Skip to content

LLM input shaping

Shapes LLM prompts to fit within a token budget before they hit the API. Uses tiktoken for token counting and drops oldest conversation turns first, then trims the query if still over budget. The budgeting logic itself is model-agnostic — the example uses gpt-4o-mini as the tokenizer target, but the pattern works the same way for any model with a token limit.

  1. The host generates prompt inputs (same system prefix, different conversation history and queries).
  2. Each task builds the full prompt and counts tokens with tiktoken.
  3. If over budget, it drops the oldest turns first, then trims query tokens.
  4. The host aggregates token savings and trim counts.

Three files:

  • run_prompt_token_budget.ts — practical prompt-budgeting example for app logic
  • bench_prompt_token_budget.ts — dedicated mitata benchmark measuring budgeting throughput
  • token_budget.ts — the budgeting logic itself

Input:

const input = {
model: "gpt-4o-mini",
systemPrefix: "You are a docs assistant.",
history: [
"Need guidance on schema validation.",
"Compare workers and batching.",
"Keep the answer short.",
],
query: "Give a migration plan and one code example.",
maxInputTokens: 900,
};

Output shape:

{
prompt: "...final prompt string...",
rawInputTokens: 1120,
inputTokens: 884,
trimmedTurns: 1,
queryWasTrimmed: false,
}

The useful part is not just the final prompt; you also get the bookkeeping needed to explain why a prompt was trimmed and by how much.

deno.sh
deno add --npm jsr:@vixeny/knitting
deno add npm:tiktoken npm:openai npm:mitata

openai is optional. These examples only prepare prompts and token budgets. mitata is only needed for the benchmark script.

bun.sh
bun src/run_prompt_token_budget.ts --threads 2 --requests 2000 --maxInputTokens 900 --model gpt-4o-mini --mode knitting

Expected output:

mode: knitting (2 threads)
requests: 2000
budget: 900 tokens (gpt-4o-mini)
trimmed: 1,247 / 2,000 requests (62.4%)
avg tokens saved: 312 per trimmed request
total tokens saved: 389,064
elapsed: 1.82s
bun.sh
bun src/bench_prompt_token_budget.ts --threads 2 --requests 20000 --maxInputTokens 900 --model gpt-4o-mini --batch 32

Compares budgeting throughput on the host vs through workers. Worker tasks return compact totals (token/trim counters), not full prompt strings.

run_prompt_token_budget.ts
import { createPool, isMain } from "@vixeny/knitting";
import {
preparePrompt,
preparePromptHost,
type PromptInput,
type PromptPlan,
} from "./token_budget.ts";
function intArg(name: string, fallback: number): number {
const i = process.argv.indexOf(`--${name}`);
if (i !== -1 && i + 1 < process.argv.length) {
const value = Number(process.argv[i + 1]);
if (Number.isFinite(value)) return Math.floor(value);
}
return fallback;
}
function strArg(name: string, fallback: string): string {
const i = process.argv.indexOf(`--${name}`);
if (i !== -1 && i + 1 < process.argv.length) {
return String(process.argv[i + 1]);
}
return fallback;
}
const THREADS = Math.max(1, intArg("threads", 2));
const REQUESTS = Math.max(1, intArg("requests", 20_000));
const MAX_INPUT_TOKENS = Math.max(64, intArg("maxInputTokens", 900));
const MODE = strArg("mode", "knitting");
const MODEL = strArg("model", "gpt-4o-mini");
const SYSTEM_PREFIX = [
"You are a docs assistant.",
"Prefer concrete and short answers.",
"If data is missing, say it directly.",
"Do not invent unsupported behavior.",
].join("\n");
const TOPICS = [
"token budgeting",
"prompt caching",
"parallel workers",
"schema validation",
"rendering pipelines",
"markdown output",
"compression tradeoffs",
"latency under load",
];
function pick<T>(arr: T[], i: number): T {
return arr[i % arr.length]!;
}
function makeHistory(i: number): string[] {
const turns = 3 + (i % 10);
const history = new Array<string>(turns);
for (let t = 0; t < turns; t++) {
const topic = pick(TOPICS, i + t);
history[t] =
`Need guidance on ${topic}. Include practical steps and one small code example.`;
}
return history;
}
function makeInput(i: number): PromptInput {
const topicA = pick(TOPICS, i);
const topicB = pick(TOPICS, i + 3);
const query = [
`Please compare ${topicA} with ${topicB}.`,
"I care about cost per request and response quality.",
"Give a short recommendation and a migration path.",
].join(" ");
return {
model: MODEL,
systemPrefix: SYSTEM_PREFIX,
history: makeHistory(i),
query,
maxInputTokens: MAX_INPUT_TOKENS,
};
}
type Totals = {
rawTokens: number;
budgetedTokens: number;
staticTokens: number;
dynamicTokens: number;
trimmedRuns: number;
queryTrimmedRuns: number;
turnsDropped: number;
};
function summarize(plans: PromptPlan[]): Totals {
let totals: Totals = {
rawTokens: 0,
budgetedTokens: 0,
staticTokens: 0,
dynamicTokens: 0,
trimmedRuns: 0,
queryTrimmedRuns: 0,
turnsDropped: 0,
};
for (const plan of plans) {
totals.rawTokens += plan.rawInputTokens;
totals.budgetedTokens += plan.inputTokens;
totals.staticTokens += plan.staticTokens;
totals.dynamicTokens += plan.dynamicTokens;
totals.turnsDropped += plan.trimmedTurns;
if (plan.trimmedTurns > 0) totals.trimmedRuns++;
if (plan.queryWasTrimmed) totals.queryTrimmedRuns++;
}
return totals;
}
function runHost(inputs: PromptInput[]): Totals {
const plans = inputs.map((input) => preparePromptHost(input));
return summarize(plans);
}
async function runWorkers(inputs: PromptInput[]): Promise<Totals> {
const pool = createPool({ threads: THREADS })({ preparePrompt });
try {
const jobs: Promise<PromptPlan>[] = [];
for (let i = 0; i < inputs.length; i++) {
jobs.push(pool.call.preparePrompt(inputs[i]!));
}
const plans = await Promise.all(jobs);
return summarize(plans);
} finally {
pool.shutdown();
}
}
function percent(saved: number, base: number): string {
if (base <= 0) return "0.0%";
return `${((saved / base) * 100).toFixed(1)}%`;
}
async function main() {
const inputs = new Array<PromptInput>(REQUESTS);
for (let i = 0; i < REQUESTS; i++) inputs[i] = makeInput(i);
const started = performance.now();
const totals = MODE === "host" ? runHost(inputs) : await runWorkers(inputs);
const finished = performance.now();
const tookMs = finished - started;
const secs = Math.max(1e-9, tookMs / 1000);
const reqPerSec = REQUESTS / secs;
const savedTokens = Math.max(0, totals.rawTokens - totals.budgetedTokens);
const cacheableTokensEstimate = totals.staticTokens;
console.log("Prompt token budgeting");
console.log("mode :", MODE);
console.log("model :", MODEL);
console.log("threads :", MODE === "host" ? 0 : THREADS);
console.log("requests :", REQUESTS.toLocaleString());
console.log("maxInputTokens :", MAX_INPUT_TOKENS.toLocaleString());
console.log("raw tokens :", totals.rawTokens.toLocaleString());
console.log("budgeted tokens :", totals.budgetedTokens.toLocaleString());
console.log(
"saved tokens :",
`${savedTokens.toLocaleString()} (${
percent(savedTokens, totals.rawTokens)
})`,
);
console.log("trimmed runs :", totals.trimmedRuns.toLocaleString());
console.log("query trimmed runs:", totals.queryTrimmedRuns.toLocaleString());
console.log("turns dropped :", totals.turnsDropped.toLocaleString());
console.log("cacheable estimate:", cacheableTokensEstimate.toLocaleString());
console.log("took :", tookMs.toFixed(2), "ms");
console.log("throughput :", reqPerSec.toFixed(0), "req/s");
}
if (isMain) {
main().catch((error) => {
console.error(error);
process.exitCode = 1;
});
}

Token budgeting is a preflight step that runs on every LLM request. If you’re handling high-throughput chat traffic — multiple users, long conversation histories — the tokenization and trimming work adds up. Offloading it to workers keeps your main thread focused on routing and I/O while budget calculations happen in parallel. It also gives you predictable input sizes, which helps with cost control and latency.