LLM cost optimization · public preview

Cut your AI bill
without rebuilding
your stack.

Measure where AI spend goes, then apply approved optimizations with fallback and traceable savings.

levers: 6
Integrations: 3
Pay from savings: 25%

Fig 01 — Request patht = 42ms

A → clientB → proxyC → provider

Section 02 · Mechanisms

Six optimization levers, clearly scoped.

Traffic is sent through routing, cache, trim, downshift, and prompt compression mechanisms. Batching is a separate async workflow for non-urgent jobs. Each lever is auditable, individually togglable, and measured against the evidence appropriate to the workload.

01●

Smart Routing

Sends each request to the most cost-effective AI model that can do the job well, deciding instantly for every prompt.

predicate → model tier

02●

Semantic Cache

Saves past AI answers and reuses them whenever a new request means the same thing, even if it's worded differently.

pgvector · TTL bound

03●

Token Trim

Cleans up prompts in real time by stripping out repeated text, extra whitespace, and unnecessary history before sending.

structural · policy gated

04●

Prompt Compression

Rewrites long system instructions into permanently shorter versions to cut token costs, requiring human review before going live.

eval/replay · hash-matched substitution

05●

Model Downshift

Uses test history to safely move routine tasks to cheaper models, testing changes on a small scale first with instant rollback.

eval-driven · canary window

06●

Batching

Groups non-urgent requests together in the background to process them at heavily discounted batch rates.

async API · off-path

Section 03 · Integration

Match your security needs.

Three ways to connect. Each provides a different level of security and control. The SDK wrapper is recommended for production traffic, the base URL is fastest for evaluation, and the metadata-only path is strictest for sensitive workloads.

Recommended · Optimized + fail-open

Production SDK

The OpenAI wrapper sends healthy traffic through Varsten and falls back direct-to-provider on Varsten-origin failures before provider output starts. Your provider key stays local for fallback.

→Direct provider fallback
→Per-request metadata supported
→Optimizations run where configured

example.tsv0.1.0

import { VarstenOpenAI, VarstenTrace } from "@varsten/openai";

const client = new VarstenOpenAI({
  varstenApiKey: process.env.VARSTEN_API_KEY,
  openaiApiKey: process.env.OPENAI_API_KEY,
  onFallback: (event) => {
    console.warn("varsten fallback", event.reasonCode);
  },
});

const trace = new VarstenTrace();

await client.chat.completions.create(
  {
    model: "gpt-4o-mini",
    messages,
  },
  {
    varsten: trace.metadata({
      feature: "support_agent",
      taskType: "classification.intent",
      customerId: "cust_123",
    }),
  },
);

Section 04 · Pricing

Verified savings, or you pay nothing.

Opportunity estimates are free. Paid savings use an accepted evidence method—such as direct avoided cost, holdback comparison, or approved replay—with applicable overhead subtracted before fees are calculated.

Fee < Savings · always

Plan · 01Audit mode

Observe

Freeno credit card

Connect via Quick Eval or Metadata Only to audit your live traffic and map out estimated savings, with no behavior-changing optimization applied.

✓Monitor AI spend
✓100k requests/month
✓Savings recommendations
✓Quick Eval or Metadata
✓No credit card required

Start a free audit

Plan · 02Early access

Optimize

25%of verified savings

Unlocks the optimization engine: inline routing, cache, trim, compression, and downshift, plus async batching for eligible jobs. Pricing is capped at 25% of verified savings.

✓Everything in free +
✓Automated cost savings
✓Unlimited requests/month
✓Production-safe SDK integration
✓Controls, guardrails, rollback

Request early access

Plan · 03Custom

Enterprise

For custom pricing that doesn't scale with your bill,
we negotiate a rate and fee cap.

Discuss an enterprise pilot

Cut your AI billwithout rebuildingyour stack.