Skip to main content

Monitor and control your LLM token spend

·7 mins

Across consultations on bringing AI into client products, the same concerns keep surfacing: unpredictable spend, lack of visibility into usage, and the risk of runaway bills from bugs or developers’ misuse. Clients ask for monitoring, budgets, alerts, and cost allocation at various levels of granularity — by project, feature, environment, region, and team — along with credible forecasts. To streamline those conversations, I decided to put the key guidance and patterns into this blog post.

What are tokens? #

Quick primer for non-technical readers — LLMs do not read text the way people do. Instead of letters or whole words, they process tokens: small chunks of text (think pieces of words, roughly like syllables). Every prompt you send and every reply you get uses tokens, which is what your usage and costs are based on.

You can think of an LLM as a professional speaker you hire to read and write. You pay not for the whole speech at once, but for each small piece of text they produce or consume.

Why LLM Providers limits your options? #

Many people are surprised by how little direct cost control the largest LLM providers give you. In this article I focus on OpenAI Platform and Google AI Studio, similar patterns also apply to AWS Bedrock or Azure AI Studio.

Can you limit spending in OpenAI Platform? In theory, yes. OpenAI uses a prepaid model, so you cannot spend more than the available balance. In practice, this creates a situation where, to ensure continuity of your services, someone with finance access must watch the account and top up the balance in real time.

Google AI Studio approaches this differently. The service is provided within Google Cloud and uses postpaid billing. You attach a card, and at the end of the month the amount is charged automatically. How much will you pay? You find out after the fact.

All major providers offer spend alerts. As the name suggests, these are only alerts — their goal is to inform you, not to stop the spend. There are many known cases where a misconfigured service by a user (due to lack of knowledge or any other reason) can rack up a $72,000 bill overnight despite alerts: https://www.theregister.com/2020/12/10/google_cloud_over_run/

There is no “hard stop” budget you can set to block further usage when reached. Providers frame this as protecting your service availability. It also aligns with their revenue model.

Beyond billing, clients need to trace what tokens were used for. OpenAI and Google let you structure API keys by organization and project. OpenAI previously allowed a free-form “comment” header to tag usage, which was easy to bypass. Today, API requests must target a specific project; mismatches are rejected: https://help.openai.com/en/articles/9186755-managing-projects-in-the-api-platform

Gaps remain:

  • You cannot set per-project or per-key hard limits.
  • You cannot temporarily disable keys; you can only delete and rotate them.
  • Misconfigured access controls can let employees misuse accounts, and attribution can be unclear.

In short: you are responsible for proactive monitoring and control. The good news: effective protection is achievable with the right setup.

You can probably already see the pattern: in theory you control your budget, but nobody actually stops you from overspending. The good news is that protecting yourself is not hard, but the quality of protection depends on your resources.

Token monitoring requirements #

The initial assumptions were loosely stated in the introduction. Below are the requirements a practical setup should meet.

Must-haves:

  • Separate API keys with management at different levels (project, feature, region, environment), ideally with a hierarchy
  • Usage stats and logs for LLM calls, aggregated at those levels
  • Users with roles and service accounts
  • Budgets at multiple levels (project, feature, region, etc.)
  • Ability to opt out of provider training on your data
  • Provider-agnostic approach to avoid lock-in

Nice-to-haves:

  • Single sign-on (SSO) to avoid duplicate user management
  • A self-service portal so users can request or create keys within policy

Note: This post covers API access to LLMs, not chat interfaces. Providers usually do better on the chat UI side.

Solutions #

In this section I cover two options that work well for small and mid-size companies, plus a note on enterprise paths. First, here are the solution-specific details.

OpenRouter.ai #

OpenRouter provides a unified API for 400+ models across multiple providers, with no extra markup beyond the model’s listed price. For our use case, two features stand out:

  • Per-key budget limits
  • OpenAI-compatible API that lets you switch among many models without code changes

Because it is OpenAI-compatible, OpenRouter integrates quickly into most projects and client libraries. It is also widely supported by third-party tools. As a SaaS, there is no need to run your own infrastructure.

Pros:

  • Easy way to start monitoring
  • Lets you switch between many models without code changes
  • No additional fees beyond model pricing
  • Can disable training on your data

Cons:

  • External internet service; not suitable if you require all traffic to stay within your network
  • Regular users can create API keys and change their spend limits, which weakens permission controls
  • No advanced budget policies beyond per-key limits
  • No built-in AI guardrails

OpenRouter doesn’t deliver all requirements, but still it is solid proposition to start your journey.

Note: OpenRouter allows you to bring your own provider API keys. If you have better pricing with a provider, you can use it and keep OpenRouter as a proxy.

LiteLLM #

LiteLLM Proxy is another highly recommended option. It gives you more ways to control token spend and define richer budget limits at the level of:

  • API key
  • Project or group

It introduces useful concepts that help control spend:

  • Users (admin, user, read-only) and service accounts
  • Groups
  • Projects
  • Virtual API keys

Together, these enable fine-grained control and monitoring at each level. You can also offer a self-service portal so users can create new tokens within policy. For example, a developer can create a key scoped to their project to test if a new feature in a test environment increases token usage.

LiteLLM, like OpenRouter, is OpenAI API–compatible, so application changes are minimal.

Beyond the core requirements, LiteLLM includes features that will help both early on and as you scale. You can enable guardrails to reduce prompt injection and check responses for signs of hallucination by adding validation and policy checks at the proxy. Observability is also simpler: you can attach Langfuse at the proxy layer, so you get tracing and analytics without changing application code. These extras make LiteLLM useful not only for the first phase of adoption but also for long-term governance and quality.

One important note: LiteLLM Proxy is self-hosted (you run it and provide a domain) or available as an enterprise offering (details are limited publicly). Depending on your needs, self-hosting can be a benefit or a blocker.

Pros:

  • Flexible budget constraints and policies
  • Optional guardrails (e.g., prompt injection defenses and hallucination checks)
  • Proxy-level observability: native Langfuse integration without application changes
  • Support for locally hosted LLMs, still exposed via the OpenAI API format
  • Limits can be defined per time window (day, week, month, etc.)

Cons:

  • Unlike OpenRouter, you must bring and manage provider access and infrastructure
  • SSO is available only with the enterprise license

Gray area:

  • Self-hosted: a pro or a con depending on your company’s capabilities

Other solutions #

The two options above are my starting recommendations. Depending on your needs, you may consider other paths. They often require more configuration and may not include an admin UI.

For enterprise clients, consider setting up Kong AI Gateway (https://konghq.com/products/kong-ai-gateway) or building a custom solution on Istio. These can be tailored to project needs but come with higher complexity and ownership.

Conclusion and next steps #

We looked at two recommended paths to control and monitor LLM spend: OpenRouter and LiteLLM. Each offers a different balance of speed, control, and ownership. OpenRouter is a hosted, OpenAI‑compatible drop-in that helps you move fast with per-key budgets and broad model choice. LiteLLM is a self-hosted proxy with richer policies, guardrails, and easy observability, which pays off as your usage grows.

Both approaches can succeed, and the better choice depends on your needs—how quickly you must ship, your security and compliance requirements, where you want to run infrastructure, and how much control you need over policies and analytics. Many teams start with OpenRouter for quick wins and later adopt LiteLLM for deeper governance without large application changes.

If you’d like help deciding, I offer consultations to assess your context, choose the right option, and design a practical rollout—budgets, access, logging, and guardrails included. Contact me to discuss your requirements and plan a path that gives you predictable costs and confident scaling.