name: ai-infrastructure-litellm description: LiteLLM proxy server setup, TypeScript client patterns via OpenAI SDK, model routing, fallbacks, load balancing, spend tracking, virtual keys, and production deployment

LiteLLM Proxy Patterns

Quick Guide: LiteLLM is an OpenAI-compatible proxy (AI gateway) that routes requests to 100+ LLM providers. TypeScript clients connect via the standard OpenAI SDK with baseURL pointed at the proxy. Configure models, fallbacks, load balancing, and budgets in config.yaml. Use provider/model-name format in litellm_params.model (e.g., anthropic/claude-sonnet-4-20250514). The model_name in config is the user-facing alias clients request. Virtual keys require PostgreSQL. Master key must start with sk-.

<critical_requirements>

CRITICAL: Before Using This Skill

All code must follow project conventions in CLAUDE.md (kebab-case, named exports, import ordering, import type, named constants)

(You MUST use the provider/model-name format in litellm_params.model -- e.g., anthropic/claude-sonnet-4-20250514, openai/gpt-4o, azure/my-deployment -- the provider prefix is how LiteLLM routes to the correct API)

(You MUST set model_name as the user-facing alias that clients request -- this is NOT the provider model ID, it is the name your TypeScript client passes as model)

(You MUST point the OpenAI SDK baseURL at the proxy URL (e.g., http://localhost:4000) and pass the proxy key as apiKey -- do NOT use provider API keys directly in client code)

(You MUST start master keys with sk- -- LiteLLM rejects master keys that do not follow this prefix convention)

(You MUST configure database_url pointing to PostgreSQL before using virtual keys, spend tracking, or team/user management -- these features require persistent storage)

</critical_requirements>

Auto-detection: LiteLLM, litellm, litellm_params, litellm_settings, LLM proxy, LLM gateway, model_list, master_key, virtual keys, model fallback, load balancing LLM, provider/model, anthropic/claude, openai/gpt, azure/, litellm --config, LITELLM_MASTER_KEY, LITELLM_SALT_KEY

When to use:

Running a unified LLM gateway that routes to multiple providers (OpenAI, Anthropic, Azure, Bedrock, etc.)
Configuring model fallbacks, load balancing, or routing strategies across deployments
Managing API key access with virtual keys, per-key budgets, and rate limits
Tracking spend across models, teams, users, and tags
Deploying a self-hosted OpenAI-compatible proxy with Docker

Key patterns covered:

Proxy server config.yaml structure (model_list, litellm_settings, router_settings, general_settings)
TypeScript client setup via OpenAI SDK pointed at proxy
Model routing with provider prefixes and user-facing aliases
Fallback chains (regular, context window, content policy, default)
Load balancing strategies (simple-shuffle, least-busy, usage-based, latency-based, cost-based)
Virtual keys with budgets, rate limits, and model restrictions
Spend tracking per key, user, team, and tag
Docker Compose production deployment

When NOT to use:

Calling a single LLM provider directly with no proxy layer -- use the provider's SDK directly
Building a Python application that calls LiteLLM as a library -- this skill covers the proxy server + TypeScript client pattern
When you need framework-specific chat UI hooks -- use a framework-integrated AI SDK

Examples Index

Core: Config & Client Setup -- config.yaml structure, TypeScript OpenAI SDK client, model routing, Docker deployment
Routing & Reliability -- Fallbacks, load balancing, cooldowns, retries, priority routing
Keys & Spend -- Virtual keys, budgets, rate limits, spend tracking, team management

Philosophy

LiteLLM Proxy is an AI gateway -- a single OpenAI-compatible endpoint that routes to 100+ LLM providers. TypeScript applications never talk to providers directly; they talk to the proxy using the standard OpenAI SDK.

Core principles:

Provider abstraction -- Client code uses a single baseURL and standard OpenAI SDK. Switching providers means changing config.yaml, not application code.
Two-layer naming -- model_name is what clients request (e.g., "claude-sonnet"). litellm_params.model is the actual provider routing (e.g., "anthropic/claude-sonnet-4-20250514"). This decouples client code from provider specifics.
Resilience via config -- Fallbacks, retries, load balancing, and cooldowns are all declared in config.yaml. No application-level retry logic needed.
Spend governance -- Virtual keys, per-key budgets, rate limits, and tag-based tracking give fine-grained cost control without changing client code.
OpenAI compatibility -- Any client, SDK, or tool that works with OpenAI's API works with LiteLLM. No custom SDK required.

</philosophy>

Core Patterns

Pattern 1: Minimal config.yaml

The proxy needs a config.yaml with at least one model defined. model_name is client-facing; litellm_params.model is the provider route.

# config.yaml
model_list:
  - model_name: claude-sonnet # What clients request
    litellm_params:
      model: anthropic/claude-sonnet-4-20250514 # Provider/model route
      api_key: os.environ/ANTHROPIC_API_KEY # Never hardcode keys

  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY

Why good: Two-layer naming decouples clients from providers, os.environ/ syntax reads secrets from environment at runtime

# BAD: Missing provider prefix, hardcoded key
model_list:
  - model_name: claude-sonnet-4-20250514 # Using provider model ID as name
    litellm_params:
      model: claude-sonnet-4-20250514 # No provider prefix -- routing fails
      api_key: sk-ant-abc123 # Hardcoded API key

Why bad: Without anthropic/ prefix, LiteLLM cannot route to the correct provider; hardcoded keys are a security risk; using the provider model ID as model_name couples clients to provider naming

See: examples/core.md for complete config with general_settings, Docker setup

Pattern 2: TypeScript Client via OpenAI SDK

Connect to the proxy using the standard OpenAI SDK. Point baseURL at the proxy, use the proxy key as apiKey.

// lib/llm-client.ts
import OpenAI from "openai";

const PROXY_URL = "http://localhost:4000";

const client = new OpenAI({
  baseURL: PROXY_URL,
  apiKey: process.env.LITELLM_API_KEY, // Virtual key or master key
});

export { client };

// usage.ts
import { client } from "./lib/llm-client.js";

const completion = await client.chat.completions.create({
  model: "claude-sonnet", // model_name from config.yaml, NOT provider model ID
  messages: [
    { role: "system", content: "You are a helpful assistant." },
    { role: "user", content: "Explain TypeScript generics." },
  ],
});

console.log(completion.choices[0].message.content);

Why good: Standard OpenAI SDK, no custom dependencies; model name matches config.yaml model_name; proxy key keeps provider keys server-side

// BAD: Using provider model ID, provider API key
const client = new OpenAI({
  baseURL: "http://localhost:4000",
  apiKey: process.env.ANTHROPIC_API_KEY, // Wrong -- use proxy key
});

const completion = await client.chat.completions.create({
  model: "anthropic/claude-sonnet-4-20250514", // Wrong -- use model_name alias
  messages: [{ role: "user", content: "Hello" }],
});

Why bad: Provider API key bypasses proxy auth and virtual key controls; using provider model ID instead of alias couples client to provider naming and bypasses proxy routing logic

See: examples/core.md for streaming, metadata tagging

Pattern 3: Fallback Chains

Configure model fallbacks so requests automatically retry on a different model when the primary fails.

# config.yaml
model_list:
  - model_name: claude-sonnet
    litellm_params:
      model: anthropic/claude-sonnet-4-20250514
      api_key: os.environ/ANTHROPIC_API_KEY

  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY

litellm_settings:
  num_retries: 2 # Retries per model before fallback
  fallbacks: [{ "claude-sonnet": ["gpt-4o"] }] # General fallback chain
  context_window_fallbacks: [{ "gpt-4o": ["claude-sonnet"] }] # Context overflow fallback
  default_fallbacks: ["gpt-4o"] # Catch-all for any model failure

Why good: Fallbacks use model_name aliases (not provider IDs), ordered chains tried sequentially, separate chains for context overflow vs general errors

See: examples/routing.md for content policy fallbacks, combining with load balancing

Pattern 4: Load Balancing Across Deployments

Multiple entries with the same model_name create a load-balanced group. The proxy distributes requests using the configured strategy.

model_list:
  - model_name: gpt-4o
    litellm_params:
      model: azure/gpt-4o-eastus
      api_base: https://eastus.openai.azure.com/
      api_key: os.environ/AZURE_EASTUS_KEY
      rpm: 100 # Requests per minute for this deployment

  - model_name: gpt-4o
    litellm_params:
      model: azure/gpt-4o-westus
      api_base: https://westus.openai.azure.com/
      api_key: os.environ/AZURE_WESTUS_KEY
      rpm: 100

router_settings:
  routing_strategy: usage-based-routing # Route to deployment with lowest RPM/TPM usage
  num_retries: 2
  timeout: 30

Why good: Same model_name across entries creates automatic load balancing, rpm/tpm limits per deployment enable usage-aware routing

See: examples/routing.md for all five routing strategies, priority routing with order

Pattern 5: Virtual Keys with Budgets

Virtual keys let you distribute access with per-key budgets, rate limits, and model restrictions. Requires PostgreSQL.

# config.yaml
general_settings:
  master_key: sk-litellm-master-key-change-me # Must start with sk-
  database_url: os.environ/DATABASE_URL # PostgreSQL required

# Generate a virtual key via API
curl 'http://localhost:4000/key/generate' \
  -H 'Authorization: Bearer sk-litellm-master-key-change-me' \
  -H 'Content-Type: application/json' \
  -d '{
    "models": ["claude-sonnet", "gpt-4o"],
    "max_budget": 50.0,
    "duration": "30d",
    "metadata": {"team": "backend", "project": "search"}
  }'
# Returns: { "key": "sk-generated-key-abc123", ... }

Why good: Per-key model restrictions, budget caps, and expiry; metadata enables tag-based spend tracking; master key authentication protects key generation

See: examples/keys-and-spend.md for team management, spend queries, rate limit tiers

Pattern 6: Spend Tracking with Tags

Attach metadata tags to requests for granular cost attribution. The proxy tracks spend automatically per key, user, team, and tag.

// Tag requests for cost attribution
const completion = await client.chat.completions.create({
  model: "claude-sonnet",
  messages: [{ role: "user", content: "Summarize this document." }],
  // LiteLLM-specific: pass metadata for spend tracking
  metadata: {
    tags: ["project:search", "team:backend"],
    trace_user_id: "user-123",
  },
} as any); // metadata is a LiteLLM extension, not in OpenAI types

Why good: Tags enable cost attribution by project, team, or feature without changing model routing; cost appears in x-litellm-response-cost response header

When to use: When you need cost visibility across teams, projects, or features

See: examples/keys-and-spend.md for querying spend by tag, user, and team

</patterns>

<decision_framework>

Decision Framework

Do You Need a Proxy?

Do you call multiple LLM providers?
+-- YES -> LiteLLM Proxy adds value (unified API, routing, fallbacks)
+-- NO -> Do you need budgets, rate limits, or virtual keys?
    +-- YES -> LiteLLM Proxy (governance layer)
    +-- NO -> Do you need fallbacks or load balancing?
        +-- YES -> LiteLLM Proxy (reliability layer)
        +-- NO -> Use the provider SDK directly (simpler)

Which Routing Strategy?

What is your priority?
+-- Even distribution      -> simple-shuffle (default)
+-- Minimize latency       -> latency-based-routing
+-- Respect rate limits     -> usage-based-routing
+-- Minimize cost           -> cost-based-routing
+-- Handle concurrent load  -> least-busy

Virtual Keys vs Master Key Only

Do you have multiple teams or users?
+-- YES -> Virtual keys (per-team budgets, model restrictions)
|   Requires: PostgreSQL database
+-- NO -> Do you need spend tracking?
    +-- YES -> Virtual keys (even for single user, enables spend logs)
    |   Requires: PostgreSQL database
    +-- NO -> Master key only (simplest setup, no database needed)

</decision_framework>

<red_flags>

RED FLAGS

High Priority Issues:

Missing provider prefix in litellm_params.model (e.g., claude-sonnet-4-20250514 instead of anthropic/claude-sonnet-4-20250514) -- proxy cannot route without the prefix
Hardcoding provider API keys in config.yaml instead of using os.environ/VAR_NAME -- security breach risk
Using provider model IDs as model_name -- couples all clients to provider naming, breaks when you switch providers
Master key not starting with sk- -- LiteLLM silently rejects it
Using virtual keys without a PostgreSQL database_url -- key generation fails

Medium Priority Issues:

Not setting num_retries in litellm_settings -- defaults to 0, no retries on transient failures
Confusing model_name (client-facing alias) with litellm_params.model (provider route) -- most common config mistake
Not setting rpm/tpm on deployments when using usage-based-routing -- routing strategy has no data to work with
Missing LITELLM_SALT_KEY in production -- virtual key credentials stored without encryption

Common Mistakes:

Passing anthropic/claude-sonnet-4-20250514 as the model parameter in TypeScript client code -- use the model_name alias instead
Expecting metadata field to be typed in OpenAI SDK -- it is a LiteLLM extension, requires as any or extra_body
Setting fallbacks using provider model IDs instead of model_name aliases -- fallbacks reference model names, not provider routes
Forgetting that config.yaml changes require proxy restart (or use the /config/update API endpoint)

Gotchas & Edge Cases:

The os.environ/ syntax in config.yaml (no $ prefix) is LiteLLM-specific -- not standard YAML environment variable substitution
model_name matching is exact -- "claude-sonnet" and "Claude-Sonnet" are different models
When using default_fallbacks, they do NOT apply to ContentPolicyViolationError or ContextWindowExceededError -- use specialized fallback types for those
The proxy adds a network hop -- expect 5-20ms additional latency compared to direct provider calls
rpm/tpm limits in config are per-deployment, not per-model-group -- a model group with 3 deployments at rpm: 100 each gets 300 RPM total
Virtual key spend tracking is eventually consistent -- the spend field on a key may lag a few seconds behind actual usage
The /v1/ prefix on endpoints is optional -- both http://localhost:4000/chat/completions and http://localhost:4000/v1/chat/completions work
Streaming through the proxy works transparently -- no special configuration needed on the proxy side
The LiteLLM admin UI is available at http://localhost:4000/ui when the proxy is running

</red_flags>

<critical_reminders>

CRITICAL REMINDERS

All code must follow project conventions in CLAUDE.md (kebab-case, named exports, import ordering, import type, named constants)

(You MUST set model_name as the user-facing alias that clients request -- this is NOT the provider model ID, it is the name your TypeScript client passes as model)

(You MUST point the OpenAI SDK baseURL at the proxy URL (e.g., http://localhost:4000) and pass the proxy key as apiKey -- do NOT use provider API keys directly in client code)

(You MUST start master keys with sk- -- LiteLLM rejects master keys that do not follow this prefix convention)

(You MUST configure database_url pointing to PostgreSQL before using virtual keys, spend tracking, or team/user management -- these features require persistent storage)

Failure to follow these rules will produce misconfigured proxies with broken routing, security issues, or missing spend data.

</critical_reminders>