ai-infrastructure-litellm
reviewedLiteLLM proxy server setup, TypeScript client patterns via OpenAI SDK, model routing, fallbacks, load balancing, spend tracking, virtual keys, and production deployment
name: ai-infrastructure-litellm description: LiteLLM proxy server setup, TypeScript client patterns via OpenAI SDK, model routing, fallbacks, load balancing, spend tracking, virtual keys, and production deployment
LiteLLM Proxy Patterns
Quick Guide: LiteLLM is an OpenAI-compatible proxy (AI gateway) that routes requests to 100+ LLM providers. TypeScript clients connect via the standard OpenAI SDK with
baseURLpointed at the proxy. Configure models, fallbacks, load balancing, and budgets inconfig.yaml. Useprovider/model-nameformat inlitellm_params.model(e.g.,anthropic/claude-sonnet-4-20250514). Themodel_namein config is the user-facing alias clients request. Virtual keys require PostgreSQL. Master key must start withsk-.
<critical_requirements>
CRITICAL: Before Using This Skill
All code must follow project conventions in CLAUDE.md (kebab-case, named exports, import ordering,
import type, named constants)
(You MUST use the provider/model-name format in litellm_params.model -- e.g., anthropic/claude-sonnet-4-20250514, openai/gpt-4o, azure/my-deployment -- the provider prefix is how LiteLLM routes to the correct API)
(You MUST set model_name as the user-facing alias that clients request -- this is NOT the provider model ID, it is the name your TypeScript client passes as model)
(You MUST point the OpenAI SDK baseURL at the proxy URL (e.g., http://localhost:4000) and pass the proxy key as apiKey -- do NOT use provider API keys directly in client code)
(You MUST start master keys with sk- -- LiteLLM rejects master keys that do not follow this prefix convention)
(You MUST configure database_url pointing to PostgreSQL before using virtual keys, spend tracking, or team/user management -- these features require persistent storage)
</critical_requirements>
Auto-detection: LiteLLM, litellm, litellm_params, litellm_settings, LLM proxy, LLM gateway, model_list, master_key, virtual keys, model fallback, load balancing LLM, provider/model, anthropic/claude, openai/gpt, azure/, litellm --config, LITELLM_MASTER_KEY, LITELLM_SALT_KEY
When to use:
- Running a unified LLM gateway that routes to multiple providers (OpenAI, Anthropic, Azure, Bedrock, etc.)
- Configuring model fallbacks, load balancing, or routing strategies across deployments
- Managing API key access with virtual keys, per-key budgets, and rate limits
- Tracking spend across models, teams, users, and tags
- Deploying a self-hosted OpenAI-compatible proxy with Docker
Key patterns covered:
- Proxy server config.yaml structure (model_list, litellm_settings, router_settings, general_settings)
- TypeScript client setup via OpenAI SDK pointed at proxy
- Model routing with provider prefixes and user-facing aliases
- Fallback chains (regular, context window, content policy, default)
- Load balancing strategies (simple-shuffle, least-busy, usage-based, latency-based, cost-based)
- Virtual keys with budgets, rate limits, and model restrictions
- Spend tracking per key, user, team, and tag
- Docker Compose production deployment
When NOT to use:
- Calling a single LLM provider directly with no proxy layer -- use the provider's SDK directly
- Building a Python application that calls LiteLLM as a library -- this skill covers the proxy server + TypeScript client pattern
- When you need framework-specific chat UI hooks -- use a framework-integrated AI SDK
Examples Index
- Core: Config & Client Setup -- config.yaml structure, TypeScript OpenAI SDK client, model routing, Docker deployment
- Routing & Reliability -- Fallbacks, load balancing, cooldowns, retries, priority routing
- Keys & Spend -- Virtual keys, budgets, rate limits, spend tracking, team management
<philosophy>
Philosophy
LiteLLM Proxy is an AI gateway -- a single OpenAI-compatible endpoint that routes to 100+ LLM providers. TypeScript applications never talk to providers directly; they talk to the proxy using the standard OpenAI SDK.
Core principles:
- Provider abstraction -- Client code uses a single
baseURLand standard OpenAI SDK. Switching providers means changingconfig.yaml, not application code. - Two-layer naming --
model_nameis what clients request (e.g.,"claude-sonnet").litellm_params.modelis the actual provider routing (e.g.,"anthropic/claude-sonnet-4-20250514"). This decouples client code from provider specifics. - Resilience via config -- Fallbacks, retries, load balancing, and cooldowns are all declared in
config.yaml. No application-level retry logic needed. - Spend governance -- Virtual keys, per-key budgets, rate limits, and tag-based tracking give fine-grained cost control without changing client code.
- OpenAI compatibility -- Any client, SDK, or tool that works with OpenAI's API works with LiteLLM. No custom SDK required.
<patterns>
Core Patterns
Pattern 1: Minimal config.yaml
The proxy needs a config.yaml with at least one model defined. model_name is client-facing; litellm_params.model is the provider route.
# config.yaml
model_list:
- model_name: claude-sonnet # What clients request
litellm_params:
model: anthropic/claude-sonnet-4-20250514 # Provider/model route
api_key: os.environ/ANTHROPIC_API_KEY # Never hardcode keys
- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o
api_key: os.environ/OPENAI_API_KEY
Why good: Two-layer naming decouples clients from providers, os.environ/ syntax reads secrets from environment at runtime
# BAD: Missing provider prefix, hardcoded key
model_list:
- model_name: claude-sonnet-4-20250514 # Using provider model ID as name
litellm_params:
model: claude-sonnet-4-20250514 # No provider prefix -- routing fails
api_key: sk-ant-abc123 # Hardcoded API key
Why bad: Without anthropic/ prefix, LiteLLM cannot route to the correct provider; hardcoded keys are a security risk; using the provider model ID as model_name couples clients to provider naming
See: examples/core.md for complete config with general_settings, Docker setup
Pattern 2: TypeScript Client via OpenAI SDK
Connect to the proxy using the standard OpenAI SDK. Point baseURL at the proxy, use the proxy key as apiKey.
// lib/llm-client.ts
import OpenAI from "openai";
const PROXY_URL = "http://localhost:4000";
const client = new OpenAI({
baseURL: PROXY_URL,
apiKey: process.env.LITELLM_API_KEY, // Virtual key or master key
});
export { client };
// usage.ts
import { client } from "./lib/llm-client.js";
const completion = await client.chat.completions.create({
model: "claude-sonnet", // model_name from config.yaml, NOT provider model ID
messages: [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "Explain TypeScript generics." },
],
});
console.log(completion.choices[0].message.content);
Why good: Standard OpenAI SDK, no custom dependencies; model name matches config.yaml model_name; proxy key keeps provider keys server-side
// BAD: Using provider model ID, provider API key
const client = new OpenAI({
baseURL: "http://localhost:4000",
apiKey: process.env.ANTHROPIC_API_KEY, // Wrong -- use proxy key
});
const completion = await client.chat.completions.create({
model: "anthropic/claude-sonnet-4-20250514", // Wrong -- use model_name alias
messages: [{ role: "user", content: "Hello" }],
});
Why bad: Provider API key bypasses proxy auth and virtual key controls; using provider model ID instead of alias couples client to provider naming and bypasses proxy routing logic
See: examples/core.md for streaming, metadata tagging
Pattern 3: Fallback Chains
Configure model fallbacks so requests automatically retry on a different model when the primary fails.
# config.yaml
model_list:
- model_name: claude-sonnet
litellm_params:
model: anthropic/claude-sonnet-4-20250514
api_key: os.environ/ANTHROPIC_API_KEY
- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o
api_key: os.environ/OPENAI_API_KEY
litellm_settings:
num_retries: 2 # Retries per model before fallback
fallbacks: [{ "claude-sonnet": ["gpt-4o"] }] # General fallback chain
context_window_fallbacks: [{ "gpt-4o": ["claude-sonnet"] }] # Context overflow fallback
default_fallbacks: ["gpt-4o"] # Catch-all for any model failure
Why good: Fallbacks use model_name aliases (not provider IDs), ordered chains tried sequentially, separate chains for context overflow vs general errors
See: examples/routing.md for content policy fallbacks, combining with load balancing
Pattern 4: Load Balancing Across Deployments
Multiple entries with the same model_name create a load-balanced group. The proxy distributes requests using the configured strategy.
model_list:
- model_name: gpt-4o
litellm_params:
model: azure/gpt-4o-eastus
api_base: https://eastus.openai.azure.com/
api_key: os.environ/AZURE_EASTUS_KEY
rpm: 100 # Requests per minute for this deployment
- model_name: gpt-4o
litellm_params:
model: azure/gpt-4o-westus
api_base: https://westus.openai.azure.com/
api_key: os.environ/AZURE_WESTUS_KEY
rpm: 100
router_settings:
routing_strategy: usage-based-routing # Route to deployment with lowest RPM/TPM usage
num_retries: 2
timeout: 30
Why good: Same model_name across entries creates automatic load balancing, rpm/tpm limits per deployment enable usage-aware routing
See: examples/routing.md for all five routing strategies, priority routing with order
Pattern 5: Virtual Keys with Budgets
Virtual keys let you distribute access with per-key budgets, rate limits, and model restrictions. Requires PostgreSQL.
# config.yaml
general_settings:
master_key: sk-litellm-master-key-change-me # Must start with sk-
database_url: os.environ/DATABASE_URL # PostgreSQL required
# Generate a virtual key via API
curl 'http://localhost:4000/key/generate' \
-H 'Authorization: Bearer sk-litellm-master-key-change-me' \
-H 'Content-Type: application/json' \
-d '{
"models": ["claude-sonnet", "gpt-4o"],
"max_budget": 50.0,
"duration": "30d",
"metadata": {"team": "backend", "project": "search"}
}'
# Returns: { "key": "sk-generated-key-abc123", ... }
Why good: Per-key model restrictions, budget caps, and expiry; metadata enables tag-based spend tracking; master key authentication protects key generation
See: examples/keys-and-spend.md for team management, spend queries, rate limit tiers
Pattern 6: Spend Tracking with Tags
Attach metadata tags to requests for granular cost attribution. The proxy tracks spend automatically per key, user, team, and tag.
// Tag requests for cost attribution
const completion = await client.chat.completions.create({
model: "claude-sonnet",
messages: [{ role: "user", content: "Summarize this document." }],
// LiteLLM-specific: pass metadata for spend tracking
metadata: {
tags: ["project:search", "team:backend"],
trace_user_id: "user-123",
},
} as any); // metadata is a LiteLLM extension, not in OpenAI types
Why good: Tags enable cost attribution by project, team, or feature without changing model routing; cost appears in x-litellm-response-cost response header
When to use: When you need cost visibility across teams, projects, or features
See: examples/keys-and-spend.md for querying spend by tag, user, and team
</patterns><decision_framework>
Decision Framework
Do You Need a Proxy?
Do you call multiple LLM providers?
+-- YES -> LiteLLM Proxy adds value (unified API, routing, fallbacks)
+-- NO -> Do you need budgets, rate limits, or virtual keys?
+-- YES -> LiteLLM Proxy (governance layer)
+-- NO -> Do you need fallbacks or load balancing?
+-- YES -> LiteLLM Proxy (reliability layer)
+-- NO -> Use the provider SDK directly (simpler)
Which Routing Strategy?
What is your priority?
+-- Even distribution -> simple-shuffle (default)
+-- Minimize latency -> latency-based-routing
+-- Respect rate limits -> usage-based-routing
+-- Minimize cost -> cost-based-routing
+-- Handle concurrent load -> least-busy
Virtual Keys vs Master Key Only
Do you have multiple teams or users?
+-- YES -> Virtual keys (per-team budgets, model restrictions)
| Requires: PostgreSQL database
+-- NO -> Do you need spend tracking?
+-- YES -> Virtual keys (even for single user, enables spend logs)
| Requires: PostgreSQL database
+-- NO -> Master key only (simplest setup, no database needed)
</decision_framework>
<red_flags>
RED FLAGS
High Priority Issues:
- Missing provider prefix in
litellm_params.model(e.g.,claude-sonnet-4-20250514instead ofanthropic/claude-sonnet-4-20250514) -- proxy cannot route without the prefix - Hardcoding provider API keys in config.yaml instead of using
os.environ/VAR_NAME-- security breach risk - Using provider model IDs as
model_name-- couples all clients to provider naming, breaks when you switch providers - Master key not starting with
sk--- LiteLLM silently rejects it - Using virtual keys without a PostgreSQL
database_url-- key generation fails
Medium Priority Issues:
- Not setting
num_retriesinlitellm_settings-- defaults to 0, no retries on transient failures - Confusing
model_name(client-facing alias) withlitellm_params.model(provider route) -- most common config mistake - Not setting
rpm/tpmon deployments when usingusage-based-routing-- routing strategy has no data to work with - Missing
LITELLM_SALT_KEYin production -- virtual key credentials stored without encryption
Common Mistakes:
- Passing
anthropic/claude-sonnet-4-20250514as themodelparameter in TypeScript client code -- use themodel_namealias instead - Expecting
metadatafield to be typed in OpenAI SDK -- it is a LiteLLM extension, requiresas anyorextra_body - Setting fallbacks using provider model IDs instead of
model_namealiases -- fallbacks reference model names, not provider routes - Forgetting that
config.yamlchanges require proxy restart (or use the/config/updateAPI endpoint)
Gotchas & Edge Cases:
- The
os.environ/syntax in config.yaml (no$prefix) is LiteLLM-specific -- not standard YAML environment variable substitution model_namematching is exact --"claude-sonnet"and"Claude-Sonnet"are different models- When using
default_fallbacks, they do NOT apply toContentPolicyViolationErrororContextWindowExceededError-- use specialized fallback types for those - The proxy adds a network hop -- expect 5-20ms additional latency compared to direct provider calls
rpm/tpmlimits in config are per-deployment, not per-model-group -- a model group with 3 deployments atrpm: 100each gets 300 RPM total- Virtual key spend tracking is eventually consistent -- the
spendfield on a key may lag a few seconds behind actual usage - The
/v1/prefix on endpoints is optional -- bothhttp://localhost:4000/chat/completionsandhttp://localhost:4000/v1/chat/completionswork - Streaming through the proxy works transparently -- no special configuration needed on the proxy side
- The LiteLLM admin UI is available at
http://localhost:4000/uiwhen the proxy is running
</red_flags>
<critical_reminders>
CRITICAL REMINDERS
All code must follow project conventions in CLAUDE.md (kebab-case, named exports, import ordering,
import type, named constants)
(You MUST use the provider/model-name format in litellm_params.model -- e.g., anthropic/claude-sonnet-4-20250514, openai/gpt-4o, azure/my-deployment -- the provider prefix is how LiteLLM routes to the correct API)
(You MUST set model_name as the user-facing alias that clients request -- this is NOT the provider model ID, it is the name your TypeScript client passes as model)
(You MUST point the OpenAI SDK baseURL at the proxy URL (e.g., http://localhost:4000) and pass the proxy key as apiKey -- do NOT use provider API keys directly in client code)
(You MUST start master keys with sk- -- LiteLLM rejects master keys that do not follow this prefix convention)
(You MUST configure database_url pointing to PostgreSQL before using virtual keys, spend tracking, or team/user management -- these features require persistent storage)
Failure to follow these rules will produce misconfigured proxies with broken routing, security issues, or missing spend data.
</critical_reminders>