ai-infrastructure-huggingface-inference

reviewed

Hugging Face Inference SDK patterns for TypeScript/Node.js — InferenceClient setup, chat completion, text generation, streaming, embeddings, image generation, audio transcription, translation, summarization, and Inference Endpoints

>_agents-inc/skills/dist/plugins/ai-infrastructure-huggingface-inference/skills/ai-infrastructure-huggingface-inference·commit 279811b

name: ai-infrastructure-huggingface-inference description: Hugging Face Inference SDK patterns for TypeScript/Node.js — InferenceClient setup, chat completion, text generation, streaming, embeddings, image generation, audio transcription, translation, summarization, and Inference Endpoints

Hugging Face Inference Patterns

Quick Guide: Use @huggingface/inference (v4+) to access 200k+ ML models on the Hugging Face Hub. Use InferenceClient with chatCompletion() for OpenAI-compatible chat, textGeneration() for raw text completion, chatCompletionStream() for streaming, featureExtraction() for embeddings, textToImage() for image generation, and automaticSpeechRecognition() for audio transcription. Set provider to route through inference providers (Cerebras, Together, Groq, etc.) or use endpointUrl for dedicated Inference Endpoints.


<critical_requirements>

CRITICAL: Before Using This Skill

All code must follow project conventions in CLAUDE.md (kebab-case, named exports, import ordering, import type, named constants)

(You MUST always pass an access token to InferenceClient -- never deploy without authentication)

(You MUST use chatCompletion() / chatCompletionStream() for conversational LLM tasks -- these follow the OpenAI-compatible message format)

(You MUST handle errors using InferenceClientError and its subclasses -- never use bare catch blocks without error type checking)

(You MUST specify a model parameter for every inference call -- there is no default model)

(You MUST never hardcode access tokens -- always use environment variables via process.env.HF_TOKEN)

</critical_requirements>


Auto-detection: Hugging Face, huggingface, @huggingface/inference, InferenceClient, HfInference, hf.chatCompletion, hf.textGeneration, hf.featureExtraction, hf.textToImage, hf.automaticSpeechRecognition, hf.translation, hf.summarization, hf.textToSpeech, chatCompletionStream, textGenerationStream, HF_TOKEN, inference provider, Inference Endpoints

When to use:

  • Accessing any of the 200k+ models hosted on the Hugging Face Hub
  • Running chat completion with open-source LLMs (Qwen, Mistral, Llama, etc.)
  • Generating embeddings with sentence-transformer models for semantic search
  • Generating images from text prompts (FLUX, Stable Diffusion)
  • Transcribing audio with automatic speech recognition models
  • Running translation, summarization, text classification, or NER tasks
  • Deploying models on dedicated Inference Endpoints for production use
  • Using third-party inference providers (Cerebras, Together, Groq, Replicate, etc.) through a unified API

Key patterns covered:

  • InferenceClient initialization and configuration
  • Chat Completion API (OpenAI-compatible messages format, streaming)
  • Text generation (raw completion, streaming)
  • Embeddings via feature extraction
  • Image generation (text-to-image)
  • Audio transcription (automatic speech recognition)
  • Translation, summarization, and text classification
  • Inference Endpoints (dedicated deployments)
  • Inference Providers (routing through third-party services)
  • Error handling with typed error classes

When NOT to use:

  • If you only use OpenAI models -- use the OpenAI SDK directly
  • If you need a provider-agnostic unified SDK with structured outputs and tool calling -- use a higher-level AI SDK
  • If you need to fine-tune or train models -- use the @huggingface/hub package or Python transformers

Examples Index


<philosophy>

Philosophy

The @huggingface/inference SDK provides a unified TypeScript client for accessing hundreds of thousands of ML models through multiple backends: serverless Inference Providers, dedicated Inference Endpoints, and local servers.

Core principles:

  1. Model-agnostic access -- One client, any model on the Hub. Swap models by changing the model parameter without code changes.
  2. Provider flexibility -- Route inference through 20+ providers (Cerebras, Together, Groq, Replicate, etc.) with a single provider parameter, or deploy your own Inference Endpoints.
  3. Task-oriented API -- Methods map to ML tasks (chatCompletion, textToImage, automaticSpeechRecognition), not raw HTTP endpoints.
  4. OpenAI-compatible chat -- chatCompletion() uses the OpenAI message format (role + content), making migration between providers easy.
  5. Streaming as async generators -- chatCompletionStream() and textGenerationStream() return AsyncGenerator, consumed with for await...of.
</philosophy>
<patterns>

Core Patterns

Pattern 1: Client Setup

Initialize with your Hugging Face access token. The token is required for authenticated access.

// lib/hf-client.ts -- basic setup
import { InferenceClient } from "@huggingface/inference";

const client = new InferenceClient(process.env.HF_TOKEN);

export { client };
// lib/hf-client.ts -- with custom endpoint
const ENDPOINT_URL =
  "https://your-endpoint.us-east-1.aws.endpoints.huggingface.cloud/v1/";

const client = new InferenceClient(process.env.HF_TOKEN, {
  endpointUrl: ENDPOINT_URL,
});

export { client };

Why good: Token from env var, named constant for endpoint URL, named export

// BAD: Hardcoded token, no named export
const hf = new InferenceClient("hf_abc123xyz");
export default hf;

Why bad: Hardcoded token is a security risk, default export violates conventions

See: examples/core.md for provider routing, local endpoints, and endpoint helper


Pattern 2: Chat Completion (OpenAI-Compatible)

Use chatCompletion() for conversational LLM tasks. Follows the OpenAI message format.

const MAX_TOKENS = 512;
const TEMPERATURE = 0.1;

const response = await client.chatCompletion({
  model: "Qwen/Qwen3-32B",
  provider: "cerebras",
  messages: [
    { role: "system", content: "You are a helpful coding assistant." },
    { role: "user", content: "Explain TypeScript generics." },
  ],
  max_tokens: MAX_TOKENS,
  temperature: TEMPERATURE,
});

console.log(response.choices[0].message.content);

Why good: Named constants for parameters, explicit model and provider, system message for behavior

// BAD: No model specified, magic numbers, no system message
const response = await client.chatCompletion({
  messages: [{ role: "user", content: "do something" }],
  max_tokens: 512,
  temperature: 0.1,
});

Why bad: Missing required model, magic numbers, vague prompt, no system instruction

See: examples/core.md for multi-turn conversations and provider selection


Pattern 3: Streaming Chat Completion

Use chatCompletionStream() for streaming responses. Returns an AsyncGenerator.

const MAX_TOKENS = 512;
let fullResponse = "";

for await (const chunk of client.chatCompletionStream({
  model: "Qwen/Qwen3-32B",
  provider: "cerebras",
  messages: [{ role: "user", content: "Explain async/await in TypeScript." }],
  max_tokens: MAX_TOKENS,
})) {
  if (chunk.choices && chunk.choices.length > 0) {
    const content = chunk.choices[0].delta.content;
    if (content) {
      process.stdout.write(content);
      fullResponse += content;
    }
  }
}
console.log(); // newline

Why good: Async generator consumed with for await, progressive output, null checks on chunk data

// BAD: Not checking chunk.choices, ignoring null content
for await (const chunk of client.chatCompletionStream({
  model: "...",
  messages: [],
})) {
  process.stdout.write(chunk.choices[0].delta.content); // May throw on null
}

Why bad: No null check -- choices may be empty, content may be null between chunks

See: examples/core.md for text generation streaming


Pattern 4: Text Generation (Raw Completion)

Use textGeneration() for prompt continuation without the chat message format.

const MAX_NEW_TOKENS = 250;

const result = await client.textGeneration({
  model: "mistralai/Mixtral-8x7B-v0.1",
  provider: "together",
  inputs: "The key benefits of TypeScript are",
  parameters: { max_new_tokens: MAX_NEW_TOKENS },
});

console.log(result.generated_text);

Why good: Named constant, clear prompt, explicit provider, direct access to generated_text

See: examples/core.md for streaming text generation


Pattern 5: Embeddings (Feature Extraction)

Use featureExtraction() for generating vector embeddings for semantic search and RAG.

const embeddings = await client.featureExtraction({
  model: "sentence-transformers/all-MiniLM-L6-v2",
  inputs: "That is a happy person",
});
// Returns: number[] (embedding vector)

Why good: Purpose-built embedding model, simple input/output

See: examples/tasks.md for batch embeddings and cosine similarity


Pattern 6: Image Generation (Text-to-Image)

Use textToImage() to generate images from text prompts. Returns a Blob.

const imageBlob = await client.textToImage({
  model: "black-forest-labs/FLUX.1-dev",
  inputs: "a serene mountain landscape at sunset",
  provider: "replicate",
});
// imageBlob is a Blob -- write to file or convert to buffer

Why good: Explicit model and provider, descriptive prompt

See: examples/tasks.md for saving images, image-to-image, and output formats


Pattern 7: Audio Transcription

Use automaticSpeechRecognition() for speech-to-text.

import { readFileSync } from "node:fs";

const result = await client.automaticSpeechRecognition({
  model: "facebook/wav2vec2-large-960h-lv60-self",
  data: readFileSync("audio/recording.flac"),
});

console.log(result.text);

Why good: Uses data parameter with file buffer, outputs .text

See: examples/tasks.md for Whisper models, audio classification, and text-to-speech


Pattern 8: Error Handling

Always catch InferenceClientError and its subclasses. Re-throw unexpected errors.

import {
  InferenceClientError,
  InferenceClientInputError,
  InferenceClientProviderApiError,
  InferenceClientProviderOutputError,
  InferenceClientHubApiError,
} from "@huggingface/inference";

try {
  const result = await client.chatCompletion({
    model: "Qwen/Qwen3-32B",
    messages: [{ role: "user", content: "Hello" }],
  });
} catch (error) {
  if (error instanceof InferenceClientProviderApiError) {
    console.error("Provider API error:", error.message);
    console.error("Request:", error.request);
    console.error("Response:", error.response);
  } else if (error instanceof InferenceClientHubApiError) {
    console.error("Hub API error:", error.message);
  } else if (error instanceof InferenceClientProviderOutputError) {
    console.error("Malformed provider response:", error.message);
  } else if (error instanceof InferenceClientInputError) {
    console.error("Invalid input:", error.message);
  } else if (error instanceof InferenceClientError) {
    console.error("Inference error:", error.message);
  } else {
    throw error; // Re-throw non-inference errors
  }
}

Why good: Specific error types for each failure mode, request/response details for debugging, re-throws unexpected errors

See: examples/core.md for full error handling patterns

</patterns>

<decision_framework>

Decision Framework

Which Method to Use

What is your task?
+-- Conversational LLM (messages) -> chatCompletion() / chatCompletionStream()
+-- Raw text continuation -> textGeneration() / textGenerationStream()
+-- Embeddings for search/RAG -> featureExtraction()
+-- Image from text prompt -> textToImage()
+-- Speech to text -> automaticSpeechRecognition()
+-- Text to speech -> textToSpeech()
+-- Language translation -> translation()
+-- Summarize long text -> summarization()
+-- Classify text -> textClassification()
+-- Named entity recognition -> tokenClassification()
+-- Classify image -> imageClassification()
+-- Detect objects -> objectDetection()
+-- Caption an image -> imageToText()
+-- Answer questions from context -> questionAnswering()

Chat Completion vs Text Generation

Do you have a conversation with roles (system/user/assistant)?
+-- YES -> chatCompletion() / chatCompletionStream()
|   Uses OpenAI-compatible message format
|   Supports system messages, multi-turn
+-- NO -> Do you want to continue/complete a text prompt?
    +-- YES -> textGeneration() / textGenerationStream()
    |   Takes raw text input via 'inputs'
    +-- NO -> Use a task-specific method instead

Serverless vs Dedicated

What are your deployment needs?
+-- Prototyping / low volume -> Serverless Inference Providers (provider: "auto")
|   Free tier available, shared infrastructure, may have cold starts
+-- Production / high volume -> Inference Endpoints (endpointUrl)
|   Dedicated GPU, autoscaling, scale-to-zero, private infrastructure
+-- Local development -> Local endpoint (endpointUrl: "http://localhost:8080")
    Works with llama.cpp, Ollama, vLLM, TGI, LiteLLM

When to Use This SDK vs Others

Do you need access to 200k+ open-source models?
+-- YES -> Use @huggingface/inference
+-- NO -> Do you only use OpenAI models?
    +-- YES -> Not this skill's scope -- use the OpenAI SDK directly
    +-- NO -> Do you need structured outputs / tool calling?
        +-- YES -> Not this skill's scope -- use a higher-level AI SDK
        +-- NO -> @huggingface/inference works for most ML tasks

</decision_framework>


<red_flags>

RED FLAGS

High Priority Issues:

  • Hardcoding access tokens instead of using environment variables (security breach risk)
  • Using bare catch blocks without checking InferenceClientError types (hides API errors, loses debug info)
  • Omitting the model parameter -- always specify the model explicitly for predictable behavior (the SDK can pick a recommended model if omitted, but this is unreliable for production)
  • Not consuming chatCompletionStream() / textGenerationStream() generators (tokens are silently lost)
  • Using textGeneration() for conversational tasks instead of chatCompletion() (wrong API shape)

Medium Priority Issues:

  • Not specifying provider when a specific provider is needed (default "auto" picks based on your HF settings, which may not be optimal)
  • Not checking chunk.choices length before accessing chunk.choices[0] in streaming (may throw on empty chunks)
  • Using request() / streamingRequest() directly -- these are deprecated, use task-specific methods
  • Ignoring max_tokens / max_new_tokens limits (output may be truncated or excessively long)
  • Not handling model loading time for serverless inference (cold models return 503, then load)

Common Mistakes:

  • Confusing chatCompletion() parameters with textGeneration() parameters -- chat uses messages + max_tokens, text generation uses inputs + parameters.max_new_tokens
  • Using inputs parameter with chatCompletion() -- it uses messages, not inputs
  • Using messages parameter with textGeneration() -- it uses inputs, not messages
  • Forgetting that textToImage() returns a Blob, not a URL or Buffer
  • Treating featureExtraction() output as always a flat array -- shape depends on the model (can be nested arrays for batch inputs)
  • Not passing data (binary) for audio/image tasks, or passing a string path instead of the actual file buffer

Gotchas & Edge Cases:

  • The provider: "auto" default selects providers based on your HF account settings at hf.co/settings/inference-providers -- not by availability or speed. Set an explicit provider for predictable routing.
  • Serverless models may need time to load (cold start). First requests to a cold model may return 503 errors while the model warms up. The SDK handles retries, but initial requests can be slow.
  • When using Inference Endpoints with endpointUrl, the model parameter is often ignored because the endpoint serves a specific model.
  • chatCompletion() is OpenAI-API compatible -- it works with any OpenAI-compatible endpoint, not just Hugging Face.
  • HfInference is still exported for backward compatibility but InferenceClient is the current class name.
  • Third-party provider API keys can be passed as the accessToken -- when authenticated with a non-HF key, requests go directly to the provider instead of through HF's routing layer.
  • Tree-shakeable imports (import { textGeneration } from "@huggingface/inference") require passing accessToken as a parameter instead of constructor.
  • textToImage() supports multiple output types: blob (default), url, dataUrl, or json via the options outputType parameter.
  • Translation requires parameters.src_lang and parameters.tgt_lang for many-to-many models like mbart-large-50-many-to-many-mmt.

</red_flags>


<critical_reminders>

CRITICAL REMINDERS

All code must follow project conventions in CLAUDE.md (kebab-case, named exports, import ordering, import type, named constants)

(You MUST always pass an access token to InferenceClient -- never deploy without authentication)

(You MUST use chatCompletion() / chatCompletionStream() for conversational LLM tasks -- these follow the OpenAI-compatible message format)

(You MUST handle errors using InferenceClientError and its subclasses -- never use bare catch blocks without error type checking)

(You MUST specify a model parameter for every inference call -- there is no default model)

(You MUST never hardcode access tokens -- always use environment variables via process.env.HF_TOKEN)

Failure to follow these rules will produce insecure, unreliable, or silently failing AI integrations.

</critical_reminders>