Jangwook KimDeploy Llama 4 to production with Meta Llama Stack's OpenAI-compatible API. Covers distributions, vLLM, Ollama, safety, agents, and cost-effective hosting.
Running open-source LLMs in production has always had a catch: you pick a backend (Ollama, vLLM, llama.cpp), write your integration code against that backend's specific API, and then find yourself locked in. Swap the backend for performance or cost reasons and you're rewriting client code.
Meta Llama Stack solves exactly this problem. It's an open-source AI application server that sits in front of any backend and exposes a single, OpenAI-compatible API layer. The same /v1/chat/completions call that works against your local Ollama instance in development routes to vLLM or AWS Bedrock in production — with zero application-code changes.
As of April 2026, the repository has over 8,200 GitHub stars and is under active development by Meta's open-source team. It ships with native support for Llama 4 Scout and Llama 4 Maverick, alongside older Llama 3.x models. If you're running or planning to run open-weight Llama models in production, Llama Stack is the infrastructure layer worth knowing.
Most frameworks for LLM deployment focus on one thing — inference serving. Llama Stack goes wider. It provides a unified API layer covering seven concerns:
The architecture has two layers: a distribution (a pre-configured bundle of provider implementations) and the Llama Stack server (a single process that routes API calls to whichever providers are configured in that distribution).
Your application only ever talks to the Llama Stack server. Swapping backends, safety models, or vector stores is a config change, not a code change.
A distribution is the unit of deployment in Llama Stack. It bundles together one provider for each API component and packages them into a runnable server.
Meta ships several official distributions out of the box:
| Distribution | Inference Backend | Best For |
|---|---|---|
ollama |
Ollama | Local development, CPU/Apple Silicon |
vllm |
vLLM | GPU production servers |
tgi |
HuggingFace TGI | HuggingFace-native stacks |
together |
Together AI | Managed API, no GPU needed |
fireworks |
Fireworks AI | Low-latency managed inference |
bedrock |
AWS Bedrock | AWS-native production |
openai |
OpenAI API | Hybrid open/closed LLM routing |
The pattern: develop with ollama, deploy with vllm or a managed service. The API your application uses doesn't change.
You can also build custom distributions if you need to mix providers — for example, using Fireworks for inference but self-hosted ChromaDB for vector storage.
pip install llama-stack-client
First, make sure Ollama is running and has pulled the model:
ollama pull llama3.3
Then start the Llama Stack server pointing at the Ollama distribution:
export INFERENCE_MODEL="llama3.3"
export LLAMA_STACK_PORT=8321
docker run -it \
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-v ~/.llama:/root/.llama \
-e INFERENCE_MODEL=$INFERENCE_MODEL \
llamastack/distribution-ollama:latest
The server is now running at http://localhost:8321 and exposes standard OpenAI-compatible endpoints.
Use the OpenAI Python client directly — point it at the Llama Stack server:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8321/v1",
api_key="not-required" # Llama Stack handles auth separately
)
response = client.chat.completions.create(
model="llama3.3",
messages=[
{"role": "user", "content": "Explain attention mechanisms in one paragraph."}
]
)
print(response.choices[0].message.content)
That's it. Any existing code that uses the OpenAI Python SDK can point at a Llama Stack server instead, with no further changes.
from llama_stack_client import LlamaStackClient
client = LlamaStackClient(base_url="http://localhost:8321")
response = client.inference.chat_completion(
model_id="llama3.3",
messages=[{"role": "user", "content": "What are the Llama 4 model sizes?"}]
)
print(response.completion_message.content)
The native client exposes additional Llama Stack-specific APIs (agents, memory, safety) that aren't part of the OpenAI SDK interface.
For GPU production deployments, swap the distribution from ollama to vllm.
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model meta-llama/Llama-4-Scout-17B-16E-Instruct \
--tensor-parallel-size 2
export INFERENCE_MODEL="meta-llama/Llama-4-Scout-17B-16E-Instruct"
export VLLM_URL="http://localhost:8000"
export LLAMA_STACK_PORT=8321
docker run -it \
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-e INFERENCE_MODEL=$INFERENCE_MODEL \
-e VLLM_URL=$VLLM_URL \
llamastack/distribution-vllm:latest
Now your application code is unchanged — it still talks to http://your-server:8321/v1. The only thing that moved was the backend.
Llama Stack supports the full Llama 4 family. The two currently available production models:
| Model | Parameters | Active Params | Context | Best For |
|---|---|---|---|---|
| Llama 4 Scout | 109B total / 17B active | 17B | 10M tokens | Single GPU, balanced tasks |
| Llama 4 Maverick | 400B total / 52B active | 52B | 10M tokens | Multi-GPU, high-quality output |
Both are MoE (Mixture of Experts) models under Meta's open-weight license. Scout runs on a single A100 80GB; Maverick requires 2–4 GPUs depending on quantization.
Llama Stack's agents API goes well beyond basic chat completion. An agent maintains a session, executes multi-step plans, and calls tools.
from llama_stack_client import LlamaStackClient
client = LlamaStackClient(base_url="http://localhost:8321")
# Create an agent with web search enabled
agent_config = {
"model": "llama3.3",
"instructions": "You are a research assistant. Use search when you need current information.",
"tools": [{"type": "brave_search", "api_key": "YOUR_BRAVE_API_KEY"}],
"max_infer_iters": 5
}
agent = client.agents.create(**agent_config)
session = client.agents.sessions.create(
agent_id=agent.agent_id,
session_name="research-session"
)
# Turn 1
response = client.agents.turns.create(
agent_id=agent.agent_id,
session_id=session.session_id,
messages=[{"role": "user", "content": "What are the latest Llama 4 benchmarks?"}],
stream=True
)
for chunk in response:
if hasattr(chunk, "event") and chunk.event.payload.event_type == "turn_complete":
print(chunk.event.payload.turn.output_message.content)
The agent automatically decides when to call the search tool, reads the results, and synthesizes a final answer — all within Llama Stack's orchestration layer.
Built-in tools include:
brave_search — web search via Brave Search APIwolfram_alpha — math and science queriescode_interpreter — sandboxed Python executionEvery Llama Stack distribution can run Llama Guard as the safety provider, filtering both inputs and outputs against configurable policy categories.
# Check a response against safety policy
safety_response = client.safety.run_shield(
shield_id="meta-llama/Llama-Guard-3-8B",
messages=[{"role": "assistant", "content": response_text}]
)
if safety_response.violation:
print(f"Safety violation: {safety_response.violation.user_message}")
else:
print(response_text)
Safety categories can be tuned per deployment. Production use cases often enable all defaults; internal developer tools might relax some restrictions.
Additional safety capabilities in Llama Stack:
The Memory API supports four storage types for different use cases:
| Type | Best For |
|---|---|
| Vector (FAISS, ChromaDB, Weaviate) | Semantic similarity search, RAG |
| Key-Value (Redis, PostgreSQL) | Session state, structured lookup |
| Keyword (BM25) | Exact-match and hybrid search |
| Graph (Neo4j) | Relationship-based retrieval |
Adding RAG to an agent is a configuration change:
# Create a memory bank (vector store)
memory_bank = client.memory_banks.register(
memory_bank_id="my-docs",
params={
"memory_bank_type": "vector",
"embedding_model": "all-MiniLM-L6-v2",
"chunk_size_in_tokens": 512,
"overlap_size_in_tokens": 64
}
)
# Insert documents
client.memory.insert(
bank_id="my-docs",
documents=[
{"document_id": "doc-1", "content": "Llama 4 Scout has a 10M token context window..."}
]
)
In production, PostgreSQL is the recommended backend for both vector storage and key-value persistence, replacing in-memory FAISS for durability across restarts.
Llama Stack ships a complete OpenTelemetry-native telemetry system. Traces, spans, and events flow from the server to any OTEL-compatible backend (Jaeger, Grafana Tempo, Datadog, etc.).
# Enable OTEL tracing in your distribution config
OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4317
OTEL_SERVICE_NAME=llama-stack-prod
Every inference call, agent step, tool invocation, and safety check becomes a traced span. This gives you:
For teams already using Langfuse or other LLM observability tools, Llama Stack's OTEL output integrates cleanly with existing dashboards.
Using the wrong distribution for your hardware. The ollama distribution works fine on CPU and Apple Silicon, but for A100/H100 servers, vllm gives 3–5x better throughput. Don't use CPU-tier distributions for GPU production workloads.
Not setting model IDs consistently. The model identifier in your API call must match exactly what the provider backend has loaded. With vLLM this is usually the full HuggingFace path (meta-llama/Llama-4-Scout-17B-16E-Instruct); with Ollama it's the short tag (llama3.3). Mismatches return a 404 that looks like a server error.
Skipping Safety in development. Llama Guard evaluation adds ~50ms latency. Developers sometimes disable it locally to speed up iteration, then forget to re-enable it before production. Treat safety configuration as part of your deployment checklist, not a late addition.
Ignoring session management for agents. Agent sessions accumulate context across turns. For production services that handle many concurrent users, set session_ttl and clean up sessions explicitly, or you'll see memory growth over time.
Mounting volumes incorrectly for model weights. The Docker images expect model weights at specific paths. If the volume mount doesn't match, the container downloads models on startup — slow, expensive, and fragile in autoscaling environments. Pre-pull weights and mount them at the documented path.
Yes. Llama Stack supports any model available through its provider backends. The openai distribution lets you route to GPT-4o or GPT-6, the anthropic provider connects to Claude, and the vllm distribution serves any HuggingFace-compatible model. The "Llama" branding is about the defaults, not a hard constraint.
LiteLLM focuses on unified API routing to managed providers (OpenAI, Anthropic, Azure, etc.) with cost tracking and fallback logic. Llama Stack is broader: it includes self-hosting, agent orchestration, safety, RAG, and evaluation. For a team running managed cloud providers, LiteLLM is simpler. For teams self-hosting Llama models who need agents and safety, Llama Stack adds significant value beyond what LiteLLM offers.
The core inference and OpenAI-compatible endpoints are stable and used in production deployments. The agent, memory, and evaluation APIs are under more active development. As of version 0.2.x (April 2026), production use is most reliable for inference + safety use cases. Agents work well but have more API surface area that can change between minor versions.
Llama 4 Scout (17B active parameters) requires approximately 35GB VRAM in BF16 or 20GB with 4-bit quantization. A single A100 40GB handles it comfortably. For Apple Silicon, M4 Pro (48GB unified memory) or M4 Max runs it at reduced throughput. Maverick needs 2–4 A100/H100 GPUs depending on quantization level.
Yes. Install via pip: pip install llama-stack and run llama-stack start --config path/to/config.yaml. Docker is recommended for production for isolation and reproducibility, but the Python package works for development and custom deployments.
Meta Llama Stack is the cleanest path from local Llama model experimentation to production deployment. Its distribution model — develop with Ollama, deploy with vLLM, never change your API client — removes the most common painful rewrite in open-source LLM adoption.
The OpenAI compatibility layer is the practical unlock: teams already using the OpenAI Python SDK can switch to self-hosted Llama 4 by changing one line (base_url). Combined with built-in agents, safety, RAG, and telemetry, Llama Stack positions itself as a full infrastructure layer, not just a model server.
For the current state of production deployments: inference and safety are solid; agents and RAG are functional with active API evolution. A reasonable approach is to start with inference + safety in production, then evaluate the agents API for lower-stakes workloads while it stabilizes.
Bottom Line
Llama Stack is the best available open-source infrastructure layer for Llama 4 production deployments. The OpenAI-compatible API and swappable distribution model eliminate the usual vendor lock-in trade-off of self-hosting — you get full control without rewriting your application code when you scale up or change backends.
Prefer a deep-dive walkthrough? Watch the full video on YouTube.