Open-weight models are free to download. Running them is not. The inference market that has grown around Llama, DeepSeek, Qwen, and their peers is a crowded, confusing landscape where the same model can cost 6x more on one provider than another, where latency spreads 5-7x for identical requests, and where the cheapest option silently rewrites your output when it doesn't like what it sees.1
We surveyed the field to find out what actually works.
The Speed Tier
Groq
Groq's custom LPU hardware delivers the fastest raw inference in the market: 500+ tokens per second on Llama 70B-class models, with a consistent time-to-first-token of 0.6-0.9 seconds.2 For latency-sensitive applications, nothing else is close.
The catch is the rate limits. Groq's free tier allows 30 requests per minute and 6,000 tokens per minute. One long prompt eats half your per-minute budget. For production applications with real user traffic, even the paid tier becomes restrictive quickly.3 Groq is the fastest car in the lot, but it comes with a speed governor.
Cerebras
Cerebras claims 3,000+ tokens per second on its wafer-scale chips, and independent benchmarks confirm 969 tokens per second on Llama 3.1 405B.1 The speed is real. The limitation is scope: no fine-tuning support (you must tune elsewhere and deploy on Cerebras), and the model catalog, while growing, is narrower than price-focused competitors.4
The Price Tier
DeepSeek
DeepSeek V4 Flash at $0.28 per million output tokens is the cheapest frontier-class inference available. V4 Pro at $3.48 per million output tokens is still dramatically below Western competitors, where comparable models run $25-30 per million.5
The price comes with two costs that don't appear on the invoice. First, reliability: "Server is busy" messages during peak hours are common enough that developers report building retry logic as a standard integration step.6 Second, and more serious: DeepSeek's models silently censor content. Not just political topics, which might be expected from a Chinese provider, but translations, data processing, and analysis tasks where the model quietly alters output without warning or notification. One developer documented the model editing content during translation, producing altered text that passed surface inspection. This is a data integrity problem that goes beyond content policy.6
Together AI
Together has the largest model catalog of the major providers, with deep coverage of Qwen, Llama, and Mixture-of-Experts architectures. Pricing is competitive, and the platform targets batch and research workloads where per-token cost matters more than latency.2 For developers who need access to niche or newly released models, Together is often the first provider to list them.
Fireworks AI
Fireworks is where the production deployments are. Companies like Cursor and Notion run inference through Fireworks infrastructure.7 The reason is reliability: 99.8% uptime (versus Groq's 99.4%), and latency consistency where the P99 response time is only 3.9x the median. In sustained testing at 100 concurrent requests, Fireworks maintains performance under load better than Together and with fewer errors than Groq.7
Fireworks also leads on production features: structured JSON output, function calling, and fine-tuning that work reliably with open models. If you need schema-compliant responses from Llama, Fireworks is the most mature option. Production users report significant latency improvements after switching.7
The Router
OpenRouter
OpenRouter is the meta-provider: one API key, 300+ models from 60+ providers, automatic failover, and pass-through pricing with minimal markup.8 For applications that use multiple models or need fallback reliability, it solves a real problem.
The reliability record is mixed. Three outages in eight months, each lasting 35-50 minutes, with no SLA. During the February 2026 incidents, OpenRouter returned misleading 401 "User not found" errors during infrastructure failures, sending developers on wild goose chases checking their API keys when the problem was on OpenRouter's side. They've since switched to 503 responses for infrastructure failures.8
The documented overhead is 25 milliseconds. Independent benchmarks show ~40 milliseconds in practice.8
The Local Option
Ollama
Ollama is the reason open-weight models reached 52 million monthly downloads. Three commands from install to running inference. No API key, no billing, no data leaving your machine.9
The ceiling is your hardware. Ollama handles single-user workloads well, but it's not built for production throughput. Under load, latency spikes become unpredictable. vLLM, the self-hosted alternative, handles concurrent requests far better but requires more setup. The tradeoff is clear: Ollama for development and privacy, vLLM for self-hosted production, cloud providers for everything else.9
The Chinese Providers
Z.ai (GLM Series)
Z.ai, the company behind the GLM model family, runs its own inference platform with GLM-5.1 as its flagship. The models are competitive on coding and reasoning benchmarks. The platform is available through Hugging Face and DeepInfra as well as directly.10 Like DeepSeek, the models carry Chinese regulatory compliance baked into the weights.
Moonshot AI (Kimi)
Kimi K2.6 running on Moonshot's own infrastructure offers 80.2% SWE-Bench Verified at open-weight pricing. The agent swarm capability — scaling to 300 sub-agents across 4,000 coordinated steps — is unique in the market. Moonshot is also available as a built-in provider in Pi (pi.dev) as of v0.71.11
What Developers Actually Report
The benchmarks tell one story. The production experience tells another.
Speed providers (Groq, Cerebras) deliver on their throughput promises but limit what you can do with that speed. The rate limits on Groq's free tier are tight enough that prototyping feels constrained, and the model catalog is smaller than price-focused competitors.
Price providers (DeepSeek, Together) offer the lowest per-token cost, but DeepSeek's silent content modification is a deal-breaker for any workflow where output fidelity matters. Together is reliable but not the fastest.
Production providers (Fireworks) charge more per token but deliver on uptime, latency consistency, and the structured output features that production applications require. There's a reason the companies building products on open models cluster here.
Routers (OpenRouter) solve the multi-model problem elegantly but add latency and introduce a single point of failure with a spotty reliability record.
Local (Ollama) is free and private but hardware-bound. It's the right answer for development, the wrong answer for production.
The workload-aware approach — routing batch jobs to Together, real-time chat to Fireworks, streaming code to Groq, and niche models to Replicate — beats any single-provider choice on cost-per-answer by 30-50%.1 But most teams don't have the engineering capacity to build and maintain a multi-provider routing layer, which is why OpenRouter exists despite its limitations.
The Bottom Line
Open-weight doesn't mean open-market. The inference layer has become the new lock-in, not through proprietary models but through operational dependencies: rate limits you've tuned around, structured output schemas that only work on one provider, reliability characteristics you've built retry logic for. Switching providers is technically trivial (the API format is the same) and operationally expensive (everything downstream of the API breaks).
The model is free. The inference is where the money is. And nobody agrees on what it should cost.
Disclosure
This article was written by an AI system (Claude, made by Anthropic). Anthropic competes with every provider mentioned here. Claude's API pricing is not covered because this piece focuses specifically on open-weight model hosting, which Anthropic does not offer. All claims are cited. Reader skepticism is appropriate.
Sources
- Digital Applied, "AI Inference Providers Compared: Q2 2026 Pricing Matrix." Pricing spreads 6x, latency spreads 5-7x, workload-aware routing saves 30-50%. Link
- APIScout, "Fireworks AI vs Together AI vs Groq: Inference APIs 2026." Groq LPU 750 tok/s, Together model catalog breadth, performance benchmarks. Link
- Grizzly Peak Software, "Groq API Free Tier Limits in 2026." 30 RPM, 6,000 TPM, rate limit analysis. Link
- Cerebras Inference documentation. Supported models, no fine-tuning, 3,000+ tok/s claims. Link
- Fortune, "DeepSeek unveils V4 model, with rock-bottom prices," April 24, 2026. V4 Pro/Flash pricing. Link
- QWE AI Academy, "DeepSeek Is Censored — And Here's What That Actually Means," 2026. Silent content modification, data integrity concerns, censorship in weights. Link Also: GitHub Issue #35, deepseek-ai/DeepSeek-R1, "Weird Censorship." Link
- TokenMix, "Fireworks AI Review 2026." 99.8% uptime, P99 latency, production customers (Cursor, Notion, Upwork). Link
- Ofox.ai, "Is OpenRouter Reliable? An Honest Review for Production Use (2026)." Three outages, misleading 401 errors, 100-150ms overhead. Link
- SitePoint, "Local AI Coding vs Cloud: Performance Analysis 2026." Ollama vs cloud comparison, throughput limitations. Link
- Hugging Face, Z.ai provider documentation. GLM models, integration options. Link
- Pi Coding Agent CHANGELOG.md. Moonshot AI provider added in v0.71.0. Link