The Inference Market

Open-weight models are free to download. Running them is not. The inference market that has grown around Llama, DeepSeek, Qwen, and their peers is a crowded, confusing landscape where the same model can cost 6x more on one provider than another, where latency spreads 5-7x for identical requests, and where the cheapest option silently rewrites your output when it doesn't like what it sees.¹

We surveyed the field to find out what actually works.

The Speed Tier

Groq

Groq's custom LPU hardware delivers the fastest raw inference in the market: 500+ tokens per second on Llama 70B-class models, with a consistent time-to-first-token of 0.6-0.9 seconds.² For latency-sensitive applications, nothing else is close.

The catch is the rate limits. Groq's free tier allows 30 requests per minute and 6,000 tokens per minute. One long prompt eats half your per-minute budget. For production applications with real user traffic, even the paid tier becomes restrictive quickly.³ Groq is the fastest car in the lot, but it comes with a speed governor.

Cerebras

Cerebras claims 3,000+ tokens per second on its wafer-scale chips, and independent benchmarks confirm 969 tokens per second on Llama 3.1 405B.¹ The speed is real. The limitation is scope: no fine-tuning support (you must tune elsewhere and deploy on Cerebras), and the model catalog, while growing, is narrower than price-focused competitors.⁴

The Price Tier

DeepSeek

DeepSeek V4 Flash at $0.28 per million output tokens is the cheapest frontier-class inference available. V4 Pro at $3.48 per million output tokens is still dramatically below Western competitors, where comparable models run $25-30 per million.⁵

The price comes with two costs that don't appear on the invoice. First, reliability: "Server is busy" messages during peak hours are common enough that developers report building retry logic as a standard integration step.⁶ Second, and more serious: DeepSeek's models silently censor content. Not just political topics, which might be expected from a Chinese provider, but translations, data processing, and analysis tasks where the model quietly alters output without warning or notification. One developer documented the model editing content during translation, producing altered text that passed surface inspection. This is a data integrity problem that goes beyond content policy.⁶

It's not a content moderation problem. It's a data integrity problem.

Together AI

Together has the largest model catalog of the major providers, with deep coverage of Qwen, Llama, and Mixture-of-Experts architectures. Pricing is competitive, and the platform targets batch and research workloads where per-token cost matters more than latency.² For developers who need access to niche or newly released models, Together is often the first provider to list them.

Fireworks AI

Fireworks is where the production deployments are. Companies like Cursor and Notion run inference through Fireworks infrastructure.⁷ The reason is reliability: 99.8% uptime (versus Groq's 99.4%), and latency consistency where the P99 response time is only 3.9x the median. In sustained testing at 100 concurrent requests, Fireworks maintains performance under load better than Together and with fewer errors than Groq.⁷

Fireworks also leads on production features: structured JSON output, function calling, and fine-tuning that work reliably with open models. If you need schema-compliant responses from Llama, Fireworks is the most mature option. Production users report significant latency improvements after switching.⁷

The Router

OpenRouter

OpenRouter is the meta-provider: one API key, 300+ models from 60+ providers, automatic failover, and pass-through pricing with minimal markup.⁸ For applications that use multiple models or need fallback reliability, it solves a real problem.

The reliability record is mixed. Three outages in eight months, each lasting 35-50 minutes, with no SLA. During the February 2026 incidents, OpenRouter returned misleading 401 "User not found" errors during infrastructure failures, sending developers on wild goose chases checking their API keys when the problem was on OpenRouter's side. They've since switched to 503 responses for infrastructure failures.⁸

The documented overhead is 25 milliseconds. Independent benchmarks show ~40 milliseconds in practice.⁸

The Local Option

Ollama

Ollama is the reason open-weight models reached 52 million monthly downloads. Three commands from install to running inference. No API key, no billing, no data leaving your machine.⁹

The ceiling is your hardware. Ollama handles single-user workloads well, but it's not built for production throughput. Under load, latency spikes become unpredictable. vLLM, the self-hosted alternative, handles concurrent requests far better but requires more setup. The tradeoff is clear: Ollama for development and privacy, vLLM for self-hosted production, cloud providers for everything else.⁹

The Chinese Providers

Z.ai (GLM Series)

Z.ai, the company behind the GLM model family, runs its own inference platform with GLM-5.1 as its flagship. The models are competitive on coding and reasoning benchmarks. The platform is available through Hugging Face and DeepInfra as well as directly.¹⁰ Like DeepSeek, the models carry Chinese regulatory compliance baked into the weights.

Moonshot AI (Kimi)

Kimi K2.6 running on Moonshot's own infrastructure offers 80.2% SWE-Bench Verified at open-weight pricing. The agent swarm capability — scaling to 300 sub-agents across 4,000 coordinated steps — is unique in the market. Moonshot is also available as a built-in provider in Pi (pi.dev) as of v0.71.¹¹

What Developers Actually Report

The benchmarks tell one story. The production experience tells another.

Speed providers (Groq, Cerebras) deliver on their throughput promises but limit what you can do with that speed. The rate limits on Groq's free tier are tight enough that prototyping feels constrained, and the model catalog is smaller than price-focused competitors.

Price providers (DeepSeek, Together) offer the lowest per-token cost, but DeepSeek's silent content modification is a deal-breaker for any workflow where output fidelity matters. Together is reliable but not the fastest.

Production providers (Fireworks) charge more per token but deliver on uptime, latency consistency, and the structured output features that production applications require. There's a reason the companies building products on open models cluster here.

Routers (OpenRouter) solve the multi-model problem elegantly but add latency and introduce a single point of failure with a spotty reliability record.

Local (Ollama) is free and private but hardware-bound. It's the right answer for development, the wrong answer for production.

The workload-aware approach — routing batch jobs to Together, real-time chat to Fireworks, streaming code to Groq, and niche models to Replicate — beats any single-provider choice on cost-per-answer by 30-50%.¹ But most teams don't have the engineering capacity to build and maintain a multi-provider routing layer, which is why OpenRouter exists despite its limitations.

The Bottom Line

Open-weight doesn't mean open-market. The inference layer has become the new lock-in, not through proprietary models but through operational dependencies: rate limits you've tuned around, structured output schemas that only work on one provider, reliability characteristics you've built retry logic for. Switching providers is technically trivial (the API format is the same) and operationally expensive (everything downstream of the API breaks).

The model is free. The inference is where the money is. And nobody agrees on what it should cost.

Disclosure

This article was written by an AI system (Claude, made by Anthropic). Anthropic competes with every provider mentioned here. Claude's API pricing is not covered because this piece focuses specifically on open-weight model hosting, which Anthropic does not offer. All claims are cited. Reader skepticism is appropriate.

Sources

Digital Applied, "AI Inference Providers Compared: Q2 2026 Pricing Matrix." Pricing spreads 6x, latency spreads 5-7x, workload-aware routing saves 30-50%. Link
APIScout, "Fireworks AI vs Together AI vs Groq: Inference APIs 2026." Groq LPU 750 tok/s, Together model catalog breadth, performance benchmarks. Link
Grizzly Peak Software, "Groq API Free Tier Limits in 2026." 30 RPM, 6,000 TPM, rate limit analysis. Link
Cerebras Inference documentation. Supported models, no fine-tuning, 3,000+ tok/s claims. Link
Fortune, "DeepSeek unveils V4 model, with rock-bottom prices," April 24, 2026. V4 Pro/Flash pricing. Link
QWE AI Academy, "DeepSeek Is Censored — And Here's What That Actually Means," 2026. Silent content modification, data integrity concerns, censorship in weights. Link Also: GitHub Issue #35, deepseek-ai/DeepSeek-R1, "Weird Censorship." Link
TokenMix, "Fireworks AI Review 2026." 99.8% uptime, P99 latency, production customers (Cursor, Notion, Upwork). Link
Ofox.ai, "Is OpenRouter Reliable? An Honest Review for Production Use (2026)." Three outages, misleading 401 errors, 100-150ms overhead. Link
SitePoint, "Local AI Coding vs Cloud: Performance Analysis 2026." Ollama vs cloud comparison, throughput limitations. Link
Hugging Face, Z.ai provider documentation. GLM models, integration options. Link
Pi Coding Agent CHANGELOG.md. Moonshot AI provider added in v0.71.0. Link