5 min read

Local LLMs Are Now a Decision Lens, Not a Benchmark Race

110 tokens per second on 12GB VRAM is the wrong headline. The right story is what local inference makes possible — and what it changes about which decisions you can make in the room. TPMs who treat local LLMs as a benchmark race miss the real shift. TPMs who treat them as a decision lens gain…
Local LLMs Are Now a Decision Lens, Not a Benchmark Race

By Sarah Collins, Director of Research & Intelligence — TPM Content Research

110 tokens per second on 12GB of VRAM is the wrong headline. The right story is what local inference makes possible, and what it changes about which decisions you can make in the room. The TPMs who treat local LLMs as a benchmark race will write a thousand posts about model speed. The TPMs who treat them as a decision lens will quietly ship programs the rest of the organization cannot even start.

Two signals landed close together this month. Qwen3.6 hit 110 tokens per second on a 12GB consumer GPU. Needle, a 26-million-parameter model, outperformed Gemini on narrow tool-calling tasks. Qwen3.6 is the cost-and-velocity signal. Needle is the specialization signal. Together they describe one shift: the prototype threshold is crossed.

The threshold that matters

The threshold that matters is not "can local match hosted." It is "can local do enough, fast enough, that the answer to 'should we build this' does not require a procurement conversation first."

Qwen3.6 at 110 tok/s on 12GB of VRAM is comfortably above the prototype threshold for almost any internal workflow task. Classification, summarization, structured extraction, draft generation, code review against a small codebase, conversational triage. The model fits on a laptop. The inference runs without a network. The cost per call is electricity. The latency is bounded by the GPU, not the queue. For a TPM trying to answer the question "does this AI workflow make sense for our team," that is enough to run the experiment without involving procurement, security review, or vendor evaluation. The experiment runs on the engineer's desk. The result is real.

Needle's 26-million-parameter tool-calling model outperforming Gemini on a narrow benchmark is the same signal at a different scale. The thesis is not "small models beat big models." The thesis is "for a specific workflow, a small specialized model can be the right answer, and the gap between prototype and ship is shorter than the gap between hosted vendor and procurement." The TPM takeaway is direct: the prototype-to-evidence loop is now fast enough that you can run it before you ask for budget.

The decision lens: when local beats hosted

If you are a TPM deciding where to run inference, here is the five-question decision lens. The answer is rarely "always local" or "always hosted." The answer is "local for these workflows, hosted for those workflows, and the boundary moves as the program matures."

1. Data sensitivity. If the workflow touches PII, PHI, regulated data, customer content under NDA, or any data the legal team has flagged as sensitive, the default is local inference. The cost of a data egress incident is not a benchmark consideration. It is a program-ending consideration. Local inference removes the egress from the architecture entirely. The model is on the workstation or the VPC. The data does not leave the boundary.

2. Iteration velocity. If the workflow is in prototype and you expect to change prompts, swap models, or experiment with retrieval strategies weekly, local inference removes the procurement and approval loop from the iteration cycle. Hosted inference gives you scale. Local inference gives you speed of change. For the first three months of any AI program, speed of change is the binding constraint. Scale becomes the binding constraint later.

3. Latency budget. If the workflow is interactive (a developer waiting for a code completion, a support agent waiting for a suggested reply, a customer waiting for a search result), local inference wins on latency because the round trip is GPU-bound, not network-bound. Hosted inference in the low hundreds of milliseconds is fine. Hosted inference in the multiple-second range is a different product. Local inference in the tens of milliseconds is a different product again. Treat these as illustrative orders of magnitude, not benchmarks.

4. Cost at scale. If the workflow is high-volume and low-margin (bulk classification, large-scale extraction, log triage), local inference wins on cost because the marginal cost is electricity, not per-token API spend. The break-even point depends on utilization. A model running on a workstation at 80% utilization has a very different cost curve than a model running at 5% utilization waiting for the next API call.

5. Capability frontier. If the workflow requires the most capable model available (long-context reasoning, frontier multimodal understanding, novel scientific analysis), local inference loses because the capability frontier lives in hosted frontier models. This is the routing-out criterion for the lens. A workflow that fails question 5 does not run local. The decision is not local versus hosted. The decision is which workflows can run on local and which need to ride the frontier. Most programs need both.

The five questions are the decision lens. The answer is not a benchmark. It is a workflow-by-workflow assignment.

The TPM angle: prototype velocity is underweighted

The TPM angle that is underweighted in the local-inference conversation is prototype velocity, and it is the most underrated reason to run local inference in your program.

I have watched a prototype sit in procurement for six weeks because the team needed a hosted API account to even run the first experiment. The experiment that would have justified the budget could not start without the budget. Local inference breaks that loop. The engineer downloads a model. The experiment runs that afternoon. The data that justifies the program exists before the program exists.

What this does not solve

I will be specific about the limits, because local inference enthusiasm is real and the gap between "the laptop can run the model" and "the program is ready" is wide.

Local inference does not solve frontier capability gaps. If your workflow needs the most capable model available, local inference is not the answer. The frontier lives in hosted models. The decision is which workflows can run on local models and which need frontier. Most programs will end up with both layers in the architecture.

Local inference does not solve governance by itself. A local model still has to follow the same data handling, retention, and access control policies as a hosted model. Running locally does not exempt the program from SOC 2, ISO 27001, or your internal AI policy. The governance work is the same. The data boundary is smaller.

Local inference does not solve procurement forever. If the local model becomes a production dependency and the program needs uptime, redundancy, observability, and support, the procurement conversation returns in a different form. The decision shifts from "buy API access" to "buy GPU capacity, support contracts, and an MLOps platform." That is a real conversation with real budget. Local inference delays it, it does not eliminate it.

The signal that matters most

The signal that matters most is the 290-point anticipation thread for Qwen3.7. The signal is not the model. The signal is the community organizing around local inference as the default expectation. The Qwen line ships open weights on a predictable cadence, runs on every consumer GPU tier, and is the model most engineering teams reach for first when they want to skip the hosted API. The default is shifting under our feet.

If your AI program is gated on procurement or data sensitivity or latency or iteration velocity, the decision lens above is the conversation to have this week. Local inference is no longer a research project. It is a decision you can make today.


Send me your local-versus-hosted decision — which workflow tipped it, what changed in your program velocity, and what surprised you. DM me on LinkedIn (Doron Katz). I am collecting working patterns into a public local inference playbook; five workflows would let me ship it next month.