Tok is a *token* โ the smallest unit a language model thinks in. Weight is, well, a *weight* โ one of the 800 million numbers that make up the model's brain. They will be the recurring narrators of every chapter.
Tok is a *token* โ the smallest unit a language model thinks in. Weight is, well, a *weight* โ one of the 800 million numbers that make up the model's brain. They will be the recurring narrators of every chapter.
When you talk to a normal AI app, your text leaves your device, crosses the internet, hits a giant warehouse full of GPUs, and the answer travels all the way back. That round trip is 100โ300 ms even on great wifi โ and it's billed by the word.
OpenAI charges around $0.0015โ$0.06 per 1000 tokens depending on the model. For a chat app with 10k daily users averaging 200 tokens each, that's $30โ$1200/day. Local inference: $0.
Most cloud APIs reserve the right to log inputs for 'service improvement' (read: training). Even when they don't, network operators between you and them can. Local inference means the words literally never leave your device.
Browser-local LLMs work on subway commutes, intercontinental flights, rural areas with bad signal, school networks that block API endpoints, and that one cafe where the wifi pretends to work but doesn't.
The whole bet of Mentria: shrink the model small enough that it fits inside a browser tab. No round trip. No bill. No data leaving the device. The brain lives where you live.
Once loaded (~600 MB, cached forever), the model runs entirely on your device's GPU via WebGPU. The page never phones home for inference โ only for the initial weight download.
1) Models like Qwen3.5-0.8B are small enough to run locally yet smart enough to be useful. 2) Browsers got direct GPU access via WebGPU. 3) Quantization techniques got good enough to shrink models 3ร without breaking them.
Transformers.js is a generalist that supports every model, every backend, every browser. That's a hard contract. To uphold it, it can't deeply optimize for any one model. Mentria does the opposite โ one model family, one backend, as fast as possible.
LoRA adapters are tiny files (40 MB or as small as 150 KB) that change what the model is *good at* โ math, code, poetry, summarization. Mentria swaps them mid-conversation in under 20 ms without reloading the 600 MB base. transformers.js can't do this.
Local will not match GPT-5 or Claude Opus on hard tasks. But it wins on privacy, offline, cost, latency, and 'works inside an app that can't afford per-query billing.' Pick the right tool per task โ Mentria is for the second column.
Now that you know *why* Mentria exists, the next chapter walks one token through the entire engine end-to-end. After that, we zoom into one component per chapter for the next twelve chapters.
tap โswipe โ for depthswipe โ to exit