Week 10: I Built OfflineLM for a Friend

Week 10 of Shipping Every Week.

A friend reached out a couple of weeks ago with a problem I couldn't stop thinking about. After nine weeks heads-down on Powr, building something for someone else, for a problem I'd never have found on my own, was the most energized I'd felt in a while.

The problem

He's training to be a child psychologist and works at a school. Part of the job is writing up evaluations and IEPs — the individualized education programs that decide what support a kid actually gets. These documents are consequential. They're also written for other clinicians, in dense clinical language: percentiles, processing-speed indices, terms like executive function and dysregulation that mean something precise to a specialist and nothing to anyone else.

But the people who most need to understand them are parents. And the average adult in the US reads at about a fifth-grade level. So every report he writes, he ends up hand-translating into plain language a parent can follow. It's slow, it's repetitive, and it's exactly the kind of work an LLM is genuinely good at.

Except he can't use ChatGPT. These are children's private psychological records. That data legally and ethically cannot get pasted into a cloud chatbot. The single most useful tool for the job was the one tool he wasn't allowed to touch. That's the gap.

So last week I built OfflineLM

It's a desktop app that runs the model entirely on your own machine. The child's documents never leave the computer — no upload, no API call, no training on their data. You drop in the report, chat with it, and it rewrites the whole thing to a fifth-grade reading level, then hands you back a Word document you can edit.

The "entirely on your own machine" part is the whole product, so it's worth being concrete about how. It's a Tauri desktop app — a Rust core with a web UI — and the model itself runs locally through llama.cpp: a small, capable open model (Phi-4 Mini, around 2.5 GB, quantized to run on a normal laptop) executing on the user's own hardware. There's no server in the loop because there can't be one. It reads and writes real .docx files directly, because that's the format these reports actually live in, and the data on disk is encrypted at rest with a key held in the operating system's keychain rather than sitting next to the file it protects.

The hard part wasn't the rewriting

Any model can make text simpler. The hard part was trusting the rewrite enough to put it in front of a parent making decisions about their kid. Two things had to be true, and both had to be checkable, not just hoped for.

1. It actually has to be readable. "Simpler" is not a feeling, it's a measurement, so I built a real readability gate. It scores the output with a panel of established readability formulas — Flesch-Kincaid, Gunning Fog, Coleman-Liau, and others — and takes the consensus rather than trusting any single one, with a target around a fifth-grade level. On top of the math, it checks the text against a curated list of about a hundred clinical and psych terms that must never appear undefined — the executive functions and percentiles — and it's smart enough to notice when a term is defined inline and let it through. One undefined jargon term is a hard fail. When the output fails the gate, the specific failures get handed back to the model to try again, before a human ever sees a word of it.

2. It can't quietly change the meaning. This is the scary one. The dangerous failure isn't bad grammar, it's "may indicate a learning disability" becoming "has a learning disability." A simplification that hardens a hedge, drops a qualifier, or adds a claim that wasn't there is worse than no simplification at all, because it's confidently wrong about a child. So the guarantee here has a human at its center, by design: nothing is released without the psychologist reviewing the original and the simplified version side by side and approving it. That review is not optional and never will be. On top of that human backstop, I'm building an automated check that compares the two versions and flags exactly these failure modes — added, dropped, or hardened claims — so the reviewer's attention gets pointed straight at the lines most likely to have drifted, instead of having to re-read everything cold.

Then I found out a lot of the parents he works with speak Spanish, so I added a second flow: translate a Spanish-language IEP into clear English, faithfully, without simplifying — a different job with the same non-negotiable rule that no finding, number, or diagnosis may change.

It's local-first, encrypted at rest, and open source under the MIT license as of this weekend.

A real product for a real problem, built for one person I know, with no business model and no growth plan attached. After ten weeks of agonizing over conversion rates, that was clarifying. This is the part of building I like most: someone has a problem only software can fix, and you can be the one who fixes it.

What's a tool you wish existed for your job?

The problem

So last week I built OfflineLM

The hard part wasn't the rewriting

Want a site that does this for your business?