The Problem with Generic AI Responses
Ask Claude to explain OAuth and you get a technically correct answer aimed at a statistical average of everyone who might ask that question. It scaffolds too much for an engineer, too little for a non-technical founder. It uses jargon when plain language would land better, then over-explains things you already know to compensate.
This is not a model quality problem. It is a calibration problem. The model has no idea who you are.
Darmok fixes this by watching how you write and building a live profile from your actual messages, then injecting routing signals before every response so Claude adjusts to the specific person it is talking to.
Named After Star Trek
The name comes from Star Trek TNG season 5, episode 2. The Tamarians communicate entirely through shared narrative references. When those references are not shared, communication collapses entirely regardless of how fluent both sides are.
That is the real problem with AI communication. It is not vocabulary. It is whether the explanation lands on ground the person already stands on.
How It Works
Darmok runs as a UserPromptSubmit hook in Claude Code. Before Claude sees your message, the hook fires three things:
- Calibrator sidecar: a FastAPI service running locally at port 8050. It parses every message you send using spaCy, extracts three metrics (lexical diversity, syntactic depth, domain density), and stores them in a rolling SQLite window. After 20 messages, it produces a reliable User Capability Profile.
- DARMOK signal: built from your profile. Tells Claude the detected intent mode (action, status check, expand, explore, converse, prism), a hard brevity cap derived from your message length, and a routing decision: direct, substitute, or darmok.
- LEXICON signal: compares topic terms in your current prompt against your personal vocabulary index. Flags any terms you have never used in your own messages so Claude knows to anchor or explain them rather than assume familiarity.
All three land in Claude's context as a system block before the response is generated. The total overhead is around 150 milliseconds and a few hundred tokens per turn.
Three Routing Tiers
The calibrator computes an Explanation Gap Score by comparing the complexity of the explanation space against your measured profile. That score maps to one of three routes:
- Direct: your vocabulary matches the explanation space. Claude answers normally without adjustment.
- Substitute: label gap detected. You understand the concept but not the jargon. Claude strips technical terms and restates using words you have actually used.
- Darmok: frame gap detected. The concept has no everyday equivalent. Claude finds an anchor in your known domains and explains through that anchor before introducing technical terms.
The gap between substitute and darmok is meaningful. Substitute is vocabulary translation. Darmok is building a bridge from the person's side, not the AI's.
The Lexicon Component
The personal lexicon is a word frequency index rebuilt from your message history in the calibrator database. It solves a specific failure mode: the calibrator measures how you write, but not what you know about a given topic.
A person with networking experience writes conversationally in a Claude session. Their profile scores low domain density. The system routes direct and explains OAuth at default depth, missing the fact that they already understand the mechanics and just need the vocabulary anchored.
The lexicon check adds a second layer: if you ask about OAuth and have never typed the word "oauth" before, that term gets flagged regardless of your overall profile score. Claude knows to anchor it specifically, not assume familiarity with the whole concept.
Brevity as a Signal
One of the cleaner behaviors that emerged from the system: brevity is proportional to message length. Your message was nine words, the cap is twenty-seven. You write long, the cap expands. Certain modes (action, expand, status-check) lift the cap entirely because you signaled you want the full answer.
This prevents the common failure of a short question getting a five-paragraph response the person has to skim for the actual answer.
Validated Against Vanilla
To test whether the calibration actually worked, I asked a vanilla Claude Sonnet instance (fresh session, no profile, no signals) the same question I had asked through Darmok. Both answered "explain how OAuth works." Both answers went to Claude Opus for blind evaluation.
Opus scored the Darmok answer higher on accessibility and anchoring (the KeePassXC reference, the valet key analogy). It scored the vanilla answer higher on accuracy and density, specifically because vanilla included the two-step security rationale that Darmok had skipped.
The verdict: calibration helped, but it slightly under-trusted the user's networking baseline and dropped the one technical insight they would have found satisfying. That is the next layer to build: distinguishing between terms a person has been exposed to versus terms they have actively built with.
Cold Start and Backfill
A fresh install starts with zero profile. The system runs in degraded mode until 20 messages accumulate. If you have prior conversation history (ChatGPT exports, prior Claude sessions), you can backfill: pipe the messages through the calibrator's /analyze endpoint and run build_lexicon.py to rebuild the vocabulary index immediately. The SIGIL in the repository walks through exactly when this is and is not worth doing.
What Is Next
Three things are scoped but not yet built:
- Outcome logging: capture implicit signals from follow-up messages (correction, acceptance, expansion, abandonment) as training labels to validate whether calibration actually landed.
- ML mode classifier: the current mode detection uses keyword patterns. A small classifier trained on labeled message data would be more reliable, especially for ambiguous prompts.
- Topic knowledge layer: check Hyphae recall results for the current topic to distinguish exposure evidence (you stored a credential) from understanding evidence (you explained the concept). This is the fix for the OAuth under-trust failure.
The core system is working and live. Repository: github.com/benolenick/darmok
Design Principle
The ocean knows all depths. One wave reaches the shore, just high enough. Darmok is the mechanism that decides how high.