WST: Building an AI-Powered Knowledge Extraction Pipeline for Technical Books

The Problem with Knowledge Trapped in Books

I've been building AI agents that need deep, specific technical knowledge — the kind that exists in books but isn't accessible at the moment you need it. Not "what is Kubernetes" — every model knows that. What you actually need in the middle of debugging a production issue is: the exact kubectl debug invocation for an OOMKilled container, or which pg_stat view shows lock contention on a specific table, or what the three distinguishing criteria are between two similar drug interactions.

That knowledge exists. It's in books like Kubernetes in Action, Database Internals, Designing Data-Intensive Applications, and hundreds of other technical references across every domain. But it's trapped inside 500-page narratives. You can't do a semantic search across your entire technical library in real time. An agent can't ctrl+F a book while it's working through a problem.

So I built WST to solve exactly that.

The Name: Wan Shi Tong

Wan Shi Tong — "He Who Knows Ten Thousand Things" — is the spirit library owl from Avatar: The Last Airbender. He guards an infinite underground library containing all knowledge ever gathered by the world. The metaphor felt right: a vast library of technical knowledge, accessible by query, returning exactly what you need at the moment you need it.

WST on GitHub: github.com/benolenick/WST — open source, MIT licensed.

How It Works: Four Stages

WST is a four-stage pipeline. Drop books into a folder, run the pipeline, and within hours your entire technical library becomes semantically searchable.

Stage 1: Ingest

The ingester extracts raw text from whatever format your books are in. PDFs go through pdftotext (poppler) for layout-aware extraction, with PyPDF2 as a fallback. EPUBs use ebooklib. AZW3 and MOBI files get converted to EPUB first via Calibre's ebook-convert. CHM files — still common for older technical references — get extracted with 7z and parsed with BeautifulSoup.

Each book gets a content hash so the pipeline knows when re-processing is needed. All state is tracked in a local state.json file.

$ python3 pipeline.py ingest
  [ingest] kubernetes_in_action.pdf ... OK (2,847,391 chars)
  [ingest] designing_data_intensive_apps.pdf ... OK (1,923,847 chars)
  [ingest] database_internals.epub ... OK (1,104,293 chars)

Extracted text in extracted/

Stage 2: Extract

This is where the real work happens. The extractor chunks each text file into 8,000-character pieces with 500-character overlap (to avoid cutting mid-sentence at important boundaries), then sends each chunk to a local LLM via Ollama with a specialized extraction prompt.

The prompt is the key design decision — covered in detail below. The LLM returns a JSON array of facts for each chunk. Those facts get saved to the facts/ directory.

$ python3 pipeline.py extract
  [extract] kubernetes_in_action
    356 chunks (2,847,391 chars)
    chunk 1/356 ... 12 facts
    chunk 2/356 ... 8 facts
    chunk 3/356 ... 19 facts
    ...
    Total: 3,847 unique facts (from 4,102 raw)

Stage 3: Dedup

Before seeding new facts into Memoria, WST checks each one against what's already there using semantic similarity. Any fact scoring above 0.85 cosine similarity to an existing entry gets dropped. This prevents the knowledge base from becoming cluttered with near-duplicate facts across books that cover the same topics.

$ python3 pipeline.py dedup
  Memoria online: 2,847 existing facts
  [dedup] kubernetes_in_action: 3,847 facts to check
    Kept 2,931, removed 916 duplicates

Stage 4: Seed

The deduplicated facts get loaded into Memoria via its /memorize API endpoint. Memoria uses sentence-transformers (the all-MiniLM-L6-v2 model) to encode each fact as a vector embedding in a FAISS index, enabling sub-second semantic search across thousands of facts.

$ python3 pipeline.py seed
  Memoria online: 2,847 facts before seeding
  [seed] kubernetes_in_action: 2,931 facts
    seeded 2,931, errors 0
  FV: 2,847 → 5,778 facts (+2,931)

The Extraction Prompt: Why It Matters

The quality of extracted knowledge depends entirely on what you ask the LLM to look for. A naive "summarize this text" prompt produces useless output — vague descriptions of concepts that don't help when you need a specific answer to a specific problem.

WST's prompt asks for two very specific things:

Individual Actionable Facts

Commands and procedures with exact syntax (not "use kubectl" — the actual command with all flags and arguments)
Configuration values, default ports, file paths, environment variables, thresholds
Decision criteria — the specific conditions under which you'd choose one approach over another
Tool-specific tips with actual parameters and flags
Error messages mapped to their specific resolutions

Decision Trees in IF/THEN Format

This is the part that makes WST genuinely useful for AI agents. The prompt specifically asks the LLM to extract multi-step procedures as conditional decision trees:

"K8s Pod Debugging: IF (Pod == CrashLoopBackOff) AND (Exit Code == 137)
  THEN check resource limits with kubectl describe pod
  → IF (Last State == OOMKilled)
    THEN increase memory limit or profile app memory usage
  → IF (limits seem adequate)
    THEN check for memory leaks with kubectl exec + profiler"

"Database Replication Lag: IF (replica_lag > 30s) AND (write_throughput == normal)
  THEN check replica IO thread status
  → IF (Seconds_Behind_Master increasing)
    THEN check slow query log on replica
  → IF (IO thread stopped)
    THEN check network + SHOW SLAVE STATUS for error"

"Circuit Breaker Pattern: IF (error_rate > threshold) AND (consecutive_failures > N)
  THEN open circuit breaker → return fallback response
  → after timeout period, enter half-open state
  → IF (next request succeeds) THEN close circuit
  → IF (next request fails) THEN re-open circuit"

When an agent encounters a specific condition, it queries Memoria and gets back the complete decision tree of what to check and try next. The knowledge is already structured as a procedure the agent can follow directly — no need to reason about it from first principles.

The prompt also strips out everything that isn't actionable: general theory, historical context, architecture overviews, basic definitions. A technical book might be 70% background and 30% specific techniques and procedures. WST extracts the 30%.

Web Source Integration

Beyond books, WST can ingest structured web documentation — any repository with well-organized YAML, markdown, or structured data files.

Sources with well-defined schemas get parsed directly without involving an LLM. Each structured entry gets converted into a searchable fact. This is fast and reliable — no LLM hallucination risk, no token cost.

Unstructured documentation (prose-heavy markdown wikis, tutorial sites) goes through the full LLM extraction stage. WST converts the markdown to plain text, groups it by section, and queues it for extraction just like book content.

The ingest_web.py script handles the detection and routing — it figures out whether a source is structured enough for direct parsing or needs the LLM treatment.

Priority Queue Integration

WST is a background process — it should never compete with live work. When running on a shared machine that also handles active AI agent sessions, resource contention is a real concern. A large LLM loaded for book extraction will conflict with whatever model the agent is using for its current task.

WST handles this in two ways:

Model conflict detection: Before each chunk, WST checks Ollama's /api/ps endpoint to see if a different model is currently running. If so, it backs off exponentially (starting at 2 seconds, up to 5 minutes) rather than forcing a costly model swap.
AI-Shaman queue (optional): If you're running an AI-Shaman priority queue, WST submits all jobs at bulk priority, which automatically yields to higher-priority tasks. Configure via the SHAMAN_QUEUE environment variable.

This means you can leave WST running overnight to process a large library, and it will politely yield whenever something else needs the LLM.

Integration with Memoria

WST is built to work with Memoria — a lightweight semantic search API I built as the knowledge layer for AI agents. Memoria uses sentence-transformers to encode facts as vector embeddings in a FAISS index, then does cosine similarity search to find the most relevant facts for any natural-language query.

From an agent's perspective, using WST-loaded knowledge looks like this:

import urllib.request, json

# Agent needs to debug a Kubernetes issue
query = "kubernetes pod crashloopbackoff OOMKilled debugging"

payload = json.dumps({"query": query, "top_k": 3}).encode()
req = urllib.request.Request(
    "http://127.0.0.1:8000/search",
    data=payload,
    headers={"Content-Type": "application/json"}
)
with urllib.request.urlopen(req) as resp:
    results = json.loads(resp.read())

# Returns:
# "K8s Pod Debugging: IF (Pod == CrashLoopBackOff) AND (Exit Code == 137)
#   THEN check resource limits with kubectl describe pod
#   → IF (Last State == OOMKilled) THEN increase memory limit..."

The agent gets back the exact procedure it needs, extracted from technical reference material, structured as a decision tree it can follow directly.

Why This Is Hard (and Worth Doing)

Extracting structured knowledge from books is a genuinely difficult problem. Books are written for sequential reading — the knowledge is woven into explanations, examples, and narrative context. You need an LLM that can distinguish "here is the exact command you'd run" from "here is a discussion about the concept behind the command." Naive summarization loses the precision that makes extracted facts useful.

The dedup problem is also non-trivial. Five books on distributed systems will all cover consensus algorithms. Without semantic deduplication, your knowledge base fills up with near-identical facts that dilute search result quality. The 0.85 cosine similarity threshold is tuned to let through meaningfully different phrasings while blocking true duplicates.

And the prompt engineering matters more than you'd expect. The difference between "extract important information" and "extract actionable facts as IF/THEN decision trees" is the difference between a knowledge base of vague summaries and one that actually helps an agent solve problems.

The payoff: once your library is processed, you have a searchable knowledge base that any tool, agent, or script can query with a single HTTP call. Weeks of reading become accessible in milliseconds. And it works for any domain — engineering, medicine, law, research — anywhere books contain knowledge that needs to be findable.

The full code is on GitHub: github.com/benolenick/WST