Case Study · Applied AI · 2026
Materials Science Literature RAG
A production RAG and agent system over materials-science corpora, built solo and deployed live, held to the standard I learned in a regulated test lab: it doesn't ship until an eval proves it works. Three surfaces (browse, hybrid search, and an agent that cites its sources) share one retrieval module. Every non-trivial decision is written down in an ADR.
What it is
Ask a materials-engineering question: "Which brass alloys are least susceptible to dezincification?", the PREN equation, Charpy V-notch impact behavior, and get an answer cited back to the source page, in about a quarter-second on a warm cache. Under the hood, three surfaces sit on one retrieval module:
- Browse — a hierarchical view of the corpus, document by document.
- Search — hybrid retrieval (keyword + semantic), exposed as both an HTTP API and a CLI.
- Chat — a Level-2 agentic loop with tool use, SSE streaming, and inline citations. The agent calls the same /search every other surface uses, plus read_section, web_search, fetch_url, and cite_check.
Retrieval is a pure function: (query, filters) → ranked chunks. No surface owns its own retrieval logic; the agent is just another caller. That one decision keeps the system testable, I can evaluate retrieval in isolation, without driving the whole agent.
The corpus is two halves: 30 publicly-redistributable documents (~4,300 pages from NIST, USGS, DOE national labs, NASA, FHWA, NPL, and CC-licensed arXiv reviews) and a private ASM Handbook corpus used for evaluation. They're reported separately throughout, and they aren't commensurable.
Architecture
The production deployment is a deliberate two-host shape: a public edge that gates and proxies, and a single application host that owns compute and data.
Cloudflare Access gates the hostname at the edge, so only authenticated visitors reach the proxy; the shared-secret header is defense-in-depth so nobody with the raw Fly URL can burn Anthropic credits directly. The whole thing runs on shared-cpu-1x with 2 GB of RAM and scales to zero when idle.
Retrieval
Keyword search nails exact terms and notation; semantic search handles paraphrase and concept. Neither wins alone on a technical corpus, so the system runs both and fuses them:
Canonical config, chosen against the eval harness rather than by feel: multi-qa-mpnet-base-cos-v1 embeddings, 384-token chunks at 20% overlap, hnsw.ef_search = 160. The /search/explain endpoint attaches per-retriever provenance to every merged result, so I can see why a chunk surfaced, which is what makes the failures debuggable.
Evaluation
This is the part most RAG demos skip. There are 95 passage-level labeled retrieval cases across the two corpora, scored as NDCG@5, Recall@5, and MRR. Every chunking, embedding, and parameter change is measured against this harness, not vibes. Numbers are reported per-corpus, not as a flattering merged aggregate, because the two gold standards aren't comparable.
| Corpus | Cases | NDCG@5 | Recall@5 | MRR |
|---|---|---|---|---|
| asm-private | 56 | 0.1913 | 0.0897 | 0.3575 |
| public | 39 | 0.1129 | 0.1018 | 0.1882 |
These are honest numbers on a hard target: strict passage-level relevance on dense reference handbooks. The point isn't the absolute score. It's that every change to retrieval gets measured, so when something's off I can find where instead of guessing. Two examples of the harness doing real work:
- An early diagnosis took NDCG@5 up nearly 400% after a 40-configuration parameter sweep, a four-way backend benchmark, and a knowledge-graph hypothesis I tested and rejected. That's the story in Building a Search Engine That Actually Works.
- A later regression on the ASM corpus traced to Unicode normalization; NFKC at ingest plus raising ef_search from 40 to 160 recovered it with zero regression on the public set. Written up in ADR-0013.
Beyond retrieval, there are agent-behavior eval tracks (does the agent cite correctly, refuse when it should, stay grounded) and a performance SLO battery. The eval methodology itself is captured in an ADR (ADR-0009).
The agent, and why it checks itself
The chat surface is a hand-rolled async tool-use loop on the Anthropic SDK, with no orchestration framework. Claude Sonnet drives the loop; it can call /search, read_section, web_search, and fetch_url. Before an answer is returned, a separate cheaper model (Haiku) runs cite_check, which verifies the answer's claims against the passages actually cited. An answer that can't be grounded doesn't go out clean.
That mirrors how I've worked my whole career: in a regulated lab, a result isn't trusted because it sounds right; it's trusted because the measurement system was validated first. The agent gets the same treatment. This is also why there's no LangChain: when the agent loop is the interesting part, I want it legible and testable, not buried under abstractions (ADR-0001).
Performance & cost
Wall-clock search latency from the deployed app, across the cross-region hop to Neon. Cold-cache is the first hit per query (encoder miss + DB round-trip + FTS scan + RRF); warm-cache is a repeat within the query-embedding LRU window.
| Query | Cold-cache | Warm-cache |
|---|---|---|
| dezincification | 1.75 s | 0.24 s |
| PREN equation | 1.57 s | 0.22 s |
| Charpy V-notch impact | 1.54 s | 0.24 s |
| stress corrosion cracking | 1.30 s | 0.26 s |
The ~230 ms warm path is encoder-cache-hit + FTS GIN scan + HNSW probe + RRF + serialization. The 1.3–1.7 s first run is the encoder cache miss plus cross-region DB latency, both expected for a two-host shape, and down from the ~2–4 s a pre-pivot baseline ran on the same queries. All of it runs about $12 to $20 a month (it came in at $12 last month): Fly.io scales to zero, Neon is serverless, and the LLM is metered. After idle, the first request warms for ~5–10 s, surfaced to the user as a "warming up" event rather than a stall.
How it's built
Python 3.12, FastAPI (async-only), Pydantic v2, SQLAlchemy 2.0 + asyncpg, Alembic, structlog. A vanilla-JS + Tailwind frontend, no SPA framework. Every pull request is gated in CI on ruff, black, mypy --strict, the pytest suite, and a Docker build. Thirteen ADRs record the decisions that mattered, including the ones that didn't work out, kept honest rather than retconned.
What I'd do next
It's not finished, and the backlog says so: chasing the slow-tail latency to its root cause, tuning cite_check precision, and a further round of ASM-corpus search-quality work.