Case Study · Applied AI · 2026

Materials Science Literature RAG

A production RAG and agent system over materials-science corpora, built solo and deployed live, held to the standard I learned in a regulated test lab: it doesn't ship until an eval proves it works. Three surfaces (browse, hybrid search, and an agent that cites its sources) share one retrieval module. Every non-trivial decision is written down in an ADR.

View the repo ↗ Request live-app access The engineering writeup

Status

Live in production

Fly.io + Neon, behind Cloudflare Access

Retrieval

Hybrid BM25 + dense + RRF

One Postgres: tsvector FTS + pgvector

Evaluation

95 labeled cases

Per-corpus NDCG@5 / Recall@5 / MRR

Agent

Tool-using loop

Direct Anthropic SDK, no LangChain (ADR-0001)

Cost

~$12–20 / mo

All-in at portfolio scale

Discipline

13 ADRs · CI-gated

ruff · black · mypy --strict · pytest · Docker

What it is

Ask a materials-engineering question: "Which brass alloys are least susceptible to dezincification?", the PREN equation, Charpy V-notch impact behavior, and get an answer cited back to the source page, in about a quarter-second on a warm cache. Under the hood, three surfaces sit on one retrieval module:

Browse — a hierarchical view of the corpus, document by document.
Search — hybrid retrieval (keyword + semantic), exposed as both an HTTP API and a CLI.
Chat — a Level-2 agentic loop with tool use, SSE streaming, and inline citations. The agent calls the same /search every other surface uses, plus read_section, web_search, fetch_url, and cite_check.

Retrieval is a pure function: (query, filters) → ranked chunks. No surface owns its own retrieval logic; the agent is just another caller. That one decision keeps the system testable, I can evaluate retrieval in isolation, without driving the whole agent.

The corpus is two halves: 30 publicly-redistributable documents (~4,300 pages from NIST, USGS, DOE national labs, NASA, FHWA, NPL, and CC-licensed arXiv reviews) and a private ASM Handbook corpus used for evaluation. They're reported separately throughout, and they aren't commensurable.

Architecture

The production deployment is a deliberate two-host shape: a public edge that gates and proxies, and a single application host that owns compute and data.

BrowserSSE for streaming chat

↓

Cloudflare Pages + AccessTLS · OTP gate on /portfolio/* · static Astro site

↓

Pages Function (proxy)Forwards over HTTPS + shared secret header

↓

FastAPI on Fly.io async uvicorn · lazy-loaded encoder pool · scale-to-zero ↔ Anthropic Claude: Sonnet (agent loop) · Haiku (cite_check)

↓

Neon · serverless Postgrespgvector (HNSW) + tsvector (GIN) + sessions · one DB

Cloudflare Access gates the hostname at the edge, so only authenticated visitors reach the proxy; the shared-secret header is defense-in-depth so nobody with the raw Fly URL can burn Anthropic credits directly. The whole thing runs on shared-cpu-1x with 2 GB of RAM and scales to zero when idle.

Retrieval

Keyword search nails exact terms and notation; semantic search handles paraphrase and concept. Neither wins alone on a technical corpus, so the system runs both and fuses them:

Query

Postgres FTStsvector · GIN

Densepgvector · HNSW

RRF mergedeterministic tie-break

Ranked chunks+ per-retriever provenance

Canonical config, chosen against the eval harness rather than by feel: multi-qa-mpnet-base-cos-v1 embeddings, 384-token chunks at 20% overlap, hnsw.ef_search = 160. The /search/explain endpoint attaches per-retriever provenance to every merged result, so I can see why a chunk surfaced, which is what makes the failures debuggable.

Evaluation

This is the part most RAG demos skip. There are 95 passage-level labeled retrieval cases across the two corpora, scored as NDCG@5, Recall@5, and MRR. Every chunking, embedding, and parameter change is measured against this harness, not vibes. Numbers are reported per-corpus, not as a flattering merged aggregate, because the two gold standards aren't comparable.

Corpus	Cases	NDCG@5	Recall@5	MRR
asm-private	56	0.1913	0.0897	0.3575
public	39	0.1129	0.1018	0.1882

These are honest numbers on a hard target: strict passage-level relevance on dense reference handbooks. The point isn't the absolute score. It's that every change to retrieval gets measured, so when something's off I can find where instead of guessing. Two examples of the harness doing real work:

An early diagnosis took NDCG@5 up nearly 400% after a 40-configuration parameter sweep, a four-way backend benchmark, and a knowledge-graph hypothesis I tested and rejected. That's the story in Building a Search Engine That Actually Works.
A later regression on the ASM corpus traced to Unicode normalization; NFKC at ingest plus raising ef_search from 40 to 160 recovered it with zero regression on the public set. Written up in ADR-0013.

Beyond retrieval, there are agent-behavior eval tracks (does the agent cite correctly, refuse when it should, stay grounded) and a performance SLO battery. The eval methodology itself is captured in an ADR (ADR-0009).

The agent, and why it checks itself

The chat surface is a hand-rolled async tool-use loop on the Anthropic SDK, with no orchestration framework. Claude Sonnet drives the loop; it can call /search, read_section, web_search, and fetch_url. Before an answer is returned, a separate cheaper model (Haiku) runs cite_check, which verifies the answer's claims against the passages actually cited. An answer that can't be grounded doesn't go out clean.

That mirrors how I've worked my whole career: in a regulated lab, a result isn't trusted because it sounds right; it's trusted because the measurement system was validated first. The agent gets the same treatment. This is also why there's no LangChain: when the agent loop is the interesting part, I want it legible and testable, not buried under abstractions (ADR-0001).

Performance & cost

Wall-clock search latency from the deployed app, across the cross-region hop to Neon. Cold-cache is the first hit per query (encoder miss + DB round-trip + FTS scan + RRF); warm-cache is a repeat within the query-embedding LRU window.

Query	Cold-cache	Warm-cache
dezincification	1.75 s	0.24 s
PREN equation	1.57 s	0.22 s
Charpy V-notch impact	1.54 s	0.24 s
stress corrosion cracking	1.30 s	0.26 s

The ~230 ms warm path is encoder-cache-hit + FTS GIN scan + HNSW probe + RRF + serialization. The 1.3–1.7 s first run is the encoder cache miss plus cross-region DB latency, both expected for a two-host shape, and down from the ~2–4 s a pre-pivot baseline ran on the same queries. All of it runs about $12 to $20 a month (it came in at $12 last month): Fly.io scales to zero, Neon is serverless, and the LLM is metered. After idle, the first request warms for ~5–10 s, surfaced to the user as a "warming up" event rather than a stall.

How it's built

Python 3.12, FastAPI (async-only), Pydantic v2, SQLAlchemy 2.0 + asyncpg, Alembic, structlog. A vanilla-JS + Tailwind frontend, no SPA framework. Every pull request is gated in CI on ruff, black, mypy --strict, the pytest suite, and a Docker build. Thirteen ADRs record the decisions that mattered, including the ones that didn't work out, kept honest rather than retconned.

What I'd do next

It's not finished, and the backlog says so: chasing the slow-tail latency to its root cause, tuning cite_check precision, and a further round of ASM-corpus search-quality work.

View the repo ↗ Request access to the live app