As an engineer, the context you need to do your job is scattered across a dozen systems: decisions made in chat threads, the why behind a change buried in a pull request, design rationale in a wiki page, a commitment made in a meeting nobody wrote down. Most of it decays. Six months later you're re-deriving a decision you already made because the only record was a Slack thread that scrolled into oblivion.
I built a pipeline to fix that — an LLM-assisted system that captures raw material from those systems and synthesizes it into a durable, searchable knowledge base. This post is the design, generalized. I'm deliberately keeping it tool- and employer-neutral; the value is in the architecture, which ports to any stack.
The three-layer model
The system has three layers, and keeping them separate is the whole game:
Layer 1: sources/ raw captures, one file per artifact, lightly structured
Layer 2: wiki/ synthesized pages — the compressed, durable knowledge
Layer 3: schema the protocol: conventions, frontmatter, ingest rules
- Sources are raw. A captured chat thread, a PR snapshot, a meeting transcript — verbatim, with metadata, written to a file. Cheap and lossless.
- The wiki is synthesized. A synthesis agent reads new sources and folds them into durable pages: a glossary entry, an incident postmortem, a reusable pattern. The wiki's job is compression, not mirroring.
- The schema is the contract both layers obey — filename conventions, frontmatter shape, the ingest protocol. It changes rarely and deliberately.
The insight that made this work: capture and synthesis are different problems with different failure modes, so decouple them. Capture must be fast and reliable (you're stealing a moment to save something). Synthesis can be slow and batched (it runs later, reads everything new, thinks hard). Couple them and a slow synthesis step blocks you from capturing at all. Decouple them and capture is instant; synthesis happens on its own cadence and reads whatever has accumulated.
What to capture, per source
Every source type — chat, issue tracker, code host, docs/wiki, meeting transcripts, ad-hoc paste-ins — gets the same five-part treatment:
- Capture-worthy signals. What's actually worth keeping. For a chat platform: threads where a decision is announced ("let's go with…", "approved", "ship it"), threads where you're mentioned and haven't replied, high-engagement threads in channels you watch.
- Noise to skip. Bot notifications. Your own status pings. Reaction-only activity. The stuff that would bury the signal.
- Output shape. The frontmatter and filename the capture writes — so the synthesis layer can consume it without guessing.
- Cadence. Daily sweep? On-demand? Weekly?
- Quirks. The source-specific gotchas (pagination, permalink expiry, HTML-to-markdown fidelity).
A captured issue-tracker ticket, for instance, lands as structured frontmatter plus a body:
---
source: issue-tracker
key: PROJ-1234
captured: 2026-06-26T08:00:00-04:00
status: Released
last_event: 2026-06-25T16:22:00-04:00
---
That last_event field is the high-water mark: on the next sweep, the capturer fetches only events newer than it, instead of re-pulling the whole ticket.
Capture mechanisms — and the trap everyone hits
There are three mechanism families, and choosing the right one per source is most of the design:
| Mechanism | Best for | Trade-off |
|---|---|---|
| Skills (slash commands) | On-demand, mid-conversation capture | You have to remember to invoke them |
| Hooks (event-driven) | Local events you control | Narrower than the name suggests — see below |
| Scheduled tasks (periodic sweeps) | External-system polling | Latency up to one cadence; needs dedup state |
Here's the trap, and it's worth stating loudly because it cost me an afternoon: AI-agent hooks fire on conversation events — the agent stopped, a tool is about to run, the user submitted a prompt — not on external events. There is no "hook on PR opened" or "hook on Slack message." If you want event-driven capture from an external system, the real path is something outside the agent's hook system entirely: a git post-merge hook on your own machine, or a CI action that writes to a synced directory. Conflating "agent hook" with "webhook" sends you building something that can't exist.
So the actual recommendation per source is mostly: scheduled sweep for external systems (poll daily, dedup against state), on-demand skill for user-initiated capture (/capture <url> when you decide something matters), and hooks only in the narrow spots where a local event is the trigger.
Dedup and freshness via filename conventions
State is the enemy of reliability, so I push dedup into filenames instead of a database. Two shapes:
- Time-stamped capture —
<source>-<id>-<YYYY-MM-DD>.md— for artifacts with no stable identity (a daily chat digest, an ad-hoc article). Each capture is its own file. - Stable-entity capture —
<source>-<id>.md— for things with a canonical identity, recaptured in place (a ticket, a PR, a wiki page).
Recapture semantics depend on the source's own model:
- Append-only, newest-first for sources with event timelines (tickets, PRs): each recapture prepends a dated section; old content stays.
- Versioned-replace with diff preservation for edit-replace sources (wiki pages): replace the body, but keep the prior version under a
## Previous versionheading so the diff isn't lost.
The filename is the dedup key. Two captures of the same ticket can never land under two different names — which, before I imposed this, is exactly what happened.
When does a raw source become a wiki page?
Not everything earns synthesis. A source gets promoted to its own durable page when any of these holds:
- Recurrence — the concept shows up 3+ times across captures and notes. Recurring relevance means compression pays off.
- Decision-of-record — it documents a choice that'll be referenced later (an architecture decision, an incident root cause). Decisions get a page on first capture; the value is findability.
- Reusable pattern — it describes a technique that applies beyond its origin. A pattern extracted from one incident belongs in the wiki because it'll apply to the next one.
Everything else stays raw. A one-off thread that resolved itself, a ticket shipped without revisiting — those live in sources/ as a searchable archive but never clutter the synthesized layer. Synthesis over enumeration: one wiki page can cite ten sources. The wiki compresses; it doesn't mirror.
The one rule you can't skip: sensitive content
The moment you point automated capture at chat and meetings, you're one bad sweep away from archiving someone's DM, an HR conversation, or customer-confidential material. The rule has to be default-deny for the sensitive class: skip private channels, skip DMs, require explicit confirmation before persisting a meeting transcript. It's safer to under-capture and manually add than to over-capture and have to scrub. Design this in from line one, not after the first incident.
Why this is worth the ceremony
"Why this much machinery for personal notes?" is a fair question, and the honest answer is: because the alternative — capturing by hand, when you remember, in whatever tool is open — doesn't scale past a few weeks of good intentions. The pipeline's entire purpose is to make capture cheaper than not capturing, and to make synthesis happen whether or not you feel like it that day.
The tools are interchangeable — your chat platform, your agent runner, your note format. The architecture is the durable part: raw capture decoupled from batched synthesis, dedup encoded in filenames, a promotion rule that keeps the synthesized layer small, and default-deny on anything sensitive. Build that, and the context you need stops decaying.
— Parker Jones, parkerjones.dev