Back to all investigations

Investigation 02

AI Email Context

Email was the proving ground. The real investigation was about what kind of context layer makes AI output usable for real work.

AI Product Development | Active

1. The Problem

AI writing tools are impressive until you try to use them for your actual work.

Ask a general AI to draft a client email, and you get something competent but hollow. It doesn’t know who you’re writing to, what you agreed on last week, how you usually open messages, or which project this is referencing. You spend more time editing than you would have spent writing from scratch.

The obvious fix — dumping all your context into a prompt — hits a ceiling quickly. You can’t paste your entire client history into every email draft. And even if you could, it would be expensive, slow, and fragile.

The question this investigation set out to answer: can a structured context layer make AI-drafted emails reliably good, for real work, without manual prompt-stuffing every time?

2. What I Built

AI Email Context is a structured context-and-retrieval system, tested through email. It uses a Notion workspace as its memory layer — contacts, decisions, tone profiles, and relationship history — retrieving the right context at runtime based on who the AI is writing to and what it's doing.

Instead of prompting the AI with raw context, the system retrieves structured data at runtime:

Structured Retrieval Layer

Instead of prompting the AI with raw context, the system retrieves structured data at runtime from these five Notion-backed sources.

Contacts Database

One record per person, history, contact type, relationship notes

Tone Profile

Written style guide describing personality and writing habits

Decision Log

Structured record of key decisions across active projects

Email Examples

Real sent emails used as few-shot style references

Skill Log

Record of every draft generated for audit trail and iteration

Each source plays a specific role. The Contacts Database is the routing layer — it determines who an email is from, what context to load, and whether to draft a reply. The Tone Profile describes how I actually write, not formatting rules but personality. The Decision Log lets the AI reference what was agreed without me restating it. Email Examples anchor voice through real sent messages. And the Skill Log creates an audit trail across every run.

3. How It Works

Runtime Retrieval Process

How the agent uses context at runtime

When a draft or inbox scan is triggered, the system does not load everything it knows. It first identifies the contact and task, then pulls only the context that is relevant to that specific action — relationship history, tone guidance, prior decisions, and supporting examples. This keeps prompts focused, reduces token usage, and improves output quality.

The retrieval step is designed to be task-aware and auditable. If the agent is not confident about a detail, it does not invent one. Instead, it inserts a [VERIFY: ...] tag inline so uncertainty is visible and can be reviewed before anything is sent.

That workflow mattered more than the drafting itself. The goal was not just to generate email faster, but to test whether structured, runtime-loaded context could make AI output more usable for real work.

4. What the Testing Showed

Tone fidelity improved significantly with real context

Without the context layer, AI drafts were correct but generic. With the Notion-backed Tone Profile, drafts matched the actual voice: warm opener, bold section headers for multi-part messages, action-oriented close. The style guide works because it describes personality, not just formatting rules.

Contextual accuracy removed most of the editing work

Without project context, AI drafts are vague and require heavy editing to be usable. With the Decision Log and Contact Notes, drafts referenced the right project names, recent decisions, and relationship context — without me restating any of it. External first-user testing produced drafts requiring minimal editing, before any calibration period.

The first real bug: silent data loss at the retrieval layer

During the scheduled morning scan, the system used a semantic search query to fetch active contacts from Notion. It returned 10 of 18 contacts — silently, with no error. Eight contacts were simply excluded from the scan, including several active clients.

This is a platform limitation in Notion’s search API, not a logic error. The fix required replacing semantic search with a direct database fetch, adding count verification, and building a fallback protocol. The new approach consistently retrieves all 18 contacts and logs the result.

The broader lesson: for any AI system that depends on retrieval, silent partial retrieval is a critical failure mode. It doesn’t break loudly — it just quietly misses things.

External testing surfaced setup UX gaps — and confirmed the system is resilient

A colleague installed the system from scratch and ran it across several days, including intentional stress tests. The functionality held up well, but early runs surfaced real setup gaps:

  • The “always run” permission prompts weren’t clearly flagged as something the user needed to approve during setup. A guided test run step was added so users see and approve these prompts before enabling scheduled tasks.
  • The task monitor sidebar is collapsed by default. Added an explicit step to open it during setup so users can see what’s running.
  • Gmail drafts weren’t landing in the correct thread — a threading bug where the draft was created without passing the thread identifier. Fixed and verified in the smoke test checklist.
  • The [VERIFY: ...] tags weren’t prominent enough — these are flags for human review, and they need to be visible enough that the user actually sees them before sending.

The stress test that stood out: the Gmail connector was intentionally disconnected before a morning run to see what would happen. The system didn’t fail silently. It notified via the briefing that Gmail was unavailable, still produced two draft responses from partial inbox access, surfaced them in the briefing for copy/paste review, and flagged connector errors for attention.

The briefing also proactively flagged an unknown sender as someone worth adding to the Contacts DB — a small detail, but it shows the system doing useful triage beyond just drafting replies.

This feedback round accelerated quality significantly. Day-one external install going from “works but confusing” to “smooth” in a single iteration — and then holding up to intentional failure testing — is a meaningful signal.

Token Efficiency & Optimization

Model selection and mode-aware context loading significantly reduced token consumption without degrading output quality.

Daily Pro Plan Usage

-64% Drop
Day 1 (Claude Opus) ~25%
Day 2 (Claude Sonnet) ~16%
Day 3 (Optimized Sonnet) ~9%

Switching from Opus to Sonnet and optimizing context loading reduced usage of the 5-hour window.

Briefing Mode Context Size

-70% Tokens

Before

3,000 - 5,000 tokens

Mode-Aware

800 - 1,500

Loading all context for every mode burned budget. Mode-aware optimization only loads what’s needed.

The more interesting signal came from continued testing. After switching from Opus to Sonnet, token usage dropped from ~25% of the Pro plan’s 5-hour window on day one to ~16% on day two, and ~9% by day three — without any change in output quality. Model selection turns out to be meaningful configuration, not just a preference.

The initial version loaded all context for every mode regardless of whether the mode needed it. A briefing run was consuming as much context as a full draft run.

Mode-aware optimization brought briefing mode from approximately 3,000-5,000 tokens down to 800-1,500. A further optimization — caching active contacts in a single reference page rather than fetching individual records — reduced the contact retrieval step from multiple round-trips to one fetch. Time-bounded queries eliminated repeated processing of already-reviewed emails.

System stability required staggering scheduled tasks

Running the morning briefing and inbox scan as a single chained task caused the app to freeze when connectors were slow on startup. The fix: stagger them 30 minutes apart. Each runs independently and releases connector resources before the next begins.

Connector latency on cold starts is a genuine failure mode for production scheduled AI agents, not an edge case.

5. What This Approach Gets Right

  • No custom infrastructure. Notion’s databases, pages, and relations are enough to build a structured retrieval layer. No vector database, no embeddings pipeline.
  • Human-editable context. Contacts, decisions, and tone profile can be updated directly in Notion without touching code. The AI picks up changes on the next run.
  • Structured retrieval enforces consistency. Database schemas mean the agent always knows which fields to expect.
  • Incremental improvement is visible. The Skill Log creates an audit trail across runs, making it possible to see where draft quality improved and where gaps remain.

6. Where It Hits Limits

The context layer works well within Notion. But it doesn’t know that “John” in a Gmail thread is the same person as a contact in the database, or that a Google Drive file relates to a specific client unless the file name matches a known prefix. Resolution across sources — email, Drive, calendar, project management — requires manual maintenance or naming conventions as workarounds.

For a single-user system managing a known set of contacts, this is manageable. At scale, or across a team, the manual maintenance burden grows quickly.

This is the more interesting question the investigation opened up. The pattern works — structured context genuinely improves AI output. But it works because I built and maintained every piece of the context layer by hand: every contact record, every decision log entry, every relationship note. The system doesn't know that "John" in a Gmail thread and "Hornaments" in a Google Drive filename refer to the same client unless I've already told it. It can't connect a calendar invite to a project unless the naming convention holds. The context is structured, but the connections between sources are manual.

That's a solvable problem — and it's the one I'm investigating next. The question isn't whether AI can generate good output — this investigation proved it can, given the right context. The question is what it takes to build a context layer that connects across sources automatically, so the retrieval isn't limited to what one person can maintain in one tool.

7. Current Status

Email Brain is in active daily use at v3.5. Scheduled tasks run every morning. The system is documented and installable by others — external install testing confirmed the setup guide works end-to-end.

GitHub release is the next milestone.

8. Takeaways

  • Structured context meaningfully improves AI output quality. The difference between a generic AI draft and a contextually accurate one isn’t better prompting — it’s better retrieval.
  • Silent failures are the dangerous ones. Partial contact retrieval looked like success until it was audited. Logging and count verification aren’t optional for production AI systems.
  • External testing finds what internal testing misses. Setup gaps were invisible to me because I already knew how the system worked. First-user testing revealed the actual experience.
  • Token efficiency is a design constraint, not an afterthought. An agent that loads everything on every run burns context budget quickly. Mode-aware optimization is worth building early — and model selection matters more than expected. Switching from Opus to Sonnet cut usage by roughly a third with no quality loss.
  • Resilience should be designed in, not bolted on. A system that fails gracefully — notifying the user, preserving what it can, flagging what needs attention — is meaningfully more useful than one that just stops. The Gmail disconnection test made this concrete.
  • The pattern generalizes, but the wiring doesn't — yet. Structured context, mode-scoped retrieval, human-in-the-loop quality controls — none of this is specific to email. The question this investigation leaves open is whether the same approach holds when the context layer spans multiple sources and the connections between them aren't manually maintained. That's where this work is heading.

9. Links & Resources

Back to all investigations Next: WordPress to HTML