Back to all investigations

Investigation 02

AI Email Context

Exploring how local project knowledge can give AI enough context to draft clearer, lower-effort replies without manual prompt-stuffing.

AI Product Development | Active

Scheduled AI Agent Modes

Five specialized modes — each triggered on schedule or on demand — working from the same Notion context layer.

Draft

Context-aware email reply drafting with tone matching

Inbox Scan

Identifies emails needing response with priority scoring

Daily Briefing

Morning summary of pending items and action needed

Decision Extraction

Pulls commitments and decisions from email threads

Resource Scanner

Indexes links and attachments into Notion knowledge base

1. The Problem

AI writing tools are impressive until you try to use them for your actual work.

Ask a general AI to draft a client email, and you get something competent but hollow. It doesn’t know who you’re writing to, what you agreed on last week, how you usually open messages, or which project this is referencing. You spend more time editing than you would have spent writing from scratch.

The obvious fix — dumping all your context into a prompt — hits a ceiling quickly. You can’t paste your entire client history into every email draft. And even if you could, it would be expensive, slow, and fragile.

The question this investigation set out to answer: can a structured context layer make AI-drafted emails reliably good, for real work, without manual prompt-stuffing every time?

2. What I Built

AI Email Context is an AI-assisted email management system that drafts contextual replies, scans for emails needing attention, and generates daily briefings — using a Notion workspace as its memory.

Instead of prompting the AI with raw context, the system retrieves structured data at runtime:

Structured Retrieval Layer

Instead of prompting the AI with raw context, the system retrieves structured data at runtime from these five Notion-backed sources.

Contacts Database

One record per person, history, contact type, relationship notes

Tone Profile

Written style guide describing personality and writing habits

Decision Log

Structured record of key decisions across active projects

Email Examples

Real sent emails used as few-shot style references

Skill Log

Record of every draft generated for audit trail and iteration

Contacts Database

One record per person, with communication history, contact type, and relationship notes.

Tone Profile

A written style guide the AI reads before drafting, not a set of rules but a description of how I actually write.

Decision Log

A structured record of key decisions across projects, so the AI can reference what was agreed without me restating it.

Email Examples

Real sent emails used as style reference.

Skill Log

A record of every draft generated, for audit trail and iteration.

3. How It Works

The system runs as a scheduled AI agent, with five modes: Draft, Inbox Scan, Daily Briefing, Decision Extraction, and Resource Scanner.

Scheduled AI Agent Modes

Five distinct operational modes running against the Notion-backed context layer to manage communications and extract knowledge.

Draft

Contextual replies

Inbox Scan

Triage & filter

Daily Briefing

Morning summary

Decision Extraction

Log agreements

Resource Scanner

Index new assets

When a scan or draft runs, the agent:

  • Retrieves active contacts from Notion
  • Filters emails against that contact list — ignoring promotional, automated, and already-replied threads
  • Loads the relevant context (tone profile, decision log entries, relationship notes) for each email that needs a response
  • Drafts a reply and saves it to Gmail as a draft in the correct thread
  • Logs the run, including which context sources were used

Runtime Retrieval Process

How the agent processes an inbox scan or draft request step-by-step.

Retrieve Contacts

From Notion DB

Filter Emails

Ignore promo/auto

Load Context

Tone, decisions, notes

Draft Reply

With [VERIFY] tags

Save to Gmail

In correct thread

Log Run

Record used context

The [VERIFY: ...] tag system flags anything the agent isn’t certain about — rather than hallucinating a detail, it marks it for human review inline in the draft.

4. What the Testing Showed

Tone fidelity improved significantly with real context

Without the context layer, AI drafts were correct but generic. With the Notion-backed Tone Profile, drafts matched the actual voice: warm opener, bold section headers for multi-part messages, action-oriented close. The style guide works because it describes personality, not just formatting rules.

Contextual accuracy removed most of the editing work

Without project context, AI drafts are vague and require heavy editing to be usable. With the Decision Log and Contact Notes, drafts referenced the right project names, recent decisions, and relationship context — without me restating any of it. External first-user testing produced drafts requiring minimal editing, before any calibration period.

The first real bug: silent data loss at the retrieval layer

During the scheduled morning scan, the system used a semantic search query to fetch active contacts from Notion. It returned 10 of 18 contacts — silently, with no error. Eight contacts were simply excluded from the scan, including several active clients.

This is a platform limitation in Notion’s search API, not a logic error. The fix required replacing semantic search with a direct database fetch, adding count verification, and building a fallback protocol. The new approach consistently retrieves all 18 contacts and logs the result.

The broader lesson: for any AI system that depends on retrieval, silent partial retrieval is a critical failure mode. It doesn’t break loudly — it just quietly misses things.

External testing surfaced setup UX gaps — and confirmed the system is resilient

A colleague installed the system from scratch and ran it across several days, including intentional stress tests. The functionality held up well, but early runs surfaced real setup gaps:

  • The “always run” permission prompts weren’t clearly flagged as something the user needed to approve during setup. A guided test run step was added so users see and approve these prompts before enabling scheduled tasks.
  • The task monitor sidebar is collapsed by default. Added an explicit step to open it during setup so users can see what’s running.
  • Gmail drafts weren’t landing in the correct thread — a threading bug where the draft was created without passing the thread identifier. Fixed and verified in the smoke test checklist.
  • The [VERIFY: ...] tags weren’t prominent enough — these are flags for human review, and they need to be visible enough that the user actually sees them before sending.

The stress test that stood out: the Gmail connector was intentionally disconnected before a morning run to see what would happen. The system didn’t fail silently. It notified via the briefing that Gmail was unavailable, still produced two draft responses from partial inbox access, surfaced them in the briefing for copy/paste review, and flagged connector errors for attention.

The briefing also proactively flagged an unknown sender as someone worth adding to the Contacts DB — a small detail, but it shows the system doing useful triage beyond just drafting replies.

This feedback round accelerated quality significantly. Day-one external install going from “works but confusing” to “smooth” in a single iteration — and then holding up to intentional failure testing — is a meaningful signal.

Token efficiency required deliberate design

Token Efficiency & Optimization

Model selection and mode-aware context loading significantly reduced token consumption without degrading output quality.

Daily Pro Plan Usage

-64% Drop
Day 1 (Claude Opus) ~25%
Day 2 (Claude Sonnet) ~16%
Day 3 (Optimized Sonnet) ~9%

Switching from Opus to Sonnet and optimizing context loading reduced usage of the 5-hour window.

Briefing Mode Context Size

-70% Tokens

Before

3,000 - 5,000 tokens

Mode-Aware

800 - 1,500

Loading all context for every mode burned budget. Mode-aware optimization only loads what’s needed.

The more interesting signal came from continued testing. After switching from Opus to Sonnet, token usage dropped from ~25% of the Pro plan’s 5-hour window on day one to ~16% on day two, and ~9% by day three — without any change in output quality. Model selection turns out to be meaningful configuration, not just a preference.

The initial version loaded all context for every mode regardless of whether the mode needed it. A briefing run was consuming as much context as a full draft run.

Mode-aware optimization brought briefing mode from approximately 3,000-5,000 tokens down to 800-1,500. A further optimization — caching active contacts in a single reference page rather than fetching individual records — reduced the contact retrieval step from multiple round-trips to one fetch. Time-bounded queries eliminated repeated processing of already-reviewed emails.

System stability required staggering scheduled tasks

Running the morning briefing and inbox scan as a single chained task caused the app to freeze when connectors were slow on startup. The fix: stagger them 30 minutes apart. Each runs independently and releases connector resources before the next begins.

Connector latency on cold starts is a genuine failure mode for production scheduled AI agents, not an edge case.

5. What This Approach Gets Right

  • No custom infrastructure. Notion’s databases, pages, and relations are enough to build a structured retrieval layer. No vector database, no embeddings pipeline.
  • Human-editable context. Contacts, decisions, and tone profile can be updated directly in Notion without touching code. The AI picks up changes on the next run.
  • Structured retrieval enforces consistency. Database schemas mean the agent always knows which fields to expect.
  • Incremental improvement is visible. The Skill Log creates an audit trail across runs, making it possible to see where draft quality improved and where gaps remain.

6. Where It Hits Limits

The context layer works well within Notion. But it doesn’t know that “John” in a Gmail thread is the same person as a contact in the database, or that a Google Drive file relates to a specific client unless the file name matches a known prefix. Resolution across sources — email, Drive, calendar, project management — requires manual maintenance or naming conventions as workarounds.

For a single-user system managing a known set of contacts, this is manageable. At scale, or across a team, the manual maintenance burden grows quickly.

7. Current Status

Email Brain is in active daily use at v3.5. Scheduled tasks run every morning. The system is documented and installable by others — external install testing confirmed the setup guide works end-to-end.

GitHub release is the next milestone.

8. Takeaways

  • Structured context meaningfully improves AI output quality. The difference between a generic AI draft and a contextually accurate one isn’t better prompting — it’s better retrieval.
  • Silent failures are the dangerous ones. Partial contact retrieval looked like success until it was audited. Logging and count verification aren’t optional for production AI systems.
  • External testing finds what internal testing misses. Setup gaps were invisible to me because I already knew how the system worked. First-user testing revealed the actual experience.
  • Token efficiency is a design constraint, not an afterthought. An agent that loads everything on every run burns context budget quickly. Mode-aware optimization is worth building early — and model selection matters more than expected. Switching from Opus to Sonnet cut usage by roughly a third with no quality loss.
  • Resilience should be designed in, not bolted on. A system that fails gracefully — notifying the user, preserving what it can, flagging what needs attention — is meaningfully more useful than one that just stops. The Gmail disconnection test made this concrete.

9. Links & Resources

  • Email Brain GitHub Repository (GitHub)
  • Context Architecture Documentation (Notion)
  • External Testing Feedback Log (Notion)
Back to all investigations Next: WordPress to HTML