Back to all investigations

Investigation 03

WordPress to HTML

Using structured evaluation and prompt refinement to test whether AI can reliably convert WordPress pages into production-ready static HTML.

AI Product Investigation | Active

End-to-End Test Pipeline

Five sequential stages — from setup to scored results — each run independently to ensure clean evaluation.

Setup

Reference screenshots, folder structure, URL validation

Orchestrator

AI model converts WordPress → static HTML + assets

Judge Agent

Independent LLM scores output across 7 dimensions (×3 median)

HITL Review

Human confirms scores with side-by-side screenshot comparison

Record

Final scores → Notion tracker, weighted totals, pass/fail

1. Overview

WordPress-to-static conversion is a useful but messy problem. A page can look correct at first glance while still failing on structure, SEO, accessibility, or maintainability. As the lead investigator on this project, I wanted to test whether AI could make this migration process faster and more reliable.

I designed the evaluation framework, built the prompt workflows, and acted as the quality judge across all iterations.

I explored the problem through prompt workflows, model comparisons, scoring, and review steps. The core insight: visual similarity is easy to fake, and models will actively exploit weak evaluation criteria to appear successful. The real challenge is building a workflow that can catch this — one that checks quality, verifies structure, and produces something trustworthy.

2. Product Question

The core decision I was trying to inform was: Can an AI-driven conversion workflow be reliable enough to be genuinely useful in a production environment?

Secondary questions included:

  • Where does AI provide the most leverage in the migration process?
  • What elements still strictly require human review?
  • What quality bar would make this usable for real client projects?

3. Approach

I treated this as a structured workflow design problem, not a one-shot prompt test. I tested multiple models on the same conversion task and compared the results using a rigorous scoring framework.

The workflow was broken down into distinct stages: generation, judging, human review, logging, and prompt versioning. Outputs were evaluated using weighted criteria including visual likeness, content likeness, interaction fidelity, SEO fidelity, accessibility, and asset integrity.

4. What I Built

Prompt Workflow Interface

I created and refined orchestrator prompts specifically tuned for parsing builder-heavy WordPress markup and extracting clean semantic HTML.

Prompt Evolution

Key inflection points where product decisions reshaped the workflow — each triggered by something that broke or a lesson learned.

V1.0
V2.0
V3.1
V3.2
V3.4

Single-pass generation

Initial approach — one prompt, one output. Fast but brittle; outputs looked passable but failed on structure and metadata.

Added blocking requirements + hard stops

Too many “passing” outputs were missing critical elements. Introduced mandatory checks that could halt the pipeline.

Content-first two-pass build + minimum thresholds

Trying to do everything at once caused both content and styling to fail — split into two passes.

Interaction fidelity scoring + judge automation

Interactive elements (navs, accordions, modals) were consistently broken. Added dedicated scoring and an independent LLM judge.

Anti-exploit hardening + structural integrity

Models found shortcuts to inflate scores without real quality. Tightened rubric to prevent gaming the evaluation.

V1.0

Single-pass generation

Initial approach — one prompt, one output. Fast but brittle; outputs looked passable but failed on structure and metadata.

V2.0

Added blocking requirements + hard stops

Too many “passing” outputs were missing critical elements. Introduced mandatory checks that could halt the pipeline.

V3.1

Content-first two-pass build + minimum thresholds

Trying to do everything at once caused both content and styling to fail — split into two passes.

V3.2

Interaction fidelity scoring + judge automation

Interactive elements (navs, accordions, modals) were consistently broken. Added dedicated scoring and an independent LLM judge.

V3.4

Anti-exploit hardening + structural integrity

Models found shortcuts to inflate scores without real quality. Tightened rubric to prevent gaming the evaluation.

Evaluation & Scoring System

I built a scoring system to compare outputs consistently, using minimum thresholds so outputs could not “pass” while failing core requirements like visual likeness, content likeness or interaction fidelity.

Scoring Weights (v3.2)

Scores are 0–5 per cell. Weighted totals calculated per model per page type, then averaged for an overall score.

100% Total Weight
Visual Likeness 25%
Content Likeness 25%
Interaction Fidelity 10%
SEO Fidelity 10%
Accessibility 5%
Asset Integrity 5%
Turns to Completion 20%
Weighted Total 100%

Minimum Threshold Rule

Even if the weighted total exceeds the fidelity threshold, the run cannot pass if any mission-critical category falls below its hard floor.

Visual Likeness

≥ 85%

Layout, typography, color, and responsive fidelity

Content Likeness

≥ 95%

Text completeness, order, and CTA accuracy

Interaction Fidelity

≥ 70%

Accordions, nav, tabs, dropdowns, and modals

Workflow Architecture

Multi-step conversion pipeline from WordPress input to validated static HTML output.

Input

WordPress page

Content Extraction

Parse & structure

HTML Generation

Orchestrator prompt

Judge Evaluation

Independent LLM judge

Score / Pass / Fail

Weighted rubric

Human Review

Manual QA check

Final Output

Production-ready HTML

5. Key Decisions & Tradeoffs

Content-First vs. Styling-First Generation

Decision: I prioritized extracting and structuring content before attempting to replicate styling.

Tradeoff: While this meant early outputs looked visually broken, it ensured that semantic structure, accessibility, and SEO metadata were preserved. Trying to do both simultaneously caused the models to hallucinate classes and drop critical content.

Separate Judge Step vs. Self-Evaluation

Decision: I implemented a separate LLM call specifically to act as a judge, evaluating the output of the generator model.

Tradeoff: This increased latency and API costs, but significantly reduced the “yes-man” effect where a model would blindly approve its own flawed output. The independent judge caught structural errors the generator missed.

Weighted Scoring with Hard Thresholds

Decision: Instead of a simple pass/fail, I used a weighted rubric where certain failures (like missing alt text or broken links) resulted in an automatic fail regardless of visual fidelity.

Tradeoff: This made the evaluation much stricter and harder to pass, but it aligned the tool’s success metrics with actual production readiness rather than superficial similarity.

6. What I Learned

The biggest failure was trusting visual similarity as a success metric. Early outputs looked polished but silently dropped SEO metadata, produced div-heavy markup, and broke accessibility. It was a clear lesson in how easy it is to overestimate AI output quality when you only evaluate the surface.

Staged workflow design matters more than prompting skill. A multi-step pipeline — with separate generation, independent judging, structured logging, and prompt versioning — consistently outperformed even well-crafted single-shot prompts. The workflow is the product, not the model.

That reframed the entire project for me. This is not a code generation problem. It is a fidelity problem: preserving content, layout, metadata, accessibility, assets, and interactions all at once. Getting any one of those right is straightforward. Getting all of them right, reliably, is where the real product challenge lives.

7. Outcome & Next Steps

This investigation showed that while AI cannot yet deliver a reliable “one-click” migration, it can be highly effective as a guided migration workflow or semi-automated internal tool when paired with clear QA steps.

More importantly, it changed how I think about AI product design beyond this specific use case. I now treat evaluation, thresholds, and repair loops as core product components rather than post-processing. I now design systems that self-check against production criteria, iterate until they meet a defined quality bar, and only escalate to human review once that bar is met.

Next, I plan to narrow the scope to a smaller set of page types, improve interaction evaluation, and test lightweight repair loops to determine whether the system can automatically fix issues identified during the judge step.

8. Links & Resources

  • Prompt Version History (Notion)
  • Test Runbook & Scoring Table (Google Sheets)
  • Sample Outputs Repository (GitHub)
Back to all investigations Next: AI SEO Audit