Design to Code
Scenario: Employee - Corporate UX & Universal Design System Status: Active - Phase 2 in progress Started: 2025 · Last updated: May 2026 Part of: The Studio of Two - Use Cases
This is a living record. It documents an ongoing use case - not a completed experiment, but an active exploration that will continue to evolve as each phase is implemented, validated, and extended.
The scenario: A UX designer working within a large corporate environment, with access to Figma and AI tooling but without dedicated development resources. The ambition is not to generate a component library - it is to build an automated pipeline that powers a universal design system spanning web, mobile, and embedded surfaces. Figma is the available tool, not the chosen one. The constraints are the environment.
The problem being solved: The gap between design intent and running component has always required translation through other people. The Studio of Two closes that gap - but only if the AI's context window receives high-signal, low-noise input. Raw Figma extraction fails that requirement. This use case is the ongoing record of building a pipeline that meets it.
For the personal journey behind this use case, see Horizon - The Thing I Was Always Building.
Current State
Phase 1 - Complete. The extraction-to-component pipeline is defined and operational. Quality contracts are established at each gate. The native MCP refactor is complete - wrapper scripts replaced by first-class pipe nodes. The correction feedback loop is live.
Phase 2 - In progress. Self-healing conditional routing (validator nodes with branches) is the primary target. Banner tolerance for non-conformant MCP servers is the secondary target. The longer horizon: expanding the pipeline beyond Figma toward the universal design system - web, mobile, and embedded surfaces under one token vocabulary.
Phase 1: Establishing the Pipeline
Objective: Define a reliable extraction-to-component pipeline with explicit quality gates. Status: Complete
The Landscape: Why Build This?
Several tools already address design-to-code generation. Figma has native Dev Mode and code export built in. Claude Code, Lovable, Google Stitch, and others can take a design and produce a working component with impressive speed. The Figma MCP server itself provides direct API access from within an AI agent context.
These tools are genuinely valuable. They are optimised for a specific outcome: speed to a working component. You provide the design; the tool generates something that renders.
This pipeline is optimised for a different outcome: design intent preservation at scale - building a universal design system where the same design decision generates reliable, parity-accurate components across multiple platforms and runtimes, with a compounding knowledge base that improves with every correction.
The distinction matters in practice:
- Most generation tools are black boxes - you cannot inspect what context reaches the model, cannot apply correction rules, cannot build an auditable artifact trail.
- Most are optimised for one-shot generation, not for a pipeline that runs hundreds of times across a growing component library.
- Most assume full access to the design system's variable definitions - which, as the next section describes, is not always the case in a constrained corporate environment.
An open-source alternative design tool was also evaluated as a proof-of-concept extraction source. It is not a production path when the canonical design file lives elsewhere, but the experiment produced something concrete: it validated the correct target format for a component spec. Every bound property expressed as a semantic token path paired with a resolved fallback value - no internal tool IDs anywhere in the spec. The format proved the architecture before the production pipeline existed. The lesson was not use a different tool but this is what clean extraction looks like; now produce it from the canonical source.
This use case is not a replacement for those tools. It is what the Studio of Two approach looks like for a designer who needs the pipeline to be transparent, controllable, and cumulative.
The Problem: Figma Extraction Is Context-Expensive
Design-to-code extraction is powerful. It is also one of the most context-expensive operations in an AI-assisted workflow.
Raw Figma API outputs are large. They are structured for design tools, not for reasoning engines. A single extraction call can return tens of kilobytes of nested JSON - layer trees, variant definitions, component metadata, code connect references - most of which the implementation step will never use.
In practice this creates three compounding problems:
- Context bloat: the payload arriving at the reasoning stage contains far more structure than is needed.
- Token waste: expensive context windows are consumed by fields that never influence a decision.
- Fragile handoffs: each stage forwards too much to the next, and the accumulated noise degrades reasoning quality progressively.
For parity-focused design-to-code work - where the goal is a component that faithfully reflects the design system's token vocabulary and behavioral intent, not just one that renders - this problem appears quickly once workflows become multi-step.
Additional friction points that compounded the problem:
- Spec drift: extraction output and implementation diverge without explicit contracts to hold them together.
- Verification ambiguity: easy to validate that a component looks right; hard to validate that it behaves right across states.
- Retry cost: full reruns are expensive when only one gate fails. No concept of resuming from a known good point.
- Tier-locked variable resolution: The MCP tool that provides direct access to the design tool's variable definitions and their token bindings was restricted to a higher subscription tier. Without it, variable-to-token mapping could not be automated. This forced a manual reference mapping approach: a curated file documenting the relationship between variable names and the design system's canonical token paths, maintained by hand and applied through the correction contract. The constraint shaped the architecture more than any deliberate design decision.
- Layered corporate tool restrictions: Variable resolution was not the only blocked path. A second extraction tool - restricted at the organisation level rather than by subscription tier - would have provided richer component context for the same operation. Both restrictions arrived from different sources and were invisible until the extraction pipeline was being built. In a corporate environment, the available toolset is not the canonical toolset. The pipeline had to be designed around what was actually accessible.
- Source design quality debt: The design file is a corporate asset maintained across a team over time. Some layers carry generic names rather than semantic ones. Some property types are used for purposes outside their intended semantic scope. Some components have incomplete state coverage. These are not errors - they are the natural result of a file that evolved faster than its maintenance discipline. The pipeline must compensate for upstream quality issues that the extraction author does not fully control.
Each is a form of the same underlying problem: the cognitive supply chain between the design file and the running component had no quality gates, no compression, and no evidence trail.
What Failed Before
Before the current approach, multiple extraction and generation strategies were attempted. Each failure is documented - not as a cautionary tale but as direct evidence for why each contract in the current system exists.
Direct variable ID mapping. Raw variable identifiers from the extraction output were mapped directly to resolved style values. This worked until any variable changed in the design file - at which point every mapping that referenced that variable broke simultaneously. Maintenance cost grew with the size of the design system.
No intermediate spec layer. Components were generated directly from raw extraction output, skipping any platform-agnostic intermediate format. Implementation inherited the extraction's noise: wrong token references, missing nested components, inconsistent state handling. Every run required manual cleanup.
Hardcoded resolved values in generated code. Components built with literal resolved values - colours, dimensions - could not adapt to a second brand, platform, or theme. What passed visual review was not maintainable at scale.
Competing transformation pipelines. Multiple parallel paths for token transformation ran independently, produced different output shapes, and diverged over time. Debugging required knowing which pipeline had run last, not whether the logic was correct.
One-shot generation tools. Automated generation tools produced output that looked correct at first glance but was not maintainable. Manual correction followed every run. The same problems recurred component by component with no compounding improvement.
Each failure mode maps directly to a contract in the current system. The spec layer prevents (2). The validation gates prevent (1) and (3). A consolidated token pipeline prevents (4). The correction feedback loop transforms (5) from a recurring cost into a compounding asset.
None of these failures were unique to this project. They are the predictable result of starting where most workflows start - with a sequential script and the assumption that a better prompt or a faster model will eventually close the gap. They don't. The gap is structural. The next section shows what that looks like in practice.
The Naive Baseline: Sequential Scripts
The baseline is sequential and obvious: Step A → Step B → Step C in one script. Most workflows start here.
The hidden cost is opacity. The script runs. It produces output. The output looks reasonable. But when component tests fail - a token path unresolved, a fallback value wrong, an icon slot wired to a deprecated name - there is no trace of where in the chain signal was lost. Everything reruns from the beginning.
Where sequential scripts systematically break down:
- Cross-runtime composition is awkward - Node, Python, and MCP tooling each have their own invocation patterns.
- Compression is applied at the end, if at all - the AI reasons through the noise before cleanup happens.
- Reuse is low because orchestration logic is embedded in scripts rather than declared.
- Workflow intent is invisible to anyone who didn't write the code - including future self.
Agentic sequential baseline. The natural evolution is an agent that orchestrates the same linear flow. This improves flexibility but the structural problems remain. Orchestration logic moves into prompts and transient runtime state - neither auditable nor reproducible. When the agent drifts on a model upgrade, there is no way to inspect why. When it fails a validation gate, there is no way to resume from that gate. Start over.
More importantly: the agent's reasoning is only as good as what arrives in its context. Running an LLM over noisy, un-sifted extraction output is like asking a surgeon to operate in poor light. The capability is there. The environment is working against it.
The Pipe Insight: Context Is a Budget
Context-Pipe was not built for design-to-code workflows. It was built earlier to solve a different problem: context bloat in local AI inference, where small models operating on limited context windows needed high-signal input to reason accurately. The core idea was a Unix-inspired chain - each node reads from stdin, transforms, writes to stdout - applied to the problem of refining raw data before it reached the model.
What became apparent during that development was that the same pipeline model had a second property: determinism. A declared chain of named nodes, each with a fixed contract, produces the same output for the same input every time. That is not a context management feature - it is an automation primitive. The pipeline does not just compress; it orchestrates reliably.
That observation prompted a deliberate test: apply the same infrastructure to the Figma-to-code workflow, where the extraction-to-component chain had all the same symptoms - noise accumulation, opaque failures, no resume points - and where deterministic, auditable orchestration was exactly what was missing. The result is documented here.
The shift that changed how the workflow operated: treating context not as a space to fill but as a budget to spend deliberately.
Once framed that way, the redesign became obvious. Every stage should receive only what it needs to do its job. No stage should forward accumulated noise to the next. Compression should happen at the boundary between stages - not as an afterthought at the end.
The pipeline shape:
- preflight - validate the source, establish canonical artifact paths
- extract - call Figma MCP tools, produce structured extraction manifest
- spec-seed - assemble component spec input, apply any recorded correction rules
- run - validate spec against hard gates, promote to canonical path
- query summary - extract only the fields the next reasoning step needs
- sift - compress for handoff
Each node reads from stdin and writes to stdout. Each is independently testable. Compression at step 6 applies to the summarised output of step 5, not to the raw extraction of step 2. The AI that performs implementation work receives a high-signal payload, not a transcript of everything the pipeline touched.
This provides:
- Composable nodes through stdin/stdout contracts - language-agnostic, runtime-agnostic.
- Cross-runtime interoperability - scripts, sift tools, and MCP servers in the same chain.
- Per-node telemetry - input size, output size, reduction percentage, latency at every boundary.
- Portable declarations in
pipes.json- readable by humans and executable by machines.
Most importantly: compression becomes part of the workflow design, not a cleanup step applied after the fact.
The separation this creates. The extraction phase - preflight through sift - is fully deterministic. No reasoning required. Every step is a transformation with a fixed contract: validate, extract, assemble, gate, summarise, compress. This work can be delegated entirely to the pipeline, and within the pipeline, individual nodes can be handed off to lighter models, subagents, or specialised workers. The pipeline does not care what executes each node. It cares only that the contract is honoured.
In the Context Design framework, this extraction phase is Context Priming in automated form. Context Priming is the discipline of building and maintaining the structured knowledge base the agent operates within - design system tokens, component semantics, correction rules, validated specs. Manually, this is a curation task. Through the pipeline, it becomes a reproducible, auditable process that runs on demand and compounds over time. The agent is not discovering the design system during implementation. It arrives primed with a validated, compressed version of it - built and maintained by the extraction phase before reasoning begins.
The implementation phase is different. Assembling a correct, platform-appropriate component from a validated spec requires high reasoning: understanding the target platform's idioms, applying token semantics correctly, handling state logic, and producing code that is not just syntactically valid but architecturally sound. That is where a capable model is necessary - and where the quality of what the pipeline delivers becomes the primary variable. A high-reasoning model operating on a well-shaped, validated spec produces reliable output. The same model operating on raw extraction noise does not.
The pipe works well in the first phase precisely because the first phase is a problem of structure, not meaning. The second phase is a problem of meaning. Keeping them separate - and routing each to the right kind of executor - is what makes the full workflow tractable.
MCP Evolution: Wrapper Scripts → Native Nodes
The practical progression:
- Start with custom runner scripts - effective, but bespoke.
- Introduce declarative pipes for readability and reuse.
- Move MCP calls into native pipe nodes where the infrastructure supports it.
This evolution is complete. The bridge scripts that previously wrapped Figma MCP calls have been replaced by native type: "mcp" nodes in pipes.json. Figma extraction and context summary both now run as first-class pipe nodes - no custom glue code, no adapter scripts, no wrapper layer between the data and the pipe.
Two production findings surfaced during this refactor:
Banner tolerance. Some MCP servers emit a startup message to stdout before any JSON-RPC communication begins. Invisible when a wrapper absorbs it. The moment you remove the wrapper and communicate with the server directly over the raw protocol, it breaks the JSON-RPC reader. The pipe infrastructure needs to tolerate non-conformant server output gracefully - silently by default, visible with verbose: true on the server config.
Self-healing branches. Removing wrapper scripts exposed conditional routing requirements that had previously been handled by imperative if/else logic inside the scripts. A create run that detects the output artifact already exists should not fail - it should route to an update sequence automatically. An update run that cannot find the source artifact should route back to create. These are state transitions, not errors. The pipe format currently cannot express this. Adding type: "validator" nodes with branches is Phase 2's primary target.
On independent axes of change. The refactor clarified how this system scales. A CPP pipeline has three independently swappable layers:
- Nodes - what each step does. A dumb stdin/stdout tool, unaware of the pipeline around it.
pipes.json- the topology. How steps connect, branch, and route. Changed by editing the map - no code changes, no recompile, no redeploy.- MCP servers - the capability behind each tool call. Swapped by changing a server key - no imports, no dependency declarations, no build cycle.
These three layers evolve on entirely independent cycles. For MCP nodes specifically: every MCP server speaks the same protocol - JSON-RPC, tools/call, text response. The pipe depends on what speaks the protocol, not on what implements the service.
In a script, you depend on what you import. In a pipe, you depend on what speaks the protocol.
The Workflow Shape
For design and design-system teams, the concrete node sequence:
component-preflight- parse design file URL, establish artifact pathscomponent-extract- call Figma MCP tools, produce extraction manifestcomponent-spec-seed- assemble spec input, apply correction rulescomponent-run- validate hard gates, promote to canonical speccontext-summary-query- retain only confidence, validation status, gapssemantic-sift-cli semantic --rate <target>- compress for implementation handoff
Minimal reproducible example - a three-node flow without any design tooling dependency:
{
"name": "context-summarize-demo",
"nodes": [
{ "type": "mcp", "server": "your-server", "tool": "search_or_query",
"args": { "queries": ["validation", "gaps"] } },
{ "cmd": "your-transform-step", "args": ["--normalize"] },
{ "cmd": "semantic-sift-cli", "args": ["semantic", "--rate", "0.4"] }
]
}Success criteria: output remains useful for the next step; reduction metrics are visible in the trace; the same pipe definition is reusable across components.
Parity-critical extension for complex components:
- Extract both the component-set node and at least one real composition or example node.
- Validate not only visual structure but behavioral state - slot visibility, grouped row logic, pagination boundaries.
- Keep a visual QA pass separate from the main generation run so deterministic spec production is never blocked by screenshot availability.
The Contracts That Made It Reliable
The largest quality gains did not come from a better model. They came from explicit contracts.
Before contracts: results that were mostly right - every output manually verified because any part could have drifted. After contracts: results that are either provably right or explicitly failed at a named gate. The difference in cognitive experience is significant. Mostly right requires constant vigilance. Provably right or explicitly failed frees attention - the mental bandwidth to focus on the design, not the pipeline.
Extraction contract - required artifacts are fixed. Missing required extraction signals block promotion. The AI cannot proceed to spec authoring on incomplete data.
Validation contract - schema validity, token resolvability, fallback consistency, and unresolved bindings are hard gates. Implementation starts only after all gates pass. This is the equivalent of a design review sign-off - but deterministic.
Correction contract - manual parity fixes are recorded as deterministic rules and reapplied automatically on every subsequent run. When a token binding is corrected - an icon color that mapped to a hardcoded literal instead of a design token - the correction is recorded once. Every run after that applies it without involvement.
This is the contract with the most compounding value. The first correction costs full attention. The tenth is automatic. The fiftieth means the pipeline understands the design system's token vocabulary better than a developer who just joined the project. The correction rules are a codified record of domain knowledge - auditable, versioned, and permanently applied. You are not fixing bugs. You are teaching the system the specific priors it needs to reason accurately about this design system, one correction at a time, permanently.
The variable resolution problem is structural to how this class of design tool represents bindings in its extraction output. A component node's bound properties reference internal variable IDs - not semantic names. Resolving an ID to a semantic token path requires a multi-step join: the ID maps to a variable definition in a separate collection; the definition maps to a value; that value must be resolved per combination of brand, platform, and theme. For a design system targeting multiple brands and platforms simultaneously, this is a large combinatorial join with no single authoritative path. The tool that automates this join was tier-locked.
The tier-locked constraint made this contract load-bearing rather than supplementary. Without automated variable resolution, every token binding the extractor got wrong had to be caught and recorded manually. What started as a workaround became a design principle: a hand-crafted layer of domain knowledge that no automated tool would have produced with the same precision or specificity. The constraint forced a more durable solution than the shortcut would have allowed.
Handoff contract - every high-volume boundary includes a compression step. Agent-to-agent handoff uses explicit compression policy, not implicit prompt trimming.
Verification contract - behavior and state checks, not only screenshots, are mandatory. Structural parity stories are required for composite components.
Resume contract - resume from a failed gate by default. Full rerun is reserved for missing or corrupted core artifacts. This alone eliminated most of the retry cost.
Architecture-First Parity
A pattern that proved useful across components: map the UI into semantic runtime roles rather than design-tool names. Define a deterministic state matrix before writing a line of implementation:
- Segment selection → active content set
- Filter toggles → eligibility rules
- Pagination → page count and clamping rules
- Empty state → fallback rendering behaviour
This makes parity measurable and implementation-agnostic. It also makes it possible to validate not just that a component renders but that it behaves correctly across all states - which is where most parity failures actually hide.
Token Economics and Phase 1 Results
Token savings are not only a cost metric. Paired with context design, they are a quality metric.
Cost and scale effect - fewer tokens across extraction, summarisation, and handoff stages. Lower per-run cost as premium request budgets tighten.
Reasoning and performance effect - less low-signal noise in context windows. Higher density of relevant signals per token. Better downstream reasoning consistency across multi-agent handoffs.
From the context-pipe balance sheet during the period documented in Phase 1:
| Metric | Value |
|---|---|
| Noise reduced | 154,893 characters |
| Signal injected | 151 characters |
| Net context saved | 154,742 characters |
| Events | 262 |
| Approximate tokens saved | ~38,000-44,000 |
With retry factor (1.5×-2.0× for real agentic workflows): 51k-88k avoided tokens for the same period.
The avoided tokens are not savings on neutral content. They are low-signal payload that would otherwise compete with the reasoning context that drives correct output. Better context means fewer retries. Fewer retries mean better context. The savings compound.
Before and after:
| Dimension | Before | After | Why it matters |
|---|---|---|---|
| Context size | Large payloads pass through unchanged | Boundary compression applied intentionally | Lower token spend, fewer overflows |
| Failure isolation | Failures opaque and cross-step | Node-level gates with deterministic artifacts | Faster triage, safer retries |
| Rerun cost | Frequent full reruns | Resume from failed gate by default | Less compute and time waste |
| Auditability | Logic hidden in scripts and prompts | Pipe and artifact trail is inspectable | Better governance and reproducibility |
| Designer trust | Visual checks miss behavioral drift | Explicit parity contracts and state matrix | Higher confidence in intent preservation |
| Agent handoff quality | Ad-hoc trimming between agents | Explicit A2A compression policy | Better reasoning continuity |
| Cognitive load | Constant vigilance over "mostly right" | Trust gates, investigate failures | Attention back on the design, not the pipeline |
Phase 2: Self-Healing Pipeline
Objective: Add conditional routing, validator nodes, and banner tolerance. Begin expanding toward the universal design system. Status: In progress
Validator Nodes and Self-Healing Branches
The primary target for Phase 2 is conditional routing expressed declaratively in pipes.json.
The production pattern that exposed the need: a create pipe that detects the output artifact already exists should not fail - it should route to an update sequence. An update pipe that cannot find the source artifact should route back to create. These are state transitions. A script handles them with if/else embedded in code. A pipe should handle them by declaring the branches - so the recovery logic is readable, auditable, and modifiable without touching any node.
The design: type: "validator" nodes with a branches key. The orchestrator evaluates exit codes and routes to named sequences. No subprocess logic. No embedded conditionals. The topology owns the routing; the nodes remain unaware.
A script embeds its recovery. A pipe declares it.
Banner Tolerance
The secondary target: graceful handling of MCP servers that emit non-JSON-RPC output to stdout before the protocol handshake. Silently skipped by default. Visible via verbose: true on the server config. No pipe or node changes required - the tolerance lives in the MCP node runner.
Expanding Toward the Universal Design System
The longer Phase 2 target is expanding the pipeline toward the universal design system - the same token vocabulary generating components across web, mobile, and embedded platform targets from a single design specification.
The current source is a single design tool. The pipeline is the mechanism. The design system is the destination. The token contract that makes platform-agnostic generation possible, and the node additions required to route from one extraction source to multiple platform targets, will be documented here as Phase 2 progresses.
Evidence and Replicability
The methodology in this use case is replicable now for:
- Pipe composition and boundary-first compression.
- Artifact-first audits with deterministic gates.
- Correction feedback loops that teach the system domain-specific priors.
What requires abstraction before general sharing:
- Domain-specific scripts and naming conventions.
- Design-system token vocabulary and correction rules.
High-value standardisation targets for the community:
- Pipe run artifact schema: nodes, status, metrics, outputs.
- Gate taxonomy: extraction, validation, implementation, verification.
- Handoff metadata: from-agent, to-agent, compression profile, reduction.
- Minimal parity matrix schema for design-to-code workflows.
The Correction Loop as Studio of Two in Practice
The correction contract is where the framework becomes concrete in this use case.
The AI's extraction priors - its trained expectation of what a design tool's token path looks like - do not always match a specific design system's vocabulary. The correction rules are not bug fixes. They are expectation alignments: teaching the system the specific priors it needs to reason accurately about this design system.
The first correction costs full attention. The pipeline improves permanently. Over time, the system's understanding of the design system compounds - not because the model was retrained, but because domain knowledge was codified into the pipeline itself.
This is what Systems, not Patches means in practice. Not a correction for each component - a correction system that compounds across every component. Not an optimised prompt - a pipeline where the signal arriving at each reasoning step is already dense and verified.
The tool is finally fast enough to keep up with the thought.
Open source infrastructure: Context-Pipe (opens in a new tab) · Semantic-Sift (opens in a new tab)