The ai-executive spec
cross-review system
The spec prompt to paste into Claude Code Desktop during Part 1. Three phases — independent eval, peer cross-review, synthesis. The sample output on the 3-billion-yen MMORPG investment question shows what the tool actually produces in forty-five seconds.
Why "generate from a spec"
The original plan was to clone ~/Products/ai-executive onto each executive's laptop and run docker compose up. But the laptops MIXI hands to executives aren't signed into GitHub. Neither git clone nor gh repo clone will run. So we flipped it: Part 1 of the workshop now has Claude Code Desktop generate the whole app from an empty folder.
- No GitHub account needed — the device policy wall goes away, and every executive's MacBook boots the same way
- The demo lands harder — watching Claude draw the app in front of you beats running a pre-made repo for "I could do this too" impact
- If you can read and write the spec, you have the tool — a small preview of what 2026 software development is converging toward: "spec → generate → adjust"
- It reproduces easily — paste the §5 prompt at home, and the executive re-runs the whole thing solo
Where ai-executive (reference) fits
The full version under ~/Products/ai-executive is a multi-agent system for MIXI strategy work: 17 agents (16 standard + 1 company-specific), 12 strategy frameworks, 3 analysis modes, 4 interfaces (Web / CLI / REST / MCP). That's the reference implementation — it's there to show the summit, the direction this can grow in. We don't clone it during the workshop. It's an inspiration source, nothing more.
The difference between the reference and the mini version we generate live:
| Item | Reference (full) | Workshop (mini) |
|---|---|---|
| Agents | 17 | 4 (CEO / CFO / CTO + Facilitator) |
| File count | 150+ | ≤ 12 files |
| Languages / stack | Python + TypeScript (Next.js) | Python only (+ one HTML file) |
| Frameworks | 12 (SWOT, Porter, 3C, and so on) | None — free-form + a JSON schema |
| Flow | 3 modes | 3 phases (independent eval → cross-review → synthesis) |
| Transport | REST + WebSocket | Server-Sent Events streaming |
| Startup | docker compose up (backend + frontend) | docker compose up (one container) |
What we're generating (executive-facing summary)
The outline of the mini version, at the level of detail you'd use to brief an executive:
- Goal — a very simple AI board tool that evaluates an item from the CEO / CFO / CTO perspectives in parallel
- Input — one agenda item (e.g. "Should we invest 300 million yen in new game X?")
- Processing — three agents running in parallel on the OpenAI API (
gpt-4o-mini), each producing a 3–5 line assessment - Output — a Markdown report with three sections side by side (
## CEO/## CFO/## CTO) - Runtime — one
docker compose up, served onlocalhost:8000(backend and static frontend from the same container) - Execution time — 10–15 seconds per agenda item (three parallel gpt-4o-mini calls)
Sample output (should we invest 300M yen in new game X?)
What the tool actually produces — the concrete example that converts a skeptical executive in 30 seconds. Below is a dummy sample of what you'd expect from gpt-4o-mini (don't use this for real investment decisions).
Phase 1: Independent evaluation (completed in 14.2s)
MMORPG sits well alongside MIXI's existing IP, and portfolio diversification three years out wants this move. But 300M yen at a moment when competitor A could ship around the same time means the differentiation story gets weaker. It's a conditional yes — only if we can front-load two existing IP characters and build brand advantage early.
Key points: reuse of existing IP, first-mover window vs competitor A, portfolio expansion.
Risks: launch timing, brand dilution.
A 24-month payback window on MMORPG is optimistic. Industry average is 36–48 months, and for Japanese titles at this scale, over 60% miss their DAU targets. Also, 300M yen covers dev only — post-launch marketing plus ops will pull in another 200–300M reliably. Total investment of 500–600M yen, and IRR likely drops below our 15% company floor. The numbers are loose.
Key points: payback assumption, hidden costs, IRR below hurdle.
Risks: snowballing add-on investment, opportunity cost.
Quantitative note: 24-month payback assumed, industry median is 42 months.
MMORPG's technical stack diverges from MIXI's existing assets (server sync, world persistence, anti-cheat). Twelve months assumes 8 in-house engineers full-time, but that collides with our existing project priorities. External studio contracting is possible, but MIXI inherits the operational technical debt afterward.
Key points: stack divergence, resource priority clash, operational debt inheritance.
Risks: 12-month schedule gap, long-term technical debt.
Phase 2: Cross-review (completed in 18.4s)
Agreements: CFO is right — 24-month payback is optimistic, it's out of line with industry benchmarks.
Blind spots: CFO doesn't model CAC reduction from existing IP reuse. If even 10% of MIXI's current users convert to the MMORPG, CAC drops ~60%.
Counterpoints: CTO's "external studio only" stance is too cautious. MIXI's technical brand depends on owning this kind of work in-house and taking on the debt.
Question: Can we actually imagine a 3-year IP strategy that doesn't include an MMORPG?
Agreements: CEO's "front-load existing IP" is a real differentiation lever. CTO's point about long-term tech debt being a direct cost driver is important.
Blind spots: CEO hasn't priced in macro risks beyond competitor A (weak yen, PC/console user growth slowing). CTO hasn't put numbers on contract costs or IP licensing fees for an external studio.
Counterpoints: CEO's "portfolio diversification" is abstract. Go back to the corporate plan and point at the evidence that current businesses are decelerating.
Question: What's the IRR if we put the same 300M yen into improving existing titles' operations instead?
Agreements: CFO's "hidden costs" is realistic. MMORPG ops from year three onward is maintenance-heavy. Agree with CEO on front-loading IP.
Blind spots: Neither CEO nor CFO accounts for AI coding assistants cutting development time (tools like today's Claude Code). In 2026 you can expect ~30% improvement in dev efficiency.
Counterpoints: CEO's "we'll bring it in-house" is decided without looking at current team utilization.
Question: With the 8 engineers already fully loaded, how does this rank against other large projects (core system replacement, etc.)?
Phase 3: Synthesis (completed in 8.1s)
1. MMORPG as new IP has market potential.
2. Existing IP can drive differentiation (all three agree).
3. The current 300M-yen, 24-month-payback assumption is open to question.
CFO "optimistic payback / IRR miss" vs CEO "strategic necessity" → restructure as staged investment (Phase 1 prototype 50M yen → KPI-gated 200M yen GO/NO-GO → remaining 50M yen for launch).
CTO "stack divergence" vs CEO "bring it in-house" → hybrid: outsource core components, MIXI builds an in-house wrapper.
- Rebuild the staged-investment numbers — Owner: CFO. Deadline: 2026 Q3. Success metric: 3-phase investment scenario + IRR sensitivity approved at the management meeting.
- Technical stack survey and due diligence on 2 outsourcing candidates — Owner: CTO. Deadline: 2026 Q3. Success metric: architecture option + 2-candidate RFP complete.
- Simulation of existing-user conversion to the new title — Owner: Marketing. Deadline: 2026 Q4. Success metric: survey of 100 current users + quantified CAC reduction estimate.
1. Competitor A launches earlier than 2027 Q2 → monthly competitive research.
2. In-house engineer priority collision → quarterly EM utilization review.
3. AI dev-efficiency gains come in below estimate → measure Claude Code hours and PRs merged over 3 months.
Strategic intent is agreed by all three. But "300M yen, 24-month payback" as currently specified is a NO-GO. Once the three conditions — redesigned staged investment, technical DD completed, IP conversion quantified — are met, bring it back for a formal GO/NO-GO.
- Taken alone, the three stances (CEO "conditional yes" / CFO "no" / CTO "holding") deadlock the discussion.
- Phase 2 cross-review surfaces what none of them saw alone (CAC reduction from IP conversion, macro risks, AI dev efficiency).
- Phase 3 synthesis reframes "yes vs no" as "staged investment + three conditions" — the discussion now has forward motion.
- A debate that usually takes 60–90 minutes of board meeting lands in 40 seconds of runtime + 3–5 minutes of reading.
File layout (expected output)
The directory tree Claude produces should land close to this. The spec caps it at "≤ 10 files, ~300 lines", so results converge here.
ai-executive/
├── .env.example
├── .env # filled in by hand, gitignored
├── .gitignore
├── README.md
├── docker-compose.yml
├── Dockerfile
├── requirements.txt
├── app/
│ ├── __init__.py
│ ├── main.py # FastAPI entrypoint + SSE streaming
│ ├── agents.py # the 4 agents (CEO/CFO/CTO/Facilitator)
│ ├── pipeline.py # 3-phase pipeline orchestration
│ ├── models.py # Pydantic schemas
│ └── prompts.py # system prompts in one place
└── static/
└── index.html # Single-page UI (SSE receiver + Markdown rendering)
Total ≤ 12 files, ≤ 600 lines of code. app/pipeline.py is the core — asyncio.gather runs Phase 1 and Phase 2 in parallel, Phase 3 runs serially through the Facilitator. static/index.html receives Server-Sent Events and lights up each phase as it completes; the final report is rendered with marked.js.
tests/ or more elaborate separation. Given the time budget, say "skip that for now, keep to MVP" in Plan mode. Extensions can wait until the executive gets home.
The spec prompt (copy and paste)
This block is the single most important artifact in the workshop. Open Claude Code Desktop's Code tab, set the project folder to ~/workshop-demo/, and paste the whole thing. Turn Plan mode ON before sending, so it pauses for file-list review. Approve, and generation begins.
Create a new empty folder at ~/workshop-demo/ai-executive/, and build the MIXI "AI Executive cross-review agenda-evaluation system" inside it.
## Product goal
A practical management-support tool that evaluates agenda items from three executive perspectives (CEO / CFO / CTO), runs peer cross-review, then synthesizes. The core value: surface blind spots that individual evaluations miss, through executive cross-review.
## Three-phase processing
### Phase 1: Independent evaluation (parallel)
Each agent (CEO / CFO / CTO) evaluates the agenda item independently, without seeing the others' output.
Output fields:
- stance: "yes" / "no" / "conditional yes" / "holding judgment"
- summary: 2–3 lines of assessment
- key_points: main points (3 bullets)
- risks: concerning risks (2 bullets)
- quantitative_notes: quantitative notes (required for CFO, optional for others)
### Phase 2: Cross-review (parallel)
Each agent reads the other two evaluations from Phase 1 and produces a peer review covering:
- agreements: points of agreement (1–2)
- blind_spots: blind spots and pointed questions (2–3) ← the core value of this tool
- counter_arguments: counterpoints (0–2)
- questions: follow-up questions / information requests (1–2)
### Phase 3: Synthesis (serial)
A Facilitator agent reads all Phase 1 + Phase 2 output and produces:
- consensus: points of agreement (2–4)
- conflicts: points of conflict + recommended resolution
- actions: 3 actionable items (title / owner / deadline / success_metric)
- residual_risks: residual risks and monitoring metrics
- decision_recommendation: "GO" / "NO-GO" / "conditional GO" + one-line reason
## Agent specs
### CEO Agent (key: OPENAI_API_KEY_CEO)
System prompt: "You are MIXI's CEO. Your evaluation perspective covers (1) alignment with the 3-year vision, (2) building competitive advantage, (3) long-term shareholder value. Weight 'why now' over the numbers. Put executive intuition into words. Always respond in the specified JSON schema."
### CFO Agent (key: OPENAI_API_KEY_CFO)
System prompt: "You are MIXI's CFO. Your evaluation perspective covers (1) payback period (ROI, IRR), (2) quantifying business risks, (3) financial discipline. Strip out unfounded optimism and speak in numbers. Always cover the pessimistic scenario. Always respond in the specified JSON schema."
### CTO Agent (key: OPENAI_API_KEY_CTO)
System prompt: "You are MIXI's CTO. Your evaluation perspective covers (1) technical feasibility, (2) scalability ceilings, (3) load on the engineering org. Imagine the ops load six months in. Be honest about technical debt. Always respond in the specified JSON schema."
### Facilitator Agent (key: OPENAI_API_KEY)
System prompt: "You are the strategy-meeting facilitator at MIXI. Read the three executives' independent evaluations + cross-reviews and produce a synthesized recommendation. State points of agreement, points of conflict, and action items clearly. When opinions split, don't force consensus — surface the conflict as-is and propose a resolution direction. Always respond in the specified JSON schema."
## API key management: web-UI settings page
### Important: don't ask anyone to edit .env directly
- Neither the facilitator nor the executives open .env by hand
- API keys are entered via the browser /settings page
- The app persists them safely to .env behind the scenes
### Settings page requirements
GET /settings returns HTML (or static/settings.html):
- Title "API key settings"
- 4 password fields (all <input type="password">):
- OPENAI_API_KEY (default / for the Facilitator)
- OPENAI_API_KEY_CEO / _CFO / _CTO
- A show/hide toggle next to each field, and a "Validate & Save" button below
- Existing keys show only ••••abc1 (last 4 chars), masked
- A rotate button for each field
- .gitignore status at the bottom (green check = .env excluded)
- The 6 principles quick-reference at the bottom
### API endpoints
- GET /api/settings/status → { all_keys_set, masked, gitignore_ok }
- POST /api/settings → validate (OpenAI test call per key) → atomic .env write → auto-append to .gitignore → hot-apply in memory
- POST /api/settings/rotate → replace only the specified key
- GET /api/settings/gitignore-check → is .env gitignored?
### Startup flow
1. GET / calls /api/settings/status
2. If all_keys_set is false, redirect to /settings?first_run=true
3. After setup, return to /
### Security
- /settings endpoints accept only localhost (docker binds 127.0.0.1:8000:8000)
- Never log or render key values (masked last-4 chars only)
- Regex-check the OpenAI key format
- chmod 600 on .env
- Reject keys that fail validation
### Why this approach
- Executive UX first. Especially on Windows, eliminate the friction of editing dotfiles
- .env stays compatible with Docker Compose as the standard
- All 6 principles are honored (UI design makes them easier to honor)
## Technical requirements
### Backend
- Python 3.11+
- FastAPI (async, asyncio.gather for parallel Phase 1 / Phase 2)
- OpenAI SDK (openai>=1.30) via AsyncOpenAI
- Pydantic v2 for typed models
- python-dotenv to load .env
- Use response_format={"type":"json_object"} for structured OpenAI output
### Frontend
- Single static/index.html
- Vanilla JS + fetch()
- SSE (Server-Sent Events) streaming
- UI updates as each phase/agent completes (progress bar + agent card lights up)
- Markdown rendering: marked.js via CDN
- "Copy to clipboard" button
### API spec (POST /api/review, SSE)
Request: { "agenda": "Should we invest 300M yen in new game X?" }
Response (text/event-stream):
event: phase1_start / agent_eval(x3) / phase1_done
event: phase2_start / cross_review(x3) / phase2_done
event: phase3_start / synthesis / phase3_done
event: complete / error
### Error handling
- OpenAI rate limit → exponential backoff, max 3 retries (1s, 2s, 4s)
- Missing API key → detect at startup and report which keys are missing
- Per-phase timeout: 30 seconds
- Broken JSON output → one retry with "the format was invalid, please regenerate"
### Logging
- Save raw I/O to logs/{timestamp}-{request_id}.json
- Record per-phase duration and input/output token counts
### Docker
- One command: docker compose up -d
- localhost:8000 serves both the frontend and the API
- Based on python:3.11-slim
## File layout
ai-executive/
├── .env.example
├── .gitignore
├── README.md
├── docker-compose.yml
├── Dockerfile
├── requirements.txt
├── app/
│ ├── __init__.py
│ ├── main.py # FastAPI + SSE + redirect logic
│ ├── agents.py # 4 agents
│ ├── pipeline.py # 3-phase orchestration
│ ├── settings.py # Settings API (/api/settings/*)
│ ├── env_writer.py # atomic .env write + gitignore check
│ ├── models.py # Pydantic schemas
│ └── prompts.py # system prompts
└── static/
├── index.html # agenda-evaluation UI
└── settings.html # API key settings UI
Total ≤ 14 files.
## Constraints
### File count and code volume
- Total files: ≤ 14
- Total code: ≤ 700 lines
### Execution time (measured)
- Full processing per agenda: 40–70 seconds typical (hard cap 90s per phase)
- Phase 1 (parallel): 8–20s
- Phase 2 (parallel, longer prompts): 10–25s
- Phase 3 (serial): 5–15s
### Cost (7 OpenAI calls: Phase 1 x3 + Phase 2 x3 + Phase 3 x1)
- gpt-4o-mini (default): $0.01–0.02 per agenda (~15–25k in / ~5–8k out tokens)
- gpt-4o: $0.25–0.50 per agenda
- DEFAULT_MODEL env var + /settings dropdown to switch models
- Per-agent model override (MODEL_CEO, MODEL_CFO, etc.)
### Phase 2 partial-failure handling (required)
If 1–2 agents fail in Phase 2 (broken JSON / timeout / rate limit):
1. One retry (with backoff)
2. If still failing, record as {"status": "unavailable", "reason": "..."}
3. Phase 3 Facilitator synthesizes with whatever cross-reviews are available
4. residual_risks must state "Only N/3 cross-reviews obtained: {missing role}'s view not reflected"
5. Mark decision_recommendation confidence as "weak"
6. UI: grey out the missing card, show a warning icon, offer retry
Never issue GO/NO-GO on 2/3 opinions silently. The missing view must be visible.
### README.md required contents
Purpose / 4-step startup / .env setup (via web UI) / API spec / license
### First-run UX target
docker up → /settings → evaluation screen reachable in under 60 seconds
## What to show in Plan mode
1. File list + one-line responsibility per file
2. Pseudocode for each phase (≤ 30 lines, Python-flavored)
3. API endpoint spec (SSE event list included)
4. /settings UI flow (state machine: unset / partial / all-set)
5. Settings API endpoints + validation flow
6. .env atomic write + gitignore check implementation approach
7. Error-handling strategy (4 error sources × response)
8. Pydantic model skeletons (4–5 models)
9. requirements.txt equivalent
10. Startup smoke-test procedure
After approval, generate all files. API keys will be placed later — for now, only create .env.example.
Placing the API keys (6 principles in practice) — web UI approach
To remove the friction of executives (especially Windows users) editing dotfiles directly, the generated app includes a web-UI settings page (/settings). Workshop step P1-02 walks everyone through that page.
- First launch:
docker compose up -d→ openhttp://localhost:8000→ app auto-redirects to/settings?first_run=true - Paste pre-issued keys into the four
<input type="password">fields - "Validate & Save" — the app test-calls OpenAI (models.list or similar) per key
- All validate → atomic write to
.env, auto-check and append to.gitignore, hot-apply in memory (no restart) - Success toast + "Go to evaluation" button returns to
/
How the 6 principles are honored
| # | Principle | How the web UI delivers it |
|---|---|---|
| 1 | Don't paste in chat | Password input — not visible to eyes or screen capture |
| 2 | Via .env | App writes .env atomically behind the scenes (same mechanism) |
| 3 | .gitignore check | Auto-check and append on Save, green check displayed |
| 4 | Rotation | Dedicated "Rotate" button in /settings |
| 5 | Least privilege | OpenAI-side scoping (UI doesn't touch this) |
| 6 | Prod / dev split | dev/prod tabs in /settings (planned extension) |
Startup and verification
Generation and key placement done — time to launch.
-
Start with Docker Compose
cd ~/workshop-demo/ai-executive docker compose up -d open http://localhost:8000 - First run auto-redirects to /settings — enter keys into the 4 fields → Validate & Save → on success, app writes
.env, auto-checks.gitignore, returns to evaluation. - Drop in a sample agenda — paste "Should we invest 300M yen in new game X?" and run. SSE streams Phase 1 → 2 → 3 across the screen in 40–45 seconds.
- Watch Claude self-repair — first runs almost always hit a minor error (async implementation slip, SSE config missing, broken JSON output). Paste the log to Claude with "we're getting an error" — and auto-repair runs. That's the highlight of the demo.
Typical first-run errors and how Claude responds:
| Error | Cause | Claude's response |
|---|---|---|
ModuleNotFoundError: openai | requirements.txt missing it | Adds openai>=1.30 and proposes re-running docker compose build |
Port 8000 is already in use | Another app is holding the port | Rewrites docker-compose.yml to 8001:8000 |
openai.AuthenticationError | .env not loaded / key wrong | Adds python-dotenv or adds env_file: .env to compose |
json.decoder.JSONDecodeError | Broken JSON from OpenAI | Makes response_format explicit and adds one-retry logic |
| SSE EventSource error in Chrome | CORS or buffering issue | FastAPI StreamingResponse with correct media_type |
| Phase 2 has one slow agent | No gather timeout | Adds asyncio.wait_for with a 30-second limit |
| Settings page doesn't appear | Redirect logic missing | Adds a status check on GET /, redirects to /settings when all_keys_set is false |
| .env write fails | Volume not mounted / permission | Adds .env:/app/.env (rw) volume to docker-compose.yml |
Natural-language feature addition (P1-04): a devil's advocate field
Once startup is verified, the Part 1 climax is adding a feature by typing English. New spec: add a "devil's advocate question" to Phase 2 cross-review, tightening the mutual check against optimism bias.
Add one more item to Phase 2 cross-review:
a "devil's advocate question".
Each agent, when reviewing the other agents' evaluations,
adds one question: "Does this still hold in the worst case?"
Purpose: strengthen mutual checks on optimism bias,
with one added field.
Display the new field in the frontend UI as well.
Claude shows the diff in the Visual Diff preview: a new field on CrossReview in app/models.py, extended prompt template in app/prompts.py, a new card in static/index.html. Click Accept → redeploy with docker compose restart → re-run the same agenda → the three cross-reviews now each carry the "worst-case" question. Done.
Extension ideas (to try at home)
A list of "what's next" the executive can try at home. All of these need one more line in the spec prompt and Claude handles it.
- Add more agents — CMO / COO / CLO / CHRO. One line of system_prompt and Claude adds it to the agent array.
- Persist past agendas and outcomes to a DB — add SQLite, store agenda + 3 evaluations + action items. Full-text or embedding search across "similar investments in the last 3 years".
- MCP to post agendas to Slack / Drive — via MCP, post agendas to a Slack channel / Google Drive / Notion, and send results back. See MCP deep dive.
- Route the output to Claude Design for board slides — three evaluations + action items auto-flow into a slide template. See Claude Design deep dive.
- Derive system prompts from actual MIXI materials — extract phrasing and perspectives from shareholder letters, AGM notices, and integrated reports, and fold them into each agent's system_prompt. The "it sounds like us" factor goes up significantly.
- Highlight conflict points — pull the contradictory claims across the three and surface them as a separate "to-debate" section. Port the
synthesizer.pyidea from the reference back down into the mini.
Troubleshooting
What tends to jam the day of, and how to handle it on the spot.
| Symptom | Cause | Fix |
|---|---|---|
| Plan mode doesn't appear | Selected model doesn't support Plan mode | Pick a model marked "Plan mode supported" (Sonnet 4.7 / Opus 4.7) |
OpenAI AuthenticationError |
.env isn't being read (source .env doesn't reach the container) |
Load via python-dotenv inside app.py, or add env_file: .env to docker-compose.yml |
Port 8000 already in use |
Another app holding the port (usually the reference running in the background) | Change docker-compose.yml ports to 8001:8000, use localhost:8001 |
| Docker is slow to start | First-time image pull (python:3.11-slim is ~150MB) | Wait 2–3 minutes. Running docker pull python:3.11-slim ahead of time makes it instant. |
| CORS error on the frontend | Static mount misconfigured, or frontend calling a different origin | Use FastAPI's StaticFiles as app.mount("/", StaticFiles(..., html=True)) to unify origin |
| Agents run sequentially instead of in parallel | Sync for loop instead of asyncio.gather |
Tell Claude "it's not parallel" — it rewrites with asyncio.gather(*tasks) |
Official links
- Claude Code Desktop Documentation
- OpenAI API Reference
- FastAPI Documentation
- Docker Compose Documentation
- Reference implementation (local):
~/Products/ai-executive— inspiration source only - Internal: API key hygiene — the complete guide