Why-Postmortem — The Field Manual · 9 stages, one funnel, blameless by default

p. 02 · editor's note

Editor's note

Backward-looking sibling of PM Copilot

Postmortems fail in three ways. They get skipped because writing them hurts. They stop at "human error" and miss the systemic cause. And the action items rot in a doc nobody opens again.

Why-Postmortem is a Claude Code skill wrapped in a guided web flow. Nine stages, one incident at a time. AI plays the facilitator — proposes timelines, challenges weak whys, drafts the report — but it never authors anything alone. Every stage has an accept / reject / edit gate.

The blameless guardrail is non-negotiable. The 5-Whys stage intercepts answers that mention a person by name and forces a reframe to a role, a control, a process gap. Twice — once in the UI, once at the prompt layer. Defense in depth.

This is the backward-looking sibling of PM Copilot. Same machine, opposite end of the build–break–learn loop. PM Copilot stress-tests the spec before the work; Why-Postmortem stress-tests the system after the incident. They share an engine: context layer + AI-stress-tested funnel + structured artifact.

One important difference. PM Copilot's POC runs on localStorage. Why-Postmortem can't. It writes files to disk, validates against schemas, streams AI output through a typed-output layer (BAML), and round-trips with a CLI skill. Browser alone is not enough. The backend story is the second half of this manual.

p. 03 · the shape

→

The shape

Context · funnel · artifact

Shared with PM Copilot
Different direction in time
Same engine

Three ingredients, in this order. Context layer loaded once. A funnel that's hard to skip. A structured artifact at the end.

Generic prompts produce generic reports. Grounded prompts — fed your services catalog, your severity matrix, your team roster — produce reports a senior practitioner would recognise. That grounding is the context layer, and it lives in git, not in the app.

Why this works · z-dna.md

Context layer carries the senior judgment.
User just answers each stage honestly.
AI challenges, never authors.
Output is queryable (enums + schema), not just readable.
Markdown + git as substrate — no SaaS lock-in.

PM Copilot looks forward. Why-Postmortem looks back. Same engine.

Side by side — the DNA shared

Layer	PM Copilot	Why-Postmortem
Time direction	Forward — define the right product	Backward — learn from the incident
Context	Personas, triggers, success-metric rules	org · services · teams · severity · enums · glossary · history
Funnel	Why → Personas → Edges → SMART → Trace	Timeline → Fishbone → 5-Whys → Draft → Actions
Stress test	Recursive whys on problem statement	5-Whys per fishbone branch (parallel)
Hard guardrail	SMART validation	Blameless name-scrub (UI + prompt)
Output	BRD / PRD markdown	Postmortem folder + metadata.yaml

p. 05 · stage 00

00

Context

Load org · validate · show TODOs

/postmortem:context
Read on every invocation
Lives in git

The most important stage. Generic prompts produce generic reports. Grounded prompts produce something a senior practitioner would recognise.

"Where does this org's rulebook live?"

What this stage does: reads seven files from _context/. Validates them against schemas. Lists missing required keys as TODOs. Surfaces counts (X services, Y teams, Z severity bands). Never overwrites filled values silently.

Bootstrap rules

/postmortem:init scaffolds _context/ with template files.
Each file ships with inline TODO markers.
Team fills in once, reuses forever.
Other commands warn if required context keys are missing.
If org.yaml lists SOC2/PCI/HIPAA → drafter auto-adds Compliance Impact subsection.

The seven files

org.yaml

Company name · domain · compliance regimes · ticket system · on-call tool · business hours.

used by · all phases

services.yaml

Service catalog: name · owner team · tier · SLOs · dependencies · runbook link.

timeline · fishbone · actions

teams.yaml

Team roster: name · Slack channel · on-call rotation · lead · escalation path.

actions · drafter

severity-matrix.yaml

SEV1–4 thresholds: users impacted · revenue/hr · SLO breach · compliance trigger.

new · drafter

enums.yaml

Allowed values: root_cause_category · contributing_factors · tags. The vocabulary.

metadata · pattern miner

glossary.md

Internal terms · acronyms · system codenames so AI doesn't expand "TRP" wrong.

drafter · blameless editor

history.md

Known recurring issues. Prior root causes. "Don't re-discover" list.

fishbone · whys · miner

runbooks/

Pointers to per-service runbooks. Linked from timeline mitigation steps.

timeline

Kill criterion — Missing org.yaml or services.yaml → skill refuses to run. Context is not optional.

p. 08 · stages 01–02

01·02

New incident · raw

Slug · severity · drop the artifacts

/postmortem:new <slug>
Severity auto-suggested
50 MB raw cap

Slug + date form the folder name. Severity is auto-suggested from the matrix once you enter impact fields. Services autocomplete from your catalog. Roles, not names.

"What happened, when, and to whom?"

Then drop the artifacts. Slack export. PagerDuty payload. Deploy log. Screenshots. Drag-drop into the raw box; files stored as raw/ matching the skill's folder layout exactly. 10 MB per file, 50 MB total per incident.

Form rules · FR-2, FR-3

Slug + date auto-form YYYY-MM-DD-<slug>.
Severity auto-suggested from severity-matrix.yaml; user overrides with reason.
Services autocomplete from services.yaml.
IC + scribe shown by role, never by name.
Detection + resolution timestamps → auto MTTD/MTTR.

Sample severity matrix output

# entered impact
users_impacted: 7,500
tier_0_service_degraded: true
duration_minutes: 75

# matched against severity-matrix.yaml →
suggested: SEV2
matched_rule: "users_impacted 1000-10000 OR tier_0_service_degraded"
override: null

Kill criterion — No raw materials → AI has nothing to build a timeline from. Don't skip this.

p. 10 · stage 03

03

Timeline

Raw left · AI right · roles not names

/postmortem:timeline
Streaming · gap detection
Writes timeline.md

Two-pane working surface. Raw materials on the left. AI-generated chronology on the right, streaming. Inline edits stick. Re-runs preserve human edits.

"What happened, in what order, with what gap?"

What the AI does: reads your raw drop, builds a chronology in markdown. Highlights decision points. Flags any window ≥ 10 min of inactivity during the active incident — those are gaps the team will want to explain.

Hard rules · FR-4

Roles only. Never personal names. "On-call SRE", not "Sarah".
Timestamps in business_hours_tz from org.yaml.
Gap flag = ≥ 10 min of silence during active state.
Inline edits preserved on re-run unless user explicitly resets.
Mitigation steps link to runbooks/ when possible.

Sample timeline output

## Timeline (America/New_York)

- 14:02 · Deploy SHA `a1f3b2c` to checkout (tier-0). On-call SRE confirms green.
- 14:09 · Synthetic monitor `/api/checkout` → p99 latency climbs from 180ms → 1.2s.
- 14:11 · PagerDuty fires SEV2. On-call SRE acks.
- 14:14 · 5xx rate on checkout reaches 4.3%. Customer support tickets begin queueing.
- 14:26 · [GAP — 12 min] No action recorded.
- 14:38 · Incident commander joins. Rollback initiated.
- 15:17 · Deploy reverted. 5xx returns to baseline.

Kill criterion — A gap with no explanation in the report is a missing on-call follow-up. Surface it now.

p. 12 · stage 04

04

Fishbone

Six categories · accept · reject · edit

/postmortem:fishbone
2–4 candidates per branch
Enum-mapped causes

Six fixed columns. AI proposes 2–4 candidate causes per column with quoted evidence from the timeline. You accept, reject (with reason), or edit. Accepted causes become branches for the 5-Whys stage.

"Across all six categories, what could have let this happen?"

Six categories

People — roles, on-call coverage, escalation.
Process — change mgmt, runbook gaps, comms.
Tooling — observability, deploy infra, alerting.
Code — bugs, regressions, missing tests.
Infra — capacity, dependencies, configuration drift.
External — vendors, upstream APIs, customer behavior.

Sample fishbone candidates (for the deploy 5xx incident)

Process

"No canary stage between green-light and 100% rollout."

evidence: timeline 14:02 → 14:09. maps to: change_mgmt_gap

Tooling

"Synthetic monitor caught it but page took 2 min to fire."

evidence: 14:09 → 14:11. maps to: alert_latency

Process

"12 min gap with no incident commander assigned."

evidence: 14:26 → 14:38. maps to: ic_handoff_unclear · recurrence flag from history.md

Recurrence flag from history.md = "you've seen this before".

Kill criterion — All six categories empty is suspicious. At least People + Process should always have a candidate.

p. 14 · stage 05

05

5-Whys

Per branch · blameless · systemic stop

/postmortem:whys
Typed BAML output · WhysTurn
NEXT_WHY · REFRAME · SYSTEMIC · TOO_DEEP

One thread per accepted fishbone branch. Conversational: you answer, AI challenges, AI asks the next "why". Lands on an enum-tagged systemic cause. Stops when systemic — not when the chain reaches five.

"Why did the system let this happen — keep going until you hit a control gap?"

The blameless guardrail. Before your answer is submitted, the UI heuristic blocks any answer mentioning a person by name, an @mention, or a known roster name from teams.yaml. Prompts you to reframe to a role / system gap. Server-side prompt re-checks. Defense in depth.

Hard rules · FR-6

One thread per accepted fishbone branch.
UI-layer name detector: capitalised first-name token · @mention · roster match.
Blocked submit → reframe prompt ("what role / control gap allowed this?").
AI emits NEXT_WHY · REFRAME · SYSTEMIC · TOO_DEEP via typed output.
Terminate on systemic (process gap, missing control, design choice).

Sample 5-Whys thread (blameless reframe in action)

Branch: "12 min gap with no incident commander"

Why #1. Why was no incident commander assigned during the 12-minute window?
→ Sarah was on call but didn't see the page.
⚠ blocked · personal name detected · please reframe to a role / control
Why #1 (reframed). Why was no incident commander assigned during the 12-minute window?
→ The primary on-call SRE missed the page; the escalation policy waits 15 min before firing the secondary.
Why #2. Why does the escalation policy wait 15 min before firing the secondary on-call?
→ Original config from 2 years ago, never re-tuned after we cut on-call rotation from 3 → 2 engineers.
SYSTEMIC. Process gap: escalation timeouts are not reviewed when on-call topology changes. Maps to: missing_periodic_review.

Kill criterion — A chain that stops at a person is not done. The system must be the answer.

p. 17 · stage 06

06

Draft · blameless audit

Auto-assemble · diff · typed audit

/postmortem:draft
+ /postmortem:blameless-check
HIGH violations block save

The drafter auto-assembles the report from stages 03–05 using the skill's _template/report.md. Eight sections in canonical order. Streaming markdown, rendered preview, toggle to raw textarea.

"Is the root cause a control gap — or did we just rename the person?"

Inline blameless audit. Runs a typed BAML check returning a list of violations: severity (HIGH/MED/LOW) · kind (name / blame-language / passive-voice gap) · location (line ranges) · original · suggested fix · why. Per-violation: apply suggested · mark fixed · override with reason. HIGH violations block save.

Eight report sections

Executive Summary
Impact
Timeline
Root Causes (plural)
Contributing Factors
Lessons Learned
Action Items
Appendix / raw links

Sample blameless audit output

{
  "violations": [
    {
      "severity": "HIGH",
      "kind": "personal_name",
      "location": { "section": "root_causes", "line": 4 },
      "original": "Sarah forgot to update the escalation policy.",
      "suggested": "The escalation policy was not reviewed after the on-call topology change.",
      "why": "Names belong in the timeline by role only. Root cause must be systemic."
    },
    {
      "severity": "MED",
      "kind": "blame_language",
      "original": "The team failed to...",
      "suggested": "The process did not surface..."
    }
  ]
}

Kill criterion — Any HIGH violation unresolved → save blocked. Names never reach the root cause section.

p. 20 · stages 07–08

07·08

Actions · Export

Tickets · close-out · round-trip with CLI

/postmortem:actions
SEV-based due dates
Folder shape ≡ CLI skill

Stage 07 parses the action items table from the draft's section 7. Each row becomes editable: ID · title · owner (role from teams.yaml) · due date (default by severity SLA) · what it addresses (enum from enums.yaml) · status.

Stage 08 exports the folder. Identical shape to a skill-authored postmortem. A CLI user can pull the export and continue with any /postmortem:* command — no migration needed.

Due-date defaults by severity

SEV1 — within 14 days
SEV2 — within 30 days
SEV3 — within 60 days
SEV4 — within 90 days
Finalize → flips status: closed, rebuilds postmortems/README.md index.

Final folder shape (round-trips with CLI)

postmortems/2026-05-08-checkout-5xx-fraud-redos/
├── metadata.yaml          # typed: severity, services, MTTR, root_cause enums
├── timeline.md            # stage 03 output
├── fishbone.md            # stage 04 output
├── whys.md                # stage 05 — one section per accepted branch
├── report.md              # stage 06 — blameless-audited
├── actions.json           # stage 07 sidecar — ticket IDs once created
└── raw/
    ├── slack.txt
    ├── deploy.log
    └── pagerduty.json

Definition of done — /postmortem:blameless-check on the exported folder returns clean. README.md index regenerated. Status closed.

p. 23 · why backend

⊕

Why this needs a backend

Six structural reasons

PM Copilot ran in localStorage
This one can't
Reasons below

PM Copilot ships as a forward-looking spec tool. localStorage holds the doc. No server-side state, no file writes, no schema validation against a vocabulary. Why-Postmortem is different in six ways. Each one drags it out of the browser.

01

The artifact is a folder, not a JSON blob.

Output round-trips with the CLI skill — exact folder shape, multiple files (report.md, timeline.md, fishbone.md, whys.md, metadata.yaml, raw/*). Browser file APIs can't write folders to arbitrary paths.

Browser-only: impossible. Backend writes the folder.

02

Prompts are version-controlled in the skill, not in the app.

Single source of truth = _prompts/*.md in the skill repo. Backend reads them at runtime, substitutes {{slug}} + {{tz}}, appends incident artifacts + context, passes to BAML. Re-hardcoding them client-side guarantees drift.

Browser-only: drift inevitable. Backend keeps parity.

03

Typed AI output via BAML.

Stage 05 needs WhysTurn (NEXT_WHY / REFRAME / SYSTEMIC / TOO_DEEP). Stage 06 needs BlamelessAuditResult (typed violations). BAML generates a typed client + validator; runs server-side. No browser equivalent without exposing keys.

Browser-only: leaks keys. Backend keeps BYOK safe.

04

Schema validation against `enums.yaml`.

Every root cause + contributing factor must match an allowed enum. metadata.yaml must validate against the skill's schema before save. Validation logic lives once on the server; not duplicated and de-synced in the SPA.

Browser-only: validation drift. Backend = one source.

05

The blameless guardrail is defense-in-depth.

UI heuristic blocks names client-side. Prompt re-checks server-side. If the UI is bypassed (devtools, paste, copy-edit), the prompt layer still intercepts. The audit re-runs server-side on save. Two layers — neither is optional.

Browser-only: single point of failure. Backend = defense in depth.

06

BYOK + provider swap.

Default provider = OpenAI. Swap to Anthropic by editing one BAML client file + npm run baml:generate. No frontend change. API keys live in .env, never in the browser. Keys + provider routing belong on the server.

Browser-only: keys in the wild. Backend = keys stay home.

PM Copilot can be a static page. Why-Postmortem cannot.

p. 26 · stack & api

⊞

Stack & API surface

Vite + React + Express + BAML

Frontend :5173
API :3001
BAML for typed prompts

Layer 01 — frontend

Vite + React + TS

Stage rail · two-pane working surface
:5173 dev server
Proxies /api/* → :3001
Streaming UI: fetch → reader → render
Brand tokens shared with PM Copilot

Layer 02 — API

Express on :3001

Reads _context/ at every call
Writes incident folder to disk
Loads _prompts/*.md from skill repo
Substitutes vars, calls BAML
Streams markdown back · saves on commit

Layer 03 — AI

BAML · typed prompts

Generate(prompt) for free-form markdown
WhysTurn typed for stage 05
BlamelessAuditResult typed for stage 06
Provider swap via one client file
BYOK in .env

API surface — 19 routes

Method	Path	Purpose
GET	/api/health	Liveness
POST	/api/context/validate	Read + validate `_context/`
POST	/api/context/data	Parsed services / teams / severity / enums
POST	/api/severity/suggest	Match impact inputs against matrix
POST	/api/incident/create	Write `metadata.yaml` skeleton + folder
POST	/api/incident/raw	Write `raw/` files (utf8 / base64)
GET	/api/incident/files	List incident folder for export view
POST	/api/timeline/generate	Streaming markdown
POST	/api/timeline/save	Write `timeline.md`
POST	/api/fishbone/generate	Streaming markdown
POST	/api/fishbone/save	Write `fishbone.md`
POST	/api/whys/next	Typed `WhysTurn` per round
POST	/api/whys/save	Write `whys.md`
POST	/api/draft/generate	Streaming full report
POST	/api/draft/save	Write `report.md`
POST	/api/blameless/check	Typed `BlamelessAuditResult`
POST	/api/actions/parse	Extract table from report § 7
POST	/api/actions/save	Sidecar + metadata patch + report refresh
POST	/api/export/finalize	Status closed · rebuild README index

p. 29 · deploy reality

▶

Deploy reality

What survives CF Pages · what doesn't

CF Pages = static + Functions
Backend = Workers / Fly / Render
This manual = static. Ship it.

The webapp is local-only by design. The author runs npm run dev on a laptop. Hosting the full POC needs more than a static CDN. But not everything is dynamic.

Ships on CF Pages today

Static only · drop & go

This why-postmortem-magazine.html
one-pager.html · printable
infographic.html · POC overview
deck.html · pitch slides
demo-video.html + .mp4
All brd-*.md rendered via any static MD host
Marketing surfaces — landing, OG card, launch post

Needs runtime — not CF Pages

Express + BAML + file I/O

webapp/ — Vite SPA needs paired backend
webapp/server/ — Express on :3001
BAML runtime + provider client
File-system writes (raw/, report.md, etc.)
_context/ file reads
YAML schema validation
BYOK secret storage

Three credible deploy paths

Path A · marketing only

Ship the magazine + landing pages to CF Pages. Keep webapp/ local. Demo via screencast.

Cost: free · risk: none · gets the story out today.

Path B · CF Pages + Workers

Convert Express routes to functions/api/*.ts. Use R2 for raw artifacts. KV for context cache. BAML calls via Worker secrets.

Cost: ~free at POC scale · risk: rewrite + cold starts.

Path C · Pages + Fly/Render API

SPA on CF Pages. Express on Fly.io or Render with a persistent volume for incident folders. BAML server-side.

Cost: $5/mo · risk: lowest, matches local-dev shape exactly.

Ship the magazine today. Decide on the runtime later.

Recommended for this POC — Path A

# prep folder
mkdir dist-why-postmortem
cp why-postmortem-magazine.html  dist-why-postmortem/index.html
cp one-pager.html infographic.html deck.html demo-video.html  dist-why-postmortem/
cp demo-video.mp4 favicon.svg why-postmortem.png            dist-why-postmortem/

# deploy
npx wrangler@latest pages deploy dist-why-postmortem \
  --project-name why-postmortem-checks \
  --branch main

# →  https://why-postmortem-checks.pages.dev

Static now, dynamic later — The story doesn't need the runtime. The runtime needs the story.