Issue 01 · 2026 9 stages · blameless by default Backend required · BYOK

Stress-test
the system,
not the
person.

From raw chat to committed report in 60 minutes. Nine stages. One funnel. AI as facilitator, never sole author. Names never reach the root cause.

p. 02 · editor's note

Editor's note

Postmortems fail in three ways. They get skipped because writing them hurts. They stop at "human error" and miss the systemic cause. And the action items rot in a doc nobody opens again.

Why-Postmortem is a Claude Code skill wrapped in a guided web flow. Nine stages, one incident at a time. AI plays the facilitator — proposes timelines, challenges weak whys, drafts the report — but it never authors anything alone. Every stage has an accept / reject / edit gate.

The blameless guardrail is non-negotiable. The 5-Whys stage intercepts answers that mention a person by name and forces a reframe to a role, a control, a process gap. Twice — once in the UI, once at the prompt layer. Defense in depth.

This is the backward-looking sibling of PM Copilot. Same machine, opposite end of the build–break–learn loop. PM Copilot stress-tests the spec before the work; Why-Postmortem stress-tests the system after the incident. They share an engine: context layer + AI-stress-tested funnel + structured artifact.

One important difference. PM Copilot's POC runs on localStorage. Why-Postmortem can't. It writes files to disk, validates against schemas, streams AI output through a typed-output layer (BAML), and round-trips with a CLI skill. Browser alone is not enough. The backend story is the second half of this manual.

p. 03 · the shape

The shape

Context · funnel · artifact
Shared with PM Copilot
Different direction in time
Same engine

Three ingredients, in this order. Context layer loaded once. A funnel that's hard to skip. A structured artifact at the end.

Generic prompts produce generic reports. Grounded prompts — fed your services catalog, your severity matrix, your team roster — produce reports a senior practitioner would recognise. That grounding is the context layer, and it lives in git, not in the app.

Why this works · z-dna.md

  1. Context layer carries the senior judgment.
  2. User just answers each stage honestly.
  3. AI challenges, never authors.
  4. Output is queryable (enums + schema), not just readable.
  5. Markdown + git as substrate — no SaaS lock-in.
PM Copilot looks forward. Why-Postmortem looks back. Same engine.

Side by side — the DNA shared

LayerPM CopilotWhy-Postmortem
Time directionForward — define the right productBackward — learn from the incident
ContextPersonas, triggers, success-metric rulesorg · services · teams · severity · enums · glossary · history
FunnelWhy → Personas → Edges → SMART → TraceTimeline → Fishbone → 5-Whys → Draft → Actions
Stress testRecursive whys on problem statement5-Whys per fishbone branch (parallel)
Hard guardrailSMART validationBlameless name-scrub (UI + prompt)
OutputBRD / PRD markdownPostmortem folder + metadata.yaml
p. 05 · stage 00
00

Context

Load org · validate · show TODOs
/postmortem:context
Read on every invocation
Lives in git

The most important stage. Generic prompts produce generic reports. Grounded prompts produce something a senior practitioner would recognise.

"Where does this org's rulebook live?"

What this stage does: reads seven files from _context/. Validates them against schemas. Lists missing required keys as TODOs. Surfaces counts (X services, Y teams, Z severity bands). Never overwrites filled values silently.

Bootstrap rules

  1. /postmortem:init scaffolds _context/ with template files.
  2. Each file ships with inline TODO markers.
  3. Team fills in once, reuses forever.
  4. Other commands warn if required context keys are missing.
  5. If org.yaml lists SOC2/PCI/HIPAA → drafter auto-adds Compliance Impact subsection.

The seven files

org.yaml

Company name · domain · compliance regimes · ticket system · on-call tool · business hours.

used by · all phases
services.yaml

Service catalog: name · owner team · tier · SLOs · dependencies · runbook link.

timeline · fishbone · actions
teams.yaml

Team roster: name · Slack channel · on-call rotation · lead · escalation path.

actions · drafter
severity-matrix.yaml

SEV1–4 thresholds: users impacted · revenue/hr · SLO breach · compliance trigger.

new · drafter
enums.yaml

Allowed values: root_cause_category · contributing_factors · tags. The vocabulary.

metadata · pattern miner
glossary.md

Internal terms · acronyms · system codenames so AI doesn't expand "TRP" wrong.

drafter · blameless editor
history.md

Known recurring issues. Prior root causes. "Don't re-discover" list.

fishbone · whys · miner
runbooks/

Pointers to per-service runbooks. Linked from timeline mitigation steps.

timeline
Kill criterion — Missing org.yaml or services.yaml → skill refuses to run. Context is not optional.
p. 08 · stages 01–02
01·02

New incident · raw

Slug · severity · drop the artifacts
/postmortem:new <slug>
Severity auto-suggested
50 MB raw cap

Slug + date form the folder name. Severity is auto-suggested from the matrix once you enter impact fields. Services autocomplete from your catalog. Roles, not names.

"What happened, when, and to whom?"

Then drop the artifacts. Slack export. PagerDuty payload. Deploy log. Screenshots. Drag-drop into the raw box; files stored as raw/ matching the skill's folder layout exactly. 10 MB per file, 50 MB total per incident.

Form rules · FR-2, FR-3

  1. Slug + date auto-form YYYY-MM-DD-<slug>.
  2. Severity auto-suggested from severity-matrix.yaml; user overrides with reason.
  3. Services autocomplete from services.yaml.
  4. IC + scribe shown by role, never by name.
  5. Detection + resolution timestamps → auto MTTD/MTTR.

Sample severity matrix output

# entered impact
users_impacted: 7,500
tier_0_service_degraded: true
duration_minutes: 75

# matched against severity-matrix.yaml →
suggested: SEV2
matched_rule: "users_impacted 1000-10000 OR tier_0_service_degraded"
override: null
Kill criterion — No raw materials → AI has nothing to build a timeline from. Don't skip this.
p. 10 · stage 03
03

Timeline

Raw left · AI right · roles not names
/postmortem:timeline
Streaming · gap detection
Writes timeline.md

Two-pane working surface. Raw materials on the left. AI-generated chronology on the right, streaming. Inline edits stick. Re-runs preserve human edits.

"What happened, in what order, with what gap?"

What the AI does: reads your raw drop, builds a chronology in markdown. Highlights decision points. Flags any window ≥ 10 min of inactivity during the active incident — those are gaps the team will want to explain.

Hard rules · FR-4

  1. Roles only. Never personal names. "On-call SRE", not "Sarah".
  2. Timestamps in business_hours_tz from org.yaml.
  3. Gap flag = ≥ 10 min of silence during active state.
  4. Inline edits preserved on re-run unless user explicitly resets.
  5. Mitigation steps link to runbooks/ when possible.

Sample timeline output

## Timeline (America/New_York)

- 14:02 · Deploy SHA `a1f3b2c` to checkout (tier-0). On-call SRE confirms green.
- 14:09 · Synthetic monitor `/api/checkout` → p99 latency climbs from 180ms → 1.2s.
- 14:11 · PagerDuty fires SEV2. On-call SRE acks.
- 14:14 · 5xx rate on checkout reaches 4.3%. Customer support tickets begin queueing.
- 14:26 · [GAP — 12 min] No action recorded.
- 14:38 · Incident commander joins. Rollback initiated.
- 15:17 · Deploy reverted. 5xx returns to baseline.
Kill criterion — A gap with no explanation in the report is a missing on-call follow-up. Surface it now.
p. 12 · stage 04
04

Fishbone

Six categories · accept · reject · edit
/postmortem:fishbone
2–4 candidates per branch
Enum-mapped causes

Six fixed columns. AI proposes 2–4 candidate causes per column with quoted evidence from the timeline. You accept, reject (with reason), or edit. Accepted causes become branches for the 5-Whys stage.

"Across all six categories, what could have let this happen?"

Six categories

  1. People — roles, on-call coverage, escalation.
  2. Process — change mgmt, runbook gaps, comms.
  3. Tooling — observability, deploy infra, alerting.
  4. Code — bugs, regressions, missing tests.
  5. Infra — capacity, dependencies, configuration drift.
  6. External — vendors, upstream APIs, customer behavior.

Sample fishbone candidates (for the deploy 5xx incident)

Process

"No canary stage between green-light and 100% rollout."

evidence: timeline 14:02 → 14:09. maps to: change_mgmt_gap

Tooling

"Synthetic monitor caught it but page took 2 min to fire."

evidence: 14:09 → 14:11. maps to: alert_latency

Process

"12 min gap with no incident commander assigned."

evidence: 14:26 → 14:38. maps to: ic_handoff_unclear · recurrence flag from history.md

Recurrence flag from history.md = "you've seen this before".
Kill criterion — All six categories empty is suspicious. At least People + Process should always have a candidate.
p. 14 · stage 05
05

5-Whys

Per branch · blameless · systemic stop
/postmortem:whys
Typed BAML output · WhysTurn
NEXT_WHY · REFRAME · SYSTEMIC · TOO_DEEP

One thread per accepted fishbone branch. Conversational: you answer, AI challenges, AI asks the next "why". Lands on an enum-tagged systemic cause. Stops when systemic — not when the chain reaches five.

"Why did the system let this happen — keep going until you hit a control gap?"

The blameless guardrail. Before your answer is submitted, the UI heuristic blocks any answer mentioning a person by name, an @mention, or a known roster name from teams.yaml. Prompts you to reframe to a role / system gap. Server-side prompt re-checks. Defense in depth.

Hard rules · FR-6

  1. One thread per accepted fishbone branch.
  2. UI-layer name detector: capitalised first-name token · @mention · roster match.
  3. Blocked submit → reframe prompt ("what role / control gap allowed this?").
  4. AI emits NEXT_WHY · REFRAME · SYSTEMIC · TOO_DEEP via typed output.
  5. Terminate on systemic (process gap, missing control, design choice).

Sample 5-Whys thread (blameless reframe in action)

Branch: "12 min gap with no incident commander"

  1. Why #1. Why was no incident commander assigned during the 12-minute window?
    → Sarah was on call but didn't see the page.
    ⚠ blocked · personal name detected · please reframe to a role / control
  2. Why #1 (reframed). Why was no incident commander assigned during the 12-minute window?
    → The primary on-call SRE missed the page; the escalation policy waits 15 min before firing the secondary.
  3. Why #2. Why does the escalation policy wait 15 min before firing the secondary on-call?
    → Original config from 2 years ago, never re-tuned after we cut on-call rotation from 3 → 2 engineers.
  4. SYSTEMIC. Process gap: escalation timeouts are not reviewed when on-call topology changes. Maps to: missing_periodic_review.
Kill criterion — A chain that stops at a person is not done. The system must be the answer.
p. 17 · stage 06
06

Draft · blameless audit

Auto-assemble · diff · typed audit
/postmortem:draft
+ /postmortem:blameless-check
HIGH violations block save

The drafter auto-assembles the report from stages 03–05 using the skill's _template/report.md. Eight sections in canonical order. Streaming markdown, rendered preview, toggle to raw textarea.

"Is the root cause a control gap — or did we just rename the person?"

Inline blameless audit. Runs a typed BAML check returning a list of violations: severity (HIGH/MED/LOW) · kind (name / blame-language / passive-voice gap) · location (line ranges) · original · suggested fix · why. Per-violation: apply suggested · mark fixed · override with reason. HIGH violations block save.

Eight report sections

  1. Executive Summary
  2. Impact
  3. Timeline
  4. Root Causes (plural)
  5. Contributing Factors
  6. Lessons Learned
  7. Action Items
  8. Appendix / raw links

Sample blameless audit output

{
  "violations": [
    {
      "severity": "HIGH",
      "kind": "personal_name",
      "location": { "section": "root_causes", "line": 4 },
      "original": "Sarah forgot to update the escalation policy.",
      "suggested": "The escalation policy was not reviewed after the on-call topology change.",
      "why": "Names belong in the timeline by role only. Root cause must be systemic."
    },
    {
      "severity": "MED",
      "kind": "blame_language",
      "original": "The team failed to...",
      "suggested": "The process did not surface..."
    }
  ]
}
Kill criterion — Any HIGH violation unresolved → save blocked. Names never reach the root cause section.
p. 20 · stages 07–08
07·08

Actions · Export

Tickets · close-out · round-trip with CLI
/postmortem:actions
SEV-based due dates
Folder shape ≡ CLI skill

Stage 07 parses the action items table from the draft's section 7. Each row becomes editable: ID · title · owner (role from teams.yaml) · due date (default by severity SLA) · what it addresses (enum from enums.yaml) · status.

Stage 08 exports the folder. Identical shape to a skill-authored postmortem. A CLI user can pull the export and continue with any /postmortem:* command — no migration needed.

Due-date defaults by severity

  1. SEV1 — within 14 days
  2. SEV2 — within 30 days
  3. SEV3 — within 60 days
  4. SEV4 — within 90 days
  5. Finalize → flips status: closed, rebuilds postmortems/README.md index.

Final folder shape (round-trips with CLI)

postmortems/2026-05-08-checkout-5xx-fraud-redos/
├── metadata.yaml          # typed: severity, services, MTTR, root_cause enums
├── timeline.md            # stage 03 output
├── fishbone.md            # stage 04 output
├── whys.md                # stage 05 — one section per accepted branch
├── report.md              # stage 06 — blameless-audited
├── actions.json           # stage 07 sidecar — ticket IDs once created
└── raw/
    ├── slack.txt
    ├── deploy.log
    └── pagerduty.json
Definition of done — /postmortem:blameless-check on the exported folder returns clean. README.md index regenerated. Status closed.
p. 23 · why backend

Why this needs a backend

Six structural reasons
PM Copilot ran in localStorage
This one can't
Reasons below

PM Copilot ships as a forward-looking spec tool. localStorage holds the doc. No server-side state, no file writes, no schema validation against a vocabulary. Why-Postmortem is different in six ways. Each one drags it out of the browser.

01

The artifact is a folder, not a JSON blob.

Output round-trips with the CLI skill — exact folder shape, multiple files (report.md, timeline.md, fishbone.md, whys.md, metadata.yaml, raw/*). Browser file APIs can't write folders to arbitrary paths.

Browser-only: impossible. Backend writes the folder.
02

Prompts are version-controlled in the skill, not in the app.

Single source of truth = _prompts/*.md in the skill repo. Backend reads them at runtime, substitutes {{slug}} + {{tz}}, appends incident artifacts + context, passes to BAML. Re-hardcoding them client-side guarantees drift.

Browser-only: drift inevitable. Backend keeps parity.
03

Typed AI output via BAML.

Stage 05 needs WhysTurn (NEXT_WHY / REFRAME / SYSTEMIC / TOO_DEEP). Stage 06 needs BlamelessAuditResult (typed violations). BAML generates a typed client + validator; runs server-side. No browser equivalent without exposing keys.

Browser-only: leaks keys. Backend keeps BYOK safe.
04

Schema validation against enums.yaml.

Every root cause + contributing factor must match an allowed enum. metadata.yaml must validate against the skill's schema before save. Validation logic lives once on the server; not duplicated and de-synced in the SPA.

Browser-only: validation drift. Backend = one source.
05

The blameless guardrail is defense-in-depth.

UI heuristic blocks names client-side. Prompt re-checks server-side. If the UI is bypassed (devtools, paste, copy-edit), the prompt layer still intercepts. The audit re-runs server-side on save. Two layers — neither is optional.

Browser-only: single point of failure. Backend = defense in depth.
06

BYOK + provider swap.

Default provider = OpenAI. Swap to Anthropic by editing one BAML client file + npm run baml:generate. No frontend change. API keys live in .env, never in the browser. Keys + provider routing belong on the server.

Browser-only: keys in the wild. Backend = keys stay home.
PM Copilot can be a static page. Why-Postmortem cannot.
p. 26 · stack & api

Stack & API surface

Vite + React + Express + BAML
Frontend :5173
API :3001
BAML for typed prompts
Layer 01 — frontend

Vite + React + TS

  • Stage rail · two-pane working surface
  • :5173 dev server
  • Proxies /api/*:3001
  • Streaming UI: fetch → reader → render
  • Brand tokens shared with PM Copilot
Layer 02 — API

Express on :3001

  • Reads _context/ at every call
  • Writes incident folder to disk
  • Loads _prompts/*.md from skill repo
  • Substitutes vars, calls BAML
  • Streams markdown back · saves on commit
Layer 03 — AI

BAML · typed prompts

  • Generate(prompt) for free-form markdown
  • WhysTurn typed for stage 05
  • BlamelessAuditResult typed for stage 06
  • Provider swap via one client file
  • BYOK in .env

API surface — 19 routes

MethodPathPurpose
GET/api/healthLiveness
POST/api/context/validateRead + validate _context/
POST/api/context/dataParsed services / teams / severity / enums
POST/api/severity/suggestMatch impact inputs against matrix
POST/api/incident/createWrite metadata.yaml skeleton + folder
POST/api/incident/rawWrite raw/ files (utf8 / base64)
GET/api/incident/filesList incident folder for export view
POST/api/timeline/generateStreaming markdown
POST/api/timeline/saveWrite timeline.md
POST/api/fishbone/generateStreaming markdown
POST/api/fishbone/saveWrite fishbone.md
POST/api/whys/nextTyped WhysTurn per round
POST/api/whys/saveWrite whys.md
POST/api/draft/generateStreaming full report
POST/api/draft/saveWrite report.md
POST/api/blameless/checkTyped BlamelessAuditResult
POST/api/actions/parseExtract table from report § 7
POST/api/actions/saveSidecar + metadata patch + report refresh
POST/api/export/finalizeStatus closed · rebuild README index
p. 29 · deploy reality

Deploy reality

What survives CF Pages · what doesn't
CF Pages = static + Functions
Backend = Workers / Fly / Render
This manual = static. Ship it.

The webapp is local-only by design. The author runs npm run dev on a laptop. Hosting the full POC needs more than a static CDN. But not everything is dynamic.

Ships on CF Pages today

Static only · drop & go
  • This why-postmortem-magazine.html
  • one-pager.html · printable
  • infographic.html · POC overview
  • deck.html · pitch slides
  • demo-video.html + .mp4
  • All brd-*.md rendered via any static MD host
  • Marketing surfaces — landing, OG card, launch post

Needs runtime — not CF Pages

Express + BAML + file I/O
  • webapp/ — Vite SPA needs paired backend
  • webapp/server/ — Express on :3001
  • BAML runtime + provider client
  • File-system writes (raw/, report.md, etc.)
  • _context/ file reads
  • YAML schema validation
  • BYOK secret storage

Three credible deploy paths

Path A · marketing only

Ship the magazine + landing pages to CF Pages. Keep webapp/ local. Demo via screencast.

Cost: free · risk: none · gets the story out today.

Path B · CF Pages + Workers

Convert Express routes to functions/api/*.ts. Use R2 for raw artifacts. KV for context cache. BAML calls via Worker secrets.

Cost: ~free at POC scale · risk: rewrite + cold starts.

Path C · Pages + Fly/Render API

SPA on CF Pages. Express on Fly.io or Render with a persistent volume for incident folders. BAML server-side.

Cost: $5/mo · risk: lowest, matches local-dev shape exactly.

Ship the magazine today. Decide on the runtime later.

Recommended for this POC — Path A

# prep folder
mkdir dist-why-postmortem
cp why-postmortem-magazine.html  dist-why-postmortem/index.html
cp one-pager.html infographic.html deck.html demo-video.html  dist-why-postmortem/
cp demo-video.mp4 favicon.svg why-postmortem.png            dist-why-postmortem/

# deploy
npx wrangler@latest pages deploy dist-why-postmortem \
  --project-name why-postmortem-checks \
  --branch main

# →  https://why-postmortem-checks.pages.dev
Static now, dynamic later — The story doesn't need the runtime. The runtime needs the story.

"Same machine.
Opposite end of the build–break–learn loop."

This manual mirrors the why-postmortem webapp POC at webapp/ — same nine stages, same skill commands, same context layer, same blameless guardrail. Built in the dark-canvas brand from brand-kit.md, paired with the forward-looking PM Copilot manual at pdm-magazine.html.

The story is shippable to CF Pages today. The webapp is local-only by design — see page 23 for the six structural reasons. When the runtime is ready, paths B and C on page 29 are waiting.

Why-Postmortem · Field Manual
Issue 01 · 2026
Static single-file HTML
Brand: brand-kit.md
Sibling: pdm-magazine.html