Hundreds of metrics already — but mostly aggregates that hide what breaks. Overall Visa looks fine while Visa × JPM is leaking. We're adding specific segmentations (brand × provider × region, merchant × APM × device) with a named watcher and a threshold per segment. Sharper signals, not more noise.
The signal
before the leak.
Auth, Apple Pay, 3DS, refunds — the metrics already exist. What's uneven is the loop: react, investigate, escalate. Monitors makes it the same path, every signal.
“The data is talking. What we don't always do is the same thing each time — who reacts first, how we dig in, when it crosses teams.”
Payment is failing. The dashboard is green.
One segment, four payment outcomes, technical metrics are good. Visa × JPMorgan drifts for 42 minutes while the blended Visa number sits at 96% — until a merchant calls.
Detect. React. Investigate. Escalate.
The dashboards are there. The loop is there too — in pockets, on some teams, for some rails. What we're adding is discipline: the same path to resolution for every signal, every time. Tools to sharpen, process to write down, and attention paid to both.
We're porting SRE discipline to business signals. When a segment trips its threshold, a named team gets notified on a known channel. First touch in hours, not days — same every time: ack, classify (real / noise / in-progress), park or escalate. No more waiting for a merchant to notice before we do.
Monitoring, alerts, dashboards — already in place. What we're adding: a runbook per situation that names the drill (aggregate → segment → merchant → provider), the expected evidence, and the verdict shape. Once the runbook exists, the investigation becomes scriptable — an agent walks the steps and drops a case file before on-call opens the laptop.
Projected, not cumulative — we escalate on the forecast, before the money is gone. $10k projected GMV leak → team lead. $50k → Stream Lead. $250k or confirmed merchant impact → cross-team. Same unit as the North Star. The number decides, and the record writes itself.
Three honest objections. Three honest answers.
Isn't this just SRE?
SRE watches reliability — is the API up, is latency within SLO. This watches outcomes — is Visa × JPM actually authorising, is merchant-X converting, is Apple Pay earning its share. Same discipline, different question. Together they close the gap tests leave open.
Won't this drown us in dashboards?
Segments aren't dashboards. A specific segment (Visa × JPM × EU) only surfaces when it trips its threshold — it becomes a signal pushed to an owner, not a tile on a screen. More segments don't mean more dashboards; they mean more precision when something actually breaks.
Production signals after the fact — isn't that too late?
The first time, yes. But every signal we learn becomes a gate before the next release — same contract as turning a post-mortem into a test. A Visa × JPM dip becomes a pre-merge check next release. The loop closes.