FLAKY TESTS
Cut flaky-test noise by ranking tests on failure history
Updated June 2026 · 6 min read
A suite of 2,000 tests treats every test as equally informative. They aren't. A handful fail constantly for environmental reasons; a different handful are the ones that actually catch regressions in the area you just changed. Failure history tells them apart.
Why "run everything and read the red" stops working
Past a certain size, the suite's failures become noise. Engineers learn that checkout-timeout.spec is "just flaky," start ignoring reds, and eventually miss a real one hiding in the pile. Retries paper over it; the signal keeps degrading. The problem isn't too few tests — it's no prioritization.
Two numbers worth tracking per test
- Base failure rate — how often this test fails across all recent runs. A test that fails 30% of the time regardless of the change is flaky and should be quarantined or fixed, not trusted.
- Conditional failure rate — how often it fails on changes touching a given area. A test with a low base rate that spikes when a specific module changes is a high-signal test for that module.
Together they let you rank: for this PR, which tests are most likely to fail for a real reason? That's the list worth running first and reading first.
Where the history comes from
You already produce it — every CI run emits JUnit (or similar) results. The trick is collecting them over time and attributing each pass/fail to the change that triggered it. With that record you can compute per-test failure rates and surface them where decisions happen: the pull request.
How Testward uses it
Testward ingests your CI test results through a GitHub Action that uploads JUnit output, authenticated with GitHub's OIDC token — no API keys to manage. It builds per-test failure history and, on each PR, ranks the tests most likely to fail on that change:
Run these tests first (ranked by failure history):
- checkout.spec.ts › applies coupon — 72% likely to fail
- auth.spec.ts › SSO redirect — 38% likely to fail
Combine that with impact analysis — which tests this PR structurally touches — and you get both signals at once: what the change is wired to break, ranked by what actually breaks in practice.
Fix the flaky ones too
Ranking surfaces chronically-flaky tests so you can fix or quarantine them. A large share trace back to brittle selectors — see selectors that don't break.
Install Testward, add the upload Action, and get failure-ranked tests on every PR.
Install free on GitHub