FLAKY TESTS

Cut flaky-test noise by ranking tests on failure history

Updated June 2026 · 6 min read

A suite of 2,000 tests treats every test as equally informative. They aren't. A handful fail constantly for environmental reasons; a different handful are the ones that actually catch regressions in the area you just changed. Failure history tells them apart.

Why "run everything and read the red" stops working

Past a certain size, the suite's failures become noise. Engineers learn that checkout-timeout.spec is "just flaky," start ignoring reds, and eventually miss a real one hiding in the pile. Retries paper over it; the signal keeps degrading. The problem isn't too few tests — it's no prioritization.

Two numbers worth tracking per test

Base failure rate — how often this test fails across all recent runs. A test that fails 30% of the time regardless of the change is flaky and should be quarantined or fixed, not trusted.
Conditional failure rate — how often it fails on changes touching a given area. A test with a low base rate that spikes when a specific module changes is a high-signal test for that module.

Together they let you rank: for this PR, which tests are most likely to fail for a real reason? That's the list worth running first and reading first.

Where the history comes from

You already produce it — every CI run emits JUnit (or similar) results. The trick is collecting them over time and attributing each pass/fail to the change that triggered it. With that record you can compute per-test failure rates and surface them where decisions happen: the pull request.

How Testward uses it

Testward ingests your CI test results through a GitHub Action that uploads JUnit output, authenticated with GitHub's OIDC token — no API keys to manage. It builds per-test failure history and, on each PR, ranks the tests most likely to fail on that change:

Run these tests first (ranked by failure history):
- checkout.spec.ts › applies coupon — 72% likely to fail
- auth.spec.ts › SSO redirect — 38% likely to fail

Combine that with impact analysis — which tests this PR structurally touches — and you get both signals at once: what the change is wired to break, ranked by what actually breaks in practice.

Fix the flaky ones too

Ranking surfaces chronically-flaky tests so you can fix or quarantine them. A large share trace back to brittle selectors — see selectors that don't break.

Rank your tests by what actually fails.

Install Testward, add the upload Action, and get failure-ranked tests on every PR.

Install free on GitHub