# Unreliable end-to-end smoke tests in CI/CD
> Source report: https://painfinder.app/reports/unreliable-end-to-end-smoke-tests-in-ci-cd

## 1. What we're building
Build a CI-first “End-to-End Smoke Gate” system that runs reliable smoke checks as PR checks and blocks promotion when critical validations fail. It should implement the must-have asks: an adaptive in-browser AI agent that executes natural-language step lists and can tolerate selector/layout drift, and self-hosted/local LLM execution so sensitive data never leaves your infrastructure. Include an explicit smoke vs full regression strategy to reduce deployment slowdown: on PRs, mock third-party dependencies and run a small number of real smoke validations; in nightly/full runs, increase real coverage.

To address CI-only and environment-driven flakiness, add must-have infrastructure features for stable runs: database isolation mechanisms for parallel CI builds (e.g., safe isolation patterns so concurrent runs don’t share state), and deterministic “local or locally-ish” execution paths so engineers can reproduce failures without waiting minutes for CI. Finally, add an on-device validation stage that can run on real hardware (as a blocking gate before staging/release), with safeguards for practical hardware issues like thermal/long-run instability (e.g., capturing failure context and supporting retries/diagnostics). Optionally, provide CI pipeline ergonomics such as auto-guided triage and actionable outputs when smoke gates fail, aligning with the desire for “auto-suggested fixes” when CI/CD breaks.

**Working name:** SmokeGate CI
**Tagline:** Blocking end-to-end smoke checks with adaptive agent steps, isolation, and evidence.
**Main goal:** Make PR-time smoke checks deterministic and blocking, with fast local repro and rich evidence to cut flake and triage time.
**Target users:** Engineering/QA teams shipping UI-heavy web apps who need CI/CD smoke gates that block promotion when critical validations fail.

**Main user result:** A developer configures a repo’s PR smoke gate once and gets a deterministic, blocking pass/fail with a signed evidence bundle when it fails.
**5-minute outcome:** Create a smoke flow (natural-language step list), run it against a PR, and confirm promotion is blocked with screenshots/network/HTML evidence on failure.
**What we solve first:** Reliable PR-time smoke gating with (1) adaptive natural-language browser steps and (2) dataset/run isolation plus deterministic local replay.
**Out of scope for MVP:**
- On-device real hardware validation stage
- Enterprise fleet-wide runner management UI automation
- Automatic quarantine policies based on advanced flake analytics

## 2. Why this is worth building
- Verdict: **MEDIUM** (63/100)
- Across the corpus, multiple posts confirm that end-to-end/smoke tests are unreliable (flaky in CI, noisy from third-party dependencies and shared state) and too slow or costly in development workflows. There is also a clear gap between CI results and real-world behavior, highlighted by bugs that only reproduce on physical hardware and motivate a blocking on-device stage. Feature requests repeatedly converge on deterministic/local debugging, smoke/full separation, dependency mocking, isolation for parallel CI, and adaptive in-browser execution with local/self-hosted LLMs.

**Current pain:** Smoke tests fail inconsistently in CI/CD, creating noise and forcing engineers to rerun manually. When failures happen, teams often need manual QA to confirm whether it’s a real bug and to figure out what actually broke.
**Current workaround:** Engineers re-run tests manually to see if failures reproduce, and save local copies/consistent setups to reduce drift. They also manually investigate whether failures are known behaviors or defects.
**Why existing tools fail:** Generic browser automation (e.g., Playwright/Selenium) doesn’t provide a blocking smoke gate that’s resilient to selector/layout drift, nor deterministic isolation + local-or-locally-ish replay with rich CI evidence. Without run-scoped data isolation and environment mirroring, parallel CI jobs amplify state-related flakiness.

## 3. Must-have capabilities
### 3.1 Two-tier test strategy: PR smoke gate vs nightly/full regression
**Why:** Smoke-only on PRs to reduce deployment slowdown, with higher coverage in nightly/full runs.
**Evidence:** post #19248 — *"Separating a small subset of the tests as a smoke set is a good idea."*

### 3.2 Adaptive in-browser AI agent executing natural-language step lists (tolerant to selector/layout drift)
**Why:** Required capability to make smoke steps resilient as UI changes.

### 3.3 Self-hosted/local LLM execution for sensitive-data compliance
**Why:** Required so sensitive UI context and prompts never leave the user’s infrastructure.

### 3.4 Database isolation for parallel CI runs (no shared state across builds)
**Why:** Parallel CI flakiness is a common cause; isolation prevents test interference.

### 3.5 Unique test data / run-scoped datasets controlled by the test system
**Why:** Deterministic, test-owned data reduces “works on my machine / only in CI” failures.
**Evidence:** post #19278 — *"Use unique test data for each instance"*

### 3.6 Mock third-party dependencies on PRs; run limited real integrations only in staging/nightly
**Why:** Reduces noise/flakiness from external systems while still catching integration breaks periodically.

### 3.7 Deterministic local-or-locally-ish execution path using CI-mirroring containers/servers
**Why:** Engineers need quick reproduction without waiting minutes for CI.
**Evidence:** post #19274 — *"Make your dev environment run on a local server / container"*

### 3.8 Hardware-blocking stage: on-device validation on real hardware before staging/release (with retry/diagnostics)
**Why:** Emulation misses hardware issues; must be blocking with failure context.
**Evidence:** post #19245 — *"We've since added an on-device validation stage that runs on real hardware before anything reaches staging."*

### 3.9 CI artifacts evidence bundle: screenshots, HTML, logs, network logs to support triage
**Why:** Capturing rich evidence accelerates diagnosis of flaky UI/system failures.
**Evidence:** post #19275 — *"Add a listener to expand your logs."*

### 3.10 Blocking gate behavior: smoke gate must prevent promotion when critical validations fail
**Why:** Smoke results can’t be advisory; they must control pipeline progression.

## 4. Use cases & user stories
SmokeGate CI provides a CI-first end-to-end smoke gate for PRs. It executes an adaptive in-browser agent from a natural-language step list, runs with run-scoped isolated datasets, captures rich artifacts, and offers a deterministic locally-mirrored replay path to reproduce failures quickly.

### Use cases
**4.1 PR merge safety for a UI-heavy app with frequent selector/layout changes**
A developer opens a PR. The pipeline runs the Smoke Gate with an adaptive natural-language agent that follows a step list (e.g., “log in, open settings, verify status label”), while the agent tolerates selector/layout drift. If a critical smoke step fails, the gate blocks merge/promotion and attaches an evidence bundle (screenshots + HTML + network + JS logs) so the team can diagnose quickly instead of waiting for nightly full regression.

**4.2 Stable parallel integration smoke despite shared services**
A team runs multiple PRs concurrently. The Smoke Gate provisions run-scoped database isolation and controlled datasets per build, ensuring parallel jobs don’t share state. On PRs, third-party integrations are mocked; only a small number of real integration validations run in nightly/staging. When failures occur, the team re-runs deterministically using the CI-mirroring local container/runner to reproduce the issue immediately.

### User stories
- **As a CI engineer at a mid-market team**, I want a PR check that runs only a fast, reliable smoke subset and blocks promotion on critical failures, *so that* we keep deployments moving without sacrificing safety.
- **As a SRE/QA engineer responsible for flake reduction**, I want deterministic local reproduction of CI smoke failures using a container that mirrors CI, *so that* we can triage and fix quickly instead of waiting on re-runs in CI.

## 5. Pages & form factor
**Form factor:** Web SaaS control plane with self-hosted agent runner (CI-first Smoke Gate)
**Why:** A web SaaS control plane centralizes smoke-gate configuration, run orchestration, artifact/evidence bundles, and reporting, while self-hosted runners handle deterministic local-or-locally-ish execution and self-contained LLM/compliance needs. This directly targets CI unreliability/flakiness by isolating environments per run and using blocking stages before promotion.

### Pages
**5.1 Dashboard**
At-a-glance status for smoke-gate health, latest promotion eligibility, and failing/flake trends.
Key elements:
- Smoke gate status (pass/block/skip)
- Latest runs list with links
- Flakiness indicators and top offenders
- Hardware validation stage status (if enabled)
- Integration coverage status (mock vs real)

**5.2 Project Settings**
Define how the smoke gate runs for a repo: suite selection, mocking rules, dataset/versioning, and environment mapping.
Key elements:
- Test suite policy (PR smoke vs nightly/full regression)
- Mock third-party dependencies toggle
- Staging/nightly real-integration allowlist
- Dataset isolation strategy (per-run schema/dataset version)
- Hardware validation enablement and retry policy

**5.3 Run Console**
Single run view: deterministic execution steps, logs/screenshots/network traces, and promotion eligibility computation.
Key elements:
- Run timeline (agent steps, network phases, DB setup, hardware stage)
- Artifacts section (screenshots, HTML, network logs, evidence bundle)
- Flake classification status (likely flaky vs likely defect)
- Promotion result (blocked vs promotable)
- Retry/diagnostics CTA

**5.4 Evidence Bundle**
Export and share a signed, audit-friendly bundle from a run (especially on hardware validation).
Key elements:
- Signed evidence bundle download
- Hardware run metadata
- Screenshots/video/HTML attachments
- Environment fingerprint (agent version, container/image hash)
- Integrity verification status

**5.5 Dataset Manager**
Control run-scoped test data/dataset versions and restoration to guarantee determinism under parallel CI builds.
Key elements:
- Dataset snapshot selector (pinned by run id)
- Dataset restore status/verification
- Schema/per-tenant isolation configuration
- Retention policy for pinned datasets
- Preview of generated test data IDs

**5.6 Hardware Validation**
Run and review on-device validation stage results to catch hardware-specific issues before promotion.
Key elements:
- Device pool selection
- Inference/regression job config
- Pass/fail with diagnostic attachments
- Retry scheduling and escalation
- Promotion gating result

**5.7 Flake Triage & Analytics**
Classify failures using historical data and artifacts, minimize manual reruns, and drive quarantine policies.
Key elements:
- Flake vs defect probability
- Historical trend chart (pass/fail streaks)
- Quarantine controls (temporarily allow non-blocking)
- Agent-based triage explanation
- Link to evidence bundle and suggested next actions

**5.8 Runner Management**
Register, monitor, and configure self-hosted agent runners used by CI jobs across environments.
Key elements:
- Runner registration status
- Capacity/queue depth indicators
- Environment mapping (CI labels to runner groups)
- Security/compliance mode (local LLM, no external egress)
- Version of agent executor and health checks

### Key functions
- **Configure smoke gate policy** *[on: Project Settings]*
  - Trigger: User opens Project Settings and saves smoke/full suite rules
  - Creates two-tier execution policy (PR smoke blocking, nightly/full regression) with deterministic suite selection.
- **Set mock third-party mode** *[on: Project Settings]*
  - Trigger: User toggles 'Mock dependencies on PRs' and saves
  - Routes PR runs through mocked third-party dependencies while preserving real integrations for staging/nightly allowlisted jobs.
- **Provision per-run isolated dataset** *[on: Dataset Manager]*
  - Trigger: Smoke gate run starts; system provisions run-scoped data
  - Allocates unique dataset/test data IDs or per-run schemas so parallel CI builds never share state.
- **Pin dataset snapshot for deterministic replay** *[on: Dataset Manager]*
  - Trigger: User selects a dataset version; smoke gate uses it for execution
  - Pins immutable dataset versions for each run (restore before execution) to make failures reproducible without manual data fiddling.
- **Execute natural-language browser steps** *[on: Run Console]*
  - Trigger: Smoke gate runner receives a step list for the targeted flow
  - Runs an adaptive in-browser agent that executes natural-language steps and tolerates selector/layout drift during smoke flows.
- **Run local-or-locally-ish replay in CI-mirroring container** *[on: Run Console]*
  - Trigger: User clicks 'Replay locally-ish' from a failed run
  - Reproduces the run using the same container/image and environment mapping as CI to speed up iteration and reduce “works on my machine” drift.
- **Capture expanded failure artifacts** *[on: Run Console]*
  - Trigger: On debug run or failure; listener expands logs
  - Automatically records evidence (screenshots/HTML/network logs) and attaches them to the run for fast diagnosis and evidence bundling.
- **Classify failure as flaky vs likely defect** *[on: Flake Triage & Analytics]*
  - Trigger: Run completes; analytics computes classification
  - Uses historical pass/fail trends plus artifacts to label likely flakes and reduce manual reruns.
- **Retry with diagnostics bundle** *[on: Run Console]*
  - Trigger: User clicks 'Retry with extra diagnostics' on blocked run
  - Re-executes the smoke flow with enhanced evidence capture and deterministic dataset replay, producing an updated evidence bundle.
- **Block promotion on smoke gate failure** *[on: Run Console]*
  - Trigger: CI job attempts to promote; gate policy evaluates result
  - Enforces a hard stop: smoke gate failure prevents staging/release promotion rather than providing advisory-only reporting.
- **Run hardware-blocking validation stage** *[on: Hardware Validation]*
  - Trigger: Gate policy requires on-device stage before promotion
  - Executes on-device regression/inference checks on a physical device pool and returns pass/fail as a PR check.
- **Generate signed evidence bundle** *[on: Evidence Bundle]*
  - Trigger: User clicks 'Generate signed bundle' after a hardware or debug run
  - Creates a signed, audit-ready evidence bundle that can be shared externally or retained for compliance.
- **Register and monitor self-hosted runner** *[on: Runner Management]*
  - Trigger: Admin clicks 'Add runner' and confirms runner label in CI
  - Creates a runner group and ties GitHub Action jobs to that self-hosted executor pool.

### UX details
- **Promotion gating logic:** Treat smoke gate outcomes as blocking signals for promotion (never advisory-only).
- **Suite selection UX:** Show “Smoke (PR)” vs “Nightly (full)” as separate, explicit cards and prevent mixing policies in the UI.
- **Failure diagnosis:** On failure, default-expand the Evidence section to show logs first (not just pass/fail) to avoid manual reruns to confirm reproduction.
- **Local replay workflow:** Offer “Replay in CI-mirroring container” directly from the failed step, using the same container/image as CI for deterministic re-execution.
- **Dataset determinism:** Always display the run’s dataset pin (version/id) in the header so every failure is reproducible without copying local data.
- **Integration coverage visibility:** Render an explicit badge on the run timeline: “Mocks enabled (PR)” vs “Real integrations (staging/nightly)” to prevent false confidence.
- **Hardware validation:** Place hardware stage results at the top of the run console when configured, since it is the final blocking gate before staging/release.
- **Artifact triage:** Automatically attach evidence artifacts and make them the primary input to AI/agent triage for labeling flaky vs defects.

## 6. Monetization
**Model:** subscription

### Suggested pricing tiers
**Starter** — $39/month — *Mid-market QA engineer*
- PR smoke gate (small suite) with blocking failures
- Evidence bundles (screenshots + HTML + logs) as artifacts
- Basic LLM agent with configurable step lists
- Community runner + limited parallelism

**Pro** — $65/month — *Product engineering team*
- Adaptive drift-tolerant smoke agent (selector/layout tolerance)
- Self-hosted/local LLM mode
- Database isolation helpers + run-scoped dataset controls
- Nightly/full regression scheduling + smoke/full split policies

**Enterprise** — $149/month — *Enterprise platform org*
- On-device hardware validation stage (blocking gate) with diagnostics
- Signed evidence bundles for audit trails
- Advanced triage automation + optional PR fix suggestions
- Dedicated runners, higher parallelism, SSO

## 7. Competitors to beat
| Name | Why it fails | Price | Mentions |
|---|---|---|---|
| Playwright | In one thread it’s recommended for speed/scaling, but failures elsewhere center on flakiness/maintenance; this chunk does not claim Playwright fully resolves E2E reliability. | - | 5 |
| Selenium | Mentioned as an option; this chunk doesn’t describe a solution that satisfies the local/self-hosted adaptive agent requirement or eliminates CI flakiness. | - | 3 |
| DVC / LakeFS / Delta Lake / Iceberg for versioning datasets | The chunk does not say these fail; it’s advice aimed at making tests deterministic when data changes. | - | 3 |
| Reportportal (reports/trends + analytics for flaky tests) | The chunk does not claim it fails; it’s suggested as an effective setup for triage with historical data. | - | 3 |
| AI CI log healer GitHub Action that posts AI-generated fix suggestions as PR comments | No failure claim in the chunk; users are only asked about production value and what features are needed. | - | 2 |
| On-device validation stage running on real hardware (blocking before staging) | Not described as failing; instead it addresses the gap where emulation/cloud tests miss hardware-specific behavior. | - | 2 |
| TestComplete (desktop testing) / Ranorex (desktop testing) | No explicit failure is cited for these; they are suggested as alternatives to Power Automate for brittle selectors/object recognition. | - | 2 |
| TestComplete scripting flexibility in JavaScript + object handling | No failure described; includes a pitfall that JavaScript and object typing can make debugging harder. | - | 2 |

## 8. Distribution
- reddit
- seo
- x_twitter
- cold_email
- Top subreddits to launch in: r/devops, r/softwaretesting

## 9. Users & roles
**Primary persona:** CI engineer running unreliable E2E smoke gates
**Secondary personas:**
- SRE/QA engineer responsible for flake reduction
- DevOps engineer managing CI orchestration

**Roles:**
- **Repo Admin** — Create smoke gate policies, configure datasets/mocking rules, and control promotion gating for a repo.
- **CI Maintainer** — Trigger runs, view evidence bundles, and rerun deterministically in the locally-mirrored runner.
- **Viewer** — View dashboards, run consoles, and exported evidence bundles for failed runs.

## 10. Data model & integrations
- (no data model extracted)

## 11. States
**Empty state:** Dashboard shows no configured smoke gate yet and prompts to create a Project Settings policy.
**Error state:** Run console shows failed smoke step, and Evidence Bundle lists missing/collected artifacts for triage.

## 12. Analytics & metrics
- (not synthesized for this report)

## 13. Risks & open questions
- (no risks/questions extracted)

## 14. Post-launch
- See https://painfinder.app/reports/unreliable-end-to-end-smoke-tests-in-ci-cd for DM-able hot leads (workarounds × buying intent).
- See https://painfinder.app/reports/unreliable-end-to-end-smoke-tests-in-ci-cd for verified key quotes you can use as landing copy.

## 15. Suggested build order (3-week MVP cut)
- Week 1: §3 must-haves + §5 page 1.
- Week 2: §5 remaining pages + auth/persistence if needed.
- Week 3: §6 monetization wiring + analytics + launch checklist.

## 16. Setup hints (your stack overrides these)
- `pnpm create next-app . --typescript --tailwind --app`
- `npx shadcn@latest init`
- The agent SHOULD ask the user before committing to a stack.

## 17. How to use this file
You're an AI coding agent reading this in AGENTS.md. Your job:
1. Confirm the stack with the user (their preferences override this file).
2. Scaffold an MVP covering §3 + §5 page-1 first.
3. Defer §6 (monetization) and §14 (post-launch) until §3 ships and works.
4. Re-fetch the live PRD anytime via:
   curl https://painfinder.app/api/public/reports/unreliable-end-to-end-smoke-tests-in-ci-cd/export.json?size=compact

## 18. Verbatim key quotes (top 10)
> "Every week, our team has to manually click through a dozen or so test scenarios on our application across various client environments."  
> — E2E automation approach, post #19261

> "Write the test scenarios in plain natural language as a simple list of steps"  
> — E2E automation approach, post #19261

> "Feed these steps into an AI Agent that can interact with the browser, analyze the page in real-time, and execute the steps."  
> — AI agent browser testing, post #19261

> "Crucial requirement: The tool must not just generate Playwright/Selenium code."  
> — AI agent browser testing, post #19261

> "I need an agent that actively navigates the page and can adapt on the fly if the layout or CSS selectors change slightly, as long as the underlying logic remains the same."  
> — AI agent browser testing, post #19261

> "Privacy is a must: Because we deal with sensitive client data, I need to power this agent using a local, self-hosted LLM (running via Ollama, vLLM, etc.) so no data leaves our infrastructure."  
> — Compliance and privacy, post #19261

> "Has anyone built a similar pipeline?"  
> — Value, buy vs build, post #19261

> "What I want to achieve:"  
> — Uncategorized, post #19261

> "So many false positives."  
> — Flaky test reduction, post #19248

> "How are teams efficiently automating regression or E2E testing?"  
> — E2E automation approach, post #19248

## 19. Manual workarounds users cobble together (top 15)
1. **Failure triage / flaky-test detection tooling with historical analytics** — *Manually re-run failing tests to verify reproduction and distinguish flaky tests from real defects.*
   > "I end up re-running tests manually just to check if the failure reproduces."
2. **Workflow automation for triage classification** — *Manual confirmation loop with QA/product after a failure reproduces.*
   > "When it does, I still go back and forth with manual QA or product to confirm whether it's a known behavior or a real bug."
3. **Dataset versioning/controlled test data environments for changing inputs** — *Save local copies of changing data to keep test runs consistent.*
   > "Right now I’m saving local copies to keep things consistent, but that doesn’t scale once more people start touching the same data."
4. **N/A** — *Not a manual workaround; included here only as no DIY process is described. (No manual workaround extracted.)*
   > "i remember seeing kualitatem mentioned somewhere while reading similar discussions, made me think more about structuring tests better"

## 20. "I would pay for…" quotes (top 10)
1. **would_pay** — wants: A paid no-code (or low-code) E2E testing tool that works without developers spending weeks learning Selenium.
   > "Would rather pay for a tool than have developers spend weeks learning selenium if there's a faster option."
2. **wishing** — wants: A locally self-hosted adaptive browser automation tool for E2E smoke tests.
   > "I need an agent that actively navigates the page and can adapt on the fly"
3. **wishing** — wants: A solution that runs local/self-hosted LLMs and does not send sensitive data externally.
   > "Privacy is a must"
4. **wishing** — wants: Local deterministic testing runtime for cloud + LLM workflows without calling real services.
   > "I’d actually use this."
5. **wishing** — wants: AI action that auto-suggests fixes for failed CI/CD pipelines (value inquiry)
   > "Curious what the DevOps community thinks. Would this be valuable for your team?"

## 21. Hot leads summary
- 6 hot leads identified (users who BOTH built a workaround AND signaled buying intent)
- Tier breakdown: 1 hot / 2 warm / 3 cold
- DM-able usernames available at: https://painfinder.app/reports/unreliable-end-to-end-smoke-tests-in-ci-cd#hot-leads (kept off this file for privacy — see live report)

## 22. Full competitor list (top 10)
| Name | Why it fails | Price | Mentions |
|---|---|---|---|
| Playwright | In one thread it’s recommended for speed/scaling, but failures elsewhere center on flakiness/maintenance; this chunk does not claim Playwright fully resolves E2E reliability. | - | 5 |
| Selenium | Mentioned as an option; this chunk doesn’t describe a solution that satisfies the local/self-hosted adaptive agent requirement or eliminates CI flakiness. | - | 3 |
| DVC / LakeFS / Delta Lake / Iceberg for versioning datasets | The chunk does not say these fail; it’s advice aimed at making tests deterministic when data changes. | - | 3 |
| Reportportal (reports/trends + analytics for flaky tests) | The chunk does not claim it fails; it’s suggested as an effective setup for triage with historical data. | - | 3 |
| AI CI log healer GitHub Action that posts AI-generated fix suggestions as PR comments | No failure claim in the chunk; users are only asked about production value and what features are needed. | - | 2 |
| On-device validation stage running on real hardware (blocking before staging) | Not described as failing; instead it addresses the gap where emulation/cloud tests miss hardware-specific behavior. | - | 2 |
| TestComplete (desktop testing) / Ranorex (desktop testing) | No explicit failure is cited for these; they are suggested as alternatives to Power Automate for brittle selectors/object recognition. | - | 2 |
| TestComplete scripting flexibility in JavaScript + object handling | No failure described; includes a pitfall that JavaScript and object typing can make debugging harder. | - | 2 |
| Browser-Use | Referenced only as a framework OP heard of; this chunk does not provide evidence of local/self-hosted reliability for E2E smoke testing or whether it avoids maintenance. | - | 1 |
| Containerized database for CI | Mentioned as an option to consider; no explicit tool is named beyond the approach. | - | 1 |

## 23. Where this conversation lives (top subreddits)
- r/devops (18 posts)
- r/softwaretesting (17 posts)
