The Safety Net: Why E2E Tests Matter More When AI Writes Your Code

Last week, I asked Claude Code to add a new field to the leave request form in Equestrian Venue Manager. “Add a notes field so staff can explain why they need time off.”

Claude Code updated the Pydantic model. Modified the SQLAlchemy schema. Added the field to the form. Generated an Alembic migration. Updated the API endpoint. Everything looked good.

I deployed it. The leave request form loaded. Staff could add notes. Perfect.

Except I broke the leave approval workflow. Completely. Managers couldn’t approve requests anymore - the approval button did nothing. The JavaScript event handler was looking for form data that no longer existed because Claude Code had refactored the form structure.

I discovered this three hours later when a manager texted: “I can’t approve Alice’s holiday request. She’s booked flights. Help.”

This is the problem with AI-assisted development in production: changes cascade in unexpected ways. And when people’s paychecks depend on your code, “oops, didn’t catch that” isn’t acceptable.

The Illusion of Correctness

Here’s what happened with that leave request change:

What I asked for: Add a notes field What got changed:

Database model (added column)
API endpoint (new field validation)
Form template (new textarea)
Form submission (restructured data format)
JavaScript handler (broke existing event listener)

The last one was invisible. The code compiled. The tests I had written passed (they only tested the API, not the full workflow). The form submitted successfully. But the approval flow silently broke because the client-side JavaScript expected data in the old format.

Traditional development? I might have caught this because I’d be manually changing each file, thinking through dependencies. With AI-assisted development, Claude Code changed five files in seconds. I reviewed the diff, saw that each change made sense individually, and approved it.

The integration failure wasn’t visible in the diff. It only appeared when a user tried to approve a request.

This is the AI development paradox: changes happen faster, but breaking changes are harder to spot.

The User’s Perspective

After the leave approval incident, I realized my testing strategy was wrong.

I had unit tests. They verified individual functions worked correctly. API endpoints returned the right status codes. Database queries executed. Alembic migrations applied cleanly. All good.

But I didn’t have tests that answered: “Can a manager approve a leave request?”

Not “does the API endpoint accept a POST request with the right fields.” Not “does the database update when you call the approval function.” But literally: Can a user open the leave approval page, click the approve button, and have it work?

That’s what E2E tests do. They test from the user’s perspective. Not “does this function work?” but “does this workflow work?”

When AI generates code that touches multiple files across backend, frontend, and database, unit tests aren’t enough. You need tests that verify the whole system still does what users expect.

Enter Playwright

I added Playwright for E2E testing. Not to replace unit tests - to complement them.

Here’s the test that would have caught the leave approval bug:

def test_manager_can_approve_leave_request(page: Page):
    # Login as manager
    page.goto("/login")
    page.fill("#username", "[email protected]")
    page.fill("#password", "password")
    page.click("#login-button")

    # Navigate to leave requests
    page.goto("/leave/requests")

    # Find pending request and approve it
    page.locator(".leave-request").first.click()
    page.click("#approve-button")

    # Verify approval succeeded
    expect(page.locator(".success-message")).to_be_visible()
    expect(page.locator("#leave-status")).to_have_text("Approved")

This test doesn’t care about API internals. It doesn’t care about database schemas. It does what a manager does: login, navigate, click approve, verify success.

If Claude Code changes the form structure and breaks the JavaScript handler, this test fails. Because it’s testing the behavior users depend on, not the implementation details.

The Testing Pyramid for AI-Assisted Development

Traditional testing pyramid: lots of unit tests, some integration tests, few E2E tests.

AI-assisted development testing pyramid: still lots of unit tests, but E2E tests become critical.

Why? Because AI changes are:

Fast - multiple files modified simultaneously
Comprehensive - touches frontend, backend, database
Subtle - can break integration points you didn’t think about

Unit tests verify pieces work. E2E tests verify pieces still work together after AI refactoring.

Here’s what I test with Playwright in EVM:

Critical User Workflows:

Staff can clock in/out for timesheets
Managers can approve leave requests
Admins can generate payroll exports
Livery owners can request services
Staff can view their holiday balance

Not Implementation Details:

API response formats
Database query structures
Form field names

If a user workflow breaks, Playwright catches it. Before deployment. Before real users see it.

The CI Pipeline That Caught Everything

The leave approval bug made me rethink the entire deployment pipeline. Now, GitHub Actions runs this sequence on every push:

1. Unit Tests (Parallel)

- name: Run Backend Tests
  run: pytest backend/tests --cov

- name: Run Frontend Tests
  run: pytest frontend/tests --cov

Fast feedback. Verify individual functions work. Takes 2 minutes.

2. Build Container Images

- name: Build Images
  run: |
    docker build -t evm-backend:test ./backend
    docker build -t evm-frontend:test ./frontend

Ensure the code actually builds. Catches dependency issues. Takes 3 minutes.

3. Start Full Stack

- name: Start Services
  run: docker compose up -d

- name: Wait for Health Checks
  run: ./scripts/wait-for-healthy.sh

Spin up the full application: database, cache, backend, frontend. Exactly like production. Takes 30 seconds.

4. Run E2E Tests

- name: Run Playwright Tests
  run: playwright test --workers=4

Test critical user workflows against the running application. This is where integration failures appear. Takes 5 minutes.

5. Push Images (if all tests pass)

- name: Push to GHCR
  if: success()
  run: |
    docker push ghcr.io/user/evm-backend:latest
    docker push ghcr.io/user/evm-frontend:latest

Only push images if everything passes. Unit tests, builds, E2E tests.

Total time: ~12 minutes. Cost of catching the leave approval bug before production: $0. Cost of breaking payroll because approvals didn’t work: priceless.

Alembic Migrations: The Unsung Hero

While we’re talking about safety nets, Alembic database migrations deserve mention. They’ve been game-changing for EVM.

Before Alembic: “I need to add a column. Let me write SQL. Hope I don’t typo. Hope I remember to run it in production.”

With Alembic: Claude Code generates migrations automatically. They’re versioned. Tested in CI. Applied automatically on deployment.

Example: adding the leave request notes field.

# Claude Code generated this migration
def upgrade():
    op.add_column('leave_requests',
        sa.Column('notes', sa.Text(), nullable=True))

def downgrade():
    op.drop_column('leave_requests', 'notes')

The migration is code. It’s in git. It’s reviewed. It’s tested. And critically: it’s reversible. If the deployment breaks, alembic downgrade rolls back the schema change.

When AI is changing your database schema, having versioned, tested, reversible migrations is essential. You can’t just ALTER TABLE manually and hope it works.

What Production Taught Me

EVM has been in production for a month. Real users. Real timesheets. Real paychecks calculated from those timesheets. Here’s what that month taught me about AI-assisted development:

1. AI Code Gen is Fast, But Consequences Are Also Fast

Claude Code can refactor an entire feature in minutes. If that refactoring breaks something, users discover it immediately. E2E tests are your early warning system.

2. “It Compiles” Doesn’t Mean “It Works”

Type checking catches syntax errors. Unit tests catch logic errors. Only E2E tests catch integration errors - when pieces that individually work don’t work together.

3. The User’s Perspective is Truth

You can have 100% unit test coverage and still ship broken features. If the E2E test for “manager approves leave request” passes, that workflow works. If it fails, it’s broken. Simple.

4. CI Pipeline = Confidence

The 12-minute CI pipeline means I can deploy EVM changes without anxiety. If CI passes, it works. If CI fails, something broke, and I know before users do.

5. Database Migrations Need the Same Rigor as Code

Alembic migrations are code. They need tests. They need review. They need CI. Treating schema changes as scripts you run manually is asking for production disasters.

The Practical Result

Since adding Playwright E2E tests and the full CI pipeline, I’ve:

Caught Before Production:

3 broken form submissions
2 authentication redirect loops
1 payroll calculation error
The leave approval bug that started this story
Countless integration failures

Deployed to Production:

15+ feature additions
Dozens of bug fixes
Multiple database schema changes
Zero user-reported breakages

The yard owners don’t know about Playwright. They don’t care about CI pipelines. They know that EVM works. Timesheets calculate correctly. Leave requests get approved. Payroll exports every month. That’s all that matters.

But I know why it works: because E2E tests catch what unit tests miss. Because the CI pipeline ensures every change is tested the same way, every time. Because Alembic makes database changes safe and reversible.

The Cost of Not Testing

That leave approval bug? Three hours of broken functionality before I noticed. One frustrated manager. One stressed staff member worried about booked flights. One emergency fix deployed outside normal hours.

Total impact: low, because I caught it fast and fixed it fast. But imagine if payroll had been broken instead. Or if timesheets hadn’t recorded hours correctly. Or if the entire leave system had crashed on Friday afternoon before a holiday weekend.

When people’s livelihoods depend on your code, “move fast and break things” isn’t an option. But “move slowly and test manually” kills productivity.

E2E tests + CI pipelines let you move fast and break nothing. Or at least, break nothing that reaches users.

What This Means for AI-Assisted Development

AI makes writing code faster. That’s transformative. But it also makes breaking code faster.

Traditional development: you change one file, you know what might break. AI development: Claude Code changes five files, and the breaking change might be in file six that you didn’t touch.

The solution isn’t to stop using AI. The solution is to test differently.

Unit tests verify pieces work individually. E2E tests verify pieces work together after AI refactoring. CI pipelines ensure every change gets both. Alembic migrations make database changes as safe as code changes.

Together, they form a safety net. When Claude Code generates a change that cascades unexpectedly, the tests catch it. Before production. Before users. Before paychecks are affected.

The Bottom Line

I asked Claude Code to add a notes field to a form. It broke the approval workflow. Playwright tests now catch that before deployment. That’s the story.

But the lesson is bigger: AI-assisted development is incredibly powerful, and that power comes with responsibility. When you’re moving 10x faster, you need safety nets that work at 10x speed too.

E2E tests aren’t optional anymore. They’re essential. Not just for peace of mind - for production confidence.

EVM processes real payroll. For real people. Who have rent and mortgages and bills. The stakes are real. The testing has to be too.

Playwright + CI + Alembic = the safety net that makes AI-assisted development safe in production. That’s not theory. That’s lived experience from the past month.

And when that manager texts “I can’t approve this leave request,” I can confidently reply: “That should work - let me check if it’s a browser issue” instead of panicking about what Claude Code might have accidentally broken.

That confidence is worth the 12-minute CI pipeline. Every single time.

Technical notes:

Playwright: E2E testing framework for web applications
Alembic: Database migration tool for SQLAlchemy
GitHub Actions: CI/CD platform for automated testing and deployment
Test coverage: Unit tests (~85%), E2E tests (critical workflows only)
CI time: ~12 minutes per push
Deployment confidence: High enough to ship daily

The code is tested. The workflows are verified. The users are happy. That’s what matters.