Workflows Load Test Plan
Overview
This plan defines functional and performance tests for the Alli Workflows API. It covers smoke, load, stress, and soak tests across all endpoints that trigger workflow runs: webhook, API, scheduler, and on-demand (Run Now).
Test Scenarios
Workflow Triggers
Scenario | Endpoint(s) | Expected |
Webhook | POST | 201 |
API | POST /orgs/{org_id}/projects/{project_id}/fleets/{fleet_id}/fleetruns | 200 |
Run Now | Browser “Run Now” Button | 200 |
Non-triggering APIs
Scenarios | Endpoint | Expected |
---|---|---|
List Workflows | GET /api/orgs/{org_id}/projects/{project_id}/fleets | 200 |
List Workflow Runs | GET /api/orgs/{org_id}/projects/{project_id}/fleets/{workflow_id} | 200 |
List Workflow Output | GET | 200 |
List Template Library | GET | 200 |
List Workflow Versions | GET /api/orgs/{org_id}/projects/{project_id}/fleets/{workflow_id}/versions | 200 |
Upsert Workflow | PUT /api/orgs/{org_id}/projects/{workflows_id}/fleets | 200 |
Taskrun Lead Time
Scenarios | Success |
---|---|
Taskuns trigger in a timely manner | <1min |
Baselines (will have better numbers next week)
Triggering APIs
Endpoint | Rate (Weekly) | Latency p50 | Latency p95 |
---|---|---|---|
On Demand (UI) | 0.0012 req/s | 0.82s | 20.2s |
Webhook | 0.00030 req/s | 3.34s | 4.34s |
API | TBD | TBD | TBD |
Non-Triggering APIs
Endpoint | Rate (Past Week) | Latency p50 | Latency p95 |
---|---|---|---|
List Workflows | 0.0022 req/s | 1.56s | 35.4s |
List Workflow Runs | 0.0075 req/s | 0.18s | 0.89s |
List Workflow Output | 0.00074 req/s | 0.2s | 1.91s |
List Workflow Versions | 0.00012 req/s | 0.67s | 6.6s |
Upsert Fleet | 0.0013 req/s | 0.64s | 2.68s |
List Template Library | 0.0022 req/s | 2.79s | 5.91s |
Task Run Lead Time
Rate | Latency p50 | Latency p95 |
---|---|---|
TBD | TBD | TBD |
Testing Success Thresholds
Load Testing
Triggering APIs
Endpoint | Rate | Latency p50 | Latency p95 |
---|---|---|---|
On Demand (UI) | TBD | TBD | TBD |
Webhook | TBD | TBD | TBD |
API | TBD | TBD | TBD |
Non-triggering APIs
Endpoint | Rate | Latency p50 | Latency p95 |
---|---|---|---|
List Workflows | TBD | TBD | TBD |
List Workflow Runs | TBD | TBD | TBD |
List Workflow Output | TBD | TBD | TBD |
List Workflow Versions | TBD | TBD | TBD |
Upsert Fleet | TBD | TBD | TBD |
List Template Library | TBD | TBD | TBD |
Task Run Lead Time
Rate | Latency p50 | Latency p95 |
---|---|---|
TBD | 5s | 10s |
Spike Testing
Triggering APIs
Endpoint | Rate | Latency p50 | Latency p95 |
---|---|---|---|
On Demand (UI) | TBD | TBD | TBD |
Webhook | TBD | TBD | TBD |
API | TBD | TBD | TBD |
Non-triggering APIs
Endpoint | Rate | Latency p50 | Latency p95 |
---|---|---|---|
List Workflows | TBD | TBD | TBD |
List Workflow Runs | TBD | TBD | TBD |
List Workflow Output | TBD | TBD | TBD |
List Workflow Versions | TBD | TBD | TBD |
Upsert Fleet | TBD | TBD | TBD |
List Template Library | TBD | TBD | TBD |
Task Run Lead Time
Rate | Latency p50 | Latency p95 |
---|---|---|
TBD | TBD | TBD |
Soak Testing
Triggering APIs
Endpoint | Rate | Latency p50 | Latency p95 |
---|---|---|---|
On Demand (UI) | TBD | TBD | TBD |
Webhook | TBD | TBD | TBD |
API | TBD | TBD | TBD |
Non-triggering APIs
Endpoint | Rate | Latency p50 | Latency p95 |
---|---|---|---|
List Workflows | TBD | TBD | TBD |
List Workflow Runs | TBD | TBD | TBD |
List Workflow Output | TBD | TBD | TBD |
List Workflow Versions | TBD | TBD | TBD |
Upsert Fleet | TBD | TBD | TBD |
List Template Library | TBD | TBD | TBD |
Task Run Lead Time
Rate | Latency p50 | Latency p95 |
---|---|---|
TBD | 5s | 10 |
Test phases (run in this order)
Baseline & smoke (15–30 min)
Low RPS; verify dashboards, logs, and tracing, and that k6 scripts & fixtures work.
Load test (steady & stepped)
Start at 120% of baseline, add 10% every 10m until 200%. We are assuming load at 200% of baseline.
Spike test
From 0 to 125% of baseline in 1m. Stay at 5m.
Soak test (2 hours)
100% of baseline for 2 hours.
Stress to failure
Keep increasing until an SLO is violated; document the practical max and bottlenecks.
Chaos & failure drills (On Site)
Kill a node, make a single AZ unreachable, slow the DB, throttle SQS, stop a worker service; verify graceful degradation.
Infra SLOs
TBD
Task Run Lead Times
Run task runs during load and peak.
Include taskrunner lead time
Monitor ECS CPU
Combine with Jon’s UDA K6 testing
Execution Steps
Run baseline, make sure harness is working.
Load run
Capture latency, errors, ECS CPU/memory, and queue metrics
Spike
Soak
Stress to Failure
Summarize vs targets; adjust scaling if needed
Dashboards & Monitors
Dashboard: requests, latency by type, queues, ECS, SLO tiles
Monitors:
Taskrun lead time
Oldest job >(x)m warn; >(y)m alert
ECS CPU p95 >x% warn; >y% alert