Skip to main content
Skip table of contents

Workflows Load Test Plan

Overview

This plan defines functional and performance tests for the Alli Workflows API. It covers smoke, load, stress, and soak tests across all endpoints that trigger workflow runs: webhook, API, scheduler, and on-demand (Run Now).

Test Scenarios

Workflow Triggers

Scenario

Endpoint(s)

Expected

Webhook

POST
/webhookID/secret

201

API

POST /orgs/{org_id}/projects/{project_id}/fleets/{fleet_id}/fleetruns

200

Run Now

Browser “Run Now” Button
POST /orgs/{org_id}/projects/{project_id}/fleets/{fleet_id}/fleetruns via k6 browser

200


Non-triggering APIs

Scenarios

Endpoint

Expected

List Workflows

GET /api/orgs/{org_id}/projects/{project_id}/fleets

200

List Workflow Runs

GET

/api/orgs/{org_id}/projects/{project_id}/fleets/{workflow_id}

200

List Workflow Output

GET
/api/orgs/{org_id}/projects/{project_id}/fleets/{workflow_id}/fleetruns

200

List Template Library

GET
/api/orgs/{org_id}/blueprintLibrary

200

List Workflow Versions

GET

/api/orgs/{org_id}/projects/{project_id}/fleets/{workflow_id}/versions

200

Upsert Workflow

PUT

/api/orgs/{org_id}/projects/{workflows_id}/fleets

200

Taskrun Lead Time

Scenarios

Success

Taskuns trigger in a timely manner

<1min

Baselines (will have better numbers next week)

Triggering APIs

Endpoint

Rate (Weekly)

Latency p50

Latency p95

On Demand (UI)

0.0012 req/s

0.82s

20.2s

Webhook

0.00030 req/s

3.34s

4.34s

API

TBD

TBD

TBD

Non-Triggering APIs

Endpoint

Rate (Past Week)

Latency p50

Latency p95

List Workflows

0.0022 req/s

1.56s

35.4s

List Workflow Runs

0.0075 req/s

0.18s

0.89s

List Workflow Output

0.00074 req/s

0.2s

1.91s

List Workflow Versions

0.00012 req/s

0.67s

6.6s

Upsert Fleet

0.0013 req/s

0.64s

2.68s

List Template Library

0.0022 req/s

2.79s

5.91s

Task Run Lead Time

Rate

Latency p50

Latency p95

TBD

TBD

TBD


Testing Success Thresholds

Load Testing

Triggering APIs

Endpoint

Rate

Latency p50

Latency p95

On Demand (UI)

TBD

TBD

TBD

Webhook

TBD

TBD

TBD

API

TBD

TBD

TBD

Non-triggering APIs

Endpoint

Rate

Latency p50

Latency p95

List Workflows

TBD

TBD

TBD

List Workflow Runs

TBD

TBD

TBD

List Workflow Output

TBD

TBD

TBD

List Workflow Versions

TBD

TBD

TBD

Upsert Fleet

TBD

TBD

TBD

List Template Library

TBD

TBD

TBD

Task Run Lead Time

Rate

Latency p50

Latency p95

TBD

5s

10s


Spike Testing

Triggering APIs

Endpoint

Rate

Latency p50

Latency p95

On Demand (UI)

TBD

TBD

TBD

Webhook

TBD

TBD

TBD

API

TBD

TBD

TBD

Non-triggering APIs

Endpoint

Rate

Latency p50

Latency p95

List Workflows

TBD

TBD

TBD

List Workflow Runs

TBD

TBD

TBD

List Workflow Output

TBD

TBD

TBD

List Workflow Versions

TBD

TBD

TBD

Upsert Fleet

TBD

TBD

TBD

List Template Library

TBD

TBD

TBD

Task Run Lead Time

Rate

Latency p50

Latency p95

TBD

TBD

TBD


Soak Testing

Triggering APIs

Endpoint

Rate

Latency p50

Latency p95

On Demand (UI)

TBD

TBD

TBD

Webhook

TBD

TBD

TBD

API

TBD

TBD

TBD

Non-triggering APIs

Endpoint

Rate

Latency p50

Latency p95

List Workflows

TBD

TBD

TBD

List Workflow Runs

TBD

TBD

TBD

List Workflow Output

TBD

TBD

TBD

List Workflow Versions

TBD

TBD

TBD

Upsert Fleet

TBD

TBD

TBD

List Template Library

TBD

TBD

TBD

Task Run Lead Time

Rate

Latency p50

Latency p95

TBD

5s

10


Test phases (run in this order)

  1. Baseline & smoke (15–30 min)

    • Low RPS; verify dashboards, logs, and tracing, and that k6 scripts & fixtures work.

  2. Load test (steady & stepped)

    • Start at 120% of baseline, add 10% every 10m until 200%. We are assuming load at 200% of baseline.

  3. Spike test

    • From 0 to 125% of baseline in 1m. Stay at 5m.

  4. Soak test (2 hours)

    • 100% of baseline for 2 hours.

  5. Stress to failure

    • Keep increasing until an SLO is violated; document the practical max and bottlenecks.

  6. Chaos & failure drills (On Site)

    • Kill a node, make a single AZ unreachable, slow the DB, throttle SQS, stop a worker service; verify graceful degradation.

Infra SLOs

TBD


Task Run Lead Times

Run task runs during load and peak.

  • Include taskrunner lead time

  • Monitor ECS CPU

  • Combine with Jon’s UDA K6 testing


Execution Steps

  1. Run baseline, make sure harness is working.

  2. Load run

  3. Capture latency, errors, ECS CPU/memory, and queue metrics

  4. Spike

  5. Soak

  6. Stress to Failure

  7. Summarize vs targets; adjust scaling if needed


Dashboards & Monitors

Dashboard: requests, latency by type, queues, ECS, SLO tiles

Monitors:

  • Taskrun lead time

  • Oldest job >(x)m warn; >(y)m alert

  • ECS CPU p95 >x% warn; >y% alert

JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.