Metric Definitions
What each metric measures, how it's collected, and what the thresholds mean.
API Endpoint
Data source: HTTP GET to
Note: A direct VPS measurement reads 54–85 ms. Lambda overhead is intentional and kept consistent across all checks.
https://api.hellocubly.com/api/v1/health invoked from an AWS Lambda function in us-east-2. Response time includes the Lambda-to-ALB network hop (~50–200 ms overhead).Note: A direct VPS measurement reads 54–85 ms. Lambda overhead is intentional and kept consistent across all checks.
| Metric | Thresholds | Notes |
|---|---|---|
| Response Time |
<1000 ms
1000–3000 ms
>3000 ms
|
End-to-end latency from Lambda invocation to HTTP response. Includes ~50–200 ms Lambda overhead. |
| Avg Response (5 checks) |
Same as Response Time
|
Rolling average of the last 5 Lambda invocations. Smooths transient spikes. |
| Error Rate |
0 errors
Any errors
|
Count of 4xx + 5xx responses from the ALB in the last 5 minutes, via CloudWatch. |
ECS Service (Cubly API)
Data source: AWS ECS
DescribeServices — running vs desired task count pulled from the AWS control plane. CPU/Memory from CloudWatch CPUUtilization / MemoryUtilization (average, last 5 minutes).
| Metric | Thresholds | Notes |
|---|---|---|
| Running Tasks |
running ≥ desired
running > 0 but < desired
running = 0
|
Number of ECS tasks currently running vs the service's desired count. |
| CPU Usage |
<80%
80–95%
>95%
|
CloudWatch ECSService/CPUUtilization, averaged over the last 5 minutes. |
| Memory Usage |
<85%
85–95%
>95%
|
CloudWatch ECSService/MemoryUtilization, averaged over the last 5 minutes. |
Load Balancer (ALB)
Data source: CloudWatch
ApplicationELB metrics, 5-minute window.
| Metric | Thresholds | Notes |
|---|---|---|
| Healthy Hosts |
≥ 1
0
|
Minimum healthy target count behind the ALB. Zero means no traffic is being served. |
| Requests (5m) |
Informational
|
Total request count through the ALB in the last 5 minutes. No alert threshold — used for traffic visibility. |
| 4xx Errors |
0
>0 (worth investigating)
|
Client errors (bad requests, auth failures, not found). Some 4xx is expected; elevated rates may indicate a client issue or misconfiguration. |
| 5xx Errors |
0
>0 — immediate attention
|
Server-side errors from ECS tasks or the ALB itself. Any 5xx requires investigation. |
Database (RDS Aurora PostgreSQL)
Data source: AWS RDS
DescribeDBClusters for cluster health + CloudWatch RDS metrics (average or max, last 5 minutes).
| Metric | Thresholds | Notes |
|---|---|---|
| Cluster Status |
available
other
|
AWS-reported cluster status from DescribeDBClusters. Any state other than available is flagged red. |
| DB Connections |
<80
80–150
>150
|
Average active database connections over the last 5 minutes. High counts can indicate connection leaks or traffic surges. |
| ACU Utilization |
<75 ACU
75–90 ACU
>90 ACU
|
Aurora Serverless v2 capacity units in use. Write operations may pause when ACU reaches the configured maximum. Monitor closely above 75. |
| Replica Lag |
<100 ms
100–1000 ms
>1000 ms
|
Replication lag to read replica, max over last 5 minutes. Above 1 s, read replicas may return stale data. |
Cache (ElastiCache Redis)
Data source: AWS ElastiCache
DescribeCacheClusters for node health + CloudWatch ElastiCache metrics, last 5 minutes.
| Metric | Thresholds | Notes |
|---|---|---|
| Status |
available
other
|
AWS-reported node health. Any status other than available is flagged red. |
| Cache Hit Rate |
>80%
60–80%
<60%
|
CacheHits / (CacheHits + CacheMisses), last 5 min. Shown as N/A when there's no traffic to measure. |
| Evictions |
0
>0 (memory pressure)
|
Keys evicted due to memory pressure in the last 5 minutes. Evictions can cause silent session loss — investigate if nonzero. |
Authentication (Cognito)
Data source: AWS Cognito
DescribeUserPool — pool existence, configuration, and registered user count.
| Metric | Thresholds | Notes |
|---|---|---|
| Status |
active
error
|
Confirms the user pool is reachable and correctly configured. Red if DescribeUserPool returns an error. |
| User Count |
Informational
|
Total registered users in the pool. No alert threshold — used for growth tracking. |
CI/CD (GitHub Actions)
Data source: GitHub API — last 5 workflow runs on the
main branch, fetched via GET /repos/{owner}/{repo}/actions/runs.
| Metric | Thresholds | Notes |
|---|---|---|
| Workflow Runs |
Informational
|
Each entry shows: workflow name, conclusion (success / failure / in_progress), branch, and timestamp. Failures are surfaced individually — no aggregate threshold. Investigate any failure conclusion. |
Data Freshness
Update mechanism: Dashboard data is collected and written by a systemd timer running on the VPS, triggered every 30 seconds. The JSON payload is uploaded to S3 and served via CloudFront.
| Metric | Thresholds | Notes |
|---|---|---|
| Last Updated |
<5 minutes old
>5 minutes old
|
Timestamp of when data was last fetched by the systemd timer. If data is more than 5 minutes old, the updater service may be down or stuck. |