Cubly Cubly | Infrastructure Definitions
← Dashboard

Metric Definitions

What each metric measures, how it's collected, and what the thresholds mean.

🌐 API Endpoint
Data source: HTTP GET to https://api.hellocubly.com/api/v1/health invoked from an AWS Lambda function in us-east-2. Response time includes the Lambda-to-ALB network hop (~50–200 ms overhead).
Note: A direct VPS measurement reads 54–85 ms. Lambda overhead is intentional and kept consistent across all checks.
Metric Thresholds Notes
Response Time
<1000 ms
1000–3000 ms
>3000 ms
End-to-end latency from Lambda invocation to HTTP response. Includes ~50–200 ms Lambda overhead.
Avg Response (5 checks)
Same as Response Time
Rolling average of the last 5 Lambda invocations. Smooths transient spikes.
Error Rate
0 errors
Any errors
Count of 4xx + 5xx responses from the ALB in the last 5 minutes, via CloudWatch.
🐳 ECS Service (Cubly API)
Data source: AWS ECS DescribeServices — running vs desired task count pulled from the AWS control plane. CPU/Memory from CloudWatch CPUUtilization / MemoryUtilization (average, last 5 minutes).
Metric Thresholds Notes
Running Tasks
running ≥ desired
running > 0 but < desired
running = 0
Number of ECS tasks currently running vs the service's desired count.
CPU Usage
<80%
80–95%
>95%
CloudWatch ECSService/CPUUtilization, averaged over the last 5 minutes.
Memory Usage
<85%
85–95%
>95%
CloudWatch ECSService/MemoryUtilization, averaged over the last 5 minutes.
⚖️ Load Balancer (ALB)
Data source: CloudWatch ApplicationELB metrics, 5-minute window.
Metric Thresholds Notes
Healthy Hosts
≥ 1
0
Minimum healthy target count behind the ALB. Zero means no traffic is being served.
Requests (5m)
Informational
Total request count through the ALB in the last 5 minutes. No alert threshold — used for traffic visibility.
4xx Errors
0
>0 (worth investigating)
Client errors (bad requests, auth failures, not found). Some 4xx is expected; elevated rates may indicate a client issue or misconfiguration.
5xx Errors
0
>0 — immediate attention
Server-side errors from ECS tasks or the ALB itself. Any 5xx requires investigation.
🗄️ Database (RDS Aurora PostgreSQL)
Data source: AWS RDS DescribeDBClusters for cluster health + CloudWatch RDS metrics (average or max, last 5 minutes).
Metric Thresholds Notes
Cluster Status
available
other
AWS-reported cluster status from DescribeDBClusters. Any state other than available is flagged red.
DB Connections
<80
80–150
>150
Average active database connections over the last 5 minutes. High counts can indicate connection leaks or traffic surges.
ACU Utilization
<75 ACU
75–90 ACU
>90 ACU
Aurora Serverless v2 capacity units in use. Write operations may pause when ACU reaches the configured maximum. Monitor closely above 75.
Replica Lag
<100 ms
100–1000 ms
>1000 ms
Replication lag to read replica, max over last 5 minutes. Above 1 s, read replicas may return stale data.
Cache (ElastiCache Redis)
Data source: AWS ElastiCache DescribeCacheClusters for node health + CloudWatch ElastiCache metrics, last 5 minutes.
Metric Thresholds Notes
Status
available
other
AWS-reported node health. Any status other than available is flagged red.
Cache Hit Rate
>80%
60–80%
<60%
CacheHits / (CacheHits + CacheMisses), last 5 min. Shown as N/A when there's no traffic to measure.
Evictions
0
>0 (memory pressure)
Keys evicted due to memory pressure in the last 5 minutes. Evictions can cause silent session loss — investigate if nonzero.
🔐 Authentication (Cognito)
Data source: AWS Cognito DescribeUserPool — pool existence, configuration, and registered user count.
Metric Thresholds Notes
Status
active
error
Confirms the user pool is reachable and correctly configured. Red if DescribeUserPool returns an error.
User Count
Informational
Total registered users in the pool. No alert threshold — used for growth tracking.
🔄 CI/CD (GitHub Actions)
Data source: GitHub API — last 5 workflow runs on the main branch, fetched via GET /repos/{owner}/{repo}/actions/runs.
Metric Thresholds Notes
Workflow Runs
Informational
Each entry shows: workflow name, conclusion (success / failure / in_progress), branch, and timestamp. Failures are surfaced individually — no aggregate threshold. Investigate any failure conclusion.
🕐 Data Freshness
Update mechanism: Dashboard data is collected and written by a systemd timer running on the VPS, triggered every 30 seconds. The JSON payload is uploaded to S3 and served via CloudFront.
Metric Thresholds Notes
Last Updated
<5 minutes old
>5 minutes old
Timestamp of when data was last fetched by the systemd timer. If data is more than 5 minutes old, the updater service may be down or stuck.