Cubly | Infrastructure Definitions

🌐 API Endpoint

Data source: HTTP GET to https://api.hellocubly.com/api/v1/health invoked from an AWS Lambda function in us-east-2. Response time includes the Lambda-to-ALB network hop (~50–200 ms overhead).
Note: A direct VPS measurement reads 54–85 ms. Lambda overhead is intentional and kept consistent across all checks.

Metric	Thresholds	Notes
Response Time	<1000 ms 1000–3000 ms >3000 ms	End-to-end latency from Lambda invocation to HTTP response. Includes ~50–200 ms Lambda overhead.
Avg Response (5 checks)	Same as Response Time	Rolling average of the last 5 Lambda invocations. Smooths transient spikes.
Error Rate	0 errors Any errors	Count of 4xx + 5xx responses from the ALB in the last 5 minutes, via CloudWatch.

🐳 ECS Service (Cubly API)

Data source: AWS ECS DescribeServices — running vs desired task count pulled from the AWS control plane. CPU/Memory from CloudWatch CPUUtilization / MemoryUtilization (average, last 5 minutes).

Metric	Thresholds	Notes
Running Tasks	running ≥ desired running > 0 but < desired running = 0	Number of ECS tasks currently running vs the service's desired count.
CPU Usage	<80% 80–95% >95%	CloudWatch `ECSService/CPUUtilization`, averaged over the last 5 minutes.
Memory Usage	<85% 85–95% >95%	CloudWatch `ECSService/MemoryUtilization`, averaged over the last 5 minutes.

⚖️ Load Balancer (ALB)

Data source: CloudWatch ApplicationELB metrics, 5-minute window.

Metric	Thresholds	Notes
Healthy Hosts	≥ 1 0	Minimum healthy target count behind the ALB. Zero means no traffic is being served.
Requests (5m)	Informational	Total request count through the ALB in the last 5 minutes. No alert threshold — used for traffic visibility.
4xx Errors	0 >0 (worth investigating)	Client errors (bad requests, auth failures, not found). Some 4xx is expected; elevated rates may indicate a client issue or misconfiguration.
5xx Errors	0 >0 — immediate attention	Server-side errors from ECS tasks or the ALB itself. Any 5xx requires investigation.

🗄️ Database (RDS Aurora PostgreSQL)

Data source: AWS RDS DescribeDBClusters for cluster health + CloudWatch RDS metrics (average or max, last 5 minutes).

Metric	Thresholds	Notes
Cluster Status	available other	AWS-reported cluster status from `DescribeDBClusters`. Any state other than available is flagged red.
DB Connections	<80 80–150 >150	Average active database connections over the last 5 minutes. High counts can indicate connection leaks or traffic surges.
ACU Utilization	<75 ACU 75–90 ACU >90 ACU	Aurora Serverless v2 capacity units in use. Write operations may pause when ACU reaches the configured maximum. Monitor closely above 75.
Replica Lag	<100 ms 100–1000 ms >1000 ms	Replication lag to read replica, max over last 5 minutes. Above 1 s, read replicas may return stale data.

⚡ Cache (ElastiCache Redis)

Data source: AWS ElastiCache DescribeCacheClusters for node health + CloudWatch ElastiCache metrics, last 5 minutes.

Metric	Thresholds	Notes
Status	available other	AWS-reported node health. Any status other than available is flagged red.
Cache Hit Rate	>80% 60–80% <60%	CacheHits / (CacheHits + CacheMisses), last 5 min. Shown as N/A when there's no traffic to measure.
Evictions	0 >0 (memory pressure)	Keys evicted due to memory pressure in the last 5 minutes. Evictions can cause silent session loss — investigate if nonzero.

🔐 Authentication (Cognito)

Data source: AWS Cognito DescribeUserPool — pool existence, configuration, and registered user count.

Metric	Thresholds	Notes
Status	active error	Confirms the user pool is reachable and correctly configured. Red if `DescribeUserPool` returns an error.
User Count	Informational	Total registered users in the pool. No alert threshold — used for growth tracking.

🔄 CI/CD (GitHub Actions)

Data source: GitHub API — last 5 workflow runs on the main branch, fetched via GET /repos/{owner}/{repo}/actions/runs.

Metric	Thresholds	Notes
Workflow Runs	Informational	Each entry shows: workflow name, conclusion (success / failure / in_progress), branch, and timestamp. Failures are surfaced individually — no aggregate threshold. Investigate any failure conclusion.

🕐 Data Freshness

Update mechanism: Dashboard data is collected and written by a systemd timer running on the VPS, triggered every 30 seconds. The JSON payload is uploaded to S3 and served via CloudFront.

Metric	Thresholds	Notes
Last Updated	<5 minutes old >5 minutes old	Timestamp of when data was last fetched by the systemd timer. If data is more than 5 minutes old, the updater service may be down or stuck.

Metric Definitions