The Three Pillars of Cloud Monitoring
Infrastructure Monitoring
This is the foundation: compute (CPU, memory, disk I/O), storage (throughput, IOPS, latency), and the underlying platform health (hypervisor, container runtime, serverless execution environment). Every cloud provider exposes these natively:
- AWS CloudWatch — metrics for EC2, RDS, EBS, Lambda, plus custom metrics via the CloudWatch agent or StatsD
- Azure Monitor — platform metrics auto-collected for all Azure resources, with Log Analytics workspace for deeper queries (KQL)
- GCP Cloud Monitoring (formerly Stackdriver) — auto-collects metrics for Compute Engine, GKE, Cloud SQL, and Cloud Functions
The trap here is watching averages. A CPU averaging 45% looks healthy, but if it spikes to 98% for 10 seconds every minute, your users are experiencing queuing delays that the average conceals. Always monitor percentiles (p95, p99) for latency and saturation-related metrics.
Application Performance Monitoring (APM)
APM instruments your code to trace requests end-to-end across microservices, databases, caches, and external APIs. The standard signals are the RED metrics: Request rate, Error rate, and Duration (latency distribution).
OpenTelemetry has become the de facto standard for instrumentation. It is vendor-neutral, supports auto-instrumentation in Java, Python, .NET, Go, Node.js, and more, and exports to any backend — Datadog, Dynatrace, Grafana Tempo, AWS X-Ray, Azure Application Insights, or GCP Cloud Trace. If you are starting fresh in 2026, instrument with OpenTelemetry SDKs and choose your backend separately. This avoids vendor lock-in on the instrumentation layer, which is the hardest part to rip out later.
What matters operationally: distributed traces that let you see that a checkout request spent 12ms in the API gateway, 45ms in the order service, 800ms waiting on a third-party payment API, and 3ms writing to DynamoDB. Without this breakdown, "the checkout is slow" is all you know.
Network Monitoring
Network observability is where most cloud monitoring strategies have a blind spot. Inside a VPC, you rely on flow logs (VPC Flow Logs on AWS, NSG Flow Logs on Azure, VPC Flow Logs on GCP) to see traffic patterns, dropped packets, and cross-AZ/cross-region data transfer volumes.
For hybrid setups — Direct Connect, ExpressRoute, Cloud Interconnect — monitoring the tunnel health, BGP session state, and jitter/packet loss across the link is critical. A degraded Direct Connect circuit won't show up in your application metrics until latency doubles and customers call. This is particularly relevant in India, where many large enterprises maintain private data centres in locations like Noida, Chennai, or Pune connected to ap-south-1 (Mumbai) or ap-south-2 (Hyderabad) via Direct Connect.
Tools like Kentik, ThousandEyes (now part of Cisco), and the native cloud network monitoring services handle this well. If your environment is single-cloud and simple, native tools suffice. Multi-cloud or hybrid? You need a dedicated network observability layer.
Metrics That Actually Matter in Production
Not all metrics deserve an alert. Here is what our NOC prioritises, ranked by operational value:
| Metric | Why It Matters | Alert Threshold Guidance |
|---|---|---|
| p95/p99 Latency | Represents the experience of your slowest (and often most valuable) users | >2× baseline for 5 minutes |
| Error Rate (5xx) | Direct indicator of broken functionality | >0.5% of total requests for 2 minutes |
| Saturation (CPU/Memory/Disk) | Predicts imminent failure before it happens | >85% sustained for 10 minutes |
| Request Rate (RPS) | Sudden drops signal upstream issues or misrouted traffic | >30% deviation from predicted baseline |
| Time to First Byte (TTFB) | User-facing performance proxy, especially for apps serving users across India and beyond | >500ms at CDN edge |
| DNS Resolution Time | Often overlooked; a slow DNS lookup adds latency to every request | >100ms average |
| Replication Lag | For databases (RDS, Cloud SQL, Cosmos DB) — data consistency risk | >5 seconds for transactional workloads |
| Container Restart Count | OOMKilled or CrashLoopBackOff patterns signal resource misconfiguration | >3 restarts in 15 minutes |
The USE method (Utilisation, Saturation, Errors) works well for infrastructure resources. The RED method (Rate, Errors, Duration) works well for services. Use both. They complement each other — USE tells you about the machine, RED tells you about the work the machine is doing.
Tooling Comparison: Native vs. Third-Party
Native Cloud Monitoring Tools
| Feature | AWS CloudWatch | Azure Monitor | GCP Cloud Monitoring |
|---|---|---|---|
| Auto-collected metrics | Yes (basic) | Yes (platform metrics) | Yes (basic) |
| Custom metrics | Yes (CloudWatch API / embedded metric format) | Yes (custom metrics API) | Yes (custom metrics API) |
| Log aggregation | CloudWatch Logs / Logs Insights | Log Analytics (KQL) | Cloud Logging |
| Distributed tracing | X-Ray | Application Insights | Cloud Trace |
| Alerting | CloudWatch Alarms + SNS | Action Groups + Logic Apps | Alerting Policies + Pub/Sub |
| Dashboards | CloudWatch Dashboards | Azure Dashboards / Workbooks | Cloud Monitoring Dashboards |
| Cost at scale | Expensive (custom metrics, log ingestion) | Moderate (Log Analytics ingestion pricing) | Moderate |
Opsio's take: Native tools are the right starting point and remain essential for resource-specific metrics that third-party tools cannot collect (e.g., Lambda concurrent executions, Azure Service Bus dead-letter counts). But if you run workloads across two or more providers — which, according to Flexera's State of the Cloud, the vast majority of enterprises now do — you need a cross-cloud layer. Indian enterprises, in particular, often run a mix of AWS in ap-south-1 (Mumbai) and Azure in Central India for redundancy or regulatory reasons; a unified observability layer becomes essential.
Third-Party Observability Platforms
- Datadog — Strongest all-in-one: infrastructure, APM, logs, synthetic monitoring, security signals, and FinOps dashboards. Broad integration catalogue. Downside: cost scales aggressively with host count and custom metrics cardinality. At current exchange rates (approximately ₹85 per USD), per-host pricing can add up quickly for large Indian deployments.
- Dynatrace — AI-driven root-cause analysis (Davis AI) is genuinely useful for complex environments. Strong auto-instrumentation for Java/.NET. Downside: licensing model can be opaque.
- Grafana Cloud (LGTM stack) — Grafana + Loki (logs) + Tempo (traces) + Mimir (metrics). Open-source core with managed hosting option. Best for teams that want control and want to avoid vendor lock-in. Downside: requires more operational expertise to tune and maintain.
- New Relic — Generous free tier, consumption-based pricing. Good APM. Downside: network monitoring and infrastructure depth trail Datadog.
- Elastic Observability — Built on Elasticsearch. Strong if you already run ELK for logging. Downside: scaling Elasticsearch clusters for high-cardinality metrics is non-trivial.
For cost-sensitive teams — and optimising INR spend is a top priority for most Indian engineering organisations — the Grafana LGTM stack with OpenTelemetry instrumentation offers the best control-to-cost ratio. For teams that want managed everything and will pay for it, Datadog or Dynatrace are the pragmatic choices.
Best Practices From a 24/7 NOC
These are not theoretical recommendations. They come from patterns we see across hundreds of monitored workloads.
1. Define SLOs Before You Define Alerts
An alert without a Service Level Objective is noise. Start by defining what "healthy" means for each service — e.g., "99.9% of checkout requests complete within 800ms with <0.1% error rate." Then set alerts on the burn rate of that error budget. Google's SRE book formalised this approach, and it works. Multi-window, multi-burn-rate alerting (fast burn for pages, slow burn for tickets) reduces alert fatigue dramatically.
2. Centralise Into a Single Observability Pipeline
In multi-cloud environments, the biggest waste of time is context-switching between three different consoles. Use OpenTelemetry Collector as a vendor-neutral telemetry router: it receives metrics, traces, and logs from any source and exports to your chosen backend(s). This decouples instrumentation from storage and keeps your options open.
3. Monitor the Monitoring
Your observability pipeline is itself infrastructure. If your Prometheus server runs out of disk, or your Datadog agent crashes silently, you have a blind spot during the exact moment you need visibility. Run a lightweight heartbeat/canary check that validates your monitoring stack is ingesting data. We run synthetic checks every 60 seconds that push a known metric and alert if it fails to arrive within 120 seconds.
4. Correlate Costs With Performance Metrics
This is where Cloud FinOps meets performance monitoring. An instance running at 8% CPU is not a performance problem — it is a cost problem. An instance running at 92% CPU is not a cost problem — it is a reliability risk. Surfacing both in the same dashboard lets teams make right-sizing decisions with full context. AWS Compute Optimizer, Azure Advisor, and GCP Recommender provide native right-sizing suggestions, but they lack application-level context. Overlay them with your APM data for useful recommendations.
5. Retain Logs Strategically
Storing every debug log from every container forever is a fast path to a high-crore observability bill. Tier your retention: hot storage (7–14 days) for operational troubleshooting, warm storage (30–90 days) for trend analysis, and cold/archive storage for compliance. Under DPDPA 2023, personal data of Indian data principals in logs must be handled with the same rigour as data in databases — mask or pseudonymise PII at the collection layer, not after ingestion. RBI's guidelines for BFSI entities additionally require that audit logs be retained for prescribed periods and be available for inspection. Design retention policies that satisfy both operational and regulatory needs.
6. Instrument for Regional Performance
If you serve users across India and globally, monitor from multiple regions. A service performing well measured from ap-south-1 (Mumbai) may have 400ms additional latency when accessed from eu-west-1 (Ireland) or vice versa. Even within India, users in the north-east or smaller towns may experience higher latencies. Synthetic monitoring with checkpoints in Mumbai, Hyderabad, Singapore, and at least one European or US location gives you the user-perspective truth. Opsio's NOC runs synthetic checks from multiple geographies precisely because regional degradation is invisible from a single vantage point.
Monitoring Across Hybrid and Multi-Cloud Environments
Most enterprises are not single-cloud, regardless of what their official strategy says. According to Flexera's State of the Cloud, multi-cloud adoption has remained the dominant pattern for several consecutive years. The monitoring challenge multiplies: metrics have different names, different granularities, and different APIs across providers.
Practical Multi-Cloud Monitoring Architecture
1. Collection layer: Deploy OpenTelemetry Collectors in each environment (AWS, Azure, GCP, on-premises). Configure them to normalise metric names and add cloud-provider tags.
2. Aggregation layer: Route all telemetry to a central backend — Datadog, Grafana Cloud, or a self-hosted Mimir/Loki/Tempo stack.
3. Correlation layer: Use service maps and dependency graphs that span providers. A request might start at an Azure Front Door CDN, hit an API running on AWS EKS in ap-south-1 (Mumbai), and query a database on GCP Cloud SQL. Without a cross-cloud trace, you will never find the bottleneck.
4. Alerting layer: Centralised alerting with PagerDuty, Opsgenie, or Grafana OnCall as the single routing point. Avoid cloud-native alerting silos — an Azure Action Group that pages one team while a CloudWatch Alarm pages another leads to duplicated effort and missed correlations.
Hybrid Cloud Specifics
For workloads spanning on-premises and cloud (common in Indian BFSI, manufacturing, government, and healthcare sectors), monitor the interconnect as a first-class citizen. Direct Connect, ExpressRoute, and Cloud Interconnect circuits have SLAs, but those SLAs do not cover your application's sensitivity to jitter. Implement bidirectional latency probes across the link and alert on degradation before it impacts real traffic. Many Indian banks and financial institutions maintain core banking systems on-premises while running customer-facing APIs on cloud — the interconnect monitoring between these layers is absolutely critical.
Compliance and Data Residency Considerations
India: DPDPA 2023, RBI, SEBI, and MeitY Guidelines
The Digital Personal Data Protection Act (DPDPA) 2023 introduces consent-based data processing obligations. The Government of India retains the power to notify certain categories of personal data that must be processed within Indian borders. For monitoring architectures, this means: if your observability data contains personal identifiers (user IDs, IP addresses, mobile numbers, Aadhaar-linked tokens) of Indian data principals, you must ensure your log-masking pipeline strips or pseudonymises such data before it leaves the collection tier — and potentially before it leaves Indian soil.
BFSI-specific requirements: The RBI's circular on outsourcing of IT services and the subsequent guidance on cloud adoption require regulated entities to ensure that customer data remains within India unless explicitly permitted, that audit and inspection rights are preserved, and that robust monitoring and incident-management capabilities are in place. If you are a bank, NBFC, or payment aggregator, your monitoring backend — whether Datadog, Grafana Cloud, or self-hosted — must store Indian customer data within ap-south-1 (Mumbai), ap-south-2 (Hyderabad), Central India, South India, or equivalent domestic infrastructure. Datadog's dedicated infrastructure options require careful evaluation; self-hosted backends in Indian regions or Grafana Cloud with a dedicated Indian cluster may be more straightforward for compliance.
SEBI's cloud framework requires stock exchanges, depositories, and market intermediaries to maintain data within India and ensure that monitoring and audit trails are accessible to SEBI for inspection. The framework also mandates business continuity and disaster-recovery monitoring across regions.
MeitY guidelines for government workloads specify the use of empanelled cloud service providers and domestic data residency. Monitoring data for GI Cloud (MeghRaj) deployments must reside in India.
The practical takeaway: design your monitoring architecture with data residency as a first-class constraint, not an afterthought. Tag telemetry at the collection layer with data-classification labels, and route sensitive telemetry exclusively to domestic backends.
Global Compliance Considerations
For Indian enterprises serving European customers, GDPR constrains where monitoring data can be stored and processed. If your application logs contain IP addresses, user IDs, or session tokens of EU data subjects, those logs are personal data under GDPR. Sending them to an Indian-hosted backend without appropriate transfer mechanisms (SCCs, adequacy decisions) is a compliance risk. Similarly, NIS2 — enforceable since October 2024 in the EU — requires entities in essential and important sectors to implement monitoring and incident detection capabilities and to notify their CSIRT within 24 hours of becoming aware of a significant incident. If your Indian organisation has subsidiaries or customers in the EU, these obligations extend to your monitoring design.
Enabling Cloud Monitoring: A Practical Starting Path
For teams that are early in their monitoring maturity, here is a concrete sequence — not a boil-the-ocean exercise:
Week 1–2: Enable native monitoring for all cloud resources. Turn on CloudWatch detailed monitoring (1-minute intervals), Azure Monitor diagnostics, or GCP Cloud Monitoring. This is usually a Terraform/Bicep/Config Connector one-liner per resource.
Week 3–4: Instrument your three most critical services with OpenTelemetry. Deploy the OTel Collector as a sidecar (Kubernetes) or daemon (EC2/VM). Export traces and metrics to your chosen backend.
Month 2: Define SLOs for those three services. Implement error-budget-based alerting. Connect alerts to PagerDuty or Opsgenie with on-call rotations.
Month 3: Add synthetic monitoring from at least two geographic locations — ideally ap-south-1 (Mumbai) and one international checkpoint such as ap-southeast-1 (Singapore) or eu-west-1 (Ireland). Establish baseline dashboards. Begin log centralisation with retention tiers.
Ongoing: Expand instrumentation to remaining services. Add network monitoring. Integrate cost data. Review and tune alert thresholds quarterly — stale thresholds are worse than no thresholds because they train teams to ignore alerts.
Virtual Machine Monitors and Cloud Performance
A virtual machine monitor (VMM), also called a hypervisor, is the software layer that manages the allocation of physical resources — CPU, memory, storage, network — to virtual machines. In cloud computing, the hypervisor (AWS Nitro, Azure Hyper-V, GCP's custom KVM) is the foundation that makes multi-tenancy possible. From a monitoring perspective, you rarely interact with the hypervisor directly on public cloud — the provider abstracts it. But you do observe its effects: "steal time" on a Linux instance (the %steal metric in top or sar) indicates that the hypervisor is allocating CPU cycles to other tenants. If steal time consistently exceeds 5–10%, you are experiencing noisy-neighbour effects and should consider dedicated or metal instances.
Cloud Logging vs. Cloud Monitoring: Clarifying the Relationship
Logging and monitoring are distinct but interdependent disciplines. Monitoring answers "is something wrong right now?" through real-time metrics and alerts. Logging answers "what exactly happened?" through discrete event records. Traces add the third dimension: "how did the request flow through the system?"
The modern term "observability" unifies all three — metrics, logs, and traces — into a cohesive practice. In tooling terms: CloudWatch Metrics + CloudWatch Logs + X-Ray; Azure Monitor Metrics + Log Analytics + Application Insights; GCP Cloud Monitoring + Cloud Logging + Cloud Trace. Or, with third-party stacks: Datadog Infrastructure + Logs + APM; Grafana Mimir + Loki + Tempo.
The practical advice: do not build logging and monitoring as separate projects with separate teams. They share infrastructure, share context, and are queried together during every incident.
Frequently Asked Questions
How do you measure cloud performance?
Measure cloud performance using the RED method (Rate, Errors, Duration) for services and the USE method (Utilisation, Saturation, Errors) for infrastructure. Instrument applications with distributed tracing (OpenTelemetry), collect infrastructure metrics via native cloud agents, and set baselines for p95 latency, error rate, and availability. Synthetic monitoring adds outside-in validation that real users can reach your endpoints within SLA thresholds.
What are the three parts of cloud monitoring?
The three parts are infrastructure monitoring (compute, storage, network health), application performance monitoring (transaction traces, error rates, response times), and log management/analytics (centralised log aggregation, search, and alerting). Some frameworks add a fourth — security monitoring — but in practice security signals feed into the same observability pipeline.
What is the 3-4-5 rule in cloud computing?
The 3-4-5 rule is a backup and disaster-recovery heuristic: keep 3 copies of data, on 4 different types of storage media, with 5 of those copies stored off-site or in a different region. While originally a data-protection guideline, it directly affects monitoring design because you need to verify replication health, RPO compliance, and regional failover readiness continuously.
What are the five types of monitoring?
The five commonly cited types are: infrastructure monitoring, application performance monitoring (APM), network monitoring, security monitoring (SIEM/SOC), and real-user/synthetic monitoring. In a cloud context these overlap heavily — a latency spike could be a network issue, an application bug, or a noisy-neighbour problem on shared infrastructure — which is why unified observability platforms are replacing siloed tools.
What is the difference between cloud logging and cloud monitoring?
Monitoring collects time-series metrics (CPU, latency, error counts) and triggers alerts when thresholds are breached. Logging captures discrete event records — application errors, access logs, audit trails — that you query after the fact. In practice the two are complementary: an alert fires from a monitoring metric, and engineers pivot to logs to diagnose root cause. Modern observability unifies metrics, logs, and traces into a single workflow.
