
Introduction: Navigating Multi-Cluster Complexity with Observability
In today's cloud-native world, applications no longer run in a single cluster. Organizations adopt multi-cluster architectures for scalability, resilience, and geographic distribution. But with this power comes complexity—and the need for robust monitoring.
Imagine managing Kubernetes clusters across multiple regions, each hosting critical microservices. Without a unified view, troubleshooting becomes a needle-in-a-haystack challenge. This is where multi-cluster monitoring proves essential.
Why Multi-Cluster Monitoring Matters?
“More Clusters, More Chaos—Here’s How to Stay Ahead”
Monitoring one cluster is hard—scaling to multiple is exponentially tougher. Each cluster generates logs, metrics, and traces, creating an overwhelming data flood. Without the right tools, you risk missing key insights and reacting too late.
Open-source solutions like Prometheus, Thanos, Grafana, and OpenTelemetry offer cost-effective, flexible observability, helping detect anomalies and ensure reliability in multi-cluster environments.
Our Solution
“From Costly to Cost-Effective: Our Multi-Cluster Monitoring Solution”
After evaluating our options, we decided to implement a high-availability, open-source monitoring and logging setup that could scale with our infrastructure while keeping costs in check. Here’s how we did it:
Infrastructure Details
- Cloud Provider: Google Cloud Platform (GCP)
- Container Orchestration: Kubernetes
- Number of Regions: 4
- Number of Clusters: 4
- Average Node Count: 100–150
The Challenge: Exploring Our Monitoring Options
Paid Alternatives: Are They Worth It !!?
“The Truth About Paid Monitoring: Are You Paying for a Black Box?”
Paid monitoring platforms like Datadog and New Relic provide seamless multi-cluster visibility but come with high costs:
- Licensing fees, per-node pricing, and additional charges add up quickly.
- These black-box solutions limit customization, making them less suited for dynamic environments.
Given our cluster size of 120+ nodes and high data intake, the cost of these platforms became prohibitive. Instead of paying a premium for a one-size-fits-all tool, a more cost-effective approach is to build a tailored observability stack using open-source solutions.
How about Open-Source Alternatives !!?
"Prometheus, Grafana & Loki: The Ultimate Observability Trio ?”
Prometheus + Grafana + Loki form a powerful open-source solution, but they come with their own challenges:
- High maintenance requirements
- Lack of built-in clustering capabilities
Prometheus, designed as a monolith with local storage, is not ideal for scalable setups. However, integrations like Thanos or Cortex can enable scalability by:
- Adding remote storage support (e.g., S3, GCS)
- Implementing a global query layer
- Making Prometheus stateless by transferring metrics (older than two hours) to remote storage
While Thanos or Cortex enable sharding and unified querying, they also add complexity. As your infrastructure scales, resource consumption increases, leading to higher operational overhead and potential performance bottlenecks. Additionally, managing long-term storage and ensuring high availability requires extra configurations, making maintenance more demanding.
So, what’s the alternative?
What Did We Choose?
“Our Winning Stack: VictoriaMetrics + Grafana + Loki !!”
A Scalable and Performance-Oriented Monitoring Alternative: VictoriaMetrics + Grafana + Loki
VictoriaMetrics comes with built-in clustering and is fully backward compatible with all Prometheus endpoints. It emerged as the top performer in our resource and performance benchmark, which was crucial for scaling.
Victoria Metrics Overview
VictoriaMetrics, similar to Prometheus, offers built-in clustering capabilities, making it an ideal choice for our monitoring needs. It comprises:
- vmagent: Scrapes Prometheus metrics using scrape configs and pushes data to the cluster.
- vmcluster: Comprises three components:
- vmwrite: Handles data ingestion, configured as an internal load balancer.
- vmselect: Manages query operations; Grafana connects here.
- vmstorage: Stores data with replication and sharding for scalability.
Loki Overview
Loki provides three deployment strategies, and for our setup, we opted for the Simple Deployment strategy to efficiently handle clustering. For storage, we use Google Cloud Storage (GCS) to ensure scalability and durability.`
Loki deployment consists of three main components:
- Loki-Write – Handles incoming log ingestion. Exposed via an internal load balancer, allowing all clusters to connect seamlessly.
- Loki-Read – Manages log querying and retrieval.
- Loki-Backend – Handles indexing and storage operations with GCS.
Logging Agent: Fluent Bit
For log collection, we chose Fluent Bit as the logging agent. Each log line is enriched with custom labels (Cluster: cluster 1)before being pushed to the master cluster, ensuring efficient filtering and retrieval.
This setup provides a streamlined, scalable logging solution while leveraging Loki’s simplicity and GCS’s reliability.
Architecture Overview
“Building a High-Availability Monitoring Stack—Here’s How We Did It”
Our production monitoring setup consists of three Kubernetes clusters:
- Cluster 1: Master cluster (hosts the VictoriaMetrics cluster and centralized Grafana).
- Clusters 2 and 3: Handle workloads and other components.

Key Details:
- Global VPC Connectivity – All clusters communicate via a global VPC, spanning multiple subnets and regions.
- Metrics Collection with vmagent – Each cluster runs vmagent to collect metrics, tagging them with custom labels (e.g., clustername=cluster1) for segregation.
- Additional Metrics Sources – Metrics are gathered from:
- kube-state-metrics (Kubernetes resource monitoring)
- node-exporter (Node-level metrics)
- blackbox-exporter (Endpoint probing)
- Logging with Fluent Bit – Deployed in each cluster, Fluent Bit appends custom labels similar to vmagent for structured log management.
- Log Ingestion via Load Balancer – Logs are sent to Loki-Write through an internal load balancer for centralized processing.
Implementation Highlights
Data Flow:
- Metrics collected by vmagent in Clusters 1, 2, and 3 are sent to the vmwrite component in Cluster 1.
- Logs via fluentbit is sent to Loki write in cluster 1
- Grafana, hosted centrally in Cluster 1, connects to vmselect and loki read for unified monitoring dashboards.
High Availability:
- Replication and sharding ensure data integrity and scalability within vmstorage.
- Loki is setup with proper log indexing based on k8s required labels and data is synced to GCS, and a compactor is used to handle the data retention.
- Internal load balancing improves reliability and prevents bottlenecks.
Visualization:
- Centralized Grafana dashboards provide a comprehensive view of metrics and logs across clusters, enabling quick identification of issues and performance tuning.
Performance Comparison
“VictoriaMetrics vs. Prometheus + Thanos: The Ultimate Performance Face-Off”
VictoriaMetrics demonstrated:
- Improved Query Speeds: Optimized for large datasets and distributed setups.
- Reduced Maintenance Overhead: Clustering is natively supported, unlike Prometheus.
- Cost Efficiency: Open-source, with lower infrastructure costs compared to paid alternatives.
1.5M TS//15S
100K SAMPLES/S
Thanos | VictoriaMetrics | |
---|---|---|
CPU | 4.01 cores | 0.86 cores |
Memory | 21 GiB | 8.93 GiB |
Bytes/sample* | 4.72 B | 0.91 B |
Conclusion
“Monitoring at Scale? Open-Source is the Way Forward “
By leveraging VictoriaMetrics, Grafana, and Loki, we successfully implemented a high-performing, scalable monitoring solution tailored to our needs. This setup not only ensures optimal performance but also aligns with cost and scalability requirements, making it a sustainable choice for growing infrastructure demands.
In a world where multi-cluster architectures are becoming the norm, open-source tools like these empower organizations to achieve enterprise-grade observability without the enterprise-grade price tag. So, whether you’re a startup or a large enterprise, it’s time to embrace open-source monitoring and take control of your multi-cluster environment.
Let’s turn complexity into clarity—one cluster at a time.
For anyone looking to dive deeper into the specifics of implementing this setup, a detailed VictoriaMetrics tutorial can guide you through the setup process and best practices for scaling your monitoring system across multiple clusters. Remember that mastering monitoring best practices is key to ensuring the smooth operation of your infrastructure as you grow.
Sample Config for multi-cluster monitoring setup: [multi-cluster-monitoring].