Grafana Dashboard
Cloud service provider relevance: AWS Kubernetes, AKS, GKE
This topic describes the integration of Ocean with Grafana and Prometheus to help you identify and investigate real-life anomalies in Ocean-managed clusters.
Prometheus and Grafana
Prometheus and Grafana are among the most popular observability stacks in the market, especially within the DevOps community. Prometheus is a robust monitoring and alerting toolkit that scrapes and stores time series data, providing real-time insights into various metrics within your cluster.
Grafana enhances Prometheus by offering powerful data visualization capabilities so you can create informative and interactive dashboards.
In Kubernetes environments, where the dynamic nature of workloads and infrastructure demands continuous monitoring, the Prometheus and Grafana stack provides deep visibility into the performance, health, and cost-effectiveness of your Kubernetes clusters, enabling proactive management and optimization.
Spot Ocean Scaling and Cost Optimization Grafana Dashboard
The Ocean scaling and cost optimization dashboard provides real-time insights into the scaling, cost, usage, and right-sizing activities managed by Ocean within your Kubernetes cluster. It displays node provisioning, optimization, cost efficiency, and recovery operations metrics.
Visualizations can help you understand how Ocean dynamically manages Kubernetes cluster resources to ensure optimal performance, cost savings, and high availability. Key actions such as scale-ups, scale-downs, node replacements, and manual interventions are highlighted to give a comprehensive view of your cluster's operational status and health.
Visualizations include data on compute, storage, and networking expenses, helping you monitor and optimize cloud spending. They also highlight the cost distribution across different resource types and track usage patterns over time.
The scaling and cost optimization dashboard exposes data that helps you make informed decisions about resource allocation, identify cost-saving opportunities, and ensure efficient utilization of cloud infrastructure. For Ocean right-sizing evaluations, the dashboard shows how efficient resource adjustments contribute to cost reduction while maintaining optimal cluster performance.
Integration Benefits
Ocean manages scaling of the Kubernetes data plane, and the data generated in the process can be valuable for monitoring your containerized environment.
Using well-defined Prometheus metrics to monitor Ocean provides insights into cluster scaling and debugging. You can also build alerts based on the metrics to address real-time issues and track trends on a dashboard with different Ocean metrics.
Ocean maintains an official set of metrics, natively scrapable by Prometheus. This set of metrics helps build a 360-degree view of Ocean’s actions while providing application-driven infrastructure.
Dashboard Visualizations
Click to view - Current Status, Scaling Overview, Nodes Managed by Ocean, and Pods Metrics
Click to view - Cost and Usage, and Compute and Storage Metrics
Click to view - Network Cost and Usage Metrics
Click to view - Scaling Activity Overview Metrics
Click to view - Right Sizing Metrics
Variables
- Datasource: Select the cluster datasource in a Grafana installation with multiple datasources available.
- Ocean Cluster ID: This option filters data only for the selected ID, making it suitable for data sources with data from several Ocean clusters.
- Aggregation Interval: Used to set a relative time in panels with aggregated data. The relative time will be shown on the panel title.
Dashboard Metrics Breakdown
This section describes the metrics for the previously shown dashboard visualizations.
:::
Ocean metrics are relevant to the Ocean Prometheus Exporter.
:::
Current Status
- Ocean controller status: This graph shows the current status of the Ocean controller within your Kubernetes cluster, providing real-time insights into controller health and operational status. Monitor to ensure the controller functions correctly and effectively manages resources, which is crucial for maintaining optimal cluster performance.
- Kubernetes cluster Nodes: Source: Kubernetes API server. This graph shows the number of nodes within your Kubernetes cluster, helping you monitor whether your cluster is correctly scaled to handle workloads. You can use this to ensure that your cluster's node count aligns with your applications' demands, to maintain smooth operations, and to prevent resource bottlenecks.
- Nodes managed by Ocean: This graph shows the nodes managed by Spot Ocean, providing transparency into which nodes are optimized and scaled. This metric verifies that Spot Ocean effectively manages and optimizes your cluster resources, improving resource utilization and cost efficiency.
- Cluster cost during the selected aggregation interval:. This graph shows the cost associated with the cluster during a specified time period, letting you track and manage your spending. By understanding the cost implications of running your cluster, you can make informed decisions to optimize resource usage and reduce overall costs.
- Top 5 workloads with maximum cost during the selected aggregation interval: This graph shows the top 5 workloads contributions to cluster costs, helping you identify high-cost areas. By focusing optimization efforts on these workloads, you can reduce unnecessary expenses and improve the cost-efficiency of your cluster operations.
- Cluster cost's potential savings suggested by the right-sizing feature during the selected Aggregation Interval variable value: This graph indicates the cost savings that could be achieved through Spot Ocean's right-sizing recommendations. This metric provides you with valuable insights into potential cost reductions so you can optimize resource allocations and ensure efficient use of cloud resources, ultimately leading to significant savings.
Scaling Overview
- Cluster nodes’ allocatable resources (CPU, memory, GPU): This graph shows the allocatable resources (CPU, memory, GPU) available on cluster nodes. The provided insights into the available resources for scheduling new workloads ensure optimal resource allocation, allowing for better planning and efficient utilization of cluster resources.
- Ocean cluster headroom allocatable resources (CPU, memory, GPU): This graph shows the headroom resources available for scaling within the Ocean-managed cluster. It helps ensure that there is sufficient headroom for scaling up applications without facing immediate resource constraints, so you can maintain smooth operations even during demand spikes.
- Ocean cluster resource limit (CPU, memory): This graph shows the maximum resource limits (CPU, memory) configured for the Ocean-managed cluster. Monitor these limits to avoid overprovisioning and ensure that resource usage stays within the defined capacity, which is crucial for maintaining cost efficiency and compliance with cluster settings.
- Ocean nodes breakdown by instance lifecycle and availability zone: This graph breaks down Ocean-managed nodes by their lifecycle stage (e.g., Spot/On-Demand/RI/SP). Detailed visibility into node distribution and lifecycle can help your capacity planning and troubleshooting, ensuring your resources are effectively distributed and managed across the cluster.
- Cluster nodes’ allocatable resources breakdown by instance lifecycle and availability zone: This graph shows allocatable resources across nodes, categorized by instance lifecycle and availability zone. Understanding how resources are distributed and utilized across different nodes and zones can help optimize resource allocation and improve overall cluster efficiency.
Nodes Managed by Ocean Metrics
- Ocean nodes count over time: This graph tracks the count of Ocean-managed nodes over time, providing insights into how the number of managed nodes changes. It can help you understand scaling trends and capacity adjustments, ensuring that resources are aligned with workload demands over time.
- Ocean nodes count by instance lifecycle and availability zone over time: This graph shows the count of Ocean-managed nodes categorized by instance lifecycle and availability zone over time. Historical insights into node lifecycle and zone distribution can help your long-term capacity planning and resource allocation across different zones.
- Cluster nodes’ allocatable resources count by instance lifecycle and availability zone over time: This graph shows the count of allocatable resources on cluster nodes, categorized by lifecycle and availability zone over time. It can help you track how resource availability evolves, providing valuable information for effective resource management and ensuring that resources are consistently aligned with operational needs.
Pods Metrics
- Average time for Pod to become ready over time: Source: Kubernetes API server. This graph tracks the average time required for a pod to transition to a ready state. It can help you measure the responsiveness of your Kubernetes cluster, identifying potential delays or issues in pod startup, which is crucial for maintaining efficient and reliable application deployments.
- Pods in Running state. Source: Kubernetes API server: This graph shows the number of currently running pods, providing a snapshot of active pods. It can help you monitor your applications' health and activity level, ensuring that the necessary workloads are operational and performing as expected.
Scaling Activity Overview
- Scaling up and down events summaries: This table summarizes events related to scaling up and down within the cluster. Visibility into scaling activities can help you understand how your cluster adapts to changing workloads and ensure that resources are being managed dynamically to meet demand.
- Failed to scale-up events summaries: This table summarizes events related to failed scale-up attempts within the cluster. It provides insight into scaling errors so you can identify root causes and quickly resolve issues.