Monitor infrastructure

Kubernetes Monitoring

Explore your infrastructure

Grafana Cloud

Explore your infrastructure with Kubernetes Monitoring

Kubernetes Monitoring offers visualization and analysis tools for you to:

Evaluate the health, efficiency, and cost of Kubernetes infrastructure components.
Analyze historical data as well as forecasts.
View predictions created with machine learning.
Manage alerts.

Navigate to Kubernetes Monitoring

Navigate to your Grafana Cloud portal.
In the menu, select the stack you want to work with.
Click the upper-left menu icon.
In the main menu, expand Infrastructure, then click Kubernetes.
Navigating to Kubernetes Monitoring

Top-level pages for Kubernetes objects allow you to drill into the hierarchy of Kubernetes objects in your fleet. Main pages include lists of Clusters, namespaces, workloads, and Nodes.

For example, the Cluster main page shows the list of your Clusters. When you click on a Cluster in the list, it opens the Cluster detail page. That page shows the detail information for the Cluster along with a list of Nodes within the Cluster.

You can continue to drill into a Node and see the list of Pods for that Node, all the way to the container level.

Navigating from main list page to detail page — Navigating from lists to detail pages

There are also main pages for you to view alerts, configuration, and data for cost and efficiency. For additional navigation tips, refer to Navigation tips for Kubernetes Monitoring.

Start with high-level snapshot

At the Kubernetes Overview home page, you can get a high-level look at your Clusters and alerts.

Refine counts of Kubernetes objects

Adjust the time range selector and filter by Cluster and namespace to view the counts for:

Clusters, Nodes, namespaces, workloads, Pods, and containers
Deployed container images

Adjusting time period and filtering by Cluster for object count

Find usage spikes

You can use the time range selector to focus on a time period while looking for any spikes in CPU and memory usage in your Clusters. When spikes occur:

Zoom in on the graph to narrow the time selection.
Zooming in on graph to change time range
Hover over and click the peak of the spike to see the percentage of use compared to capacity. In the following example, the spike shows 46.5% of CPU usage compared to capacity.
Click the link to view the Cluster. The Cluster page shows the time range you set when zooming in on the graph.
Jumping to Cluster detail page within time range set
You can continue by sorting the list of Nodes in this Cluster by highest CPU usage to investigate the issue causing the spike.

Review and drill into alerts

Sort the Firing Since column of alerts for containers and Pods to focus on either the most current or the oldest alerts that are firing.
Sorting the list of container alerts by oldest alert firing
Click the container or Pod name to jump directly to the detail page.
Jumping to the container detail page

Manage alerts

View and respond to all Kubernetes-related alerts from the Alerts page and the Kubernetes Overview home page.

You can also manage preconfigured alerting rules.

Analyze costs

You can review costs at a high level and by Kubernetes object, from Cluster all the way to the container level. At the Cost page, use the Overview and Savings tabs to gain a high-level understanding what Kubernetes is costing and how you can save. You can also see the cost of each item in a list view as well as on the detail pages.

Understand efficiency and resource use

Use Kubernetes Monitoring to optimize resource usage and efficiency by:

Correlating between average and maximum resource usage to understand performance and troubleshoot stability issues.
Observe resource usage for each Kubernetes object.
Discover any stranded resources in your fleet.

Throughout Kubernetes Monitoring, resource usage statistics show for each list item so that you can filter and sort to make the best use of your time.

Discover energy usage

On any detail page, click the Energy tab to view the energy usage of:

Workloads and namespaces
Clusters
Nodes
Pods
Containers

Energy usage for workloads in a namespace for 24 hours

When you configure Kubernetes Monitoring to gather energy metrics, Kepler exposes and gathers metrics, and Alloy scrapes these metrics.

Energy metrics are separated into these categories:

Package, including CPU cores
DRAM (memory)
GPU
Other
Total, the sum of all categories

Learn what’s predicted

CPU and memory prediction can help you ensure resources are available during spikes in resource usage and help you decrease the amount of unused resources due to over provisioning. To use prediction tools, first enable the Machine Learning plugin.

The following buttons are available in various views. Click them to show a prediction for Clusters, namespaces, workloads, Nodes, Pods, and containers:

Predict Mem Usage: Shows a predictive graph for memory usage one week in the future. Calculations are based on metrics from the previous week.
Predict CPU: Shows a predictive graph for CPU usage one week in the future. Calculations are based on metrics from the previous week.
Predictions for Node CPU Usage

Detect outlier Pod CPU usage

Within a workload detail page, click the Detect Outlier CPU Usage amongst Pods button to identify a Pod that has CPU usage different from the other Pods.

Link to explore outlier detection query — Outlier message and exploration link

Use Explore for troubleshooting

Click Explore this query in the Machine Learning plugin to view the raw data and troubleshoot issues. Here you can adjust parameters and see a more detailed graph of the findings.

Raw data, query details, and graph regarding outlier data — Outlier raw data

Analyze historical data

Select a time range to see your historical data for any time frame you choose. As you navigate from page to page, the time range remains the same for period you set until you change it again.

As an example, the Pod optimization section of the Pod detail page shows a time range over several hours. You can use this to understand the historical pattern of CPU usage and memory usage.

Graphs showing Pod bursting over CPU request and bursting above memory requests — Pod optimization view on Pod detail page

Zoom into an area of any graph on the detail pages to narrow the time range selector even further. The time range remains selected until you click Back to default.

Narrowing the time range by zooming in on a graph

Find deleted Kubernetes objects

You can find deleted Clusters, namespaces, workloads, Nodes, Pods, and containers to understand what occurred in the past. To do so, set the time range selector to a past time period.

The following example shows a time range of the previous 30 days, and then filtering for Nodes with the condition of “No data”. The Node detail page shows a graph depicting when the Node expired.

Finding deleted nodes using the time range selector and node filter

Note
Grafana Cloud has a default 30-day limit for queries. If your Kubernetes object was deleted 30 days beyond the current date, use the time range selector to choose a specific 30-day time frame in the past.

Choosing a 30-day range with the time range selector

Discover bare and unmanaged Pods

You can find unmanaged (or static) and bare Pods that have been directly created.

Navigate to the Workloads main page and filter by the Pod type. For example, to locate unmanaged static Pods, filter for StaticPod.

View network bandwidth and saturation

Use the network panels to understand when bandwidth limits are causing network saturation, which can lead to dropped packets. On any detail page for Cluster, namespace, workload, Node, and Pod, click the Network tab to view:

Network Bandwidth Rx/Tx: Shows the rate of received and transmitted bytes
Network Saturation Rx/Tx dropped packets: Shows rate of received and transmitted packets dropped
Network Bandwidth and Network Saturation by Node, workload, or Pod: Shows the bandwidth and saturation by object
Network bandwidth and saturation panels for a Cluster

View logs and events

From any detail page, click the Logs & Events tab to view the logs and events for that Kubernetes object.

Resolve issues with built-in tools

Navigate easily within the Kubernetes Monitoring app to other capabilities in Grafana Cloud to analyze, troubleshoot, and solve issues.

Start an automated diagnostic

From a Pod, Cluster, namespace, or workload detail view, you can begin an incident investigation by clicking Run Sift investigation. Sift performs a set of automated system checks and surfaces potential issues in your Kubernetes environment, and works to identify the root cause of an incident.

Opening a Sift investigation for a namespace

Access root cause analysis tool

Note
To access root cause analysis tools in Asserts, enable Asserts on your stack.

You can take troubleshooting deeper by understanding relationships between components and what is occurring between them.

Within Kubernetes Monitoring, access Asserts Workbench to perform root cause analysis. From any list of Clusters, Nodes, workloads, namespaces, or Pods you choose, select the box to the left of the list item, and click the Compare in Asserts Workbench button. The RCA Workbench opens in a new tab.

Selecting Clusters to compare for root cause analysis

Within any details page where the Assertions button appears, click it to continue your investigation into issues.

Jumping to Asserts Workbench to troubleshoot

You can jump to the connections view in Asserts to view connections between entities.

Jump to the application layer

On the detail page for a Pod or workload, click Application Observability to navigate directly to more data, such as the service health.

Navigating directly to the Application Observability app

To return to Kubernetes Monitoring, click the browser back button.

View queries to troubleshoot with Explore

To further query data, use any of the Explore buttons available throughout the interface (such as Explore namespaces or Explore alerts). You see a view that provides additional query tools for troubleshooting.

Raw query with options to add, view query history, and inspect query — Raw metrics

Navigate to traces

If you choose to enable traces when you configure Kubernetes Monitoring, you can easily click to see them.

Click the main menu icon.
Click Explore.
Choose the Tempo data source.
With the TraceQL tab selected, enter your search query.
Click Run query.
A table of traces appears.
Click a trace to see the detail.

Explore detail page showing table of traces, TraceQL query, and trace graph — View traces

Manage configuration

If you have the admin role, you can manage the configuration of Kubernetes Monitoring by working with:

Data source choices
Alerts
Integration installations
Optional custom log queries
Configuration instructions for Grafana Kubernetes Monitoring Helm chart to deploy, configure, and keep it up to date

Here are some tips and shortcuts for getting around in Kubernetes Monitoring.

Jump between main pages

From any main page, click the icon beside the page title to see the menu of all main pages. Then click the page you want to open.

To keep the main navigation open:

Click the main menu icon.
Click the menu docking icon to keep the main menu open.

Filter, sort, and set the time range

Use filters and sorting, along with the time range selector, to target the data you want.

Adjusting the time range and filtering for a type of workload

Jump to main lists

From the counts on the Kubernetes Overview home page, click All to see that component’s list of items in your Kubernetes fleet.

Clicking the **All** link from the **Kubernetes Overview** page to see a list of all workloads

Control app refresh

You can control the automatic refresh interval of the GUI as well as disable the auto refresh.

Menu for controlling automatic refresh and refresh interval

Use color cues

Throughout the views in Kubernetes Monitoring, you see color used as an additional means of indicating status or condition. For example, sometimes text is a different color for Pod status:

List of workloads with the status of running showing in green — Color coding

Text	Color	Comments
Running	Green	Healthy Pod
Running	Red	Pod failing to start
Failed	Red	Failed Pod
Unknown	White	Pod status unknown
Succeeded	Green	Job Pod successfully run

For more information on Pod status, refer to the Kubernetes documentation on Pod lifecycle.

The following table describes the color indicators for resource capacity and the state of resource usage:

Usage Colors	Usage	Comments
Green	60-90% of maximum	This is the ideal state of resource usage.
Yellow	Below 60%	Low usage percentages indicate that the item might be over provisioned.
Red	90%+	Your resource usage is close to or above its configured capacity.

Feedback

Explore your infrastructure with Kubernetes Monitoring

Navigate to Kubernetes Monitoring