Menu
Grafana Cloud

Explore your infrastructure with Kubernetes Monitoring

Kubernetes Monitoring offers visualization and analysis tools for you to:

  • Evaluate the health, efficiency, and cost of Kubernetes infrastructure components.
  • Analyze historical data as well as forecasts.
  • View predictions created with machine learning.
  • Manage alerts.
  1. Navigate to your Grafana Cloud portal.
  2. In the menu, select the stack you want to work with.
  3. Click the upper-left menu icon.
  4. In the main menu, expand Infrastructure, then click Kubernetes.
    Navigating on the main menu to Kubernetes
    Navigating to Kubernetes Monitoring

Top-level pages for Kubernetes objects allow you to drill into the hierarchy of Kubernetes objects in your fleet. Main pages include lists of Clusters, namespaces, workloads, and Nodes.

For example, the Cluster main page shows the list of your Clusters. When you click on a Cluster in the list, it opens the Cluster detail page. That page shows the detail information for the Cluster along with a list of Nodes within the Cluster.

You can continue to drill into a Node and see the list of Pods for that Node, all the way to the container level.

Navigating from main list page to detail page
Navigating from lists to detail pages

There are also main pages for you to view alerts, configuration, and data for cost and efficiency. For additional navigation tips, refer to Navigation tips for Kubernetes Monitoring.

Start with high-level snapshot

At the Kubernetes Overview home page, you can get a high-level look at your Clusters and alerts.

Refine counts of Kubernetes objects

Adjust the time range selector and filter by Cluster and namespace to view the counts for:

  • Clusters, Nodes, namespaces, workloads, Pods, and containers
  • Deployed container images
Adjusting time period and filtering by Cluster for object count
Adjusting time period and filtering by Cluster for object count

Find usage spikes

You can use the time range selector to focus on a time period while looking for any spikes in CPU and memory usage in your Clusters. When spikes occur:

  1. Zoom in on the graph to narrow the time selection.

    Zooming in on graph to change time range
    Zooming in on graph to change time range

  2. Hover over and click the peak of the spike to see the percentage of use compared to capacity. In the following example, the spike shows 46.5% of CPU usage compared to capacity.

  3. Click the link to view the Cluster. The Cluster page shows the time range you set when zooming in on the graph.

    Jumping to Cluster detail page within time range set
    Jumping to Cluster detail page within time range set
    You can continue by sorting the list of Nodes in this Cluster by highest CPU usage to investigate the issue causing the spike.

Review and drill into alerts

  1. Sort the Firing Since column of alerts for containers and Pods to focus on either the most current or the oldest alerts that are firing.

    Sorting the list of container alerts by oldest alert firing
    Sorting the list of container alerts by oldest alert firing

  2. Click the container or Pod name to jump directly to the detail page.

    Jumping to the container detail page
    Jumping to the container detail page

Manage alerts

View and respond to all Kubernetes-related alerts from the Alerts page and the Kubernetes Overview home page.

You can also manage preconfigured alerting rules.

Analyze costs

You can review costs at a high level and by Kubernetes object, from Cluster all the way to the container level. At the Cost page, use the Overview and Savings tabs to gain a high-level understanding what Kubernetes is costing and how you can save. You can also see the cost of each item in a list view as well as on the detail pages.

Understand efficiency and resource use

Use Kubernetes Monitoring to optimize resource usage and efficiency by:

  • Correlating between average and maximum resource usage to understand performance and troubleshoot stability issues.
  • Observe resource usage for each Kubernetes object.
  • Discover any stranded resources in your fleet.

Throughout Kubernetes Monitoring, resource usage statistics show for each list item so that you can filter and sort to make the best use of your time.

Discover energy usage

On any detail page, click the Energy tab to view the energy usage of:

  • Workloads and namespaces
  • Clusters
  • Nodes
  • Pods
  • Containers
Energy usage for workloads in a namespace for 24 hours
Energy usage for workloads in a namespace for 24 hours

When you configure Kubernetes Monitoring to gather energy metrics, Kepler exposes and gathers metrics, and Alloy scrapes these metrics.

Energy metrics are separated into these categories:

Learn what’s predicted

CPU and memory prediction can help you ensure resources are available during spikes in resource usage and help you decrease the amount of unused resources due to over provisioning. To use prediction tools, first enable the Machine Learning plugin.

The following buttons are available in various views. Click them to show a prediction for Clusters, namespaces, workloads, Nodes, Pods, and containers:

  • Predict Mem Usage: Shows a predictive graph for memory usage one week in the future. Calculations are based on metrics from the previous week.
  • Predict CPU: Shows a predictive graph for CPU usage one week in the future. Calculations are based on metrics from the previous week.
    Three graph lines showing the actual CPU usage, the lower predicted future usage, and upper predicted future usage
    Predictions for Node CPU Usage

Detect outlier Pod CPU usage

Within a workload detail page, click the Detect Outlier CPU Usage amongst Pods button to identify a Pod that has CPU usage different from the other Pods.

Link to explore outlier detection query
Outlier message and exploration link

Use Explore for troubleshooting

Click Explore this query in the Machine Learning plugin to view the raw data and troubleshoot issues. Here you can adjust parameters and see a more detailed graph of the findings.

Raw data, query details, and graph regarding outlier data
Outlier raw data

Analyze historical data

Select a time range to see your historical data for any time frame you choose. As you navigate from page to page, the time range remains the same for period you set until you change it again.

Time range selector options
Time range selector options

As an example, the Pod optimization section of the Pod detail page shows a time range over several hours. You can use this to understand the historical pattern of CPU usage and memory usage.

Graphs showing Pod bursting over CPU request and bursting above memory requests
Pod optimization view on Pod detail page

Zoom into an area of any graph on the detail pages to narrow the time range selector even further. The time range remains selected until you click Back to default.

Narrowing the time range by zooming in on a graph
Narrowing the time range by zooming in on a graph

Find deleted Kubernetes objects

You can find deleted Clusters, namespaces, workloads, Nodes, Pods, and containers to understand what occurred in the past. To do so, set the time range selector to a past time period.

The following example shows a time range of the previous 30 days, and then filtering for Nodes with the condition of “No data”. The Node detail page shows a graph depicting when the Node expired.

Finding deleted nodes using the time range selector and node filter
Finding deleted nodes using the time range selector and node filter

Note

Grafana Cloud has a default 30-day limit for queries. If your Kubernetes object was deleted 30 days beyond the current date, use the time range selector to choose a specific 30-day time frame in the past.
Choosing a 30-day range with the time range selector
Choosing a 30-day range with the time range selector

Discover bare and unmanaged Pods

You can find unmanaged (or static) and bare Pods that have been directly created.

Navigate to the Workloads main page and filter by the Pod type. For example, to locate unmanaged static Pods, filter for StaticPod.

Filtering for static Pod type
Filtering for static Pod type

View network bandwidth and saturation

Use the network panels to understand when bandwidth limits are causing network saturation, which can lead to dropped packets. On any detail page for Cluster, namespace, workload, Node, and Pod, click the Network tab to view:

  • Network Bandwidth Rx/Tx: Shows the rate of received and transmitted bytes
  • Network Saturation Rx/Tx dropped packets: Shows rate of received and transmitted packets dropped
  • Network Bandwidth and Network Saturation by Node, workload, or Pod: Shows the bandwidth and saturation by object
    Network bandwidth and saturation panels for a Cluster
    Network bandwidth and saturation panels for a Cluster

View logs and events

From any detail page, click the Logs & Events tab to view the logs and events for that Kubernetes object.

Viewing logs for a namespace
Viewing logs for a namespace

Resolve issues with built-in tools

Navigate easily within the Kubernetes Monitoring app to other capabilities in Grafana Cloud to analyze, troubleshoot, and solve issues.

Start an automated diagnostic

From a Pod, Cluster, namespace, or workload detail view, you can begin an incident investigation by clicking Run Sift investigation. Sift performs a set of automated system checks and surfaces potential issues in your Kubernetes environment, and works to identify the root cause of an incident.

Opening a Sift investigation for a namespace
Opening a Sift investigation for a namespace

Access root cause analysis tool

Note

To access root cause analysis tools in Asserts, enable Asserts on your stack.
You can take troubleshooting deeper by understanding relationships between components and what is occurring between them.

Within Kubernetes Monitoring, access Asserts Workbench to perform root cause analysis. From any list of Clusters, Nodes, workloads, namespaces, or Pods you choose, select the box to the left of the list item, and click the Compare in Asserts Workbench button. The RCA Workbench opens in a new tab.

Selecting Clusters to compare for root cause analysis
Selecting Clusters to compare for root cause analysis

Within any details page where the Assertions button appears, click it to continue your investigation into issues.

Jumping to Asserts Workbench to troubleshoot
Jumping to Asserts Workbench to troubleshoot

You can jump to the connections view in Asserts to view connections between entities.

Jumping to view connections in Asserts
Jumping to view connections in Asserts

Jump to the application layer

On the detail page for a Pod or workload, click Application Observability to navigate directly to more data, such as the service health.

Navigating  directly to the Application Observability app
Navigating directly to the Application Observability app

To return to Kubernetes Monitoring, click the browser back button.

View queries to troubleshoot with Explore

To further query data, use any of the Explore buttons available throughout the interface (such as Explore namespaces or Explore alerts). You see a view that provides additional query tools for troubleshooting.

Raw query with options to add, view query history, and inspect query
Raw metrics

If you choose to enable traces when you configure Kubernetes Monitoring, you can easily click to see them.

  1. Click the main menu icon.

  2. Click Explore.

  3. Choose the Tempo data source.

  4. With the TraceQL tab selected, enter your search query.

  5. Click Run query.

    A table of traces appears.

  6. Click a trace to see the detail.

Explore detail page showing table of traces, TraceQL query, and trace graph
View traces

Manage configuration

If you have the admin role, you can manage the configuration of Kubernetes Monitoring by working with:

  • Data source choices
  • Alerts
  • Integration installations
  • Optional custom log queries
  • Configuration instructions for Grafana Kubernetes Monitoring Helm chart to deploy, configure, and keep it up to date

Here are some tips and shortcuts for getting around in Kubernetes Monitoring.

Jump between main pages

From any main page, click the icon beside the page title to see the menu of all main pages. Then click the page you want to open.

Navigating between main pages
Navigating between main pages

Dock the main menu

To keep the main navigation open:

  1. Click the main menu icon.
  2. Click the menu docking icon to keep the main menu open.
Docking the main menu to stay open
Docking the main menu to stay open

Filter, sort, and set the time range

Use filters and sorting, along with the time range selector, to target the data you want.

Adjusting the time range and filtering for a type of workload
Adjusting the time range and filtering for a type of workload

Jump to main lists

From the counts on the Kubernetes Overview home page, click All to see that component’s list of items in your Kubernetes fleet.

Clicking the **All** link from the **Kubernetes Overview** page to see a list of all workloads
Clicking the All link from the Kubernetes Overview page to see a list of all workloads

Control app refresh

You can control the automatic refresh interval of the GUI as well as disable the auto refresh.

Menu for controlling automatic refresh and refresh interval
Menu for controlling automatic refresh and refresh interval

Use color cues

Throughout the views in Kubernetes Monitoring, you see color used as an additional means of indicating status or condition. For example, sometimes text is a different color for Pod status:

List of workloads with the status of running showing in green
Color coding

TextColorComments
RunningGreenHealthy Pod
RunningRedPod failing to start
FailedRedFailed Pod
UnknownWhitePod status unknown
SucceededGreenJob Pod successfully run

For more information on Pod status, refer to the Kubernetes documentation on Pod lifecycle.

The following table describes the color indicators for resource capacity and the state of resource usage:

Usage ColorsUsageComments
Green60-90% of maximumThis is the ideal state of resource usage.
YellowBelow 60%Low usage percentages indicate that the item might be over provisioned.
Red90%+Your resource usage is close to or above its configured capacity.