Sift investigations
Sift is a powerful diagnostic assistant in Grafana Cloud designed to perform investigations on your infrastructure telemetry, helping you identify critical details during incidents. By employing a series of individual checks, Sift examines specific aspects of your infrastructure during investigations, providing valuable insights to guide your incident response efforts.
Sift checks
Sift offers a range of checks to analyze your system’s telemetry during investigations. These checks include:
Error Pattern Logs: Analyzes error logs and identifies groups of similar log lines, highlighting groups with significantly increased log rates based on shared patterns.
HTTP Error Series: Checks for series exhibiting elevated HTTP errors within a specified cluster and namespace.
Kube Crashes: Detects recent container crashes by analyzing Kubernetes metrics and provides information on the cause of the crash (Error, OOMKill).
Log Query: Executes a configurable LogQL query against a Loki instance and shows the results in a configurable format. Useful for recurrent queries that you want to run during investigations.
Metric Query: Executes a configurable PromQL query against a Prometheus instance and shows the results in a configurable format. Useful for recurrent queries that you want to run during investigations.
Noisy Neighbors: Identifies over-saturated hosts where load exceeds CPU core count, leading to high latency, and examines pods on those hosts for deeper insights into the underlying issues.
Recent Deployments: Identifies resources that recently underwent changes in Kubernetes, such as service updates or configuration modifications.
Resource Contention: Focuses on containers with significant CPU throttling due to reaching CPU limits, or significant packet loss due to networking issues. Unlike noisy neighbors, CPU throttling is caused by the container itself and not by other processes on the underlying infrastructure.
Slow Requests: Analyzes traces in Grafana Tempo, a distributed tracing system, to identify requests taking longer than a specified threshold (default: 3 seconds).