Seeking root cause and reducing noise with alert correlation

Alerts can be both the best and worst friends of software engineers. When accurate, an alert enables immediate intervention to salvage a troubled system. However, when tens of alerts fire at once, it becomes significantly harder to determine where to start investigating. At a certain point, even acknowledging alerts and communicating across teams can become a head-scratching burden. At New Relic, we correlate relevant alerts to help identify the root cause faster, thus shortening mean time to resolution (MTTR). New Relic also offers correlation as a product to our customers.

What is alert correlation?

Alert correlation is a strategic approach to group related alerts together. New Relic Decisions (also known as correlation rules) enable you to classify alerts into meaningful groups, with each group establishing a relationship between their alerts.

Key benefits of using correlation include:

Noise reduction: Fewer alert notifications and acknowledgements for a single outage.
Facilitated situational awareness: Related information is gathered in one view, ensuring real signals are not missed amidst noisier alerts.
Improved cross-party communications: Visuals showing incident status timelines and impacted components are more efficient than wordy descriptions.
Faster root cause identification: Mapping correlation patterns to incident scenarios shortens the time needed to uncover the root problem.

New Relic correlates similar or related incidents into an issue. Click here to find out more about New Relic’s alerting lifecycle.

Patterns of bursting alerts and how correlation helps

Common alert configurations and scenarios are listed below. New Relic correlation capabilities can help with visibility and reduce MTTR in these situations by default since we configure a set of global rules.

Alerts for the same entity. Following engineering best practices, a single entity can have a collection of active alert conditions to detect various violation scenarios.

For instance, during traffic spikes, abnormal throughput alerts may activate, resource usage could sharply increase, and the service's requests may quickly overwhelm its database, leading to timeouts and various errors. Response time would slow down, and additional alerts could be triggered from pub/sub lag, synthetics, Apdex score, rate limiting breaches, and more. In a short period, you may receive a dozen alerts for just a single service!

While all alerts are useful to characterize the issue and understand its impact, some are closer to the root cause than others. If the initial service resource usage alerts are addressed while overlooking the fatal alerts, the on-call engineer might mistakenly increase service resources, which won’t solve the problem. It’s crucial that these alerts are grouped and examined together along the timeline, and not viewed as separate issues.

In this situation, the default enabled global rule Same New Relic Target Name (Non-NRQL) will largely have you covered. The rule will correlate incidents based on common entity names.
Alerts between the same type of entities. It’s very useful to relate datacenter-scale incidents. For instance, your synthetics monitor might fail simultaneously in both Atlanta and London. The default global rule: Same Secure Credential, Public Location and Type captures this situation precisely.

The image below shows monitor failure incidents correlated across multiple locations:

Other global rules, such as Similar Issue Structure, Similar Alert Message, also fit this pattern while providing greater flexibility in correlation depth, as they’re based on similarity.

Alerts across various entities impacted directly or indirectly by the same source. They often share similar symptoms and common attributes. For example, applications making requests to a failed common service may exhibit similar patterns of increased error rates. Global rules such as Similar Alert Message, Same New Relic Condition and Title, and Same New Relic Condition and Deep Link Url can be highly effective in detecting these correlations.
Alerts between entities with relationships. A typical scenario involves a host and its hosted applications. When an infrastructure host experiences downtime, all its hosted applications may generate alerts. Tracing what’s wrong within the application itself will only delay resolution and recovery. We highly recommend constructing custom rules or leveraging topology correlation (this will be discussed in more detail shortly). This is an image of a cross-party incident and its topological map:

Of course, sometimes your outage may be a mix of these patterns. Correlating and analyzing all the relevant alert incidents in a unified visual timeline helps identify the root cause much faster than independently tracking scattered alerts.

Real-world scenario

A system with many dependencies can really benefit from correlation as they tend to be the hardest incidents to resolve. We use many of New Relic’s correlation capabilities on our own pipelines. The diagram below shows a brief segment of our correlation pipeline. Two services (ingest-service and evaluator) make requests to another service, Depot. Each of these services owns respective stores.

Therefore, bottlenecks could originate from any service or their dependencies (whether Depot, upstream, or side dependencies). Even if there are alerts from a particular service, it doesn’t necessarily mean that service itself triggered the root problem. Although our team is familiar with the known failure patterns, that alone isn’t enough to help us identify the specific issue by just reading a few alerts. We need to first match these patterns to the timeline and variety of these alerts first.

Since each of our services has multiple active alert conditions (such as usage, response time, error rate, etc.), we also leverage the same entity correlation pattern (first pattern from the above section) to correlate incidents together and form a single view.

The procedure we digest alerts resembles a decision tree and is essential in the runbook. The following shows a simplified flow (colored boxes highlight a particular outage pattern for case illustration):

Each step can find answers by the issue visual timeline (shown below) and impacted entities map.

Below shows an outage where multiple components triggered resources and miscellaneous alerts. While both Depot DB and ingest-service DB fired alerts, neither showed “corrupted logging” alerts. Since Depot service and DB connection violations occurred before the fatal alerts from its dependent ingest-service and store, we quickly identified Depot DB as the primary bottleneck and scoped resolution (from 49 incidents!). Despite early and interleaved alerts from various components, timeline pattern matching saved us a detour to the root cause and recovery.

To understand the sequence of events, visualize the incident timeline.

The above is just a case of a single system. During a larger-scale, multi-party system outage, communication is the key, and crisp correlated facts and views speak louder and faster than wordy descriptions. Though it can be challenging to determine correlation patterns and define rules across teams, it is gradually achievable by starting with boundary services for typical failures.

How to configure correlations to achieve sound coverage and accuracy

You can achieve the above system-level correlation coverage and boost systematic incident pattern matching procedures by thoughtfully configuring correlations. Below are a few ways to configure them.

Add filters to enforce isolation and custom tagging

Dependencies in your systems don’t necessarily mean all related incidents have to be grouped together. For example, a system could be managed cross teams with different notification preferences, or multi-point failures might need separate handling.

To avoid undesired correlations, you almost always want to consider applying filters. Specific requirements vary greatly among users, so a certain level of isolation is often demanded, such as between different teams, alert priorities, regions, subsystems, or clusters, etc.

Users seeking finer-granularity may want to customize their own patterns. Simply tag the entities interested or add attributes to alert events. Other use cases include labeling and joining alerts generated from different New Relic capabilities. For example, you can tag synthetics and APM alerts for the same entity.

Timing window

Without guarding when to stop correlating, an issue might remain open unnecessarily long, combining multiple rounds of incidents. Especially for higher priority incidents, the window is helpful to avoid correlations from pure coincidence.

It’s often essential to incorporate your alerting boundaries into a rule’s timing window. How fast is your error/downtime spreading through your system? For instance, if an outage pattern usually spreads within 10 minutes (one simpler way to determine this is by referring to the aggregation of SLAs and alerting sensitivity), set the rule timing window to no more than 20 minutes to avoid grouping coincidental alerts. The initially grouped alerts should suffice to track the root cause.

Generalize rules to capture symptoms

If different parts of your system show similar symptoms, they probably have similar alert conditions set up. Generalization involves systematically grouping similar alerts to reveal particular failure symptoms. For example, to monitor error-related alerts across services, you can use regex matching to cover similar titles, conditions, or other key attributes. Here’s a simple rule expression to capture the “error” pattern:

Finally, how do you validate your assumptions before applying these rules in production? Correlation provides simulation to measure the correlation rate and present incident examples. It’s also what we will consult when troubleshooting why a correlation did or didn’t happen. Correlation accuracy is vital to balance alerting granularity and response efficiency, so we suggest applying validation to avoid surprises and frustration.

Conclusion

When encountering bursting alerts, it can be very helpful to have assistance in untangling complicated situations. Correlation is a great tool for establishing and demonstrating relationships between alerts, finding useful signals and patterns from the noise, and helping determine the root cause faster. Its powerful features, combined with your experience, can significantly shorten the path to uncovering the root cause and reducing MTTR. We hope this article helps you blend your wisdom into fine-tuning correlations. Let the correlation tool present for you. Isn’t this what empowered observability is about?

Next steps

Ready to improve your alert management? Here are some actionable steps to organize your alerts:

By Ying Gao, Lead Software Engineer

Ying Gao is a Lead Software Engineer at New Relic. She has avidly contributed to incident correlation and management products, and is increasingly excited about unleashing the power of correlation in the observability space. Previously, she was extensively involved in academia, with publications on computing concurrency and security. As a thinker, she still wonders how to convert cutting-edge findings into business efficiency, yet hopes to think less during traveling!

The views expressed on this blog are those of the author and do not necessarily reflect the views of New Relic. Any solutions offered by the author are environment-specific and not part of the commercial solutions or support offered by New Relic. Please join us exclusively at the Explorers Hub (discuss.newrelic.com) for questions and support related to this blog post. This blog may contain links to content on third-party sites. By providing such links, New Relic does not adopt, guarantee, approve or endorse the information, views or products available on such sites.

750+ integrations to start monitoring your stack for free.

See All Integrations See All Integrations