Detection Pipeline Execution Flow

Background & Motivation

Current detection pipeline does not support business rules well. The user-base of ThirdEye is growing constantly. A lot of anomaly detection use-case comes with an additional set of business rules. For example, growth team want to filter out the anomalies whose site-wide impact is less than a certain threshold. They need to group anomalies across different dimensions. Some teams want to set up anomaly detection using threshold rules or want to use a fixed list of sub-dimensions to monitor or ignore. In the current pipeline, to satisfy one specific business logic requirement, the pipeline flow has to be updated because the anomaly pipeline is monolithic and does not allow customized business rule plugins. Moreover, the current pipeline is lack of testing infrastructure. There is no testability of any pipeline change, this making changes extremely hard.

Users also have no way to configure their business rule without reaching out to the pipeline maintainer to change the config JSON manually, which is not scalable. Some users, although has their specific inclusion/exclusion rules, still want to utilize the auto-tuned algorithm-based detection pipeline. This is not achievable in the current pipeline.

Due to the limitations described above, we introduce the composite pipeline flow in the new detection pipeline framework to achieve the following goals:

  • More flexibility of adding user-defined business rules into the pipeline
  • User-friendly configuration of the detection rules
  • Robustness and testability

Design & Implementation

Composite pipeline flow

The pipeline is shown as follows.

https://user-images.githubusercontent.com/4448437/88265403-48f43480-cc82-11ea-9efe-1c30016a6669.png

Dimension Exploration:

Dimension drill down for user-defined dimensions. For example, explore county dimension where continent = Europe.

Dimension Filter:

Filter out dimensions based on some business logic criteria. For example, only explore the dimension whose contribution to overall metric > 5%.

Rule Detection:

User specified rule for anomaly detection. For example, if percentage change WoW > 10%, fires an anomaly.

Algorithm Detection:

Existing anomaly detection algorithms. Such as sign test, spline regression, etc.

Algorithm alert filter:

Existing auto-tune alert filters.

Merger:

For each dimension, merge anomalies based on time. See more detailed discussion about merging logic below.

Rule filter:

Exclusion filter for anomalies defined by user to filter out the anomalies they don’t want to receive. For example, if within the anomaly time range, the site wide impact of this metric in this dimension is less than 5% , don’t classify this as an anomaly .

Grouper:

Groups anomalies across different dimensions.

The algorithm detection and alert filter will provide backward-compatibility to existing anomaly function interface.

For each stage, we provides interfaces so that they can be pluggable. User can provide any kind of business logic to customized the logic of each stage. The details of the interfaces are listed in this page: Detection pipeline interfaces.

Pros of this pipeline:

  • Users can defines inclusion rules to detect anomalies.
  • Users won’t receive the anomalies they explicitly filtered out if they set up the exclusion rule-based filter.
  • For each dimension, users won’t see duplicated anomalies generated by algorithm & rule pipeline for any time range, since they are merged based on time.

Alternative Pipeline flow designs:

https://user-images.githubusercontent.com/4448437/88265408-4b568e80-cc82-11ea-83e7-a833663a68ed.png

Pros of this pipeline:

  • Users can defines inclusion rules to detect anomalies.
  • Users won’t receive the anomalies they explicitly filtered out if they set up the exclusion rule-based filter.
  • Users won’t see duplicated anomalies generated by algorithm & rule pipeline, since they are merged based on time.

Cons of this pipeline:

  • The algorithm alert filter might filter out the anomalies generated by user specified rules, i.e. users could miss anomalies they want to see.
https://user-images.githubusercontent.com/4448437/88265411-4e517f00-cc82-11ea-947a-04bee30ca08c.png

Pros of this pipeline:

  • Users can defines inclusion rules to detect anomalies.
  • Users won’t see duplicated anomalies generated by algorithm & rule pipeline, since they are merged based on time.

Cons of this pipeline:

  • Users will still see the anomaly they set rules to explicitly filter out. Because the anomalies generated by algorithm detection pipeline does not filtered by user’s exclusion rule.

As discussed above, we recommend to use the first discussed design as default. The detection framework itself still has the flexibility of executing different type of flows if this is needed later.

Merging logic

Merging happens either when merging anomalies within a rule/algorithm detection flow or merge anomalies generated by different flows. Merger’s behavior is slightly different in these two conditions.

Merging only rule-detected anomalies or rule-detected anomalies

Do time-based merging only. Do not keep anomalies before merging.

Merging both rule-detected anomalies and algorithm-detected anomalies

There will be 3 cases when merging two anomalies:

https://user-images.githubusercontent.com/4448437/88265414-501b4280-cc82-11ea-904e-83fd54e3a157.png

Solution to case 2:

1. Merge all time intervals in both anomalies.

In this example, will send A-D as the anomaly.

Pros:

  • Users will not receive duplicated anomaly for any specific range.
  • Improves the recall.

Cons:

  • Users will receive an extended anomaly range. More period to investigate
2. Only classify as an anomaly for the overlapped interval.

In this example, will send C-B as the anomaly.

Pros:

  • User will not receive duplicated anomaly for any specific range.
  • Improved the precision. The anomaly range is shortened. User has less period to investigate.

Cons:

  • User could miss the anomaly period he explicitly set rule to detect. Because the merger might chop off the anomaly period. Reduce the recall.
3. Don’t merge, send two anomalies.

In this example, will send A-B and C-D as two anomalies.

Pros:

  • Improves the recall

Cons:

  • User will receive duplicated anomaly for a specific time range, in this example for C-B.
  • User has more workload to investigate because of more anomalies

As discuss above, we set merger to behave like solution 1 by default, i.e, merger merges the time period. The merger will keep the anomalies before merging as the child anomalies. This allows tracing back to the anomalies generated by different algorithms/rules.