Mirror of https://github.com/roostorg/coop
github.com/roostorg/coop
1import binomialTest from '@stdlib/stats-binomial-test';
2import _ from 'lodash';
3
4import { RuleAlarmStatus } from '../moderationConfigService/index.js';
5
6const { sum } = _;
7
8/**
9 * Determine whether we should flag the rule as being 'in alarm', meaning that
10 * its most-recent pass rate looks anomalous.
11 *
12 * I'm not exactly sure what the math should be here, but it seems like there
13 * are two ways to view a rule's pass rate.
14 *
15 * One is to imagine that there's some static probability of a rule passing, so
16 * the number of passes you'd expect to observe would be given by a binomial
17 * distribution, which you're sampling (historically and in the latest period).
18 * Then, you can look at how many passes were in the latest period and ask (in
19 * various ways) if that's consistent w/ the pass rate observed historically.
20 *
21 * The second view is to see a rule's pass rate as a variable that changes over
22 * time, with an independent "seasonal" component (i.e., different cohorts of
23 * users regualarly come onto the platform at different times of day/days of the
24 * week) and "trend" component (e.g., the pass rate may be going down over time
25 * as bad users get deleted). Then, you forecast the next period's pass rate,
26 * with a confidence interval on the forecast, and use the error between the
27 * forecast and the actual pass rate to detect anomalies.
28 *
29 * The second approach feels more accurate, but also obviously more complicated.
30 * The statistics are beyond me, and I couldn't find any off-the-shelf packages
31 * in JS for decomposing the historical data into these trend + seasonal +
32 * residual components (they're all in Python or R.)
33 *
34 * My hacky approach was gonna be to check if the difference between this hour's
35 * pass rate and last hour's pass rate was abnormally large (by estimating a
36 * normal distribution for the rule's hour-by-hour _pass rate change_, from the
37 * samples), and then mark the rule as in alarm if the pass rate went up a lot.
38 * The idea was that, by comparing the current hour to the last hour (when
39 * finding the pass rate change), we'd be assuming that the best estimate for
40 * this hour's pass rate -- if things aren't anomalous -- is last hour's pass
41 * rate, which thereby accounts for "seasonal" effects that we might miss if we
42 * used some longer-run average as the starting point for estimating what this
43 * period's pass rate "should be" (a la the 'static pass rate' model).
44 *
45 * But one issue is that the last hour's pass rate could also be abnormally low
46 * by chance, which we wouldn't have alerted on (since we only alert if it looks
47 * like the pass rate is unusually _high_), so then simple regression to the
48 * mean in the next period would result in a big difference that could trigger
49 * an alert. Relatedly, some rules, especially at some hours, really don't run
50 * that many times, so using the hour-by-hour pass rates is not very reliable/
51 * accurate (doubly so if we start onboarding smaller users), even though
52 * pass rates are very nice in their ability to abstract away changes in the
53 * absolute amount of usage at each hour, which we don't care about (and does
54 * change a lot). Still, many rules don't pass at all in a given hour, which
55 * totally messes this approach up.
56 *
57 * There's also a bigger issue with looking at the pass rate as a time series,
58 * which is that, if we go into alert mode when the pass rate jumps, we don't
59 * want to then automatically go out of alert mode while the anomaly is still
60 * occurring (i.e., we don't want to too-quickly 'learn' the anomolous value as
61 * the 'correct' value, which could happen if we use the last value as the basis
62 * for our prediction of the next value, and the anomaly lasts for over an hour).
63 *
64 * So, for now, I'm sticking with the more naive option (1), but, to kinda-sorta
65 * try to capture part of this seasonality idea, while avoiding the trap of an
66 * anomalous most-recent hour(s), I'm very-slightly weighting more recent data
67 * more heavily when calculating the historical data. This is kinda like
68 * "exponential smoothing" in time series analysis.
69 *
70 * @param stats An array where each item represents one time period, with newer
71 * time periods first. Within an item, the keys represent how many passes and
72 * rule executions there were in that time period.
73 * @param confidence A value between 0 and 1 that represents how confident we
74 * have to be before we consider the rule in alarm.
75 */
76export default function getRuleAlarmStatus(
77 stats: { passes: number; runs: number }[],
78 confidence = 0.995,
79): RuleAlarmStatus {
80 const [lastPeriod, ...priorPeriods] = stats;
81
82 // alpha is the exponential smoothing factor (i.e., the weights exponentially
83 // decay such that each period gets (1 - alpha) times the weight of the prior
84 // one. alpha, is quite low because, again, we don't want to learn an anomaly
85 // as the new 'real' value, and because any 'seasonal' changes should be small
86 // (so we don't actually need to give much more weight to recent data). A .02
87 // smoothing factor each gives each prior period 98% as much weight as the
88 // next (i.e., more-recent) period.
89 //
90 // NB: We're throwing away the _number of weighted passes and runs_, and just
91 // looking at the weighted _pass rate_. If we ever want to do a statistical
92 // test that relies on the absolute number of (weighted) prior runs to derive
93 // some uncertainty bounds, we'll need to scale the weights to sum to 1 first.
94 const alpha = 0.02;
95 const weights = priorPeriods.map((_, i) => (1 - alpha) ** i);
96 const historicalWeightedPassRate =
97 sum(priorPeriods.map((it, i) => it.passes * weights[i])) /
98 sum(priorPeriods.map((it, i) => it.runs * weights[i]));
99
100 // We've got a few options for the kind of test we can do here:
101 //
102 // 1) we could just assume that the pass rate from `priorPeriodsMerged` is
103 // the underlying pass rate for the rule (even though it's actually just
104 // an estimate, which might be off if we don't have much historical data),
105 // and then do an exact binomial test against lastPeriod's sample.
106 //
107 // 2) we can find the difference in pass rates between lastPeriod and the
108 // historical data, and assume that that difference is normally distributed.
109 // this would account for the sample size of the historical data.
110 // (See https://www.itl.nist.gov/div898/handbook/prc/section3/prc33.htm)
111 // the issue is that our binomial distributions are hella skewed (because
112 // the pass rate for some rules is very, very close to zero), so assuming
113 // the pass rate difference is normally distributed isn't strictly right
114 // iiuc. there is (luckily) an off-the-shelf heuristic for when it's good
115 // enough, which is when there's at least 10 passes in both samples.
116 //
117 // 3) we can run one of the exact tests that work around the fact that our
118 // historical proportion is just an estimate, using different approaches.
119 // see https://stats.stackexchange.com/a/551617/277172. the issue here is
120 // that these are complicated and/or slow.
121 //
122 // For now, i just stick with the simplest option (1), but return false if it
123 // looks like we might not have enough historical data. The threshold for
124 // "enough" is hard to find because, if we want our estimated pass rate to be
125 // off by no more than 10%, 95% of the time, the number of raw runs we'd need
126 // could vary _wildly_ based on the rule's true pass rate. Our rule pass rates
127 // seem to be as low as 1 in 100,000, which would require 250,000 runs to have
128 // a reliable estimate, which is gonna be too many for smaller users and
129 // is unnecessary for rules that pass much more often (e.g., a rule that
130 // passed 5% of the time, which is the highest pass rate I can imagine, would
131 // only need ~8000 runs, and ~4000 if we only care about our estimate being
132 // too high, since we're only alerting when the threshold is exceeded, not
133 // when the pass rate is anomolously low). However, if we say that the rule
134 // must have been run at least 4000 times, and passed at least twice or been
135 // run 125,000+ times, then this covers both extremes, I think. So, we require
136 // that + at least 24 periods of data, to hopefully capture some seasonality.
137 const priorPasses = sum(priorPeriods.map((it) => it.passes));
138 const priorRuns = sum(priorPeriods.map((it) => it.runs));
139 const hasAdequateData =
140 priorPeriods.length >= 24 &&
141 priorRuns >= 4000 &&
142 (priorPasses > 2 || priorRuns > 125000);
143
144 if (!hasAdequateData) {
145 return RuleAlarmStatus.INSUFFICIENT_DATA;
146 }
147
148 // Even with our heuristics above, there are some cases where we're
149 // incorrectly flagging things as anomolous because, when the sample sizes are
150 // big (as w/ our large users), then even very small errors in the
151 // estimate used for the pass rate, or small seasonal effects, can lead us to
152 // log an anomaly incorrectly.
153 //
154 // For example: a rule has run almost 2 million times and has a weighted pass
155 // rate of 0.017752. Now, in the latest period, if the rule runs many times
156 // (say 8,000+), that's a big enough sample size that we'd expect the pass
157 // rate to be pretty close to .017752. Specifically, with 8,000 runs, the
158 // expected value is 8000*.017752 = 142 successes, but 99.5% of the time,
159 // we'll have 171 passes (pass rate: .021375) or fewer. In real life, we'll
160 // semi-routinely get 172-182 passes, and this won't represent anything
161 // anomalous. Instead, it just means that our pass rate estimate was still a
162 // little off or some seasonal thing is going on in this hour. So, to prevent
163 // these kinds of false positives, we make sure that the pass rate in this
164 // period is at least 25% higher than the historical weighted pass rate.
165 if (lastPeriod.passes / lastPeriod.runs < historicalWeightedPassRate * 1.25) {
166 return RuleAlarmStatus.OK;
167 }
168
169 const testResults = binomialTest(lastPeriod.passes, lastPeriod.runs, {
170 p: historicalWeightedPassRate,
171 alternative: 'greater',
172 alpha: 1 - confidence,
173 });
174
175 // It's an anomaly if we rejected the null hypothesis.
176 return testResults.rejected ? RuleAlarmStatus.ALARM : RuleAlarmStatus.OK;
177}