Add Autoscaling Scheduler and refactor Events Worker Schedulers (#265)
# Events Worker Scheduler Refactor
Events worker schedulers have been refactored into the
`events/schedulers/{type}` packages.
The following schedulers are available:
- `sequential`: A standard single-worker scheduler that processes all
events in order
- `parallel`: A parallel event worker pool that runs a fixed number of
workers in their own goroutines, jobs for the same repo being run by the
same worker if they come in closely grouped batches.
- `autoscaling`: An autoscaling event worker pool that ramps up and down
the number of goroutines executing in parallel proportionally to event
throughput and tuned by configurable parameters.
All three schedulers now make use of a common set of Prometheus metrics
from the `schedulers` package, so dashboards and alarms can be re-used
between the different scheduling strategies.
# Autoscaling Events Scheduler
An autoscaling scheduler will start a configured number of workers
(default=1) and then scale up to a maximum configured number of workers.
The scheduler uses a Throughput Manager which contains a circular buffer
to keep track of a rolling window of event throughput.
- By default, the manager keeps track of the # of events per second over
the past 60 seconds and can compute the average throughput for the past
minute on demand
- The caller can configure how many buckets (default 60) are used for
computing the average throughput, as well as how frequently we rotate
those buckets (default to every second).
- The caller can also configure how frequently we poll the Manager for
the average throughput, allowing us to make scaling decisions with
higher reactivity, scaling up and down faster in response to large
changes in event throughput.
- By default, we poll the Throughput Manager once every 5 seconds to
make scaling decisions.
This allows for dynamic load-based scaling for PDS slurpers so we can
scale up to high throughout on connections that require it without
having to allocate tons of goroutines to every PDS connection.
We scale up if # of avg evts/sec > current concurrency and scale down if
the average throughput is < current concurrency - 1.
Locally tested with a bunch of debug logging for both scaling up and
down:

