Why?#
Spindle's job scheduling can be unfair when a single repo has a lot of jobs queued and other repos are left starving. This document proposes an improvement of the current job orchestration functionality.
Goals#
- Fair job scheduling
- Support job cancellation
Design#
:::info Manager and workers run on the same node and communicate over channels :::
---
title: Job Orchestration
---
flowchart LR
appview -- pipeline_event--> spindle
subgraph spindle
manager
worker1
worker2
worker3
manager -- scheduleTask --> worker1
manager -- scheduleTask --> worker2
manager -- scheduleTask --> worker3
end
Manager#
The manager receives pipeline events from appview and schedules pipeline jobs to run on different workers. A pipeline job is composed of one or more task. A task is an abstraction over a workflow.
To enforce fairness the manager will maintain per repo queue of pipeline events and schedule a job from the least-recently served queue to run. This can be done using a min-heap where the items are ordered by the last-served time.
A worker for a task will be selected using a simple round-robin algorithm. We can improve the scheduling algorithm in the future to consider a worker's resource usage. To schedule a task, the manager sends a ScheduleTask message to the the worker.
The manager knows where a job's tasks are running and the manager can send a CancelTask message to the worker which can then stop the task.
Worker#
The worker maintains a tasks channel and spawns a known number of go-routines that read from the channel and either schedule's a task or cancels the task. A worker keeps track of all its current tasks.
Task#
Task is the unit of execution and abstracts a workflow. Task exposes a start/stop interface. When a task is started it returns its containerId which we can use later to stop the task.
Job Cancellation#
Pending Jobs#
All pending jobs are contained in the manager's queue. The manager will having a tombstoneMap which will hold cancelled jobs. We'll then check if a job is cancelled before scheduling its tasks.
Scheduled Tasks#
Like managers, workers will have a tombstoneMap and will check if the task was cancelled before starting it.
Running Jobs#
For tasks that are already running, the worker will hold reference to its cancel function which the worker calls when it gets a cancel message. Calling cancel will then stop the container the task is running in.
func (t *Task) Start(ctx context.Context) error {
go func() {
<-ctx.Done()
t.client.StopContainer(t.containerID)
}()
return nil
}