Monorepo for Tangled tangled.org
854
fork

Configure Feed

Select the types of activity you want to include in your feed.

proposal: fair job scheduling and cancellation #210

open opened by jobala.tngl.sh

Why?#

Spindle's job scheduling can be unfair when a single repo has a lot of jobs queued and other repos are left starving. This document proposes an improvement of the current job orchestration functionality.

Goals#

  • Fair job scheduling
  • Support job cancellation

Design#

:::info Manager and workers run on the same node and communicate over channels :::

---
title: Job Orchestration
---

flowchart LR
    appview -- pipeline_event--> spindle
    subgraph spindle
    manager
    worker1
    worker2
    worker3
    
    manager -- scheduleTask --> worker1
    manager -- scheduleTask --> worker2
    manager -- scheduleTask --> worker3
    end

Manager#

The manager receives pipeline events from appview and schedules pipeline jobs to run on different workers. A pipeline job is composed of one or more task. A task is an abstraction over a workflow.

To enforce fairness the manager will maintain per repo queue of pipeline events and schedule a job from the least-recently served queue to run. This can be done using a min-heap where the items are ordered by the last-served time.

A worker for a task will be selected using a simple round-robin algorithm. We can improve the scheduling algorithm in the future to consider a worker's resource usage. To schedule a task, the manager sends a ScheduleTask message to the the worker.

The manager knows where a job's tasks are running and the manager can send a CancelTask message to the worker which can then stop the task.

Worker#

The worker maintains a tasks channel and spawns a known number of go-routines that read from the channel and either schedule's a task or cancels the task. A worker keeps track of all its current tasks.

Task#

Task is the unit of execution and abstracts a workflow. Task exposes a start/stop interface. When a task is started it returns its containerId which we can use later to stop the task.

Job Cancellation#

Pending Jobs#

All pending jobs are contained in the manager's queue. The manager will having a tombstoneMap which will hold cancelled jobs. We'll then check if a job is cancelled before scheduling its tasks.

Scheduled Tasks#

Like managers, workers will have a tombstoneMap and will check if the task was cancelled before starting it.

Running Jobs#

For tasks that are already running, the worker will hold reference to its cancel function which the worker calls when it gets a cancel message. Calling cancel will then stop the container the task is running in.

func (t *Task) Start(ctx context.Context) error {    
    go func() {
        <-ctx.Done()
        t.client.StopContainer(t.containerID)
    }()
    
    return nil
}
[deleted by author]

Thanks @anirudh.fi, what do you think of the concerns around a global queue and resource starving? Is the expectation that DestroyWorkflow should cancel the workflow?

sign up or login to add to the discussion
Labels

None yet.

assignee

None yet.

Participants 2
AT URI
at://did:plc:qcqdzn5ohjxyp2ilrunon6kn/sh.tangled.repo.issue/3mitedqgaou22