Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

sched_ext: Document task ownership state machine

The task ownership state machine in sched_ext is quite hard to follow
from the code alone. The interaction of ownership states, memory
ordering rules and cross-CPU "lock dancing" makes the overall model
subtle.

Extend the documentation next to scx_ops_state to provide a more
structured and self-contained description of the state transitions and
their synchronization rules.

The new reference should make the code easier to reason about and
maintain and can help future contributors understand the overall
task-ownership workflow.

Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>

authored by

Andrea Righi and committed by
Tejun Heo
70f54f61 0927780c

+98 -16
+98 -16
kernel/sched/ext_internal.h
··· 1035 1035 }; 1036 1036 1037 1037 /* 1038 - * sched_ext_entity->ops_state 1038 + * Task Ownership State Machine (sched_ext_entity->ops_state) 1039 1039 * 1040 - * Used to track the task ownership between the SCX core and the BPF scheduler. 1041 - * State transitions look as follows: 1040 + * The sched_ext core uses this state machine to track task ownership 1041 + * between the SCX core and the BPF scheduler. This allows the BPF 1042 + * scheduler to dispatch tasks without strict ordering requirements, while 1043 + * the SCX core safely rejects invalid dispatches. 1042 1044 * 1043 - * NONE -> QUEUEING -> QUEUED -> DISPATCHING 1044 - * ^ | | 1045 - * | v v 1046 - * \-------------------------------/ 1045 + * State Transitions 1047 1046 * 1048 - * QUEUEING and DISPATCHING states can be waited upon. See wait_ops_state() call 1049 - * sites for explanations on the conditions being waited upon and why they are 1050 - * safe. Transitions out of them into NONE or QUEUED must store_release and the 1051 - * waiters should load_acquire. 1047 + * .------------> NONE (owned by SCX core) 1048 + * | | ^ 1049 + * | enqueue | | direct dispatch 1050 + * | v | 1051 + * | QUEUEING -------' 1052 + * | | 1053 + * | enqueue | 1054 + * | completes | 1055 + * | v 1056 + * | QUEUED (owned by BPF scheduler) 1057 + * | | 1058 + * | dispatch | 1059 + * | | 1060 + * | v 1061 + * | DISPATCHING 1062 + * | | 1063 + * | dispatch | 1064 + * | completes | 1065 + * `---------------' 1052 1066 * 1053 - * Tracking scx_ops_state enables sched_ext core to reliably determine whether 1054 - * any given task can be dispatched by the BPF scheduler at all times and thus 1055 - * relaxes the requirements on the BPF scheduler. This allows the BPF scheduler 1056 - * to try to dispatch any task anytime regardless of its state as the SCX core 1057 - * can safely reject invalid dispatches. 1067 + * State Descriptions 1068 + * 1069 + * - %SCX_OPSS_NONE: 1070 + * Task is owned by the SCX core. It's either on a run queue, running, 1071 + * or being manipulated by the core scheduler. The BPF scheduler has no 1072 + * claim on this task. 1073 + * 1074 + * - %SCX_OPSS_QUEUEING: 1075 + * Transitional state while transferring a task from the SCX core to 1076 + * the BPF scheduler. The task's rq lock is held during this state. 1077 + * Since QUEUEING is both entered and exited under the rq lock, dequeue 1078 + * can never observe this state (it would be a BUG). When finishing a 1079 + * dispatch, if the task is still in %SCX_OPSS_QUEUEING the completion 1080 + * path busy-waits for it to leave this state (via wait_ops_state()) 1081 + * before retrying. 1082 + * 1083 + * - %SCX_OPSS_QUEUED: 1084 + * Task is owned by the BPF scheduler. It's on a DSQ (dispatch queue) 1085 + * and the BPF scheduler is responsible for dispatching it. A QSEQ 1086 + * (queue sequence number) is embedded in this state to detect 1087 + * dispatch/dequeue races: if a task is dequeued and re-enqueued, the 1088 + * QSEQ changes and any in-flight dispatch operations targeting the old 1089 + * QSEQ are safely ignored. 1090 + * 1091 + * - %SCX_OPSS_DISPATCHING: 1092 + * Transitional state while transferring a task from the BPF scheduler 1093 + * back to the SCX core. This state indicates the BPF scheduler has 1094 + * selected the task for execution. When dequeue needs to take the task 1095 + * off a DSQ and it is still in %SCX_OPSS_DISPATCHING, the dequeue path 1096 + * busy-waits for it to leave this state (via wait_ops_state()) before 1097 + * proceeding. Exits to %SCX_OPSS_NONE when dispatch completes. 1098 + * 1099 + * Memory Ordering 1100 + * 1101 + * Transitions out of %SCX_OPSS_QUEUEING and %SCX_OPSS_DISPATCHING into 1102 + * %SCX_OPSS_NONE or %SCX_OPSS_QUEUED must use atomic_long_set_release() 1103 + * and waiters must use atomic_long_read_acquire(). This ensures proper 1104 + * synchronization between concurrent operations. 1105 + * 1106 + * Cross-CPU Task Migration 1107 + * 1108 + * When moving a task in the %SCX_OPSS_DISPATCHING state, we can't simply 1109 + * grab the target CPU's rq lock because a concurrent dequeue might be 1110 + * waiting on %SCX_OPSS_DISPATCHING while holding the source rq lock 1111 + * (deadlock). 1112 + * 1113 + * The sched_ext core uses a "lock dancing" protocol coordinated by 1114 + * p->scx.holding_cpu. When moving a task to a different rq: 1115 + * 1116 + * 1. Verify task can be moved (CPU affinity, migration_disabled, etc.) 1117 + * 2. Set p->scx.holding_cpu to the current CPU 1118 + * 3. Set task state to %SCX_OPSS_NONE; dequeue waits while DISPATCHING 1119 + * is set, so clearing DISPATCHING first prevents the circular wait 1120 + * (safe to lock the rq we need) 1121 + * 4. Unlock the current CPU's rq 1122 + * 5. Lock src_rq (where the task currently lives) 1123 + * 6. Verify p->scx.holding_cpu == current CPU, if not, dequeue won the 1124 + * race (dequeue clears holding_cpu to -1 when it takes the task), in 1125 + * this case migration is aborted 1126 + * 7. If src_rq == dst_rq: clear holding_cpu and enqueue directly 1127 + * into dst_rq's local DSQ (no lock swap needed) 1128 + * 8. Otherwise: call move_remote_task_to_local_dsq(), which releases 1129 + * src_rq, locks dst_rq, and performs the deactivate/activate 1130 + * migration cycle (dst_rq is held on return) 1131 + * 9. Unlock dst_rq and re-lock the current CPU's rq to restore 1132 + * the lock state expected by the caller 1133 + * 1134 + * If any verification fails, abort the migration. 1135 + * 1136 + * This state tracking allows the BPF scheduler to try to dispatch any task 1137 + * at any time regardless of its state. The SCX core can safely 1138 + * reject/ignore invalid dispatches, simplifying the BPF scheduler 1139 + * implementation. 1058 1140 */ 1059 1141 enum scx_ops_state { 1060 1142 SCX_OPSS_NONE, /* owned by the SCX core */