Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

unwind deferred: Use bitmask to determine which callbacks to call

In order to know which registered callback requested a stacktrace for when
the task goes back to user space, add a bitmask to keep track of all
registered tracers. The bitmask is the size of long, which means that on a
32 bit machine, it can have at most 32 registered tracers, and on 64 bit,
it can have at most 64 registered tracers. This should not be an issue as
there should not be more than 10 (unless BPF can abuse this?).

When a tracer registers with unwind_deferred_init() it will get a bit
number assigned to it. When a tracer requests a stacktrace, it will have
its bit set within the task_struct. When the task returns back to user
space, it will call the callbacks for all the registered tracers where
their bits are set in the task's mask.

When a tracer is removed by the unwind_deferred_cancel() all current tasks
will clear the associated bit, just in case another tracer gets registered
immediately afterward and then gets their callback called unexpectedly.

To prevent live locks from happening if an event that happens between the
task_work and when the task goes back to user space, triggers the deferred
unwind, have the unwind_mask get cleared on exit to user space and not
after the callback is made.

Move the pending bit from a value on the task_struct to bit zero of the
unwind_mask (saves space on the task_struct). This will allow modifying
the pending bit along with the work bits atomically.

Instead of clearing a work's bit after its callback is called, it is
delayed until exit. If the work is requested again, the task_work is not
queued again and the request will be notified that the task has already been
called by returning a positive number (the same as if it was already
pending).

The pending bit is cleared before calling the callback functions but the
current work bits remain. If one of the called works registers again, it
will not trigger a task_work if its bit is still present in the task's
unwind_mask.

If a new work requests a deferred unwind, then it will set both the
pending bit and its own bit. Note this will also cause any work that was
previously queued and had their callback already executed to be executed
again. Future work will remove these spurious callbacks.

The use of atomic_long bit operations were suggested by Peter Zijlstra:
Link: https://lore.kernel.org/all/20250715102912.GQ1613200@noisy.programming.kicks-ass.net/
The unwind_mask could not be converted to atomic_long_t do to atomic_long
not having all the bit operations needed by unwind_mask. Instead it
follows other use cases in the kernel and just typecasts the unwind_mask
to atomic_long_t when using the two atomic_long functions.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Indu Bhagat <indu.bhagat@oracle.com>
Cc: "Jose E. Marchesi" <jemarch@gnu.org>
Cc: Beau Belgrave <beaub@linux.microsoft.com>
Cc: Jens Remus <jremus@linux.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Sam James <sam@gentoo.org>
Link: https://lore.kernel.org/20250729182405.822789300@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>

+92 -23
+23 -3
include/linux/unwind_deferred.h
··· 13 13 struct unwind_work { 14 14 struct list_head list; 15 15 unwind_callback_t func; 16 + int bit; 16 17 }; 17 18 18 19 #ifdef CONFIG_UNWIND_USER 20 + 21 + enum { 22 + UNWIND_PENDING_BIT = 0, 23 + }; 24 + 25 + enum { 26 + UNWIND_PENDING = BIT(UNWIND_PENDING_BIT), 27 + }; 19 28 20 29 void unwind_task_init(struct task_struct *task); 21 30 void unwind_task_free(struct task_struct *task); ··· 37 28 38 29 static __always_inline void unwind_reset_info(void) 39 30 { 40 - if (unlikely(current->unwind_info.id.id)) 31 + struct unwind_task_info *info = &current->unwind_info; 32 + unsigned long bits; 33 + 34 + /* Was there any unwinding? */ 35 + if (unlikely(info->unwind_mask)) { 36 + bits = info->unwind_mask; 37 + do { 38 + /* Is a task_work going to run again before going back */ 39 + if (bits & UNWIND_PENDING) 40 + return; 41 + } while (!try_cmpxchg(&info->unwind_mask, &bits, 0UL)); 41 42 current->unwind_info.id.id = 0; 43 + } 42 44 /* 43 45 * As unwind_user_faultable() can be called directly and 44 46 * depends on nr_entries being cleared on exit to user, 45 47 * this needs to be a separate conditional. 46 48 */ 47 - if (unlikely(current->unwind_info.cache)) 48 - current->unwind_info.cache->nr_entries = 0; 49 + if (unlikely(info->cache)) 50 + info->cache->nr_entries = 0; 49 51 } 50 52 51 53 #else /* !CONFIG_UNWIND_USER */
+1 -1
include/linux/unwind_deferred_types.h
··· 29 29 }; 30 30 31 31 struct unwind_task_info { 32 + unsigned long unwind_mask; 32 33 struct unwind_cache *cache; 33 34 struct callback_head work; 34 35 union unwind_task_id id; 35 - int pending; 36 36 }; 37 37 38 38 #endif /* _LINUX_UNWIND_USER_DEFERRED_TYPES_H */
+68 -19
kernel/unwind/deferred.c
··· 45 45 static DEFINE_MUTEX(callback_mutex); 46 46 static LIST_HEAD(callbacks); 47 47 48 + #define RESERVED_BITS (UNWIND_PENDING) 49 + 50 + /* Zero'd bits are available for assigning callback users */ 51 + static unsigned long unwind_mask = RESERVED_BITS; 52 + 53 + static inline bool unwind_pending(struct unwind_task_info *info) 54 + { 55 + return test_bit(UNWIND_PENDING_BIT, &info->unwind_mask); 56 + } 57 + 48 58 /* 49 59 * This is a unique percpu identifier for a given task entry context. 50 60 * Conceptually, it's incremented every time the CPU enters the kernel from ··· 148 138 struct unwind_task_info *info = container_of(head, struct unwind_task_info, work); 149 139 struct unwind_stacktrace trace; 150 140 struct unwind_work *work; 141 + unsigned long bits; 151 142 u64 cookie; 152 143 153 - if (WARN_ON_ONCE(!info->pending)) 144 + if (WARN_ON_ONCE(!unwind_pending(info))) 154 145 return; 155 146 156 - /* Allow work to come in again */ 157 - WRITE_ONCE(info->pending, 0); 158 - 147 + /* Clear pending bit but make sure to have the current bits */ 148 + bits = atomic_long_fetch_andnot(UNWIND_PENDING, 149 + (atomic_long_t *)&info->unwind_mask); 159 150 /* 160 151 * From here on out, the callback must always be called, even if it's 161 152 * just an empty trace. ··· 170 159 171 160 guard(mutex)(&callback_mutex); 172 161 list_for_each_entry(work, &callbacks, list) { 173 - work->func(work, &trace, cookie); 162 + if (test_bit(work->bit, &bits)) 163 + work->func(work, &trace, cookie); 174 164 } 175 165 } 176 166 ··· 195 183 * because it has already been previously called for the same entry context, 196 184 * it will be called again with the same stack trace and cookie. 197 185 * 198 - * Return: 1 if the the callback was already queued. 199 - * 0 if the callback successfully was queued. 186 + * Return: 0 if the callback successfully was queued. 187 + * 1 if the callback is pending or was already executed. 200 188 * Negative if there's an error. 201 189 * @cookie holds the cookie of the first request by any user 202 190 */ 203 191 int unwind_deferred_request(struct unwind_work *work, u64 *cookie) 204 192 { 205 193 struct unwind_task_info *info = &current->unwind_info; 206 - long pending; 194 + unsigned long old, bits; 195 + unsigned long bit = BIT(work->bit); 207 196 int ret; 208 197 209 198 *cookie = 0; ··· 225 212 226 213 *cookie = get_cookie(info); 227 214 228 - /* callback already pending? */ 229 - pending = READ_ONCE(info->pending); 230 - if (pending) 215 + old = READ_ONCE(info->unwind_mask); 216 + 217 + /* Is this already queued or executed */ 218 + if (old & bit) 231 219 return 1; 232 220 233 - /* Claim the work unless an NMI just now swooped in to do so. */ 234 - if (!try_cmpxchg(&info->pending, &pending, 1)) 235 - return 1; 221 + /* 222 + * This work's bit hasn't been set yet. Now set it with the PENDING 223 + * bit and fetch the current value of unwind_mask. If ether the 224 + * work's bit or PENDING was already set, then this is already queued 225 + * to have a callback. 226 + */ 227 + bits = UNWIND_PENDING | bit; 228 + old = atomic_long_fetch_or(bits, (atomic_long_t *)&info->unwind_mask); 229 + if (old & bits) { 230 + /* 231 + * If the work's bit was set, whatever set it had better 232 + * have also set pending and queued a callback. 233 + */ 234 + WARN_ON_ONCE(!(old & UNWIND_PENDING)); 235 + return old & bit; 236 + } 236 237 237 238 /* The work has been claimed, now schedule it. */ 238 239 ret = task_work_add(current, &info->work, TWA_RESUME); 239 - if (WARN_ON_ONCE(ret)) { 240 - WRITE_ONCE(info->pending, 0); 241 - return ret; 242 - } 243 240 244 - return 0; 241 + if (WARN_ON_ONCE(ret)) 242 + WRITE_ONCE(info->unwind_mask, 0); 243 + 244 + return ret; 245 245 } 246 246 247 247 void unwind_deferred_cancel(struct unwind_work *work) 248 248 { 249 + struct task_struct *g, *t; 250 + 249 251 if (!work) 252 + return; 253 + 254 + /* No work should be using a reserved bit */ 255 + if (WARN_ON_ONCE(BIT(work->bit) & RESERVED_BITS)) 250 256 return; 251 257 252 258 guard(mutex)(&callback_mutex); 253 259 list_del(&work->list); 260 + 261 + __clear_bit(work->bit, &unwind_mask); 262 + 263 + guard(rcu)(); 264 + /* Clear this bit from all threads */ 265 + for_each_process_thread(g, t) { 266 + clear_bit(work->bit, &t->unwind_info.unwind_mask); 267 + } 254 268 } 255 269 256 270 int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func) ··· 285 245 memset(work, 0, sizeof(*work)); 286 246 287 247 guard(mutex)(&callback_mutex); 248 + 249 + /* See if there's a bit in the mask available */ 250 + if (unwind_mask == ~0UL) 251 + return -EBUSY; 252 + 253 + work->bit = ffz(unwind_mask); 254 + __set_bit(work->bit, &unwind_mask); 255 + 288 256 list_add(&work->list, &callbacks); 289 257 work->func = func; 290 258 return 0; ··· 304 256 305 257 memset(info, 0, sizeof(*info)); 306 258 init_task_work(&info->work, unwind_deferred_task_work); 259 + info->unwind_mask = 0; 307 260 } 308 261 309 262 void unwind_task_free(struct task_struct *task)