rqspinlock: Enclose lock/unlock within lock entry acquisitions

Ritesh reported that timeouts occurred frequently for rqspinlock despite
reentrancy on the same lock on the same CPU in [0]. This patch closes
one of the races leading to this behavior, and reduces the frequency of
timeouts.

We currently have a tiny window between the fast-path cmpxchg and the
grabbing of the lock entry where an NMI could land, attempt the same
lock that was just acquired, and end up timing out. This is not ideal.
Instead, move the lock entry acquisition from the fast path to before
the cmpxchg, and remove the grabbing of the lock entry in the slow path,
assuming it was already taken by the fast path. The TAS fallback is
invoked directly without being preceded by the typical fast path,
therefore we must continue to grab the deadlock detection entry in that
case.

Case on lock leading to missed AA:

cmpxchg lock A
<NMI>
... rqspinlock acquisition of A
... timeout
</NMI>
grab_held_lock_entry(A)

There is a similar case when unlocking the lock. If the NMI lands
between the WRITE_ONCE and smp_store_release, it is possible that we end
up in a situation where the NMI fails to diagnose the AA condition,
leading to a timeout.

Case on unlock leading to missed AA:

WRITE_ONCE(rqh->locks[rqh->cnt - 1], NULL)
<NMI>
... rqspinlock acquisition of A
... timeout
</NMI>
smp_store_release(A->locked, 0)

The patch changes the order on unlock to smp_store_release() succeeded
by WRITE_ONCE() of NULL. This avoids the missed AA detection described
above, but may lead to a false positive if the NMI lands between these
two statements, which is acceptable (and preferred over a timeout).

The original intention of the reverse order on unlock was to prevent the
following possible misdiagnosis of an ABBA scenario:

grab entry A
lock A
grab entry B
lock B
unlock B
smp_store_release(B->locked, 0)
grab entry B
lock B
grab entry A
lock A
! <detect ABBA>
WRITE_ONCE(rqh->locks[rqh->cnt - 1], NULL)

If the store release were is after the WRITE_ONCE, the other CPU would
not observe B in the table of the CPU unlocking the lock B. However,
since the threads are obviously participating in an ABBA deadlock, it
is no longer appealing to use the order above since it may lead to a
250 ms timeout due to missed AA detection.

[0]: https://lore.kernel.org/bpf/CAH6OuBTjG+N=+GGwcpOUbeDN563oz4iVcU3rbse68egp9wj9_A@mail.gmail.com

Fixes: 0d80e7f951be ("rqspinlock: Choose trylock fallback for NMI waiters")
Reported-by: Ritesh Oedayrajsingh Varma <ritesh@superluminal.eu>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20251128232802.1031906-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

authored by

Kumar Kartikeya Dwivedi and committed by

Alexei Starovoitov 5 months ago beb7021a bd5bdd20

+38 -37

2 changed files

expand all

include

asm-generic

rqspinlock.h

kernel

bpf

rqspinlock.c

+32 -28

include/asm-generic/rqspinlock.h

··· 129 129 * <error> for lock B 130 130 * release_held_lock_entry 131 131 * 132 - * try_cmpxchg_acquire for lock A 133 132 * grab_held_lock_entry 133 + * try_cmpxchg_acquire for lock A 134 134 * 135 135 * Lack of any ordering means reordering may occur such that dec, inc 136 136 * are done before entry is overwritten. This permits a remote lock ··· 139 139 * CPU holds a lock it is attempting to acquire, leading to false ABBA 140 140 * diagnosis). 141 141 * 142 - * In case of unlock, we will always do a release on the lock word after 143 - * releasing the entry, ensuring that other CPUs cannot hold the lock 144 - * (and make conclusions about deadlocks) until the entry has been 145 - * cleared on the local CPU, preventing any anomalies. Reordering is 146 - * still possible there, but a remote CPU cannot observe a lock in our 147 - * table which it is already holding, since visibility entails our 148 - * release store for the said lock has not retired. 142 + * The case of unlock is treated differently due to NMI reentrancy, see 143 + * comments in res_spin_unlock. 149 144 * 150 145 * In theory we don't have a problem if the dec and WRITE_ONCE above get 151 146 * reordered with each other, we either notice an empty NULL entry on ··· 170 175 { 171 176 int val = 0; 172 177 173 - if (likely(atomic_try_cmpxchg_acquire(&lock->val, &val, _Q_LOCKED_VAL))) { 174 - grab_held_lock_entry(lock); 178 + /* 179 + * Grab the deadlock detection entry before doing the cmpxchg, so that 180 + * reentrancy due to NMIs between the succeeding cmpxchg and creation of 181 + * held lock entry can correctly detect an acquisition attempt in the 182 + * interrupted context. 183 + * 184 + * cmpxchg lock A 185 + * <NMI> 186 + * res_spin_lock(A) --> missed AA, leads to timeout 187 + * </NMI> 188 + * grab_held_lock_entry(A) 189 + */ 190 + grab_held_lock_entry(lock); 191 + 192 + if (likely(atomic_try_cmpxchg_acquire(&lock->val, &val, _Q_LOCKED_VAL))) 175 193 return 0; 176 - } 177 194 return resilient_queued_spin_lock_slowpath(lock, val); 178 195 } 179 196 ··· 199 192 { 200 193 struct rqspinlock_held *rqh = this_cpu_ptr(&rqspinlock_held_locks); 201 194 202 - if (unlikely(rqh->cnt > RES_NR_HELD)) 203 - goto unlock; 204 - WRITE_ONCE(rqh->locks[rqh->cnt - 1], NULL); 205 - unlock: 206 195 /* 207 - * Release barrier, ensures correct ordering. See release_held_lock_entry 208 - * for details. Perform release store instead of queued_spin_unlock, 209 - * since we use this function for test-and-set fallback as well. When we 210 - * have CONFIG_QUEUED_SPINLOCKS=n, we clear the full 4-byte lockword. 196 + * Release barrier, ensures correct ordering. Perform release store 197 + * instead of queued_spin_unlock, since we use this function for the TAS 198 + * fallback as well. When we have CONFIG_QUEUED_SPINLOCKS=n, we clear 199 + * the full 4-byte lockword. 211 200 * 212 - * Like release_held_lock_entry, we can do the release before the dec. 213 - * We simply care about not seeing the 'lock' in our table from a remote 214 - * CPU once the lock has been released, which doesn't rely on the dec. 201 + * Perform the smp_store_release before clearing the lock entry so that 202 + * NMIs landing in the unlock path can correctly detect AA issues. The 203 + * opposite order shown below may lead to missed AA checks: 215 204 * 216 - * Unlike smp_wmb(), release is not a two way fence, hence it is 217 - * possible for a inc to move up and reorder with our clearing of the 218 - * entry. This isn't a problem however, as for a misdiagnosis of ABBA, 219 - * the remote CPU needs to hold this lock, which won't be released until 220 - * the store below is done, which would ensure the entry is overwritten 221 - * to NULL, etc. 205 + * WRITE_ONCE(rqh->locks[rqh->cnt - 1], NULL) 206 + * <NMI> 207 + * res_spin_lock(A) --> missed AA, leads to timeout 208 + * </NMI> 209 + * smp_store_release(A->locked, 0) 222 210 */ 223 211 smp_store_release(&lock->locked, 0); 212 + if (likely(rqh->cnt <= RES_NR_HELD)) 213 + WRITE_ONCE(rqh->locks[rqh->cnt - 1], NULL); 224 214 this_cpu_dec(rqspinlock_held_locks.cnt); 225 215 } 226 216

+6 -9

kernel/bpf/rqspinlock.c

··· 275 275 int val, ret = 0; 276 276 277 277 RES_INIT_TIMEOUT(ts); 278 + /* 279 + * The fast path is not invoked for the TAS fallback, so we must grab 280 + * the deadlock detection entry here. 281 + */ 278 282 grab_held_lock_entry(lock); 279 283 280 284 /* ··· 401 397 goto queue; 402 398 } 403 399 404 - /* 405 - * Grab an entry in the held locks array, to enable deadlock detection. 406 - */ 407 - grab_held_lock_entry(lock); 400 + /* Deadlock detection entry already held after failing fast path. */ 408 401 409 402 /* 410 403 * We're pending, wait for the owner to go away. ··· 449 448 */ 450 449 queue: 451 450 lockevent_inc(lock_slowpath); 452 - /* 453 - * Grab deadlock detection entry for the queue path. 454 - */ 455 - grab_held_lock_entry(lock); 456 - 451 + /* Deadlock detection entry already held after failing fast path. */ 457 452 node = this_cpu_ptr(&rqnodes[0].mcs); 458 453 idx = node->count++; 459 454 tail = encode_tail(smp_processor_id(), idx);

Configure Feed

Configure Feed