Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

Merge branch 'bpf-cpumap-devmap-fix-per-cpu-bulk-queue-races-on-preempt_rt'

Jiayuan Chen says:

====================
bpf: Fix per-CPU bulk queue races on PREEMPT_RT

On PREEMPT_RT kernels, local_bh_disable() only calls migrate_disable()
(when PREEMPT_RT_NEEDS_BH_LOCK is not set) and does not disable
preemption. This means CFS scheduling can preempt a task inside the
per-CPU bulk queue (bq) operations in cpumap and devmap, allowing
another task on the same CPU to concurrently access the same bq,
leading to use-after-free, list corruption, and kernel panics.

Patch 1 fixes the cpumap race in bq_flush_to_queue(), originally
reported by syzbot [1].

Patch 2 fixes the same class of race in devmap's bq_xmit_all(),
identified by code inspection after Sebastian Andrzej Siewior pointed
out that devmap has the same per-CPU bulk queue pattern [2].

Both patches use local_lock_nested_bh() to serialize access to the
per-CPU bq. On non-RT this is a pure lockdep annotation with no
overhead; on PREEMPT_RT it provides a per-CPU sleeping lock.

To reproduce the devmap race, insert an mdelay(100) in bq_xmit_all()
after "cnt = bq->count" and before the actual transmit loop. Then pin
two threads to the same CPU, each running BPF_PROG_TEST_RUN with an XDP
program that redirects to a DEVMAP entry (e.g. a veth pair). CFS
timeslicing during the mdelay window causes interleaving. Without the
fix, KASAN reports null-ptr-deref due to operating on freed frames:

BUG: KASAN: null-ptr-deref in __build_skb_around+0x22d/0x340
Write of size 32 at addr 0000000000000d50 by task devmap_race_rep/449

CPU: 0 UID: 0 PID: 449 Comm: devmap_race_rep Not tainted 6.19.0+ #31 PREEMPT_RT
Call Trace:
<TASK>
__build_skb_around+0x22d/0x340
build_skb_around+0x25/0x260
__xdp_build_skb_from_frame+0x103/0x860
veth_xdp_rcv_bulk_skb.isra.0+0x162/0x320
veth_xdp_rcv.constprop.0+0x61e/0xbb0
veth_poll+0x280/0xb50
__napi_poll.constprop.0+0xa5/0x590
net_rx_action+0x4b0/0xea0
handle_softirqs.isra.0+0x1b3/0x780
__local_bh_enable_ip+0x12a/0x240
xdp_test_run_batch.constprop.0+0xedd/0x1f60
bpf_test_run_xdp_live+0x304/0x640
bpf_prog_test_run_xdp+0xd24/0x1b70
__sys_bpf+0x61c/0x3e00
</TASK>

Kernel panic - not syncing: Fatal exception in interrupt

[1] https://lore.kernel.org/all/69369331.a70a0220.38f243.009d.GAE@google.com/T/
[2] https://lore.kernel.org/bpf/20260212023634.366343-1-jiayuan.chen@linux.dev/

v3 -> v4: https://lore.kernel.org/all/20260213034018.284146-1-jiayuan.chen@linux.dev/
- Move panic trace to cover letter. (Sebastian Andrzej Siewior)
- Add Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> to both patches
from cover letter.

v2 -> v3: https://lore.kernel.org/bpf/20260212023634.366343-1-jiayuan.chen@linux.dev/
- Fix commit message: remove incorrect "spin_lock() becomes rt_mutex"
claim, the per-CPU bq has no spin_lock at all. (Sebastian Andrzej Siewior)
- Fix commit message: accurately describe local_lock_nested_bh()
behavior instead of referencing local_lock(). (Sebastian Andrzej Siewior)
- Remove incomplete discussion of snapshot alternative.
(Sebastian Andrzej Siewior)
- Remove panic trace from commit message. (Sebastian Andrzej Siewior)
- Add patch 2/2 for devmap, same race pattern. (Sebastian Andrzej Siewior)

v1 -> v2: https://lore.kernel.org/bpf/20260211064417.196401-1-jiayuan.chen@linux.dev/
- Use local_lock_nested_bh()/local_unlock_nested_bh() instead of
local_lock()/local_unlock(), since these paths already run under
local_bh_disable(). (Sebastian Andrzej Siewior)
- Replace "Caller must hold bq->bq_lock" comment with
lockdep_assert_held() in bq_flush_to_queue(). (Sebastian Andrzej Siewior)
- Fix Fixes tag to 3253cb49cbad ("softirq: Allow to drop the
softirq-BKL lock on PREEMPT_RT") which is the actual commit that
makes the race possible. (Sebastian Andrzej Siewior)
====================

Link: https://patch.msgid.link/20260225121459.183121-1-jiayuan.chen@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

+36 -6
+15 -2
kernel/bpf/cpumap.c
··· 29 29 #include <linux/sched.h> 30 30 #include <linux/workqueue.h> 31 31 #include <linux/kthread.h> 32 + #include <linux/local_lock.h> 32 33 #include <linux/completion.h> 33 34 #include <trace/events/xdp.h> 34 35 #include <linux/btf_ids.h> ··· 53 52 struct list_head flush_node; 54 53 struct bpf_cpu_map_entry *obj; 55 54 unsigned int count; 55 + local_lock_t bq_lock; 56 56 }; 57 57 58 58 /* Struct for every remote "destination" CPU in map */ ··· 453 451 for_each_possible_cpu(i) { 454 452 bq = per_cpu_ptr(rcpu->bulkq, i); 455 453 bq->obj = rcpu; 454 + local_lock_init(&bq->bq_lock); 456 455 } 457 456 458 457 /* Alloc queue */ ··· 725 722 struct ptr_ring *q; 726 723 int i; 727 724 725 + lockdep_assert_held(&bq->bq_lock); 726 + 728 727 if (unlikely(!bq->count)) 729 728 return; 730 729 ··· 754 749 } 755 750 756 751 /* Runs under RCU-read-side, plus in softirq under NAPI protection. 757 - * Thus, safe percpu variable access. 752 + * Thus, safe percpu variable access. PREEMPT_RT relies on 753 + * local_lock_nested_bh() to serialise access to the per-CPU bq. 758 754 */ 759 755 static void bq_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_frame *xdpf) 760 756 { 761 - struct xdp_bulk_queue *bq = this_cpu_ptr(rcpu->bulkq); 757 + struct xdp_bulk_queue *bq; 758 + 759 + local_lock_nested_bh(&rcpu->bulkq->bq_lock); 760 + bq = this_cpu_ptr(rcpu->bulkq); 762 761 763 762 if (unlikely(bq->count == CPU_MAP_BULK_SIZE)) 764 763 bq_flush_to_queue(bq); ··· 783 774 784 775 list_add(&bq->flush_node, flush_list); 785 776 } 777 + 778 + local_unlock_nested_bh(&rcpu->bulkq->bq_lock); 786 779 } 787 780 788 781 int cpu_map_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_frame *xdpf, ··· 821 810 struct xdp_bulk_queue *bq, *tmp; 822 811 823 812 list_for_each_entry_safe(bq, tmp, flush_list, flush_node) { 813 + local_lock_nested_bh(&bq->obj->bulkq->bq_lock); 824 814 bq_flush_to_queue(bq); 815 + local_unlock_nested_bh(&bq->obj->bulkq->bq_lock); 825 816 826 817 /* If already running, costs spin_lock_irqsave + smb_mb */ 827 818 wake_up_process(bq->obj->kthread);
+21 -4
kernel/bpf/devmap.c
··· 45 45 * types of devmap; only the lookup and insertion is different. 46 46 */ 47 47 #include <linux/bpf.h> 48 + #include <linux/local_lock.h> 48 49 #include <net/xdp.h> 49 50 #include <linux/filter.h> 50 51 #include <trace/events/xdp.h> ··· 61 60 struct net_device *dev_rx; 62 61 struct bpf_prog *xdp_prog; 63 62 unsigned int count; 63 + local_lock_t bq_lock; 64 64 }; 65 65 66 66 struct bpf_dtab_netdev { ··· 383 381 int to_send = cnt; 384 382 int i; 385 383 384 + lockdep_assert_held(&bq->bq_lock); 385 + 386 386 if (unlikely(!cnt)) 387 387 return; 388 388 ··· 429 425 struct xdp_dev_bulk_queue *bq, *tmp; 430 426 431 427 list_for_each_entry_safe(bq, tmp, flush_list, flush_node) { 428 + local_lock_nested_bh(&bq->dev->xdp_bulkq->bq_lock); 432 429 bq_xmit_all(bq, XDP_XMIT_FLUSH); 433 430 bq->dev_rx = NULL; 434 431 bq->xdp_prog = NULL; 435 432 __list_del_clearprev(&bq->flush_node); 433 + local_unlock_nested_bh(&bq->dev->xdp_bulkq->bq_lock); 436 434 } 437 435 } 438 436 ··· 457 451 458 452 /* Runs in NAPI, i.e., softirq under local_bh_disable(). Thus, safe percpu 459 453 * variable access, and map elements stick around. See comment above 460 - * xdp_do_flush() in filter.c. 454 + * xdp_do_flush() in filter.c. PREEMPT_RT relies on local_lock_nested_bh() 455 + * to serialise access to the per-CPU bq. 461 456 */ 462 457 static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf, 463 458 struct net_device *dev_rx, struct bpf_prog *xdp_prog) 464 459 { 465 - struct xdp_dev_bulk_queue *bq = this_cpu_ptr(dev->xdp_bulkq); 460 + struct xdp_dev_bulk_queue *bq; 461 + 462 + local_lock_nested_bh(&dev->xdp_bulkq->bq_lock); 463 + bq = this_cpu_ptr(dev->xdp_bulkq); 466 464 467 465 if (unlikely(bq->count == DEV_MAP_BULK_SIZE)) 468 466 bq_xmit_all(bq, 0); ··· 487 477 } 488 478 489 479 bq->q[bq->count++] = xdpf; 480 + 481 + local_unlock_nested_bh(&dev->xdp_bulkq->bq_lock); 490 482 } 491 483 492 484 static inline int __xdp_enqueue(struct net_device *dev, struct xdp_frame *xdpf, ··· 1139 1127 if (!netdev->xdp_bulkq) 1140 1128 return NOTIFY_BAD; 1141 1129 1142 - for_each_possible_cpu(cpu) 1143 - per_cpu_ptr(netdev->xdp_bulkq, cpu)->dev = netdev; 1130 + for_each_possible_cpu(cpu) { 1131 + struct xdp_dev_bulk_queue *bq; 1132 + 1133 + bq = per_cpu_ptr(netdev->xdp_bulkq, cpu); 1134 + bq->dev = netdev; 1135 + local_lock_init(&bq->bq_lock); 1136 + } 1144 1137 break; 1145 1138 case NETDEV_UNREGISTER: 1146 1139 /* This rcu_read_lock/unlock pair is needed because