Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

Merge branch 'bpf-avoid-locks-in-bpf_timer-and-bpf_wq'

Alexei Starovoitov says:

====================
bpf: Avoid locks in bpf_timer and bpf_wq

From: Alexei Starovoitov <ast@kernel.org>

This series reworks implementation of BPF timer and workqueue APIs to
make them usable from any context.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>

Changes in v9:
- Different approach for patches 1 and 3:
- s/EBUSY/ENOENT/ when refcnt==0 to match existing
- drop latch, use refcnt and kmalloc_nolock() instead
- address race between timer/wq_start and delete_elem, add a test
- Link to v8: https://lore.kernel.org/bpf/20260127-timer_nolock-v8-0-5a29a9571059@meta.com/

Changes in v8:
- Return -EBUSY in bpf_async_read_op() if last_seq is failed to be set
- In bpf_async_cancel_and_free() drop bpf_async_cb ref after calling bpf_async_process()
- Link to v7: https://lore.kernel.org/r/20260122-timer_nolock-v7-0-04a45c55c2e2@meta.com

Changes in v7:
- Addressed Andrii's review points from the previous version - nothing
very significang.
- Added NMI stress tests for bpf_timer - hit few verifier failing checks
and removed them.
- Address sparse warning in the bpf_async_update_prog_callback()
- Link to v6: https://lore.kernel.org/r/20260120-timer_nolock-v6-0-670ffdd787b4@meta.com

Changes in v6:
- Reworked destruction and refcnt use:
- On cancel_and_free() set last_seq to BPF_ASYNC_DESTROY value, drop
map's reference
- In irq work callback, atomically switch DESTROY to DESTROYED, cancel
timer/wq
- Free bpf_async_cb on refcnt going to 0.
- Link to v5: https://lore.kernel.org/r/20260115-timer_nolock-v5-0-15e3aef2703d@meta.com

Changes in v5:
- Extracted lock-free algorithm for updating cb->prog and
cb->callback_fn into a function bpf_async_update_prog_callback(),
added a new commit and introduces this function and uses it in
__bpf_async_set_callback(), bpf_timer_cancel() and
bpf_async_cancel_and_free().
This allows to move the change into the separate commit without breaking
correctness.
- Handle NULL prog in bpf_async_update_prog_callback().
- Link to v4: https://lore.kernel.org/r/20260114-timer_nolock-v4-0-fa6355f51fa7@meta.com

Changes in v4:
- Handle irq_work_queue failures in both schedule and cancel_and_free
paths: introduced bpf_async_refcnt_dec_cleanup() that decrements refcnt
and makes sure if last reference is put, there is at least one irq_work
scheduled to execute final cleanup.
- Additional refcnt inc/dec in set_callback() + rcu lock to make sure
cleanup is not running at the same time as set_callback().
- Added READ_ONCE where it was needed.
- Squash 'bpf: Refactor __bpf_async_set_callback()' commit into 'bpf:
Add lock-free cell for NMI-safe
async operations'
- Removed mpmc_cell, use seqcount_latch_t instead.
- Link to v3: https://lore.kernel.org/r/20260107-timer_nolock-v3-0-740d3ec3e5f9@meta.com

Changes in v3:
- Major rework
- Introduce mpmc_cell, allowing concurrent writes and reads
- Implement irq_work deferring
- Adding selftests
- Introduces bpf_timer_cancel_async kfunc
- Link to v2: https://lore.kernel.org/r/20251105-timer_nolock-v2-0-32698db08bfa@meta.com

Changes in v2:
- Move refcnt initialization and put (from cancel_and_free())
from patch 5 into the patch 4, so that patch 4 has more clear and full
implementation and use of refcnt
- Link to v1: https://lore.kernel.org/r/20251031-timer_nolock-v1-0-b064ae403bfb@meta.com
====================

Link: https://patch.msgid.link/20260201025403.66625-1-alexei.starovoitov@gmail.com
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

+865 -356
+281 -191
kernel/bpf/helpers.c
··· 1095 1095 return (void *)value - round_up(map->key_size, 8); 1096 1096 } 1097 1097 1098 + enum bpf_async_type { 1099 + BPF_ASYNC_TYPE_TIMER = 0, 1100 + BPF_ASYNC_TYPE_WQ, 1101 + }; 1102 + 1103 + enum bpf_async_op { 1104 + BPF_ASYNC_START, 1105 + BPF_ASYNC_CANCEL 1106 + }; 1107 + 1108 + struct bpf_async_cmd { 1109 + struct llist_node node; 1110 + u64 nsec; 1111 + u32 mode; 1112 + enum bpf_async_op op; 1113 + }; 1114 + 1098 1115 struct bpf_async_cb { 1099 1116 struct bpf_map *map; 1100 1117 struct bpf_prog *prog; 1101 1118 void __rcu *callback_fn; 1102 1119 void *value; 1103 - union { 1104 - struct rcu_head rcu; 1105 - struct work_struct delete_work; 1106 - }; 1120 + struct rcu_head rcu; 1107 1121 u64 flags; 1122 + struct irq_work worker; 1123 + refcount_t refcnt; 1124 + enum bpf_async_type type; 1125 + struct llist_head async_cmds; 1108 1126 }; 1109 1127 1110 1128 /* BPF map elements can contain 'struct bpf_timer'. ··· 1150 1132 struct bpf_work { 1151 1133 struct bpf_async_cb cb; 1152 1134 struct work_struct work; 1153 - struct work_struct delete_work; 1154 1135 }; 1155 1136 1156 1137 /* the actual struct hidden inside uapi struct bpf_timer and bpf_wq */ ··· 1159 1142 struct bpf_hrtimer *timer; 1160 1143 struct bpf_work *work; 1161 1144 }; 1162 - /* bpf_spin_lock is used here instead of spinlock_t to make 1163 - * sure that it always fits into space reserved by struct bpf_timer 1164 - * regardless of LOCKDEP and spinlock debug flags. 1165 - */ 1166 - struct bpf_spin_lock lock; 1167 1145 } __attribute__((aligned(8))); 1168 1146 1169 - enum bpf_async_type { 1170 - BPF_ASYNC_TYPE_TIMER = 0, 1171 - BPF_ASYNC_TYPE_WQ, 1172 - }; 1173 - 1174 1147 static DEFINE_PER_CPU(struct bpf_hrtimer *, hrtimer_running); 1148 + 1149 + static void bpf_async_refcount_put(struct bpf_async_cb *cb); 1175 1150 1176 1151 static enum hrtimer_restart bpf_timer_cb(struct hrtimer *hrtimer) 1177 1152 { ··· 1228 1219 { 1229 1220 struct bpf_async_cb *cb = container_of(rcu, struct bpf_async_cb, rcu); 1230 1221 1222 + /* 1223 + * Drop the last reference to prog only after RCU GP, as set_callback() 1224 + * may race with cancel_and_free() 1225 + */ 1226 + if (cb->prog) 1227 + bpf_prog_put(cb->prog); 1228 + 1231 1229 kfree_nolock(cb); 1232 1230 } 1233 1231 1234 - static void bpf_wq_delete_work(struct work_struct *work) 1232 + /* Callback from call_rcu_tasks_trace, chains to call_rcu for final free */ 1233 + static void bpf_async_cb_rcu_tasks_trace_free(struct rcu_head *rcu) 1235 1234 { 1236 - struct bpf_work *w = container_of(work, struct bpf_work, delete_work); 1235 + struct bpf_async_cb *cb = container_of(rcu, struct bpf_async_cb, rcu); 1236 + struct bpf_hrtimer *t = container_of(cb, struct bpf_hrtimer, cb); 1237 + struct bpf_work *w = container_of(cb, struct bpf_work, cb); 1238 + bool retry = false; 1237 1239 1238 - cancel_work_sync(&w->work); 1239 - 1240 - call_rcu(&w->cb.rcu, bpf_async_cb_rcu_free); 1241 - } 1242 - 1243 - static void bpf_timer_delete_work(struct work_struct *work) 1244 - { 1245 - struct bpf_hrtimer *t = container_of(work, struct bpf_hrtimer, cb.delete_work); 1246 - 1247 - /* Cancel the timer and wait for callback to complete if it was running. 1248 - * If hrtimer_cancel() can be safely called it's safe to call 1249 - * call_rcu() right after for both preallocated and non-preallocated 1250 - * maps. The async->cb = NULL was already done and no code path can see 1251 - * address 't' anymore. Timer if armed for existing bpf_hrtimer before 1252 - * bpf_timer_cancel_and_free will have been cancelled. 1240 + /* 1241 + * bpf_async_cancel_and_free() tried to cancel timer/wq, but it 1242 + * could have raced with timer/wq_start. Now refcnt is zero and 1243 + * srcu/rcu GP completed. Cancel timer/wq again. 1253 1244 */ 1254 - hrtimer_cancel(&t->timer); 1255 - call_rcu(&t->cb.rcu, bpf_async_cb_rcu_free); 1245 + switch (cb->type) { 1246 + case BPF_ASYNC_TYPE_TIMER: 1247 + if (hrtimer_try_to_cancel(&t->timer) < 0) 1248 + retry = true; 1249 + break; 1250 + case BPF_ASYNC_TYPE_WQ: 1251 + if (!cancel_work(&w->work)) 1252 + retry = true; 1253 + break; 1254 + } 1255 + if (retry) { 1256 + /* 1257 + * hrtimer or wq callback may still be running. It must be 1258 + * in rcu_tasks_trace or rcu CS, so wait for GP again. 1259 + * It won't retry forever, since refcnt zero prevents all 1260 + * operations on timer/wq. 1261 + */ 1262 + call_rcu_tasks_trace(&cb->rcu, bpf_async_cb_rcu_tasks_trace_free); 1263 + return; 1264 + } 1265 + 1266 + /* rcu_trace_implies_rcu_gp() is true and will remain so */ 1267 + bpf_async_cb_rcu_free(rcu); 1256 1268 } 1269 + 1270 + static void bpf_async_refcount_put(struct bpf_async_cb *cb) 1271 + { 1272 + if (!refcount_dec_and_test(&cb->refcnt)) 1273 + return; 1274 + 1275 + call_rcu_tasks_trace(&cb->rcu, bpf_async_cb_rcu_tasks_trace_free); 1276 + } 1277 + 1278 + static void bpf_async_cancel_and_free(struct bpf_async_kern *async); 1279 + static void bpf_async_irq_worker(struct irq_work *work); 1257 1280 1258 1281 static int __bpf_async_init(struct bpf_async_kern *async, struct bpf_map *map, u64 flags, 1259 1282 enum bpf_async_type type) 1260 1283 { 1261 - struct bpf_async_cb *cb; 1284 + struct bpf_async_cb *cb, *old_cb; 1262 1285 struct bpf_hrtimer *t; 1263 1286 struct bpf_work *w; 1264 1287 clockid_t clockid; 1265 1288 size_t size; 1266 - int ret = 0; 1267 - 1268 - if (in_nmi()) 1269 - return -EOPNOTSUPP; 1270 1289 1271 1290 switch (type) { 1272 1291 case BPF_ASYNC_TYPE_TIMER: ··· 1307 1270 return -EINVAL; 1308 1271 } 1309 1272 1310 - __bpf_spin_lock_irqsave(&async->lock); 1311 - t = async->timer; 1312 - if (t) { 1313 - ret = -EBUSY; 1314 - goto out; 1315 - } 1273 + old_cb = READ_ONCE(async->cb); 1274 + if (old_cb) 1275 + return -EBUSY; 1316 1276 1317 1277 cb = bpf_map_kmalloc_nolock(map, size, 0, map->numa_node); 1318 - if (!cb) { 1319 - ret = -ENOMEM; 1320 - goto out; 1321 - } 1278 + if (!cb) 1279 + return -ENOMEM; 1322 1280 1323 1281 switch (type) { 1324 1282 case BPF_ASYNC_TYPE_TIMER: ··· 1321 1289 t = (struct bpf_hrtimer *)cb; 1322 1290 1323 1291 atomic_set(&t->cancelling, 0); 1324 - INIT_WORK(&t->cb.delete_work, bpf_timer_delete_work); 1325 1292 hrtimer_setup(&t->timer, bpf_timer_cb, clockid, HRTIMER_MODE_REL_SOFT); 1326 1293 cb->value = (void *)async - map->record->timer_off; 1327 1294 break; ··· 1328 1297 w = (struct bpf_work *)cb; 1329 1298 1330 1299 INIT_WORK(&w->work, bpf_wq_work); 1331 - INIT_WORK(&w->delete_work, bpf_wq_delete_work); 1332 1300 cb->value = (void *)async - map->record->wq_off; 1333 1301 break; 1334 1302 } 1335 1303 cb->map = map; 1336 1304 cb->prog = NULL; 1337 1305 cb->flags = flags; 1306 + cb->worker = IRQ_WORK_INIT(bpf_async_irq_worker); 1307 + init_llist_head(&cb->async_cmds); 1308 + refcount_set(&cb->refcnt, 1); /* map's reference */ 1309 + cb->type = type; 1338 1310 rcu_assign_pointer(cb->callback_fn, NULL); 1339 1311 1340 - WRITE_ONCE(async->cb, cb); 1312 + old_cb = cmpxchg(&async->cb, NULL, cb); 1313 + if (old_cb) { 1314 + /* Lost the race to initialize this bpf_async_kern, drop the allocated object */ 1315 + kfree_nolock(cb); 1316 + return -EBUSY; 1317 + } 1341 1318 /* Guarantee the order between async->cb and map->usercnt. So 1342 1319 * when there are concurrent uref release and bpf timer init, either 1343 1320 * bpf_timer_cancel_and_free() called by uref release reads a no-NULL ··· 1356 1317 /* maps with timers must be either held by user space 1357 1318 * or pinned in bpffs. 1358 1319 */ 1359 - WRITE_ONCE(async->cb, NULL); 1360 - kfree_nolock(cb); 1361 - ret = -EPERM; 1320 + bpf_async_cancel_and_free(async); 1321 + return -EPERM; 1362 1322 } 1363 - out: 1364 - __bpf_spin_unlock_irqrestore(&async->lock); 1365 - return ret; 1323 + 1324 + return 0; 1366 1325 } 1367 1326 1368 1327 BPF_CALL_3(bpf_timer_init, struct bpf_async_kern *, timer, struct bpf_map *, map, ··· 1391 1354 .arg3_type = ARG_ANYTHING, 1392 1355 }; 1393 1356 1394 - static int bpf_async_update_prog_callback(struct bpf_async_cb *cb, void *callback_fn, 1395 - struct bpf_prog *prog) 1357 + static int bpf_async_update_prog_callback(struct bpf_async_cb *cb, 1358 + struct bpf_prog *prog, 1359 + void *callback_fn) 1396 1360 { 1397 1361 struct bpf_prog *prev; 1398 1362 ··· 1418 1380 if (prev) 1419 1381 bpf_prog_put(prev); 1420 1382 1421 - } while (READ_ONCE(cb->prog) != prog || READ_ONCE(cb->callback_fn) != callback_fn); 1383 + } while (READ_ONCE(cb->prog) != prog || 1384 + (void __force *)READ_ONCE(cb->callback_fn) != callback_fn); 1422 1385 1423 1386 if (prog) 1424 1387 bpf_prog_put(prog); ··· 1427 1388 return 0; 1428 1389 } 1429 1390 1391 + static int bpf_async_schedule_op(struct bpf_async_cb *cb, enum bpf_async_op op, 1392 + u64 nsec, u32 timer_mode) 1393 + { 1394 + WARN_ON_ONCE(!in_hardirq()); 1395 + 1396 + struct bpf_async_cmd *cmd = kmalloc_nolock(sizeof(*cmd), 0, NUMA_NO_NODE); 1397 + 1398 + if (!cmd) { 1399 + bpf_async_refcount_put(cb); 1400 + return -ENOMEM; 1401 + } 1402 + init_llist_node(&cmd->node); 1403 + cmd->nsec = nsec; 1404 + cmd->mode = timer_mode; 1405 + cmd->op = op; 1406 + if (llist_add(&cmd->node, &cb->async_cmds)) 1407 + irq_work_queue(&cb->worker); 1408 + return 0; 1409 + } 1410 + 1430 1411 static int __bpf_async_set_callback(struct bpf_async_kern *async, void *callback_fn, 1431 1412 struct bpf_prog *prog) 1432 1413 { 1433 1414 struct bpf_async_cb *cb; 1434 - int ret = 0; 1435 1415 1436 - if (in_nmi()) 1437 - return -EOPNOTSUPP; 1438 - __bpf_spin_lock_irqsave(&async->lock); 1439 - cb = async->cb; 1440 - if (!cb) { 1441 - ret = -EINVAL; 1442 - goto out; 1443 - } 1444 - if (!atomic64_read(&cb->map->usercnt)) { 1445 - /* maps with timers must be either held by user space 1446 - * or pinned in bpffs. Otherwise timer might still be 1447 - * running even when bpf prog is detached and user space 1448 - * is gone, since map_release_uref won't ever be called. 1449 - */ 1450 - ret = -EPERM; 1451 - goto out; 1452 - } 1453 - ret = bpf_async_update_prog_callback(cb, callback_fn, prog); 1454 - out: 1455 - __bpf_spin_unlock_irqrestore(&async->lock); 1456 - return ret; 1416 + cb = READ_ONCE(async->cb); 1417 + if (!cb) 1418 + return -EINVAL; 1419 + 1420 + return bpf_async_update_prog_callback(cb, prog, callback_fn); 1457 1421 } 1458 1422 1459 1423 BPF_CALL_3(bpf_timer_set_callback, struct bpf_async_kern *, timer, void *, callback_fn, ··· 1473 1431 .arg2_type = ARG_PTR_TO_FUNC, 1474 1432 }; 1475 1433 1476 - BPF_CALL_3(bpf_timer_start, struct bpf_async_kern *, timer, u64, nsecs, u64, flags) 1434 + BPF_CALL_3(bpf_timer_start, struct bpf_async_kern *, async, u64, nsecs, u64, flags) 1477 1435 { 1478 1436 struct bpf_hrtimer *t; 1479 - int ret = 0; 1480 - enum hrtimer_mode mode; 1437 + u32 mode; 1481 1438 1482 - if (in_nmi()) 1483 - return -EOPNOTSUPP; 1484 1439 if (flags & ~(BPF_F_TIMER_ABS | BPF_F_TIMER_CPU_PIN)) 1485 1440 return -EINVAL; 1486 - __bpf_spin_lock_irqsave(&timer->lock); 1487 - t = timer->timer; 1488 - if (!t || !t->cb.prog) { 1489 - ret = -EINVAL; 1490 - goto out; 1491 - } 1441 + 1442 + t = READ_ONCE(async->timer); 1443 + if (!t || !READ_ONCE(t->cb.prog)) 1444 + return -EINVAL; 1492 1445 1493 1446 if (flags & BPF_F_TIMER_ABS) 1494 1447 mode = HRTIMER_MODE_ABS_SOFT; ··· 1493 1456 if (flags & BPF_F_TIMER_CPU_PIN) 1494 1457 mode |= HRTIMER_MODE_PINNED; 1495 1458 1496 - hrtimer_start(&t->timer, ns_to_ktime(nsecs), mode); 1497 - out: 1498 - __bpf_spin_unlock_irqrestore(&timer->lock); 1499 - return ret; 1459 + /* 1460 + * bpf_async_cancel_and_free() could have dropped refcnt to zero. In 1461 + * such case BPF progs are not allowed to arm the timer to prevent UAF. 1462 + */ 1463 + if (!refcount_inc_not_zero(&t->cb.refcnt)) 1464 + return -ENOENT; 1465 + 1466 + if (!in_hardirq()) { 1467 + hrtimer_start(&t->timer, ns_to_ktime(nsecs), mode); 1468 + bpf_async_refcount_put(&t->cb); 1469 + return 0; 1470 + } else { 1471 + return bpf_async_schedule_op(&t->cb, BPF_ASYNC_START, nsecs, mode); 1472 + } 1500 1473 } 1501 1474 1502 1475 static const struct bpf_func_proto bpf_timer_start_proto = { ··· 1524 1477 bool inc = false; 1525 1478 int ret = 0; 1526 1479 1527 - if (in_nmi()) 1480 + if (in_hardirq()) 1528 1481 return -EOPNOTSUPP; 1529 - 1530 - guard(rcu)(); 1531 1482 1532 1483 t = READ_ONCE(async->timer); 1533 1484 if (!t) ··· 1581 1536 .arg1_type = ARG_PTR_TO_TIMER, 1582 1537 }; 1583 1538 1584 - static struct bpf_async_cb *__bpf_async_cancel_and_free(struct bpf_async_kern *async) 1539 + static void bpf_async_process_op(struct bpf_async_cb *cb, u32 op, 1540 + u64 timer_nsec, u32 timer_mode) 1541 + { 1542 + switch (cb->type) { 1543 + case BPF_ASYNC_TYPE_TIMER: { 1544 + struct bpf_hrtimer *t = container_of(cb, struct bpf_hrtimer, cb); 1545 + 1546 + switch (op) { 1547 + case BPF_ASYNC_START: 1548 + hrtimer_start(&t->timer, ns_to_ktime(timer_nsec), timer_mode); 1549 + break; 1550 + case BPF_ASYNC_CANCEL: 1551 + hrtimer_try_to_cancel(&t->timer); 1552 + break; 1553 + } 1554 + break; 1555 + } 1556 + case BPF_ASYNC_TYPE_WQ: { 1557 + struct bpf_work *w = container_of(cb, struct bpf_work, cb); 1558 + 1559 + switch (op) { 1560 + case BPF_ASYNC_START: 1561 + schedule_work(&w->work); 1562 + break; 1563 + case BPF_ASYNC_CANCEL: 1564 + cancel_work(&w->work); 1565 + break; 1566 + } 1567 + break; 1568 + } 1569 + } 1570 + bpf_async_refcount_put(cb); 1571 + } 1572 + 1573 + static void bpf_async_irq_worker(struct irq_work *work) 1574 + { 1575 + struct bpf_async_cb *cb = container_of(work, struct bpf_async_cb, worker); 1576 + struct llist_node *pos, *n, *list; 1577 + 1578 + list = llist_del_all(&cb->async_cmds); 1579 + if (!list) 1580 + return; 1581 + 1582 + list = llist_reverse_order(list); 1583 + llist_for_each_safe(pos, n, list) { 1584 + struct bpf_async_cmd *cmd; 1585 + 1586 + cmd = container_of(pos, struct bpf_async_cmd, node); 1587 + bpf_async_process_op(cb, cmd->op, cmd->nsec, cmd->mode); 1588 + kfree_nolock(cmd); 1589 + } 1590 + } 1591 + 1592 + static void bpf_async_cancel_and_free(struct bpf_async_kern *async) 1585 1593 { 1586 1594 struct bpf_async_cb *cb; 1587 1595 1588 - /* Performance optimization: read async->cb without lock first. */ 1589 1596 if (!READ_ONCE(async->cb)) 1590 - return NULL; 1591 - 1592 - __bpf_spin_lock_irqsave(&async->lock); 1593 - /* re-read it under lock */ 1594 - cb = async->cb; 1595 - if (!cb) 1596 - goto out; 1597 - bpf_async_update_prog_callback(cb, NULL, NULL); 1598 - /* The subsequent bpf_timer_start/cancel() helpers won't be able to use 1599 - * this timer, since it won't be initialized. 1600 - */ 1601 - WRITE_ONCE(async->cb, NULL); 1602 - out: 1603 - __bpf_spin_unlock_irqrestore(&async->lock); 1604 - return cb; 1605 - } 1606 - 1607 - static void bpf_timer_delete(struct bpf_hrtimer *t) 1608 - { 1609 - /* 1610 - * We check that bpf_map_delete/update_elem() was called from timer 1611 - * callback_fn. In such case we don't call hrtimer_cancel() (since it 1612 - * will deadlock) and don't call hrtimer_try_to_cancel() (since it will 1613 - * just return -1). Though callback_fn is still running on this cpu it's 1614 - * safe to do kfree(t) because bpf_timer_cb() read everything it needed 1615 - * from 't'. The bpf subprog callback_fn won't be able to access 't', 1616 - * since async->cb = NULL was already done. The timer will be 1617 - * effectively cancelled because bpf_timer_cb() will return 1618 - * HRTIMER_NORESTART. 1619 - * 1620 - * However, it is possible the timer callback_fn calling us armed the 1621 - * timer _before_ calling us, such that failing to cancel it here will 1622 - * cause it to possibly use struct hrtimer after freeing bpf_hrtimer. 1623 - * Therefore, we _need_ to cancel any outstanding timers before we do 1624 - * call_rcu, even though no more timers can be armed. 1625 - * 1626 - * Moreover, we need to schedule work even if timer does not belong to 1627 - * the calling callback_fn, as on two different CPUs, we can end up in a 1628 - * situation where both sides run in parallel, try to cancel one 1629 - * another, and we end up waiting on both sides in hrtimer_cancel 1630 - * without making forward progress, since timer1 depends on time2 1631 - * callback to finish, and vice versa. 1632 - * 1633 - * CPU 1 (timer1_cb) CPU 2 (timer2_cb) 1634 - * bpf_timer_cancel_and_free(timer2) bpf_timer_cancel_and_free(timer1) 1635 - * 1636 - * To avoid these issues, punt to workqueue context when we are in a 1637 - * timer callback. 1638 - */ 1639 - if (this_cpu_read(hrtimer_running)) { 1640 - queue_work(system_dfl_wq, &t->cb.delete_work); 1641 1597 return; 1642 - } 1643 1598 1644 - if (IS_ENABLED(CONFIG_PREEMPT_RT)) { 1645 - /* If the timer is running on other CPU, also use a kworker to 1646 - * wait for the completion of the timer instead of trying to 1647 - * acquire a sleepable lock in hrtimer_cancel() to wait for its 1648 - * completion. 1649 - */ 1650 - if (hrtimer_try_to_cancel(&t->timer) >= 0) 1651 - call_rcu(&t->cb.rcu, bpf_async_cb_rcu_free); 1652 - else 1653 - queue_work(system_dfl_wq, &t->cb.delete_work); 1599 + cb = xchg(&async->cb, NULL); 1600 + if (!cb) 1601 + return; 1602 + 1603 + /* 1604 + * No refcount_inc_not_zero(&cb->refcnt) here. Dropping the last 1605 + * refcnt. Either synchronously or asynchronously in irq_work. 1606 + */ 1607 + 1608 + if (!in_hardirq()) { 1609 + bpf_async_process_op(cb, BPF_ASYNC_CANCEL, 0, 0); 1654 1610 } else { 1655 - bpf_timer_delete_work(&t->cb.delete_work); 1611 + (void)bpf_async_schedule_op(cb, BPF_ASYNC_CANCEL, 0, 0); 1612 + /* 1613 + * bpf_async_schedule_op() either enqueues allocated cmd into llist 1614 + * or fails with ENOMEM and drop the last refcnt. 1615 + * This is unlikely, but safe, since bpf_async_cb_rcu_tasks_trace_free() 1616 + * callback will do additional timer/wq_cancel due to races anyway. 1617 + */ 1656 1618 } 1657 1619 } 1658 1620 ··· 1669 1617 */ 1670 1618 void bpf_timer_cancel_and_free(void *val) 1671 1619 { 1672 - struct bpf_hrtimer *t; 1673 - 1674 - t = (struct bpf_hrtimer *)__bpf_async_cancel_and_free(val); 1675 - if (!t) 1676 - return; 1677 - 1678 - bpf_timer_delete(t); 1620 + bpf_async_cancel_and_free(val); 1679 1621 } 1680 1622 1681 - /* This function is called by map_delete/update_elem for individual element and 1623 + /* 1624 + * This function is called by map_delete/update_elem for individual element and 1682 1625 * by ops->map_release_uref when the user space reference to a map reaches zero. 1683 1626 */ 1684 1627 void bpf_wq_cancel_and_free(void *val) 1685 1628 { 1686 - struct bpf_work *work; 1687 - 1688 - BTF_TYPE_EMIT(struct bpf_wq); 1689 - 1690 - work = (struct bpf_work *)__bpf_async_cancel_and_free(val); 1691 - if (!work) 1692 - return; 1693 - /* Trigger cancel of the sleepable work, but *do not* wait for 1694 - * it to finish if it was running as we might not be in a 1695 - * sleepable context. 1696 - * kfree will be called once the work has finished. 1697 - */ 1698 - schedule_work(&work->delete_work); 1629 + bpf_async_cancel_and_free(val); 1699 1630 } 1700 1631 1701 1632 BPF_CALL_2(bpf_kptr_xchg, void *, dst, void *, ptr) ··· 3151 3116 struct bpf_async_kern *async = (struct bpf_async_kern *)wq; 3152 3117 struct bpf_work *w; 3153 3118 3154 - if (in_nmi()) 3155 - return -EOPNOTSUPP; 3156 3119 if (flags) 3157 3120 return -EINVAL; 3121 + 3158 3122 w = READ_ONCE(async->work); 3159 3123 if (!w || !READ_ONCE(w->cb.prog)) 3160 3124 return -EINVAL; 3161 3125 3162 - schedule_work(&w->work); 3163 - return 0; 3126 + if (!refcount_inc_not_zero(&w->cb.refcnt)) 3127 + return -ENOENT; 3128 + 3129 + if (!in_hardirq()) { 3130 + schedule_work(&w->work); 3131 + bpf_async_refcount_put(&w->cb); 3132 + return 0; 3133 + } else { 3134 + return bpf_async_schedule_op(&w->cb, BPF_ASYNC_START, 0, 0); 3135 + } 3164 3136 } 3165 3137 3166 3138 __bpf_kfunc int bpf_wq_set_callback(struct bpf_wq *wq, ··· 4426 4384 return 0; 4427 4385 } 4428 4386 4387 + /** 4388 + * bpf_timer_cancel_async - try to deactivate a timer 4389 + * @timer: bpf_timer to stop 4390 + * 4391 + * Returns: 4392 + * 4393 + * * 0 when the timer was not active 4394 + * * 1 when the timer was active 4395 + * * -1 when the timer is currently executing the callback function and 4396 + * cannot be stopped 4397 + * * -ECANCELED when the timer will be cancelled asynchronously 4398 + * * -ENOMEM when out of memory 4399 + * * -EINVAL when the timer was not initialized 4400 + * * -ENOENT when this kfunc is racing with timer deletion 4401 + */ 4402 + __bpf_kfunc int bpf_timer_cancel_async(struct bpf_timer *timer) 4403 + { 4404 + struct bpf_async_kern *async = (void *)timer; 4405 + struct bpf_async_cb *cb; 4406 + int ret; 4407 + 4408 + cb = READ_ONCE(async->cb); 4409 + if (!cb) 4410 + return -EINVAL; 4411 + 4412 + /* 4413 + * Unlike hrtimer_start() it's ok to synchronously call 4414 + * hrtimer_try_to_cancel() when refcnt reached zero, but deferring to 4415 + * irq_work is not, since irq callback may execute after RCU GP and 4416 + * cb could be freed at that time. Check for refcnt zero for 4417 + * consistency. 4418 + */ 4419 + if (!refcount_inc_not_zero(&cb->refcnt)) 4420 + return -ENOENT; 4421 + 4422 + if (!in_hardirq()) { 4423 + struct bpf_hrtimer *t = container_of(cb, struct bpf_hrtimer, cb); 4424 + 4425 + ret = hrtimer_try_to_cancel(&t->timer); 4426 + bpf_async_refcount_put(cb); 4427 + return ret; 4428 + } else { 4429 + ret = bpf_async_schedule_op(cb, BPF_ASYNC_CANCEL, 0, 0); 4430 + return ret ? ret : -ECANCELED; 4431 + } 4432 + } 4433 + 4429 4434 __bpf_kfunc_end_defs(); 4430 4435 4431 4436 static void bpf_task_work_cancel_scheduled(struct irq_work *irq_work) ··· 4656 4567 BTF_ID_FLAGS(func, bpf_task_work_schedule_resume, KF_IMPLICIT_ARGS) 4657 4568 BTF_ID_FLAGS(func, bpf_dynptr_from_file) 4658 4569 BTF_ID_FLAGS(func, bpf_dynptr_file_discard) 4570 + BTF_ID_FLAGS(func, bpf_timer_cancel_async) 4659 4571 BTF_KFUNCS_END(common_btf_ids) 4660 4572 4661 4573 static const struct btf_kfunc_id_set common_kfunc_set = {
+37 -18
kernel/bpf/verifier.c
··· 8675 8675 } 8676 8676 8677 8677 static int process_timer_func(struct bpf_verifier_env *env, int regno, 8678 - struct bpf_call_arg_meta *meta) 8678 + struct bpf_map_desc *map) 8679 8679 { 8680 8680 if (IS_ENABLED(CONFIG_PREEMPT_RT)) { 8681 8681 verbose(env, "bpf_timer cannot be used for PREEMPT_RT.\n"); 8682 8682 return -EOPNOTSUPP; 8683 8683 } 8684 - return check_map_field_pointer(env, regno, BPF_TIMER, &meta->map); 8684 + return check_map_field_pointer(env, regno, BPF_TIMER, map); 8685 + } 8686 + 8687 + static int process_timer_helper(struct bpf_verifier_env *env, int regno, 8688 + struct bpf_call_arg_meta *meta) 8689 + { 8690 + return process_timer_func(env, regno, &meta->map); 8691 + } 8692 + 8693 + static int process_timer_kfunc(struct bpf_verifier_env *env, int regno, 8694 + struct bpf_kfunc_call_arg_meta *meta) 8695 + { 8696 + return process_timer_func(env, regno, &meta->map); 8685 8697 } 8686 8698 8687 8699 static int process_kptr_func(struct bpf_verifier_env *env, int regno, ··· 9985 9973 } 9986 9974 break; 9987 9975 case ARG_PTR_TO_TIMER: 9988 - err = process_timer_func(env, regno, meta); 9976 + err = process_timer_helper(env, regno, meta); 9989 9977 if (err) 9990 9978 return err; 9991 9979 break; ··· 12250 12238 KF_ARG_WORKQUEUE_ID, 12251 12239 KF_ARG_RES_SPIN_LOCK_ID, 12252 12240 KF_ARG_TASK_WORK_ID, 12253 - KF_ARG_PROG_AUX_ID 12241 + KF_ARG_PROG_AUX_ID, 12242 + KF_ARG_TIMER_ID 12254 12243 }; 12255 12244 12256 12245 BTF_ID_LIST(kf_arg_btf_ids) ··· 12264 12251 BTF_ID(struct, bpf_res_spin_lock) 12265 12252 BTF_ID(struct, bpf_task_work) 12266 12253 BTF_ID(struct, bpf_prog_aux) 12254 + BTF_ID(struct, bpf_timer) 12267 12255 12268 12256 static bool __is_kfunc_ptr_arg_type(const struct btf *btf, 12269 12257 const struct btf_param *arg, int type) ··· 12306 12292 static bool is_kfunc_arg_rbtree_node(const struct btf *btf, const struct btf_param *arg) 12307 12293 { 12308 12294 return __is_kfunc_ptr_arg_type(btf, arg, KF_ARG_RB_NODE_ID); 12295 + } 12296 + 12297 + static bool is_kfunc_arg_timer(const struct btf *btf, const struct btf_param *arg) 12298 + { 12299 + return __is_kfunc_ptr_arg_type(btf, arg, KF_ARG_TIMER_ID); 12309 12300 } 12310 12301 12311 12302 static bool is_kfunc_arg_wq(const struct btf *btf, const struct btf_param *arg) ··· 12412 12393 KF_ARG_PTR_TO_NULL, 12413 12394 KF_ARG_PTR_TO_CONST_STR, 12414 12395 KF_ARG_PTR_TO_MAP, 12396 + KF_ARG_PTR_TO_TIMER, 12415 12397 KF_ARG_PTR_TO_WORKQUEUE, 12416 12398 KF_ARG_PTR_TO_IRQ_FLAG, 12417 12399 KF_ARG_PTR_TO_RES_SPIN_LOCK, ··· 12665 12645 12666 12646 if (is_kfunc_arg_wq(meta->btf, &args[argno])) 12667 12647 return KF_ARG_PTR_TO_WORKQUEUE; 12648 + 12649 + if (is_kfunc_arg_timer(meta->btf, &args[argno])) 12650 + return KF_ARG_PTR_TO_TIMER; 12668 12651 12669 12652 if (is_kfunc_arg_task_work(meta->btf, &args[argno])) 12670 12653 return KF_ARG_PTR_TO_TASK_WORK; ··· 13462 13439 case KF_ARG_PTR_TO_REFCOUNTED_KPTR: 13463 13440 case KF_ARG_PTR_TO_CONST_STR: 13464 13441 case KF_ARG_PTR_TO_WORKQUEUE: 13442 + case KF_ARG_PTR_TO_TIMER: 13465 13443 case KF_ARG_PTR_TO_TASK_WORK: 13466 13444 case KF_ARG_PTR_TO_IRQ_FLAG: 13467 13445 case KF_ARG_PTR_TO_RES_SPIN_LOCK: ··· 13759 13735 return -EINVAL; 13760 13736 } 13761 13737 ret = check_map_field_pointer(env, regno, BPF_WORKQUEUE, &meta->map); 13738 + if (ret < 0) 13739 + return ret; 13740 + break; 13741 + case KF_ARG_PTR_TO_TIMER: 13742 + if (reg->type != PTR_TO_MAP_VALUE) { 13743 + verbose(env, "arg#%d doesn't point to a map value\n", i); 13744 + return -EINVAL; 13745 + } 13746 + ret = process_timer_kfunc(env, regno, meta); 13762 13747 if (ret < 0) 13763 13748 return ret; 13764 13749 break; ··· 21458 21425 21459 21426 if (is_tracing_prog_type(prog_type)) { 21460 21427 verbose(env, "tracing progs cannot use bpf_spin_lock yet\n"); 21461 - return -EINVAL; 21462 - } 21463 - } 21464 - 21465 - if (btf_record_has_field(map->record, BPF_TIMER)) { 21466 - if (is_tracing_prog_type(prog_type)) { 21467 - verbose(env, "tracing progs cannot use bpf_timer yet\n"); 21468 - return -EINVAL; 21469 - } 21470 - } 21471 - 21472 - if (btf_record_has_field(map->record, BPF_WORKQUEUE)) { 21473 - if (is_tracing_prog_type(prog_type)) { 21474 - verbose(env, "tracing progs cannot use bpf_wq yet\n"); 21475 21428 return -EINVAL; 21476 21429 } 21477 21430 }
+233 -17
tools/testing/selftests/bpf/prog_tests/timer.c
··· 1 1 // SPDX-License-Identifier: GPL-2.0 2 2 /* Copyright (c) 2021 Facebook */ 3 + #include <sched.h> 3 4 #include <test_progs.h> 5 + #include <linux/perf_event.h> 6 + #include <sys/syscall.h> 4 7 #include "timer.skel.h" 5 8 #include "timer_failure.skel.h" 6 9 #include "timer_interrupt.skel.h" 7 10 8 11 #define NUM_THR 8 12 + 13 + static int perf_event_open(__u32 type, __u64 config, int pid, int cpu) 14 + { 15 + struct perf_event_attr attr = { 16 + .type = type, 17 + .config = config, 18 + .size = sizeof(struct perf_event_attr), 19 + .sample_period = 10000, 20 + }; 21 + 22 + return syscall(__NR_perf_event_open, &attr, pid, cpu, -1, 0); 23 + } 9 24 10 25 static void *spin_lock_thread(void *arg) 11 26 { ··· 37 22 pthread_exit(arg); 38 23 } 39 24 40 - static int timer(struct timer *timer_skel) 25 + 26 + static int timer_stress_runner(struct timer *timer_skel, bool async_cancel) 41 27 { 42 - int i, err, prog_fd; 28 + int i, err = 1, prog_fd; 43 29 LIBBPF_OPTS(bpf_test_run_opts, topts); 44 30 pthread_t thread_id[NUM_THR]; 45 31 void *ret; 32 + 33 + timer_skel->bss->async_cancel = async_cancel; 34 + prog_fd = bpf_program__fd(timer_skel->progs.race); 35 + for (i = 0; i < NUM_THR; i++) { 36 + err = pthread_create(&thread_id[i], NULL, 37 + &spin_lock_thread, &prog_fd); 38 + if (!ASSERT_OK(err, "pthread_create")) 39 + break; 40 + } 41 + 42 + while (i) { 43 + err = pthread_join(thread_id[--i], &ret); 44 + if (ASSERT_OK(err, "pthread_join")) 45 + ASSERT_EQ(ret, (void *)&prog_fd, "pthread_join"); 46 + } 47 + return err; 48 + } 49 + 50 + static int timer_stress(struct timer *timer_skel) 51 + { 52 + return timer_stress_runner(timer_skel, false); 53 + } 54 + 55 + static int timer_stress_async_cancel(struct timer *timer_skel) 56 + { 57 + return timer_stress_runner(timer_skel, true); 58 + } 59 + 60 + static void *nmi_cpu_worker(void *arg) 61 + { 62 + volatile __u64 num = 1; 63 + int i; 64 + 65 + for (i = 0; i < 500000000; ++i) 66 + num *= (i % 7) + 1; 67 + (void)num; 68 + 69 + return NULL; 70 + } 71 + 72 + static int run_nmi_test(struct timer *timer_skel, struct bpf_program *prog) 73 + { 74 + struct bpf_link *link = NULL; 75 + int pe_fd = -1, pipefd[2] = {-1, -1}, pid = 0, status; 76 + char buf = 0; 77 + int ret = -1; 78 + 79 + if (!ASSERT_OK(pipe(pipefd), "pipe")) 80 + goto cleanup; 81 + 82 + pid = fork(); 83 + if (pid == 0) { 84 + /* Child: spawn multiple threads to consume multiple CPUs */ 85 + pthread_t threads[NUM_THR]; 86 + int i; 87 + 88 + close(pipefd[1]); 89 + read(pipefd[0], &buf, 1); 90 + close(pipefd[0]); 91 + 92 + for (i = 0; i < NUM_THR; i++) 93 + pthread_create(&threads[i], NULL, nmi_cpu_worker, NULL); 94 + for (i = 0; i < NUM_THR; i++) 95 + pthread_join(threads[i], NULL); 96 + exit(0); 97 + } 98 + 99 + if (!ASSERT_GE(pid, 0, "fork")) 100 + goto cleanup; 101 + 102 + /* Open perf event for child process across all CPUs */ 103 + pe_fd = perf_event_open(PERF_TYPE_HARDWARE, 104 + PERF_COUNT_HW_CPU_CYCLES, 105 + pid, /* measure child process */ 106 + -1); /* on any CPU */ 107 + if (pe_fd < 0) { 108 + if (errno == ENOENT || errno == EOPNOTSUPP) { 109 + printf("SKIP:no PERF_COUNT_HW_CPU_CYCLES\n"); 110 + test__skip(); 111 + ret = EOPNOTSUPP; 112 + goto cleanup; 113 + } 114 + ASSERT_GE(pe_fd, 0, "perf_event_open"); 115 + goto cleanup; 116 + } 117 + 118 + link = bpf_program__attach_perf_event(prog, pe_fd); 119 + if (!ASSERT_OK_PTR(link, "attach_perf_event")) 120 + goto cleanup; 121 + pe_fd = -1; /* Ownership transferred to link */ 122 + 123 + /* Signal child to start CPU work */ 124 + close(pipefd[0]); 125 + pipefd[0] = -1; 126 + write(pipefd[1], &buf, 1); 127 + close(pipefd[1]); 128 + pipefd[1] = -1; 129 + 130 + waitpid(pid, &status, 0); 131 + pid = 0; 132 + 133 + /* Verify NMI context was hit */ 134 + ASSERT_GT(timer_skel->bss->test_hits, 0, "test_hits"); 135 + ret = 0; 136 + 137 + cleanup: 138 + bpf_link__destroy(link); 139 + if (pe_fd >= 0) 140 + close(pe_fd); 141 + if (pid > 0) { 142 + write(pipefd[1], &buf, 1); 143 + waitpid(pid, &status, 0); 144 + } 145 + if (pipefd[0] >= 0) 146 + close(pipefd[0]); 147 + if (pipefd[1] >= 0) 148 + close(pipefd[1]); 149 + return ret; 150 + } 151 + 152 + static int timer_stress_nmi_race(struct timer *timer_skel) 153 + { 154 + int err; 155 + 156 + err = run_nmi_test(timer_skel, timer_skel->progs.nmi_race); 157 + if (err == EOPNOTSUPP) 158 + return 0; 159 + return err; 160 + } 161 + 162 + static int timer_stress_nmi_update(struct timer *timer_skel) 163 + { 164 + int err; 165 + 166 + err = run_nmi_test(timer_skel, timer_skel->progs.nmi_update); 167 + if (err == EOPNOTSUPP) 168 + return 0; 169 + if (err) 170 + return err; 171 + ASSERT_GT(timer_skel->bss->update_hits, 0, "update_hits"); 172 + return 0; 173 + } 174 + 175 + static int timer_stress_nmi_cancel(struct timer *timer_skel) 176 + { 177 + int err; 178 + 179 + err = run_nmi_test(timer_skel, timer_skel->progs.nmi_cancel); 180 + if (err == EOPNOTSUPP) 181 + return 0; 182 + if (err) 183 + return err; 184 + ASSERT_GT(timer_skel->bss->cancel_hits, 0, "cancel_hits"); 185 + return 0; 186 + } 187 + 188 + static int timer(struct timer *timer_skel) 189 + { 190 + int err, prog_fd; 191 + LIBBPF_OPTS(bpf_test_run_opts, topts); 46 192 47 193 err = timer__attach(timer_skel); 48 194 if (!ASSERT_OK(err, "timer_attach")) ··· 239 63 /* check that code paths completed */ 240 64 ASSERT_EQ(timer_skel->bss->ok, 1 | 2 | 4, "ok"); 241 65 242 - prog_fd = bpf_program__fd(timer_skel->progs.race); 243 - for (i = 0; i < NUM_THR; i++) { 244 - err = pthread_create(&thread_id[i], NULL, 245 - &spin_lock_thread, &prog_fd); 246 - if (!ASSERT_OK(err, "pthread_create")) 247 - break; 248 - } 66 + return 0; 67 + } 249 68 250 - while (i) { 251 - err = pthread_join(thread_id[--i], &ret); 252 - if (ASSERT_OK(err, "pthread_join")) 253 - ASSERT_EQ(ret, (void *)&prog_fd, "pthread_join"); 254 - } 69 + static int timer_cancel_async(struct timer *timer_skel) 70 + { 71 + int err, prog_fd; 72 + LIBBPF_OPTS(bpf_test_run_opts, topts); 73 + 74 + prog_fd = bpf_program__fd(timer_skel->progs.test_async_cancel_succeed); 75 + err = bpf_prog_test_run_opts(prog_fd, &topts); 76 + ASSERT_OK(err, "test_run"); 77 + ASSERT_EQ(topts.retval, 0, "test_run"); 78 + 79 + usleep(500); 80 + /* check that there were no errors in timer execution */ 81 + ASSERT_EQ(timer_skel->bss->err, 0, "err"); 82 + 83 + /* check that code paths completed */ 84 + ASSERT_EQ(timer_skel->bss->ok, 1 | 2 | 4, "ok"); 255 85 256 86 return 0; 257 87 } 258 88 259 - /* TODO: use pid filtering */ 260 - void serial_test_timer(void) 89 + static void test_timer(int (*timer_test_fn)(struct timer *timer_skel)) 261 90 { 262 91 struct timer *timer_skel = NULL; 263 92 int err; ··· 275 94 if (!ASSERT_OK_PTR(timer_skel, "timer_skel_load")) 276 95 return; 277 96 278 - err = timer(timer_skel); 97 + err = timer_test_fn(timer_skel); 279 98 ASSERT_OK(err, "timer"); 280 99 timer__destroy(timer_skel); 100 + } 101 + 102 + void serial_test_timer(void) 103 + { 104 + test_timer(timer); 281 105 282 106 RUN_TESTS(timer_failure); 107 + } 108 + 109 + void serial_test_timer_stress(void) 110 + { 111 + test_timer(timer_stress); 112 + } 113 + 114 + void serial_test_timer_stress_async_cancel(void) 115 + { 116 + test_timer(timer_stress_async_cancel); 117 + } 118 + 119 + void serial_test_timer_async_cancel(void) 120 + { 121 + test_timer(timer_cancel_async); 122 + } 123 + 124 + void serial_test_timer_stress_nmi_race(void) 125 + { 126 + test_timer(timer_stress_nmi_race); 127 + } 128 + 129 + void serial_test_timer_stress_nmi_update(void) 130 + { 131 + test_timer(timer_stress_nmi_update); 132 + } 133 + 134 + void serial_test_timer_stress_nmi_cancel(void) 135 + { 136 + test_timer(timer_stress_nmi_cancel); 283 137 } 284 138 285 139 void test_timer_interrupt(void)
+137
tools/testing/selftests/bpf/prog_tests/timer_start_delete_race.c
··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* Copyright (c) 2026 Meta Platforms, Inc. and affiliates. */ 3 + #define _GNU_SOURCE 4 + #include <sched.h> 5 + #include <pthread.h> 6 + #include <test_progs.h> 7 + #include "timer_start_delete_race.skel.h" 8 + 9 + /* 10 + * Test for race between bpf_timer_start() and map element deletion. 11 + * 12 + * The race scenario: 13 + * - CPU 1: bpf_timer_start() proceeds to bpf_async_process() and is about 14 + * to call hrtimer_start() but hasn't yet 15 + * - CPU 2: map_delete_elem() calls __bpf_async_cancel_and_free(), since 16 + * timer is not scheduled yet hrtimer_try_to_cancel() is a nop, 17 + * then calls bpf_async_refcount_put() dropping refcnt to zero 18 + * and scheduling call_rcu_tasks_trace() 19 + * - CPU 1: continues and calls hrtimer_start() 20 + * - After RCU tasks trace grace period: memory is freed 21 + * - Timer callback fires on freed memory: UAF! 22 + * 23 + * This test stresses this race by having two threads: 24 + * - Thread 1: repeatedly starts timers 25 + * - Thread 2: repeatedly deletes map elements 26 + * 27 + * KASAN should detect use-after-free. 28 + */ 29 + 30 + #define ITERATIONS 1000 31 + 32 + struct ctx { 33 + struct timer_start_delete_race *skel; 34 + volatile bool start; 35 + volatile bool stop; 36 + int errors; 37 + }; 38 + 39 + static void *start_timer_thread(void *arg) 40 + { 41 + struct ctx *ctx = arg; 42 + cpu_set_t cpuset; 43 + int fd, i; 44 + 45 + CPU_ZERO(&cpuset); 46 + CPU_SET(0, &cpuset); 47 + pthread_setaffinity_np(pthread_self(), sizeof(cpuset), &cpuset); 48 + 49 + while (!ctx->start && !ctx->stop) 50 + usleep(1); 51 + if (ctx->stop) 52 + return NULL; 53 + 54 + fd = bpf_program__fd(ctx->skel->progs.start_timer); 55 + 56 + for (i = 0; i < ITERATIONS && !ctx->stop; i++) { 57 + LIBBPF_OPTS(bpf_test_run_opts, opts); 58 + int err; 59 + 60 + err = bpf_prog_test_run_opts(fd, &opts); 61 + if (err || opts.retval) { 62 + ctx->errors++; 63 + break; 64 + } 65 + } 66 + 67 + return NULL; 68 + } 69 + 70 + static void *delete_elem_thread(void *arg) 71 + { 72 + struct ctx *ctx = arg; 73 + cpu_set_t cpuset; 74 + int fd, i; 75 + 76 + CPU_ZERO(&cpuset); 77 + CPU_SET(1, &cpuset); 78 + pthread_setaffinity_np(pthread_self(), sizeof(cpuset), &cpuset); 79 + 80 + while (!ctx->start && !ctx->stop) 81 + usleep(1); 82 + if (ctx->stop) 83 + return NULL; 84 + 85 + fd = bpf_program__fd(ctx->skel->progs.delete_elem); 86 + 87 + for (i = 0; i < ITERATIONS && !ctx->stop; i++) { 88 + LIBBPF_OPTS(bpf_test_run_opts, opts); 89 + int err; 90 + 91 + err = bpf_prog_test_run_opts(fd, &opts); 92 + if (err || opts.retval) { 93 + ctx->errors++; 94 + break; 95 + } 96 + } 97 + 98 + return NULL; 99 + } 100 + 101 + void test_timer_start_delete_race(void) 102 + { 103 + struct timer_start_delete_race *skel; 104 + pthread_t threads[2]; 105 + struct ctx ctx = {}; 106 + int err; 107 + 108 + skel = timer_start_delete_race__open_and_load(); 109 + if (!ASSERT_OK_PTR(skel, "skel_open_and_load")) 110 + return; 111 + 112 + ctx.skel = skel; 113 + 114 + err = pthread_create(&threads[0], NULL, start_timer_thread, &ctx); 115 + if (!ASSERT_OK(err, "create start_timer_thread")) { 116 + ctx.stop = true; 117 + goto cleanup; 118 + } 119 + 120 + err = pthread_create(&threads[1], NULL, delete_elem_thread, &ctx); 121 + if (!ASSERT_OK(err, "create delete_elem_thread")) { 122 + ctx.stop = true; 123 + pthread_join(threads[0], NULL); 124 + goto cleanup; 125 + } 126 + 127 + ctx.start = true; 128 + 129 + pthread_join(threads[0], NULL); 130 + pthread_join(threads[1], NULL); 131 + 132 + ASSERT_EQ(ctx.errors, 0, "thread_errors"); 133 + 134 + /* Either KASAN will catch UAF or kernel will crash or nothing happens */ 135 + cleanup: 136 + timer_start_delete_race__destroy(skel); 137 + }
+111 -19
tools/testing/selftests/bpf/progs/timer.c
··· 1 1 // SPDX-License-Identifier: GPL-2.0 2 2 /* Copyright (c) 2021 Facebook */ 3 - #include <linux/bpf.h> 4 - #include <time.h> 3 + 4 + #include <vmlinux.h> 5 5 #include <stdbool.h> 6 6 #include <errno.h> 7 7 #include <bpf/bpf_helpers.h> 8 8 #include <bpf/bpf_tracing.h> 9 9 10 + #define CLOCK_MONOTONIC 1 11 + #define CLOCK_BOOTTIME 7 12 + 10 13 char _license[] SEC("license") = "GPL"; 14 + 11 15 struct hmap_elem { 12 16 int counter; 13 17 struct bpf_timer timer; ··· 63 59 __u64 abs_data; 64 60 __u64 err; 65 61 __u64 ok; 62 + __u64 test_hits; 63 + __u64 update_hits; 64 + __u64 cancel_hits; 66 65 __u64 callback_check = 52; 67 66 __u64 callback2_check = 52; 68 67 __u64 pinned_callback_check; 69 68 __s32 pinned_cpu; 69 + bool async_cancel = 0; 70 70 71 71 #define ARRAY 1 72 72 #define HTAB 2 ··· 169 161 if (!arr_timer) 170 162 return 0; 171 163 bpf_timer_init(arr_timer, &array, CLOCK_MONOTONIC); 164 + return 0; 165 + } 166 + 167 + static int timer_error(void *map, int *key, struct bpf_timer *timer) 168 + { 169 + err = 42; 170 + return 0; 171 + } 172 + 173 + SEC("syscall") 174 + int test_async_cancel_succeed(void *ctx) 175 + { 176 + struct bpf_timer *arr_timer; 177 + int array_key = ARRAY; 178 + 179 + arr_timer = bpf_map_lookup_elem(&array, &array_key); 180 + if (!arr_timer) 181 + return 0; 182 + bpf_timer_init(arr_timer, &array, CLOCK_MONOTONIC); 183 + bpf_timer_set_callback(arr_timer, timer_error); 184 + bpf_timer_start(arr_timer, 100000 /* 100us */, 0); 185 + bpf_timer_cancel_async(arr_timer); 186 + ok = 7; 172 187 return 0; 173 188 } 174 189 ··· 430 399 return 0; 431 400 } 432 401 402 + /* Callback that updates its own map element */ 403 + static int update_self_callback(void *map, int *key, struct bpf_timer *timer) 404 + { 405 + struct elem init = {}; 406 + 407 + bpf_map_update_elem(map, key, &init, BPF_ANY); 408 + __sync_fetch_and_add(&update_hits, 1); 409 + return 0; 410 + } 411 + 412 + /* Callback that cancels itself using async cancel */ 413 + static int cancel_self_callback(void *map, int *key, struct bpf_timer *timer) 414 + { 415 + bpf_timer_cancel_async(timer); 416 + __sync_fetch_and_add(&cancel_hits, 1); 417 + return 0; 418 + } 419 + 420 + enum test_mode { 421 + TEST_RACE_SYNC, 422 + TEST_RACE_ASYNC, 423 + TEST_UPDATE, 424 + TEST_CANCEL, 425 + }; 426 + 427 + static __always_inline int test_common(enum test_mode mode) 428 + { 429 + struct bpf_timer *timer; 430 + struct elem init; 431 + int ret, key = 0; 432 + 433 + __builtin_memset(&init, 0, sizeof(struct elem)); 434 + 435 + bpf_map_update_elem(&race_array, &key, &init, BPF_ANY); 436 + timer = bpf_map_lookup_elem(&race_array, &key); 437 + if (!timer) 438 + return 0; 439 + 440 + ret = bpf_timer_init(timer, &race_array, CLOCK_MONOTONIC); 441 + if (ret && ret != -EBUSY) 442 + return 0; 443 + 444 + if (mode == TEST_RACE_SYNC || mode == TEST_RACE_ASYNC) 445 + bpf_timer_set_callback(timer, race_timer_callback); 446 + else if (mode == TEST_UPDATE) 447 + bpf_timer_set_callback(timer, update_self_callback); 448 + else 449 + bpf_timer_set_callback(timer, cancel_self_callback); 450 + 451 + bpf_timer_start(timer, 0, 0); 452 + 453 + if (mode == TEST_RACE_ASYNC) 454 + bpf_timer_cancel_async(timer); 455 + else if (mode == TEST_RACE_SYNC) 456 + bpf_timer_cancel(timer); 457 + 458 + return 0; 459 + } 460 + 433 461 SEC("syscall") 434 462 int race(void *ctx) 435 463 { 436 - struct bpf_timer *timer; 437 - int err, race_key = 0; 438 - struct elem init; 464 + return test_common(async_cancel ? TEST_RACE_ASYNC : TEST_RACE_SYNC); 465 + } 439 466 440 - __builtin_memset(&init, 0, sizeof(struct elem)); 441 - bpf_map_update_elem(&race_array, &race_key, &init, BPF_ANY); 467 + SEC("perf_event") 468 + int nmi_race(void *ctx) 469 + { 470 + __sync_fetch_and_add(&test_hits, 1); 471 + return test_common(TEST_RACE_ASYNC); 472 + } 442 473 443 - timer = bpf_map_lookup_elem(&race_array, &race_key); 444 - if (!timer) 445 - return 1; 474 + SEC("perf_event") 475 + int nmi_update(void *ctx) 476 + { 477 + __sync_fetch_and_add(&test_hits, 1); 478 + return test_common(TEST_UPDATE); 479 + } 446 480 447 - err = bpf_timer_init(timer, &race_array, CLOCK_MONOTONIC); 448 - if (err && err != -EBUSY) 449 - return 1; 450 - 451 - bpf_timer_set_callback(timer, race_timer_callback); 452 - bpf_timer_start(timer, 0, 0); 453 - bpf_timer_cancel(timer); 454 - 455 - return 0; 481 + SEC("perf_event") 482 + int nmi_cancel(void *ctx) 483 + { 484 + __sync_fetch_and_add(&test_hits, 1); 485 + return test_common(TEST_CANCEL); 456 486 }
+66
tools/testing/selftests/bpf/progs/timer_start_delete_race.c
··· 1 + // SPDX-License-Identifier: GPL-2.0 2 + /* Copyright (c) 2026 Meta Platforms, Inc. and affiliates. */ 3 + #include <linux/bpf.h> 4 + #include <time.h> 5 + #include <bpf/bpf_helpers.h> 6 + 7 + #define ITER_CNT 2000 8 + 9 + struct map_value { 10 + struct bpf_timer timer; 11 + }; 12 + 13 + struct { 14 + __uint(type, BPF_MAP_TYPE_ARRAY); 15 + __type(key, int); 16 + __type(value, struct map_value); 17 + __uint(max_entries, 1); 18 + } timer_map SEC(".maps"); 19 + 20 + long cb_cnt; 21 + 22 + /* 23 + * Timer callback that accesses the map value. 24 + * If the race bug exists and this runs on freed memory, 25 + * KASAN should detect it. 26 + */ 27 + static int timer_cb(void *map, int *key, struct map_value *value) 28 + { 29 + __sync_fetch_and_add(&cb_cnt, 1); 30 + return 0; 31 + } 32 + 33 + SEC("syscall") 34 + int start_timer(void *ctx) 35 + { 36 + struct map_value *value; 37 + int i; 38 + 39 + for (i = 0; i < ITER_CNT; i++) { 40 + int key = 0; 41 + 42 + value = bpf_map_lookup_elem(&timer_map, &key); 43 + /* hey claude, no if (!value) here. The verifier got smarter */ 44 + 45 + bpf_timer_init(&value->timer, &timer_map, CLOCK_MONOTONIC); 46 + bpf_timer_set_callback(&value->timer, timer_cb); 47 + bpf_timer_start(&value->timer, 100000000, 0); 48 + } 49 + return 0; 50 + } 51 + 52 + SEC("syscall") 53 + int delete_elem(void *ctx) 54 + { 55 + int i; 56 + 57 + for (i = 0; i < ITER_CNT; i++) { 58 + int key = 0; 59 + 60 + bpf_map_delete_elem(&timer_map, &key); 61 + } 62 + 63 + return 0; 64 + } 65 + 66 + char _license[] SEC("license") = "GPL";
-111
tools/testing/selftests/bpf/progs/verifier_helper_restricted.c
··· 17 17 __type(value, struct val); 18 18 } map_spin_lock SEC(".maps"); 19 19 20 - struct timer { 21 - struct bpf_timer t; 22 - }; 23 - 24 - struct { 25 - __uint(type, BPF_MAP_TYPE_ARRAY); 26 - __uint(max_entries, 1); 27 - __type(key, int); 28 - __type(value, struct timer); 29 - } map_timer SEC(".maps"); 30 - 31 20 SEC("kprobe") 32 21 __description("bpf_ktime_get_coarse_ns is forbidden in BPF_PROG_TYPE_KPROBE") 33 22 __failure __msg("program of this type cannot use helper bpf_ktime_get_coarse_ns") ··· 70 81 exit; \ 71 82 " : 72 83 : __imm(bpf_ktime_get_coarse_ns) 73 - : __clobber_all); 74 - } 75 - 76 - SEC("kprobe") 77 - __description("bpf_timer_init isn restricted in BPF_PROG_TYPE_KPROBE") 78 - __failure __msg("tracing progs cannot use bpf_timer yet") 79 - __naked void in_bpf_prog_type_kprobe_2(void) 80 - { 81 - asm volatile (" \ 82 - r2 = r10; \ 83 - r2 += -8; \ 84 - r1 = 0; \ 85 - *(u64*)(r2 + 0) = r1; \ 86 - r1 = %[map_timer] ll; \ 87 - call %[bpf_map_lookup_elem]; \ 88 - if r0 == 0 goto l0_%=; \ 89 - r1 = r0; \ 90 - r2 = %[map_timer] ll; \ 91 - r3 = 1; \ 92 - l0_%=: call %[bpf_timer_init]; \ 93 - exit; \ 94 - " : 95 - : __imm(bpf_map_lookup_elem), 96 - __imm(bpf_timer_init), 97 - __imm_addr(map_timer) 98 - : __clobber_all); 99 - } 100 - 101 - SEC("perf_event") 102 - __description("bpf_timer_init is forbidden in BPF_PROG_TYPE_PERF_EVENT") 103 - __failure __msg("tracing progs cannot use bpf_timer yet") 104 - __naked void bpf_prog_type_perf_event_2(void) 105 - { 106 - asm volatile (" \ 107 - r2 = r10; \ 108 - r2 += -8; \ 109 - r1 = 0; \ 110 - *(u64*)(r2 + 0) = r1; \ 111 - r1 = %[map_timer] ll; \ 112 - call %[bpf_map_lookup_elem]; \ 113 - if r0 == 0 goto l0_%=; \ 114 - r1 = r0; \ 115 - r2 = %[map_timer] ll; \ 116 - r3 = 1; \ 117 - l0_%=: call %[bpf_timer_init]; \ 118 - exit; \ 119 - " : 120 - : __imm(bpf_map_lookup_elem), 121 - __imm(bpf_timer_init), 122 - __imm_addr(map_timer) 123 - : __clobber_all); 124 - } 125 - 126 - SEC("tracepoint") 127 - __description("bpf_timer_init is forbidden in BPF_PROG_TYPE_TRACEPOINT") 128 - __failure __msg("tracing progs cannot use bpf_timer yet") 129 - __naked void in_bpf_prog_type_tracepoint_2(void) 130 - { 131 - asm volatile (" \ 132 - r2 = r10; \ 133 - r2 += -8; \ 134 - r1 = 0; \ 135 - *(u64*)(r2 + 0) = r1; \ 136 - r1 = %[map_timer] ll; \ 137 - call %[bpf_map_lookup_elem]; \ 138 - if r0 == 0 goto l0_%=; \ 139 - r1 = r0; \ 140 - r2 = %[map_timer] ll; \ 141 - r3 = 1; \ 142 - l0_%=: call %[bpf_timer_init]; \ 143 - exit; \ 144 - " : 145 - : __imm(bpf_map_lookup_elem), 146 - __imm(bpf_timer_init), 147 - __imm_addr(map_timer) 148 - : __clobber_all); 149 - } 150 - 151 - SEC("raw_tracepoint") 152 - __description("bpf_timer_init is forbidden in BPF_PROG_TYPE_RAW_TRACEPOINT") 153 - __failure __msg("tracing progs cannot use bpf_timer yet") 154 - __naked void bpf_prog_type_raw_tracepoint_2(void) 155 - { 156 - asm volatile (" \ 157 - r2 = r10; \ 158 - r2 += -8; \ 159 - r1 = 0; \ 160 - *(u64*)(r2 + 0) = r1; \ 161 - r1 = %[map_timer] ll; \ 162 - call %[bpf_map_lookup_elem]; \ 163 - if r0 == 0 goto l0_%=; \ 164 - r1 = r0; \ 165 - r2 = %[map_timer] ll; \ 166 - r3 = 1; \ 167 - l0_%=: call %[bpf_timer_init]; \ 168 - exit; \ 169 - " : 170 - : __imm(bpf_map_lookup_elem), 171 - __imm(bpf_timer_init), 172 - __imm_addr(map_timer) 173 84 : __clobber_all); 174 85 } 175 86