Merge branch 'net-neighbour-notify-changes-atomically'

Petr Machata says:

====================
net: neighbour: Notify changes atomically

Andy Roulin and Francesco Ruggeri have apparently independently both hit an
issue with the current neighbor notification scheme. Francesco reported the
issue in [1]. In a response[2] to that report, Andy said:

neigh_update sends a rtnl notification if an update, e.g.,
nud_state change, was done but there is no guarantee of
ordering of the rtnl notifications. Consider the following
scenario:

userspace thread kernel thread
================ =============
neigh_update
write_lock_bh(n->lock)
n->nud_state = STALE
write_unlock_bh(n->lock)
neigh_notify
neigh_fill_info
read_lock_bh(n->lock)
ndm->nud_state = STALE
read_unlock_bh(n->lock)
-------------------------->
neigh:update
write_lock_bh(n->lock)
n->nud_state = REACHABLE
write_unlock_bh(n->lock)
neigh_notify
neigh_fill_info
read_lock_bh(n->lock)
ndm->nud_state = REACHABLE
read_unlock_bh(n->lock)
rtnl_nofify
RTNL REACHABLE sent
<--------
rtnl_notify
RTNL STALE sent

In this scenario, the kernel neigh is updated first to STALE and
then REACHABLE but the netlink notifications are sent out of order,
first REACHABLE and then STALE.

The solution presented in [2] was to extend the critical region to include
both the call to neigh_fill_info(), as well as rtnl_notify(). Then we have
a guarantee that whatever state was captured by neigh_fill_info(), will be
sent right away. The above scenario can thus not happen.

This is how this patchset begins: patches #1 and #2 add helper duals to
neigh_fill_info() and __neigh_notify() such that the __-prefixed function
assumes the neighbor lock is held, and the unprefixed one is a thin wrapper
that manages locking. This extends locking further than Andy's patch, but
makes for a clear code and supports the following part.

At that point, the original race is gone. But what can happen is the
following race, where the notification does not reflect the change that was
made:

userspace thread kernel thread
================ =============
neigh_update
write_lock_bh(n->lock)
n->nud_state = STALE
write_unlock_bh(n->lock)
-------------------------->
neigh:update
write_lock_bh(n->lock)
n->nud_state = REACHABLE
write_unlock_bh(n->lock)
neigh_notify
read_lock_bh(n->lock)
__neigh_fill_info
ndm->nud_state = REACHABLE
rtnl_notify
read_unlock_bh(n->lock)
RTNL REACHABLE sent
<--------
neigh_notify
read_lock_bh(n->lock)
__neigh_fill_info
ndm->nud_state = REACHABLE
rtnl_notify
read_unlock_bh(n->lock)
RTNL REACHABLE sent again

Here, even though neigh_update() made a change to STALE, it later sends a
notification with a NUD of REACHABLE. The obvious solution to fix this race
is to move the notifier to the same critical section that actually makes
the change.

Sending a notification in fact involves two things: invoking the internal
notifier chain, and sending the netlink notification. The overall approach
in this patchset is to move the netlink notification to the critical
section of the change, while keeping the internal notifier intact. Since
the motion is not obviously correct, the patchset presents the change in
series of incremental steps with discussion in commit messages. Please see
details in the patches themselves.

Reproducer
==========

To consistently reproduce, I injected an mdelay before the rtnl_notify()
call. Since only one thread should delay, a bit of instrumentation was
needed to see where the call originates. The mdelay was then only issued on
the call stack rooted in the RTNL request.

Then the general idea is to issue an "ip neigh replace" to mark a neighbor
entry as failed. In parallel to that, inject an ARP burst that validates
the entry. This is all observed with an "ip monitor neigh", where one can
see either a REACHABLE->FAILED transition, or FAILED->REACHABLE, while the
actual state at the end of the sequence is always REACHABLE.

With the patchset, only FAILED->REACHABLE is ever observed in the monitor.

Alternatives
============

Another approach to solving the issue would be to have a per-neighbor queue
of notification digests, each with a set of fields necessary for formatting
a notification. In pseudocode, a neighbor update would look something like
this:

neighbor_update:
- lock
- do update
- allocate notification digest, fill partially, mark not-committed
- unlock
- critical-section-breaking stuff (probes, ARP Q, etc.)
- lock
- fill in missing details to the digest (notably neigh->probes)
- mark the digest as committed
- while (front of the digest queue is committed)
- pop it, convert to notifier, send the notification
- unlock

This adds more complexity and would imply more changes to the code, which
is why I think the approach presented in this patchset is better. But it
would allow us to retain the overall structure of the code while giving us
accurate notifications.

A third approach would be to consider the second race not very serious and
be OK with seeing a notification that does not reflect the change that
prompted it. Then a two-patch prefix of this patchset would be all that is
needed.

[1]: https://lore.kernel.org/20220606230107.D70B55EC0B30@us226.sjc.aristanetworks.com
[2]: https://lore.kernel.org/ed6768c1-80b8-aee2-e545-b51661d49336@nvidia.com
====================

Link: https://patch.msgid.link/cover.1769012464.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Jakub Kicinski 4 months ago 6290f7e7 e4faaf65

+93 -57

1 changed file

expand all

net

core

neighbour.c

+93 -57

net/core/neighbour.c

··· 51 51 #define PNEIGH_HASHMASK 0xF 52 52 53 53 static void neigh_timer_handler(struct timer_list *t); 54 - static void __neigh_notify(struct neighbour *n, int type, int flags, 55 - u32 pid); 56 - static void neigh_update_notify(struct neighbour *neigh, u32 nlmsg_pid); 54 + static void neigh_notify(struct neighbour *n, int type, int flags, u32 pid); 55 + static void __neigh_notify(struct neighbour *n, int type, int flags, u32 pid); 57 56 static void pneigh_ifdown(struct neigh_table *tbl, struct net_device *dev, 58 57 bool skip_perm); 59 58 ··· 116 117 static void neigh_cleanup_and_release(struct neighbour *neigh) 117 118 { 118 119 trace_neigh_cleanup_and_release(neigh, 0); 119 - __neigh_notify(neigh, RTM_DELNEIGH, 0, 0); 120 + neigh_notify(neigh, RTM_DELNEIGH, 0, 0); 120 121 call_netevent_notifiers(NETEVENT_NEIGH_UPDATE, neigh); 121 122 neigh_release(neigh); 122 123 } ··· 1104 1105 { 1105 1106 unsigned long now, next; 1106 1107 struct neighbour *neigh = timer_container_of(neigh, t, timer); 1108 + bool skip_probe = false; 1107 1109 unsigned int state; 1108 1110 int notify = 0; 1109 1111 ··· 1172 1172 neigh_invalidate(neigh); 1173 1173 } 1174 1174 notify = 1; 1175 - goto out; 1175 + skip_probe = true; 1176 1176 } 1177 + 1178 + if (notify) 1179 + __neigh_notify(neigh, RTM_NEWNEIGH, 0, 0); 1180 + 1181 + if (skip_probe) 1182 + goto out; 1177 1183 1178 1184 if (neigh->nud_state & NUD_IN_TIMER) { 1179 1185 if (time_before(next, jiffies + HZ/100)) ··· 1195 1189 } 1196 1190 1197 1191 if (notify) 1198 - neigh_update_notify(neigh, 0); 1192 + call_netevent_notifiers(NETEVENT_NEIGH_UPDATE, neigh); 1199 1193 1200 1194 trace_neigh_timer_handler(neigh, 0); 1201 1195 ··· 1309 1303 } 1310 1304 } 1311 1305 1306 + static void neigh_update_process_arp_queue(struct neighbour *neigh) 1307 + __releases(neigh->lock) 1308 + __acquires(neigh->lock) 1309 + { 1310 + struct sk_buff *skb; 1311 + 1312 + /* Again: avoid deadlock if something went wrong. */ 1313 + while (neigh->nud_state & NUD_VALID && 1314 + (skb = __skb_dequeue(&neigh->arp_queue)) != NULL) { 1315 + struct dst_entry *dst = skb_dst(skb); 1316 + struct neighbour *n2, *n1 = neigh; 1317 + 1318 + write_unlock_bh(&neigh->lock); 1319 + 1320 + rcu_read_lock(); 1321 + 1322 + /* Why not just use 'neigh' as-is? The problem is that 1323 + * things such as shaper, eql, and sch_teql can end up 1324 + * using alternative, different, neigh objects to output 1325 + * the packet in the output path. So what we need to do 1326 + * here is re-lookup the top-level neigh in the path so 1327 + * we can reinject the packet there. 1328 + */ 1329 + n2 = NULL; 1330 + if (dst && 1331 + READ_ONCE(dst->obsolete) != DST_OBSOLETE_DEAD) { 1332 + n2 = dst_neigh_lookup_skb(dst, skb); 1333 + if (n2) 1334 + n1 = n2; 1335 + } 1336 + READ_ONCE(n1->output)(n1, skb); 1337 + if (n2) 1338 + neigh_release(n2); 1339 + rcu_read_unlock(); 1340 + 1341 + write_lock_bh(&neigh->lock); 1342 + } 1343 + __skb_queue_purge(&neigh->arp_queue); 1344 + neigh->arp_queue_len_bytes = 0; 1345 + } 1346 + 1312 1347 /* Generic update routine. 1313 1348 -- lladdr is new lladdr or NULL, if it is not supplied. 1314 1349 -- new is new state. ··· 1376 1329 struct netlink_ext_ack *extack) 1377 1330 { 1378 1331 bool gc_update = false, managed_update = false; 1332 + bool process_arp_queue = false; 1379 1333 int update_isrouter = 0; 1380 1334 struct net_device *dev; 1381 1335 int err, notify = 0; ··· 1510 1462 neigh_connect(neigh); 1511 1463 else 1512 1464 neigh_suspect(neigh); 1513 - if (!(old & NUD_VALID)) { 1514 - struct sk_buff *skb; 1515 1465 1516 - /* Again: avoid dead loop if something went wrong */ 1466 + if (!(old & NUD_VALID)) 1467 + process_arp_queue = true; 1517 1468 1518 - while (neigh->nud_state & NUD_VALID && 1519 - (skb = __skb_dequeue(&neigh->arp_queue)) != NULL) { 1520 - struct dst_entry *dst = skb_dst(skb); 1521 - struct neighbour *n2, *n1 = neigh; 1522 - write_unlock_bh(&neigh->lock); 1523 - 1524 - rcu_read_lock(); 1525 - 1526 - /* Why not just use 'neigh' as-is? The problem is that 1527 - * things such as shaper, eql, and sch_teql can end up 1528 - * using alternative, different, neigh objects to output 1529 - * the packet in the output path. So what we need to do 1530 - * here is re-lookup the top-level neigh in the path so 1531 - * we can reinject the packet there. 1532 - */ 1533 - n2 = NULL; 1534 - if (dst && 1535 - READ_ONCE(dst->obsolete) != DST_OBSOLETE_DEAD) { 1536 - n2 = dst_neigh_lookup_skb(dst, skb); 1537 - if (n2) 1538 - n1 = n2; 1539 - } 1540 - READ_ONCE(n1->output)(n1, skb); 1541 - if (n2) 1542 - neigh_release(n2); 1543 - rcu_read_unlock(); 1544 - 1545 - write_lock_bh(&neigh->lock); 1546 - } 1547 - __skb_queue_purge(&neigh->arp_queue); 1548 - neigh->arp_queue_len_bytes = 0; 1549 - } 1550 1469 out: 1551 1470 if (update_isrouter) 1552 1471 neigh_update_is_router(neigh, flags, &notify); 1472 + 1473 + if (notify) 1474 + __neigh_notify(neigh, RTM_NEWNEIGH, 0, nlmsg_pid); 1475 + 1476 + if (process_arp_queue) 1477 + neigh_update_process_arp_queue(neigh); 1478 + 1553 1479 write_unlock_bh(&neigh->lock); 1480 + 1554 1481 if (((new ^ old) & NUD_PERMANENT) || gc_update) 1555 1482 neigh_update_gc_list(neigh); 1556 1483 if (managed_update) 1557 1484 neigh_update_managed_list(neigh); 1485 + 1558 1486 if (notify) 1559 - neigh_update_notify(neigh, nlmsg_pid); 1487 + call_netevent_notifiers(NETEVENT_NEIGH_UPDATE, neigh); 1488 + 1560 1489 trace_neigh_update_done(neigh, err); 1561 1490 return err; 1562 1491 } ··· 2647 2622 return skb->len; 2648 2623 } 2649 2624 2650 - static int neigh_fill_info(struct sk_buff *skb, struct neighbour *neigh, 2651 - u32 pid, u32 seq, int type, unsigned int flags) 2625 + static int __neigh_fill_info(struct sk_buff *skb, struct neighbour *neigh, 2626 + u32 pid, u32 seq, int type, unsigned int flags) 2652 2627 { 2653 2628 u32 neigh_flags, neigh_flags_ext; 2654 2629 unsigned long now = jiffies; ··· 2674 2649 if (nla_put(skb, NDA_DST, neigh->tbl->key_len, neigh->primary_key)) 2675 2650 goto nla_put_failure; 2676 2651 2677 - read_lock_bh(&neigh->lock); 2678 2652 ndm->ndm_state = neigh->nud_state; 2679 2653 if (neigh->nud_state & NUD_VALID) { 2680 2654 char haddr[MAX_ADDR_LEN]; 2681 2655 2682 2656 neigh_ha_snapshot(haddr, neigh, neigh->dev); 2683 - if (nla_put(skb, NDA_LLADDR, neigh->dev->addr_len, haddr) < 0) { 2684 - read_unlock_bh(&neigh->lock); 2657 + if (nla_put(skb, NDA_LLADDR, neigh->dev->addr_len, haddr) < 0) 2685 2658 goto nla_put_failure; 2686 - } 2687 2659 } 2688 2660 2689 2661 ci.ndm_used = jiffies_to_clock_t(now - neigh->used); 2690 2662 ci.ndm_confirmed = jiffies_to_clock_t(now - neigh->confirmed); 2691 2663 ci.ndm_updated = jiffies_to_clock_t(now - neigh->updated); 2692 2664 ci.ndm_refcnt = refcount_read(&neigh->refcnt) - 1; 2693 - read_unlock_bh(&neigh->lock); 2694 2665 2695 2666 if (nla_put_u32(skb, NDA_PROBES, atomic_read(&neigh->probes)) || 2696 2667 nla_put(skb, NDA_CACHEINFO, sizeof(ci), &ci)) ··· 2703 2682 nla_put_failure: 2704 2683 nlmsg_cancel(skb, nlh); 2705 2684 return -EMSGSIZE; 2685 + } 2686 + 2687 + static int neigh_fill_info(struct sk_buff *skb, struct neighbour *neigh, 2688 + u32 pid, u32 seq, int type, unsigned int flags) 2689 + __releases(neigh->lock) 2690 + __acquires(neigh->lock) 2691 + { 2692 + int err; 2693 + 2694 + read_lock_bh(&neigh->lock); 2695 + err = __neigh_fill_info(skb, neigh, pid, seq, type, flags); 2696 + read_unlock_bh(&neigh->lock); 2697 + 2698 + return err; 2706 2699 } 2707 2700 2708 2701 static int pneigh_fill_info(struct sk_buff *skb, struct pneigh_entry *pn, ··· 2760 2725 nla_put_failure: 2761 2726 nlmsg_cancel(skb, nlh); 2762 2727 return -EMSGSIZE; 2763 - } 2764 - 2765 - static void neigh_update_notify(struct neighbour *neigh, u32 nlmsg_pid) 2766 - { 2767 - call_netevent_notifiers(NETEVENT_NEIGH_UPDATE, neigh); 2768 - __neigh_notify(neigh, RTM_NEWNEIGH, 0, nlmsg_pid); 2769 2728 } 2770 2729 2771 2730 static bool neigh_master_filtered(struct net_device *dev, int master_idx) ··· 3574 3545 if (skb == NULL) 3575 3546 goto errout; 3576 3547 3577 - err = neigh_fill_info(skb, n, pid, 0, type, flags); 3548 + err = __neigh_fill_info(skb, n, pid, 0, type, flags); 3578 3549 if (err < 0) { 3579 3550 /* -EMSGSIZE implies BUG in neigh_nlmsg_size() */ 3580 3551 WARN_ON(err == -EMSGSIZE); ··· 3589 3560 rcu_read_unlock(); 3590 3561 } 3591 3562 3563 + static void neigh_notify(struct neighbour *neigh, int type, int flags, u32 pid) 3564 + { 3565 + read_lock_bh(&neigh->lock); 3566 + __neigh_notify(neigh, type, flags, pid); 3567 + read_unlock_bh(&neigh->lock); 3568 + } 3569 + 3592 3570 void neigh_app_ns(struct neighbour *n) 3593 3571 { 3594 - __neigh_notify(n, RTM_GETNEIGH, NLM_F_REQUEST, 0); 3572 + neigh_notify(n, RTM_GETNEIGH, NLM_F_REQUEST, 0); 3595 3573 } 3596 3574 EXPORT_SYMBOL(neigh_app_ns); 3597 3575

Configure Feed

Configure Feed