Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

inet: frags: flush pending skbs in fqdir_pre_exit()

We have been seeing occasional deadlocks on pernet_ops_rwsem since
September in NIPA. The stuck task was usually modprobe (often loading
a driver like ipvlan), trying to take the lock as a Writer.
lockdep does not track readers for rwsems so the read wasn't obvious
from the reports.

On closer inspection the Reader holding the lock was conntrack looping
forever in nf_conntrack_cleanup_net_list(). Based on past experience
with occasional NIPA crashes I looked thru the tests which run before
the crash and noticed that the crash follows ip_defrag.sh. An immediate
red flag. Scouring thru (de)fragmentation queues reveals skbs sitting
around, holding conntrack references.

The problem is that since conntrack depends on nf_defrag_ipv6,
nf_defrag_ipv6 will load first. Since nf_defrag_ipv6 loads first its
netns exit hooks run _after_ conntrack's netns exit hook.

Flush all fragment queue SKBs during fqdir_pre_exit() to release
conntrack references before conntrack cleanup runs. Also flush
the queues in timer expiry handlers when they discover fqdir->dead
is set, in case packet sneaks in while we're running the pre_exit
flush.

The commit under Fixes is not exactly the culprit, but I think
previously the timer firing would eventually unblock the spinning
conntrack.

Fixes: d5dd88794a13 ("inet: fix various use-after-free in defrags units")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20251207010942.1672972-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

+50 -20
+1 -12
include/net/inet_frag.h
··· 123 123 124 124 int fqdir_init(struct fqdir **fqdirp, struct inet_frags *f, struct net *net); 125 125 126 - static inline void fqdir_pre_exit(struct fqdir *fqdir) 127 - { 128 - /* Prevent creation of new frags. 129 - * Pairs with READ_ONCE() in inet_frag_find(). 130 - */ 131 - WRITE_ONCE(fqdir->high_thresh, 0); 132 - 133 - /* Pairs with READ_ONCE() in inet_frag_kill(), ip_expire() 134 - * and ip6frag_expire_frag_queue(). 135 - */ 136 - WRITE_ONCE(fqdir->dead, true); 137 - } 126 + void fqdir_pre_exit(struct fqdir *fqdir); 138 127 void fqdir_exit(struct fqdir *fqdir); 139 128 140 129 void inet_frag_kill(struct inet_frag_queue *q, int *refs);
+6 -3
include/net/ipv6_frag.h
··· 69 69 int refs = 1; 70 70 71 71 rcu_read_lock(); 72 - /* Paired with the WRITE_ONCE() in fqdir_pre_exit(). */ 73 - if (READ_ONCE(fq->q.fqdir->dead)) 74 - goto out_rcu_unlock; 75 72 spin_lock(&fq->q.lock); 76 73 77 74 if (fq->q.flags & INET_FRAG_COMPLETE) ··· 76 79 77 80 fq->q.flags |= INET_FRAG_DROP; 78 81 inet_frag_kill(&fq->q, &refs); 82 + 83 + /* Paired with the WRITE_ONCE() in fqdir_pre_exit(). */ 84 + if (READ_ONCE(fq->q.fqdir->dead)) { 85 + inet_frag_queue_flush(&fq->q, 0); 86 + goto out; 87 + } 79 88 80 89 dev = dev_get_by_index_rcu(net, fq->iif); 81 90 if (!dev)
+36
net/ipv4/inet_fragment.c
··· 218 218 219 219 pure_initcall(inet_frag_wq_init); 220 220 221 + void fqdir_pre_exit(struct fqdir *fqdir) 222 + { 223 + struct inet_frag_queue *fq; 224 + struct rhashtable_iter hti; 225 + 226 + /* Prevent creation of new frags. 227 + * Pairs with READ_ONCE() in inet_frag_find(). 228 + */ 229 + WRITE_ONCE(fqdir->high_thresh, 0); 230 + 231 + /* Pairs with READ_ONCE() in inet_frag_kill(), ip_expire() 232 + * and ip6frag_expire_frag_queue(). 233 + */ 234 + WRITE_ONCE(fqdir->dead, true); 235 + 236 + rhashtable_walk_enter(&fqdir->rhashtable, &hti); 237 + rhashtable_walk_start(&hti); 238 + 239 + while ((fq = rhashtable_walk_next(&hti))) { 240 + if (IS_ERR(fq)) { 241 + if (PTR_ERR(fq) != -EAGAIN) 242 + break; 243 + continue; 244 + } 245 + spin_lock_bh(&fq->lock); 246 + if (!(fq->flags & INET_FRAG_COMPLETE)) 247 + inet_frag_queue_flush(fq, 0); 248 + spin_unlock_bh(&fq->lock); 249 + } 250 + 251 + rhashtable_walk_stop(&hti); 252 + rhashtable_walk_exit(&hti); 253 + } 254 + EXPORT_SYMBOL(fqdir_pre_exit); 255 + 221 256 void fqdir_exit(struct fqdir *fqdir) 222 257 { 223 258 INIT_WORK(&fqdir->destroy_work, fqdir_work_fn); ··· 325 290 { 326 291 unsigned int sum; 327 292 293 + reason = reason ?: SKB_DROP_REASON_FRAG_REASM_TIMEOUT; 328 294 sum = inet_frag_rbtree_purge(&q->rb_fragments, reason); 329 295 sub_frag_mem_limit(q->fqdir, sum); 330 296 }
+7 -5
net/ipv4/ip_fragment.c
··· 134 134 net = qp->q.fqdir->net; 135 135 136 136 rcu_read_lock(); 137 - 138 - /* Paired with WRITE_ONCE() in fqdir_pre_exit(). */ 139 - if (READ_ONCE(qp->q.fqdir->dead)) 140 - goto out_rcu_unlock; 141 - 142 137 spin_lock(&qp->q.lock); 143 138 144 139 if (qp->q.flags & INET_FRAG_COMPLETE) ··· 141 146 142 147 qp->q.flags |= INET_FRAG_DROP; 143 148 inet_frag_kill(&qp->q, &refs); 149 + 150 + /* Paired with WRITE_ONCE() in fqdir_pre_exit(). */ 151 + if (READ_ONCE(qp->q.fqdir->dead)) { 152 + inet_frag_queue_flush(&qp->q, 0); 153 + goto out; 154 + } 155 + 144 156 __IP_INC_STATS(net, IPSTATS_MIB_REASMFAILS); 145 157 __IP_INC_STATS(net, IPSTATS_MIB_REASMTIMEOUT); 146 158