Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

af_unix: Give up GC if MSG_PEEK intervened.

Igor Ushakov reported that GC purged the receive queue of
an alive socket due to a race with MSG_PEEK with a nice repro.

This is the exact same issue previously fixed by commit
cbcf01128d0a ("af_unix: fix garbage collect vs MSG_PEEK").

After GC was replaced with the current algorithm, the cited
commit removed the locking dance in unix_peek_fds() and
reintroduced the same issue.

The problem is that MSG_PEEK bumps a file refcount without
interacting with GC.

Consider an SCC containing sk-A and sk-B, where sk-A is
close()d but can be recv()ed via sk-B.

The bad thing happens if sk-A is recv()ed with MSG_PEEK from
sk-B and sk-B is close()d while GC is checking unix_vertex_dead()
for sk-A and sk-B.

GC thread User thread
--------- -----------
unix_vertex_dead(sk-A)
-> true <------.
\
`------ recv(sk-B, MSG_PEEK)
invalidate !! -> sk-A's file refcount : 1 -> 2

close(sk-B)
-> sk-B's file refcount : 2 -> 1
unix_vertex_dead(sk-B)
-> true

Initially, sk-A's file refcount is 1 by the inflight fd in sk-B
recvq. GC thinks sk-A is dead because the file refcount is the
same as the number of its inflight fds.

However, sk-A's file refcount is bumped silently by MSG_PEEK,
which invalidates the previous evaluation.

At this moment, sk-B's file refcount is 2; one by the open fd,
and one by the inflight fd in sk-A. The subsequent close()
releases one refcount by the former.

Finally, GC incorrectly concludes that both sk-A and sk-B are dead.

One option is to restore the locking dance in unix_peek_fds(),
but we can resolve this more elegantly thanks to the new algorithm.

The point is that the issue does not occur without the subsequent
close() and we actually do not need to synchronise MSG_PEEK with
the dead SCC detection.

When the issue occurs, close() and GC touch the same file refcount.
If GC sees the refcount being decremented by close(), it can just
give up garbage-collecting the SCC.

Therefore, we only need to signal the race during MSG_PEEK with
a proper memory barrier to make it visible to the GC.

Let's use seqcount_t to notify GC when MSG_PEEK occurs and let
it defer the SCC to the next run.

This way no locking is needed on the MSG_PEEK side, and we can
avoid imposing a penalty on every MSG_PEEK unnecessarily.

Note that we can retry within unix_scc_dead() if MSG_PEEK is
detected, but we do not do so to avoid hung task splat from
abusive MSG_PEEK calls.

Fixes: 118f457da9ed ("af_unix: Remove lock dance in unix_peek_fds().")
Reported-by: Igor Ushakov <sysroot314@gmail.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260311054043.1231316-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

authored by

Kuniyuki Iwashima and committed by
Jakub Kicinski
e5b31d98 2c7e63d7

+54 -28
+2
net/unix/af_unix.c
··· 1958 1958 static void unix_peek_fds(struct scm_cookie *scm, struct sk_buff *skb) 1959 1959 { 1960 1960 scm->fp = scm_fp_dup(UNIXCB(skb).fp); 1961 + 1962 + unix_peek_fpl(scm->fp); 1961 1963 } 1962 1964 1963 1965 static void unix_destruct_scm(struct sk_buff *skb)
+1
net/unix/af_unix.h
··· 29 29 void unix_update_edges(struct unix_sock *receiver); 30 30 int unix_prepare_fpl(struct scm_fp_list *fpl); 31 31 void unix_destroy_fpl(struct scm_fp_list *fpl); 32 + void unix_peek_fpl(struct scm_fp_list *fpl); 32 33 void unix_schedule_gc(struct user_struct *user); 33 34 34 35 /* SOCK_DIAG */
+51 -28
net/unix/garbage.c
··· 318 318 unix_free_vertices(fpl); 319 319 } 320 320 321 + static bool gc_in_progress; 322 + static seqcount_t unix_peek_seq = SEQCNT_ZERO(unix_peek_seq); 323 + 324 + void unix_peek_fpl(struct scm_fp_list *fpl) 325 + { 326 + static DEFINE_SPINLOCK(unix_peek_lock); 327 + 328 + if (!fpl || !fpl->count_unix) 329 + return; 330 + 331 + if (!READ_ONCE(gc_in_progress)) 332 + return; 333 + 334 + /* Invalidate the final refcnt check in unix_vertex_dead(). */ 335 + spin_lock(&unix_peek_lock); 336 + raw_write_seqcount_barrier(&unix_peek_seq); 337 + spin_unlock(&unix_peek_lock); 338 + } 339 + 321 340 static bool unix_vertex_dead(struct unix_vertex *vertex) 322 341 { 323 342 struct unix_edge *edge; ··· 368 349 return false; 369 350 370 351 return true; 352 + } 353 + 354 + static LIST_HEAD(unix_visited_vertices); 355 + static unsigned long unix_vertex_grouped_index = UNIX_VERTEX_INDEX_MARK2; 356 + 357 + static bool unix_scc_dead(struct list_head *scc, bool fast) 358 + { 359 + struct unix_vertex *vertex; 360 + bool scc_dead = true; 361 + unsigned int seq; 362 + 363 + seq = read_seqcount_begin(&unix_peek_seq); 364 + 365 + list_for_each_entry_reverse(vertex, scc, scc_entry) { 366 + /* Don't restart DFS from this vertex. */ 367 + list_move_tail(&vertex->entry, &unix_visited_vertices); 368 + 369 + /* Mark vertex as off-stack for __unix_walk_scc(). */ 370 + if (!fast) 371 + vertex->index = unix_vertex_grouped_index; 372 + 373 + if (scc_dead) 374 + scc_dead = unix_vertex_dead(vertex); 375 + } 376 + 377 + /* If MSG_PEEK intervened, defer this SCC to the next round. */ 378 + if (read_seqcount_retry(&unix_peek_seq, seq)) 379 + return false; 380 + 381 + return scc_dead; 371 382 } 372 383 373 384 static void unix_collect_skb(struct list_head *scc, struct sk_buff_head *hitlist) ··· 452 403 453 404 return false; 454 405 } 455 - 456 - static LIST_HEAD(unix_visited_vertices); 457 - static unsigned long unix_vertex_grouped_index = UNIX_VERTEX_INDEX_MARK2; 458 406 459 407 static unsigned long __unix_walk_scc(struct unix_vertex *vertex, 460 408 unsigned long *last_index, ··· 520 474 } 521 475 522 476 if (vertex->index == vertex->scc_index) { 523 - struct unix_vertex *v; 524 477 struct list_head scc; 525 - bool scc_dead = true; 526 478 527 479 /* SCC finalised. 528 480 * ··· 529 485 */ 530 486 __list_cut_position(&scc, &vertex_stack, &vertex->scc_entry); 531 487 532 - list_for_each_entry_reverse(v, &scc, scc_entry) { 533 - /* Don't restart DFS from this vertex in unix_walk_scc(). */ 534 - list_move_tail(&v->entry, &unix_visited_vertices); 535 - 536 - /* Mark vertex as off-stack. */ 537 - v->index = unix_vertex_grouped_index; 538 - 539 - if (scc_dead) 540 - scc_dead = unix_vertex_dead(v); 541 - } 542 - 543 - if (scc_dead) { 488 + if (unix_scc_dead(&scc, false)) { 544 489 unix_collect_skb(&scc, hitlist); 545 490 } else { 546 491 if (unix_vertex_max_scc_index < vertex->scc_index) ··· 583 550 while (!list_empty(&unix_unvisited_vertices)) { 584 551 struct unix_vertex *vertex; 585 552 struct list_head scc; 586 - bool scc_dead = true; 587 553 588 554 vertex = list_first_entry(&unix_unvisited_vertices, typeof(*vertex), entry); 589 555 list_add(&scc, &vertex->scc_entry); 590 556 591 - list_for_each_entry_reverse(vertex, &scc, scc_entry) { 592 - list_move_tail(&vertex->entry, &unix_visited_vertices); 593 - 594 - if (scc_dead) 595 - scc_dead = unix_vertex_dead(vertex); 596 - } 597 - 598 - if (scc_dead) { 557 + if (unix_scc_dead(&scc, true)) { 599 558 cyclic_sccs--; 600 559 unix_collect_skb(&scc, hitlist); 601 560 } ··· 601 576 WRITE_ONCE(unix_graph_state, 602 577 cyclic_sccs ? UNIX_GRAPH_CYCLIC : UNIX_GRAPH_NOT_CYCLIC); 603 578 } 604 - 605 - static bool gc_in_progress; 606 579 607 580 static void unix_gc(struct work_struct *work) 608 581 {