Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

tcp: use skb->len instead of skb->truesize in tcp_can_ingest()

Some applications are stuck to the 20th century and still use
small SO_RCVBUF values.

After the blamed commit, we can drop packets especially
when using LRO/hw-gro enabled NIC and small MSS (1500) values.

LRO/hw-gro NIC pack multiple segments into pages, allowing
tp->scaling_ratio to be set to a high value.

Whenever the receive queue gets full, we can receive a small packet
filling RWIN, but with a high skb->truesize, because most NIC use 4K page
plus sk_buff metadata even when receiving less than 1500 bytes of payload.

Even if we refine how tp->scaling_ratio is estimated,
we could have an issue at the start of the flow, because
the first round of packets (IW10) will be sent based on
the initial tp->scaling_ratio (1/2)

Relax tcp_can_ingest() to use skb->len instead of skb->truesize,
allowing the peer to use final RWIN, assuming a 'perfect'
scaling_ratio of 1.

Fixes: 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf checks")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250927092827.2707901-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

authored by

Eric Dumazet and committed by
Jakub Kicinski
f017c1f7 d210ee58

+13 -2
+13 -2
net/ipv4/tcp_input.c
··· 5086 5086 5087 5087 /* Check if this incoming skb can be added to socket receive queues 5088 5088 * while satisfying sk->sk_rcvbuf limit. 5089 + * 5090 + * In theory we should use skb->truesize, but this can cause problems 5091 + * when applications use too small SO_RCVBUF values. 5092 + * When LRO / hw gro is used, the socket might have a high tp->scaling_ratio, 5093 + * allowing RWIN to be close to available space. 5094 + * Whenever the receive queue gets full, we can receive a small packet 5095 + * filling RWIN, but with a high skb->truesize, because most NIC use 4K page 5096 + * plus sk_buff metadata even when receiving less than 1500 bytes of payload. 5097 + * 5098 + * Note that we use skb->len to decide to accept or drop this packet, 5099 + * but sk->sk_rmem_alloc is the sum of all skb->truesize. 5089 5100 */ 5090 5101 static bool tcp_can_ingest(const struct sock *sk, const struct sk_buff *skb) 5091 5102 { 5092 - unsigned int new_mem = atomic_read(&sk->sk_rmem_alloc) + skb->truesize; 5103 + unsigned int rmem = atomic_read(&sk->sk_rmem_alloc); 5093 5104 5094 - return new_mem <= sk->sk_rcvbuf; 5105 + return rmem + skb->len <= sk->sk_rcvbuf; 5095 5106 } 5096 5107 5097 5108 static int tcp_try_rmem_schedule(struct sock *sk, const struct sk_buff *skb,