Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

net_sched: limit try_bulk_dequeue_skb() batches

After commit 100dfa74cad9 ("inet: dev_queue_xmit() llist adoption")
I started seeing many qdisc requeues on IDPF under high TX workload.

$ tc -s qd sh dev eth1 handle 1: ; sleep 1; tc -s qd sh dev eth1 handle 1:
qdisc mq 1: root
Sent 43534617319319 bytes 268186451819 pkt (dropped 0, overlimits 0 requeues 3532840114)
backlog 1056Kb 6675p requeues 3532840114
qdisc mq 1: root
Sent 43554665866695 bytes 268309964788 pkt (dropped 0, overlimits 0 requeues 3537737653)
backlog 781164b 4822p requeues 3537737653

This is caused by try_bulk_dequeue_skb() being only limited by BQL budget.

perf record -C120-239 -e qdisc:qdisc_dequeue sleep 1 ; perf script
...
netperf 75332 [146] 2711.138269: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1292 skbaddr=0xff378005a1e9f200
netperf 75332 [146] 2711.138953: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1213 skbaddr=0xff378004d607a500
netperf 75330 [144] 2711.139631: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1233 skbaddr=0xff3780046be20100
netperf 75333 [147] 2711.140356: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1093 skbaddr=0xff37800514845b00
netperf 75337 [151] 2711.141037: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1353 skbaddr=0xff37800460753300
netperf 75337 [151] 2711.141877: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1367 skbaddr=0xff378004e72c7b00
netperf 75330 [144] 2711.142643: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1202 skbaddr=0xff3780045bd60000
...

This is bad because :

1) Large batches hold one victim cpu for a very long time.

2) Driver often hit their own TX ring limit (all slots are used).

3) We call dev_requeue_skb()

4) Requeues are using a FIFO (q->gso_skb), breaking qdisc ability to
implement FQ or priority scheduling.

5) dequeue_skb() gets packets from q->gso_skb one skb at a time
with no xmit_more support. This is causing many spinlock games
between the qdisc and the device driver.

Requeues were supposed to be very rare, lets keep them this way.

Limit batch sizes to /proc/sys/net/core/dev_weight (default 64) as
__qdisc_run() was designed to use.

Fixes: 5772e9a3463b ("qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://patch.msgid.link/20251109161215.2574081-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

authored by

Eric Dumazet and committed by
Jakub Kicinski
0345552a 7a6fa4f8

+10 -7
+10 -7
net/sched/sch_generic.c
··· 180 180 static void try_bulk_dequeue_skb(struct Qdisc *q, 181 181 struct sk_buff *skb, 182 182 const struct netdev_queue *txq, 183 - int *packets) 183 + int *packets, int budget) 184 184 { 185 185 int bytelimit = qdisc_avail_bulklimit(txq) - skb->len; 186 + int cnt = 0; 186 187 187 188 while (bytelimit > 0) { 188 189 struct sk_buff *nskb = q->dequeue(q); ··· 194 193 bytelimit -= nskb->len; /* covers GSO len */ 195 194 skb->next = nskb; 196 195 skb = nskb; 197 - (*packets)++; /* GSO counts as one pkt */ 196 + if (++cnt >= budget) 197 + break; 198 198 } 199 + (*packets) += cnt; 199 200 skb_mark_not_on_list(skb); 200 201 } 201 202 ··· 231 228 * A requeued skb (via q->gso_skb) can also be a SKB list. 232 229 */ 233 230 static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate, 234 - int *packets) 231 + int *packets, int budget) 235 232 { 236 233 const struct netdev_queue *txq = q->dev_queue; 237 234 struct sk_buff *skb = NULL; ··· 298 295 if (skb) { 299 296 bulk: 300 297 if (qdisc_may_bulk(q)) 301 - try_bulk_dequeue_skb(q, skb, txq, packets); 298 + try_bulk_dequeue_skb(q, skb, txq, packets, budget); 302 299 else 303 300 try_bulk_dequeue_skb_slow(q, skb, packets); 304 301 } ··· 390 387 * >0 - queue is not empty. 391 388 * 392 389 */ 393 - static inline bool qdisc_restart(struct Qdisc *q, int *packets) 390 + static inline bool qdisc_restart(struct Qdisc *q, int *packets, int budget) 394 391 { 395 392 spinlock_t *root_lock = NULL; 396 393 struct netdev_queue *txq; ··· 399 396 bool validate; 400 397 401 398 /* Dequeue packet */ 402 - skb = dequeue_skb(q, &validate, packets); 399 + skb = dequeue_skb(q, &validate, packets, budget); 403 400 if (unlikely(!skb)) 404 401 return false; 405 402 ··· 417 414 int quota = READ_ONCE(net_hotdata.dev_tx_weight); 418 415 int packets; 419 416 420 - while (qdisc_restart(q, &packets)) { 417 + while (qdisc_restart(q, &packets, quota)) { 421 418 quota -= packets; 422 419 if (quota <= 0) { 423 420 if (q->flags & TCQ_F_NOLOCK)