Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

vhost/net: Defer TX queue re-enable until after sendmsg

In handle_tx_copy, TX batching processes packets below ~PAGE_SIZE and
batches up to 64 messages before calling sock->sendmsg.

Currently, when there are no more messages on the ring to dequeue,
handle_tx_copy re-enables kicks on the ring *before* firing off the
batch sendmsg. However, sock->sendmsg incurs a non-zero delay,
especially if it needs to wake up a thread (e.g., another vhost worker).

If the guest submits additional messages immediately after the last ring
check and disablement, it triggers an EPT_MISCONFIG vmexit to attempt to
kick the vhost worker. This may happen while the worker is still
processing the sendmsg, leading to wasteful exit(s).

This is particularly problematic for single-threaded guest submission
threads, as they must exit, wait for the exit to be processed
(potentially involving a TTWU), and then resume.

In scenarios like a constant stream of UDP messages, this results in a
sawtooth pattern where the submitter frequently vmexits, and the
vhost-net worker alternates between sleeping and waking.

A common solution is to configure vhost-net busy polling via userspace
(e.g., qemu poll-us). However, treating the sendmsg as the "busy"
period by keeping kicks disabled during the final sendmsg and
performing one additional ring check afterward provides a significant
performance improvement without any excess busy poll cycles.

If messages are found in the ring after the final sendmsg, requeue the
TX handler. This ensures fairness for the RX handler and allows
vhost_run_work_list to cond_resched() as needed.

Test Case
TX VM: taskset -c 2 iperf3 -c rx-ip-here -t 60 -p 5200 -b 0 -u -i 5
RX VM: taskset -c 2 iperf3 -s -p 5200 -D
6.12.0, each worker backed by tun interface with IFF_NAPI setup.
Note: TCP side is largely unchanged as that was copy bound

6.12.0 unpatched
EPT_MISCONFIG/second: 5411
Datagrams/second: ~382k
Interval Transfer Bitrate Lost/Total Datagrams
0.00-30.00 sec 15.5 GBytes 4.43 Gbits/sec 0/11481630 (0%) sender

6.12.0 patched
EPT_MISCONFIG/second: 58 (~93x reduction)
Datagrams/second: ~650k (~1.7x increase)
Interval Transfer Bitrate Lost/Total Datagrams
0.00-30.00 sec 26.4 GBytes 7.55 Gbits/sec 0/19554720 (0%) sender

Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Jon Kohler <jon@nutanix.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://patch.msgid.link/20250501020428.1889162-1-jon@nutanix.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

authored by

Jon Kohler and committed by
Jakub Kicinski
8c2e6b26 c2dbda07

+21 -9
+21 -9
drivers/vhost/net.c
··· 755 755 int err; 756 756 int sent_pkts = 0; 757 757 bool sock_can_batch = (sock->sk->sk_sndbuf == INT_MAX); 758 + bool busyloop_intr; 758 759 759 760 do { 760 - bool busyloop_intr = false; 761 - 761 + busyloop_intr = false; 762 762 if (nvq->done_idx == VHOST_NET_BATCH) 763 763 vhost_tx_batch(net, nvq, sock, &msg); 764 764 ··· 769 769 break; 770 770 /* Nothing new? Wait for eventfd to tell us they refilled. */ 771 771 if (head == vq->num) { 772 - if (unlikely(busyloop_intr)) { 773 - vhost_poll_queue(&vq->poll); 774 - } else if (unlikely(vhost_enable_notify(&net->dev, 775 - vq))) { 776 - vhost_disable_notify(&net->dev, vq); 777 - continue; 778 - } 772 + /* Kicks are disabled at this point, break loop and 773 + * process any remaining batched packets. Queue will 774 + * be re-enabled afterwards. 775 + */ 779 776 break; 780 777 } 781 778 ··· 822 825 ++nvq->done_idx; 823 826 } while (likely(!vhost_exceeds_weight(vq, ++sent_pkts, total_len))); 824 827 828 + /* Kicks are still disabled, dispatch any remaining batched msgs. */ 825 829 vhost_tx_batch(net, nvq, sock, &msg); 830 + 831 + if (unlikely(busyloop_intr)) 832 + /* If interrupted while doing busy polling, requeue the 833 + * handler to be fair handle_rx as well as other tasks 834 + * waiting on cpu. 835 + */ 836 + vhost_poll_queue(&vq->poll); 837 + else 838 + /* All of our work has been completed; however, before 839 + * leaving the TX handler, do one last check for work, 840 + * and requeue handler if necessary. If there is no work, 841 + * queue will be reenabled. 842 + */ 843 + vhost_net_busy_poll_try_queue(net, vq); 826 844 } 827 845 828 846 static void handle_tx_zerocopy(struct vhost_net *net, struct socket *sock)