From: Jason Xing <kerneljasonxing@gmail.com>
To: davem@davemloft.net, edumazet@google.com, kuba@kernel.org,
pabeni@redhat.com, bjorn@kernel.org, magnus.karlsson@intel.com,
maciej.fijalkowski@intel.com, jonathan.lemon@gmail.com,
sdf@fomichev.me, ast@kernel.org, daniel@iogearbox.net,
hawk@kernel.org, john.fastabend@gmail.com
Cc: bpf@vger.kernel.org, netdev@vger.kernel.org,
Jason Xing <kernelxing@tencent.com>
Subject: [PATCH RFC net-next v4 00/14] xsk: batch xmit in copy mode
Date: Wed, 15 Apr 2026 16:26:40 +0800 [thread overview]
Message-ID: <20260415082654.21026-1-kerneljasonxing@gmail.com> (raw)
From: Jason Xing <kernelxing@tencent.com>
Greetings, everyone. This is the batch feature series. Even though
net-next is closed, I would appreciate any feedbacks and suggestions
on this! Many thanks!
Bottom line up front: it improves the performance by 88.2% stably.
# Background
This series is focused on the performance improvement in copy mode. As
observed in the physical servers, there are much room left to ramp up
the transmission for copy mode, compared to zerocopy mode.
Even though we can apply zerocopy to achieve a much better performance,
some limitations are still there especially for virtio and veth cases
due to the implementation in the host. In the real world, hundreds and
thousands of hosts like at Tencent still don't support zerocopy mode
for VMs, so copy mode is the only way we can resort to. Being general
is its strong advantage.
Zerocopy has a good function name xskq_cons_read_desc_batch() which
reads descriptors in batch and then sends them out at a time, rather
than just read and send the descriptor one by one in a loop. Similar
batch ideas can be seen from classic mechanisms like GSO/GRO which
also try to handle as many packets as they can at one time. So the
motivation and idea of the series actually originated from them.
# AF_PACKET Comparison
Looking back to the initial design and implementation of AF_XDP, it's
not hard to find the big difference it made is to speed up the
transmission when zerocopy mode is enabled. So the conclusion is that
zerocopy mode of AF_XDP outperforms AF_PACKET that still uses copy mode.
As to the whole logic of copy mode for both of them, they looks quite
similar, especially when application using AF_PACKET sets
PACKET_QDISC_BYPASS option. Digging into the details of AF_PACKET, we
can find the implementation is comparatively heavy which can also be
proved by the real test as shown below. The numbers of AF_PACKET test
are a little bit lower.
# Batch Mode
At the current moment, I consider copy mode of AF_XDP as a half bypass
mechanism to some extent in comparison with the well known bypass
mechanism like DPDK. To avoid much consumption in kernel as much as
possible, then the batch xmit is proposed to aggregate descriptors in a
certain small group and then read/allocate/build/send them in individual
loops.
Applications are allowed to use setsockopt to enlarge the default value.
Please note that since memory allocation can be time consuming and heavy
due to lack of memory that results in complicated memory reclaim, it
might not be that good to hold one descriptor for too long, which brings
high latency for one skb.
# Experiments
Tested on ixgbe at 10Gb/sec with the following settings:
1. mitigations off
2. ethtool -G enp2s0f1 tx 512
3. sysctl -w net.core.skb_defer_max=0
4. sysctl -w net.core.wmem_max=21299200 and sndbuf is the same value
5. XDP_MAX_TX_SKB_BUDGET 512
taskset -c 1 ./xdpsock -i enp2s0f1 -t -S -s 64
copy mode(before): 1,801,007 pps (baseline)
AF_PACKET: 1,375,808 pps (-23.6%)
zc mode: 13,333,593 pps (+640.3%)
batch mode(batch 1): 1,976,821 pps (+9.8%)
batch mode(batch 64): 3,389,704 pps (+88.2%)
batch mode(batch 256): 3,387,563 pps (+88.0%)
---
RFC v4
Link: https://lore.kernel.org/all/20251021131209.41491-1-kerneljasonxing@gmail.com/
1. fix a few bugs in v3
2. add a few optimizations
The series is built on top of commit 2ce8a41113ed (net: hsr: emit
notification for PRP slave2 changed hw addr on port deletion). Since the
changes compared to v3 are too many, please review the series from scratch.
Thanks!
v3
Link: https://lore.kernel.org/all/20250825135342.53110-1-kerneljasonxing@gmail.com/
1. I retested and got different test numbers. Previous test is not that
right because my env has two NUMA nodes and only the first one has a
faster speed.
2. To achieve a stable performance result, the development and
evaluation are also finished in physical servers just like the numbers
that I share.
3. I didn't use pool->tx_descs because sockets can share the same umem
pool.
3. Use skb list to chain the allocated and built skbs to send.
5. Add AF_PACKET test numbers.
V2
Link: https://lore.kernel.org/all/20250811131236.56206-1-kerneljasonxing@gmail.com/
1. add xmit.more sub-feature (Jesper)
2. add kmem_cache_alloc_bulk (Jesper and Maciej)
Jason Xing (14):
xsk: introduce XDP_GENERIC_XMIT_BATCH setsockopt
xsk: extend xsk_build_skb() to support passing an already allocated
skb
xsk: add xsk_alloc_batch_skb() to build skbs in batch
xsk: cache data buffers to avoid frequently calling kmalloc_reserve
xsk: add direct xmit in batch function
xsk: support dynamic xmit.more control for batch xmit
xsk: try to skip validating skb list in xmit path
xsk: rename nb_pkts to nb_descs in xsk_tx_peek_release_desc_batch
xsk: extend xskq_cons_read_desc_batch to count nb_pkts
xsk: extend xsk_cq_reserve_locked() to reserve n slots
xsk: support batch xmit main logic
xsk: separate read-mostly and write-heavy fields in xsk_buff_pool
xsk: retire old xmit path in copy mode
xsk: optimize xsk_build_skb for batch copy-mode fast path
Documentation/networking/af_xdp.rst | 17 ++
include/net/xdp_sock.h | 17 ++
include/net/xsk_buff_pool.h | 10 +-
include/uapi/linux/if_xdp.h | 1 +
net/core/dev.c | 49 +++++
net/core/skbuff.c | 152 +++++++++++++++
net/xdp/xsk.c | 279 ++++++++++++++++++++--------
net/xdp/xsk_queue.h | 40 +++-
tools/include/uapi/linux/if_xdp.h | 1 +
9 files changed, 473 insertions(+), 93 deletions(-)
--
2.41.3
next reply other threads:[~2026-04-15 8:27 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-15 8:26 Jason Xing [this message]
2026-04-15 8:26 ` [PATCH RFC net-next v4 01/14] xsk: introduce XDP_GENERIC_XMIT_BATCH setsockopt Jason Xing
2026-04-15 8:26 ` [PATCH RFC net-next v4 02/14] xsk: extend xsk_build_skb() to support passing an already allocated skb Jason Xing
2026-04-15 8:26 ` [PATCH RFC net-next v4 03/14] xsk: add xsk_alloc_batch_skb() to build skbs in batch Jason Xing
2026-04-15 8:26 ` [PATCH RFC net-next v4 04/14] xsk: cache data buffers to avoid frequently calling kmalloc_reserve Jason Xing
2026-04-15 8:26 ` [PATCH RFC net-next v4 05/14] xsk: add direct xmit in batch function Jason Xing
2026-04-15 8:26 ` [PATCH RFC net-next v4 06/14] xsk: support dynamic xmit.more control for batch xmit Jason Xing
2026-04-15 8:26 ` [PATCH RFC net-next v4 07/14] xsk: try to skip validating skb list in xmit path Jason Xing
2026-04-15 8:26 ` [PATCH RFC net-next v4 08/14] xsk: rename nb_pkts to nb_descs in xsk_tx_peek_release_desc_batch Jason Xing
2026-04-15 8:26 ` [PATCH RFC net-next v4 09/14] xsk: extend xskq_cons_read_desc_batch to count nb_pkts Jason Xing
2026-04-15 8:26 ` [PATCH RFC net-next v4 10/14] xsk: extend xsk_cq_reserve_locked() to reserve n slots Jason Xing
2026-04-15 8:26 ` [PATCH RFC net-next v4 11/14] xsk: support batch xmit main logic Jason Xing
2026-04-15 8:26 ` [PATCH RFC net-next v4 12/14] xsk: separate read-mostly and write-heavy fields in xsk_buff_pool Jason Xing
2026-04-15 8:26 ` [PATCH RFC net-next v4 13/14] xsk: retire old xmit path in copy mode Jason Xing
2026-04-15 8:26 ` [PATCH RFC net-next v4 14/14] xsk: optimize xsk_build_skb for batch copy-mode fast path Jason Xing
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260415082654.21026-1-kerneljasonxing@gmail.com \
--to=kerneljasonxing@gmail.com \
--cc=ast@kernel.org \
--cc=bjorn@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=hawk@kernel.org \
--cc=john.fastabend@gmail.com \
--cc=jonathan.lemon@gmail.com \
--cc=kernelxing@tencent.com \
--cc=kuba@kernel.org \
--cc=maciej.fijalkowski@intel.com \
--cc=magnus.karlsson@intel.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=sdf@fomichev.me \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox