* [PATCH v1 0/3] net/af_packet: correctness fixes and improvements
@ 2026-01-27 18:13 scott.k.mitch1
2026-01-27 18:13 ` [PATCH v1 1/3] net/af_packet: fix thread safety and frame calculations scott.k.mitch1
` (3 more replies)
0 siblings, 4 replies; 65+ messages in thread
From: scott.k.mitch1 @ 2026-01-27 18:13 UTC (permalink / raw)
To: dev; +Cc: Scott
From: Scott <scott.k.mitch1@gmail.com>
This series addresses correctness issues and adds performance
optimizations to the AF_PACKET driver, laying the groundwork for
upcoming advanced features.
The series is structured as:
Patch 1/3: Thread safety and frame calculation fixes
- Critical correctness fixes for multi-threaded environments
- Proper atomic operations and memory ordering for tp_status
- Fixes frame address calculation bugs
Patch 2/3: Performance optimizations
- Use rte_memcpy() for better performance
- Add prefetching for next frame/mbuf
- Use rte_pktmbuf_free_bulk() instead of individual frees
Patch 3/3: New features and device arguments
- Software checksum offload support
- TX poll behavior control (txpollnotrdy devarg)
- Improved devarg validation
These changes prepare the driver for planned follow-up patches that will
add significant new capabilities:
- io_uring SQPOLL support for TX send notify, which meaningfully improves
performance by eliminating syscall overhead and enabling kernel-side
polling
- GRO/GSO support via PACKET_VNET_HDR to aggregate packets and reduce
per-packet interface traversal overhead
- TPACKET_V3 protocol support for block-based RX/TX processing, providing
packet batching benefits and reducing cache pressure
The correctness fixes in patch 1/3 are particularly important for these
future features, as io_uring SQPOLL mode involves asynchronous kernel
updates to tp_status from independent CPU cores, requiring proper memory
ordering.
Scott Mitchell (3):
net/af_packet: fix thread safety and frame calculations
net/af_packet: RX/TX rte_memcpy, bulk free, prefetch
net/af_packet: software checksum and tx poll control
drivers/net/af_packet/rte_eth_af_packet.c | 389 +++++++++++++++-------
1 file changed, 270 insertions(+), 119 deletions(-)
--
2.39.5 (Apple Git-154)
^ permalink raw reply [flat|nested] 65+ messages in thread* [PATCH v1 1/3] net/af_packet: fix thread safety and frame calculations 2026-01-27 18:13 [PATCH v1 0/3] net/af_packet: correctness fixes and improvements scott.k.mitch1 @ 2026-01-27 18:13 ` scott.k.mitch1 2026-01-27 18:39 ` Stephen Hemminger 2026-01-27 18:13 ` [PATCH v1 2/3] net/af_packet: RX/TX rte_memcpy, bulk free, prefetch scott.k.mitch1 ` (2 subsequent siblings) 3 siblings, 1 reply; 65+ messages in thread From: scott.k.mitch1 @ 2026-01-27 18:13 UTC (permalink / raw) To: dev; +Cc: Scott Mitchell, linville, stable From: Scott Mitchell <scott.k.mitch1@gmail.com> The AF_PACKET driver had multiple correctness issues that could cause data races and memory corruption in multi-threaded environments. Thread Safety Issues: 1. Statistics counters (rx_pkts, tx_pkts, rx_bytes, tx_bytes, etc.) were declared as 'volatile unsigned long' which provides no atomicity guarantees and can cause torn reads/writes on 32-bit platforms or when the compiler uses multiple instructions. 2. The tp_status field was accessed without memory barriers, violating the kernel's synchronization protocol. The kernel uses READ_ONCE/ WRITE_ONCE with smp_rmb() barriers (see __packet_get_status and __packet_set_status in net/packet/af_packet.c). Userspace must use equivalent atomic operations with acquire/release semantics. 3. Statistics are collected in datapath threads but consumed by management threads calling eth_stats_get(), creating unsynchronized cross-thread access. These issues become more critical with upcoming features like io_uring SQPOLL mode, where the kernel's submission queue polling thread operates independently and asynchronously updates tp_status from a different CPU core, making proper memory ordering essential. Frame Calculation Issues: 4. Frame overhead was incorrectly calculated, failing to account for the TPACKET2_HDRLEN structure layout and sockaddr_ll offset. 5. Frame address calculation assumed sequential frame layout (frame_base + i * frame_size), but the kernel's packet_lookup_frame() uses block-based addressing: block_start + (frame_in_block * frame_size). This causes memory corruption when block_size is not evenly divisible by frame_size. Changes: - Replace 'volatile unsigned long' counters with RTE_ATOMIC(uint64_t) - Use rte_atomic_load_explicit() with rte_memory_order_acquire when reading tp_status (matching kernel's smp_rmb() + READ_ONCE()) - Use rte_atomic_store_explicit() with rte_memory_order_release when writing tp_status (matching kernel's WRITE_ONCE() protocol) - Use rte_memory_order_relaxed for statistics updates (no ordering required between independent counters) - Fix ETH_AF_PACKET_FRAME_OVERHEAD calculation - Fix frame address calculation to match kernel's packet_lookup_frame() - Add validation warnings for kernel constraints (alignment, block/frame relationships) - Merge separate stat collection loops in eth_stats_get() for efficiency Fixes: 364e08f2bbc0 ("af_packet: add PMD for AF_PACKET-based virtual devices") Cc: linville@tuxdriver.com Cc: stable@dpdk.org Signed-off-by: Scott Mitchell <scott.k.mitch1@gmail.com> --- drivers/net/af_packet/rte_eth_af_packet.c | 227 +++++++++++++++------- 1 file changed, 158 insertions(+), 69 deletions(-) diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c index ef11b8fb6b..2ee52a402b 100644 --- a/drivers/net/af_packet/rte_eth_af_packet.c +++ b/drivers/net/af_packet/rte_eth_af_packet.c @@ -9,6 +9,8 @@ #include <rte_common.h> #include <rte_string_fns.h> #include <rte_mbuf.h> +#include <rte_atomic.h> +#include <rte_bitops.h> #include <ethdev_driver.h> #include <ethdev_vdev.h> #include <rte_malloc.h> @@ -41,6 +43,10 @@ #define DFLT_FRAME_SIZE (1 << 11) #define DFLT_FRAME_COUNT (1 << 9) +static const uint16_t ETH_AF_PACKET_FRAME_SIZE_MAX = RTE_IPV4_MAX_PKT_LEN; +#define ETH_AF_PACKET_FRAME_OVERHEAD (TPACKET2_HDRLEN - sizeof(struct sockaddr_ll)) +#define ETH_AF_PACKET_ETH_OVERHEAD (RTE_ETHER_HDR_LEN + RTE_ETHER_CRC_LEN) + static uint64_t timestamp_dynflag; static int timestamp_dynfield_offset = -1; @@ -57,10 +63,10 @@ struct __rte_cache_aligned pkt_rx_queue { uint8_t vlan_strip; uint8_t timestamp_offloading; - volatile unsigned long rx_pkts; - volatile unsigned long rx_bytes; - volatile unsigned long rx_nombuf; - volatile unsigned long rx_dropped_pkts; + RTE_ATOMIC(uint64_t) rx_pkts; + RTE_ATOMIC(uint64_t) rx_bytes; + RTE_ATOMIC(uint64_t) rx_nombuf; + RTE_ATOMIC(uint64_t) rx_dropped_pkts; }; struct __rte_cache_aligned pkt_tx_queue { @@ -72,9 +78,9 @@ struct __rte_cache_aligned pkt_tx_queue { unsigned int framecount; unsigned int framenum; - volatile unsigned long tx_pkts; - volatile unsigned long err_pkts; - volatile unsigned long tx_bytes; + RTE_ATOMIC(uint64_t) tx_pkts; + RTE_ATOMIC(uint64_t) err_pkts; + RTE_ATOMIC(uint64_t) tx_bytes; }; struct pmd_internals { @@ -129,7 +135,7 @@ eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) uint8_t *pbuf; struct pkt_rx_queue *pkt_q = queue; uint16_t num_rx = 0; - unsigned long num_rx_bytes = 0; + uint32_t num_rx_bytes = 0; unsigned int framecount, framenum; if (unlikely(nb_pkts == 0)) @@ -144,13 +150,16 @@ eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) for (i = 0; i < nb_pkts; i++) { /* point at the next incoming frame */ ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base; - if ((ppd->tp_status & TP_STATUS_USER) == 0) + uint32_t tp_status = rte_atomic_load_explicit(&ppd->tp_status, + rte_memory_order_acquire); + if ((tp_status & TP_STATUS_USER) == 0) break; /* allocate the next mbuf */ mbuf = rte_pktmbuf_alloc(pkt_q->mb_pool); if (unlikely(mbuf == NULL)) { - pkt_q->rx_nombuf++; + rte_atomic_fetch_add_explicit(&pkt_q->rx_nombuf, 1, + rte_memory_order_relaxed); break; } @@ -160,7 +169,7 @@ eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) memcpy(rte_pktmbuf_mtod(mbuf, void *), pbuf, rte_pktmbuf_data_len(mbuf)); /* check for vlan info */ - if (ppd->tp_status & TP_STATUS_VLAN_VALID) { + if (tp_status & TP_STATUS_VLAN_VALID) { mbuf->vlan_tci = ppd->tp_vlan_tci; mbuf->ol_flags |= (RTE_MBUF_F_RX_VLAN | RTE_MBUF_F_RX_VLAN_STRIPPED); @@ -179,7 +188,8 @@ eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) } /* release incoming frame and advance ring buffer */ - ppd->tp_status = TP_STATUS_KERNEL; + rte_atomic_store_explicit(&ppd->tp_status, TP_STATUS_KERNEL, + rte_memory_order_release); if (++framenum >= framecount) framenum = 0; mbuf->port = pkt_q->in_port; @@ -190,8 +200,8 @@ eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) num_rx_bytes += mbuf->pkt_len; } pkt_q->framenum = framenum; - pkt_q->rx_pkts += num_rx; - pkt_q->rx_bytes += num_rx_bytes; + rte_atomic_fetch_add_explicit(&pkt_q->rx_pkts, num_rx, rte_memory_order_relaxed); + rte_atomic_fetch_add_explicit(&pkt_q->rx_bytes, num_rx_bytes, rte_memory_order_relaxed); return num_rx; } @@ -228,8 +238,8 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) struct pollfd pfd; struct pkt_tx_queue *pkt_q = queue; uint16_t num_tx = 0; - unsigned long num_tx_bytes = 0; - int i; + uint32_t num_tx_bytes = 0; + uint16_t i; if (unlikely(nb_pkts == 0)) return 0; @@ -259,16 +269,6 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) } } - /* point at the next incoming frame */ - if (!tx_ring_status_available(ppd->tp_status)) { - if (poll(&pfd, 1, -1) < 0) - break; - - /* poll() can return POLLERR if the interface is down */ - if (pfd.revents & POLLERR) - break; - } - /* * poll() will almost always return POLLOUT, even if there * are no extra buffers available @@ -283,26 +283,31 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) * * This results in poll() returning POLLOUT. */ - if (!tx_ring_status_available(ppd->tp_status)) + if (unlikely(!tx_ring_status_available(rte_atomic_load_explicit(&ppd->tp_status, + rte_memory_order_acquire)) && + (poll(&pfd, 1, -1) < 0 || (pfd.revents & POLLERR) != 0 || + !tx_ring_status_available(rte_atomic_load_explicit(&ppd->tp_status, + rte_memory_order_acquire))))) { + /* Ring is full, stop here. Don't process bufs[i]. */ break; + } - /* copy the tx frame data */ - pbuf = (uint8_t *) ppd + TPACKET2_HDRLEN - - sizeof(struct sockaddr_ll); + pbuf = (uint8_t *)ppd + ETH_AF_PACKET_FRAME_OVERHEAD; struct rte_mbuf *tmp_mbuf = mbuf; - while (tmp_mbuf) { + do { uint16_t data_len = rte_pktmbuf_data_len(tmp_mbuf); memcpy(pbuf, rte_pktmbuf_mtod(tmp_mbuf, void*), data_len); pbuf += data_len; tmp_mbuf = tmp_mbuf->next; - } + } while (tmp_mbuf); ppd->tp_len = mbuf->pkt_len; ppd->tp_snaplen = mbuf->pkt_len; /* release incoming frame and advance ring buffer */ - ppd->tp_status = TP_STATUS_SEND_REQUEST; + rte_atomic_store_explicit(&ppd->tp_status, TP_STATUS_SEND_REQUEST, + rte_memory_order_release); if (++framenum >= framecount) framenum = 0; ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base; @@ -326,9 +331,9 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) } pkt_q->framenum = framenum; - pkt_q->tx_pkts += num_tx; - pkt_q->err_pkts += i - num_tx; - pkt_q->tx_bytes += num_tx_bytes; + rte_atomic_fetch_add_explicit(&pkt_q->tx_pkts, num_tx, rte_memory_order_relaxed); + rte_atomic_fetch_add_explicit(&pkt_q->err_pkts, i - num_tx, rte_memory_order_relaxed); + rte_atomic_fetch_add_explicit(&pkt_q->tx_bytes, num_tx_bytes, rte_memory_order_relaxed); return i; } @@ -392,10 +397,12 @@ eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info) dev_info->if_index = internals->if_index; dev_info->max_mac_addrs = 1; - dev_info->max_rx_pktlen = RTE_ETHER_MAX_LEN; + dev_info->max_rx_pktlen = (uint32_t)ETH_AF_PACKET_FRAME_SIZE_MAX + + ETH_AF_PACKET_ETH_OVERHEAD; + dev_info->max_mtu = ETH_AF_PACKET_FRAME_SIZE_MAX; dev_info->max_rx_queues = (uint16_t)internals->nb_queues; dev_info->max_tx_queues = (uint16_t)internals->nb_queues; - dev_info->min_rx_bufsize = 0; + dev_info->min_rx_bufsize = ETH_AF_PACKET_ETH_OVERHEAD; dev_info->tx_offload_capa = RTE_ETH_TX_OFFLOAD_MULTI_SEGS | RTE_ETH_TX_OFFLOAD_VLAN_INSERT; dev_info->rx_offload_capa = RTE_ETH_RX_OFFLOAD_VLAN_STRIP | @@ -436,24 +443,42 @@ eth_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats *stats, struct eth_q for (i = 0; i < internal->nb_queues; i++) { /* reading drop count clears the value, therefore keep total value */ - internal->rx_queue[i].rx_dropped_pkts += - packet_drop_count(internal->rx_queue[i].sockfd); - - rx_total += internal->rx_queue[i].rx_pkts; - rx_bytes_total += internal->rx_queue[i].rx_bytes; - rx_dropped_total += internal->rx_queue[i].rx_dropped_pkts; - rx_nombuf_total += internal->rx_queue[i].rx_nombuf; - - tx_total += internal->tx_queue[i].tx_pkts; - tx_err_total += internal->tx_queue[i].err_pkts; - tx_bytes_total += internal->tx_queue[i].tx_bytes; + uint64_t rx_curr_dropped_pkts = packet_drop_count(internal->rx_queue[i].sockfd); + uint64_t rx_prev_dropped_pkts = + rte_atomic_fetch_add_explicit(&internal->rx_queue[i].rx_dropped_pkts, + rx_curr_dropped_pkts, + rte_memory_order_relaxed); + + uint64_t rx_pkts = rte_atomic_load_explicit(&internal->rx_queue[i].rx_pkts, + rte_memory_order_relaxed); + uint64_t rx_bytes = rte_atomic_load_explicit(&internal->rx_queue[i].rx_bytes, + rte_memory_order_relaxed); + uint64_t rx_nombuf = rte_atomic_load_explicit(&internal->rx_queue[i].rx_nombuf, + rte_memory_order_relaxed); + + + uint64_t tx_pkts = rte_atomic_load_explicit(&internal->tx_queue[i].tx_pkts, + rte_memory_order_relaxed); + uint64_t tx_bytes = rte_atomic_load_explicit(&internal->tx_queue[i].tx_bytes, + rte_memory_order_relaxed); + uint64_t err_pkts = rte_atomic_load_explicit(&internal->tx_queue[i].err_pkts, + rte_memory_order_relaxed); + + rx_total += rx_pkts; + rx_bytes_total += rx_bytes; + rx_nombuf_total += rx_nombuf; + rx_dropped_total += (rx_curr_dropped_pkts + rx_prev_dropped_pkts); + + tx_total += tx_pkts; + tx_err_total += err_pkts; + tx_bytes_total += tx_bytes; if (qstats != NULL && i < RTE_ETHDEV_QUEUE_STAT_CNTRS) { - qstats->q_ipackets[i] = internal->rx_queue[i].rx_pkts; - qstats->q_ibytes[i] = internal->rx_queue[i].rx_bytes; - qstats->q_opackets[i] = internal->tx_queue[i].tx_pkts; - qstats->q_obytes[i] = internal->tx_queue[i].tx_bytes; - qstats->q_errors[i] = internal->rx_queue[i].rx_nombuf; + qstats->q_ipackets[i] = rx_pkts; + qstats->q_ibytes[i] = rx_bytes; + qstats->q_opackets[i] = tx_pkts; + qstats->q_obytes[i] = tx_bytes; + qstats->q_errors[i] = rx_nombuf; } } @@ -477,14 +502,21 @@ eth_stats_reset(struct rte_eth_dev *dev) /* clear socket counter */ packet_drop_count(internal->rx_queue[i].sockfd); - internal->rx_queue[i].rx_pkts = 0; - internal->rx_queue[i].rx_bytes = 0; - internal->rx_queue[i].rx_nombuf = 0; - internal->rx_queue[i].rx_dropped_pkts = 0; - - internal->tx_queue[i].tx_pkts = 0; - internal->tx_queue[i].err_pkts = 0; - internal->tx_queue[i].tx_bytes = 0; + rte_atomic_store_explicit(&internal->rx_queue[i].rx_pkts, 0, + rte_memory_order_relaxed); + rte_atomic_store_explicit(&internal->rx_queue[i].rx_bytes, 0, + rte_memory_order_relaxed); + rte_atomic_store_explicit(&internal->rx_queue[i].rx_nombuf, 0, + rte_memory_order_relaxed); + rte_atomic_store_explicit(&internal->rx_queue[i].rx_dropped_pkts, 0, + rte_memory_order_relaxed); + + rte_atomic_store_explicit(&internal->tx_queue[i].tx_pkts, 0, + rte_memory_order_relaxed); + rte_atomic_store_explicit(&internal->tx_queue[i].err_pkts, 0, + rte_memory_order_relaxed); + rte_atomic_store_explicit(&internal->tx_queue[i].tx_bytes, 0, + rte_memory_order_relaxed); } return 0; @@ -572,8 +604,7 @@ eth_rx_queue_setup(struct rte_eth_dev *dev, /* Now get the space available for data in the mbuf */ buf_size = rte_pktmbuf_data_room_size(pkt_q->mb_pool) - RTE_PKTMBUF_HEADROOM; - data_size = internals->req.tp_frame_size; - data_size -= TPACKET2_HDRLEN - sizeof(struct sockaddr_ll); + data_size = internals->req.tp_frame_size - ETH_AF_PACKET_FRAME_OVERHEAD; if (data_size > buf_size) { PMD_LOG(ERR, @@ -612,7 +643,7 @@ eth_dev_mtu_set(struct rte_eth_dev *dev, uint16_t mtu) int ret; int s; unsigned int data_size = internals->req.tp_frame_size - - TPACKET2_HDRLEN; + ETH_AF_PACKET_FRAME_OVERHEAD; if (mtu > data_size) return -EINVAL; @@ -977,8 +1008,18 @@ rte_pmd_init_internals(struct rte_vdev_device *dev, rx_queue->rd = rte_zmalloc_socket(name, rdsize, 0, numa_node); if (rx_queue->rd == NULL) goto error; + /* Frame addresses must match kernel's packet_lookup_frame(): + * block_idx = position / frames_per_block + * frame_offset = position % frames_per_block + * address = block_start + (frame_offset * frame_size) + */ + const uint32_t frames_per_block = req->tp_block_size / req->tp_frame_size; for (i = 0; i < req->tp_frame_nr; ++i) { - rx_queue->rd[i].iov_base = rx_queue->map + (i * framesize); + const uint32_t block_idx = i / frames_per_block; + const uint32_t frame_in_block = i % frames_per_block; + rx_queue->rd[i].iov_base = rx_queue->map + + (block_idx * req->tp_block_size) + + (frame_in_block * req->tp_frame_size); rx_queue->rd[i].iov_len = req->tp_frame_size; } rx_queue->sockfd = qsockfd; @@ -994,8 +1035,13 @@ rte_pmd_init_internals(struct rte_vdev_device *dev, tx_queue->rd = rte_zmalloc_socket(name, rdsize, 0, numa_node); if (tx_queue->rd == NULL) goto error; + /* See comment above rx_queue->rd initialization. */ for (i = 0; i < req->tp_frame_nr; ++i) { - tx_queue->rd[i].iov_base = tx_queue->map + (i * framesize); + const uint32_t block_idx = i / frames_per_block; + const uint32_t frame_in_block = i % frames_per_block; + tx_queue->rd[i].iov_base = tx_queue->map + + (block_idx * req->tp_block_size) + + (frame_in_block * req->tp_frame_size); tx_queue->rd[i].iov_len = req->tp_frame_size; } tx_queue->sockfd = qsockfd; @@ -1092,7 +1138,8 @@ rte_eth_from_packet(struct rte_vdev_device *dev, if (*sockfd < 0) return -1; - blocksize = getpagesize(); + const int pagesize = getpagesize(); + blocksize = pagesize; /* * Walk arguments for configurable settings @@ -1162,13 +1209,55 @@ rte_eth_from_packet(struct rte_vdev_device *dev, return -1; } - blockcount = framecount / (blocksize / framesize); + const unsigned int frames_per_block = blocksize / framesize; + blockcount = framecount / frames_per_block; if (!blockcount) { PMD_LOG(ERR, "%s: invalid AF_PACKET MMAP parameters", name); return -1; } + /* + * https://www.kernel.org/doc/Documentation/networking/packet_mmap.txt + * Check constraints that may be enforced by the kernel and cause failure + * to initialize the rings but explicit error messages aren't provided. + * See packet_set_ring in linux kernel for enforcement: + * https://github.com/torvalds/linux/blob/master/net/packet/af_packet.c + */ + if (blocksize % pagesize != 0) { + /* tp_block_size must be a multiple of PAGE_SIZE */ + PMD_LOG(WARNING, "%s: %s=%u must be a multiple of PAGE_SIZE=%d", + name, ETH_AF_PACKET_BLOCKSIZE_ARG, blocksize, pagesize); + } + if (framesize % TPACKET_ALIGNMENT != 0) { + /* tp_frame_size must be a multiple of TPACKET_ALIGNMENT */ + PMD_LOG(WARNING, "%s: %s=%u must be a multiple of TPACKET_ALIGNMENT=%d", + name, ETH_AF_PACKET_FRAMESIZE_ARG, framesize, TPACKET_ALIGNMENT); + } + if (frames_per_block == 0 || frames_per_block > UINT_MAX / blockcount || + framecount != frames_per_block * blockcount) { + /* tp_frame_nr must be exactly frames_per_block*tp_block_nr */ + PMD_LOG(WARNING, "%s: %s=%u must be exactly " + "frames_per_block(%s/%s = %u/%u = %u) * blockcount(%u)", + name, ETH_AF_PACKET_FRAMECOUNT_ARG, framecount, + ETH_AF_PACKET_BLOCKSIZE_ARG, ETH_AF_PACKET_FRAMESIZE_ARG, + blocksize, framesize, frames_per_block, blockcount); + } + + /* Below conditions may not cause errors but provide hints to improve */ + if (blocksize % framesize != 0) { + PMD_LOG(WARNING, "%s: %s=%u not evenly divisible by %s=%u, " + "may waste memory", name, + ETH_AF_PACKET_BLOCKSIZE_ARG, blocksize, + ETH_AF_PACKET_FRAMESIZE_ARG, framesize); + } + if (!rte_is_power_of_2(blocksize)) { + /* tp_block_size should be a power of two or there will be waste */ + PMD_LOG(WARNING, "%s: %s=%u should be a power of two " + "or there will be a waste of memory", + name, ETH_AF_PACKET_BLOCKSIZE_ARG, blocksize); + } + PMD_LOG(DEBUG, "%s: AF_PACKET MMAP parameters:", name); PMD_LOG(DEBUG, "%s:\tblock size %d", name, blocksize); PMD_LOG(DEBUG, "%s:\tblock count %d", name, blockcount); -- 2.39.5 (Apple Git-154) ^ permalink raw reply related [flat|nested] 65+ messages in thread
* Re: [PATCH v1 1/3] net/af_packet: fix thread safety and frame calculations 2026-01-27 18:13 ` [PATCH v1 1/3] net/af_packet: fix thread safety and frame calculations scott.k.mitch1 @ 2026-01-27 18:39 ` Stephen Hemminger 2026-01-28 1:35 ` Scott Mitchell 0 siblings, 1 reply; 65+ messages in thread From: Stephen Hemminger @ 2026-01-27 18:39 UTC (permalink / raw) To: scott.k.mitch1; +Cc: dev, linville, stable On Tue, 27 Jan 2026 10:13:53 -0800 scott.k.mitch1@gmail.com wrote: > From: Scott Mitchell <scott.k.mitch1@gmail.com> > > The AF_PACKET driver had multiple correctness issues that could > cause data races and memory corruption in multi-threaded environments. > > Thread Safety Issues: > > 1. Statistics counters (rx_pkts, tx_pkts, rx_bytes, tx_bytes, etc.) > were declared as 'volatile unsigned long' which provides no > atomicity guarantees and can cause torn reads/writes on 32-bit > platforms or when the compiler uses multiple instructions. This is bad idea Atomic is even more expensive and only one thread should be updating at a time. If you want to handle 32 bit platforms then something like the Linux kernel mechanism for stats is needed. It does: - on 32 bit platforms does a multiple read look (like seqlock) - on 64 bit platforms is just regular operations. ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v1 1/3] net/af_packet: fix thread safety and frame calculations 2026-01-27 18:39 ` Stephen Hemminger @ 2026-01-28 1:35 ` Scott Mitchell 0 siblings, 0 replies; 65+ messages in thread From: Scott Mitchell @ 2026-01-28 1:35 UTC (permalink / raw) To: Stephen Hemminger; +Cc: dev, linville, stable > This is bad idea > > Atomic is even more expensive and only one thread should be updating > at a time. If you want to handle 32 bit platforms then something > like the Linux kernel mechanism for stats is needed. It does: > - on 32 bit platforms does a multiple read look (like seqlock) > - on 64 bit platforms is just regular operations. ack. I will revert atomic stats and keep the existing approach (consistent with other eth drivers). ^ permalink raw reply [flat|nested] 65+ messages in thread
* [PATCH v1 2/3] net/af_packet: RX/TX rte_memcpy, bulk free, prefetch 2026-01-27 18:13 [PATCH v1 0/3] net/af_packet: correctness fixes and improvements scott.k.mitch1 2026-01-27 18:13 ` [PATCH v1 1/3] net/af_packet: fix thread safety and frame calculations scott.k.mitch1 @ 2026-01-27 18:13 ` scott.k.mitch1 2026-01-27 18:54 ` Stephen Hemminger 2026-01-27 18:13 ` [PATCH v1 3/3] net/af_packet: software checksum and tx poll control scott.k.mitch1 2026-01-28 9:36 ` [PATCH v2 0/4] af_packet correctness, performance, cksum scott.k.mitch1 3 siblings, 1 reply; 65+ messages in thread From: scott.k.mitch1 @ 2026-01-27 18:13 UTC (permalink / raw) To: dev; +Cc: Scott Mitchell From: Scott Mitchell <scott.k.mitch1@gmail.com> - Add rte_prefetch0() to prefetch next frame/mbuf while processing current packet, reducing cache miss latency - Replace memcpy() with rte_memcpy() for optimized copy operations - Use rte_pktmbuf_free_bulk() in TX path instead of individual rte_pktmbuf_free() calls for better batch efficiency - Add unlikely() hints for error paths (oversized packets, VLAN insertion failures, sendto errors) to optimize branch prediction - Remove unnecessary early nb_pkts == 0 when loop handles this and app may never call with 0 frames. Signed-off-by: Scott Mitchell <scott.k.mitch1@gmail.com> --- drivers/net/af_packet/rte_eth_af_packet.c | 70 ++++++++++++----------- 1 file changed, 37 insertions(+), 33 deletions(-) diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c index 2ee52a402b..2d152a2e2f 100644 --- a/drivers/net/af_packet/rte_eth_af_packet.c +++ b/drivers/net/af_packet/rte_eth_af_packet.c @@ -9,6 +9,7 @@ #include <rte_common.h> #include <rte_string_fns.h> #include <rte_mbuf.h> +#include <rte_memcpy.h> #include <rte_atomic.h> #include <rte_bitops.h> #include <ethdev_driver.h> @@ -138,9 +139,6 @@ eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) uint32_t num_rx_bytes = 0; unsigned int framecount, framenum; - if (unlikely(nb_pkts == 0)) - return 0; - /* * Reads the given number of packets from the AF_PACKET socket one by * one and copies the packet data into a newly allocated mbuf. @@ -155,6 +153,14 @@ eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) if ((tp_status & TP_STATUS_USER) == 0) break; + unsigned int next_framenum = framenum + 1; + if (next_framenum >= framecount) + next_framenum = 0; + + /* prefetch the next frame for the next loop iteration */ + if (likely(i + 1 < nb_pkts)) + rte_prefetch0(pkt_q->rd[next_framenum].iov_base); + /* allocate the next mbuf */ mbuf = rte_pktmbuf_alloc(pkt_q->mb_pool); if (unlikely(mbuf == NULL)) { @@ -166,7 +172,7 @@ eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) /* packet will fit in the mbuf, go ahead and receive it */ rte_pktmbuf_pkt_len(mbuf) = rte_pktmbuf_data_len(mbuf) = ppd->tp_snaplen; pbuf = (uint8_t *) ppd + ppd->tp_mac; - memcpy(rte_pktmbuf_mtod(mbuf, void *), pbuf, rte_pktmbuf_data_len(mbuf)); + rte_memcpy(rte_pktmbuf_mtod(mbuf, void *), pbuf, rte_pktmbuf_data_len(mbuf)); /* check for vlan info */ if (tp_status & TP_STATUS_VLAN_VALID) { @@ -190,8 +196,7 @@ eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) /* release incoming frame and advance ring buffer */ rte_atomic_store_explicit(&ppd->tp_status, TP_STATUS_KERNEL, rte_memory_order_release); - if (++framenum >= framecount) - framenum = 0; + framenum = next_framenum; mbuf->port = pkt_q->in_port; /* account for the receive frame */ @@ -241,9 +246,6 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) uint32_t num_tx_bytes = 0; uint16_t i; - if (unlikely(nb_pkts == 0)) - return 0; - memset(&pfd, 0, sizeof(pfd)); pfd.fd = pkt_q->sockfd; pfd.events = POLLOUT; @@ -251,22 +253,25 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) framecount = pkt_q->framecount; framenum = pkt_q->framenum; - ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base; for (i = 0; i < nb_pkts; i++) { - mbuf = *bufs++; - - /* drop oversized packets */ - if (mbuf->pkt_len > pkt_q->frame_data_size) { - rte_pktmbuf_free(mbuf); - continue; + unsigned int next_framenum = framenum + 1; + if (next_framenum >= framecount) + next_framenum = 0; + + /* prefetch the next source mbuf and destination TPACKET */ + if (likely(i + 1 < nb_pkts)) { + rte_prefetch0(bufs[i + 1]); + rte_prefetch0(pkt_q->rd[next_framenum].iov_base); } - /* insert vlan info if necessary */ - if (mbuf->ol_flags & RTE_MBUF_F_TX_VLAN) { - if (rte_vlan_insert(&mbuf)) { - rte_pktmbuf_free(mbuf); - continue; - } + mbuf = bufs[i]; + ppd = (struct tpacket2_hdr *)pkt_q->rd[framenum].iov_base; + + /* Drop oversized packets. Insert VLAN if necessary */ + if (unlikely(mbuf->pkt_len > pkt_q->frame_data_size || + ((mbuf->ol_flags & RTE_MBUF_F_TX_VLAN) != 0 && + rte_vlan_insert(&mbuf) != 0))) { + continue; } /* @@ -294,32 +299,31 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) pbuf = (uint8_t *)ppd + ETH_AF_PACKET_FRAME_OVERHEAD; + ppd->tp_len = mbuf->pkt_len; + ppd->tp_snaplen = mbuf->pkt_len; + struct rte_mbuf *tmp_mbuf = mbuf; do { uint16_t data_len = rte_pktmbuf_data_len(tmp_mbuf); - memcpy(pbuf, rte_pktmbuf_mtod(tmp_mbuf, void*), data_len); + rte_memcpy(pbuf, rte_pktmbuf_mtod(tmp_mbuf, void*), data_len); pbuf += data_len; tmp_mbuf = tmp_mbuf->next; } while (tmp_mbuf); - ppd->tp_len = mbuf->pkt_len; - ppd->tp_snaplen = mbuf->pkt_len; - /* release incoming frame and advance ring buffer */ rte_atomic_store_explicit(&ppd->tp_status, TP_STATUS_SEND_REQUEST, rte_memory_order_release); - if (++framenum >= framecount) - framenum = 0; - ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base; - + framenum = next_framenum; num_tx++; num_tx_bytes += mbuf->pkt_len; - rte_pktmbuf_free(mbuf); } + rte_pktmbuf_free_bulk(&bufs[0], i); + /* kick-off transmits */ - if (sendto(pkt_q->sockfd, NULL, 0, MSG_DONTWAIT, NULL, 0) == -1 && - errno != ENOBUFS && errno != EAGAIN) { + if (unlikely(num_tx > 0 && + sendto(pkt_q->sockfd, NULL, 0, MSG_DONTWAIT, NULL, 0) == -1 && + errno != ENOBUFS && errno != EAGAIN)) { /* * In case of a ENOBUFS/EAGAIN error all of the enqueued * packets will be considered successful even though only some -- 2.39.5 (Apple Git-154) ^ permalink raw reply related [flat|nested] 65+ messages in thread
* Re: [PATCH v1 2/3] net/af_packet: RX/TX rte_memcpy, bulk free, prefetch 2026-01-27 18:13 ` [PATCH v1 2/3] net/af_packet: RX/TX rte_memcpy, bulk free, prefetch scott.k.mitch1 @ 2026-01-27 18:54 ` Stephen Hemminger 2026-01-28 1:23 ` Scott Mitchell 0 siblings, 1 reply; 65+ messages in thread From: Stephen Hemminger @ 2026-01-27 18:54 UTC (permalink / raw) To: scott.k.mitch1; +Cc: dev On Tue, 27 Jan 2026 10:13:54 -0800 scott.k.mitch1@gmail.com wrote: > From: Scott Mitchell <scott.k.mitch1@gmail.com> > > - Add rte_prefetch0() to prefetch next frame/mbuf while processing > current packet, reducing cache miss latency Makes sense, if you really want to dive deeper there are more unrolled loops patterns possible; there was a multi-step unrolled loop pattern that fd.io does. The reason is that the first pre-fetch is usually useless and doesn't help but skipping ahead farther helps. > - Replace memcpy() with rte_memcpy() for optimized copy operations There is no good reason that rte_memcpy() should be faster than memcpy(). There were some cases observed with virtio but my hunch is that this is because the two routines are making different alignment assumptions. > - Use rte_pktmbuf_free_bulk() in TX path instead of individual > rte_pktmbuf_free() calls for better batch efficiency Makes sense. > - Add unlikely() hints for error paths (oversized packets, VLAN > insertion failures, sendto errors) to optimize branch prediction Also makes sense. > - Remove unnecessary early nb_pkts == 0 when loop handles this > and app may never call with 0 frames. Yes calling with nb_pkts == 0 on tx/rx burst only needs to work does not need short circuit. > Signed-off-by: Scott Mitchell <scott.k.mitch1@gmail.com> ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v1 2/3] net/af_packet: RX/TX rte_memcpy, bulk free, prefetch 2026-01-27 18:54 ` Stephen Hemminger @ 2026-01-28 1:23 ` Scott Mitchell 2026-01-28 9:49 ` Morten Brørup 0 siblings, 1 reply; 65+ messages in thread From: Scott Mitchell @ 2026-01-28 1:23 UTC (permalink / raw) To: Stephen Hemminger; +Cc: dev > > - Add rte_prefetch0() to prefetch next frame/mbuf while processing > > current packet, reducing cache miss latency > > Makes sense, if you really want to dive deeper there are more > unrolled loops patterns possible; there was a multi-step unrolled > loop pattern that fd.io does. The reason is that the first pre-fetch > is usually useless and doesn't help but skipping ahead farther > helps. I didn't want to go too overboard and there are trade-offs (fetch too much may evict entries you need). The upcoming GRO support (in follow-up series) enables ~64k+ payloads which increases the memory footprint per packet. Would you prefer I remove prefetch+1 or OK to keep? > > - Replace memcpy() with rte_memcpy() for optimized copy operations > There is no good reason that rte_memcpy() should be faster than memcpy(). > There were some cases observed with virtio but my hunch is that this is > because the two routines are making different alignment assumptions. ack. I will drop rte_memcpy. Under what scenarios is rte_memcpy preferred/beneficial? ^ permalink raw reply [flat|nested] 65+ messages in thread
* RE: [PATCH v1 2/3] net/af_packet: RX/TX rte_memcpy, bulk free, prefetch 2026-01-28 1:23 ` Scott Mitchell @ 2026-01-28 9:49 ` Morten Brørup 2026-01-28 15:37 ` Scott Mitchell 0 siblings, 1 reply; 65+ messages in thread From: Morten Brørup @ 2026-01-28 9:49 UTC (permalink / raw) To: Scott Mitchell, Stephen Hemminger; +Cc: dev > > > - Replace memcpy() with rte_memcpy() for optimized copy operations > > There is no good reason that rte_memcpy() should be faster than > memcpy(). > > There were some cases observed with virtio but my hunch is that this > is > > because the two routines are making different alignment assumptions. > > ack. I will drop rte_memcpy. The community is increasingly skeptical about using rte_memcpy() instead of memcpy(). I'm not sure all DPDK documentation has been updated to reflect this change, but might still recommend rte_memcpy(). So, simply replacing memcpy() with rte_memcpy() is no longer acceptable. However, if you back up the replacement with performance data, it is more likely to get accepted. > Under what scenarios is rte_memcpy preferred/beneficial? I wish someone had an answer to that question! The best I can come up with is: When using an ancient compiler or C library, where memcpy() isn't properly optimized. With modern compilers catching up, rte_memcpy() is becoming increasingly obsolete. Here's some background information about rte_memcpy() from 2017: https://www.intel.com/content/www/us/en/developer/articles/technical/performance-optimization-of-memcpy-in-dpdk.html IIRC, the concept of a specialized memcpy() originates from some video streaming or gaming code, where huge memory areas were being copied around. ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v1 2/3] net/af_packet: RX/TX rte_memcpy, bulk free, prefetch 2026-01-28 9:49 ` Morten Brørup @ 2026-01-28 15:37 ` Scott Mitchell 2026-01-28 16:57 ` Stephen Hemminger 0 siblings, 1 reply; 65+ messages in thread From: Scott Mitchell @ 2026-01-28 15:37 UTC (permalink / raw) To: Morten Brørup; +Cc: Stephen Hemminger, dev Thanks for the context! That makes sense. I dropped rte_memcpy and can re-evaluate once all my upcoming changes are merged. ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v1 2/3] net/af_packet: RX/TX rte_memcpy, bulk free, prefetch 2026-01-28 15:37 ` Scott Mitchell @ 2026-01-28 16:57 ` Stephen Hemminger 0 siblings, 0 replies; 65+ messages in thread From: Stephen Hemminger @ 2026-01-28 16:57 UTC (permalink / raw) To: Scott Mitchell; +Cc: Morten Brørup, dev On Wed, 28 Jan 2026 07:37:13 -0800 Scott Mitchell <scott.k.mitch1@gmail.com> wrote: > Thanks for the context! That makes sense. I dropped rte_memcpy and can > re-evaluate once all my upcoming changes are merged. The other thing worth noting is that compilers and tools like fortify know what memcpy is and do bounds checking. But even with all the annotation to x86 rte_memcpy() it still doesn't engage all the checks. Plus there is the case (found in examples/fips_validation) where rte_memcpy would read past end of array on stack. Many platforms have rte_memcpy is just alias for memcpy. The only ones with special code are x86 and ARM64. ^ permalink raw reply [flat|nested] 65+ messages in thread
* [PATCH v1 3/3] net/af_packet: software checksum and tx poll control 2026-01-27 18:13 [PATCH v1 0/3] net/af_packet: correctness fixes and improvements scott.k.mitch1 2026-01-27 18:13 ` [PATCH v1 1/3] net/af_packet: fix thread safety and frame calculations scott.k.mitch1 2026-01-27 18:13 ` [PATCH v1 2/3] net/af_packet: RX/TX rte_memcpy, bulk free, prefetch scott.k.mitch1 @ 2026-01-27 18:13 ` scott.k.mitch1 2026-01-27 18:57 ` Stephen Hemminger 2026-01-27 20:45 ` [REVIEW] " Stephen Hemminger 2026-01-28 9:36 ` [PATCH v2 0/4] af_packet correctness, performance, cksum scott.k.mitch1 3 siblings, 2 replies; 65+ messages in thread From: scott.k.mitch1 @ 2026-01-27 18:13 UTC (permalink / raw) To: dev; +Cc: Scott Mitchell From: Scott Mitchell <scott.k.mitch1@gmail.com> Add software checksum offload support and configurable TX poll behavior to improve flexibility and performance. Implement af_packet_sw_cksum() helper to compute IPv4/UDP/TCP checksums in software when hardware offload is not available. This enables checksum offload on interfaces without HW support. Add txpollnotrdy devarg (default=true) to control whether poll() is called when the TX ring is not ready. This allows users to avoid blocking behavior if application threads are in asynchronous poll mode where blocking the thread has negative side effects and backpressure is applied via different means. Signed-off-by: Scott Mitchell <scott.k.mitch1@gmail.com> --- drivers/net/af_packet/rte_eth_af_packet.c | 82 +++++++++++++++++++++-- 1 file changed, 76 insertions(+), 6 deletions(-) diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c index 2d152a2e2f..2654e7feed 100644 --- a/drivers/net/af_packet/rte_eth_af_packet.c +++ b/drivers/net/af_packet/rte_eth_af_packet.c @@ -11,6 +11,7 @@ #include <rte_mbuf.h> #include <rte_memcpy.h> #include <rte_atomic.h> +#include <rte_ip.h> #include <rte_bitops.h> #include <ethdev_driver.h> #include <ethdev_vdev.h> @@ -19,6 +20,7 @@ #include <bus_vdev_driver.h> #include <errno.h> +#include <stdbool.h> #include <linux/if_ether.h> #include <linux/if_packet.h> #include <arpa/inet.h> @@ -40,9 +42,11 @@ #define ETH_AF_PACKET_FRAMECOUNT_ARG "framecnt" #define ETH_AF_PACKET_QDISC_BYPASS_ARG "qdisc_bypass" #define ETH_AF_PACKET_FANOUT_MODE_ARG "fanout_mode" +#define ETH_AF_PACKET_TX_POLL_NOT_READY_ARG "txpollnotrdy" #define DFLT_FRAME_SIZE (1 << 11) #define DFLT_FRAME_COUNT (1 << 9) +#define DFLT_TX_POLL_NOT_RDY true static const uint16_t ETH_AF_PACKET_FRAME_SIZE_MAX = RTE_IPV4_MAX_PKT_LEN; #define ETH_AF_PACKET_FRAME_OVERHEAD (TPACKET2_HDRLEN - sizeof(struct sockaddr_ll)) @@ -79,6 +83,9 @@ struct __rte_cache_aligned pkt_tx_queue { unsigned int framecount; unsigned int framenum; + bool txpollnotrdy; + bool sw_cksum; + RTE_ATOMIC(uint64_t) tx_pkts; RTE_ATOMIC(uint64_t) err_pkts; RTE_ATOMIC(uint64_t) tx_bytes; @@ -97,6 +104,7 @@ struct pmd_internals { struct pkt_tx_queue *tx_queue; uint8_t vlan_strip; uint8_t timestamp_offloading; + bool tx_sw_cksum; }; static const char *valid_arguments[] = { @@ -107,6 +115,7 @@ static const char *valid_arguments[] = { ETH_AF_PACKET_FRAMECOUNT_ARG, ETH_AF_PACKET_QDISC_BYPASS_ARG, ETH_AF_PACKET_FANOUT_MODE_ARG, + ETH_AF_PACKET_TX_POLL_NOT_READY_ARG, NULL }; @@ -127,6 +136,45 @@ RTE_LOG_REGISTER_DEFAULT(af_packet_logtype, NOTICE); RTE_LOG_LINE(level, AFPACKET, "%s(): " fmt ":%s", __func__, \ ## __VA_ARGS__, strerror(errno)) +/* + * Compute and set the IPv4 or IPv6 UDP/TCP checksum on a packet. + */ +static inline void +af_packet_sw_cksum(struct rte_mbuf *mbuf) +{ + const uint64_t l4_offset = mbuf->l2_len + mbuf->l3_len; + const uint64_t mbuf_len = rte_pktmbuf_data_len(mbuf); + if (unlikely(mbuf_len < l4_offset)) + return; + + void *l3_hdr = rte_pktmbuf_mtod_offset(mbuf, void *, mbuf->l2_len); + const uint64_t ol_flags = mbuf->ol_flags; + if (ol_flags & RTE_MBUF_F_TX_IP_CKSUM) { + struct rte_ipv4_hdr *iph = l3_hdr; + iph->hdr_checksum = 0; + iph->hdr_checksum = rte_ipv4_cksum(iph); + } + + uint64_t l4_ol_flags = mbuf->ol_flags & RTE_MBUF_F_TX_L4_MASK; + if (l4_ol_flags == RTE_MBUF_F_TX_UDP_CKSUM && + likely(mbuf_len >= l4_offset + sizeof(struct rte_udp_hdr))) { + struct rte_udp_hdr *udp_hdr = + rte_pktmbuf_mtod_offset(mbuf, struct rte_udp_hdr *, l4_offset); + udp_hdr->dgram_cksum = 0; + udp_hdr->dgram_cksum = (ol_flags & RTE_MBUF_F_TX_IPV4) ? + rte_ipv4_udptcp_cksum_mbuf(mbuf, l3_hdr, l4_offset) : + rte_ipv6_udptcp_cksum_mbuf(mbuf, l3_hdr, l4_offset); + } else if (l4_ol_flags == RTE_MBUF_F_TX_TCP_CKSUM && + likely(mbuf_len >= l4_offset + sizeof(struct rte_tcp_hdr))) { + struct rte_tcp_hdr *tcp_hdr = + rte_pktmbuf_mtod_offset(mbuf, struct rte_tcp_hdr *, l4_offset); + tcp_hdr->cksum = 0; + tcp_hdr->cksum = (ol_flags & RTE_MBUF_F_TX_IPV4) ? + rte_ipv4_udptcp_cksum_mbuf(mbuf, l3_hdr, l4_offset) : + rte_ipv6_udptcp_cksum_mbuf(mbuf, l3_hdr, l4_offset); + } +} + static uint16_t eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) { @@ -246,10 +294,12 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) uint32_t num_tx_bytes = 0; uint16_t i; - memset(&pfd, 0, sizeof(pfd)); - pfd.fd = pkt_q->sockfd; - pfd.events = POLLOUT; - pfd.revents = 0; + if (pkt_q->txpollnotrdy) { + memset(&pfd, 0, sizeof(pfd)); + pfd.fd = pkt_q->sockfd; + pfd.events = POLLOUT; + pfd.revents = 0; + } framecount = pkt_q->framecount; framenum = pkt_q->framenum; @@ -290,7 +340,8 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) */ if (unlikely(!tx_ring_status_available(rte_atomic_load_explicit(&ppd->tp_status, rte_memory_order_acquire)) && - (poll(&pfd, 1, -1) < 0 || (pfd.revents & POLLERR) != 0 || + (!pkt_q->txpollnotrdy || poll(&pfd, 1, -1) < 0 || + (pfd.revents & POLLERR) != 0 || !tx_ring_status_available(rte_atomic_load_explicit(&ppd->tp_status, rte_memory_order_acquire))))) { /* Ring is full, stop here. Don't process bufs[i]. */ @@ -302,6 +353,9 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) ppd->tp_len = mbuf->pkt_len; ppd->tp_snaplen = mbuf->pkt_len; + if (pkt_q->sw_cksum) + af_packet_sw_cksum(mbuf); + struct rte_mbuf *tmp_mbuf = mbuf; do { uint16_t data_len = rte_pktmbuf_data_len(tmp_mbuf); @@ -387,10 +441,14 @@ eth_dev_configure(struct rte_eth_dev *dev __rte_unused) { struct rte_eth_conf *dev_conf = &dev->data->dev_conf; const struct rte_eth_rxmode *rxmode = &dev_conf->rxmode; + const struct rte_eth_txmode *txmode = &dev_conf->txmode; struct pmd_internals *internals = dev->data->dev_private; internals->vlan_strip = !!(rxmode->offloads & RTE_ETH_RX_OFFLOAD_VLAN_STRIP); internals->timestamp_offloading = !!(rxmode->offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP); + internals->tx_sw_cksum = !!(txmode->offloads & (RTE_ETH_TX_OFFLOAD_IPV4_CKSUM | + RTE_ETH_TX_OFFLOAD_UDP_CKSUM | RTE_ETH_TX_OFFLOAD_TCP_CKSUM)); + return 0; } @@ -408,7 +466,10 @@ eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info) dev_info->max_tx_queues = (uint16_t)internals->nb_queues; dev_info->min_rx_bufsize = ETH_AF_PACKET_ETH_OVERHEAD; dev_info->tx_offload_capa = RTE_ETH_TX_OFFLOAD_MULTI_SEGS | - RTE_ETH_TX_OFFLOAD_VLAN_INSERT; + RTE_ETH_TX_OFFLOAD_VLAN_INSERT | + RTE_ETH_TX_OFFLOAD_IPV4_CKSUM | + RTE_ETH_TX_OFFLOAD_UDP_CKSUM | + RTE_ETH_TX_OFFLOAD_TCP_CKSUM; dev_info->rx_offload_capa = RTE_ETH_RX_OFFLOAD_VLAN_STRIP | RTE_ETH_RX_OFFLOAD_TIMESTAMP; @@ -634,6 +695,7 @@ eth_tx_queue_setup(struct rte_eth_dev *dev, { struct pmd_internals *internals = dev->data->dev_private; + internals->tx_queue[tx_queue_id].sw_cksum = internals->tx_sw_cksum; dev->data->tx_queues[tx_queue_id] = &internals->tx_queue[tx_queue_id]; return 0; @@ -829,6 +891,7 @@ rte_pmd_init_internals(struct rte_vdev_device *dev, unsigned int framecnt, unsigned int qdisc_bypass, const char *fanout_mode, + bool txpollnotrdy, struct pmd_internals **internals, struct rte_eth_dev **eth_dev, struct rte_kvargs *kvlist) @@ -1049,6 +1112,7 @@ rte_pmd_init_internals(struct rte_vdev_device *dev, tx_queue->rd[i].iov_len = req->tp_frame_size; } tx_queue->sockfd = qsockfd; + tx_queue->txpollnotrdy = txpollnotrdy; rc = bind(qsockfd, (const struct sockaddr*)&sockaddr, sizeof(sockaddr)); if (rc == -1) { @@ -1137,6 +1201,7 @@ rte_eth_from_packet(struct rte_vdev_device *dev, unsigned int qpairs = 1; unsigned int qdisc_bypass = 1; const char *fanout_mode = NULL; + bool txpollnotrdy = DFLT_TX_POLL_NOT_RDY; /* do some parameter checking */ if (*sockfd < 0) @@ -1204,6 +1269,10 @@ rte_eth_from_packet(struct rte_vdev_device *dev, fanout_mode = pair->value; continue; } + if (strstr(pair->key, ETH_AF_PACKET_TX_POLL_NOT_READY_ARG) != NULL) { + txpollnotrdy = atoi(pair->value) != 0; + continue; + } } if (framesize > blocksize) { @@ -1278,6 +1347,7 @@ rte_eth_from_packet(struct rte_vdev_device *dev, framesize, framecount, qdisc_bypass, fanout_mode, + txpollnotrdy, &internals, ð_dev, kvlist) < 0) return -1; -- 2.39.5 (Apple Git-154) ^ permalink raw reply related [flat|nested] 65+ messages in thread
* Re: [PATCH v1 3/3] net/af_packet: software checksum and tx poll control 2026-01-27 18:13 ` [PATCH v1 3/3] net/af_packet: software checksum and tx poll control scott.k.mitch1 @ 2026-01-27 18:57 ` Stephen Hemminger 2026-01-28 7:05 ` Scott Mitchell 2026-01-27 20:45 ` [REVIEW] " Stephen Hemminger 1 sibling, 1 reply; 65+ messages in thread From: Stephen Hemminger @ 2026-01-27 18:57 UTC (permalink / raw) To: scott.k.mitch1; +Cc: dev On Tue, 27 Jan 2026 10:13:55 -0800 scott.k.mitch1@gmail.com wrote: > From: Scott Mitchell <scott.k.mitch1@gmail.com> > > Add software checksum offload support and configurable TX poll > behavior to improve flexibility and performance. > > Implement af_packet_sw_cksum() helper to compute IPv4/UDP/TCP > checksums in software when hardware offload is not available. > This enables checksum offload on interfaces without HW support. I don't think each driver should be doing its own checksum helper. It should be done at application or through libraries. All modern hardware does checksum offload, so if it doesn't probably a driver bug. > > Add txpollnotrdy devarg (default=true) to control whether poll() > is called when the TX ring is not ready. This allows users to > avoid blocking behavior if application threads are in asynchronous > poll mode where blocking the thread has negative side effects and > backpressure is applied via different means. > Needs to be a separate patch. Don't do two things in one patch. Not sure if some variant of the existing configure thresholds could be used for this. > Signed-off-by: Scott Mitchell <scott.k.mitch1@gmail.com> ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v1 3/3] net/af_packet: software checksum and tx poll control 2026-01-27 18:57 ` Stephen Hemminger @ 2026-01-28 7:05 ` Scott Mitchell 2026-01-28 17:36 ` Stephen Hemminger 0 siblings, 1 reply; 65+ messages in thread From: Scott Mitchell @ 2026-01-28 7:05 UTC (permalink / raw) To: Stephen Hemminger; +Cc: dev > I don't think each driver should be doing its own checksum helper. > It should be done at application or through libraries. > All modern hardware does checksum offload, so if it doesn't probably > a driver bug. The goal from the app perspective is to set rte_eth_txmode.offloads consistently across PMDs. Hardware devices are good but software devices like af_packet don't provide a consistent experience. The approach in this patch is similar to existing software eth devices tap [1] and vhost [2]. In v2 I will share checksum code with tap to avoid duplication and ensure they remain consistent. [1] https://github.com/DPDK/dpdk/blob/v25.11/drivers/net/tap/rte_eth_tap.c#L559-L624 [2] https://github.com/DPDK/dpdk/blob/v25.11/drivers/net/vhost/rte_eth_vhost.c#L317-L357 > Needs to be a separate patch. Don't do two things in one patch. > Not sure if some variant of the existing configure thresholds > could be used for this. ack, will split txpoll control. I'm not aware of any other way to prevent poll if dpdk tx is faster than kernel rx. ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v1 3/3] net/af_packet: software checksum and tx poll control 2026-01-28 7:05 ` Scott Mitchell @ 2026-01-28 17:36 ` Stephen Hemminger 2026-01-28 18:59 ` Scott Mitchell 0 siblings, 1 reply; 65+ messages in thread From: Stephen Hemminger @ 2026-01-28 17:36 UTC (permalink / raw) To: Scott Mitchell; +Cc: dev On Tue, 27 Jan 2026 23:05:54 -0800 Scott Mitchell <scott.k.mitch1@gmail.com> wrote: > > I don't think each driver should be doing its own checksum helper. > > It should be done at application or through libraries. > > All modern hardware does checksum offload, so if it doesn't probably > > a driver bug. > > The goal from the app perspective is to set rte_eth_txmode.offloads > consistently across PMDs. Hardware devices are good but software > devices like af_packet don't provide a consistent experience. The > approach in this patch is similar to existing software eth devices tap > [1] and vhost [2]. In v2 I will share checksum code with tap to avoid > duplication and ensure they remain consistent. > > [1] https://github.com/DPDK/dpdk/blob/v25.11/drivers/net/tap/rte_eth_tap.c#L559-L624 > [2] https://github.com/DPDK/dpdk/blob/v25.11/drivers/net/vhost/rte_eth_vhost.c#L317-L357 > > > Needs to be a separate patch. Don't do two things in one patch. > > Not sure if some variant of the existing configure thresholds > > could be used for this. > > ack, will split txpoll control. I'm not aware of any other way to > prevent poll if dpdk tx is faster than kernel rx. Well tap should NOT be doing software GSO and LRO in DPDK. The kernel driver has ability to do that, and is better done there. ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v1 3/3] net/af_packet: software checksum and tx poll control 2026-01-28 17:36 ` Stephen Hemminger @ 2026-01-28 18:59 ` Scott Mitchell 0 siblings, 0 replies; 65+ messages in thread From: Scott Mitchell @ 2026-01-28 18:59 UTC (permalink / raw) To: Stephen Hemminger; +Cc: dev > Well tap should NOT be doing software GSO and LRO in DPDK. > The kernel driver has ability to do that, and is better done there. Agreed if we can propagate context to the other end of the device best to let it handle it. Some context/constraints: - current mechanism to propagate context to kernel (virtio_net_hdr) doesn't support propagating "please do ipv4 checksum for me", and is done in software for virtio too [1] - I have a not-yet-submitted patch to add virtio_net_hdr support to af_packet PMD (for TSO/LRO). When enabled, the kernel requires every packet rx/tx has the virtio_net_hdr header, and I currently made this opt-in as a vdev arg. This leaves some options: 1. cksum support independent from virtio_net_hdr (my current approach) If virtio_net_hdr is enabled and supported - use it. If not, the software fallback path provides cksum (what is in this patch) - Pros - cksum is supported regardless of virtio_net_hdr support/errors, overhead of virtio_net_hdr not required for cksum - Cons - 2 paths for checksum, more code (even if shared with TAP) 2. cksum support ONLY IF virtio_net_hdr enabled/supported Pros - Single code path for checksum, L4 cksum "offloaded" (L3 still done in software) Cons - must always enable virtio_net_hdr to get cksum/TSO/LRO, overhead of virtio_net_hdr on each packet Do you prefer option (2)? fwiw, I didn't observe meaningful overhead from virtio_net_hdr, and it seems viable. [1] https://github.com/DPDK/dpdk/blob/v25.11/lib/vhost/virtio_net.c#L690-L697 ^ permalink raw reply [flat|nested] 65+ messages in thread
* [REVIEW] net/af_packet: software checksum and tx poll control 2026-01-27 18:13 ` [PATCH v1 3/3] net/af_packet: software checksum and tx poll control scott.k.mitch1 2026-01-27 18:57 ` Stephen Hemminger @ 2026-01-27 20:45 ` Stephen Hemminger 1 sibling, 0 replies; 65+ messages in thread From: Stephen Hemminger @ 2026-01-27 20:45 UTC (permalink / raw) To: dev; +Cc: Stephen Hemminger AI-generated review of bundle-1702-af-packet.mbox Reviewed using Claude (claude-opus-4-5-20251101) This is an automated review. Please verify all suggestions. --- # DPDK Patch Review: AF_PACKET Thread Safety and Frame Calculations ## Patch 1/3: net/af_packet: fix thread safety and frame calculations ### Commit Message Issues **Error: Subject line exceeds 60 characters** - Current: `net/af_packet: fix thread safety and frame calculations` (59 characters) - This is actually within the limit, my mistake on initial count. **Info: Subject is acceptable** - Format is correct: `net/af_packet:` prefix with lowercase description - No trailing period, imperative mood **Info: Body is well-structured** - Good explanation of the problems and solutions - Proper wrapping within 75 characters for most lines - Required tags present (Fixes, Cc: stable@dpdk.org, Signed-off-by) ### Code Review **Warning: Missing space after comma in void* cast** ```c rte_memcpy(pbuf, rte_pktmbuf_mtod(tmp_mbuf, void*), data_len); ``` Should be: ```c rte_memcpy(pbuf, rte_pktmbuf_mtod(tmp_mbuf, void *), data_len); ``` This appears in multiple locations. **Warning: Variable declaration in middle of block (C99 style)** ```c for (i = 0; i < nb_pkts; i++) { /* point at the next incoming frame */ ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base; uint32_t tp_status = rte_atomic_load_explicit(&ppd->tp_status, ``` While C99 style is acceptable per AGENTS.md, consistency within the function is preferred. The original code declares variables at the start of the block. **Warning: Double blank line in eth_stats_get()** ```c uint64_t rx_nombuf = rte_atomic_load_explicit(&internal->rx_queue[i].rx_nombuf, rte_memory_order_relaxed); uint64_t tx_pkts = rte_atomic_load_explicit(&internal->tx_queue[i].tx_pkts, ``` Remove one blank line. **Info: Atomic operations usage is correct** - Proper use of `rte_atomic_load_explicit()` with `rte_memory_order_acquire` for tp_status reads - Proper use of `rte_atomic_store_explicit()` with `rte_memory_order_release` for tp_status writes - Relaxed ordering for statistics is appropriate **Warning: Static const at file scope should use uppercase naming** ```c static const uint16_t ETH_AF_PACKET_FRAME_SIZE_MAX = RTE_IPV4_MAX_PKT_LEN; ``` This is correct - uppercase for constants. However, the type is `uint16_t` but `RTE_IPV4_MAX_PKT_LEN` is typically larger. Verify this doesn't truncate. **Info: Frame calculation fix appears correct** The block-based addressing fix aligns with kernel's `packet_lookup_frame()` behavior. --- ## Patch 2/3: net/af_packet: RX/TX rte_memcpy, bulk free, prefetch ### Commit Message Issues **Warning: Subject uses terms that should be expanded** - `RX/TX` should be `Rx/Tx` per DPDK naming conventions in words-case.txt **Error: Missing Cc: stable@dpdk.org if this is a bug fix** If this patch contains performance fixes that should be backported, add `Cc: stable@dpdk.org`. ### Code Review **Warning: rte_memcpy() may not be appropriate here** Per AGENTS.md: "rte_memcpy() is optimized for bulk data transfer in the fast path. For general use, standard memcpy() is preferred." The packet data copy is indeed in the fast path, so `rte_memcpy()` usage is justified here. However, the commit message should clarify this is intentional for fast-path optimization. **Warning: rte_pktmbuf_free_bulk() called on potentially modified array** ```c rte_pktmbuf_free_bulk(&bufs[0], i); ``` The VLAN insertion path calls `rte_vlan_insert(&mbuf)` which may reallocate the mbuf. If the original mbuf pointer in `bufs[i]` is not updated, this could free the wrong mbuf or double-free. The current code uses a local `mbuf` variable but doesn't update `bufs[i]`. **Error: Potential memory leak on skipped packets** ```c if (unlikely(mbuf->pkt_len > pkt_q->frame_data_size || ((mbuf->ol_flags & RTE_MBUF_F_TX_VLAN) != 0 && rte_vlan_insert(&mbuf) != 0))) { continue; } ``` When a packet is skipped (oversized or VLAN insert fails), it's not freed. The bulk free at the end uses `i` which includes skipped packets, but those packets were never processed. This needs careful review - the original code freed packets individually including on error paths. **Warning: Missing space after comma** ```c rte_memcpy(pbuf, rte_pktmbuf_mtod(tmp_mbuf, void*), data_len); ``` **Info: Prefetch optimization looks reasonable** The prefetch of next frame/mbuf while processing current packet is a valid optimization. --- ## Patch 3/3: net/af_packet: software checksum and tx poll control ### Commit Message Issues **Warning: Subject could be more specific** Consider: `net/af_packet: add software checksum and tx poll control` ### Code Review **Error: Missing header include** The code uses `struct rte_udp_hdr` and `struct rte_tcp_hdr` but only includes `<rte_ip.h>`. Need: ```c #include <rte_udp.h> #include <rte_tcp.h> ``` **Warning: Including `<stdbool.h>` is unnecessary** DPDK's `rte_common.h` (included via other headers) provides `bool` type. However, explicit inclusion isn't harmful. **Warning: New devarg not documented** The new `txpollnotrdy` argument should be documented in the driver documentation (`doc/guides/nics/af_packet.rst`). **Warning: Inline function in .c file should use `static inline`** ```c static inline void af_packet_sw_cksum(struct rte_mbuf *mbuf) ``` This is correct. **Warning: Magic check for minimum header sizes** ```c if (unlikely(mbuf_len < l4_offset)) return; ``` This silently returns without computing checksum. Consider logging a debug message for diagnosability. **Info: Software checksum implementation looks correct** - Properly handles IPv4/IPv6 - Properly handles UDP/TCP - Uses mbuf-aware checksum functions for multi-segment packets **Warning: Boolean parameter parsing could be clearer** ```c txpollnotrdy = atoi(pair->value) != 0; ``` Consider using a more explicit boolean parsing that accepts "true"/"false" or "yes"/"no" in addition to 0/1. **Warning: New capability advertised but may not match hardware** Adding `RTE_ETH_TX_OFFLOAD_IPV4_CKSUM | RTE_ETH_TX_OFFLOAD_UDP_CKSUM | RTE_ETH_TX_OFFLOAD_TCP_CKSUM` to capabilities is correct for software implementation, but documentation should clarify this is software-based. --- ## Summary ### Errors (Must Fix) 1. Patch 2: Potential memory leak on skipped packets - packets that fail size check or VLAN insert are not freed 2. Patch 3: Missing includes for `rte_udp.h` and `rte_tcp.h` ### Warnings (Should Fix) 1. All patches: Missing space after comma in `void*` casts 2. Patch 1: Double blank line in `eth_stats_get()` 3. Patch 2: Subject should use `Rx/Tx` not `RX/TX` 4. Patch 2: Verify `rte_pktmbuf_free_bulk()` handles VLAN-modified mbufs correctly 5. Patch 3: New devarg `txpollnotrdy` needs documentation update ### Info (Consider) 1. The atomic operations and memory ordering appear correct for the kernel synchronization protocol 2. Frame calculation fix aligns with kernel behavior 3. Prefetch optimizations are reasonable 4. Software checksum implementation is correct ^ permalink raw reply [flat|nested] 65+ messages in thread
* [PATCH v2 0/4] af_packet correctness, performance, cksum 2026-01-27 18:13 [PATCH v1 0/3] net/af_packet: correctness fixes and improvements scott.k.mitch1 ` (2 preceding siblings ...) 2026-01-27 18:13 ` [PATCH v1 3/3] net/af_packet: software checksum and tx poll control scott.k.mitch1 @ 2026-01-28 9:36 ` scott.k.mitch1 2026-01-28 9:36 ` [PATCH v2 1/4] net/af_packet: fix thread safety and frame calculations scott.k.mitch1 ` (4 more replies) 3 siblings, 5 replies; 65+ messages in thread From: scott.k.mitch1 @ 2026-01-28 9:36 UTC (permalink / raw) To: dev; +Cc: stephen, Scott From: Scott <scott.k.mitch1@gmail.com> This series fixes critical thread safety bugs in the af_packet PMD and adds performance optimizations. Patch 1 fixes two major correctness issues: - Thread safety: tp_status was accessed without memory barriers, violating the kernel's PACKET_MMAP protocol. On aarch64 and other weakly-ordered architectures, this causes packet corruption due to missing memory ordering. The fix matches the kernel's memory model: volatile unaligned reads/writes with explicit rte_smp_rmb/wmb barriers and __may_alias__ protection. - Frame calculations: Fixed incorrect frame overhead and address calculations that caused memory corruption when frames don't evenly divide blocks. Patches 2-4 add performance improvements: - Patch 2: Bulk mbuf freeing, unlikely annotations, and prefetching - Patch 3: TX poll control to reduce syscall overhead - Patch 4: Software checksum offload support with shared rte_net utility v2 changes: - Patch 1: Rewrote to use volatile + barriers instead of C11 atomics to match kernel's memory model. Added dependency on patch-160274 for __rte_may_alias attribute. - Patch 4: Refactored to use shared rte_net_ip_udptcp_cksum_mbuf() utility function, eliminating code duplication with tap driver. Scott Mitchell (4): net/af_packet: fix thread safety and frame calculations net/af_packet: RX/TX unlikely, bulk free, prefetch net/af_packet: tx poll control net/af_packet: software checksum doc/guides/nics/af_packet.rst | 6 +- drivers/net/af_packet/rte_eth_af_packet.c | 257 ++++++++++++++++------ drivers/net/tap/rte_eth_tap.c | 61 +---- lib/net/rte_net.h | 90 ++++++++ 4 files changed, 283 insertions(+), 131 deletions(-) -- 2.39.5 (Apple Git-154) ^ permalink raw reply [flat|nested] 65+ messages in thread
* [PATCH v2 1/4] net/af_packet: fix thread safety and frame calculations 2026-01-28 9:36 ` [PATCH v2 0/4] af_packet correctness, performance, cksum scott.k.mitch1 @ 2026-01-28 9:36 ` scott.k.mitch1 2026-01-28 16:59 ` Stephen Hemminger 2026-01-28 9:36 ` [PATCH v2 2/4] net/af_packet: RX/TX unlikely, bulk free, prefetch scott.k.mitch1 ` (3 subsequent siblings) 4 siblings, 1 reply; 65+ messages in thread From: scott.k.mitch1 @ 2026-01-28 9:36 UTC (permalink / raw) To: dev; +Cc: stephen, Scott Mitchell, linville, stable From: Scott Mitchell <scott.k.mitch1@gmail.com> Thread Safety: The tp_status field was accessed without proper memory barriers, violating the kernel's PACKET_MMAP synchronization protocol. The kernel implements this protocol in net/packet/af_packet.c: - __packet_get_status: smp_rmb() then READ_ONCE() (volatile read) - __packet_set_status: WRITE_ONCE() (volatile write) then smp_wmb() READ_ONCE/WRITE_ONCE use __may_alias__ attribute via __uXX_alias_t types to prevent compiler optimizations that assume type-based aliasing rules, which is critical for tp_status access that may be misaligned within the ring buffer. Userspace must use equivalent semantics: volatile unaligned_uint32_t (with __rte_may_alias) reads/writes with explicit memory barriers (rte_smp_rmb/rte_smp_wmb). On aarch64 and other weakly-ordered architectures, missing barriers cause packet corruption because: - RX: CPU may read stale packet data before seeing tp_status update - TX: CPU may reorder stores, causing kernel to see tp_status before packet data is fully written This becomes critical with io_uring SQPOLL mode where the kernel polling thread on a different CPU core asynchronously updates tp_status, making proper memory ordering essential. Note: Uses rte_smp_[r/w]mb which triggers checkpatch warnings, but C11 atomics cannot be used because tp_status is not declared _Atomic in the kernel's tpacket2_hdr structure. We must match the kernel's volatile + barrier memory model with __may_alias__ protection. Frame Calculation Issues: 1. Frame overhead incorrectly calculated as TPACKET_ALIGN(TPACKET2_HDRLEN) instead of TPACKET2_HDRLEN - sizeof(struct sockaddr_ll), causing incorrect usable frame data size. 2. Frame address calculation assumed sequential layout (frame_base + i * frame_size), but the kernel's packet_lookup_frame() uses block-based addressing: block_idx = position / frames_per_block frame_offset = position % frames_per_block address = block_start[block_idx] + (frame_offset * frame_size) This caused memory corruption when frames don't evenly divide blocks. Fixes: 364e08f2bbc0 ("af_packet: add PMD for AF_PACKET-based virtual devices") Cc: linville@tuxdriver.com Cc: stable@dpdk.org Signed-off-by: Scott Mitchell <scott.k.mitch1@gmail.com> --- Depends-on: patch-160274 ("eal: add __rte_may_alias to unaligned typedefs") drivers/net/af_packet/rte_eth_af_packet.c | 149 +++++++++++++++++----- 1 file changed, 114 insertions(+), 35 deletions(-) diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c index ef11b8fb6b..6c276bb7fc 100644 --- a/drivers/net/af_packet/rte_eth_af_packet.c +++ b/drivers/net/af_packet/rte_eth_af_packet.c @@ -9,6 +9,8 @@ #include <rte_common.h> #include <rte_string_fns.h> #include <rte_mbuf.h> +#include <rte_atomic.h> +#include <rte_bitops.h> #include <ethdev_driver.h> #include <ethdev_vdev.h> #include <rte_malloc.h> @@ -41,6 +43,10 @@ #define DFLT_FRAME_SIZE (1 << 11) #define DFLT_FRAME_COUNT (1 << 9) +static const uint16_t ETH_AF_PACKET_FRAME_SIZE_MAX = RTE_IPV4_MAX_PKT_LEN; +#define ETH_AF_PACKET_FRAME_OVERHEAD (TPACKET2_HDRLEN - sizeof(struct sockaddr_ll)) +#define ETH_AF_PACKET_ETH_OVERHEAD (RTE_ETHER_HDR_LEN + RTE_ETHER_CRC_LEN) + static uint64_t timestamp_dynflag; static int timestamp_dynfield_offset = -1; @@ -120,6 +126,28 @@ RTE_LOG_REGISTER_DEFAULT(af_packet_logtype, NOTICE); RTE_LOG_LINE(level, AFPACKET, "%s(): " fmt ":%s", __func__, \ ## __VA_ARGS__, strerror(errno)) +/** + * Read tp_status from packet mmap ring. Matches kernel's READ_ONCE() with smp_rmb() + * ordering in af_packet.c __packet_get_status. + */ +static inline uint32_t +tpacket_read_status(const volatile void *tp_status) +{ + rte_smp_rmb(); + return *((const volatile unaligned_uint32_t *)tp_status); +} + +/** + * Write tp_status to packet mmap ring. Matches kernel's WRITE_ONCE() with smp_wmb() + * ordering in af_packet.c __packet_set_status. + */ +static inline void +tpacket_write_status(volatile void *tp_status, uint32_t status) +{ + *((volatile unaligned_uint32_t *)tp_status) = status; + rte_smp_wmb(); +} + static uint16_t eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) { @@ -129,7 +157,8 @@ eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) uint8_t *pbuf; struct pkt_rx_queue *pkt_q = queue; uint16_t num_rx = 0; - unsigned long num_rx_bytes = 0; + uint32_t num_rx_bytes = 0; + uint32_t tp_status; unsigned int framecount, framenum; if (unlikely(nb_pkts == 0)) @@ -144,7 +173,8 @@ eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) for (i = 0; i < nb_pkts; i++) { /* point at the next incoming frame */ ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base; - if ((ppd->tp_status & TP_STATUS_USER) == 0) + tp_status = tpacket_read_status(&ppd->tp_status); + if ((tp_status & TP_STATUS_USER) == 0) break; /* allocate the next mbuf */ @@ -160,7 +190,7 @@ eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) memcpy(rte_pktmbuf_mtod(mbuf, void *), pbuf, rte_pktmbuf_data_len(mbuf)); /* check for vlan info */ - if (ppd->tp_status & TP_STATUS_VLAN_VALID) { + if (tp_status & TP_STATUS_VLAN_VALID) { mbuf->vlan_tci = ppd->tp_vlan_tci; mbuf->ol_flags |= (RTE_MBUF_F_RX_VLAN | RTE_MBUF_F_RX_VLAN_STRIPPED); @@ -179,7 +209,7 @@ eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) } /* release incoming frame and advance ring buffer */ - ppd->tp_status = TP_STATUS_KERNEL; + tpacket_write_status(&ppd->tp_status, TP_STATUS_KERNEL); if (++framenum >= framecount) framenum = 0; mbuf->port = pkt_q->in_port; @@ -228,8 +258,8 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) struct pollfd pfd; struct pkt_tx_queue *pkt_q = queue; uint16_t num_tx = 0; - unsigned long num_tx_bytes = 0; - int i; + uint32_t num_tx_bytes = 0; + uint16_t i; if (unlikely(nb_pkts == 0)) return 0; @@ -259,16 +289,6 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) } } - /* point at the next incoming frame */ - if (!tx_ring_status_available(ppd->tp_status)) { - if (poll(&pfd, 1, -1) < 0) - break; - - /* poll() can return POLLERR if the interface is down */ - if (pfd.revents & POLLERR) - break; - } - /* * poll() will almost always return POLLOUT, even if there * are no extra buffers available @@ -283,26 +303,28 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) * * This results in poll() returning POLLOUT. */ - if (!tx_ring_status_available(ppd->tp_status)) + if (unlikely(!tx_ring_status_available(tpacket_read_status(&ppd->tp_status)) && + (poll(&pfd, 1, -1) < 0 || (pfd.revents & POLLERR) != 0 || + !tx_ring_status_available(tpacket_read_status(&ppd->tp_status))))) { + /* Ring is full, stop here. Don't process bufs[i]. */ break; + } - /* copy the tx frame data */ - pbuf = (uint8_t *) ppd + TPACKET2_HDRLEN - - sizeof(struct sockaddr_ll); + pbuf = (uint8_t *)ppd + ETH_AF_PACKET_FRAME_OVERHEAD; struct rte_mbuf *tmp_mbuf = mbuf; - while (tmp_mbuf) { + do { uint16_t data_len = rte_pktmbuf_data_len(tmp_mbuf); memcpy(pbuf, rte_pktmbuf_mtod(tmp_mbuf, void*), data_len); pbuf += data_len; tmp_mbuf = tmp_mbuf->next; - } + } while (tmp_mbuf); ppd->tp_len = mbuf->pkt_len; ppd->tp_snaplen = mbuf->pkt_len; /* release incoming frame and advance ring buffer */ - ppd->tp_status = TP_STATUS_SEND_REQUEST; + tpacket_write_status(&ppd->tp_status, TP_STATUS_SEND_REQUEST); if (++framenum >= framecount) framenum = 0; ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base; @@ -392,10 +414,12 @@ eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info) dev_info->if_index = internals->if_index; dev_info->max_mac_addrs = 1; - dev_info->max_rx_pktlen = RTE_ETHER_MAX_LEN; + dev_info->max_rx_pktlen = (uint32_t)ETH_AF_PACKET_FRAME_SIZE_MAX + + ETH_AF_PACKET_ETH_OVERHEAD; + dev_info->max_mtu = ETH_AF_PACKET_FRAME_SIZE_MAX; dev_info->max_rx_queues = (uint16_t)internals->nb_queues; dev_info->max_tx_queues = (uint16_t)internals->nb_queues; - dev_info->min_rx_bufsize = 0; + dev_info->min_rx_bufsize = ETH_AF_PACKET_ETH_OVERHEAD; dev_info->tx_offload_capa = RTE_ETH_TX_OFFLOAD_MULTI_SEGS | RTE_ETH_TX_OFFLOAD_VLAN_INSERT; dev_info->rx_offload_capa = RTE_ETH_RX_OFFLOAD_VLAN_STRIP | @@ -572,8 +596,7 @@ eth_rx_queue_setup(struct rte_eth_dev *dev, /* Now get the space available for data in the mbuf */ buf_size = rte_pktmbuf_data_room_size(pkt_q->mb_pool) - RTE_PKTMBUF_HEADROOM; - data_size = internals->req.tp_frame_size; - data_size -= TPACKET2_HDRLEN - sizeof(struct sockaddr_ll); + data_size = internals->req.tp_frame_size - ETH_AF_PACKET_FRAME_OVERHEAD; if (data_size > buf_size) { PMD_LOG(ERR, @@ -612,7 +635,7 @@ eth_dev_mtu_set(struct rte_eth_dev *dev, uint16_t mtu) int ret; int s; unsigned int data_size = internals->req.tp_frame_size - - TPACKET2_HDRLEN; + ETH_AF_PACKET_FRAME_OVERHEAD; if (mtu > data_size) return -EINVAL; @@ -977,25 +1000,38 @@ rte_pmd_init_internals(struct rte_vdev_device *dev, rx_queue->rd = rte_zmalloc_socket(name, rdsize, 0, numa_node); if (rx_queue->rd == NULL) goto error; + /* Frame addresses must match kernel's packet_lookup_frame(): + * block_idx = position / frames_per_block + * frame_offset = position % frames_per_block + * address = block_start + (frame_offset * frame_size) + */ + const uint32_t frames_per_block = req->tp_block_size / req->tp_frame_size; for (i = 0; i < req->tp_frame_nr; ++i) { - rx_queue->rd[i].iov_base = rx_queue->map + (i * framesize); + const uint32_t block_idx = i / frames_per_block; + const uint32_t frame_in_block = i % frames_per_block; + rx_queue->rd[i].iov_base = rx_queue->map + + (block_idx * req->tp_block_size) + + (frame_in_block * req->tp_frame_size); rx_queue->rd[i].iov_len = req->tp_frame_size; } rx_queue->sockfd = qsockfd; tx_queue = &((*internals)->tx_queue[q]); tx_queue->framecount = req->tp_frame_nr; - tx_queue->frame_data_size = req->tp_frame_size; - tx_queue->frame_data_size -= TPACKET2_HDRLEN - - sizeof(struct sockaddr_ll); + tx_queue->frame_data_size = req->tp_frame_size - ETH_AF_PACKET_FRAME_OVERHEAD; tx_queue->map = rx_queue->map + req->tp_block_size * req->tp_block_nr; tx_queue->rd = rte_zmalloc_socket(name, rdsize, 0, numa_node); if (tx_queue->rd == NULL) goto error; + /* See comment above rx_queue->rd initialization. */ for (i = 0; i < req->tp_frame_nr; ++i) { - tx_queue->rd[i].iov_base = tx_queue->map + (i * framesize); + const uint32_t block_idx = i / frames_per_block; + const uint32_t frame_in_block = i % frames_per_block; + tx_queue->rd[i].iov_base = tx_queue->map + + (block_idx * req->tp_block_size) + + (frame_in_block * req->tp_frame_size); tx_queue->rd[i].iov_len = req->tp_frame_size; } tx_queue->sockfd = qsockfd; @@ -1092,7 +1128,8 @@ rte_eth_from_packet(struct rte_vdev_device *dev, if (*sockfd < 0) return -1; - blocksize = getpagesize(); + const int pagesize = getpagesize(); + blocksize = pagesize; /* * Walk arguments for configurable settings @@ -1162,13 +1199,55 @@ rte_eth_from_packet(struct rte_vdev_device *dev, return -1; } - blockcount = framecount / (blocksize / framesize); + const unsigned int frames_per_block = blocksize / framesize; + blockcount = framecount / frames_per_block; if (!blockcount) { PMD_LOG(ERR, "%s: invalid AF_PACKET MMAP parameters", name); return -1; } + /* + * https://www.kernel.org/doc/Documentation/networking/packet_mmap.txt + * Check constraints that may be enforced by the kernel and cause failure + * to initialize the rings but explicit error messages aren't provided. + * See packet_set_ring in linux kernel for enforcement: + * https://github.com/torvalds/linux/blob/master/net/packet/af_packet.c + */ + if (blocksize % pagesize != 0) { + /* tp_block_size must be a multiple of PAGE_SIZE */ + PMD_LOG(WARNING, "%s: %s=%u must be a multiple of PAGE_SIZE=%d", + name, ETH_AF_PACKET_BLOCKSIZE_ARG, blocksize, pagesize); + } + if (framesize % TPACKET_ALIGNMENT != 0) { + /* tp_frame_size must be a multiple of TPACKET_ALIGNMENT */ + PMD_LOG(WARNING, "%s: %s=%u must be a multiple of TPACKET_ALIGNMENT=%d", + name, ETH_AF_PACKET_FRAMESIZE_ARG, framesize, TPACKET_ALIGNMENT); + } + if (frames_per_block == 0 || frames_per_block > UINT_MAX / blockcount || + framecount != frames_per_block * blockcount) { + /* tp_frame_nr must be exactly frames_per_block*tp_block_nr */ + PMD_LOG(WARNING, "%s: %s=%u must be exactly " + "frames_per_block(%s/%s = %u/%u = %u) * blockcount(%u)", + name, ETH_AF_PACKET_FRAMECOUNT_ARG, framecount, + ETH_AF_PACKET_BLOCKSIZE_ARG, ETH_AF_PACKET_FRAMESIZE_ARG, + blocksize, framesize, frames_per_block, blockcount); + } + + /* Below conditions may not cause errors but provide hints to improve */ + if (blocksize % framesize != 0) { + PMD_LOG(WARNING, "%s: %s=%u not evenly divisible by %s=%u, " + "may waste memory", name, + ETH_AF_PACKET_BLOCKSIZE_ARG, blocksize, + ETH_AF_PACKET_FRAMESIZE_ARG, framesize); + } + if (!rte_is_power_of_2(blocksize)) { + /* tp_block_size should be a power of two or there will be waste */ + PMD_LOG(WARNING, "%s: %s=%u should be a power of two " + "or there will be a waste of memory", + name, ETH_AF_PACKET_BLOCKSIZE_ARG, blocksize); + } + PMD_LOG(DEBUG, "%s: AF_PACKET MMAP parameters:", name); PMD_LOG(DEBUG, "%s:\tblock size %d", name, blocksize); PMD_LOG(DEBUG, "%s:\tblock count %d", name, blockcount); -- 2.39.5 (Apple Git-154) ^ permalink raw reply related [flat|nested] 65+ messages in thread
* Re: [PATCH v2 1/4] net/af_packet: fix thread safety and frame calculations 2026-01-28 9:36 ` [PATCH v2 1/4] net/af_packet: fix thread safety and frame calculations scott.k.mitch1 @ 2026-01-28 16:59 ` Stephen Hemminger 2026-01-28 18:00 ` Scott Mitchell 0 siblings, 1 reply; 65+ messages in thread From: Stephen Hemminger @ 2026-01-28 16:59 UTC (permalink / raw) To: scott.k.mitch1; +Cc: dev, linville, stable On Wed, 28 Jan 2026 01:36:04 -0800 scott.k.mitch1@gmail.com wrote: > +/** > + * Read tp_status from packet mmap ring. Matches kernel's READ_ONCE() with smp_rmb() > + * ordering in af_packet.c __packet_get_status. > + */ > +static inline uint32_t > +tpacket_read_status(const volatile void *tp_status) > +{ > + rte_smp_rmb(); > + return *((const volatile unaligned_uint32_t *)tp_status); > +} Wouldn't rte_compiler_barrier() be better choice here. You are really only trying to keep compiler from optimzing the access. And tp_status is aligned in ring isn't it? ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v2 1/4] net/af_packet: fix thread safety and frame calculations 2026-01-28 16:59 ` Stephen Hemminger @ 2026-01-28 18:00 ` Scott Mitchell 2026-01-28 18:28 ` Stephen Hemminger 0 siblings, 1 reply; 65+ messages in thread From: Scott Mitchell @ 2026-01-28 18:00 UTC (permalink / raw) To: Stephen Hemminger; +Cc: dev, linville, stable On Wed, Jan 28, 2026 at 8:59 AM Stephen Hemminger <stephen@networkplumber.org> wrote: > > On Wed, 28 Jan 2026 01:36:04 -0800 > scott.k.mitch1@gmail.com wrote: > > > +/** > > + * Read tp_status from packet mmap ring. Matches kernel's READ_ONCE() with smp_rmb() > > + * ordering in af_packet.c __packet_get_status. > > + */ > > +static inline uint32_t > > +tpacket_read_status(const volatile void *tp_status) > > +{ > > + rte_smp_rmb(); > > + return *((const volatile unaligned_uint32_t *)tp_status); > > +} > > Wouldn't rte_compiler_barrier() be better choice here. > You are really only trying to keep compiler from optimzing the access. > > And tp_status is aligned in ring isn't it? The current approach replicates __packet_set_status and __packet_get_status [1] in the kernel which use the same barriers (WRITE_ONCE calls __write_once_size [2] which does a volatile cast). dpdk's rte_smp_rmb and rte_smp_wmb on x86 are just rte_compiler_barrier [3] but on arm it's different [4]. [1] https://github.com/torvalds/linux/blob/v6.18/net/packet/af_packet.c#L399-L451 [2] https://github.com/torvalds/linux/blob/v6.18/tools/include/linux/compiler.h#L194 [3] https://github.com/DPDK/dpdk/blob/v25.11/lib/eal/x86/include/rte_atomic.h#L26-L28 [4] https://github.com/DPDK/dpdk/blob/v25.11/lib/eal/arm/include/rte_atomic_64.h#L29-L31 ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v2 1/4] net/af_packet: fix thread safety and frame calculations 2026-01-28 18:00 ` Scott Mitchell @ 2026-01-28 18:28 ` Stephen Hemminger 0 siblings, 0 replies; 65+ messages in thread From: Stephen Hemminger @ 2026-01-28 18:28 UTC (permalink / raw) To: Scott Mitchell; +Cc: dev, linville, stable On Wed, 28 Jan 2026 10:00:16 -0800 Scott Mitchell <scott.k.mitch1@gmail.com> wrote: > On Wed, Jan 28, 2026 at 8:59 AM Stephen Hemminger > <stephen@networkplumber.org> wrote: > > > > On Wed, 28 Jan 2026 01:36:04 -0800 > > scott.k.mitch1@gmail.com wrote: > > > > > +/** > > > + * Read tp_status from packet mmap ring. Matches kernel's READ_ONCE() with smp_rmb() > > > + * ordering in af_packet.c __packet_get_status. > > > + */ > > > +static inline uint32_t > > > +tpacket_read_status(const volatile void *tp_status) > > > +{ > > > + rte_smp_rmb(); > > > + return *((const volatile unaligned_uint32_t *)tp_status); > > > +} > > > > Wouldn't rte_compiler_barrier() be better choice here. > > You are really only trying to keep compiler from optimzing the access. > > > > And tp_status is aligned in ring isn't it? > > The current approach replicates __packet_set_status and > __packet_get_status [1] in the kernel which use the same barriers > (WRITE_ONCE calls __write_once_size [2] which does a volatile cast). > dpdk's rte_smp_rmb and rte_smp_wmb on x86 are just > rte_compiler_barrier [3] but on arm it's different [4]. > > [1] https://github.com/torvalds/linux/blob/v6.18/net/packet/af_packet.c#L399-L451 > [2] https://github.com/torvalds/linux/blob/v6.18/tools/include/linux/compiler.h#L194 > [3] https://github.com/DPDK/dpdk/blob/v25.11/lib/eal/x86/include/rte_atomic.h#L26-L28 > [4] https://github.com/DPDK/dpdk/blob/v25.11/lib/eal/arm/include/rte_atomic_64.h#L29-L31 Agree that what ever primitive is used should match kernel. Surprised kernel uapi doesn't export that some how. FreeBSD explicitly states what atomic is needed in their similar ring api. ^ permalink raw reply [flat|nested] 65+ messages in thread
* [PATCH v2 2/4] net/af_packet: RX/TX unlikely, bulk free, prefetch 2026-01-28 9:36 ` [PATCH v2 0/4] af_packet correctness, performance, cksum scott.k.mitch1 2026-01-28 9:36 ` [PATCH v2 1/4] net/af_packet: fix thread safety and frame calculations scott.k.mitch1 @ 2026-01-28 9:36 ` scott.k.mitch1 2026-01-28 9:36 ` [PATCH v2 3/4] net/af_packet: tx poll control scott.k.mitch1 ` (2 subsequent siblings) 4 siblings, 0 replies; 65+ messages in thread From: scott.k.mitch1 @ 2026-01-28 9:36 UTC (permalink / raw) To: dev; +Cc: stephen, Scott Mitchell From: Scott Mitchell <scott.k.mitch1@gmail.com> - Add rte_prefetch0() to prefetch next frame/mbuf while processing current packet, reducing cache miss latency - Use rte_pktmbuf_free_bulk() in TX path instead of individual rte_pktmbuf_free() calls for better batch efficiency - Add unlikely() hints for error paths (oversized packets, VLAN insertion failures, sendto errors) to optimize branch prediction - Remove unnecessary early nb_pkts == 0 when loop handles this and app may never call with 0 frames. Signed-off-by: Scott Mitchell <scott.k.mitch1@gmail.com> --- drivers/net/af_packet/rte_eth_af_packet.c | 65 ++++++++++++----------- 1 file changed, 34 insertions(+), 31 deletions(-) diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c index 6c276bb7fc..e357ae168b 100644 --- a/drivers/net/af_packet/rte_eth_af_packet.c +++ b/drivers/net/af_packet/rte_eth_af_packet.c @@ -161,9 +161,6 @@ eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) uint32_t tp_status; unsigned int framecount, framenum; - if (unlikely(nb_pkts == 0)) - return 0; - /* * Reads the given number of packets from the AF_PACKET socket one by * one and copies the packet data into a newly allocated mbuf. @@ -177,6 +174,14 @@ eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) if ((tp_status & TP_STATUS_USER) == 0) break; + unsigned int next_framenum = framenum + 1; + if (next_framenum >= framecount) + next_framenum = 0; + + /* prefetch the next frame for the next loop iteration */ + if (likely(i + 1 < nb_pkts)) + rte_prefetch0(pkt_q->rd[next_framenum].iov_base); + /* allocate the next mbuf */ mbuf = rte_pktmbuf_alloc(pkt_q->mb_pool); if (unlikely(mbuf == NULL)) { @@ -210,8 +215,7 @@ eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) /* release incoming frame and advance ring buffer */ tpacket_write_status(&ppd->tp_status, TP_STATUS_KERNEL); - if (++framenum >= framecount) - framenum = 0; + framenum = next_framenum; mbuf->port = pkt_q->in_port; /* account for the receive frame */ @@ -261,9 +265,6 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) uint32_t num_tx_bytes = 0; uint16_t i; - if (unlikely(nb_pkts == 0)) - return 0; - memset(&pfd, 0, sizeof(pfd)); pfd.fd = pkt_q->sockfd; pfd.events = POLLOUT; @@ -271,22 +272,25 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) framecount = pkt_q->framecount; framenum = pkt_q->framenum; - ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base; for (i = 0; i < nb_pkts; i++) { - mbuf = *bufs++; - - /* drop oversized packets */ - if (mbuf->pkt_len > pkt_q->frame_data_size) { - rte_pktmbuf_free(mbuf); - continue; + unsigned int next_framenum = framenum + 1; + if (next_framenum >= framecount) + next_framenum = 0; + + /* prefetch the next source mbuf and destination TPACKET */ + if (likely(i + 1 < nb_pkts)) { + rte_prefetch0(bufs[i + 1]); + rte_prefetch0(pkt_q->rd[next_framenum].iov_base); } - /* insert vlan info if necessary */ - if (mbuf->ol_flags & RTE_MBUF_F_TX_VLAN) { - if (rte_vlan_insert(&mbuf)) { - rte_pktmbuf_free(mbuf); - continue; - } + mbuf = bufs[i]; + ppd = (struct tpacket2_hdr *)pkt_q->rd[framenum].iov_base; + + /* Drop oversized packets. Insert VLAN if necessary */ + if (unlikely(mbuf->pkt_len > pkt_q->frame_data_size || + ((mbuf->ol_flags & RTE_MBUF_F_TX_VLAN) != 0 && + rte_vlan_insert(&mbuf) != 0))) { + continue; } /* @@ -312,6 +316,9 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) pbuf = (uint8_t *)ppd + ETH_AF_PACKET_FRAME_OVERHEAD; + ppd->tp_len = mbuf->pkt_len; + ppd->tp_snaplen = mbuf->pkt_len; + struct rte_mbuf *tmp_mbuf = mbuf; do { uint16_t data_len = rte_pktmbuf_data_len(tmp_mbuf); @@ -320,23 +327,19 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) tmp_mbuf = tmp_mbuf->next; } while (tmp_mbuf); - ppd->tp_len = mbuf->pkt_len; - ppd->tp_snaplen = mbuf->pkt_len; - /* release incoming frame and advance ring buffer */ tpacket_write_status(&ppd->tp_status, TP_STATUS_SEND_REQUEST); - if (++framenum >= framecount) - framenum = 0; - ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base; - + framenum = next_framenum; num_tx++; num_tx_bytes += mbuf->pkt_len; - rte_pktmbuf_free(mbuf); } + rte_pktmbuf_free_bulk(&bufs[0], i); + /* kick-off transmits */ - if (sendto(pkt_q->sockfd, NULL, 0, MSG_DONTWAIT, NULL, 0) == -1 && - errno != ENOBUFS && errno != EAGAIN) { + if (unlikely(num_tx > 0 && + sendto(pkt_q->sockfd, NULL, 0, MSG_DONTWAIT, NULL, 0) == -1 && + errno != ENOBUFS && errno != EAGAIN)) { /* * In case of a ENOBUFS/EAGAIN error all of the enqueued * packets will be considered successful even though only some -- 2.39.5 (Apple Git-154) ^ permalink raw reply related [flat|nested] 65+ messages in thread
* [PATCH v2 3/4] net/af_packet: tx poll control 2026-01-28 9:36 ` [PATCH v2 0/4] af_packet correctness, performance, cksum scott.k.mitch1 2026-01-28 9:36 ` [PATCH v2 1/4] net/af_packet: fix thread safety and frame calculations scott.k.mitch1 2026-01-28 9:36 ` [PATCH v2 2/4] net/af_packet: RX/TX unlikely, bulk free, prefetch scott.k.mitch1 @ 2026-01-28 9:36 ` scott.k.mitch1 2026-01-28 9:36 ` [PATCH v2 4/4] net/af_packet: software checksum scott.k.mitch1 2026-01-28 19:10 ` [PATCH v3 0/4] af_packet correctness, performance, cksum scott.k.mitch1 4 siblings, 0 replies; 65+ messages in thread From: scott.k.mitch1 @ 2026-01-28 9:36 UTC (permalink / raw) To: dev; +Cc: stephen, Scott Mitchell From: Scott Mitchell <scott.k.mitch1@gmail.com> Add txpollnotrdy devarg (default=true) to control whether poll() is called when the TX ring is not ready. This allows users to avoid blocking behavior if application threads are in asynchronous poll mode where blocking the thread has negative side effects and backpressure is applied via different means. Signed-off-by: Scott Mitchell <scott.k.mitch1@gmail.com> --- doc/guides/nics/af_packet.rst | 6 +++- drivers/net/af_packet/rte_eth_af_packet.c | 34 ++++++++++++++++++----- 2 files changed, 32 insertions(+), 8 deletions(-) diff --git a/doc/guides/nics/af_packet.rst b/doc/guides/nics/af_packet.rst index 1505b98ff7..782a962c3f 100644 --- a/doc/guides/nics/af_packet.rst +++ b/doc/guides/nics/af_packet.rst @@ -29,6 +29,10 @@ Some of these, in turn, will be used to configure the PACKET_MMAP settings. * ``framesz`` - PACKET_MMAP frame size (optional, default 2048B; Note: multiple of 16B); * ``framecnt`` - PACKET_MMAP frame count (optional, default 512). +* ``txpollnotrdy`` - Control behavior if tx is attempted but there is no + space available to write to the kernel. If 1, call poll() and block until + space is available to tx. If 0, don't call poll() and return from tx (optional, + default 1). For details regarding ``fanout_mode`` argument, you can consult the `PACKET_FANOUT documentation <https://www.man7.org/linux/man-pages/man7/packet.7.html>`_. @@ -75,7 +79,7 @@ framecnt=512): .. code-block:: console - --vdev=eth_af_packet0,iface=tap0,blocksz=4096,framesz=2048,framecnt=512,qpairs=1,qdisc_bypass=0,fanout_mode=hash + --vdev=eth_af_packet0,iface=tap0,blocksz=4096,framesz=2048,framecnt=512,qpairs=1,qdisc_bypass=0,fanout_mode=hash,txpollnotrdy=0 Features and Limitations ------------------------ diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c index e357ae168b..be8e3260aa 100644 --- a/drivers/net/af_packet/rte_eth_af_packet.c +++ b/drivers/net/af_packet/rte_eth_af_packet.c @@ -18,6 +18,7 @@ #include <bus_vdev_driver.h> #include <errno.h> +#include <stdbool.h> #include <linux/if_ether.h> #include <linux/if_packet.h> #include <arpa/inet.h> @@ -39,9 +40,11 @@ #define ETH_AF_PACKET_FRAMECOUNT_ARG "framecnt" #define ETH_AF_PACKET_QDISC_BYPASS_ARG "qdisc_bypass" #define ETH_AF_PACKET_FANOUT_MODE_ARG "fanout_mode" +#define ETH_AF_PACKET_TX_POLL_NOT_READY_ARG "txpollnotrdy" #define DFLT_FRAME_SIZE (1 << 11) #define DFLT_FRAME_COUNT (1 << 9) +#define DFLT_TX_POLL_NOT_RDY true static const uint16_t ETH_AF_PACKET_FRAME_SIZE_MAX = RTE_IPV4_MAX_PKT_LEN; #define ETH_AF_PACKET_FRAME_OVERHEAD (TPACKET2_HDRLEN - sizeof(struct sockaddr_ll)) @@ -78,6 +81,9 @@ struct __rte_cache_aligned pkt_tx_queue { unsigned int framecount; unsigned int framenum; + bool txpollnotrdy; + bool sw_cksum; + volatile unsigned long tx_pkts; volatile unsigned long err_pkts; volatile unsigned long tx_bytes; @@ -106,6 +112,7 @@ static const char *valid_arguments[] = { ETH_AF_PACKET_FRAMECOUNT_ARG, ETH_AF_PACKET_QDISC_BYPASS_ARG, ETH_AF_PACKET_FANOUT_MODE_ARG, + ETH_AF_PACKET_TX_POLL_NOT_READY_ARG, NULL }; @@ -265,10 +272,12 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) uint32_t num_tx_bytes = 0; uint16_t i; - memset(&pfd, 0, sizeof(pfd)); - pfd.fd = pkt_q->sockfd; - pfd.events = POLLOUT; - pfd.revents = 0; + if (pkt_q->txpollnotrdy) { + memset(&pfd, 0, sizeof(pfd)); + pfd.fd = pkt_q->sockfd; + pfd.events = POLLOUT; + pfd.revents = 0; + } framecount = pkt_q->framecount; framenum = pkt_q->framenum; @@ -308,8 +317,9 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) * This results in poll() returning POLLOUT. */ if (unlikely(!tx_ring_status_available(tpacket_read_status(&ppd->tp_status)) && - (poll(&pfd, 1, -1) < 0 || (pfd.revents & POLLERR) != 0 || - !tx_ring_status_available(tpacket_read_status(&ppd->tp_status))))) { + (!pkt_q->txpollnotrdy || poll(&pfd, 1, -1) < 0 || + (pfd.revents & POLLERR) != 0 || + !tx_ring_status_available(tpacket_read_status(&ppd->tp_status))))) { /* Ring is full, stop here. Don't process bufs[i]. */ break; } @@ -820,6 +830,7 @@ rte_pmd_init_internals(struct rte_vdev_device *dev, unsigned int framecnt, unsigned int qdisc_bypass, const char *fanout_mode, + bool txpollnotrdy, struct pmd_internals **internals, struct rte_eth_dev **eth_dev, struct rte_kvargs *kvlist) @@ -1038,6 +1049,7 @@ rte_pmd_init_internals(struct rte_vdev_device *dev, tx_queue->rd[i].iov_len = req->tp_frame_size; } tx_queue->sockfd = qsockfd; + tx_queue->txpollnotrdy = txpollnotrdy; rc = bind(qsockfd, (const struct sockaddr*)&sockaddr, sizeof(sockaddr)); if (rc == -1) { @@ -1126,6 +1138,7 @@ rte_eth_from_packet(struct rte_vdev_device *dev, unsigned int qpairs = 1; unsigned int qdisc_bypass = 1; const char *fanout_mode = NULL; + bool txpollnotrdy = DFLT_TX_POLL_NOT_RDY; /* do some parameter checking */ if (*sockfd < 0) @@ -1193,6 +1206,10 @@ rte_eth_from_packet(struct rte_vdev_device *dev, fanout_mode = pair->value; continue; } + if (strstr(pair->key, ETH_AF_PACKET_TX_POLL_NOT_READY_ARG) != NULL) { + txpollnotrdy = atoi(pair->value) != 0; + continue; + } } if (framesize > blocksize) { @@ -1261,12 +1278,14 @@ rte_eth_from_packet(struct rte_vdev_device *dev, PMD_LOG(DEBUG, "%s:\tfanout mode %s", name, fanout_mode); else PMD_LOG(DEBUG, "%s:\tfanout mode %s", name, "default PACKET_FANOUT_HASH"); + PMD_LOG(INFO, "%s:\ttxpollnotrdy %d", name, txpollnotrdy ? 1 : 0); if (rte_pmd_init_internals(dev, *sockfd, qpairs, blocksize, blockcount, framesize, framecount, qdisc_bypass, fanout_mode, + txpollnotrdy, &internals, ð_dev, kvlist) < 0) return -1; @@ -1364,4 +1383,5 @@ RTE_PMD_REGISTER_PARAM_STRING(net_af_packet, "framesz=<int> " "framecnt=<int> " "qdisc_bypass=<0|1> " - "fanout_mode=<hash|lb|cpu|rollover|rnd|qm>"); + "fanout_mode=<hash|lb|cpu|rollover|rnd|qm> " + "txpollnotrdy=<0|1>"); -- 2.39.5 (Apple Git-154) ^ permalink raw reply related [flat|nested] 65+ messages in thread
* [PATCH v2 4/4] net/af_packet: software checksum 2026-01-28 9:36 ` [PATCH v2 0/4] af_packet correctness, performance, cksum scott.k.mitch1 ` (2 preceding siblings ...) 2026-01-28 9:36 ` [PATCH v2 3/4] net/af_packet: tx poll control scott.k.mitch1 @ 2026-01-28 9:36 ` scott.k.mitch1 2026-01-28 18:27 ` Stephen Hemminger 2026-01-28 19:10 ` [PATCH v3 0/4] af_packet correctness, performance, cksum scott.k.mitch1 4 siblings, 1 reply; 65+ messages in thread From: scott.k.mitch1 @ 2026-01-28 9:36 UTC (permalink / raw) To: dev; +Cc: stephen, Scott Mitchell From: Scott Mitchell <scott.k.mitch1@gmail.com> Add software checksum offload support and configurable TX poll behavior to improve flexibility and performance. Add rte_net_ip_udptcp_cksum_mbuf in rte_net.h which is shared between rte_eth_tap and rte_eth_af_packet that supports IPv4/UDP/TCP checksums in software due to hardware offload and context propagation not being supported. Signed-off-by: Scott Mitchell <scott.k.mitch1@gmail.com> --- drivers/net/af_packet/rte_eth_af_packet.c | 15 +++- drivers/net/tap/rte_eth_tap.c | 61 +-------------- lib/net/rte_net.h | 90 +++++++++++++++++++++++ 3 files changed, 106 insertions(+), 60 deletions(-) diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c index be8e3260aa..19bafc99a6 100644 --- a/drivers/net/af_packet/rte_eth_af_packet.c +++ b/drivers/net/af_packet/rte_eth_af_packet.c @@ -10,6 +10,8 @@ #include <rte_string_fns.h> #include <rte_mbuf.h> #include <rte_atomic.h> +#include <rte_ip.h> +#include <rte_net.h> #include <rte_bitops.h> #include <ethdev_driver.h> #include <ethdev_vdev.h> @@ -102,6 +104,7 @@ struct pmd_internals { struct pkt_tx_queue *tx_queue; uint8_t vlan_strip; uint8_t timestamp_offloading; + bool tx_sw_cksum; }; static const char *valid_arguments[] = { @@ -329,6 +332,9 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) ppd->tp_len = mbuf->pkt_len; ppd->tp_snaplen = mbuf->pkt_len; + if (pkt_q->sw_cksum && !rte_net_ip_udptcp_cksum_mbuf(mbuf, false)) + continue; + struct rte_mbuf *tmp_mbuf = mbuf; do { uint16_t data_len = rte_pktmbuf_data_len(tmp_mbuf); @@ -413,10 +419,13 @@ eth_dev_configure(struct rte_eth_dev *dev __rte_unused) { struct rte_eth_conf *dev_conf = &dev->data->dev_conf; const struct rte_eth_rxmode *rxmode = &dev_conf->rxmode; + const struct rte_eth_txmode *txmode = &dev_conf->txmode; struct pmd_internals *internals = dev->data->dev_private; internals->vlan_strip = !!(rxmode->offloads & RTE_ETH_RX_OFFLOAD_VLAN_STRIP); internals->timestamp_offloading = !!(rxmode->offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP); + internals->tx_sw_cksum = !!(txmode->offloads & (RTE_ETH_TX_OFFLOAD_IPV4_CKSUM | + RTE_ETH_TX_OFFLOAD_UDP_CKSUM | RTE_ETH_TX_OFFLOAD_TCP_CKSUM)); return 0; } @@ -434,7 +443,10 @@ eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info) dev_info->max_tx_queues = (uint16_t)internals->nb_queues; dev_info->min_rx_bufsize = ETH_AF_PACKET_ETH_OVERHEAD; dev_info->tx_offload_capa = RTE_ETH_TX_OFFLOAD_MULTI_SEGS | - RTE_ETH_TX_OFFLOAD_VLAN_INSERT; + RTE_ETH_TX_OFFLOAD_VLAN_INSERT | + RTE_ETH_TX_OFFLOAD_IPV4_CKSUM | + RTE_ETH_TX_OFFLOAD_UDP_CKSUM | + RTE_ETH_TX_OFFLOAD_TCP_CKSUM; dev_info->rx_offload_capa = RTE_ETH_RX_OFFLOAD_VLAN_STRIP | RTE_ETH_RX_OFFLOAD_TIMESTAMP; @@ -635,6 +647,7 @@ eth_tx_queue_setup(struct rte_eth_dev *dev, { struct pmd_internals *internals = dev->data->dev_private; + internals->tx_queue[tx_queue_id].sw_cksum = internals->tx_sw_cksum; dev->data->tx_queues[tx_queue_id] = &internals->tx_queue[tx_queue_id]; return 0; diff --git a/drivers/net/tap/rte_eth_tap.c b/drivers/net/tap/rte_eth_tap.c index 730f1859bd..55f496babe 100644 --- a/drivers/net/tap/rte_eth_tap.c +++ b/drivers/net/tap/rte_eth_tap.c @@ -560,70 +560,13 @@ tap_write_mbufs(struct tx_queue *txq, uint16_t num_mbufs, if (txq->csum && (mbuf->ol_flags & RTE_MBUF_F_TX_IP_CKSUM || l4_ol_flags == RTE_MBUF_F_TX_UDP_CKSUM || l4_ol_flags == RTE_MBUF_F_TX_TCP_CKSUM)) { - unsigned int hdrlens = mbuf->l2_len + mbuf->l3_len; - uint16_t *l4_cksum; - void *l3_hdr; - - if (l4_ol_flags == RTE_MBUF_F_TX_UDP_CKSUM) - hdrlens += sizeof(struct rte_udp_hdr); - else if (l4_ol_flags == RTE_MBUF_F_TX_TCP_CKSUM) - hdrlens += sizeof(struct rte_tcp_hdr); - else if (l4_ol_flags != RTE_MBUF_F_TX_L4_NO_CKSUM) - return -1; - - /* Support only packets with at least layer 4 - * header included in the first segment - */ - if (rte_pktmbuf_data_len(mbuf) < hdrlens) - return -1; - - /* To change checksums (considering that a mbuf can be - * indirect, for example), copy l2, l3 and l4 headers - * in a new segment and chain it to existing data - */ - seg = rte_pktmbuf_copy(mbuf, mbuf->pool, 0, hdrlens); + /* Compute checksums in software, copying headers if needed */ + seg = rte_net_ip_udptcp_cksum_mbuf(mbuf, true); if (seg == NULL) return -1; - rte_pktmbuf_adj(mbuf, hdrlens); - rte_pktmbuf_chain(seg, mbuf); pmbufs[i] = mbuf = seg; - - l3_hdr = rte_pktmbuf_mtod_offset(mbuf, void *, mbuf->l2_len); - if (mbuf->ol_flags & RTE_MBUF_F_TX_IP_CKSUM) { - struct rte_ipv4_hdr *iph = l3_hdr; - - iph->hdr_checksum = 0; - iph->hdr_checksum = rte_ipv4_cksum(iph); - } - - if (l4_ol_flags == RTE_MBUF_F_TX_L4_NO_CKSUM) - goto skip_l4_cksum; - - if (l4_ol_flags == RTE_MBUF_F_TX_UDP_CKSUM) { - struct rte_udp_hdr *udp_hdr; - - udp_hdr = rte_pktmbuf_mtod_offset(mbuf, struct rte_udp_hdr *, - mbuf->l2_len + mbuf->l3_len); - l4_cksum = &udp_hdr->dgram_cksum; - } else { - struct rte_tcp_hdr *tcp_hdr; - - tcp_hdr = rte_pktmbuf_mtod_offset(mbuf, struct rte_tcp_hdr *, - mbuf->l2_len + mbuf->l3_len); - l4_cksum = &tcp_hdr->cksum; - } - - *l4_cksum = 0; - if (mbuf->ol_flags & RTE_MBUF_F_TX_IPV4) { - *l4_cksum = rte_ipv4_udptcp_cksum_mbuf(mbuf, l3_hdr, - mbuf->l2_len + mbuf->l3_len); - } else { - *l4_cksum = rte_ipv6_udptcp_cksum_mbuf(mbuf, l3_hdr, - mbuf->l2_len + mbuf->l3_len); - } } -skip_l4_cksum: for (j = 0; j < mbuf->nb_segs; j++) { iovecs[k].iov_len = rte_pktmbuf_data_len(seg); iovecs[k].iov_base = rte_pktmbuf_mtod(seg, void *); diff --git a/lib/net/rte_net.h b/lib/net/rte_net.h index 65d724b84b..36c1c34481 100644 --- a/lib/net/rte_net.h +++ b/lib/net/rte_net.h @@ -246,6 +246,96 @@ rte_net_intel_cksum_prepare(struct rte_mbuf *m) return rte_net_intel_cksum_flags_prepare(m, m->ol_flags); } +/** + * Compute IPv4 header and UDP/TCP checksums in software. + * + * Computes checksums based on mbuf offload flags: + * - RTE_MBUF_F_TX_IP_CKSUM: Compute IPv4 header checksum + * - RTE_MBUF_F_TX_UDP_CKSUM: Compute UDP checksum (IPv4 or IPv6) + * - RTE_MBUF_F_TX_TCP_CKSUM: Compute TCP checksum (IPv4 or IPv6) + * + * @param mbuf + * The packet mbuf. Must have l2_len and l3_len set correctly. + * @param copy + * If true, copy L2/L3/L4 headers to a new segment before computing + * checksums. This is safe for indirect mbufs but has overhead. + * If false, compute checksums in place. This is only safe if the + * mbuf will be copied afterward (e.g., to a device ring buffer). + * @return + * - On success: Returns mbuf (new segment if copy=true, original if copy=false) + * - On error: Returns NULL (allocation failed or malformed packet) + */ +static inline struct rte_mbuf * +rte_net_ip_udptcp_cksum_mbuf(struct rte_mbuf *mbuf, bool copy) +{ + const uint64_t l4_ol_flags = mbuf->ol_flags & RTE_MBUF_F_TX_L4_MASK; + const uint64_t l4_offset = mbuf->l2_len + mbuf->l3_len; + uint32_t hdrlens = l4_offset; + + /* Determine total header length needed */ + if (l4_ol_flags == RTE_MBUF_F_TX_UDP_CKSUM) + hdrlens += sizeof(struct rte_udp_hdr); + else if (l4_ol_flags == RTE_MBUF_F_TX_TCP_CKSUM) + hdrlens += sizeof(struct rte_tcp_hdr); + else if (l4_ol_flags != RTE_MBUF_F_TX_L4_NO_CKSUM) + return NULL; /* Unsupported L4 checksum type */ + else if (!(mbuf->ol_flags & RTE_MBUF_F_TX_IP_CKSUM)) + return mbuf; /* Nothing to do */ + + /* Validate we at least have L2+L3 headers before doing any work */ + if (unlikely(rte_pktmbuf_data_len(mbuf) < l4_offset)) + return NULL; + + if (copy) { + /* + * Copy headers to new segment to handle indirect mbufs. + * This ensures we can safely modify checksums without + * corrupting shared/read-only data. + */ + struct rte_mbuf *seg = rte_pktmbuf_copy(mbuf, mbuf->pool, 0, hdrlens); + if (!seg) + return NULL; + + rte_pktmbuf_adj(mbuf, hdrlens); + rte_pktmbuf_chain(seg, mbuf); + mbuf = seg; + } else if (unlikely(!RTE_MBUF_DIRECT(mbuf) || rte_mbuf_refcnt_read(mbuf) > 1)) + return NULL; + + void *l3_hdr = rte_pktmbuf_mtod_offset(mbuf, void *, mbuf->l2_len); + + /* IPv4 header checksum */ + if (mbuf->ol_flags & RTE_MBUF_F_TX_IP_CKSUM) { + struct rte_ipv4_hdr *iph = l3_hdr; + iph->hdr_checksum = 0; + iph->hdr_checksum = rte_ipv4_cksum(iph); + } + + /* L4 checksum (UDP or TCP) - skip if headers not in first segment */ + if (l4_ol_flags == RTE_MBUF_F_TX_UDP_CKSUM && rte_pktmbuf_data_len(mbuf) >= hdrlens) { + struct rte_udp_hdr *udp_hdr = rte_pktmbuf_mtod_offset(mbuf, struct rte_udp_hdr *, + l4_offset); + udp_hdr->dgram_cksum = 0; + udp_hdr->dgram_cksum = (mbuf->ol_flags & RTE_MBUF_F_TX_IPV4) ? + rte_ipv4_udptcp_cksum_mbuf(mbuf, (const struct rte_ipv4_hdr *)l3_hdr, + l4_offset) : + rte_ipv6_udptcp_cksum_mbuf(mbuf, (const struct rte_ipv6_hdr *)l3_hdr, + l4_offset); + } else if (l4_ol_flags == RTE_MBUF_F_TX_TCP_CKSUM && + rte_pktmbuf_data_len(mbuf) >= hdrlens) { + struct rte_tcp_hdr *tcp_hdr = rte_pktmbuf_mtod_offset(mbuf, struct rte_tcp_hdr *, + l4_offset); + tcp_hdr->cksum = 0; + tcp_hdr->cksum = (mbuf->ol_flags & RTE_MBUF_F_TX_IPV4) ? + rte_ipv4_udptcp_cksum_mbuf(mbuf, (const struct rte_ipv4_hdr *)l3_hdr, + l4_offset) : + rte_ipv6_udptcp_cksum_mbuf(mbuf, (const struct rte_ipv6_hdr *)l3_hdr, + l4_offset); + } + + return mbuf; +} + #ifdef __cplusplus } #endif -- 2.39.5 (Apple Git-154) ^ permalink raw reply related [flat|nested] 65+ messages in thread
* Re: [PATCH v2 4/4] net/af_packet: software checksum 2026-01-28 9:36 ` [PATCH v2 4/4] net/af_packet: software checksum scott.k.mitch1 @ 2026-01-28 18:27 ` Stephen Hemminger 2026-01-28 19:08 ` Scott Mitchell 0 siblings, 1 reply; 65+ messages in thread From: Stephen Hemminger @ 2026-01-28 18:27 UTC (permalink / raw) To: scott.k.mitch1; +Cc: dev On Wed, 28 Jan 2026 01:36:07 -0800 scott.k.mitch1@gmail.com wrote: > From: Scott Mitchell <scott.k.mitch1@gmail.com> > > Add software checksum offload support and configurable TX poll > behavior to improve flexibility and performance. > > Add rte_net_ip_udptcp_cksum_mbuf in rte_net.h which is shared > between rte_eth_tap and rte_eth_af_packet that supports > IPv4/UDP/TCP checksums in software due to hardware offload > and context propagation not being supported. > > Signed-off-by: Scott Mitchell <scott.k.mitch1@gmail.com> This is failing CI build, fix and resubmit ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v2 4/4] net/af_packet: software checksum 2026-01-28 18:27 ` Stephen Hemminger @ 2026-01-28 19:08 ` Scott Mitchell 0 siblings, 0 replies; 65+ messages in thread From: Scott Mitchell @ 2026-01-28 19:08 UTC (permalink / raw) To: Stephen Hemminger; +Cc: dev > This is failing CI build, fix and resubmit v3 coming soon. ^ permalink raw reply [flat|nested] 65+ messages in thread
* [PATCH v3 0/4] af_packet correctness, performance, cksum 2026-01-28 9:36 ` [PATCH v2 0/4] af_packet correctness, performance, cksum scott.k.mitch1 ` (3 preceding siblings ...) 2026-01-28 9:36 ` [PATCH v2 4/4] net/af_packet: software checksum scott.k.mitch1 @ 2026-01-28 19:10 ` scott.k.mitch1 2026-01-28 19:10 ` [PATCH v3 1/4] net/af_packet: fix thread safety and frame calculations scott.k.mitch1 ` (4 more replies) 4 siblings, 5 replies; 65+ messages in thread From: scott.k.mitch1 @ 2026-01-28 19:10 UTC (permalink / raw) To: dev; +Cc: stephen, Scott From: Scott <scott.k.mitch1@gmail.com> This series fixes critical thread safety bugs in the af_packet PMD and adds performance optimizations. Patch 1 fixes two major correctness issues: - Thread safety: tp_status was accessed without memory barriers, violating the kernel's PACKET_MMAP protocol. On aarch64 and other weakly-ordered architectures, this causes packet corruption due to missing memory ordering. The fix matches the kernel's memory model: volatile unaligned reads/writes with explicit rte_smp_rmb/wmb barriers and __may_alias__ protection. - Frame calculations: Fixed incorrect frame overhead and address calculations that caused memory corruption when frames don't evenly divide blocks. Patches 2-4 add performance improvements: - Patch 2: Bulk mbuf freeing, unlikely annotations, and prefetching - Patch 3: TX poll control to reduce syscall overhead - Patch 4: Software checksum offload support with shared rte_net utility v3 changes: - Patch 4: Fix compile error due to implict cast with c++ compiler v2 changes: - Patch 1: Rewrote to use volatile + barriers instead of C11 atomics to match kernel's memory model. Added dependency on patch-160274 for __rte_may_alias attribute. - Patch 4: Refactored to use shared rte_net_ip_udptcp_cksum_mbuf() utility function, eliminating code duplication with tap driver. Scott Mitchell (4): net/af_packet: fix thread safety and frame calculations net/af_packet: RX/TX unlikely, bulk free, prefetch net/af_packet: tx poll control net/af_packet: software checksum doc/guides/nics/af_packet.rst | 6 +- drivers/net/af_packet/rte_eth_af_packet.c | 257 ++++++++++++++++------ drivers/net/tap/rte_eth_tap.c | 61 +---- lib/net/rte_net.h | 90 ++++++++ 4 files changed, 283 insertions(+), 131 deletions(-) -- 2.39.5 (Apple Git-154) ^ permalink raw reply [flat|nested] 65+ messages in thread
* [PATCH v3 1/4] net/af_packet: fix thread safety and frame calculations 2026-01-28 19:10 ` [PATCH v3 0/4] af_packet correctness, performance, cksum scott.k.mitch1 @ 2026-01-28 19:10 ` scott.k.mitch1 2026-01-28 19:10 ` [PATCH v3 2/4] net/af_packet: RX/TX unlikely, bulk free, prefetch scott.k.mitch1 ` (3 subsequent siblings) 4 siblings, 0 replies; 65+ messages in thread From: scott.k.mitch1 @ 2026-01-28 19:10 UTC (permalink / raw) To: dev; +Cc: stephen, Scott Mitchell, linville, stable From: Scott Mitchell <scott.k.mitch1@gmail.com> Thread Safety: The tp_status field was accessed without proper memory barriers, violating the kernel's PACKET_MMAP synchronization protocol. The kernel implements this protocol in net/packet/af_packet.c: - __packet_get_status: smp_rmb() then READ_ONCE() (volatile read) - __packet_set_status: WRITE_ONCE() (volatile write) then smp_wmb() READ_ONCE/WRITE_ONCE use __may_alias__ attribute via __uXX_alias_t types to prevent compiler optimizations that assume type-based aliasing rules, which is critical for tp_status access that may be misaligned within the ring buffer. Userspace must use equivalent semantics: volatile unaligned_uint32_t (with __rte_may_alias) reads/writes with explicit memory barriers (rte_smp_rmb/rte_smp_wmb). On aarch64 and other weakly-ordered architectures, missing barriers cause packet corruption because: - RX: CPU may read stale packet data before seeing tp_status update - TX: CPU may reorder stores, causing kernel to see tp_status before packet data is fully written This becomes critical with io_uring SQPOLL mode where the kernel polling thread on a different CPU core asynchronously updates tp_status, making proper memory ordering essential. Note: Uses rte_smp_[r/w]mb which triggers checkpatch warnings, but C11 atomics cannot be used because tp_status is not declared _Atomic in the kernel's tpacket2_hdr structure. We must match the kernel's volatile + barrier memory model with __may_alias__ protection. Frame Calculation Issues: 1. Frame overhead incorrectly calculated as TPACKET_ALIGN(TPACKET2_HDRLEN) instead of TPACKET2_HDRLEN - sizeof(struct sockaddr_ll), causing incorrect usable frame data size. 2. Frame address calculation assumed sequential layout (frame_base + i * frame_size), but the kernel's packet_lookup_frame() uses block-based addressing: block_idx = position / frames_per_block frame_offset = position % frames_per_block address = block_start[block_idx] + (frame_offset * frame_size) This caused memory corruption when frames don't evenly divide blocks. Fixes: 364e08f2bbc0 ("af_packet: add PMD for AF_PACKET-based virtual devices") Cc: linville@tuxdriver.com Cc: stable@dpdk.org Signed-off-by: Scott Mitchell <scott.k.mitch1@gmail.com> --- Depends-on: patch-160468 ("eal: add __rte_may_alias and __rte_aligned to unaligned typedefs") drivers/net/af_packet/rte_eth_af_packet.c | 149 +++++++++++++++++----- 1 file changed, 114 insertions(+), 35 deletions(-) diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c index ef11b8fb6b..6c276bb7fc 100644 --- a/drivers/net/af_packet/rte_eth_af_packet.c +++ b/drivers/net/af_packet/rte_eth_af_packet.c @@ -9,6 +9,8 @@ #include <rte_common.h> #include <rte_string_fns.h> #include <rte_mbuf.h> +#include <rte_atomic.h> +#include <rte_bitops.h> #include <ethdev_driver.h> #include <ethdev_vdev.h> #include <rte_malloc.h> @@ -41,6 +43,10 @@ #define DFLT_FRAME_SIZE (1 << 11) #define DFLT_FRAME_COUNT (1 << 9) +static const uint16_t ETH_AF_PACKET_FRAME_SIZE_MAX = RTE_IPV4_MAX_PKT_LEN; +#define ETH_AF_PACKET_FRAME_OVERHEAD (TPACKET2_HDRLEN - sizeof(struct sockaddr_ll)) +#define ETH_AF_PACKET_ETH_OVERHEAD (RTE_ETHER_HDR_LEN + RTE_ETHER_CRC_LEN) + static uint64_t timestamp_dynflag; static int timestamp_dynfield_offset = -1; @@ -120,6 +126,28 @@ RTE_LOG_REGISTER_DEFAULT(af_packet_logtype, NOTICE); RTE_LOG_LINE(level, AFPACKET, "%s(): " fmt ":%s", __func__, \ ## __VA_ARGS__, strerror(errno)) +/** + * Read tp_status from packet mmap ring. Matches kernel's READ_ONCE() with smp_rmb() + * ordering in af_packet.c __packet_get_status. + */ +static inline uint32_t +tpacket_read_status(const volatile void *tp_status) +{ + rte_smp_rmb(); + return *((const volatile unaligned_uint32_t *)tp_status); +} + +/** + * Write tp_status to packet mmap ring. Matches kernel's WRITE_ONCE() with smp_wmb() + * ordering in af_packet.c __packet_set_status. + */ +static inline void +tpacket_write_status(volatile void *tp_status, uint32_t status) +{ + *((volatile unaligned_uint32_t *)tp_status) = status; + rte_smp_wmb(); +} + static uint16_t eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) { @@ -129,7 +157,8 @@ eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) uint8_t *pbuf; struct pkt_rx_queue *pkt_q = queue; uint16_t num_rx = 0; - unsigned long num_rx_bytes = 0; + uint32_t num_rx_bytes = 0; + uint32_t tp_status; unsigned int framecount, framenum; if (unlikely(nb_pkts == 0)) @@ -144,7 +173,8 @@ eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) for (i = 0; i < nb_pkts; i++) { /* point at the next incoming frame */ ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base; - if ((ppd->tp_status & TP_STATUS_USER) == 0) + tp_status = tpacket_read_status(&ppd->tp_status); + if ((tp_status & TP_STATUS_USER) == 0) break; /* allocate the next mbuf */ @@ -160,7 +190,7 @@ eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) memcpy(rte_pktmbuf_mtod(mbuf, void *), pbuf, rte_pktmbuf_data_len(mbuf)); /* check for vlan info */ - if (ppd->tp_status & TP_STATUS_VLAN_VALID) { + if (tp_status & TP_STATUS_VLAN_VALID) { mbuf->vlan_tci = ppd->tp_vlan_tci; mbuf->ol_flags |= (RTE_MBUF_F_RX_VLAN | RTE_MBUF_F_RX_VLAN_STRIPPED); @@ -179,7 +209,7 @@ eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) } /* release incoming frame and advance ring buffer */ - ppd->tp_status = TP_STATUS_KERNEL; + tpacket_write_status(&ppd->tp_status, TP_STATUS_KERNEL); if (++framenum >= framecount) framenum = 0; mbuf->port = pkt_q->in_port; @@ -228,8 +258,8 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) struct pollfd pfd; struct pkt_tx_queue *pkt_q = queue; uint16_t num_tx = 0; - unsigned long num_tx_bytes = 0; - int i; + uint32_t num_tx_bytes = 0; + uint16_t i; if (unlikely(nb_pkts == 0)) return 0; @@ -259,16 +289,6 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) } } - /* point at the next incoming frame */ - if (!tx_ring_status_available(ppd->tp_status)) { - if (poll(&pfd, 1, -1) < 0) - break; - - /* poll() can return POLLERR if the interface is down */ - if (pfd.revents & POLLERR) - break; - } - /* * poll() will almost always return POLLOUT, even if there * are no extra buffers available @@ -283,26 +303,28 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) * * This results in poll() returning POLLOUT. */ - if (!tx_ring_status_available(ppd->tp_status)) + if (unlikely(!tx_ring_status_available(tpacket_read_status(&ppd->tp_status)) && + (poll(&pfd, 1, -1) < 0 || (pfd.revents & POLLERR) != 0 || + !tx_ring_status_available(tpacket_read_status(&ppd->tp_status))))) { + /* Ring is full, stop here. Don't process bufs[i]. */ break; + } - /* copy the tx frame data */ - pbuf = (uint8_t *) ppd + TPACKET2_HDRLEN - - sizeof(struct sockaddr_ll); + pbuf = (uint8_t *)ppd + ETH_AF_PACKET_FRAME_OVERHEAD; struct rte_mbuf *tmp_mbuf = mbuf; - while (tmp_mbuf) { + do { uint16_t data_len = rte_pktmbuf_data_len(tmp_mbuf); memcpy(pbuf, rte_pktmbuf_mtod(tmp_mbuf, void*), data_len); pbuf += data_len; tmp_mbuf = tmp_mbuf->next; - } + } while (tmp_mbuf); ppd->tp_len = mbuf->pkt_len; ppd->tp_snaplen = mbuf->pkt_len; /* release incoming frame and advance ring buffer */ - ppd->tp_status = TP_STATUS_SEND_REQUEST; + tpacket_write_status(&ppd->tp_status, TP_STATUS_SEND_REQUEST); if (++framenum >= framecount) framenum = 0; ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base; @@ -392,10 +414,12 @@ eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info) dev_info->if_index = internals->if_index; dev_info->max_mac_addrs = 1; - dev_info->max_rx_pktlen = RTE_ETHER_MAX_LEN; + dev_info->max_rx_pktlen = (uint32_t)ETH_AF_PACKET_FRAME_SIZE_MAX + + ETH_AF_PACKET_ETH_OVERHEAD; + dev_info->max_mtu = ETH_AF_PACKET_FRAME_SIZE_MAX; dev_info->max_rx_queues = (uint16_t)internals->nb_queues; dev_info->max_tx_queues = (uint16_t)internals->nb_queues; - dev_info->min_rx_bufsize = 0; + dev_info->min_rx_bufsize = ETH_AF_PACKET_ETH_OVERHEAD; dev_info->tx_offload_capa = RTE_ETH_TX_OFFLOAD_MULTI_SEGS | RTE_ETH_TX_OFFLOAD_VLAN_INSERT; dev_info->rx_offload_capa = RTE_ETH_RX_OFFLOAD_VLAN_STRIP | @@ -572,8 +596,7 @@ eth_rx_queue_setup(struct rte_eth_dev *dev, /* Now get the space available for data in the mbuf */ buf_size = rte_pktmbuf_data_room_size(pkt_q->mb_pool) - RTE_PKTMBUF_HEADROOM; - data_size = internals->req.tp_frame_size; - data_size -= TPACKET2_HDRLEN - sizeof(struct sockaddr_ll); + data_size = internals->req.tp_frame_size - ETH_AF_PACKET_FRAME_OVERHEAD; if (data_size > buf_size) { PMD_LOG(ERR, @@ -612,7 +635,7 @@ eth_dev_mtu_set(struct rte_eth_dev *dev, uint16_t mtu) int ret; int s; unsigned int data_size = internals->req.tp_frame_size - - TPACKET2_HDRLEN; + ETH_AF_PACKET_FRAME_OVERHEAD; if (mtu > data_size) return -EINVAL; @@ -977,25 +1000,38 @@ rte_pmd_init_internals(struct rte_vdev_device *dev, rx_queue->rd = rte_zmalloc_socket(name, rdsize, 0, numa_node); if (rx_queue->rd == NULL) goto error; + /* Frame addresses must match kernel's packet_lookup_frame(): + * block_idx = position / frames_per_block + * frame_offset = position % frames_per_block + * address = block_start + (frame_offset * frame_size) + */ + const uint32_t frames_per_block = req->tp_block_size / req->tp_frame_size; for (i = 0; i < req->tp_frame_nr; ++i) { - rx_queue->rd[i].iov_base = rx_queue->map + (i * framesize); + const uint32_t block_idx = i / frames_per_block; + const uint32_t frame_in_block = i % frames_per_block; + rx_queue->rd[i].iov_base = rx_queue->map + + (block_idx * req->tp_block_size) + + (frame_in_block * req->tp_frame_size); rx_queue->rd[i].iov_len = req->tp_frame_size; } rx_queue->sockfd = qsockfd; tx_queue = &((*internals)->tx_queue[q]); tx_queue->framecount = req->tp_frame_nr; - tx_queue->frame_data_size = req->tp_frame_size; - tx_queue->frame_data_size -= TPACKET2_HDRLEN - - sizeof(struct sockaddr_ll); + tx_queue->frame_data_size = req->tp_frame_size - ETH_AF_PACKET_FRAME_OVERHEAD; tx_queue->map = rx_queue->map + req->tp_block_size * req->tp_block_nr; tx_queue->rd = rte_zmalloc_socket(name, rdsize, 0, numa_node); if (tx_queue->rd == NULL) goto error; + /* See comment above rx_queue->rd initialization. */ for (i = 0; i < req->tp_frame_nr; ++i) { - tx_queue->rd[i].iov_base = tx_queue->map + (i * framesize); + const uint32_t block_idx = i / frames_per_block; + const uint32_t frame_in_block = i % frames_per_block; + tx_queue->rd[i].iov_base = tx_queue->map + + (block_idx * req->tp_block_size) + + (frame_in_block * req->tp_frame_size); tx_queue->rd[i].iov_len = req->tp_frame_size; } tx_queue->sockfd = qsockfd; @@ -1092,7 +1128,8 @@ rte_eth_from_packet(struct rte_vdev_device *dev, if (*sockfd < 0) return -1; - blocksize = getpagesize(); + const int pagesize = getpagesize(); + blocksize = pagesize; /* * Walk arguments for configurable settings @@ -1162,13 +1199,55 @@ rte_eth_from_packet(struct rte_vdev_device *dev, return -1; } - blockcount = framecount / (blocksize / framesize); + const unsigned int frames_per_block = blocksize / framesize; + blockcount = framecount / frames_per_block; if (!blockcount) { PMD_LOG(ERR, "%s: invalid AF_PACKET MMAP parameters", name); return -1; } + /* + * https://www.kernel.org/doc/Documentation/networking/packet_mmap.txt + * Check constraints that may be enforced by the kernel and cause failure + * to initialize the rings but explicit error messages aren't provided. + * See packet_set_ring in linux kernel for enforcement: + * https://github.com/torvalds/linux/blob/master/net/packet/af_packet.c + */ + if (blocksize % pagesize != 0) { + /* tp_block_size must be a multiple of PAGE_SIZE */ + PMD_LOG(WARNING, "%s: %s=%u must be a multiple of PAGE_SIZE=%d", + name, ETH_AF_PACKET_BLOCKSIZE_ARG, blocksize, pagesize); + } + if (framesize % TPACKET_ALIGNMENT != 0) { + /* tp_frame_size must be a multiple of TPACKET_ALIGNMENT */ + PMD_LOG(WARNING, "%s: %s=%u must be a multiple of TPACKET_ALIGNMENT=%d", + name, ETH_AF_PACKET_FRAMESIZE_ARG, framesize, TPACKET_ALIGNMENT); + } + if (frames_per_block == 0 || frames_per_block > UINT_MAX / blockcount || + framecount != frames_per_block * blockcount) { + /* tp_frame_nr must be exactly frames_per_block*tp_block_nr */ + PMD_LOG(WARNING, "%s: %s=%u must be exactly " + "frames_per_block(%s/%s = %u/%u = %u) * blockcount(%u)", + name, ETH_AF_PACKET_FRAMECOUNT_ARG, framecount, + ETH_AF_PACKET_BLOCKSIZE_ARG, ETH_AF_PACKET_FRAMESIZE_ARG, + blocksize, framesize, frames_per_block, blockcount); + } + + /* Below conditions may not cause errors but provide hints to improve */ + if (blocksize % framesize != 0) { + PMD_LOG(WARNING, "%s: %s=%u not evenly divisible by %s=%u, " + "may waste memory", name, + ETH_AF_PACKET_BLOCKSIZE_ARG, blocksize, + ETH_AF_PACKET_FRAMESIZE_ARG, framesize); + } + if (!rte_is_power_of_2(blocksize)) { + /* tp_block_size should be a power of two or there will be waste */ + PMD_LOG(WARNING, "%s: %s=%u should be a power of two " + "or there will be a waste of memory", + name, ETH_AF_PACKET_BLOCKSIZE_ARG, blocksize); + } + PMD_LOG(DEBUG, "%s: AF_PACKET MMAP parameters:", name); PMD_LOG(DEBUG, "%s:\tblock size %d", name, blocksize); PMD_LOG(DEBUG, "%s:\tblock count %d", name, blockcount); -- 2.39.5 (Apple Git-154) ^ permalink raw reply related [flat|nested] 65+ messages in thread
* [PATCH v3 2/4] net/af_packet: RX/TX unlikely, bulk free, prefetch 2026-01-28 19:10 ` [PATCH v3 0/4] af_packet correctness, performance, cksum scott.k.mitch1 2026-01-28 19:10 ` [PATCH v3 1/4] net/af_packet: fix thread safety and frame calculations scott.k.mitch1 @ 2026-01-28 19:10 ` scott.k.mitch1 2026-01-29 1:07 ` Stephen Hemminger 2026-01-28 19:10 ` [PATCH v3 3/4] net/af_packet: tx poll control scott.k.mitch1 ` (2 subsequent siblings) 4 siblings, 1 reply; 65+ messages in thread From: scott.k.mitch1 @ 2026-01-28 19:10 UTC (permalink / raw) To: dev; +Cc: stephen, Scott Mitchell From: Scott Mitchell <scott.k.mitch1@gmail.com> - Add rte_prefetch0() to prefetch next frame/mbuf while processing current packet, reducing cache miss latency - Use rte_pktmbuf_free_bulk() in TX path instead of individual rte_pktmbuf_free() calls for better batch efficiency - Add unlikely() hints for error paths (oversized packets, VLAN insertion failures, sendto errors) to optimize branch prediction - Remove unnecessary early nb_pkts == 0 when loop handles this and app may never call with 0 frames. Signed-off-by: Scott Mitchell <scott.k.mitch1@gmail.com> --- drivers/net/af_packet/rte_eth_af_packet.c | 65 ++++++++++++----------- 1 file changed, 34 insertions(+), 31 deletions(-) diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c index 6c276bb7fc..e357ae168b 100644 --- a/drivers/net/af_packet/rte_eth_af_packet.c +++ b/drivers/net/af_packet/rte_eth_af_packet.c @@ -161,9 +161,6 @@ eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) uint32_t tp_status; unsigned int framecount, framenum; - if (unlikely(nb_pkts == 0)) - return 0; - /* * Reads the given number of packets from the AF_PACKET socket one by * one and copies the packet data into a newly allocated mbuf. @@ -177,6 +174,14 @@ eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) if ((tp_status & TP_STATUS_USER) == 0) break; + unsigned int next_framenum = framenum + 1; + if (next_framenum >= framecount) + next_framenum = 0; + + /* prefetch the next frame for the next loop iteration */ + if (likely(i + 1 < nb_pkts)) + rte_prefetch0(pkt_q->rd[next_framenum].iov_base); + /* allocate the next mbuf */ mbuf = rte_pktmbuf_alloc(pkt_q->mb_pool); if (unlikely(mbuf == NULL)) { @@ -210,8 +215,7 @@ eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) /* release incoming frame and advance ring buffer */ tpacket_write_status(&ppd->tp_status, TP_STATUS_KERNEL); - if (++framenum >= framecount) - framenum = 0; + framenum = next_framenum; mbuf->port = pkt_q->in_port; /* account for the receive frame */ @@ -261,9 +265,6 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) uint32_t num_tx_bytes = 0; uint16_t i; - if (unlikely(nb_pkts == 0)) - return 0; - memset(&pfd, 0, sizeof(pfd)); pfd.fd = pkt_q->sockfd; pfd.events = POLLOUT; @@ -271,22 +272,25 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) framecount = pkt_q->framecount; framenum = pkt_q->framenum; - ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base; for (i = 0; i < nb_pkts; i++) { - mbuf = *bufs++; - - /* drop oversized packets */ - if (mbuf->pkt_len > pkt_q->frame_data_size) { - rte_pktmbuf_free(mbuf); - continue; + unsigned int next_framenum = framenum + 1; + if (next_framenum >= framecount) + next_framenum = 0; + + /* prefetch the next source mbuf and destination TPACKET */ + if (likely(i + 1 < nb_pkts)) { + rte_prefetch0(bufs[i + 1]); + rte_prefetch0(pkt_q->rd[next_framenum].iov_base); } - /* insert vlan info if necessary */ - if (mbuf->ol_flags & RTE_MBUF_F_TX_VLAN) { - if (rte_vlan_insert(&mbuf)) { - rte_pktmbuf_free(mbuf); - continue; - } + mbuf = bufs[i]; + ppd = (struct tpacket2_hdr *)pkt_q->rd[framenum].iov_base; + + /* Drop oversized packets. Insert VLAN if necessary */ + if (unlikely(mbuf->pkt_len > pkt_q->frame_data_size || + ((mbuf->ol_flags & RTE_MBUF_F_TX_VLAN) != 0 && + rte_vlan_insert(&mbuf) != 0))) { + continue; } /* @@ -312,6 +316,9 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) pbuf = (uint8_t *)ppd + ETH_AF_PACKET_FRAME_OVERHEAD; + ppd->tp_len = mbuf->pkt_len; + ppd->tp_snaplen = mbuf->pkt_len; + struct rte_mbuf *tmp_mbuf = mbuf; do { uint16_t data_len = rte_pktmbuf_data_len(tmp_mbuf); @@ -320,23 +327,19 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) tmp_mbuf = tmp_mbuf->next; } while (tmp_mbuf); - ppd->tp_len = mbuf->pkt_len; - ppd->tp_snaplen = mbuf->pkt_len; - /* release incoming frame and advance ring buffer */ tpacket_write_status(&ppd->tp_status, TP_STATUS_SEND_REQUEST); - if (++framenum >= framecount) - framenum = 0; - ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base; - + framenum = next_framenum; num_tx++; num_tx_bytes += mbuf->pkt_len; - rte_pktmbuf_free(mbuf); } + rte_pktmbuf_free_bulk(&bufs[0], i); + /* kick-off transmits */ - if (sendto(pkt_q->sockfd, NULL, 0, MSG_DONTWAIT, NULL, 0) == -1 && - errno != ENOBUFS && errno != EAGAIN) { + if (unlikely(num_tx > 0 && + sendto(pkt_q->sockfd, NULL, 0, MSG_DONTWAIT, NULL, 0) == -1 && + errno != ENOBUFS && errno != EAGAIN)) { /* * In case of a ENOBUFS/EAGAIN error all of the enqueued * packets will be considered successful even though only some -- 2.39.5 (Apple Git-154) ^ permalink raw reply related [flat|nested] 65+ messages in thread
* Re: [PATCH v3 2/4] net/af_packet: RX/TX unlikely, bulk free, prefetch 2026-01-28 19:10 ` [PATCH v3 2/4] net/af_packet: RX/TX unlikely, bulk free, prefetch scott.k.mitch1 @ 2026-01-29 1:07 ` Stephen Hemminger 2026-02-02 5:29 ` Scott Mitchell 0 siblings, 1 reply; 65+ messages in thread From: Stephen Hemminger @ 2026-01-29 1:07 UTC (permalink / raw) To: scott.k.mitch1; +Cc: dev On Wed, 28 Jan 2026 11:10:30 -0800 scott.k.mitch1@gmail.com wrote: > From: Scott Mitchell <scott.k.mitch1@gmail.com> > > - Add rte_prefetch0() to prefetch next frame/mbuf while processing > current packet, reducing cache miss latency > - Use rte_pktmbuf_free_bulk() in TX path instead of individual > rte_pktmbuf_free() calls for better batch efficiency > - Add unlikely() hints for error paths (oversized packets, VLAN > insertion failures, sendto errors) to optimize branch prediction > - Remove unnecessary early nb_pkts == 0 when loop handles this > and app may never call with 0 frames. > > Signed-off-by: Scott Mitchell <scott.k.mitch1@gmail.com> Drop the prefetch stuff, it doesn't matter. Original Prefetch Quad/Dual TX 1.427 Mpps 1.426 Mpps 1.426 Mpps RX 0.529 Mpps 0.530 Mpps 0.533 Mpps loss 87.93% 87.98% 88.0% ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v3 2/4] net/af_packet: RX/TX unlikely, bulk free, prefetch 2026-01-29 1:07 ` Stephen Hemminger @ 2026-02-02 5:29 ` Scott Mitchell 0 siblings, 0 replies; 65+ messages in thread From: Scott Mitchell @ 2026-02-02 5:29 UTC (permalink / raw) To: Stephen Hemminger; +Cc: dev > Drop the prefetch stuff, it doesn't matter. Will do. ^ permalink raw reply [flat|nested] 65+ messages in thread
* [PATCH v3 3/4] net/af_packet: tx poll control 2026-01-28 19:10 ` [PATCH v3 0/4] af_packet correctness, performance, cksum scott.k.mitch1 2026-01-28 19:10 ` [PATCH v3 1/4] net/af_packet: fix thread safety and frame calculations scott.k.mitch1 2026-01-28 19:10 ` [PATCH v3 2/4] net/af_packet: RX/TX unlikely, bulk free, prefetch scott.k.mitch1 @ 2026-01-28 19:10 ` scott.k.mitch1 2026-01-28 19:10 ` [PATCH v3 4/4] net/af_packet: software checksum scott.k.mitch1 2026-02-02 8:14 ` [PATCH v4 0/4] af_packet correctness, performance, cksum scott.k.mitch1 4 siblings, 0 replies; 65+ messages in thread From: scott.k.mitch1 @ 2026-01-28 19:10 UTC (permalink / raw) To: dev; +Cc: stephen, Scott Mitchell From: Scott Mitchell <scott.k.mitch1@gmail.com> Add txpollnotrdy devarg (default=true) to control whether poll() is called when the TX ring is not ready. This allows users to avoid blocking behavior if application threads are in asynchronous poll mode where blocking the thread has negative side effects and backpressure is applied via different means. Signed-off-by: Scott Mitchell <scott.k.mitch1@gmail.com> --- doc/guides/nics/af_packet.rst | 6 +++- drivers/net/af_packet/rte_eth_af_packet.c | 34 ++++++++++++++++++----- 2 files changed, 32 insertions(+), 8 deletions(-) diff --git a/doc/guides/nics/af_packet.rst b/doc/guides/nics/af_packet.rst index 1505b98ff7..782a962c3f 100644 --- a/doc/guides/nics/af_packet.rst +++ b/doc/guides/nics/af_packet.rst @@ -29,6 +29,10 @@ Some of these, in turn, will be used to configure the PACKET_MMAP settings. * ``framesz`` - PACKET_MMAP frame size (optional, default 2048B; Note: multiple of 16B); * ``framecnt`` - PACKET_MMAP frame count (optional, default 512). +* ``txpollnotrdy`` - Control behavior if tx is attempted but there is no + space available to write to the kernel. If 1, call poll() and block until + space is available to tx. If 0, don't call poll() and return from tx (optional, + default 1). For details regarding ``fanout_mode`` argument, you can consult the `PACKET_FANOUT documentation <https://www.man7.org/linux/man-pages/man7/packet.7.html>`_. @@ -75,7 +79,7 @@ framecnt=512): .. code-block:: console - --vdev=eth_af_packet0,iface=tap0,blocksz=4096,framesz=2048,framecnt=512,qpairs=1,qdisc_bypass=0,fanout_mode=hash + --vdev=eth_af_packet0,iface=tap0,blocksz=4096,framesz=2048,framecnt=512,qpairs=1,qdisc_bypass=0,fanout_mode=hash,txpollnotrdy=0 Features and Limitations ------------------------ diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c index e357ae168b..be8e3260aa 100644 --- a/drivers/net/af_packet/rte_eth_af_packet.c +++ b/drivers/net/af_packet/rte_eth_af_packet.c @@ -18,6 +18,7 @@ #include <bus_vdev_driver.h> #include <errno.h> +#include <stdbool.h> #include <linux/if_ether.h> #include <linux/if_packet.h> #include <arpa/inet.h> @@ -39,9 +40,11 @@ #define ETH_AF_PACKET_FRAMECOUNT_ARG "framecnt" #define ETH_AF_PACKET_QDISC_BYPASS_ARG "qdisc_bypass" #define ETH_AF_PACKET_FANOUT_MODE_ARG "fanout_mode" +#define ETH_AF_PACKET_TX_POLL_NOT_READY_ARG "txpollnotrdy" #define DFLT_FRAME_SIZE (1 << 11) #define DFLT_FRAME_COUNT (1 << 9) +#define DFLT_TX_POLL_NOT_RDY true static const uint16_t ETH_AF_PACKET_FRAME_SIZE_MAX = RTE_IPV4_MAX_PKT_LEN; #define ETH_AF_PACKET_FRAME_OVERHEAD (TPACKET2_HDRLEN - sizeof(struct sockaddr_ll)) @@ -78,6 +81,9 @@ struct __rte_cache_aligned pkt_tx_queue { unsigned int framecount; unsigned int framenum; + bool txpollnotrdy; + bool sw_cksum; + volatile unsigned long tx_pkts; volatile unsigned long err_pkts; volatile unsigned long tx_bytes; @@ -106,6 +112,7 @@ static const char *valid_arguments[] = { ETH_AF_PACKET_FRAMECOUNT_ARG, ETH_AF_PACKET_QDISC_BYPASS_ARG, ETH_AF_PACKET_FANOUT_MODE_ARG, + ETH_AF_PACKET_TX_POLL_NOT_READY_ARG, NULL }; @@ -265,10 +272,12 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) uint32_t num_tx_bytes = 0; uint16_t i; - memset(&pfd, 0, sizeof(pfd)); - pfd.fd = pkt_q->sockfd; - pfd.events = POLLOUT; - pfd.revents = 0; + if (pkt_q->txpollnotrdy) { + memset(&pfd, 0, sizeof(pfd)); + pfd.fd = pkt_q->sockfd; + pfd.events = POLLOUT; + pfd.revents = 0; + } framecount = pkt_q->framecount; framenum = pkt_q->framenum; @@ -308,8 +317,9 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) * This results in poll() returning POLLOUT. */ if (unlikely(!tx_ring_status_available(tpacket_read_status(&ppd->tp_status)) && - (poll(&pfd, 1, -1) < 0 || (pfd.revents & POLLERR) != 0 || - !tx_ring_status_available(tpacket_read_status(&ppd->tp_status))))) { + (!pkt_q->txpollnotrdy || poll(&pfd, 1, -1) < 0 || + (pfd.revents & POLLERR) != 0 || + !tx_ring_status_available(tpacket_read_status(&ppd->tp_status))))) { /* Ring is full, stop here. Don't process bufs[i]. */ break; } @@ -820,6 +830,7 @@ rte_pmd_init_internals(struct rte_vdev_device *dev, unsigned int framecnt, unsigned int qdisc_bypass, const char *fanout_mode, + bool txpollnotrdy, struct pmd_internals **internals, struct rte_eth_dev **eth_dev, struct rte_kvargs *kvlist) @@ -1038,6 +1049,7 @@ rte_pmd_init_internals(struct rte_vdev_device *dev, tx_queue->rd[i].iov_len = req->tp_frame_size; } tx_queue->sockfd = qsockfd; + tx_queue->txpollnotrdy = txpollnotrdy; rc = bind(qsockfd, (const struct sockaddr*)&sockaddr, sizeof(sockaddr)); if (rc == -1) { @@ -1126,6 +1138,7 @@ rte_eth_from_packet(struct rte_vdev_device *dev, unsigned int qpairs = 1; unsigned int qdisc_bypass = 1; const char *fanout_mode = NULL; + bool txpollnotrdy = DFLT_TX_POLL_NOT_RDY; /* do some parameter checking */ if (*sockfd < 0) @@ -1193,6 +1206,10 @@ rte_eth_from_packet(struct rte_vdev_device *dev, fanout_mode = pair->value; continue; } + if (strstr(pair->key, ETH_AF_PACKET_TX_POLL_NOT_READY_ARG) != NULL) { + txpollnotrdy = atoi(pair->value) != 0; + continue; + } } if (framesize > blocksize) { @@ -1261,12 +1278,14 @@ rte_eth_from_packet(struct rte_vdev_device *dev, PMD_LOG(DEBUG, "%s:\tfanout mode %s", name, fanout_mode); else PMD_LOG(DEBUG, "%s:\tfanout mode %s", name, "default PACKET_FANOUT_HASH"); + PMD_LOG(INFO, "%s:\ttxpollnotrdy %d", name, txpollnotrdy ? 1 : 0); if (rte_pmd_init_internals(dev, *sockfd, qpairs, blocksize, blockcount, framesize, framecount, qdisc_bypass, fanout_mode, + txpollnotrdy, &internals, ð_dev, kvlist) < 0) return -1; @@ -1364,4 +1383,5 @@ RTE_PMD_REGISTER_PARAM_STRING(net_af_packet, "framesz=<int> " "framecnt=<int> " "qdisc_bypass=<0|1> " - "fanout_mode=<hash|lb|cpu|rollover|rnd|qm>"); + "fanout_mode=<hash|lb|cpu|rollover|rnd|qm> " + "txpollnotrdy=<0|1>"); -- 2.39.5 (Apple Git-154) ^ permalink raw reply related [flat|nested] 65+ messages in thread
* [PATCH v3 4/4] net/af_packet: software checksum 2026-01-28 19:10 ` [PATCH v3 0/4] af_packet correctness, performance, cksum scott.k.mitch1 ` (2 preceding siblings ...) 2026-01-28 19:10 ` [PATCH v3 3/4] net/af_packet: tx poll control scott.k.mitch1 @ 2026-01-28 19:10 ` scott.k.mitch1 2026-01-28 21:57 ` [REVIEW] " Stephen Hemminger 2026-02-02 8:14 ` [PATCH v4 0/4] af_packet correctness, performance, cksum scott.k.mitch1 4 siblings, 1 reply; 65+ messages in thread From: scott.k.mitch1 @ 2026-01-28 19:10 UTC (permalink / raw) To: dev; +Cc: stephen, Scott Mitchell From: Scott Mitchell <scott.k.mitch1@gmail.com> Add software checksum offload support and configurable TX poll behavior to improve flexibility and performance. Add rte_net_ip_udptcp_cksum_mbuf in rte_net.h which is shared between rte_eth_tap and rte_eth_af_packet that supports IPv4/UDP/TCP checksums in software due to hardware offload and context propagation not being supported. Signed-off-by: Scott Mitchell <scott.k.mitch1@gmail.com> --- drivers/net/af_packet/rte_eth_af_packet.c | 15 +++- drivers/net/tap/rte_eth_tap.c | 61 +-------------- lib/net/rte_net.h | 90 +++++++++++++++++++++++ 3 files changed, 106 insertions(+), 60 deletions(-) diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c index be8e3260aa..19bafc99a6 100644 --- a/drivers/net/af_packet/rte_eth_af_packet.c +++ b/drivers/net/af_packet/rte_eth_af_packet.c @@ -10,6 +10,8 @@ #include <rte_string_fns.h> #include <rte_mbuf.h> #include <rte_atomic.h> +#include <rte_ip.h> +#include <rte_net.h> #include <rte_bitops.h> #include <ethdev_driver.h> #include <ethdev_vdev.h> @@ -102,6 +104,7 @@ struct pmd_internals { struct pkt_tx_queue *tx_queue; uint8_t vlan_strip; uint8_t timestamp_offloading; + bool tx_sw_cksum; }; static const char *valid_arguments[] = { @@ -329,6 +332,9 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) ppd->tp_len = mbuf->pkt_len; ppd->tp_snaplen = mbuf->pkt_len; + if (pkt_q->sw_cksum && !rte_net_ip_udptcp_cksum_mbuf(mbuf, false)) + continue; + struct rte_mbuf *tmp_mbuf = mbuf; do { uint16_t data_len = rte_pktmbuf_data_len(tmp_mbuf); @@ -413,10 +419,13 @@ eth_dev_configure(struct rte_eth_dev *dev __rte_unused) { struct rte_eth_conf *dev_conf = &dev->data->dev_conf; const struct rte_eth_rxmode *rxmode = &dev_conf->rxmode; + const struct rte_eth_txmode *txmode = &dev_conf->txmode; struct pmd_internals *internals = dev->data->dev_private; internals->vlan_strip = !!(rxmode->offloads & RTE_ETH_RX_OFFLOAD_VLAN_STRIP); internals->timestamp_offloading = !!(rxmode->offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP); + internals->tx_sw_cksum = !!(txmode->offloads & (RTE_ETH_TX_OFFLOAD_IPV4_CKSUM | + RTE_ETH_TX_OFFLOAD_UDP_CKSUM | RTE_ETH_TX_OFFLOAD_TCP_CKSUM)); return 0; } @@ -434,7 +443,10 @@ eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info) dev_info->max_tx_queues = (uint16_t)internals->nb_queues; dev_info->min_rx_bufsize = ETH_AF_PACKET_ETH_OVERHEAD; dev_info->tx_offload_capa = RTE_ETH_TX_OFFLOAD_MULTI_SEGS | - RTE_ETH_TX_OFFLOAD_VLAN_INSERT; + RTE_ETH_TX_OFFLOAD_VLAN_INSERT | + RTE_ETH_TX_OFFLOAD_IPV4_CKSUM | + RTE_ETH_TX_OFFLOAD_UDP_CKSUM | + RTE_ETH_TX_OFFLOAD_TCP_CKSUM; dev_info->rx_offload_capa = RTE_ETH_RX_OFFLOAD_VLAN_STRIP | RTE_ETH_RX_OFFLOAD_TIMESTAMP; @@ -635,6 +647,7 @@ eth_tx_queue_setup(struct rte_eth_dev *dev, { struct pmd_internals *internals = dev->data->dev_private; + internals->tx_queue[tx_queue_id].sw_cksum = internals->tx_sw_cksum; dev->data->tx_queues[tx_queue_id] = &internals->tx_queue[tx_queue_id]; return 0; diff --git a/drivers/net/tap/rte_eth_tap.c b/drivers/net/tap/rte_eth_tap.c index 730f1859bd..55f496babe 100644 --- a/drivers/net/tap/rte_eth_tap.c +++ b/drivers/net/tap/rte_eth_tap.c @@ -560,70 +560,13 @@ tap_write_mbufs(struct tx_queue *txq, uint16_t num_mbufs, if (txq->csum && (mbuf->ol_flags & RTE_MBUF_F_TX_IP_CKSUM || l4_ol_flags == RTE_MBUF_F_TX_UDP_CKSUM || l4_ol_flags == RTE_MBUF_F_TX_TCP_CKSUM)) { - unsigned int hdrlens = mbuf->l2_len + mbuf->l3_len; - uint16_t *l4_cksum; - void *l3_hdr; - - if (l4_ol_flags == RTE_MBUF_F_TX_UDP_CKSUM) - hdrlens += sizeof(struct rte_udp_hdr); - else if (l4_ol_flags == RTE_MBUF_F_TX_TCP_CKSUM) - hdrlens += sizeof(struct rte_tcp_hdr); - else if (l4_ol_flags != RTE_MBUF_F_TX_L4_NO_CKSUM) - return -1; - - /* Support only packets with at least layer 4 - * header included in the first segment - */ - if (rte_pktmbuf_data_len(mbuf) < hdrlens) - return -1; - - /* To change checksums (considering that a mbuf can be - * indirect, for example), copy l2, l3 and l4 headers - * in a new segment and chain it to existing data - */ - seg = rte_pktmbuf_copy(mbuf, mbuf->pool, 0, hdrlens); + /* Compute checksums in software, copying headers if needed */ + seg = rte_net_ip_udptcp_cksum_mbuf(mbuf, true); if (seg == NULL) return -1; - rte_pktmbuf_adj(mbuf, hdrlens); - rte_pktmbuf_chain(seg, mbuf); pmbufs[i] = mbuf = seg; - - l3_hdr = rte_pktmbuf_mtod_offset(mbuf, void *, mbuf->l2_len); - if (mbuf->ol_flags & RTE_MBUF_F_TX_IP_CKSUM) { - struct rte_ipv4_hdr *iph = l3_hdr; - - iph->hdr_checksum = 0; - iph->hdr_checksum = rte_ipv4_cksum(iph); - } - - if (l4_ol_flags == RTE_MBUF_F_TX_L4_NO_CKSUM) - goto skip_l4_cksum; - - if (l4_ol_flags == RTE_MBUF_F_TX_UDP_CKSUM) { - struct rte_udp_hdr *udp_hdr; - - udp_hdr = rte_pktmbuf_mtod_offset(mbuf, struct rte_udp_hdr *, - mbuf->l2_len + mbuf->l3_len); - l4_cksum = &udp_hdr->dgram_cksum; - } else { - struct rte_tcp_hdr *tcp_hdr; - - tcp_hdr = rte_pktmbuf_mtod_offset(mbuf, struct rte_tcp_hdr *, - mbuf->l2_len + mbuf->l3_len); - l4_cksum = &tcp_hdr->cksum; - } - - *l4_cksum = 0; - if (mbuf->ol_flags & RTE_MBUF_F_TX_IPV4) { - *l4_cksum = rte_ipv4_udptcp_cksum_mbuf(mbuf, l3_hdr, - mbuf->l2_len + mbuf->l3_len); - } else { - *l4_cksum = rte_ipv6_udptcp_cksum_mbuf(mbuf, l3_hdr, - mbuf->l2_len + mbuf->l3_len); - } } -skip_l4_cksum: for (j = 0; j < mbuf->nb_segs; j++) { iovecs[k].iov_len = rte_pktmbuf_data_len(seg); iovecs[k].iov_base = rte_pktmbuf_mtod(seg, void *); diff --git a/lib/net/rte_net.h b/lib/net/rte_net.h index 65d724b84b..c640ccbba2 100644 --- a/lib/net/rte_net.h +++ b/lib/net/rte_net.h @@ -246,6 +246,96 @@ rte_net_intel_cksum_prepare(struct rte_mbuf *m) return rte_net_intel_cksum_flags_prepare(m, m->ol_flags); } +/** + * Compute IPv4 header and UDP/TCP checksums in software. + * + * Computes checksums based on mbuf offload flags: + * - RTE_MBUF_F_TX_IP_CKSUM: Compute IPv4 header checksum + * - RTE_MBUF_F_TX_UDP_CKSUM: Compute UDP checksum (IPv4 or IPv6) + * - RTE_MBUF_F_TX_TCP_CKSUM: Compute TCP checksum (IPv4 or IPv6) + * + * @param mbuf + * The packet mbuf. Must have l2_len and l3_len set correctly. + * @param copy + * If true, copy L2/L3/L4 headers to a new segment before computing + * checksums. This is safe for indirect mbufs but has overhead. + * If false, compute checksums in place. This is only safe if the + * mbuf will be copied afterward (e.g., to a device ring buffer). + * @return + * - On success: Returns mbuf (new segment if copy=true, original if copy=false) + * - On error: Returns NULL (allocation failed or malformed packet) + */ +static inline struct rte_mbuf * +rte_net_ip_udptcp_cksum_mbuf(struct rte_mbuf *mbuf, bool copy) +{ + const uint64_t l4_ol_flags = mbuf->ol_flags & RTE_MBUF_F_TX_L4_MASK; + const uint64_t l4_offset = mbuf->l2_len + mbuf->l3_len; + uint32_t hdrlens = l4_offset; + + /* Determine total header length needed */ + if (l4_ol_flags == RTE_MBUF_F_TX_UDP_CKSUM) + hdrlens += sizeof(struct rte_udp_hdr); + else if (l4_ol_flags == RTE_MBUF_F_TX_TCP_CKSUM) + hdrlens += sizeof(struct rte_tcp_hdr); + else if (l4_ol_flags != RTE_MBUF_F_TX_L4_NO_CKSUM) + return NULL; /* Unsupported L4 checksum type */ + else if (!(mbuf->ol_flags & RTE_MBUF_F_TX_IP_CKSUM)) + return mbuf; /* Nothing to do */ + + /* Validate we at least have L2+L3 headers before doing any work */ + if (unlikely(rte_pktmbuf_data_len(mbuf) < l4_offset)) + return NULL; + + if (copy) { + /* + * Copy headers to new segment to handle indirect mbufs. + * This ensures we can safely modify checksums without + * corrupting shared/read-only data. + */ + struct rte_mbuf *seg = rte_pktmbuf_copy(mbuf, mbuf->pool, 0, hdrlens); + if (!seg) + return NULL; + + rte_pktmbuf_adj(mbuf, hdrlens); + rte_pktmbuf_chain(seg, mbuf); + mbuf = seg; + } else if (unlikely(!RTE_MBUF_DIRECT(mbuf) || rte_mbuf_refcnt_read(mbuf) > 1)) + return NULL; + + void *l3_hdr = rte_pktmbuf_mtod_offset(mbuf, void *, mbuf->l2_len); + + /* IPv4 header checksum */ + if (mbuf->ol_flags & RTE_MBUF_F_TX_IP_CKSUM) { + struct rte_ipv4_hdr *iph = (struct rte_ipv4_hdr *)l3_hdr; + iph->hdr_checksum = 0; + iph->hdr_checksum = rte_ipv4_cksum(iph); + } + + /* L4 checksum (UDP or TCP) - skip if headers not in first segment */ + if (l4_ol_flags == RTE_MBUF_F_TX_UDP_CKSUM && rte_pktmbuf_data_len(mbuf) >= hdrlens) { + struct rte_udp_hdr *udp_hdr = rte_pktmbuf_mtod_offset(mbuf, struct rte_udp_hdr *, + l4_offset); + udp_hdr->dgram_cksum = 0; + udp_hdr->dgram_cksum = (mbuf->ol_flags & RTE_MBUF_F_TX_IPV4) ? + rte_ipv4_udptcp_cksum_mbuf(mbuf, (const struct rte_ipv4_hdr *)l3_hdr, + l4_offset) : + rte_ipv6_udptcp_cksum_mbuf(mbuf, (const struct rte_ipv6_hdr *)l3_hdr, + l4_offset); + } else if (l4_ol_flags == RTE_MBUF_F_TX_TCP_CKSUM && + rte_pktmbuf_data_len(mbuf) >= hdrlens) { + struct rte_tcp_hdr *tcp_hdr = rte_pktmbuf_mtod_offset(mbuf, struct rte_tcp_hdr *, + l4_offset); + tcp_hdr->cksum = 0; + tcp_hdr->cksum = (mbuf->ol_flags & RTE_MBUF_F_TX_IPV4) ? + rte_ipv4_udptcp_cksum_mbuf(mbuf, (const struct rte_ipv4_hdr *)l3_hdr, + l4_offset) : + rte_ipv6_udptcp_cksum_mbuf(mbuf, (const struct rte_ipv6_hdr *)l3_hdr, + l4_offset); + } + + return mbuf; +} + #ifdef __cplusplus } #endif -- 2.39.5 (Apple Git-154) ^ permalink raw reply related [flat|nested] 65+ messages in thread
* [REVIEW] net/af_packet: software checksum 2026-01-28 19:10 ` [PATCH v3 4/4] net/af_packet: software checksum scott.k.mitch1 @ 2026-01-28 21:57 ` Stephen Hemminger 2026-02-02 7:55 ` Scott Mitchell 0 siblings, 1 reply; 65+ messages in thread From: Stephen Hemminger @ 2026-01-28 21:57 UTC (permalink / raw) To: dev AI-generated review of bundle-1708-af-packet-v3.mbox Reviewed using Claude (claude-opus-4-5-20251101) This is an automated review. Please verify all suggestions. --- ## DPDK Patch Review: net/af_packet v3 Series ### Overview This is a 4-patch series improving the AF_PACKET PMD with thread safety fixes, performance optimizations, and new features. --- ## Patch 1/4: net/af_packet: fix thread safety and frame calculations ### Commit Message Issues **Warning: Subject line length** - Subject is 58 characters, within limit but close to the 60-character maximum. **Warning: Missing blank line before Reported-by/Suggested-by tags** - The `Fixes:` and `Cc:` tags should be followed by a blank line before `Signed-off-by:`. **Info: Depends-on format** - The `Depends-on:` line uses `patch-160468` format. The guidelines specify `series-NNNNN`. Verify this is the correct format for your workflow. ### Code Issues **Warning: Use of deprecated memory barrier functions** ```c static inline uint32_t tpacket_read_status(const volatile void *tp_status) { rte_smp_rmb(); ``` The guidelines state `rte_smp_rmb()` and `rte_smp_wmb()` are forbidden and should use `rte_atomic_thread_fence()`. However, the commit message explicitly justifies this due to kernel compatibility requirements with non-atomic `tp_status`. This justification appears valid given the kernel's memory model. **Warning: Variable declaration style** ```c const int pagesize = getpagesize(); blocksize = pagesize; ``` Declaring `pagesize` as `const` at point of use is acceptable C99 style, but `blocksize` was already declared earlier. This mixing of styles within the function is inconsistent. **Warning: Static const at file scope** ```c static const uint16_t ETH_AF_PACKET_FRAME_SIZE_MAX = RTE_IPV4_MAX_PKT_LEN; ``` Constants at file scope should use `#define` with `RTE_` prefix per naming conventions, or if a typed constant is needed, use lowercase naming (`eth_af_packet_frame_size_max`). **Info: Long lines** Several lines approach but stay within the 100-character limit. Lines 1231-1234 with the warning messages are acceptable. ### Documentation Issues **Warning: Missing release notes** This patch fixes regressions and adds validation warnings. Per guidelines, fixes that are backport candidates (`Cc: stable@dpdk.org`) should have release notes updated. --- ## Patch 2/4: net/af_packet: RX/TX unlikely, bulk free, prefetch ### Commit Message Issues **Error: Missing Signed-off-by email format validation** The Signed-off-by appears correct: `Scott Mitchell <scott.k.mitch1@gmail.com>` ### Code Issues **Warning: Variable declaration inside for loop scope** ```c for (i = 0; i < nb_pkts; i++) { unsigned int next_framenum = framenum + 1; ``` Declaring variables inside loop bodies is valid C99 but mixing with earlier declaration style (`uint16_t i;` at function start) is inconsistent within the function. **Warning: Removal of early return check** ```c - if (unlikely(nb_pkts == 0)) - return 0; ``` The commit message justifies this removal, but removing defensive checks could cause issues if callers ever pass 0. The loop handles it correctly, so this is acceptable but worth noting. **Error: Potential use-after-free with rte_pktmbuf_free_bulk** ```c rte_pktmbuf_free_bulk(&bufs[0], i); ``` When packets are dropped (oversized or VLAN insertion failure), they are skipped via `continue` but still freed in the bulk free. The dropped packets should still be freed, but the current logic will try to free them even though they weren't processed. However, looking closer, `i` is the loop counter, so all mbufs from 0 to i-1 will be freed, which includes dropped ones - this is actually correct behavior. **Warning: Missing space after comma** ```c rte_prefetch0(bufs[i + 1]); ``` This is fine - no issue here. --- ## Patch 3/4: net/af_packet: tx poll control ### Commit Message Issues **Info: Subject line is clear and within limits (32 chars)** ### Code Issues **Warning: Including stdbool.h** ```c +#include <stdbool.h> ``` DPDK typically uses `<stdbool.h>` through EAL includes. Verify this is needed or if `bool` is already available. **Warning: Uninitialized struct pollfd when txpollnotrdy is false** ```c if (pkt_q->txpollnotrdy) { memset(&pfd, 0, sizeof(pfd)); ... } ``` If `txpollnotrdy` is false, `pfd` is uninitialized but referenced later in the `poll()` call. This is a potential bug - the poll path won't be taken when `txpollnotrdy` is false, but the compiler may still warn. **Error: Inconsistent indentation in function declaration** ```c rte_pmd_init_internals(struct rte_vdev_device *dev, unsigned int framecnt, unsigned int qdisc_bypass, ``` The continuation lines use mixed spaces/tabs. DPDK uses tabs for indentation, spaces for alignment. ### Documentation Issues **Info: Documentation updated correctly** The `doc/guides/nics/af_packet.rst` is properly updated with the new parameter. --- ## Patch 4/4: net/af_packet: software checksum ### Commit Message Issues **Warning: Subject is vague** "software checksum" could be more descriptive like "add software checksum offload support" ### Code Issues **Error: New public API function missing experimental tag and export macro** ```c static inline struct rte_mbuf * rte_net_ip_udptcp_cksum_mbuf(struct rte_mbuf *mbuf, bool copy) ``` This function is added to a public header (`lib/net/rte_net.h`) with `rte_` prefix but: 1. Missing `__rte_experimental` tag for new API 2. Since it's `static inline`, no export macro is needed, but experimental tag is still required for new public APIs **Warning: Function should be marked experimental** ```c +__rte_experimental +static inline struct rte_mbuf * +rte_net_ip_udptcp_cksum_mbuf(struct rte_mbuf *mbuf, bool copy) ``` **Warning: Variable declarations mixed with code** ```c void *l3_hdr = rte_pktmbuf_mtod_offset(mbuf, void *, mbuf->l2_len); ``` This is after conditional code blocks. C99 style is acceptable but inconsistent with function start. **Warning: Missing Doxygen for all parameters** The Doxygen comment is present but could be more detailed about error conditions. ### Documentation Issues **Error: Missing release notes** New TX offload capabilities (IPV4_CKSUM, UDP_CKSUM, TCP_CKSUM) and new public API function require release notes entries. **Warning: Missing feature matrix update** If af_packet now supports checksum offloads, `doc/guides/nics/features/af_packet.ini` should be updated. --- ## Summary | Severity | Count | Key Issues | |----------|-------|------------| | Error | 3 | Inconsistent indentation, missing experimental tag, missing release notes | | Warning | 12 | Deprecated barriers (justified), variable declaration style, missing documentation | | Info | 4 | Minor style preferences | ### Recommended Actions Before Merge 1. **Patch 4**: Add `__rte_experimental` to `rte_net_ip_udptcp_cksum_mbuf()` 2. **Patch 4**: Add release notes for new checksum offload support 3. **Patch 3**: Fix mixed tabs/spaces in function parameter continuation 4. **All patches**: Consider adding release notes entries for the fixes and new features 5. **Patch 4**: Update af_packet feature matrix if applicable ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [REVIEW] net/af_packet: software checksum 2026-01-28 21:57 ` [REVIEW] " Stephen Hemminger @ 2026-02-02 7:55 ` Scott Mitchell 2026-02-02 16:58 ` Stephen Hemminger 0 siblings, 1 reply; 65+ messages in thread From: Scott Mitchell @ 2026-02-02 7:55 UTC (permalink / raw) To: Stephen Hemminger; +Cc: dev Will resolve comments in v4. Also I've noticed the AI feedback often has paragraphs that end with "this is fine". Is there any way to suppress those as they take time to read/parse/understand? > ## Patch 3/4: net/af_packet: tx poll control > **Warning: Uninitialized struct pollfd when txpollnotrdy is false** > ```c > if (pkt_q->txpollnotrdy) { > memset(&pfd, 0, sizeof(pfd)); > ... > } > ``` > If `txpollnotrdy` is false, `pfd` is uninitialized but referenced later in the `poll()` call. This is a potential bug - the poll path won't be taken when `txpollnotrdy` is false, but the compiler may still warn. False positive. pfd is never used if pkt_q->txpollnotrdy is false (the conditions below short circuits. ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [REVIEW] net/af_packet: software checksum 2026-02-02 7:55 ` Scott Mitchell @ 2026-02-02 16:58 ` Stephen Hemminger 0 siblings, 0 replies; 65+ messages in thread From: Stephen Hemminger @ 2026-02-02 16:58 UTC (permalink / raw) To: Scott Mitchell; +Cc: dev On Sun, 1 Feb 2026 23:55:21 -0800 Scott Mitchell <scott.k.mitch1@gmail.com> wrote: > Will resolve comments in v4. > > Also I've noticed the AI feedback often has paragraphs that end with > "this is fine". Is there any way to suppress those as they take time > to read/parse/understand? I have been playing "whack-a-mole" to suppress these. There are magic words to say in the AI prompt that help but nothing seems to completely eliminate the "never mind" feedback. ^ permalink raw reply [flat|nested] 65+ messages in thread
* [PATCH v4 0/4] af_packet correctness, performance, cksum 2026-01-28 19:10 ` [PATCH v3 0/4] af_packet correctness, performance, cksum scott.k.mitch1 ` (3 preceding siblings ...) 2026-01-28 19:10 ` [PATCH v3 4/4] net/af_packet: software checksum scott.k.mitch1 @ 2026-02-02 8:14 ` scott.k.mitch1 2026-02-02 8:14 ` [PATCH v4 1/4] net/af_packet: fix thread safety and frame calculations scott.k.mitch1 ` (5 more replies) 4 siblings, 6 replies; 65+ messages in thread From: scott.k.mitch1 @ 2026-02-02 8:14 UTC (permalink / raw) To: dev; +Cc: stephen, Scott From: Scott <scott.k.mitch1@gmail.com> This series fixes critical thread safety bugs in the af_packet PMD and adds performance optimizations. Patch 1 fixes two major correctness issues: - Thread safety: tp_status was accessed without memory barriers, violating the kernel's PACKET_MMAP protocol. On aarch64 and other weakly-ordered architectures, this causes packet corruption due to missing memory ordering. The fix matches the kernel's memory model: volatile unaligned reads/writes with explicit rte_smp_rmb/wmb barriers and __may_alias__ protection. - Frame calculations: Fixed incorrect frame overhead and address calculations that caused memory corruption when frames don't evenly divide blocks. Patches 2-4 add performance improvements: - Patch 2: Bulk mbuf freeing, unlikely annotations, and prefetching - Patch 3: TX poll control to reduce syscall overhead - Patch 4: Software checksum offload support with shared rte_net utility v4 changes: - Remove prefetch (perf results didn't show benefit) - Fix variable sytle for consistency (declare at start of method) - Add release notes for af_packet and documentation for fixes v3 changes: - Patch 4: Fix compile error due to implict cast with c++ compiler v2 changes: - Patch 1: Rewrote to use volatile + barriers instead of C11 atomics to match kernel's memory model. Added dependency on patch-160274 for __rte_may_alias attribute. - Patch 4: Refactored to use shared rte_net_ip_udptcp_cksum_mbuf() utility function, eliminating code duplication with tap driver. Scott Mitchell (4): net/af_packet: fix thread safety and frame calculations net/af_packet: RX/TX bulk free, unlikely hint net/af_packet: tx poll control net/af_packet: add software checksum offload support doc/guides/nics/af_packet.rst | 6 +- doc/guides/nics/features/afpacket.ini | 2 + doc/guides/rel_notes/release_26_03.rst | 7 + drivers/net/af_packet/rte_eth_af_packet.c | 236 +++++++++++++++------- drivers/net/tap/rte_eth_tap.c | 61 +----- lib/net/rte_net.h | 92 +++++++++ 6 files changed, 274 insertions(+), 130 deletions(-) -- 2.39.5 (Apple Git-154) ^ permalink raw reply [flat|nested] 65+ messages in thread
* [PATCH v4 1/4] net/af_packet: fix thread safety and frame calculations 2026-02-02 8:14 ` [PATCH v4 0/4] af_packet correctness, performance, cksum scott.k.mitch1 @ 2026-02-02 8:14 ` scott.k.mitch1 2026-02-02 8:14 ` [PATCH v4 2/4] net/af_packet: RX/TX bulk free, unlikely hint scott.k.mitch1 ` (4 subsequent siblings) 5 siblings, 0 replies; 65+ messages in thread From: scott.k.mitch1 @ 2026-02-02 8:14 UTC (permalink / raw) To: dev; +Cc: stephen, Scott Mitchell, linville, stable From: Scott Mitchell <scott.k.mitch1@gmail.com> Thread Safety: The tp_status field was accessed without proper memory barriers, violating the kernel's PACKET_MMAP synchronization protocol. The kernel implements this protocol in net/packet/af_packet.c: - __packet_get_status: smp_rmb() then READ_ONCE() (volatile read) - __packet_set_status: WRITE_ONCE() (volatile write) then smp_wmb() READ_ONCE/WRITE_ONCE use __may_alias__ attribute via __uXX_alias_t types to prevent compiler optimizations that assume type-based aliasing rules, which is critical for tp_status access that may be misaligned within the ring buffer. Userspace must use equivalent semantics: volatile unaligned_uint32_t (with __rte_may_alias) reads/writes with explicit memory barriers (rte_smp_rmb/rte_smp_wmb). On aarch64 and other weakly-ordered architectures, missing barriers cause packet corruption because: - RX: CPU may read stale packet data before seeing tp_status update - TX: CPU may reorder stores, causing kernel to see tp_status before packet data is fully written This becomes critical with io_uring SQPOLL mode where the kernel polling thread on a different CPU core asynchronously updates tp_status, making proper memory ordering essential. Note: Uses rte_smp_[r/w]mb which triggers checkpatch warnings, but C11 atomics cannot be used because tp_status is not declared _Atomic in the kernel's tpacket2_hdr structure. We must match the kernel's volatile + barrier memory model with __may_alias__ protection. Frame Calculation Issues: 1. Frame overhead incorrectly calculated as TPACKET_ALIGN(TPACKET2_HDRLEN) instead of TPACKET2_HDRLEN - sizeof(struct sockaddr_ll), causing incorrect usable frame data size. 2. Frame address calculation assumed sequential layout (frame_base + i * frame_size), but the kernel's packet_lookup_frame() uses block-based addressing: block_idx = position / frames_per_block frame_offset = position % frames_per_block address = block_start[block_idx] + (frame_offset * frame_size) This caused memory corruption when frames don't evenly divide blocks. Fixes: 364e08f2bbc0 ("af_packet: add PMD for AF_PACKET-based virtual devices") Cc: linville@tuxdriver.com Cc: stable@dpdk.org Signed-off-by: Scott Mitchell <scott.k.mitch1@gmail.com> --- Depends-on: patch-160679 ("eal: add __rte_may_alias and __rte_aligned to unaligned typedefs") doc/guides/rel_notes/release_26_03.rst | 4 + drivers/net/af_packet/rte_eth_af_packet.c | 151 ++++++++++++++++------ 2 files changed, 118 insertions(+), 37 deletions(-) diff --git a/doc/guides/rel_notes/release_26_03.rst b/doc/guides/rel_notes/release_26_03.rst index 15dabee7a1..c7e7c7d25b 100644 --- a/doc/guides/rel_notes/release_26_03.rst +++ b/doc/guides/rel_notes/release_26_03.rst @@ -55,6 +55,10 @@ New Features Also, make sure to start the actual text at the margin. ======================================================= +* **Updated af_packet net driver.** + + * Fixed kernel memory barrier protocol for memory availability + * Fixed shared memory frame overhead offset calculation Removed Items ------------- diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c index ef11b8fb6b..d0cc2c419a 100644 --- a/drivers/net/af_packet/rte_eth_af_packet.c +++ b/drivers/net/af_packet/rte_eth_af_packet.c @@ -9,6 +9,8 @@ #include <rte_common.h> #include <rte_string_fns.h> #include <rte_mbuf.h> +#include <rte_atomic.h> +#include <rte_bitops.h> #include <ethdev_driver.h> #include <ethdev_vdev.h> #include <rte_malloc.h> @@ -41,6 +43,10 @@ #define DFLT_FRAME_SIZE (1 << 11) #define DFLT_FRAME_COUNT (1 << 9) +static const uint16_t eth_af_packet_frame_size_max = RTE_IPV4_MAX_PKT_LEN; +#define ETH_AF_PACKET_FRAME_OVERHEAD (TPACKET2_HDRLEN - sizeof(struct sockaddr_ll)) +#define ETH_AF_PACKET_ETH_OVERHEAD (RTE_ETHER_HDR_LEN + RTE_ETHER_CRC_LEN) + static uint64_t timestamp_dynflag; static int timestamp_dynfield_offset = -1; @@ -120,6 +126,28 @@ RTE_LOG_REGISTER_DEFAULT(af_packet_logtype, NOTICE); RTE_LOG_LINE(level, AFPACKET, "%s(): " fmt ":%s", __func__, \ ## __VA_ARGS__, strerror(errno)) +/** + * Read tp_status from packet mmap ring. Matches kernel's READ_ONCE() with smp_rmb() + * ordering in af_packet.c __packet_get_status. + */ +static inline uint32_t +tpacket_read_status(const volatile void *tp_status) +{ + rte_smp_rmb(); + return *((const volatile unaligned_uint32_t *)tp_status); +} + +/** + * Write tp_status to packet mmap ring. Matches kernel's WRITE_ONCE() with smp_wmb() + * ordering in af_packet.c __packet_set_status. + */ +static inline void +tpacket_write_status(volatile void *tp_status, uint32_t status) +{ + *((volatile unaligned_uint32_t *)tp_status) = status; + rte_smp_wmb(); +} + static uint16_t eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) { @@ -129,7 +157,8 @@ eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) uint8_t *pbuf; struct pkt_rx_queue *pkt_q = queue; uint16_t num_rx = 0; - unsigned long num_rx_bytes = 0; + uint32_t num_rx_bytes = 0; + uint32_t tp_status; unsigned int framecount, framenum; if (unlikely(nb_pkts == 0)) @@ -144,7 +173,8 @@ eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) for (i = 0; i < nb_pkts; i++) { /* point at the next incoming frame */ ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base; - if ((ppd->tp_status & TP_STATUS_USER) == 0) + tp_status = tpacket_read_status(&ppd->tp_status); + if ((tp_status & TP_STATUS_USER) == 0) break; /* allocate the next mbuf */ @@ -160,7 +190,7 @@ eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) memcpy(rte_pktmbuf_mtod(mbuf, void *), pbuf, rte_pktmbuf_data_len(mbuf)); /* check for vlan info */ - if (ppd->tp_status & TP_STATUS_VLAN_VALID) { + if (tp_status & TP_STATUS_VLAN_VALID) { mbuf->vlan_tci = ppd->tp_vlan_tci; mbuf->ol_flags |= (RTE_MBUF_F_RX_VLAN | RTE_MBUF_F_RX_VLAN_STRIPPED); @@ -179,7 +209,7 @@ eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) } /* release incoming frame and advance ring buffer */ - ppd->tp_status = TP_STATUS_KERNEL; + tpacket_write_status(&ppd->tp_status, TP_STATUS_KERNEL); if (++framenum >= framecount) framenum = 0; mbuf->port = pkt_q->in_port; @@ -228,8 +258,8 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) struct pollfd pfd; struct pkt_tx_queue *pkt_q = queue; uint16_t num_tx = 0; - unsigned long num_tx_bytes = 0; - int i; + uint32_t num_tx_bytes = 0; + uint16_t i; if (unlikely(nb_pkts == 0)) return 0; @@ -259,16 +289,6 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) } } - /* point at the next incoming frame */ - if (!tx_ring_status_available(ppd->tp_status)) { - if (poll(&pfd, 1, -1) < 0) - break; - - /* poll() can return POLLERR if the interface is down */ - if (pfd.revents & POLLERR) - break; - } - /* * poll() will almost always return POLLOUT, even if there * are no extra buffers available @@ -283,26 +303,28 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) * * This results in poll() returning POLLOUT. */ - if (!tx_ring_status_available(ppd->tp_status)) + if (unlikely(!tx_ring_status_available(tpacket_read_status(&ppd->tp_status)) && + (poll(&pfd, 1, -1) < 0 || (pfd.revents & POLLERR) != 0 || + !tx_ring_status_available(tpacket_read_status(&ppd->tp_status))))) { + /* Ring is full, stop here. Don't process bufs[i]. */ break; + } - /* copy the tx frame data */ - pbuf = (uint8_t *) ppd + TPACKET2_HDRLEN - - sizeof(struct sockaddr_ll); + pbuf = (uint8_t *)ppd + ETH_AF_PACKET_FRAME_OVERHEAD; struct rte_mbuf *tmp_mbuf = mbuf; - while (tmp_mbuf) { + do { uint16_t data_len = rte_pktmbuf_data_len(tmp_mbuf); memcpy(pbuf, rte_pktmbuf_mtod(tmp_mbuf, void*), data_len); pbuf += data_len; tmp_mbuf = tmp_mbuf->next; - } + } while (tmp_mbuf); ppd->tp_len = mbuf->pkt_len; ppd->tp_snaplen = mbuf->pkt_len; /* release incoming frame and advance ring buffer */ - ppd->tp_status = TP_STATUS_SEND_REQUEST; + tpacket_write_status(&ppd->tp_status, TP_STATUS_SEND_REQUEST); if (++framenum >= framecount) framenum = 0; ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base; @@ -392,10 +414,12 @@ eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info) dev_info->if_index = internals->if_index; dev_info->max_mac_addrs = 1; - dev_info->max_rx_pktlen = RTE_ETHER_MAX_LEN; + dev_info->max_rx_pktlen = (uint32_t)eth_af_packet_frame_size_max + + ETH_AF_PACKET_ETH_OVERHEAD; + dev_info->max_mtu = eth_af_packet_frame_size_max; dev_info->max_rx_queues = (uint16_t)internals->nb_queues; dev_info->max_tx_queues = (uint16_t)internals->nb_queues; - dev_info->min_rx_bufsize = 0; + dev_info->min_rx_bufsize = ETH_AF_PACKET_ETH_OVERHEAD; dev_info->tx_offload_capa = RTE_ETH_TX_OFFLOAD_MULTI_SEGS | RTE_ETH_TX_OFFLOAD_VLAN_INSERT; dev_info->rx_offload_capa = RTE_ETH_RX_OFFLOAD_VLAN_STRIP | @@ -572,8 +596,7 @@ eth_rx_queue_setup(struct rte_eth_dev *dev, /* Now get the space available for data in the mbuf */ buf_size = rte_pktmbuf_data_room_size(pkt_q->mb_pool) - RTE_PKTMBUF_HEADROOM; - data_size = internals->req.tp_frame_size; - data_size -= TPACKET2_HDRLEN - sizeof(struct sockaddr_ll); + data_size = internals->req.tp_frame_size - ETH_AF_PACKET_FRAME_OVERHEAD; if (data_size > buf_size) { PMD_LOG(ERR, @@ -612,7 +635,7 @@ eth_dev_mtu_set(struct rte_eth_dev *dev, uint16_t mtu) int ret; int s; unsigned int data_size = internals->req.tp_frame_size - - TPACKET2_HDRLEN; + ETH_AF_PACKET_FRAME_OVERHEAD; if (mtu > data_size) return -EINVAL; @@ -977,25 +1000,38 @@ rte_pmd_init_internals(struct rte_vdev_device *dev, rx_queue->rd = rte_zmalloc_socket(name, rdsize, 0, numa_node); if (rx_queue->rd == NULL) goto error; + /* Frame addresses must match kernel's packet_lookup_frame(): + * block_idx = position / frames_per_block + * frame_offset = position % frames_per_block + * address = block_start + (frame_offset * frame_size) + */ + const uint32_t frames_per_block = req->tp_block_size / req->tp_frame_size; for (i = 0; i < req->tp_frame_nr; ++i) { - rx_queue->rd[i].iov_base = rx_queue->map + (i * framesize); + const uint32_t block_idx = i / frames_per_block; + const uint32_t frame_in_block = i % frames_per_block; + rx_queue->rd[i].iov_base = rx_queue->map + + (block_idx * req->tp_block_size) + + (frame_in_block * req->tp_frame_size); rx_queue->rd[i].iov_len = req->tp_frame_size; } rx_queue->sockfd = qsockfd; tx_queue = &((*internals)->tx_queue[q]); tx_queue->framecount = req->tp_frame_nr; - tx_queue->frame_data_size = req->tp_frame_size; - tx_queue->frame_data_size -= TPACKET2_HDRLEN - - sizeof(struct sockaddr_ll); + tx_queue->frame_data_size = req->tp_frame_size - ETH_AF_PACKET_FRAME_OVERHEAD; tx_queue->map = rx_queue->map + req->tp_block_size * req->tp_block_nr; tx_queue->rd = rte_zmalloc_socket(name, rdsize, 0, numa_node); if (tx_queue->rd == NULL) goto error; + /* See comment above rx_queue->rd initialization. */ for (i = 0; i < req->tp_frame_nr; ++i) { - tx_queue->rd[i].iov_base = tx_queue->map + (i * framesize); + const uint32_t block_idx = i / frames_per_block; + const uint32_t frame_in_block = i % frames_per_block; + tx_queue->rd[i].iov_base = tx_queue->map + + (block_idx * req->tp_block_size) + + (frame_in_block * req->tp_frame_size); tx_queue->rd[i].iov_len = req->tp_frame_size; } tx_queue->sockfd = qsockfd; @@ -1081,7 +1117,8 @@ rte_eth_from_packet(struct rte_vdev_device *dev, struct rte_kvargs_pair *pair = NULL; unsigned k_idx; unsigned int blockcount; - unsigned int blocksize; + const int pagesize = getpagesize(); + unsigned int blocksize = pagesize; unsigned int framesize = DFLT_FRAME_SIZE; unsigned int framecount = DFLT_FRAME_COUNT; unsigned int qpairs = 1; @@ -1092,8 +1129,6 @@ rte_eth_from_packet(struct rte_vdev_device *dev, if (*sockfd < 0) return -1; - blocksize = getpagesize(); - /* * Walk arguments for configurable settings */ @@ -1162,13 +1197,55 @@ rte_eth_from_packet(struct rte_vdev_device *dev, return -1; } - blockcount = framecount / (blocksize / framesize); + const unsigned int frames_per_block = blocksize / framesize; + blockcount = framecount / frames_per_block; if (!blockcount) { PMD_LOG(ERR, "%s: invalid AF_PACKET MMAP parameters", name); return -1; } + /* + * https://www.kernel.org/doc/Documentation/networking/packet_mmap.txt + * Check constraints that may be enforced by the kernel and cause failure + * to initialize the rings but explicit error messages aren't provided. + * See packet_set_ring in linux kernel for enforcement: + * https://github.com/torvalds/linux/blob/master/net/packet/af_packet.c + */ + if (blocksize % pagesize != 0) { + /* tp_block_size must be a multiple of PAGE_SIZE */ + PMD_LOG(WARNING, "%s: %s=%u must be a multiple of PAGE_SIZE=%d", + name, ETH_AF_PACKET_BLOCKSIZE_ARG, blocksize, pagesize); + } + if (framesize % TPACKET_ALIGNMENT != 0) { + /* tp_frame_size must be a multiple of TPACKET_ALIGNMENT */ + PMD_LOG(WARNING, "%s: %s=%u must be a multiple of TPACKET_ALIGNMENT=%d", + name, ETH_AF_PACKET_FRAMESIZE_ARG, framesize, TPACKET_ALIGNMENT); + } + if (frames_per_block == 0 || frames_per_block > UINT_MAX / blockcount || + framecount != frames_per_block * blockcount) { + /* tp_frame_nr must be exactly frames_per_block*tp_block_nr */ + PMD_LOG(WARNING, "%s: %s=%u must be exactly " + "frames_per_block(%s/%s = %u/%u = %u) * blockcount(%u)", + name, ETH_AF_PACKET_FRAMECOUNT_ARG, framecount, + ETH_AF_PACKET_BLOCKSIZE_ARG, ETH_AF_PACKET_FRAMESIZE_ARG, + blocksize, framesize, frames_per_block, blockcount); + } + + /* Below conditions may not cause errors but provide hints to improve */ + if (blocksize % framesize != 0) { + PMD_LOG(WARNING, "%s: %s=%u not evenly divisible by %s=%u, " + "may waste memory", name, + ETH_AF_PACKET_BLOCKSIZE_ARG, blocksize, + ETH_AF_PACKET_FRAMESIZE_ARG, framesize); + } + if (!rte_is_power_of_2(blocksize)) { + /* tp_block_size should be a power of two or there will be waste */ + PMD_LOG(WARNING, "%s: %s=%u should be a power of two " + "or there will be a waste of memory", + name, ETH_AF_PACKET_BLOCKSIZE_ARG, blocksize); + } + PMD_LOG(DEBUG, "%s: AF_PACKET MMAP parameters:", name); PMD_LOG(DEBUG, "%s:\tblock size %d", name, blocksize); PMD_LOG(DEBUG, "%s:\tblock count %d", name, blockcount); -- 2.39.5 (Apple Git-154) ^ permalink raw reply related [flat|nested] 65+ messages in thread
* [PATCH v4 2/4] net/af_packet: RX/TX bulk free, unlikely hint 2026-02-02 8:14 ` [PATCH v4 0/4] af_packet correctness, performance, cksum scott.k.mitch1 2026-02-02 8:14 ` [PATCH v4 1/4] net/af_packet: fix thread safety and frame calculations scott.k.mitch1 @ 2026-02-02 8:14 ` scott.k.mitch1 2026-02-02 8:14 ` [PATCH v4 3/4] net/af_packet: tx poll control scott.k.mitch1 ` (3 subsequent siblings) 5 siblings, 0 replies; 65+ messages in thread From: scott.k.mitch1 @ 2026-02-02 8:14 UTC (permalink / raw) To: dev; +Cc: stephen, Scott Mitchell From: Scott Mitchell <scott.k.mitch1@gmail.com> - Use rte_pktmbuf_free_bulk() in TX path instead of individual rte_pktmbuf_free() calls for better batch efficiency - Add unlikely() hints for error paths (oversized packets, VLAN insertion failures, sendto errors) to optimize branch prediction - Remove unnecessary early nb_pkts == 0 when loop handles this and app may never call with 0 frames. Signed-off-by: Scott Mitchell <scott.k.mitch1@gmail.com> --- drivers/net/af_packet/rte_eth_af_packet.c | 41 ++++++++--------------- 1 file changed, 14 insertions(+), 27 deletions(-) diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c index d0cc2c419a..51ac95ff5e 100644 --- a/drivers/net/af_packet/rte_eth_af_packet.c +++ b/drivers/net/af_packet/rte_eth_af_packet.c @@ -161,9 +161,6 @@ eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) uint32_t tp_status; unsigned int framecount, framenum; - if (unlikely(nb_pkts == 0)) - return 0; - /* * Reads the given number of packets from the AF_PACKET socket one by * one and copies the packet data into a newly allocated mbuf. @@ -261,9 +258,6 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) uint32_t num_tx_bytes = 0; uint16_t i; - if (unlikely(nb_pkts == 0)) - return 0; - memset(&pfd, 0, sizeof(pfd)); pfd.fd = pkt_q->sockfd; pfd.events = POLLOUT; @@ -271,24 +265,17 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) framecount = pkt_q->framecount; framenum = pkt_q->framenum; - ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base; for (i = 0; i < nb_pkts; i++) { - mbuf = *bufs++; + mbuf = bufs[i]; - /* drop oversized packets */ - if (mbuf->pkt_len > pkt_q->frame_data_size) { - rte_pktmbuf_free(mbuf); + /* Drop oversized packets. Insert VLAN if necessary */ + if (unlikely(mbuf->pkt_len > pkt_q->frame_data_size || + ((mbuf->ol_flags & RTE_MBUF_F_TX_VLAN) != 0 && + rte_vlan_insert(&mbuf) != 0))) { continue; } - /* insert vlan info if necessary */ - if (mbuf->ol_flags & RTE_MBUF_F_TX_VLAN) { - if (rte_vlan_insert(&mbuf)) { - rte_pktmbuf_free(mbuf); - continue; - } - } - + ppd = (struct tpacket2_hdr *)pkt_q->rd[framenum].iov_base; /* * poll() will almost always return POLLOUT, even if there * are no extra buffers available @@ -312,6 +299,9 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) pbuf = (uint8_t *)ppd + ETH_AF_PACKET_FRAME_OVERHEAD; + ppd->tp_len = mbuf->pkt_len; + ppd->tp_snaplen = mbuf->pkt_len; + struct rte_mbuf *tmp_mbuf = mbuf; do { uint16_t data_len = rte_pktmbuf_data_len(tmp_mbuf); @@ -320,23 +310,20 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) tmp_mbuf = tmp_mbuf->next; } while (tmp_mbuf); - ppd->tp_len = mbuf->pkt_len; - ppd->tp_snaplen = mbuf->pkt_len; - /* release incoming frame and advance ring buffer */ tpacket_write_status(&ppd->tp_status, TP_STATUS_SEND_REQUEST); if (++framenum >= framecount) framenum = 0; - ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base; - num_tx++; num_tx_bytes += mbuf->pkt_len; - rte_pktmbuf_free(mbuf); } + rte_pktmbuf_free_bulk(&bufs[0], i); + /* kick-off transmits */ - if (sendto(pkt_q->sockfd, NULL, 0, MSG_DONTWAIT, NULL, 0) == -1 && - errno != ENOBUFS && errno != EAGAIN) { + if (unlikely(num_tx > 0 && + sendto(pkt_q->sockfd, NULL, 0, MSG_DONTWAIT, NULL, 0) == -1 && + errno != ENOBUFS && errno != EAGAIN)) { /* * In case of a ENOBUFS/EAGAIN error all of the enqueued * packets will be considered successful even though only some -- 2.39.5 (Apple Git-154) ^ permalink raw reply related [flat|nested] 65+ messages in thread
* [PATCH v4 3/4] net/af_packet: tx poll control 2026-02-02 8:14 ` [PATCH v4 0/4] af_packet correctness, performance, cksum scott.k.mitch1 2026-02-02 8:14 ` [PATCH v4 1/4] net/af_packet: fix thread safety and frame calculations scott.k.mitch1 2026-02-02 8:14 ` [PATCH v4 2/4] net/af_packet: RX/TX bulk free, unlikely hint scott.k.mitch1 @ 2026-02-02 8:14 ` scott.k.mitch1 2026-02-02 8:14 ` [PATCH v4 4/4] net/af_packet: add software checksum offload support scott.k.mitch1 ` (2 subsequent siblings) 5 siblings, 0 replies; 65+ messages in thread From: scott.k.mitch1 @ 2026-02-02 8:14 UTC (permalink / raw) To: dev; +Cc: stephen, Scott Mitchell From: Scott Mitchell <scott.k.mitch1@gmail.com> Add txpollnotrdy devarg (default=true) to control whether poll() is called when the TX ring is not ready. This allows users to avoid blocking behavior if application threads are in asynchronous poll mode where blocking the thread has negative side effects and backpressure is applied via different means. Signed-off-by: Scott Mitchell <scott.k.mitch1@gmail.com> --- doc/guides/nics/af_packet.rst | 6 ++++- doc/guides/rel_notes/release_26_03.rst | 1 + drivers/net/af_packet/rte_eth_af_packet.c | 33 ++++++++++++++++++----- 3 files changed, 32 insertions(+), 8 deletions(-) diff --git a/doc/guides/nics/af_packet.rst b/doc/guides/nics/af_packet.rst index 1505b98ff7..782a962c3f 100644 --- a/doc/guides/nics/af_packet.rst +++ b/doc/guides/nics/af_packet.rst @@ -29,6 +29,10 @@ Some of these, in turn, will be used to configure the PACKET_MMAP settings. * ``framesz`` - PACKET_MMAP frame size (optional, default 2048B; Note: multiple of 16B); * ``framecnt`` - PACKET_MMAP frame count (optional, default 512). +* ``txpollnotrdy`` - Control behavior if tx is attempted but there is no + space available to write to the kernel. If 1, call poll() and block until + space is available to tx. If 0, don't call poll() and return from tx (optional, + default 1). For details regarding ``fanout_mode`` argument, you can consult the `PACKET_FANOUT documentation <https://www.man7.org/linux/man-pages/man7/packet.7.html>`_. @@ -75,7 +79,7 @@ framecnt=512): .. code-block:: console - --vdev=eth_af_packet0,iface=tap0,blocksz=4096,framesz=2048,framecnt=512,qpairs=1,qdisc_bypass=0,fanout_mode=hash + --vdev=eth_af_packet0,iface=tap0,blocksz=4096,framesz=2048,framecnt=512,qpairs=1,qdisc_bypass=0,fanout_mode=hash,txpollnotrdy=0 Features and Limitations ------------------------ diff --git a/doc/guides/rel_notes/release_26_03.rst b/doc/guides/rel_notes/release_26_03.rst index c7e7c7d25b..3b6be19645 100644 --- a/doc/guides/rel_notes/release_26_03.rst +++ b/doc/guides/rel_notes/release_26_03.rst @@ -59,6 +59,7 @@ New Features * Fixed kernel memory barrier protocol for memory availability * Fixed shared memory frame overhead offset calculation + * Added ``txpollnotrdy`` devarg to avoid ``poll()`` blocking calls Removed Items ------------- diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c index 51ac95ff5e..9df1b1fd4c 100644 --- a/drivers/net/af_packet/rte_eth_af_packet.c +++ b/drivers/net/af_packet/rte_eth_af_packet.c @@ -39,9 +39,11 @@ #define ETH_AF_PACKET_FRAMECOUNT_ARG "framecnt" #define ETH_AF_PACKET_QDISC_BYPASS_ARG "qdisc_bypass" #define ETH_AF_PACKET_FANOUT_MODE_ARG "fanout_mode" +#define ETH_AF_PACKET_TX_POLL_NOT_READY_ARG "txpollnotrdy" #define DFLT_FRAME_SIZE (1 << 11) #define DFLT_FRAME_COUNT (1 << 9) +#define DFLT_TX_POLL_NOT_RDY true static const uint16_t eth_af_packet_frame_size_max = RTE_IPV4_MAX_PKT_LEN; #define ETH_AF_PACKET_FRAME_OVERHEAD (TPACKET2_HDRLEN - sizeof(struct sockaddr_ll)) @@ -78,6 +80,9 @@ struct __rte_cache_aligned pkt_tx_queue { unsigned int framecount; unsigned int framenum; + bool txpollnotrdy; + bool sw_cksum; + volatile unsigned long tx_pkts; volatile unsigned long err_pkts; volatile unsigned long tx_bytes; @@ -106,6 +111,7 @@ static const char *valid_arguments[] = { ETH_AF_PACKET_FRAMECOUNT_ARG, ETH_AF_PACKET_QDISC_BYPASS_ARG, ETH_AF_PACKET_FANOUT_MODE_ARG, + ETH_AF_PACKET_TX_POLL_NOT_READY_ARG, NULL }; @@ -258,10 +264,12 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) uint32_t num_tx_bytes = 0; uint16_t i; - memset(&pfd, 0, sizeof(pfd)); - pfd.fd = pkt_q->sockfd; - pfd.events = POLLOUT; - pfd.revents = 0; + if (pkt_q->txpollnotrdy) { + memset(&pfd, 0, sizeof(pfd)); + pfd.fd = pkt_q->sockfd; + pfd.events = POLLOUT; + pfd.revents = 0; + } framecount = pkt_q->framecount; framenum = pkt_q->framenum; @@ -291,8 +299,9 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) * This results in poll() returning POLLOUT. */ if (unlikely(!tx_ring_status_available(tpacket_read_status(&ppd->tp_status)) && - (poll(&pfd, 1, -1) < 0 || (pfd.revents & POLLERR) != 0 || - !tx_ring_status_available(tpacket_read_status(&ppd->tp_status))))) { + (!pkt_q->txpollnotrdy || poll(&pfd, 1, -1) < 0 || + (pfd.revents & POLLERR) != 0 || + !tx_ring_status_available(tpacket_read_status(&ppd->tp_status))))) { /* Ring is full, stop here. Don't process bufs[i]. */ break; } @@ -804,6 +813,7 @@ rte_pmd_init_internals(struct rte_vdev_device *dev, unsigned int framecnt, unsigned int qdisc_bypass, const char *fanout_mode, + bool txpollnotrdy, struct pmd_internals **internals, struct rte_eth_dev **eth_dev, struct rte_kvargs *kvlist) @@ -1022,6 +1032,7 @@ rte_pmd_init_internals(struct rte_vdev_device *dev, tx_queue->rd[i].iov_len = req->tp_frame_size; } tx_queue->sockfd = qsockfd; + tx_queue->txpollnotrdy = txpollnotrdy; rc = bind(qsockfd, (const struct sockaddr*)&sockaddr, sizeof(sockaddr)); if (rc == -1) { @@ -1111,6 +1122,7 @@ rte_eth_from_packet(struct rte_vdev_device *dev, unsigned int qpairs = 1; unsigned int qdisc_bypass = 1; const char *fanout_mode = NULL; + bool txpollnotrdy = DFLT_TX_POLL_NOT_RDY; /* do some parameter checking */ if (*sockfd < 0) @@ -1175,6 +1187,10 @@ rte_eth_from_packet(struct rte_vdev_device *dev, fanout_mode = pair->value; continue; } + if (strstr(pair->key, ETH_AF_PACKET_TX_POLL_NOT_READY_ARG) != NULL) { + txpollnotrdy = atoi(pair->value) != 0; + continue; + } } if (framesize > blocksize) { @@ -1243,12 +1259,14 @@ rte_eth_from_packet(struct rte_vdev_device *dev, PMD_LOG(DEBUG, "%s:\tfanout mode %s", name, fanout_mode); else PMD_LOG(DEBUG, "%s:\tfanout mode %s", name, "default PACKET_FANOUT_HASH"); + PMD_LOG(INFO, "%s:\ttxpollnotrdy %d", name, txpollnotrdy ? 1 : 0); if (rte_pmd_init_internals(dev, *sockfd, qpairs, blocksize, blockcount, framesize, framecount, qdisc_bypass, fanout_mode, + txpollnotrdy, &internals, ð_dev, kvlist) < 0) return -1; @@ -1346,4 +1364,5 @@ RTE_PMD_REGISTER_PARAM_STRING(net_af_packet, "framesz=<int> " "framecnt=<int> " "qdisc_bypass=<0|1> " - "fanout_mode=<hash|lb|cpu|rollover|rnd|qm>"); + "fanout_mode=<hash|lb|cpu|rollover|rnd|qm> " + "txpollnotrdy=<0|1>"); -- 2.39.5 (Apple Git-154) ^ permalink raw reply related [flat|nested] 65+ messages in thread
* [PATCH v4 4/4] net/af_packet: add software checksum offload support 2026-02-02 8:14 ` [PATCH v4 0/4] af_packet correctness, performance, cksum scott.k.mitch1 ` (2 preceding siblings ...) 2026-02-02 8:14 ` [PATCH v4 3/4] net/af_packet: tx poll control scott.k.mitch1 @ 2026-02-02 8:14 ` scott.k.mitch1 2026-02-02 17:00 ` Stephen Hemminger 2026-02-02 18:47 ` Stephen Hemminger 2026-02-02 18:53 ` [PATCH v4 0/4] af_packet correctness, performance, cksum Stephen Hemminger 2026-02-03 7:07 ` [PATCH v5 " scott.k.mitch1 5 siblings, 2 replies; 65+ messages in thread From: scott.k.mitch1 @ 2026-02-02 8:14 UTC (permalink / raw) To: dev; +Cc: stephen, Scott Mitchell From: Scott Mitchell <scott.k.mitch1@gmail.com> Add software checksum offload support and configurable TX poll behavior to improve flexibility and performance. Add rte_net_ip_udptcp_cksum_mbuf in rte_net.h which is shared between rte_eth_tap and rte_eth_af_packet that supports IPv4/UDP/TCP checksums in software due to hardware offload and context propagation not being supported. Signed-off-by: Scott Mitchell <scott.k.mitch1@gmail.com> --- doc/guides/nics/features/afpacket.ini | 2 + doc/guides/rel_notes/release_26_03.rst | 2 + drivers/net/af_packet/rte_eth_af_packet.c | 15 +++- drivers/net/tap/rte_eth_tap.c | 61 +-------------- lib/net/rte_net.h | 92 +++++++++++++++++++++++ 5 files changed, 112 insertions(+), 60 deletions(-) diff --git a/doc/guides/nics/features/afpacket.ini b/doc/guides/nics/features/afpacket.ini index 391f79b173..4bb81c84ff 100644 --- a/doc/guides/nics/features/afpacket.ini +++ b/doc/guides/nics/features/afpacket.ini @@ -7,5 +7,7 @@ Link status = Y Promiscuous mode = Y MTU update = Y +L3 checksum offload = Y +L4 checksum offload = Y Basic stats = Y Stats per queue = Y diff --git a/doc/guides/rel_notes/release_26_03.rst b/doc/guides/rel_notes/release_26_03.rst index 3b6be19645..2946acce99 100644 --- a/doc/guides/rel_notes/release_26_03.rst +++ b/doc/guides/rel_notes/release_26_03.rst @@ -60,6 +60,8 @@ New Features * Fixed kernel memory barrier protocol for memory availability * Fixed shared memory frame overhead offset calculation * Added ``txpollnotrdy`` devarg to avoid ``poll()`` blocking calls + * Added checksum offload support for ``IPV4_CKSUM``, ``UDP_CKSUM``, + and ``TCP_CKSUM`` Removed Items ------------- diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c index 9df1b1fd4c..128f93bec6 100644 --- a/drivers/net/af_packet/rte_eth_af_packet.c +++ b/drivers/net/af_packet/rte_eth_af_packet.c @@ -10,6 +10,8 @@ #include <rte_string_fns.h> #include <rte_mbuf.h> #include <rte_atomic.h> +#include <rte_ip.h> +#include <rte_net.h> #include <rte_bitops.h> #include <ethdev_driver.h> #include <ethdev_vdev.h> @@ -101,6 +103,7 @@ struct pmd_internals { struct pkt_tx_queue *tx_queue; uint8_t vlan_strip; uint8_t timestamp_offloading; + bool tx_sw_cksum; }; static const char *valid_arguments[] = { @@ -311,6 +314,9 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) ppd->tp_len = mbuf->pkt_len; ppd->tp_snaplen = mbuf->pkt_len; + if (pkt_q->sw_cksum && !rte_net_ip_udptcp_cksum_mbuf(mbuf, false)) + continue; + struct rte_mbuf *tmp_mbuf = mbuf; do { uint16_t data_len = rte_pktmbuf_data_len(tmp_mbuf); @@ -396,10 +402,13 @@ eth_dev_configure(struct rte_eth_dev *dev __rte_unused) { struct rte_eth_conf *dev_conf = &dev->data->dev_conf; const struct rte_eth_rxmode *rxmode = &dev_conf->rxmode; + const struct rte_eth_txmode *txmode = &dev_conf->txmode; struct pmd_internals *internals = dev->data->dev_private; internals->vlan_strip = !!(rxmode->offloads & RTE_ETH_RX_OFFLOAD_VLAN_STRIP); internals->timestamp_offloading = !!(rxmode->offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP); + internals->tx_sw_cksum = !!(txmode->offloads & (RTE_ETH_TX_OFFLOAD_IPV4_CKSUM | + RTE_ETH_TX_OFFLOAD_UDP_CKSUM | RTE_ETH_TX_OFFLOAD_TCP_CKSUM)); return 0; } @@ -417,7 +426,10 @@ eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info) dev_info->max_tx_queues = (uint16_t)internals->nb_queues; dev_info->min_rx_bufsize = ETH_AF_PACKET_ETH_OVERHEAD; dev_info->tx_offload_capa = RTE_ETH_TX_OFFLOAD_MULTI_SEGS | - RTE_ETH_TX_OFFLOAD_VLAN_INSERT; + RTE_ETH_TX_OFFLOAD_VLAN_INSERT | + RTE_ETH_TX_OFFLOAD_IPV4_CKSUM | + RTE_ETH_TX_OFFLOAD_UDP_CKSUM | + RTE_ETH_TX_OFFLOAD_TCP_CKSUM; dev_info->rx_offload_capa = RTE_ETH_RX_OFFLOAD_VLAN_STRIP | RTE_ETH_RX_OFFLOAD_TIMESTAMP; @@ -618,6 +630,7 @@ eth_tx_queue_setup(struct rte_eth_dev *dev, { struct pmd_internals *internals = dev->data->dev_private; + internals->tx_queue[tx_queue_id].sw_cksum = internals->tx_sw_cksum; dev->data->tx_queues[tx_queue_id] = &internals->tx_queue[tx_queue_id]; return 0; diff --git a/drivers/net/tap/rte_eth_tap.c b/drivers/net/tap/rte_eth_tap.c index 730f1859bd..55f496babe 100644 --- a/drivers/net/tap/rte_eth_tap.c +++ b/drivers/net/tap/rte_eth_tap.c @@ -560,70 +560,13 @@ tap_write_mbufs(struct tx_queue *txq, uint16_t num_mbufs, if (txq->csum && (mbuf->ol_flags & RTE_MBUF_F_TX_IP_CKSUM || l4_ol_flags == RTE_MBUF_F_TX_UDP_CKSUM || l4_ol_flags == RTE_MBUF_F_TX_TCP_CKSUM)) { - unsigned int hdrlens = mbuf->l2_len + mbuf->l3_len; - uint16_t *l4_cksum; - void *l3_hdr; - - if (l4_ol_flags == RTE_MBUF_F_TX_UDP_CKSUM) - hdrlens += sizeof(struct rte_udp_hdr); - else if (l4_ol_flags == RTE_MBUF_F_TX_TCP_CKSUM) - hdrlens += sizeof(struct rte_tcp_hdr); - else if (l4_ol_flags != RTE_MBUF_F_TX_L4_NO_CKSUM) - return -1; - - /* Support only packets with at least layer 4 - * header included in the first segment - */ - if (rte_pktmbuf_data_len(mbuf) < hdrlens) - return -1; - - /* To change checksums (considering that a mbuf can be - * indirect, for example), copy l2, l3 and l4 headers - * in a new segment and chain it to existing data - */ - seg = rte_pktmbuf_copy(mbuf, mbuf->pool, 0, hdrlens); + /* Compute checksums in software, copying headers if needed */ + seg = rte_net_ip_udptcp_cksum_mbuf(mbuf, true); if (seg == NULL) return -1; - rte_pktmbuf_adj(mbuf, hdrlens); - rte_pktmbuf_chain(seg, mbuf); pmbufs[i] = mbuf = seg; - - l3_hdr = rte_pktmbuf_mtod_offset(mbuf, void *, mbuf->l2_len); - if (mbuf->ol_flags & RTE_MBUF_F_TX_IP_CKSUM) { - struct rte_ipv4_hdr *iph = l3_hdr; - - iph->hdr_checksum = 0; - iph->hdr_checksum = rte_ipv4_cksum(iph); - } - - if (l4_ol_flags == RTE_MBUF_F_TX_L4_NO_CKSUM) - goto skip_l4_cksum; - - if (l4_ol_flags == RTE_MBUF_F_TX_UDP_CKSUM) { - struct rte_udp_hdr *udp_hdr; - - udp_hdr = rte_pktmbuf_mtod_offset(mbuf, struct rte_udp_hdr *, - mbuf->l2_len + mbuf->l3_len); - l4_cksum = &udp_hdr->dgram_cksum; - } else { - struct rte_tcp_hdr *tcp_hdr; - - tcp_hdr = rte_pktmbuf_mtod_offset(mbuf, struct rte_tcp_hdr *, - mbuf->l2_len + mbuf->l3_len); - l4_cksum = &tcp_hdr->cksum; - } - - *l4_cksum = 0; - if (mbuf->ol_flags & RTE_MBUF_F_TX_IPV4) { - *l4_cksum = rte_ipv4_udptcp_cksum_mbuf(mbuf, l3_hdr, - mbuf->l2_len + mbuf->l3_len); - } else { - *l4_cksum = rte_ipv6_udptcp_cksum_mbuf(mbuf, l3_hdr, - mbuf->l2_len + mbuf->l3_len); - } } -skip_l4_cksum: for (j = 0; j < mbuf->nb_segs; j++) { iovecs[k].iov_len = rte_pktmbuf_data_len(seg); iovecs[k].iov_base = rte_pktmbuf_mtod(seg, void *); diff --git a/lib/net/rte_net.h b/lib/net/rte_net.h index 65d724b84b..44f42010c8 100644 --- a/lib/net/rte_net.h +++ b/lib/net/rte_net.h @@ -246,6 +246,98 @@ rte_net_intel_cksum_prepare(struct rte_mbuf *m) return rte_net_intel_cksum_flags_prepare(m, m->ol_flags); } +/** + * Compute IPv4 header and UDP/TCP checksums in software. + * + * Computes checksums based on mbuf offload flags: + * - RTE_MBUF_F_TX_IP_CKSUM: Compute IPv4 header checksum + * - RTE_MBUF_F_TX_UDP_CKSUM: Compute UDP checksum (IPv4 or IPv6) + * - RTE_MBUF_F_TX_TCP_CKSUM: Compute TCP checksum (IPv4 or IPv6) + * + * @param mbuf + * The packet mbuf. Must have l2_len and l3_len set correctly. + * @param copy + * If true, copy L2/L3/L4 headers to a new segment before computing + * checksums. This is safe for indirect mbufs but has overhead. + * If false, compute checksums in place. This is only safe if the + * mbuf will be copied afterward (e.g., to a device ring buffer). + * @return + * - On success: Returns mbuf (new segment if copy=true, original if copy=false) + * - On error: Returns NULL (allocation failed or malformed packet) + */ +__rte_experimental +static inline struct rte_mbuf * +rte_net_ip_udptcp_cksum_mbuf(struct rte_mbuf *mbuf, bool copy) +{ + const uint64_t l4_ol_flags = mbuf->ol_flags & RTE_MBUF_F_TX_L4_MASK; + const uint64_t l4_offset = mbuf->l2_len + mbuf->l3_len; + uint32_t hdrlens = l4_offset; + void *l3_hdr = NULL; + + /* Determine total header length needed */ + if (l4_ol_flags == RTE_MBUF_F_TX_UDP_CKSUM) + hdrlens += sizeof(struct rte_udp_hdr); + else if (l4_ol_flags == RTE_MBUF_F_TX_TCP_CKSUM) + hdrlens += sizeof(struct rte_tcp_hdr); + else if (l4_ol_flags != RTE_MBUF_F_TX_L4_NO_CKSUM) + return NULL; /* Unsupported L4 checksum type */ + else if (!(mbuf->ol_flags & RTE_MBUF_F_TX_IP_CKSUM)) + return mbuf; /* Nothing to do */ + + /* Validate we at least have L2+L3 headers before doing any work */ + if (unlikely(rte_pktmbuf_data_len(mbuf) < l4_offset)) + return NULL; + + if (copy) { + /* + * Copy headers to new segment to handle indirect mbufs. + * This ensures we can safely modify checksums without + * corrupting shared/read-only data. + */ + struct rte_mbuf *seg = rte_pktmbuf_copy(mbuf, mbuf->pool, 0, hdrlens); + if (!seg) + return NULL; + + rte_pktmbuf_adj(mbuf, hdrlens); + rte_pktmbuf_chain(seg, mbuf); + mbuf = seg; + } else if (unlikely(!RTE_MBUF_DIRECT(mbuf) || rte_mbuf_refcnt_read(mbuf) > 1)) + return NULL; + + l3_hdr = rte_pktmbuf_mtod_offset(mbuf, void *, mbuf->l2_len); + + /* IPv4 header checksum */ + if (mbuf->ol_flags & RTE_MBUF_F_TX_IP_CKSUM) { + struct rte_ipv4_hdr *iph = (struct rte_ipv4_hdr *)l3_hdr; + iph->hdr_checksum = 0; + iph->hdr_checksum = rte_ipv4_cksum(iph); + } + + /* L4 checksum (UDP or TCP) - skip if headers not in first segment */ + if (l4_ol_flags == RTE_MBUF_F_TX_UDP_CKSUM && rte_pktmbuf_data_len(mbuf) >= hdrlens) { + struct rte_udp_hdr *udp_hdr = rte_pktmbuf_mtod_offset(mbuf, struct rte_udp_hdr *, + l4_offset); + udp_hdr->dgram_cksum = 0; + udp_hdr->dgram_cksum = (mbuf->ol_flags & RTE_MBUF_F_TX_IPV4) ? + rte_ipv4_udptcp_cksum_mbuf(mbuf, (const struct rte_ipv4_hdr *)l3_hdr, + l4_offset) : + rte_ipv6_udptcp_cksum_mbuf(mbuf, (const struct rte_ipv6_hdr *)l3_hdr, + l4_offset); + } else if (l4_ol_flags == RTE_MBUF_F_TX_TCP_CKSUM && + rte_pktmbuf_data_len(mbuf) >= hdrlens) { + struct rte_tcp_hdr *tcp_hdr = rte_pktmbuf_mtod_offset(mbuf, struct rte_tcp_hdr *, + l4_offset); + tcp_hdr->cksum = 0; + tcp_hdr->cksum = (mbuf->ol_flags & RTE_MBUF_F_TX_IPV4) ? + rte_ipv4_udptcp_cksum_mbuf(mbuf, (const struct rte_ipv4_hdr *)l3_hdr, + l4_offset) : + rte_ipv6_udptcp_cksum_mbuf(mbuf, (const struct rte_ipv6_hdr *)l3_hdr, + l4_offset); + } + + return mbuf; +} + #ifdef __cplusplus } #endif -- 2.39.5 (Apple Git-154) ^ permalink raw reply related [flat|nested] 65+ messages in thread
* Re: [PATCH v4 4/4] net/af_packet: add software checksum offload support 2026-02-02 8:14 ` [PATCH v4 4/4] net/af_packet: add software checksum offload support scott.k.mitch1 @ 2026-02-02 17:00 ` Stephen Hemminger 2026-02-02 18:47 ` Stephen Hemminger 1 sibling, 0 replies; 65+ messages in thread From: Stephen Hemminger @ 2026-02-02 17:00 UTC (permalink / raw) To: scott.k.mitch1; +Cc: dev On Mon, 2 Feb 2026 00:14:56 -0800 scott.k.mitch1@gmail.com wrote: > + if (copy) { > + /* > + * Copy headers to new segment to handle indirect mbufs. > + * This ensures we can safely modify checksums without > + * corrupting shared/read-only data. > + */ > + struct rte_mbuf *seg = rte_pktmbuf_copy(mbuf, mbuf->pool, 0, hdrlens); > + if (!seg) > + return NULL; > + > + rte_pktmbuf_adj(mbuf, hdrlens); > + rte_pktmbuf_chain(seg, mbuf); > + mbuf = seg; > + } else if (unlikely(!RTE_MBUF_DIRECT(mbuf) || rte_mbuf_refcnt_read(mbuf) > 1)) > + return NULL; Maybe just rte_pkmbuf_read() helper that already handles the case of getting the header if needed. ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v4 4/4] net/af_packet: add software checksum offload support 2026-02-02 8:14 ` [PATCH v4 4/4] net/af_packet: add software checksum offload support scott.k.mitch1 2026-02-02 17:00 ` Stephen Hemminger @ 2026-02-02 18:47 ` Stephen Hemminger 2026-02-03 6:41 ` Scott Mitchell 1 sibling, 1 reply; 65+ messages in thread From: Stephen Hemminger @ 2026-02-02 18:47 UTC (permalink / raw) To: scott.k.mitch1; +Cc: dev On Mon, 2 Feb 2026 00:14:56 -0800 scott.k.mitch1@gmail.com wrote: > +/** > + * Compute IPv4 header and UDP/TCP checksums in software. > + * > + * Computes checksums based on mbuf offload flags: > + * - RTE_MBUF_F_TX_IP_CKSUM: Compute IPv4 header checksum > + * - RTE_MBUF_F_TX_UDP_CKSUM: Compute UDP checksum (IPv4 or IPv6) > + * - RTE_MBUF_F_TX_TCP_CKSUM: Compute TCP checksum (IPv4 or IPv6) > + * > + * @param mbuf > + * The packet mbuf. Must have l2_len and l3_len set correctly. > + * @param copy > + * If true, copy L2/L3/L4 headers to a new segment before computing > + * checksums. This is safe for indirect mbufs but has overhead. > + * If false, compute checksums in place. This is only safe if the > + * mbuf will be copied afterward (e.g., to a device ring buffer). > + * @return > + * - On success: Returns mbuf (new segment if copy=true, original if copy=false) > + * - On error: Returns NULL (allocation failed or malformed packet) > + */ > +__rte_experimental > +static inline struct rte_mbuf * > +rte_net_ip_udptcp_cksum_mbuf(struct rte_mbuf *mbuf, bool copy) > +{ > + const uint64_t l4_ol_flags = mbuf->ol_flags & RTE_MBUF_F_TX_L4_MASK; > + const uint64_t l4_offset = mbuf->l2_len + mbuf->l3_len; > + uint32_t hdrlens = l4_offset; > + void *l3_hdr = NULL; > + > + /* Determine total header length needed */ > + if (l4_ol_flags == RTE_MBUF_F_TX_UDP_CKSUM) > + hdrlens += sizeof(struct rte_udp_hdr); > + else if (l4_ol_flags == RTE_MBUF_F_TX_TCP_CKSUM) > + hdrlens += sizeof(struct rte_tcp_hdr); > + else if (l4_ol_flags != RTE_MBUF_F_TX_L4_NO_CKSUM) > + return NULL; /* Unsupported L4 checksum type */ > + else if (!(mbuf->ol_flags & RTE_MBUF_F_TX_IP_CKSUM)) > + return mbuf; /* Nothing to do */ > + > + /* Validate we at least have L2+L3 headers before doing any work */ > + if (unlikely(rte_pktmbuf_data_len(mbuf) < l4_offset)) > + return NULL; > + > + if (copy) { > + /* > + * Copy headers to new segment to handle indirect mbufs. > + * This ensures we can safely modify checksums without > + * corrupting shared/read-only data. > + */ > + struct rte_mbuf *seg = rte_pktmbuf_copy(mbuf, mbuf->pool, 0, hdrlens); > + if (!seg) > + return NULL; > + > + rte_pktmbuf_adj(mbuf, hdrlens); > + rte_pktmbuf_chain(seg, mbuf); > + mbuf = seg; > + } else if (unlikely(!RTE_MBUF_DIRECT(mbuf) || rte_mbuf_refcnt_read(mbuf) > 1)) > + return NULL; > + > + l3_hdr = rte_pktmbuf_mtod_offset(mbuf, void *, mbuf->l2_len); > + > + /* IPv4 header checksum */ > + if (mbuf->ol_flags & RTE_MBUF_F_TX_IP_CKSUM) { > + struct rte_ipv4_hdr *iph = (struct rte_ipv4_hdr *)l3_hdr; > + iph->hdr_checksum = 0; > + iph->hdr_checksum = rte_ipv4_cksum(iph); > + } > + > + /* L4 checksum (UDP or TCP) - skip if headers not in first segment */ > + if (l4_ol_flags == RTE_MBUF_F_TX_UDP_CKSUM && rte_pktmbuf_data_len(mbuf) >= hdrlens) { > + struct rte_udp_hdr *udp_hdr = rte_pktmbuf_mtod_offset(mbuf, struct rte_udp_hdr *, > + l4_offset); > + udp_hdr->dgram_cksum = 0; > + udp_hdr->dgram_cksum = (mbuf->ol_flags & RTE_MBUF_F_TX_IPV4) ? > + rte_ipv4_udptcp_cksum_mbuf(mbuf, (const struct rte_ipv4_hdr *)l3_hdr, > + l4_offset) : > + rte_ipv6_udptcp_cksum_mbuf(mbuf, (const struct rte_ipv6_hdr *)l3_hdr, > + l4_offset); > + } else if (l4_ol_flags == RTE_MBUF_F_TX_TCP_CKSUM && > + rte_pktmbuf_data_len(mbuf) >= hdrlens) { > + struct rte_tcp_hdr *tcp_hdr = rte_pktmbuf_mtod_offset(mbuf, struct rte_tcp_hdr *, > + l4_offset); > + tcp_hdr->cksum = 0; > + tcp_hdr->cksum = (mbuf->ol_flags & RTE_MBUF_F_TX_IPV4) ? > + rte_ipv4_udptcp_cksum_mbuf(mbuf, (const struct rte_ipv4_hdr *)l3_hdr, > + l4_offset) : > + rte_ipv6_udptcp_cksum_mbuf(mbuf, (const struct rte_ipv6_hdr *)l3_hdr, > + l4_offset); > + } > + > + return mbuf; > +} > + This is getting a little large to be inline. Maybe split into inline that checks offload flags, and non-inline that does the checksum if necessary. The code could use rte_pktmbuf_linearize() if necessary. There is code duplication in multiple arms of the if statement Can the code just "do the right thing" based on the mbufs it is given. If the mbuf is indirect or has ref count > 1 then copy headers, otherwise do in place. ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v4 4/4] net/af_packet: add software checksum offload support 2026-02-02 18:47 ` Stephen Hemminger @ 2026-02-03 6:41 ` Scott Mitchell 0 siblings, 0 replies; 65+ messages in thread From: Scott Mitchell @ 2026-02-03 6:41 UTC (permalink / raw) To: Stephen Hemminger; +Cc: dev > This is getting a little large to be inline. > Maybe split into inline that checks offload flags, and non-inline > that does the checksum if necessary. I'll move implementation into rte_net.c for now to avoid exposing additional surface area in .h. > > The code could use rte_pktmbuf_linearize() if necessary. > > > There is code duplication in multiple arms of the if statement > Can the code just "do the right thing" based on the mbufs it is given. > If the mbuf is indirect or has ref count > 1 then copy headers, otherwise > do in place. Sounds good. ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v4 0/4] af_packet correctness, performance, cksum 2026-02-02 8:14 ` [PATCH v4 0/4] af_packet correctness, performance, cksum scott.k.mitch1 ` (3 preceding siblings ...) 2026-02-02 8:14 ` [PATCH v4 4/4] net/af_packet: add software checksum offload support scott.k.mitch1 @ 2026-02-02 18:53 ` Stephen Hemminger 2026-02-03 7:07 ` [PATCH v5 " scott.k.mitch1 5 siblings, 0 replies; 65+ messages in thread From: Stephen Hemminger @ 2026-02-02 18:53 UTC (permalink / raw) To: scott.k.mitch1; +Cc: dev On Mon, 2 Feb 2026 00:14:52 -0800 scott.k.mitch1@gmail.com wrote: > From: Scott <scott.k.mitch1@gmail.com> > > This series fixes critical thread safety bugs in the af_packet PMD > and adds performance optimizations. > > Patch 1 fixes two major correctness issues: > - Thread safety: tp_status was accessed without memory barriers, > violating the kernel's PACKET_MMAP protocol. On aarch64 and other > weakly-ordered architectures, this causes packet corruption due to > missing memory ordering. The fix matches the kernel's memory model: > volatile unaligned reads/writes with explicit rte_smp_rmb/wmb > barriers and __may_alias__ protection. > > - Frame calculations: Fixed incorrect frame overhead and address > calculations that caused memory corruption when frames don't evenly > divide blocks. Fix patch 4 and resubmit. ^ permalink raw reply [flat|nested] 65+ messages in thread
* [PATCH v5 0/4] af_packet correctness, performance, cksum 2026-02-02 8:14 ` [PATCH v4 0/4] af_packet correctness, performance, cksum scott.k.mitch1 ` (4 preceding siblings ...) 2026-02-02 18:53 ` [PATCH v4 0/4] af_packet correctness, performance, cksum Stephen Hemminger @ 2026-02-03 7:07 ` scott.k.mitch1 2026-02-03 7:07 ` [PATCH v5 1/4] net/af_packet: fix thread safety and frame calculations scott.k.mitch1 ` (4 more replies) 5 siblings, 5 replies; 65+ messages in thread From: scott.k.mitch1 @ 2026-02-03 7:07 UTC (permalink / raw) To: dev; +Cc: stephen, Scott From: Scott <scott.k.mitch1@gmail.com> This series fixes critical thread safety bugs in the af_packet PMD and adds performance optimizations. Patch 1 fixes two major correctness issues: - Thread safety: tp_status was accessed without memory barriers, violating the kernel's PACKET_MMAP protocol. On aarch64 and other weakly-ordered architectures, this causes packet corruption due to missing memory ordering. The fix matches the kernel's memory model: volatile unaligned reads/writes with explicit rte_smp_rmb/wmb barriers and __may_alias__ protection. - Frame calculations: Fixed incorrect frame overhead and address calculations that caused memory corruption when frames don't evenly divide blocks. Patches 2-4 add performance improvements: - Patch 2: Bulk mbuf freeing, unlikely annotations, and prefetching - Patch 3: TX poll control to reduce syscall overhead - Patch 4: Software checksum offload support with shared rte_net utility v5 changes: - rte_net_ip_udptcp_cksum_mbuf moved to rte_net.c (avoid forced inline) - rte_net_ip_udptcp_cksum_mbuf remove copy arg, handle more mbuf types - af_packet and tap calling code consistent for sw cksum v4 changes: - Remove prefetch (perf results didn't show benefit) - Fix variable sytle for consistency (declare at start of method) - Add release notes for af_packet and documentation for fixes v3 changes: - Patch 4: Fix compile error due to implict cast with c++ compiler v2 changes: - Patch 1: Rewrote to use volatile + barriers instead of C11 atomics to match kernel's memory model. Added dependency on patch-160274 for __rte_may_alias attribute. - Patch 4: Refactored to use shared rte_net_ip_udptcp_cksum_mbuf() utility function, eliminating code duplication with tap driver. Scott Mitchell (4): net/af_packet: fix thread safety and frame calculations net/af_packet: RX/TX bulk free, unlikely hint net/af_packet: tx poll control net/af_packet: add software checksum offload support doc/guides/nics/af_packet.rst | 6 +- doc/guides/nics/features/afpacket.ini | 2 + doc/guides/rel_notes/release_26_03.rst | 7 + drivers/net/af_packet/rte_eth_af_packet.c | 253 +++++++++++++++------- drivers/net/tap/rte_eth_tap.c | 70 +----- lib/net/rte_net.c | 68 ++++++ lib/net/rte_net.h | 22 ++ 7 files changed, 287 insertions(+), 141 deletions(-) -- 2.39.5 (Apple Git-154) ^ permalink raw reply [flat|nested] 65+ messages in thread
* [PATCH v5 1/4] net/af_packet: fix thread safety and frame calculations 2026-02-03 7:07 ` [PATCH v5 " scott.k.mitch1 @ 2026-02-03 7:07 ` scott.k.mitch1 2026-02-03 7:07 ` [PATCH v5 2/4] net/af_packet: RX/TX bulk free, unlikely hint scott.k.mitch1 ` (3 subsequent siblings) 4 siblings, 0 replies; 65+ messages in thread From: scott.k.mitch1 @ 2026-02-03 7:07 UTC (permalink / raw) To: dev; +Cc: stephen, Scott Mitchell, linville, stable From: Scott Mitchell <scott.k.mitch1@gmail.com> Thread Safety: The tp_status field was accessed without proper memory barriers, violating the kernel's PACKET_MMAP synchronization protocol. The kernel implements this protocol in net/packet/af_packet.c: - __packet_get_status: smp_rmb() then READ_ONCE() (volatile read) - __packet_set_status: WRITE_ONCE() (volatile write) then smp_wmb() READ_ONCE/WRITE_ONCE use __may_alias__ attribute via __uXX_alias_t types to prevent compiler optimizations that assume type-based aliasing rules, which is critical for tp_status access that may be misaligned within the ring buffer. Userspace must use equivalent semantics: volatile unaligned_uint32_t (with __rte_may_alias) reads/writes with explicit memory barriers (rte_smp_rmb/rte_smp_wmb). On aarch64 and other weakly-ordered architectures, missing barriers cause packet corruption because: - RX: CPU may read stale packet data before seeing tp_status update - TX: CPU may reorder stores, causing kernel to see tp_status before packet data is fully written This becomes critical with io_uring SQPOLL mode where the kernel polling thread on a different CPU core asynchronously updates tp_status, making proper memory ordering essential. Note: Uses rte_smp_[r/w]mb which triggers checkpatch warnings, but C11 atomics cannot be used because tp_status is not declared _Atomic in the kernel's tpacket2_hdr structure. We must match the kernel's volatile + barrier memory model with __may_alias__ protection. Frame Calculation Issues: 1. Frame overhead incorrectly calculated as TPACKET_ALIGN(TPACKET2_HDRLEN) instead of TPACKET2_HDRLEN - sizeof(struct sockaddr_ll), causing incorrect usable frame data size. 2. Frame address calculation assumed sequential layout (frame_base + i * frame_size), but the kernel's packet_lookup_frame() uses block-based addressing: block_idx = position / frames_per_block frame_offset = position % frames_per_block address = block_start[block_idx] + (frame_offset * frame_size) This caused memory corruption when frames don't evenly divide blocks. Fixes: 364e08f2bbc0 ("af_packet: add PMD for AF_PACKET-based virtual devices") Cc: linville@tuxdriver.com Cc: stable@dpdk.org Signed-off-by: Scott Mitchell <scott.k.mitch1@gmail.com> --- Depends-on: patch-160679 ("eal: add __rte_may_alias and __rte_aligned to unaligned typedefs") doc/guides/rel_notes/release_26_03.rst | 4 + drivers/net/af_packet/rte_eth_af_packet.c | 151 ++++++++++++++++------ 2 files changed, 118 insertions(+), 37 deletions(-) diff --git a/doc/guides/rel_notes/release_26_03.rst b/doc/guides/rel_notes/release_26_03.rst index 15dabee7a1..c7e7c7d25b 100644 --- a/doc/guides/rel_notes/release_26_03.rst +++ b/doc/guides/rel_notes/release_26_03.rst @@ -55,6 +55,10 @@ New Features Also, make sure to start the actual text at the margin. ======================================================= +* **Updated af_packet net driver.** + + * Fixed kernel memory barrier protocol for memory availability + * Fixed shared memory frame overhead offset calculation Removed Items ------------- diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c index ef11b8fb6b..d0cc2c419a 100644 --- a/drivers/net/af_packet/rte_eth_af_packet.c +++ b/drivers/net/af_packet/rte_eth_af_packet.c @@ -9,6 +9,8 @@ #include <rte_common.h> #include <rte_string_fns.h> #include <rte_mbuf.h> +#include <rte_atomic.h> +#include <rte_bitops.h> #include <ethdev_driver.h> #include <ethdev_vdev.h> #include <rte_malloc.h> @@ -41,6 +43,10 @@ #define DFLT_FRAME_SIZE (1 << 11) #define DFLT_FRAME_COUNT (1 << 9) +static const uint16_t eth_af_packet_frame_size_max = RTE_IPV4_MAX_PKT_LEN; +#define ETH_AF_PACKET_FRAME_OVERHEAD (TPACKET2_HDRLEN - sizeof(struct sockaddr_ll)) +#define ETH_AF_PACKET_ETH_OVERHEAD (RTE_ETHER_HDR_LEN + RTE_ETHER_CRC_LEN) + static uint64_t timestamp_dynflag; static int timestamp_dynfield_offset = -1; @@ -120,6 +126,28 @@ RTE_LOG_REGISTER_DEFAULT(af_packet_logtype, NOTICE); RTE_LOG_LINE(level, AFPACKET, "%s(): " fmt ":%s", __func__, \ ## __VA_ARGS__, strerror(errno)) +/** + * Read tp_status from packet mmap ring. Matches kernel's READ_ONCE() with smp_rmb() + * ordering in af_packet.c __packet_get_status. + */ +static inline uint32_t +tpacket_read_status(const volatile void *tp_status) +{ + rte_smp_rmb(); + return *((const volatile unaligned_uint32_t *)tp_status); +} + +/** + * Write tp_status to packet mmap ring. Matches kernel's WRITE_ONCE() with smp_wmb() + * ordering in af_packet.c __packet_set_status. + */ +static inline void +tpacket_write_status(volatile void *tp_status, uint32_t status) +{ + *((volatile unaligned_uint32_t *)tp_status) = status; + rte_smp_wmb(); +} + static uint16_t eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) { @@ -129,7 +157,8 @@ eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) uint8_t *pbuf; struct pkt_rx_queue *pkt_q = queue; uint16_t num_rx = 0; - unsigned long num_rx_bytes = 0; + uint32_t num_rx_bytes = 0; + uint32_t tp_status; unsigned int framecount, framenum; if (unlikely(nb_pkts == 0)) @@ -144,7 +173,8 @@ eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) for (i = 0; i < nb_pkts; i++) { /* point at the next incoming frame */ ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base; - if ((ppd->tp_status & TP_STATUS_USER) == 0) + tp_status = tpacket_read_status(&ppd->tp_status); + if ((tp_status & TP_STATUS_USER) == 0) break; /* allocate the next mbuf */ @@ -160,7 +190,7 @@ eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) memcpy(rte_pktmbuf_mtod(mbuf, void *), pbuf, rte_pktmbuf_data_len(mbuf)); /* check for vlan info */ - if (ppd->tp_status & TP_STATUS_VLAN_VALID) { + if (tp_status & TP_STATUS_VLAN_VALID) { mbuf->vlan_tci = ppd->tp_vlan_tci; mbuf->ol_flags |= (RTE_MBUF_F_RX_VLAN | RTE_MBUF_F_RX_VLAN_STRIPPED); @@ -179,7 +209,7 @@ eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) } /* release incoming frame and advance ring buffer */ - ppd->tp_status = TP_STATUS_KERNEL; + tpacket_write_status(&ppd->tp_status, TP_STATUS_KERNEL); if (++framenum >= framecount) framenum = 0; mbuf->port = pkt_q->in_port; @@ -228,8 +258,8 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) struct pollfd pfd; struct pkt_tx_queue *pkt_q = queue; uint16_t num_tx = 0; - unsigned long num_tx_bytes = 0; - int i; + uint32_t num_tx_bytes = 0; + uint16_t i; if (unlikely(nb_pkts == 0)) return 0; @@ -259,16 +289,6 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) } } - /* point at the next incoming frame */ - if (!tx_ring_status_available(ppd->tp_status)) { - if (poll(&pfd, 1, -1) < 0) - break; - - /* poll() can return POLLERR if the interface is down */ - if (pfd.revents & POLLERR) - break; - } - /* * poll() will almost always return POLLOUT, even if there * are no extra buffers available @@ -283,26 +303,28 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) * * This results in poll() returning POLLOUT. */ - if (!tx_ring_status_available(ppd->tp_status)) + if (unlikely(!tx_ring_status_available(tpacket_read_status(&ppd->tp_status)) && + (poll(&pfd, 1, -1) < 0 || (pfd.revents & POLLERR) != 0 || + !tx_ring_status_available(tpacket_read_status(&ppd->tp_status))))) { + /* Ring is full, stop here. Don't process bufs[i]. */ break; + } - /* copy the tx frame data */ - pbuf = (uint8_t *) ppd + TPACKET2_HDRLEN - - sizeof(struct sockaddr_ll); + pbuf = (uint8_t *)ppd + ETH_AF_PACKET_FRAME_OVERHEAD; struct rte_mbuf *tmp_mbuf = mbuf; - while (tmp_mbuf) { + do { uint16_t data_len = rte_pktmbuf_data_len(tmp_mbuf); memcpy(pbuf, rte_pktmbuf_mtod(tmp_mbuf, void*), data_len); pbuf += data_len; tmp_mbuf = tmp_mbuf->next; - } + } while (tmp_mbuf); ppd->tp_len = mbuf->pkt_len; ppd->tp_snaplen = mbuf->pkt_len; /* release incoming frame and advance ring buffer */ - ppd->tp_status = TP_STATUS_SEND_REQUEST; + tpacket_write_status(&ppd->tp_status, TP_STATUS_SEND_REQUEST); if (++framenum >= framecount) framenum = 0; ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base; @@ -392,10 +414,12 @@ eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info) dev_info->if_index = internals->if_index; dev_info->max_mac_addrs = 1; - dev_info->max_rx_pktlen = RTE_ETHER_MAX_LEN; + dev_info->max_rx_pktlen = (uint32_t)eth_af_packet_frame_size_max + + ETH_AF_PACKET_ETH_OVERHEAD; + dev_info->max_mtu = eth_af_packet_frame_size_max; dev_info->max_rx_queues = (uint16_t)internals->nb_queues; dev_info->max_tx_queues = (uint16_t)internals->nb_queues; - dev_info->min_rx_bufsize = 0; + dev_info->min_rx_bufsize = ETH_AF_PACKET_ETH_OVERHEAD; dev_info->tx_offload_capa = RTE_ETH_TX_OFFLOAD_MULTI_SEGS | RTE_ETH_TX_OFFLOAD_VLAN_INSERT; dev_info->rx_offload_capa = RTE_ETH_RX_OFFLOAD_VLAN_STRIP | @@ -572,8 +596,7 @@ eth_rx_queue_setup(struct rte_eth_dev *dev, /* Now get the space available for data in the mbuf */ buf_size = rte_pktmbuf_data_room_size(pkt_q->mb_pool) - RTE_PKTMBUF_HEADROOM; - data_size = internals->req.tp_frame_size; - data_size -= TPACKET2_HDRLEN - sizeof(struct sockaddr_ll); + data_size = internals->req.tp_frame_size - ETH_AF_PACKET_FRAME_OVERHEAD; if (data_size > buf_size) { PMD_LOG(ERR, @@ -612,7 +635,7 @@ eth_dev_mtu_set(struct rte_eth_dev *dev, uint16_t mtu) int ret; int s; unsigned int data_size = internals->req.tp_frame_size - - TPACKET2_HDRLEN; + ETH_AF_PACKET_FRAME_OVERHEAD; if (mtu > data_size) return -EINVAL; @@ -977,25 +1000,38 @@ rte_pmd_init_internals(struct rte_vdev_device *dev, rx_queue->rd = rte_zmalloc_socket(name, rdsize, 0, numa_node); if (rx_queue->rd == NULL) goto error; + /* Frame addresses must match kernel's packet_lookup_frame(): + * block_idx = position / frames_per_block + * frame_offset = position % frames_per_block + * address = block_start + (frame_offset * frame_size) + */ + const uint32_t frames_per_block = req->tp_block_size / req->tp_frame_size; for (i = 0; i < req->tp_frame_nr; ++i) { - rx_queue->rd[i].iov_base = rx_queue->map + (i * framesize); + const uint32_t block_idx = i / frames_per_block; + const uint32_t frame_in_block = i % frames_per_block; + rx_queue->rd[i].iov_base = rx_queue->map + + (block_idx * req->tp_block_size) + + (frame_in_block * req->tp_frame_size); rx_queue->rd[i].iov_len = req->tp_frame_size; } rx_queue->sockfd = qsockfd; tx_queue = &((*internals)->tx_queue[q]); tx_queue->framecount = req->tp_frame_nr; - tx_queue->frame_data_size = req->tp_frame_size; - tx_queue->frame_data_size -= TPACKET2_HDRLEN - - sizeof(struct sockaddr_ll); + tx_queue->frame_data_size = req->tp_frame_size - ETH_AF_PACKET_FRAME_OVERHEAD; tx_queue->map = rx_queue->map + req->tp_block_size * req->tp_block_nr; tx_queue->rd = rte_zmalloc_socket(name, rdsize, 0, numa_node); if (tx_queue->rd == NULL) goto error; + /* See comment above rx_queue->rd initialization. */ for (i = 0; i < req->tp_frame_nr; ++i) { - tx_queue->rd[i].iov_base = tx_queue->map + (i * framesize); + const uint32_t block_idx = i / frames_per_block; + const uint32_t frame_in_block = i % frames_per_block; + tx_queue->rd[i].iov_base = tx_queue->map + + (block_idx * req->tp_block_size) + + (frame_in_block * req->tp_frame_size); tx_queue->rd[i].iov_len = req->tp_frame_size; } tx_queue->sockfd = qsockfd; @@ -1081,7 +1117,8 @@ rte_eth_from_packet(struct rte_vdev_device *dev, struct rte_kvargs_pair *pair = NULL; unsigned k_idx; unsigned int blockcount; - unsigned int blocksize; + const int pagesize = getpagesize(); + unsigned int blocksize = pagesize; unsigned int framesize = DFLT_FRAME_SIZE; unsigned int framecount = DFLT_FRAME_COUNT; unsigned int qpairs = 1; @@ -1092,8 +1129,6 @@ rte_eth_from_packet(struct rte_vdev_device *dev, if (*sockfd < 0) return -1; - blocksize = getpagesize(); - /* * Walk arguments for configurable settings */ @@ -1162,13 +1197,55 @@ rte_eth_from_packet(struct rte_vdev_device *dev, return -1; } - blockcount = framecount / (blocksize / framesize); + const unsigned int frames_per_block = blocksize / framesize; + blockcount = framecount / frames_per_block; if (!blockcount) { PMD_LOG(ERR, "%s: invalid AF_PACKET MMAP parameters", name); return -1; } + /* + * https://www.kernel.org/doc/Documentation/networking/packet_mmap.txt + * Check constraints that may be enforced by the kernel and cause failure + * to initialize the rings but explicit error messages aren't provided. + * See packet_set_ring in linux kernel for enforcement: + * https://github.com/torvalds/linux/blob/master/net/packet/af_packet.c + */ + if (blocksize % pagesize != 0) { + /* tp_block_size must be a multiple of PAGE_SIZE */ + PMD_LOG(WARNING, "%s: %s=%u must be a multiple of PAGE_SIZE=%d", + name, ETH_AF_PACKET_BLOCKSIZE_ARG, blocksize, pagesize); + } + if (framesize % TPACKET_ALIGNMENT != 0) { + /* tp_frame_size must be a multiple of TPACKET_ALIGNMENT */ + PMD_LOG(WARNING, "%s: %s=%u must be a multiple of TPACKET_ALIGNMENT=%d", + name, ETH_AF_PACKET_FRAMESIZE_ARG, framesize, TPACKET_ALIGNMENT); + } + if (frames_per_block == 0 || frames_per_block > UINT_MAX / blockcount || + framecount != frames_per_block * blockcount) { + /* tp_frame_nr must be exactly frames_per_block*tp_block_nr */ + PMD_LOG(WARNING, "%s: %s=%u must be exactly " + "frames_per_block(%s/%s = %u/%u = %u) * blockcount(%u)", + name, ETH_AF_PACKET_FRAMECOUNT_ARG, framecount, + ETH_AF_PACKET_BLOCKSIZE_ARG, ETH_AF_PACKET_FRAMESIZE_ARG, + blocksize, framesize, frames_per_block, blockcount); + } + + /* Below conditions may not cause errors but provide hints to improve */ + if (blocksize % framesize != 0) { + PMD_LOG(WARNING, "%s: %s=%u not evenly divisible by %s=%u, " + "may waste memory", name, + ETH_AF_PACKET_BLOCKSIZE_ARG, blocksize, + ETH_AF_PACKET_FRAMESIZE_ARG, framesize); + } + if (!rte_is_power_of_2(blocksize)) { + /* tp_block_size should be a power of two or there will be waste */ + PMD_LOG(WARNING, "%s: %s=%u should be a power of two " + "or there will be a waste of memory", + name, ETH_AF_PACKET_BLOCKSIZE_ARG, blocksize); + } + PMD_LOG(DEBUG, "%s: AF_PACKET MMAP parameters:", name); PMD_LOG(DEBUG, "%s:\tblock size %d", name, blocksize); PMD_LOG(DEBUG, "%s:\tblock count %d", name, blockcount); -- 2.39.5 (Apple Git-154) ^ permalink raw reply related [flat|nested] 65+ messages in thread
* [PATCH v5 2/4] net/af_packet: RX/TX bulk free, unlikely hint 2026-02-03 7:07 ` [PATCH v5 " scott.k.mitch1 2026-02-03 7:07 ` [PATCH v5 1/4] net/af_packet: fix thread safety and frame calculations scott.k.mitch1 @ 2026-02-03 7:07 ` scott.k.mitch1 2026-02-03 7:07 ` [PATCH v5 3/4] net/af_packet: tx poll control scott.k.mitch1 ` (2 subsequent siblings) 4 siblings, 0 replies; 65+ messages in thread From: scott.k.mitch1 @ 2026-02-03 7:07 UTC (permalink / raw) To: dev; +Cc: stephen, Scott Mitchell From: Scott Mitchell <scott.k.mitch1@gmail.com> - Use rte_pktmbuf_free_bulk() in TX path instead of individual rte_pktmbuf_free() calls for better batch efficiency - Add unlikely() hints for error paths (oversized packets, VLAN insertion failures, sendto errors) to optimize branch prediction - Remove unnecessary early nb_pkts == 0 when loop handles this and app may never call with 0 frames. Signed-off-by: Scott Mitchell <scott.k.mitch1@gmail.com> --- drivers/net/af_packet/rte_eth_af_packet.c | 41 ++++++++--------------- 1 file changed, 14 insertions(+), 27 deletions(-) diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c index d0cc2c419a..51ac95ff5e 100644 --- a/drivers/net/af_packet/rte_eth_af_packet.c +++ b/drivers/net/af_packet/rte_eth_af_packet.c @@ -161,9 +161,6 @@ eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) uint32_t tp_status; unsigned int framecount, framenum; - if (unlikely(nb_pkts == 0)) - return 0; - /* * Reads the given number of packets from the AF_PACKET socket one by * one and copies the packet data into a newly allocated mbuf. @@ -261,9 +258,6 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) uint32_t num_tx_bytes = 0; uint16_t i; - if (unlikely(nb_pkts == 0)) - return 0; - memset(&pfd, 0, sizeof(pfd)); pfd.fd = pkt_q->sockfd; pfd.events = POLLOUT; @@ -271,24 +265,17 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) framecount = pkt_q->framecount; framenum = pkt_q->framenum; - ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base; for (i = 0; i < nb_pkts; i++) { - mbuf = *bufs++; + mbuf = bufs[i]; - /* drop oversized packets */ - if (mbuf->pkt_len > pkt_q->frame_data_size) { - rte_pktmbuf_free(mbuf); + /* Drop oversized packets. Insert VLAN if necessary */ + if (unlikely(mbuf->pkt_len > pkt_q->frame_data_size || + ((mbuf->ol_flags & RTE_MBUF_F_TX_VLAN) != 0 && + rte_vlan_insert(&mbuf) != 0))) { continue; } - /* insert vlan info if necessary */ - if (mbuf->ol_flags & RTE_MBUF_F_TX_VLAN) { - if (rte_vlan_insert(&mbuf)) { - rte_pktmbuf_free(mbuf); - continue; - } - } - + ppd = (struct tpacket2_hdr *)pkt_q->rd[framenum].iov_base; /* * poll() will almost always return POLLOUT, even if there * are no extra buffers available @@ -312,6 +299,9 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) pbuf = (uint8_t *)ppd + ETH_AF_PACKET_FRAME_OVERHEAD; + ppd->tp_len = mbuf->pkt_len; + ppd->tp_snaplen = mbuf->pkt_len; + struct rte_mbuf *tmp_mbuf = mbuf; do { uint16_t data_len = rte_pktmbuf_data_len(tmp_mbuf); @@ -320,23 +310,20 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) tmp_mbuf = tmp_mbuf->next; } while (tmp_mbuf); - ppd->tp_len = mbuf->pkt_len; - ppd->tp_snaplen = mbuf->pkt_len; - /* release incoming frame and advance ring buffer */ tpacket_write_status(&ppd->tp_status, TP_STATUS_SEND_REQUEST); if (++framenum >= framecount) framenum = 0; - ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base; - num_tx++; num_tx_bytes += mbuf->pkt_len; - rte_pktmbuf_free(mbuf); } + rte_pktmbuf_free_bulk(&bufs[0], i); + /* kick-off transmits */ - if (sendto(pkt_q->sockfd, NULL, 0, MSG_DONTWAIT, NULL, 0) == -1 && - errno != ENOBUFS && errno != EAGAIN) { + if (unlikely(num_tx > 0 && + sendto(pkt_q->sockfd, NULL, 0, MSG_DONTWAIT, NULL, 0) == -1 && + errno != ENOBUFS && errno != EAGAIN)) { /* * In case of a ENOBUFS/EAGAIN error all of the enqueued * packets will be considered successful even though only some -- 2.39.5 (Apple Git-154) ^ permalink raw reply related [flat|nested] 65+ messages in thread
* [PATCH v5 3/4] net/af_packet: tx poll control 2026-02-03 7:07 ` [PATCH v5 " scott.k.mitch1 2026-02-03 7:07 ` [PATCH v5 1/4] net/af_packet: fix thread safety and frame calculations scott.k.mitch1 2026-02-03 7:07 ` [PATCH v5 2/4] net/af_packet: RX/TX bulk free, unlikely hint scott.k.mitch1 @ 2026-02-03 7:07 ` scott.k.mitch1 2026-02-03 7:07 ` [PATCH v5 4/4] net/af_packet: add software checksum offload support scott.k.mitch1 2026-02-06 1:11 ` [PATCH v6 0/4] af_packet correctness, performance, cksum scott.k.mitch1 4 siblings, 0 replies; 65+ messages in thread From: scott.k.mitch1 @ 2026-02-03 7:07 UTC (permalink / raw) To: dev; +Cc: stephen, Scott Mitchell From: Scott Mitchell <scott.k.mitch1@gmail.com> Add txpollnotrdy devarg (default=true) to control whether poll() is called when the TX ring is not ready. This allows users to avoid blocking behavior if application threads are in asynchronous poll mode where blocking the thread has negative side effects and backpressure is applied via different means. Signed-off-by: Scott Mitchell <scott.k.mitch1@gmail.com> --- doc/guides/nics/af_packet.rst | 6 ++++- doc/guides/rel_notes/release_26_03.rst | 1 + drivers/net/af_packet/rte_eth_af_packet.c | 33 ++++++++++++++++++----- 3 files changed, 32 insertions(+), 8 deletions(-) diff --git a/doc/guides/nics/af_packet.rst b/doc/guides/nics/af_packet.rst index 1505b98ff7..782a962c3f 100644 --- a/doc/guides/nics/af_packet.rst +++ b/doc/guides/nics/af_packet.rst @@ -29,6 +29,10 @@ Some of these, in turn, will be used to configure the PACKET_MMAP settings. * ``framesz`` - PACKET_MMAP frame size (optional, default 2048B; Note: multiple of 16B); * ``framecnt`` - PACKET_MMAP frame count (optional, default 512). +* ``txpollnotrdy`` - Control behavior if tx is attempted but there is no + space available to write to the kernel. If 1, call poll() and block until + space is available to tx. If 0, don't call poll() and return from tx (optional, + default 1). For details regarding ``fanout_mode`` argument, you can consult the `PACKET_FANOUT documentation <https://www.man7.org/linux/man-pages/man7/packet.7.html>`_. @@ -75,7 +79,7 @@ framecnt=512): .. code-block:: console - --vdev=eth_af_packet0,iface=tap0,blocksz=4096,framesz=2048,framecnt=512,qpairs=1,qdisc_bypass=0,fanout_mode=hash + --vdev=eth_af_packet0,iface=tap0,blocksz=4096,framesz=2048,framecnt=512,qpairs=1,qdisc_bypass=0,fanout_mode=hash,txpollnotrdy=0 Features and Limitations ------------------------ diff --git a/doc/guides/rel_notes/release_26_03.rst b/doc/guides/rel_notes/release_26_03.rst index c7e7c7d25b..3b6be19645 100644 --- a/doc/guides/rel_notes/release_26_03.rst +++ b/doc/guides/rel_notes/release_26_03.rst @@ -59,6 +59,7 @@ New Features * Fixed kernel memory barrier protocol for memory availability * Fixed shared memory frame overhead offset calculation + * Added ``txpollnotrdy`` devarg to avoid ``poll()`` blocking calls Removed Items ------------- diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c index 51ac95ff5e..9df1b1fd4c 100644 --- a/drivers/net/af_packet/rte_eth_af_packet.c +++ b/drivers/net/af_packet/rte_eth_af_packet.c @@ -39,9 +39,11 @@ #define ETH_AF_PACKET_FRAMECOUNT_ARG "framecnt" #define ETH_AF_PACKET_QDISC_BYPASS_ARG "qdisc_bypass" #define ETH_AF_PACKET_FANOUT_MODE_ARG "fanout_mode" +#define ETH_AF_PACKET_TX_POLL_NOT_READY_ARG "txpollnotrdy" #define DFLT_FRAME_SIZE (1 << 11) #define DFLT_FRAME_COUNT (1 << 9) +#define DFLT_TX_POLL_NOT_RDY true static const uint16_t eth_af_packet_frame_size_max = RTE_IPV4_MAX_PKT_LEN; #define ETH_AF_PACKET_FRAME_OVERHEAD (TPACKET2_HDRLEN - sizeof(struct sockaddr_ll)) @@ -78,6 +80,9 @@ struct __rte_cache_aligned pkt_tx_queue { unsigned int framecount; unsigned int framenum; + bool txpollnotrdy; + bool sw_cksum; + volatile unsigned long tx_pkts; volatile unsigned long err_pkts; volatile unsigned long tx_bytes; @@ -106,6 +111,7 @@ static const char *valid_arguments[] = { ETH_AF_PACKET_FRAMECOUNT_ARG, ETH_AF_PACKET_QDISC_BYPASS_ARG, ETH_AF_PACKET_FANOUT_MODE_ARG, + ETH_AF_PACKET_TX_POLL_NOT_READY_ARG, NULL }; @@ -258,10 +264,12 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) uint32_t num_tx_bytes = 0; uint16_t i; - memset(&pfd, 0, sizeof(pfd)); - pfd.fd = pkt_q->sockfd; - pfd.events = POLLOUT; - pfd.revents = 0; + if (pkt_q->txpollnotrdy) { + memset(&pfd, 0, sizeof(pfd)); + pfd.fd = pkt_q->sockfd; + pfd.events = POLLOUT; + pfd.revents = 0; + } framecount = pkt_q->framecount; framenum = pkt_q->framenum; @@ -291,8 +299,9 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) * This results in poll() returning POLLOUT. */ if (unlikely(!tx_ring_status_available(tpacket_read_status(&ppd->tp_status)) && - (poll(&pfd, 1, -1) < 0 || (pfd.revents & POLLERR) != 0 || - !tx_ring_status_available(tpacket_read_status(&ppd->tp_status))))) { + (!pkt_q->txpollnotrdy || poll(&pfd, 1, -1) < 0 || + (pfd.revents & POLLERR) != 0 || + !tx_ring_status_available(tpacket_read_status(&ppd->tp_status))))) { /* Ring is full, stop here. Don't process bufs[i]. */ break; } @@ -804,6 +813,7 @@ rte_pmd_init_internals(struct rte_vdev_device *dev, unsigned int framecnt, unsigned int qdisc_bypass, const char *fanout_mode, + bool txpollnotrdy, struct pmd_internals **internals, struct rte_eth_dev **eth_dev, struct rte_kvargs *kvlist) @@ -1022,6 +1032,7 @@ rte_pmd_init_internals(struct rte_vdev_device *dev, tx_queue->rd[i].iov_len = req->tp_frame_size; } tx_queue->sockfd = qsockfd; + tx_queue->txpollnotrdy = txpollnotrdy; rc = bind(qsockfd, (const struct sockaddr*)&sockaddr, sizeof(sockaddr)); if (rc == -1) { @@ -1111,6 +1122,7 @@ rte_eth_from_packet(struct rte_vdev_device *dev, unsigned int qpairs = 1; unsigned int qdisc_bypass = 1; const char *fanout_mode = NULL; + bool txpollnotrdy = DFLT_TX_POLL_NOT_RDY; /* do some parameter checking */ if (*sockfd < 0) @@ -1175,6 +1187,10 @@ rte_eth_from_packet(struct rte_vdev_device *dev, fanout_mode = pair->value; continue; } + if (strstr(pair->key, ETH_AF_PACKET_TX_POLL_NOT_READY_ARG) != NULL) { + txpollnotrdy = atoi(pair->value) != 0; + continue; + } } if (framesize > blocksize) { @@ -1243,12 +1259,14 @@ rte_eth_from_packet(struct rte_vdev_device *dev, PMD_LOG(DEBUG, "%s:\tfanout mode %s", name, fanout_mode); else PMD_LOG(DEBUG, "%s:\tfanout mode %s", name, "default PACKET_FANOUT_HASH"); + PMD_LOG(INFO, "%s:\ttxpollnotrdy %d", name, txpollnotrdy ? 1 : 0); if (rte_pmd_init_internals(dev, *sockfd, qpairs, blocksize, blockcount, framesize, framecount, qdisc_bypass, fanout_mode, + txpollnotrdy, &internals, ð_dev, kvlist) < 0) return -1; @@ -1346,4 +1364,5 @@ RTE_PMD_REGISTER_PARAM_STRING(net_af_packet, "framesz=<int> " "framecnt=<int> " "qdisc_bypass=<0|1> " - "fanout_mode=<hash|lb|cpu|rollover|rnd|qm>"); + "fanout_mode=<hash|lb|cpu|rollover|rnd|qm> " + "txpollnotrdy=<0|1>"); -- 2.39.5 (Apple Git-154) ^ permalink raw reply related [flat|nested] 65+ messages in thread
* [PATCH v5 4/4] net/af_packet: add software checksum offload support 2026-02-03 7:07 ` [PATCH v5 " scott.k.mitch1 ` (2 preceding siblings ...) 2026-02-03 7:07 ` [PATCH v5 3/4] net/af_packet: tx poll control scott.k.mitch1 @ 2026-02-03 7:07 ` scott.k.mitch1 2026-02-03 8:20 ` Scott Mitchell 2026-02-03 14:13 ` Stephen Hemminger 2026-02-06 1:11 ` [PATCH v6 0/4] af_packet correctness, performance, cksum scott.k.mitch1 4 siblings, 2 replies; 65+ messages in thread From: scott.k.mitch1 @ 2026-02-03 7:07 UTC (permalink / raw) To: dev; +Cc: stephen, Scott Mitchell From: Scott Mitchell <scott.k.mitch1@gmail.com> Add software checksum offload support and configurable TX poll behavior to improve flexibility and performance. Add rte_net_ip_udptcp_cksum_mbuf in rte_net.h which is shared between rte_eth_tap and rte_eth_af_packet that supports IPv4/UDP/TCP checksums in software due to hardware offload and context propagation not being supported. Signed-off-by: Scott Mitchell <scott.k.mitch1@gmail.com> --- Depends-on: patch-160679 ("eal: add __rte_may_alias and __rte_aligned to unaligned typedefs") doc/guides/nics/features/afpacket.ini | 2 + doc/guides/rel_notes/release_26_03.rst | 2 + drivers/net/af_packet/rte_eth_af_packet.c | 42 ++++++++++---- drivers/net/tap/rte_eth_tap.c | 70 ++--------------------- lib/net/rte_net.c | 68 ++++++++++++++++++++++ lib/net/rte_net.h | 22 +++++++ 6 files changed, 130 insertions(+), 76 deletions(-) diff --git a/doc/guides/nics/features/afpacket.ini b/doc/guides/nics/features/afpacket.ini index 391f79b173..4bb81c84ff 100644 --- a/doc/guides/nics/features/afpacket.ini +++ b/doc/guides/nics/features/afpacket.ini @@ -7,5 +7,7 @@ Link status = Y Promiscuous mode = Y MTU update = Y +L3 checksum offload = Y +L4 checksum offload = Y Basic stats = Y Stats per queue = Y diff --git a/doc/guides/rel_notes/release_26_03.rst b/doc/guides/rel_notes/release_26_03.rst index 3b6be19645..2946acce99 100644 --- a/doc/guides/rel_notes/release_26_03.rst +++ b/doc/guides/rel_notes/release_26_03.rst @@ -60,6 +60,8 @@ New Features * Fixed kernel memory barrier protocol for memory availability * Fixed shared memory frame overhead offset calculation * Added ``txpollnotrdy`` devarg to avoid ``poll()`` blocking calls + * Added checksum offload support for ``IPV4_CKSUM``, ``UDP_CKSUM``, + and ``TCP_CKSUM`` Removed Items ------------- diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c index 9df1b1fd4c..662341ffc7 100644 --- a/drivers/net/af_packet/rte_eth_af_packet.c +++ b/drivers/net/af_packet/rte_eth_af_packet.c @@ -10,6 +10,8 @@ #include <rte_string_fns.h> #include <rte_mbuf.h> #include <rte_atomic.h> +#include <rte_ip.h> +#include <rte_net.h> #include <rte_bitops.h> #include <ethdev_driver.h> #include <ethdev_vdev.h> @@ -101,6 +103,7 @@ struct pmd_internals { struct pkt_tx_queue *tx_queue; uint8_t vlan_strip; uint8_t timestamp_offloading; + bool tx_sw_cksum; }; static const char *valid_arguments[] = { @@ -220,7 +223,7 @@ eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) /* account for the receive frame */ bufs[i] = mbuf; num_rx++; - num_rx_bytes += mbuf->pkt_len; + num_rx_bytes += rte_pktmbuf_pkt_len(mbuf); } pkt_q->framenum = framenum; pkt_q->rx_pkts += num_rx; @@ -256,6 +259,7 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) { struct tpacket2_hdr *ppd; struct rte_mbuf *mbuf; + struct rte_mbuf *seg; uint8_t *pbuf; unsigned int framecount, framenum; struct pollfd pfd; @@ -277,7 +281,7 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) mbuf = bufs[i]; /* Drop oversized packets. Insert VLAN if necessary */ - if (unlikely(mbuf->pkt_len > pkt_q->frame_data_size || + if (unlikely(rte_pktmbuf_pkt_len(mbuf) > pkt_q->frame_data_size || ((mbuf->ol_flags & RTE_MBUF_F_TX_VLAN) != 0 && rte_vlan_insert(&mbuf) != 0))) { continue; @@ -308,23 +312,32 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) pbuf = (uint8_t *)ppd + ETH_AF_PACKET_FRAME_OVERHEAD; - ppd->tp_len = mbuf->pkt_len; - ppd->tp_snaplen = mbuf->pkt_len; + if (pkt_q->sw_cksum) { + seg = rte_net_ip_udptcp_cksum_mbuf(mbuf); + if (!seg) + continue; - struct rte_mbuf *tmp_mbuf = mbuf; + mbuf = seg; + bufs[i] = seg; + } + + ppd->tp_len = rte_pktmbuf_pkt_len(mbuf); + ppd->tp_snaplen = rte_pktmbuf_pkt_len(mbuf); + + seg = mbuf; do { - uint16_t data_len = rte_pktmbuf_data_len(tmp_mbuf); - memcpy(pbuf, rte_pktmbuf_mtod(tmp_mbuf, void*), data_len); + uint16_t data_len = rte_pktmbuf_data_len(seg); + memcpy(pbuf, rte_pktmbuf_mtod(seg, void*), data_len); pbuf += data_len; - tmp_mbuf = tmp_mbuf->next; - } while (tmp_mbuf); + seg = seg->next; + } while (seg); /* release incoming frame and advance ring buffer */ tpacket_write_status(&ppd->tp_status, TP_STATUS_SEND_REQUEST); if (++framenum >= framecount) framenum = 0; num_tx++; - num_tx_bytes += mbuf->pkt_len; + num_tx_bytes += rte_pktmbuf_pkt_len(mbuf); } rte_pktmbuf_free_bulk(&bufs[0], i); @@ -396,10 +409,13 @@ eth_dev_configure(struct rte_eth_dev *dev __rte_unused) { struct rte_eth_conf *dev_conf = &dev->data->dev_conf; const struct rte_eth_rxmode *rxmode = &dev_conf->rxmode; + const struct rte_eth_txmode *txmode = &dev_conf->txmode; struct pmd_internals *internals = dev->data->dev_private; internals->vlan_strip = !!(rxmode->offloads & RTE_ETH_RX_OFFLOAD_VLAN_STRIP); internals->timestamp_offloading = !!(rxmode->offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP); + internals->tx_sw_cksum = !!(txmode->offloads & (RTE_ETH_TX_OFFLOAD_IPV4_CKSUM | + RTE_ETH_TX_OFFLOAD_UDP_CKSUM | RTE_ETH_TX_OFFLOAD_TCP_CKSUM)); return 0; } @@ -417,7 +433,10 @@ eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info) dev_info->max_tx_queues = (uint16_t)internals->nb_queues; dev_info->min_rx_bufsize = ETH_AF_PACKET_ETH_OVERHEAD; dev_info->tx_offload_capa = RTE_ETH_TX_OFFLOAD_MULTI_SEGS | - RTE_ETH_TX_OFFLOAD_VLAN_INSERT; + RTE_ETH_TX_OFFLOAD_VLAN_INSERT | + RTE_ETH_TX_OFFLOAD_IPV4_CKSUM | + RTE_ETH_TX_OFFLOAD_UDP_CKSUM | + RTE_ETH_TX_OFFLOAD_TCP_CKSUM; dev_info->rx_offload_capa = RTE_ETH_RX_OFFLOAD_VLAN_STRIP | RTE_ETH_RX_OFFLOAD_TIMESTAMP; @@ -618,6 +637,7 @@ eth_tx_queue_setup(struct rte_eth_dev *dev, { struct pmd_internals *internals = dev->data->dev_private; + internals->tx_queue[tx_queue_id].sw_cksum = internals->tx_sw_cksum; dev->data->tx_queues[tx_queue_id] = &internals->tx_queue[tx_queue_id]; return 0; diff --git a/drivers/net/tap/rte_eth_tap.c b/drivers/net/tap/rte_eth_tap.c index 730f1859bd..c7ed6dfb8b 100644 --- a/drivers/net/tap/rte_eth_tap.c +++ b/drivers/net/tap/rte_eth_tap.c @@ -525,7 +525,6 @@ tap_write_mbufs(struct tx_queue *txq, uint16_t num_mbufs, struct iovec iovecs[mbuf->nb_segs + 2]; struct tun_pi pi = { .flags = 0, .proto = 0x00 }; struct rte_mbuf *seg = mbuf; - uint64_t l4_ol_flags; int proto; int n; int j; @@ -556,74 +555,15 @@ tap_write_mbufs(struct tx_queue *txq, uint16_t num_mbufs, iovecs[k].iov_len = sizeof(pi); k++; - l4_ol_flags = mbuf->ol_flags & RTE_MBUF_F_TX_L4_MASK; - if (txq->csum && (mbuf->ol_flags & RTE_MBUF_F_TX_IP_CKSUM || - l4_ol_flags == RTE_MBUF_F_TX_UDP_CKSUM || - l4_ol_flags == RTE_MBUF_F_TX_TCP_CKSUM)) { - unsigned int hdrlens = mbuf->l2_len + mbuf->l3_len; - uint16_t *l4_cksum; - void *l3_hdr; - - if (l4_ol_flags == RTE_MBUF_F_TX_UDP_CKSUM) - hdrlens += sizeof(struct rte_udp_hdr); - else if (l4_ol_flags == RTE_MBUF_F_TX_TCP_CKSUM) - hdrlens += sizeof(struct rte_tcp_hdr); - else if (l4_ol_flags != RTE_MBUF_F_TX_L4_NO_CKSUM) + if (txq->csum) { + seg = rte_net_ip_udptcp_cksum_mbuf(mbuf); + if (!seg) return -1; - /* Support only packets with at least layer 4 - * header included in the first segment - */ - if (rte_pktmbuf_data_len(mbuf) < hdrlens) - return -1; - - /* To change checksums (considering that a mbuf can be - * indirect, for example), copy l2, l3 and l4 headers - * in a new segment and chain it to existing data - */ - seg = rte_pktmbuf_copy(mbuf, mbuf->pool, 0, hdrlens); - if (seg == NULL) - return -1; - rte_pktmbuf_adj(mbuf, hdrlens); - rte_pktmbuf_chain(seg, mbuf); - pmbufs[i] = mbuf = seg; - - l3_hdr = rte_pktmbuf_mtod_offset(mbuf, void *, mbuf->l2_len); - if (mbuf->ol_flags & RTE_MBUF_F_TX_IP_CKSUM) { - struct rte_ipv4_hdr *iph = l3_hdr; - - iph->hdr_checksum = 0; - iph->hdr_checksum = rte_ipv4_cksum(iph); - } - - if (l4_ol_flags == RTE_MBUF_F_TX_L4_NO_CKSUM) - goto skip_l4_cksum; - - if (l4_ol_flags == RTE_MBUF_F_TX_UDP_CKSUM) { - struct rte_udp_hdr *udp_hdr; - - udp_hdr = rte_pktmbuf_mtod_offset(mbuf, struct rte_udp_hdr *, - mbuf->l2_len + mbuf->l3_len); - l4_cksum = &udp_hdr->dgram_cksum; - } else { - struct rte_tcp_hdr *tcp_hdr; - - tcp_hdr = rte_pktmbuf_mtod_offset(mbuf, struct rte_tcp_hdr *, - mbuf->l2_len + mbuf->l3_len); - l4_cksum = &tcp_hdr->cksum; - } - - *l4_cksum = 0; - if (mbuf->ol_flags & RTE_MBUF_F_TX_IPV4) { - *l4_cksum = rte_ipv4_udptcp_cksum_mbuf(mbuf, l3_hdr, - mbuf->l2_len + mbuf->l3_len); - } else { - *l4_cksum = rte_ipv6_udptcp_cksum_mbuf(mbuf, l3_hdr, - mbuf->l2_len + mbuf->l3_len); - } + mbuf = seg; + pmbufs[i] = seg; } -skip_l4_cksum: for (j = 0; j < mbuf->nb_segs; j++) { iovecs[k].iov_len = rte_pktmbuf_data_len(seg); iovecs[k].iov_base = rte_pktmbuf_mtod(seg, void *); diff --git a/lib/net/rte_net.c b/lib/net/rte_net.c index 458b4814a9..1a0397bcd7 100644 --- a/lib/net/rte_net.c +++ b/lib/net/rte_net.c @@ -615,3 +615,71 @@ uint32_t rte_net_get_ptype(const struct rte_mbuf *m, return pkt_type; } + +RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_net_ip_udptcp_cksum_mbuf, 26.03) +struct rte_mbuf * +rte_net_ip_udptcp_cksum_mbuf(struct rte_mbuf *mbuf) +{ + const uint64_t l4_ol_flags = mbuf->ol_flags & RTE_MBUF_F_TX_L4_MASK; + const uint32_t l4_offset = mbuf->l2_len + mbuf->l3_len; + uint32_t hdrlens = l4_offset; + unaligned_uint16_t *l4_cksum = NULL; + void *l3_hdr; + + /* Quick check - nothing to do if no checksum offloads requested */ + if (!(mbuf->ol_flags & (RTE_MBUF_F_TX_IP_CKSUM | RTE_MBUF_F_TX_L4_MASK))) + return mbuf; + + /* Determine total header length needed */ + if (l4_ol_flags == RTE_MBUF_F_TX_UDP_CKSUM) + hdrlens += sizeof(struct rte_udp_hdr); + else if (l4_ol_flags == RTE_MBUF_F_TX_TCP_CKSUM) + hdrlens += sizeof(struct rte_tcp_hdr); + else if (l4_ol_flags != RTE_MBUF_F_TX_L4_NO_CKSUM) + return NULL; /* Unsupported L4 checksum type */ + + /* Validate we at least have L2+L3 headers */ + if (unlikely(rte_pktmbuf_data_len(mbuf) < l4_offset)) + return NULL; + + if (!RTE_MBUF_DIRECT(mbuf) || rte_mbuf_refcnt_read(mbuf) > 1) { + /* Indirect or shared - must copy, cannot modify in-place */ + struct rte_mbuf *seg = rte_pktmbuf_copy(mbuf, mbuf->pool, 0, hdrlens); + if (!seg) + return NULL; + + rte_pktmbuf_adj(mbuf, hdrlens); + rte_pktmbuf_chain(seg, mbuf); + mbuf = seg; + } else if (rte_pktmbuf_data_len(mbuf) < hdrlens && + (rte_pktmbuf_linearize(mbuf) < 0 || rte_pktmbuf_data_len(mbuf) < hdrlens)) { + /* failed: direct, non-shared, but segmented headers linearize in-place */ + return NULL; + } + /* else: Direct, non-shared, contiguous - can modify in-place, nothing to do */ + + l3_hdr = rte_pktmbuf_mtod_offset(mbuf, void *, mbuf->l2_len); + + /* IPv4 header checksum */ + if (mbuf->ol_flags & RTE_MBUF_F_TX_IP_CKSUM) { + struct rte_ipv4_hdr *iph = (struct rte_ipv4_hdr *)l3_hdr; + iph->hdr_checksum = 0; + iph->hdr_checksum = rte_ipv4_cksum(iph); + } + + /* L4 checksum */ + if (l4_ol_flags == RTE_MBUF_F_TX_UDP_CKSUM) + l4_cksum = &rte_pktmbuf_mtod_offset(mbuf, struct rte_udp_hdr *, + l4_offset)->dgram_cksum; + else if (l4_ol_flags == RTE_MBUF_F_TX_TCP_CKSUM) + l4_cksum = &rte_pktmbuf_mtod_offset(mbuf, struct rte_tcp_hdr *, l4_offset)->cksum; + + if (l4_cksum) { + *l4_cksum = 0; + *l4_cksum = (mbuf->ol_flags & RTE_MBUF_F_TX_IPV4) ? + rte_ipv4_udptcp_cksum_mbuf(mbuf, l3_hdr, l4_offset) : + rte_ipv6_udptcp_cksum_mbuf(mbuf, l3_hdr, l4_offset); + } + + return mbuf; +} diff --git a/lib/net/rte_net.h b/lib/net/rte_net.h index 65d724b84b..b258a86928 100644 --- a/lib/net/rte_net.h +++ b/lib/net/rte_net.h @@ -246,6 +246,28 @@ rte_net_intel_cksum_prepare(struct rte_mbuf *m) return rte_net_intel_cksum_flags_prepare(m, m->ol_flags); } +/** + * Compute IP and L4 checksums in software for mbufs with + * RTE_MBUF_F_TX_IP_CKSUM, RTE_MBUF_F_TX_UDP_CKSUM, or + * RTE_MBUF_F_TX_TCP_CKSUM offload flags set. + * + * On success, this function takes ownership of the input mbuf. The mbuf may be + * modified in-place (for direct, non-shared mbufs) or a new mbuf chain may be + * created (for indirect/shared mbufs) with the original becoming part of the chain. + * + * @param mbuf + * The packet mbuf to checksum. + * @return + * - On success: pointer to mbuf with checksums computed (may be same as input + * or a new mbuf chain). Caller must free only this returned pointer; the input + * mbuf pointer should not be freed separately as it may be part of the returned + * chain or may be the same as the returned pointer. + * - On error: NULL. Original mbuf remains valid and owned by caller. + */ +__rte_experimental +struct rte_mbuf * +rte_net_ip_udptcp_cksum_mbuf(struct rte_mbuf *mbuf); + #ifdef __cplusplus } #endif -- 2.39.5 (Apple Git-154) ^ permalink raw reply related [flat|nested] 65+ messages in thread
* Re: [PATCH v5 4/4] net/af_packet: add software checksum offload support 2026-02-03 7:07 ` [PATCH v5 4/4] net/af_packet: add software checksum offload support scott.k.mitch1 @ 2026-02-03 8:20 ` Scott Mitchell 2026-02-03 14:12 ` Stephen Hemminger 2026-02-03 14:13 ` Stephen Hemminger 1 sibling, 1 reply; 65+ messages in thread From: Scott Mitchell @ 2026-02-03 8:20 UTC (permalink / raw) To: dev; +Cc: stephen gcc warnings are resolved when built with Depends-on patch (https://patches.dpdk.org/project/dpdk/patch/20260202044841.90945-2-scott.k.mitch1@gmail.com/). Did I indicate the dependency correctly (reference https://doc.dpdk.org/guides/contributing/patches.html#patch-dependencies), and is CI expected to apply dependent patches before the current patch series? ../lib/net/rte_net.c:672:28: error: taking address of packed member of ‘struct rte_udp_hdr’ may result in an unaligned pointer value [-Werror=address-of-packed-member] 672 | l4_cksum = &rte_pktmbuf_mtod_offset(mbuf, struct rte_udp_hdr *, | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 673 | l4_offset)->dgram_cksum; | ~~~~~~~~~~~~~~~~~~~~~~~ ../lib/net/rte_net.c:675:28: error: taking address of packed member of ‘struct rte_tcp_hdr’ may result in an unaligned pointer value [-Werror=address-of-packed-member] 675 | l4_cksum = &rte_pktmbuf_mtod_offset(mbuf, struct rte_tcp_hdr *, l4_offset)->cksum; | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v5 4/4] net/af_packet: add software checksum offload support 2026-02-03 8:20 ` Scott Mitchell @ 2026-02-03 14:12 ` Stephen Hemminger 2026-02-04 2:59 ` Scott Mitchell 0 siblings, 1 reply; 65+ messages in thread From: Stephen Hemminger @ 2026-02-03 14:12 UTC (permalink / raw) To: Scott Mitchell; +Cc: dev On Tue, 3 Feb 2026 00:20:43 -0800 Scott Mitchell <scott.k.mitch1@gmail.com> wrote: > gcc warnings are resolved when built with Depends-on patch > (https://patches.dpdk.org/project/dpdk/patch/20260202044841.90945-2-scott.k.mitch1@gmail.com/). > Did I indicate the dependency correctly (reference > https://doc.dpdk.org/guides/contributing/patches.html#patch-dependencies), > and is CI expected to apply dependent patches before the current patch > series? > > ../lib/net/rte_net.c:672:28: error: taking address of packed member of > ‘struct rte_udp_hdr’ may result in an unaligned pointer value > [-Werror=address-of-packed-member] > 672 | l4_cksum = &rte_pktmbuf_mtod_offset(mbuf, > struct rte_udp_hdr *, > | > ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > 673 | l4_offset)->dgram_cksum; > | ~~~~~~~~~~~~~~~~~~~~~~~ > ../lib/net/rte_net.c:675:28: error: taking address of packed member of > ‘struct rte_tcp_hdr’ may result in an unaligned pointer value > [-Werror=address-of-packed-member] > 675 | l4_cksum = &rte_pktmbuf_mtod_offset(mbuf, > struct rte_tcp_hdr *, l4_offset)->cksum; > | > ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ CI system doesn't know what Depends-on is yet. ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v5 4/4] net/af_packet: add software checksum offload support 2026-02-03 14:12 ` Stephen Hemminger @ 2026-02-04 2:59 ` Scott Mitchell 0 siblings, 0 replies; 65+ messages in thread From: Scott Mitchell @ 2026-02-04 2:59 UTC (permalink / raw) To: Stephen Hemminger; +Cc: dev > CI system doesn't know what Depends-on is yet. Thanks for confirming. Should I wait for the depends-on patch to be merged before posting updates to this series? If I have other not-yet-submitted patches for af_packet, should I wait for this series to be merged? I'd like to post them in parallel to get early feedback and I can rebase as depends-on patches are merged. ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v5 4/4] net/af_packet: add software checksum offload support 2026-02-03 7:07 ` [PATCH v5 4/4] net/af_packet: add software checksum offload support scott.k.mitch1 2026-02-03 8:20 ` Scott Mitchell @ 2026-02-03 14:13 ` Stephen Hemminger 2026-02-04 1:39 ` Scott Mitchell 1 sibling, 1 reply; 65+ messages in thread From: Stephen Hemminger @ 2026-02-03 14:13 UTC (permalink / raw) To: scott.k.mitch1; +Cc: dev On Mon, 2 Feb 2026 23:07:40 -0800 scott.k.mitch1@gmail.com wrote: > > +/** > + * Compute IP and L4 checksums in software for mbufs with > + * RTE_MBUF_F_TX_IP_CKSUM, RTE_MBUF_F_TX_UDP_CKSUM, or > + * RTE_MBUF_F_TX_TCP_CKSUM offload flags set. > + * > + * On success, this function takes ownership of the input mbuf. The mbuf may be > + * modified in-place (for direct, non-shared mbufs) or a new mbuf chain may be > + * created (for indirect/shared mbufs) with the original becoming part of the chain. > + * > + * @param mbuf > + * The packet mbuf to checksum. > + * @return > + * - On success: pointer to mbuf with checksums computed (may be same as input > + * or a new mbuf chain). Caller must free only this returned pointer; the input > + * mbuf pointer should not be freed separately as it may be part of the returned > + * chain or may be the same as the returned pointer. > + * - On error: NULL. Original mbuf remains valid and owned by caller. > + */ > +__rte_experimental > +struct rte_mbuf * > +rte_net_ip_udptcp_cksum_mbuf(struct rte_mbuf *mbuf); Probably need to add EXPERIMENTAL into docbook comment as well. ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v5 4/4] net/af_packet: add software checksum offload support 2026-02-03 14:13 ` Stephen Hemminger @ 2026-02-04 1:39 ` Scott Mitchell 2026-02-05 21:27 ` Stephen Hemminger 0 siblings, 1 reply; 65+ messages in thread From: Scott Mitchell @ 2026-02-04 1:39 UTC (permalink / raw) To: Stephen Hemminger; +Cc: dev > Probably need to add EXPERIMENTAL into docbook comment as well. Done. ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v5 4/4] net/af_packet: add software checksum offload support 2026-02-04 1:39 ` Scott Mitchell @ 2026-02-05 21:27 ` Stephen Hemminger 0 siblings, 0 replies; 65+ messages in thread From: Stephen Hemminger @ 2026-02-05 21:27 UTC (permalink / raw) To: Scott Mitchell; +Cc: dev On Tue, 3 Feb 2026 17:39:37 -0800 Scott Mitchell <scott.k.mitch1@gmail.com> wrote: > > Probably need to add EXPERIMENTAL into docbook comment as well. > > Done. Rebase and resubmit, the dependent patch is already in main. ^ permalink raw reply [flat|nested] 65+ messages in thread
* [PATCH v6 0/4] af_packet correctness, performance, cksum 2026-02-03 7:07 ` [PATCH v5 " scott.k.mitch1 ` (3 preceding siblings ...) 2026-02-03 7:07 ` [PATCH v5 4/4] net/af_packet: add software checksum offload support scott.k.mitch1 @ 2026-02-06 1:11 ` scott.k.mitch1 2026-02-06 1:11 ` [PATCH v6 1/4] net/af_packet: fix thread safety and frame calculations scott.k.mitch1 ` (4 more replies) 4 siblings, 5 replies; 65+ messages in thread From: scott.k.mitch1 @ 2026-02-06 1:11 UTC (permalink / raw) To: dev; +Cc: stephen, Scott From: Scott <scott.k.mitch1@gmail.com> This series fixes critical thread safety bugs in the af_packet PMD and adds performance optimizations. Patch 1 fixes two major correctness issues: - Thread safety: tp_status was accessed without memory barriers, violating the kernel's PACKET_MMAP protocol. On aarch64 and other weakly-ordered architectures, this causes packet corruption due to missing memory ordering. The fix matches the kernel's memory model: volatile unaligned reads/writes with explicit rte_smp_rmb/wmb barriers and __may_alias__ protection. - Frame calculations: Fixed incorrect frame overhead and address calculations that caused memory corruption when frames don't evenly divide blocks. Patches 2-4 add performance improvements: - Patch 2: Bulk mbuf freeing, unlikely annotations - Patch 3: TX poll control to reduce syscall overhead - Patch 4: Software checksum offload support with shared rte_net utility v6 changes: - rte_net_ip_udptcp_cksum_mbuf doxygen EXPERIMENTAL tag v5 changes: - rte_net_ip_udptcp_cksum_mbuf moved to rte_net.c (avoid forced inline) - rte_net_ip_udptcp_cksum_mbuf remove copy arg, handle more mbuf types - af_packet and tap calling code consistent for sw cksum v4 changes: - Remove prefetch (perf results didn't show benefit) - Fix variable sytle for consistency (declare at start of method) - Add release notes for af_packet and documentation for fixes v3 changes: - Patch 4: Fix compile error due to implict cast with c++ compiler v2 changes: - Patch 1: Rewrote to use volatile + barriers instead of C11 atomics to match kernel's memory model. Added dependency on patch-160274 for __rte_may_alias attribute. - Patch 4: Refactored to use shared rte_net_ip_udptcp_cksum_mbuf() utility function, eliminating code duplication with tap driver. Scott Mitchell (4): net/af_packet: fix thread safety and frame calculations net/af_packet: RX/TX bulk free, unlikely hint net/af_packet: tx poll control net/af_packet: add software checksum offload support doc/guides/nics/af_packet.rst | 6 +- doc/guides/nics/features/afpacket.ini | 2 + doc/guides/rel_notes/release_26_03.rst | 9 +- drivers/net/af_packet/rte_eth_af_packet.c | 253 +++++++++++++++------- drivers/net/tap/rte_eth_tap.c | 70 +----- lib/net/rte_net.c | 69 ++++++ lib/net/rte_net.h | 25 +++ 7 files changed, 292 insertions(+), 142 deletions(-) -- 2.39.5 (Apple Git-154) ^ permalink raw reply [flat|nested] 65+ messages in thread
* [PATCH v6 1/4] net/af_packet: fix thread safety and frame calculations 2026-02-06 1:11 ` [PATCH v6 0/4] af_packet correctness, performance, cksum scott.k.mitch1 @ 2026-02-06 1:11 ` scott.k.mitch1 2026-02-06 1:11 ` [PATCH v6 2/4] net/af_packet: RX/TX bulk free, unlikely hint scott.k.mitch1 ` (3 subsequent siblings) 4 siblings, 0 replies; 65+ messages in thread From: scott.k.mitch1 @ 2026-02-06 1:11 UTC (permalink / raw) To: dev; +Cc: stephen, Scott Mitchell, linville, stable From: Scott Mitchell <scott.k.mitch1@gmail.com> Thread Safety: The tp_status field was accessed without proper memory barriers, violating the kernel's PACKET_MMAP synchronization protocol. The kernel implements this protocol in net/packet/af_packet.c: - __packet_get_status: smp_rmb() then READ_ONCE() (volatile read) - __packet_set_status: WRITE_ONCE() (volatile write) then smp_wmb() READ_ONCE/WRITE_ONCE use __may_alias__ attribute via __uXX_alias_t types to prevent compiler optimizations that assume type-based aliasing rules, which is critical for tp_status access that may be misaligned within the ring buffer. Userspace must use equivalent semantics: volatile unaligned_uint32_t (with __rte_may_alias) reads/writes with explicit memory barriers (rte_smp_rmb/rte_smp_wmb). On aarch64 and other weakly-ordered architectures, missing barriers cause packet corruption because: - RX: CPU may read stale packet data before seeing tp_status update - TX: CPU may reorder stores, causing kernel to see tp_status before packet data is fully written This becomes critical with io_uring SQPOLL mode where the kernel polling thread on a different CPU core asynchronously updates tp_status, making proper memory ordering essential. Note: Uses rte_smp_[r/w]mb which triggers checkpatch warnings, but C11 atomics cannot be used because tp_status is not declared _Atomic in the kernel's tpacket2_hdr structure. We must match the kernel's volatile + barrier memory model with __may_alias__ protection. Frame Calculation Issues: 1. Frame overhead incorrectly calculated as TPACKET_ALIGN(TPACKET2_HDRLEN) instead of TPACKET2_HDRLEN - sizeof(struct sockaddr_ll), causing incorrect usable frame data size. 2. Frame address calculation assumed sequential layout (frame_base + i * frame_size), but the kernel's packet_lookup_frame() uses block-based addressing: block_idx = position / frames_per_block frame_offset = position % frames_per_block address = block_start[block_idx] + (frame_offset * frame_size) This caused memory corruption when frames don't evenly divide blocks. Fixes: 364e08f2bbc0 ("af_packet: add PMD for AF_PACKET-based virtual devices") Cc: linville@tuxdriver.com Cc: stable@dpdk.org Signed-off-by: Scott Mitchell <scott.k.mitch1@gmail.com> --- doc/guides/rel_notes/release_26_03.rst | 6 +- drivers/net/af_packet/rte_eth_af_packet.c | 151 ++++++++++++++++------ 2 files changed, 119 insertions(+), 38 deletions(-) diff --git a/doc/guides/rel_notes/release_26_03.rst b/doc/guides/rel_notes/release_26_03.rst index 031eaa657e..5eebed5023 100644 --- a/doc/guides/rel_notes/release_26_03.rst +++ b/doc/guides/rel_notes/release_26_03.rst @@ -55,6 +55,11 @@ New Features Also, make sure to start the actual text at the margin. ======================================================= +* **Updated af_packet net driver.** + + * Fixed kernel memory barrier protocol for memory availability + * Fixed shared memory frame overhead offset calculation + * **Updated AMD axgbe ethernet driver.** * Added support for V4000 Krackan2e. @@ -63,7 +68,6 @@ New Features * Added support for pre and post VF reset callbacks. - Removed Items ------------- diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c index c0ba3381ea..78ed3fb858 100644 --- a/drivers/net/af_packet/rte_eth_af_packet.c +++ b/drivers/net/af_packet/rte_eth_af_packet.c @@ -9,6 +9,8 @@ #include <rte_common.h> #include <rte_string_fns.h> #include <rte_mbuf.h> +#include <rte_atomic.h> +#include <rte_bitops.h> #include <ethdev_driver.h> #include <ethdev_vdev.h> #include <rte_malloc.h> @@ -41,6 +43,10 @@ #define DFLT_FRAME_SIZE (1 << 11) #define DFLT_FRAME_COUNT (1 << 9) +static const uint16_t eth_af_packet_frame_size_max = RTE_IPV4_MAX_PKT_LEN; +#define ETH_AF_PACKET_FRAME_OVERHEAD (TPACKET2_HDRLEN - sizeof(struct sockaddr_ll)) +#define ETH_AF_PACKET_ETH_OVERHEAD (RTE_ETHER_HDR_LEN + RTE_ETHER_CRC_LEN) + static uint64_t timestamp_dynflag; static int timestamp_dynfield_offset = -1; @@ -120,6 +126,28 @@ RTE_LOG_REGISTER_DEFAULT(af_packet_logtype, NOTICE); RTE_LOG_LINE(level, AFPACKET, "%s(): " fmt ":%s", __func__, \ ## __VA_ARGS__, strerror(errno)) +/** + * Read tp_status from packet mmap ring. Matches kernel's READ_ONCE() with smp_rmb() + * ordering in af_packet.c __packet_get_status. + */ +static inline uint32_t +tpacket_read_status(const volatile void *tp_status) +{ + rte_smp_rmb(); + return *((const volatile unaligned_uint32_t *)tp_status); +} + +/** + * Write tp_status to packet mmap ring. Matches kernel's WRITE_ONCE() with smp_wmb() + * ordering in af_packet.c __packet_set_status. + */ +static inline void +tpacket_write_status(volatile void *tp_status, uint32_t status) +{ + *((volatile unaligned_uint32_t *)tp_status) = status; + rte_smp_wmb(); +} + static uint16_t eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) { @@ -129,7 +157,8 @@ eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) uint8_t *pbuf; struct pkt_rx_queue *pkt_q = queue; uint16_t num_rx = 0; - unsigned long num_rx_bytes = 0; + uint32_t num_rx_bytes = 0; + uint32_t tp_status; unsigned int framecount, framenum; if (unlikely(nb_pkts == 0)) @@ -144,7 +173,8 @@ eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) for (i = 0; i < nb_pkts; i++) { /* point at the next incoming frame */ ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base; - if ((ppd->tp_status & TP_STATUS_USER) == 0) + tp_status = tpacket_read_status(&ppd->tp_status); + if ((tp_status & TP_STATUS_USER) == 0) break; /* allocate the next mbuf */ @@ -160,7 +190,7 @@ eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) memcpy(rte_pktmbuf_mtod(mbuf, void *), pbuf, rte_pktmbuf_data_len(mbuf)); /* check for vlan info */ - if (ppd->tp_status & TP_STATUS_VLAN_VALID) { + if (tp_status & TP_STATUS_VLAN_VALID) { mbuf->vlan_tci = ppd->tp_vlan_tci; mbuf->ol_flags |= (RTE_MBUF_F_RX_VLAN | RTE_MBUF_F_RX_VLAN_STRIPPED); @@ -179,7 +209,7 @@ eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) } /* release incoming frame and advance ring buffer */ - ppd->tp_status = TP_STATUS_KERNEL; + tpacket_write_status(&ppd->tp_status, TP_STATUS_KERNEL); if (++framenum >= framecount) framenum = 0; mbuf->port = pkt_q->in_port; @@ -228,8 +258,8 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) struct pollfd pfd; struct pkt_tx_queue *pkt_q = queue; uint16_t num_tx = 0; - unsigned long num_tx_bytes = 0; - int i; + uint32_t num_tx_bytes = 0; + uint16_t i; if (unlikely(nb_pkts == 0)) return 0; @@ -259,16 +289,6 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) } } - /* point at the next incoming frame */ - if (!tx_ring_status_available(ppd->tp_status)) { - if (poll(&pfd, 1, -1) < 0) - break; - - /* poll() can return POLLERR if the interface is down */ - if (pfd.revents & POLLERR) - break; - } - /* * poll() will almost always return POLLOUT, even if there * are no extra buffers available @@ -283,26 +303,28 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) * * This results in poll() returning POLLOUT. */ - if (!tx_ring_status_available(ppd->tp_status)) + if (unlikely(!tx_ring_status_available(tpacket_read_status(&ppd->tp_status)) && + (poll(&pfd, 1, -1) < 0 || (pfd.revents & POLLERR) != 0 || + !tx_ring_status_available(tpacket_read_status(&ppd->tp_status))))) { + /* Ring is full, stop here. Don't process bufs[i]. */ break; + } - /* copy the tx frame data */ - pbuf = (uint8_t *) ppd + TPACKET2_HDRLEN - - sizeof(struct sockaddr_ll); + pbuf = (uint8_t *)ppd + ETH_AF_PACKET_FRAME_OVERHEAD; struct rte_mbuf *tmp_mbuf = mbuf; - while (tmp_mbuf) { + do { uint16_t data_len = rte_pktmbuf_data_len(tmp_mbuf); memcpy(pbuf, rte_pktmbuf_mtod(tmp_mbuf, void*), data_len); pbuf += data_len; tmp_mbuf = tmp_mbuf->next; - } + } while (tmp_mbuf); ppd->tp_len = mbuf->pkt_len; ppd->tp_snaplen = mbuf->pkt_len; /* release incoming frame and advance ring buffer */ - ppd->tp_status = TP_STATUS_SEND_REQUEST; + tpacket_write_status(&ppd->tp_status, TP_STATUS_SEND_REQUEST); if (++framenum >= framecount) framenum = 0; ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base; @@ -392,10 +414,12 @@ eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info) dev_info->if_index = internals->if_index; dev_info->max_mac_addrs = 1; - dev_info->max_rx_pktlen = RTE_ETHER_MAX_LEN; + dev_info->max_rx_pktlen = (uint32_t)eth_af_packet_frame_size_max + + ETH_AF_PACKET_ETH_OVERHEAD; + dev_info->max_mtu = eth_af_packet_frame_size_max; dev_info->max_rx_queues = (uint16_t)internals->nb_queues; dev_info->max_tx_queues = (uint16_t)internals->nb_queues; - dev_info->min_rx_bufsize = 0; + dev_info->min_rx_bufsize = ETH_AF_PACKET_ETH_OVERHEAD; dev_info->tx_offload_capa = RTE_ETH_TX_OFFLOAD_MULTI_SEGS | RTE_ETH_TX_OFFLOAD_VLAN_INSERT; dev_info->rx_offload_capa = RTE_ETH_RX_OFFLOAD_VLAN_STRIP | @@ -572,8 +596,7 @@ eth_rx_queue_setup(struct rte_eth_dev *dev, /* Now get the space available for data in the mbuf */ buf_size = rte_pktmbuf_data_room_size(pkt_q->mb_pool) - RTE_PKTMBUF_HEADROOM; - data_size = internals->req.tp_frame_size; - data_size -= TPACKET2_HDRLEN - sizeof(struct sockaddr_ll); + data_size = internals->req.tp_frame_size - ETH_AF_PACKET_FRAME_OVERHEAD; if (data_size > buf_size) { PMD_LOG(ERR, @@ -612,7 +635,7 @@ eth_dev_mtu_set(struct rte_eth_dev *dev, uint16_t mtu) int ret; int s; unsigned int data_size = internals->req.tp_frame_size - - TPACKET2_HDRLEN; + ETH_AF_PACKET_FRAME_OVERHEAD; if (mtu > data_size) return -EINVAL; @@ -977,25 +1000,38 @@ rte_pmd_init_internals(struct rte_vdev_device *dev, rx_queue->rd = rte_zmalloc_socket(name, rdsize, 0, numa_node); if (rx_queue->rd == NULL) goto error; + /* Frame addresses must match kernel's packet_lookup_frame(): + * block_idx = position / frames_per_block + * frame_offset = position % frames_per_block + * address = block_start + (frame_offset * frame_size) + */ + const uint32_t frames_per_block = req->tp_block_size / req->tp_frame_size; for (i = 0; i < req->tp_frame_nr; ++i) { - rx_queue->rd[i].iov_base = rx_queue->map + (i * framesize); + const uint32_t block_idx = i / frames_per_block; + const uint32_t frame_in_block = i % frames_per_block; + rx_queue->rd[i].iov_base = rx_queue->map + + (block_idx * req->tp_block_size) + + (frame_in_block * req->tp_frame_size); rx_queue->rd[i].iov_len = req->tp_frame_size; } rx_queue->sockfd = qsockfd; tx_queue = &((*internals)->tx_queue[q]); tx_queue->framecount = req->tp_frame_nr; - tx_queue->frame_data_size = req->tp_frame_size; - tx_queue->frame_data_size -= TPACKET2_HDRLEN - - sizeof(struct sockaddr_ll); + tx_queue->frame_data_size = req->tp_frame_size - ETH_AF_PACKET_FRAME_OVERHEAD; tx_queue->map = rx_queue->map + req->tp_block_size * req->tp_block_nr; tx_queue->rd = rte_zmalloc_socket(name, rdsize, 0, numa_node); if (tx_queue->rd == NULL) goto error; + /* See comment above rx_queue->rd initialization. */ for (i = 0; i < req->tp_frame_nr; ++i) { - tx_queue->rd[i].iov_base = tx_queue->map + (i * framesize); + const uint32_t block_idx = i / frames_per_block; + const uint32_t frame_in_block = i % frames_per_block; + tx_queue->rd[i].iov_base = tx_queue->map + + (block_idx * req->tp_block_size) + + (frame_in_block * req->tp_frame_size); tx_queue->rd[i].iov_len = req->tp_frame_size; } tx_queue->sockfd = qsockfd; @@ -1081,7 +1117,8 @@ rte_eth_from_packet(struct rte_vdev_device *dev, struct rte_kvargs_pair *pair = NULL; unsigned k_idx; unsigned int blockcount; - unsigned int blocksize; + const int pagesize = getpagesize(); + unsigned int blocksize = pagesize; unsigned int framesize = DFLT_FRAME_SIZE; unsigned int framecount = DFLT_FRAME_COUNT; unsigned int qpairs = 1; @@ -1092,8 +1129,6 @@ rte_eth_from_packet(struct rte_vdev_device *dev, if (*sockfd < 0) return -1; - blocksize = getpagesize(); - /* * Walk arguments for configurable settings */ @@ -1162,13 +1197,55 @@ rte_eth_from_packet(struct rte_vdev_device *dev, return -1; } - blockcount = framecount / (blocksize / framesize); + const unsigned int frames_per_block = blocksize / framesize; + blockcount = framecount / frames_per_block; if (!blockcount) { PMD_LOG(ERR, "%s: invalid AF_PACKET MMAP parameters", name); return -1; } + /* + * https://www.kernel.org/doc/Documentation/networking/packet_mmap.txt + * Check constraints that may be enforced by the kernel and cause failure + * to initialize the rings but explicit error messages aren't provided. + * See packet_set_ring in linux kernel for enforcement: + * https://github.com/torvalds/linux/blob/master/net/packet/af_packet.c + */ + if (blocksize % pagesize != 0) { + /* tp_block_size must be a multiple of PAGE_SIZE */ + PMD_LOG(WARNING, "%s: %s=%u must be a multiple of PAGE_SIZE=%d", + name, ETH_AF_PACKET_BLOCKSIZE_ARG, blocksize, pagesize); + } + if (framesize % TPACKET_ALIGNMENT != 0) { + /* tp_frame_size must be a multiple of TPACKET_ALIGNMENT */ + PMD_LOG(WARNING, "%s: %s=%u must be a multiple of TPACKET_ALIGNMENT=%d", + name, ETH_AF_PACKET_FRAMESIZE_ARG, framesize, TPACKET_ALIGNMENT); + } + if (frames_per_block == 0 || frames_per_block > UINT_MAX / blockcount || + framecount != frames_per_block * blockcount) { + /* tp_frame_nr must be exactly frames_per_block*tp_block_nr */ + PMD_LOG(WARNING, "%s: %s=%u must be exactly " + "frames_per_block(%s/%s = %u/%u = %u) * blockcount(%u)", + name, ETH_AF_PACKET_FRAMECOUNT_ARG, framecount, + ETH_AF_PACKET_BLOCKSIZE_ARG, ETH_AF_PACKET_FRAMESIZE_ARG, + blocksize, framesize, frames_per_block, blockcount); + } + + /* Below conditions may not cause errors but provide hints to improve */ + if (blocksize % framesize != 0) { + PMD_LOG(WARNING, "%s: %s=%u not evenly divisible by %s=%u, " + "may waste memory", name, + ETH_AF_PACKET_BLOCKSIZE_ARG, blocksize, + ETH_AF_PACKET_FRAMESIZE_ARG, framesize); + } + if (!rte_is_power_of_2(blocksize)) { + /* tp_block_size should be a power of two or there will be waste */ + PMD_LOG(WARNING, "%s: %s=%u should be a power of two " + "or there will be a waste of memory", + name, ETH_AF_PACKET_BLOCKSIZE_ARG, blocksize); + } + PMD_LOG(DEBUG, "%s: AF_PACKET MMAP parameters:", name); PMD_LOG(DEBUG, "%s:\tblock size %d", name, blocksize); PMD_LOG(DEBUG, "%s:\tblock count %d", name, blockcount); -- 2.39.5 (Apple Git-154) ^ permalink raw reply related [flat|nested] 65+ messages in thread
* [PATCH v6 2/4] net/af_packet: RX/TX bulk free, unlikely hint 2026-02-06 1:11 ` [PATCH v6 0/4] af_packet correctness, performance, cksum scott.k.mitch1 2026-02-06 1:11 ` [PATCH v6 1/4] net/af_packet: fix thread safety and frame calculations scott.k.mitch1 @ 2026-02-06 1:11 ` scott.k.mitch1 2026-02-06 1:11 ` [PATCH v6 3/4] net/af_packet: tx poll control scott.k.mitch1 ` (2 subsequent siblings) 4 siblings, 0 replies; 65+ messages in thread From: scott.k.mitch1 @ 2026-02-06 1:11 UTC (permalink / raw) To: dev; +Cc: stephen, Scott Mitchell From: Scott Mitchell <scott.k.mitch1@gmail.com> - Use rte_pktmbuf_free_bulk() in TX path instead of individual rte_pktmbuf_free() calls for better batch efficiency - Add unlikely() hints for error paths (oversized packets, VLAN insertion failures, sendto errors) to optimize branch prediction - Remove unnecessary early nb_pkts == 0 when loop handles this and app may never call with 0 frames. Signed-off-by: Scott Mitchell <scott.k.mitch1@gmail.com> --- drivers/net/af_packet/rte_eth_af_packet.c | 41 ++++++++--------------- 1 file changed, 14 insertions(+), 27 deletions(-) diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c index 78ed3fb858..9acce990d1 100644 --- a/drivers/net/af_packet/rte_eth_af_packet.c +++ b/drivers/net/af_packet/rte_eth_af_packet.c @@ -161,9 +161,6 @@ eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) uint32_t tp_status; unsigned int framecount, framenum; - if (unlikely(nb_pkts == 0)) - return 0; - /* * Reads the given number of packets from the AF_PACKET socket one by * one and copies the packet data into a newly allocated mbuf. @@ -261,9 +258,6 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) uint32_t num_tx_bytes = 0; uint16_t i; - if (unlikely(nb_pkts == 0)) - return 0; - memset(&pfd, 0, sizeof(pfd)); pfd.fd = pkt_q->sockfd; pfd.events = POLLOUT; @@ -271,24 +265,17 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) framecount = pkt_q->framecount; framenum = pkt_q->framenum; - ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base; for (i = 0; i < nb_pkts; i++) { - mbuf = *bufs++; + mbuf = bufs[i]; - /* drop oversized packets */ - if (mbuf->pkt_len > pkt_q->frame_data_size) { - rte_pktmbuf_free(mbuf); + /* Drop oversized packets. Insert VLAN if necessary */ + if (unlikely(mbuf->pkt_len > pkt_q->frame_data_size || + ((mbuf->ol_flags & RTE_MBUF_F_TX_VLAN) != 0 && + rte_vlan_insert(&mbuf) != 0))) { continue; } - /* insert vlan info if necessary */ - if (mbuf->ol_flags & RTE_MBUF_F_TX_VLAN) { - if (rte_vlan_insert(&mbuf)) { - rte_pktmbuf_free(mbuf); - continue; - } - } - + ppd = (struct tpacket2_hdr *)pkt_q->rd[framenum].iov_base; /* * poll() will almost always return POLLOUT, even if there * are no extra buffers available @@ -312,6 +299,9 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) pbuf = (uint8_t *)ppd + ETH_AF_PACKET_FRAME_OVERHEAD; + ppd->tp_len = mbuf->pkt_len; + ppd->tp_snaplen = mbuf->pkt_len; + struct rte_mbuf *tmp_mbuf = mbuf; do { uint16_t data_len = rte_pktmbuf_data_len(tmp_mbuf); @@ -320,23 +310,20 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) tmp_mbuf = tmp_mbuf->next; } while (tmp_mbuf); - ppd->tp_len = mbuf->pkt_len; - ppd->tp_snaplen = mbuf->pkt_len; - /* release incoming frame and advance ring buffer */ tpacket_write_status(&ppd->tp_status, TP_STATUS_SEND_REQUEST); if (++framenum >= framecount) framenum = 0; - ppd = (struct tpacket2_hdr *) pkt_q->rd[framenum].iov_base; - num_tx++; num_tx_bytes += mbuf->pkt_len; - rte_pktmbuf_free(mbuf); } + rte_pktmbuf_free_bulk(&bufs[0], i); + /* kick-off transmits */ - if (sendto(pkt_q->sockfd, NULL, 0, MSG_DONTWAIT, NULL, 0) == -1 && - errno != ENOBUFS && errno != EAGAIN) { + if (unlikely(num_tx > 0 && + sendto(pkt_q->sockfd, NULL, 0, MSG_DONTWAIT, NULL, 0) == -1 && + errno != ENOBUFS && errno != EAGAIN)) { /* * In case of a ENOBUFS/EAGAIN error all of the enqueued * packets will be considered successful even though only some -- 2.39.5 (Apple Git-154) ^ permalink raw reply related [flat|nested] 65+ messages in thread
* [PATCH v6 3/4] net/af_packet: tx poll control 2026-02-06 1:11 ` [PATCH v6 0/4] af_packet correctness, performance, cksum scott.k.mitch1 2026-02-06 1:11 ` [PATCH v6 1/4] net/af_packet: fix thread safety and frame calculations scott.k.mitch1 2026-02-06 1:11 ` [PATCH v6 2/4] net/af_packet: RX/TX bulk free, unlikely hint scott.k.mitch1 @ 2026-02-06 1:11 ` scott.k.mitch1 2026-02-06 1:11 ` [PATCH v6 4/4] net/af_packet: add software checksum offload support scott.k.mitch1 2026-02-06 1:49 ` [PATCH v6 0/4] af_packet correctness, performance, cksum Stephen Hemminger 4 siblings, 0 replies; 65+ messages in thread From: scott.k.mitch1 @ 2026-02-06 1:11 UTC (permalink / raw) To: dev; +Cc: stephen, Scott Mitchell From: Scott Mitchell <scott.k.mitch1@gmail.com> Add txpollnotrdy devarg (default=true) to control whether poll() is called when the TX ring is not ready. This allows users to avoid blocking behavior if application threads are in asynchronous poll mode where blocking the thread has negative side effects and backpressure is applied via different means. Signed-off-by: Scott Mitchell <scott.k.mitch1@gmail.com> --- doc/guides/nics/af_packet.rst | 6 ++++- doc/guides/rel_notes/release_26_03.rst | 1 + drivers/net/af_packet/rte_eth_af_packet.c | 33 ++++++++++++++++++----- 3 files changed, 32 insertions(+), 8 deletions(-) diff --git a/doc/guides/nics/af_packet.rst b/doc/guides/nics/af_packet.rst index 1505b98ff7..782a962c3f 100644 --- a/doc/guides/nics/af_packet.rst +++ b/doc/guides/nics/af_packet.rst @@ -29,6 +29,10 @@ Some of these, in turn, will be used to configure the PACKET_MMAP settings. * ``framesz`` - PACKET_MMAP frame size (optional, default 2048B; Note: multiple of 16B); * ``framecnt`` - PACKET_MMAP frame count (optional, default 512). +* ``txpollnotrdy`` - Control behavior if tx is attempted but there is no + space available to write to the kernel. If 1, call poll() and block until + space is available to tx. If 0, don't call poll() and return from tx (optional, + default 1). For details regarding ``fanout_mode`` argument, you can consult the `PACKET_FANOUT documentation <https://www.man7.org/linux/man-pages/man7/packet.7.html>`_. @@ -75,7 +79,7 @@ framecnt=512): .. code-block:: console - --vdev=eth_af_packet0,iface=tap0,blocksz=4096,framesz=2048,framecnt=512,qpairs=1,qdisc_bypass=0,fanout_mode=hash + --vdev=eth_af_packet0,iface=tap0,blocksz=4096,framesz=2048,framecnt=512,qpairs=1,qdisc_bypass=0,fanout_mode=hash,txpollnotrdy=0 Features and Limitations ------------------------ diff --git a/doc/guides/rel_notes/release_26_03.rst b/doc/guides/rel_notes/release_26_03.rst index 5eebed5023..6a173e2e82 100644 --- a/doc/guides/rel_notes/release_26_03.rst +++ b/doc/guides/rel_notes/release_26_03.rst @@ -59,6 +59,7 @@ New Features * Fixed kernel memory barrier protocol for memory availability * Fixed shared memory frame overhead offset calculation + * Added ``txpollnotrdy`` devarg to avoid ``poll()`` blocking calls * **Updated AMD axgbe ethernet driver.** diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c index 9acce990d1..07e7e3cd4a 100644 --- a/drivers/net/af_packet/rte_eth_af_packet.c +++ b/drivers/net/af_packet/rte_eth_af_packet.c @@ -39,9 +39,11 @@ #define ETH_AF_PACKET_FRAMECOUNT_ARG "framecnt" #define ETH_AF_PACKET_QDISC_BYPASS_ARG "qdisc_bypass" #define ETH_AF_PACKET_FANOUT_MODE_ARG "fanout_mode" +#define ETH_AF_PACKET_TX_POLL_NOT_READY_ARG "txpollnotrdy" #define DFLT_FRAME_SIZE (1 << 11) #define DFLT_FRAME_COUNT (1 << 9) +#define DFLT_TX_POLL_NOT_RDY true static const uint16_t eth_af_packet_frame_size_max = RTE_IPV4_MAX_PKT_LEN; #define ETH_AF_PACKET_FRAME_OVERHEAD (TPACKET2_HDRLEN - sizeof(struct sockaddr_ll)) @@ -78,6 +80,9 @@ struct __rte_cache_aligned pkt_tx_queue { unsigned int framecount; unsigned int framenum; + bool txpollnotrdy; + bool sw_cksum; + volatile unsigned long tx_pkts; volatile unsigned long err_pkts; volatile unsigned long tx_bytes; @@ -106,6 +111,7 @@ static const char *valid_arguments[] = { ETH_AF_PACKET_FRAMECOUNT_ARG, ETH_AF_PACKET_QDISC_BYPASS_ARG, ETH_AF_PACKET_FANOUT_MODE_ARG, + ETH_AF_PACKET_TX_POLL_NOT_READY_ARG, NULL }; @@ -258,10 +264,12 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) uint32_t num_tx_bytes = 0; uint16_t i; - memset(&pfd, 0, sizeof(pfd)); - pfd.fd = pkt_q->sockfd; - pfd.events = POLLOUT; - pfd.revents = 0; + if (pkt_q->txpollnotrdy) { + memset(&pfd, 0, sizeof(pfd)); + pfd.fd = pkt_q->sockfd; + pfd.events = POLLOUT; + pfd.revents = 0; + } framecount = pkt_q->framecount; framenum = pkt_q->framenum; @@ -291,8 +299,9 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) * This results in poll() returning POLLOUT. */ if (unlikely(!tx_ring_status_available(tpacket_read_status(&ppd->tp_status)) && - (poll(&pfd, 1, -1) < 0 || (pfd.revents & POLLERR) != 0 || - !tx_ring_status_available(tpacket_read_status(&ppd->tp_status))))) { + (!pkt_q->txpollnotrdy || poll(&pfd, 1, -1) < 0 || + (pfd.revents & POLLERR) != 0 || + !tx_ring_status_available(tpacket_read_status(&ppd->tp_status))))) { /* Ring is full, stop here. Don't process bufs[i]. */ break; } @@ -804,6 +813,7 @@ rte_pmd_init_internals(struct rte_vdev_device *dev, unsigned int framecnt, unsigned int qdisc_bypass, const char *fanout_mode, + bool txpollnotrdy, struct pmd_internals **internals, struct rte_eth_dev **eth_dev, struct rte_kvargs *kvlist) @@ -1022,6 +1032,7 @@ rte_pmd_init_internals(struct rte_vdev_device *dev, tx_queue->rd[i].iov_len = req->tp_frame_size; } tx_queue->sockfd = qsockfd; + tx_queue->txpollnotrdy = txpollnotrdy; rc = bind(qsockfd, (const struct sockaddr*)&sockaddr, sizeof(sockaddr)); if (rc == -1) { @@ -1111,6 +1122,7 @@ rte_eth_from_packet(struct rte_vdev_device *dev, unsigned int qpairs = 1; unsigned int qdisc_bypass = 1; const char *fanout_mode = NULL; + bool txpollnotrdy = DFLT_TX_POLL_NOT_RDY; /* do some parameter checking */ if (*sockfd < 0) @@ -1175,6 +1187,10 @@ rte_eth_from_packet(struct rte_vdev_device *dev, fanout_mode = pair->value; continue; } + if (strstr(pair->key, ETH_AF_PACKET_TX_POLL_NOT_READY_ARG) != NULL) { + txpollnotrdy = atoi(pair->value) != 0; + continue; + } } if (framesize > blocksize) { @@ -1243,12 +1259,14 @@ rte_eth_from_packet(struct rte_vdev_device *dev, PMD_LOG(DEBUG, "%s:\tfanout mode %s", name, fanout_mode); else PMD_LOG(DEBUG, "%s:\tfanout mode %s", name, "default PACKET_FANOUT_HASH"); + PMD_LOG(INFO, "%s:\ttxpollnotrdy %d", name, txpollnotrdy ? 1 : 0); if (rte_pmd_init_internals(dev, *sockfd, qpairs, blocksize, blockcount, framesize, framecount, qdisc_bypass, fanout_mode, + txpollnotrdy, &internals, ð_dev, kvlist) < 0) return -1; @@ -1346,4 +1364,5 @@ RTE_PMD_REGISTER_PARAM_STRING(net_af_packet, "framesz=<int> " "framecnt=<int> " "qdisc_bypass=<0|1> " - "fanout_mode=<hash|lb|cpu|rollover|rnd|qm>"); + "fanout_mode=<hash|lb|cpu|rollover|rnd|qm> " + "txpollnotrdy=<0|1>"); -- 2.39.5 (Apple Git-154) ^ permalink raw reply related [flat|nested] 65+ messages in thread
* [PATCH v6 4/4] net/af_packet: add software checksum offload support 2026-02-06 1:11 ` [PATCH v6 0/4] af_packet correctness, performance, cksum scott.k.mitch1 ` (2 preceding siblings ...) 2026-02-06 1:11 ` [PATCH v6 3/4] net/af_packet: tx poll control scott.k.mitch1 @ 2026-02-06 1:11 ` scott.k.mitch1 2026-02-06 1:49 ` [PATCH v6 0/4] af_packet correctness, performance, cksum Stephen Hemminger 4 siblings, 0 replies; 65+ messages in thread From: scott.k.mitch1 @ 2026-02-06 1:11 UTC (permalink / raw) To: dev; +Cc: stephen, Scott Mitchell From: Scott Mitchell <scott.k.mitch1@gmail.com> Add software checksum offload support and configurable TX poll behavior to improve flexibility and performance. Add rte_net_ip_udptcp_cksum_mbuf in rte_net.h which is shared between rte_eth_tap and rte_eth_af_packet that supports IPv4/UDP/TCP checksums in software due to hardware offload and context propagation not being supported. Signed-off-by: Scott Mitchell <scott.k.mitch1@gmail.com> --- doc/guides/nics/features/afpacket.ini | 2 + doc/guides/rel_notes/release_26_03.rst | 2 + drivers/net/af_packet/rte_eth_af_packet.c | 42 ++++++++++---- drivers/net/tap/rte_eth_tap.c | 70 ++--------------------- lib/net/rte_net.c | 69 ++++++++++++++++++++++ lib/net/rte_net.h | 25 ++++++++ 6 files changed, 134 insertions(+), 76 deletions(-) diff --git a/doc/guides/nics/features/afpacket.ini b/doc/guides/nics/features/afpacket.ini index 391f79b173..4bb81c84ff 100644 --- a/doc/guides/nics/features/afpacket.ini +++ b/doc/guides/nics/features/afpacket.ini @@ -7,5 +7,7 @@ Link status = Y Promiscuous mode = Y MTU update = Y +L3 checksum offload = Y +L4 checksum offload = Y Basic stats = Y Stats per queue = Y diff --git a/doc/guides/rel_notes/release_26_03.rst b/doc/guides/rel_notes/release_26_03.rst index 6a173e2e82..9211a83226 100644 --- a/doc/guides/rel_notes/release_26_03.rst +++ b/doc/guides/rel_notes/release_26_03.rst @@ -60,6 +60,8 @@ New Features * Fixed kernel memory barrier protocol for memory availability * Fixed shared memory frame overhead offset calculation * Added ``txpollnotrdy`` devarg to avoid ``poll()`` blocking calls + * Added checksum offload support for ``IPV4_CKSUM``, ``UDP_CKSUM``, + and ``TCP_CKSUM`` * **Updated AMD axgbe ethernet driver.** diff --git a/drivers/net/af_packet/rte_eth_af_packet.c b/drivers/net/af_packet/rte_eth_af_packet.c index 07e7e3cd4a..1c5f17af34 100644 --- a/drivers/net/af_packet/rte_eth_af_packet.c +++ b/drivers/net/af_packet/rte_eth_af_packet.c @@ -10,6 +10,8 @@ #include <rte_string_fns.h> #include <rte_mbuf.h> #include <rte_atomic.h> +#include <rte_ip.h> +#include <rte_net.h> #include <rte_bitops.h> #include <ethdev_driver.h> #include <ethdev_vdev.h> @@ -101,6 +103,7 @@ struct pmd_internals { struct pkt_tx_queue *tx_queue; uint8_t vlan_strip; uint8_t timestamp_offloading; + bool tx_sw_cksum; }; static const char *valid_arguments[] = { @@ -220,7 +223,7 @@ eth_af_packet_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) /* account for the receive frame */ bufs[i] = mbuf; num_rx++; - num_rx_bytes += mbuf->pkt_len; + num_rx_bytes += rte_pktmbuf_pkt_len(mbuf); } pkt_q->framenum = framenum; pkt_q->rx_pkts += num_rx; @@ -256,6 +259,7 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) { struct tpacket2_hdr *ppd; struct rte_mbuf *mbuf; + struct rte_mbuf *seg; uint8_t *pbuf; unsigned int framecount, framenum; struct pollfd pfd; @@ -277,7 +281,7 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) mbuf = bufs[i]; /* Drop oversized packets. Insert VLAN if necessary */ - if (unlikely(mbuf->pkt_len > pkt_q->frame_data_size || + if (unlikely(rte_pktmbuf_pkt_len(mbuf) > pkt_q->frame_data_size || ((mbuf->ol_flags & RTE_MBUF_F_TX_VLAN) != 0 && rte_vlan_insert(&mbuf) != 0))) { continue; @@ -308,23 +312,32 @@ eth_af_packet_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts) pbuf = (uint8_t *)ppd + ETH_AF_PACKET_FRAME_OVERHEAD; - ppd->tp_len = mbuf->pkt_len; - ppd->tp_snaplen = mbuf->pkt_len; + if (pkt_q->sw_cksum) { + seg = rte_net_ip_udptcp_cksum_mbuf(mbuf); + if (!seg) + continue; - struct rte_mbuf *tmp_mbuf = mbuf; + mbuf = seg; + bufs[i] = seg; + } + + ppd->tp_len = rte_pktmbuf_pkt_len(mbuf); + ppd->tp_snaplen = rte_pktmbuf_pkt_len(mbuf); + + seg = mbuf; do { - uint16_t data_len = rte_pktmbuf_data_len(tmp_mbuf); - memcpy(pbuf, rte_pktmbuf_mtod(tmp_mbuf, void*), data_len); + uint16_t data_len = rte_pktmbuf_data_len(seg); + memcpy(pbuf, rte_pktmbuf_mtod(seg, void*), data_len); pbuf += data_len; - tmp_mbuf = tmp_mbuf->next; - } while (tmp_mbuf); + seg = seg->next; + } while (seg); /* release incoming frame and advance ring buffer */ tpacket_write_status(&ppd->tp_status, TP_STATUS_SEND_REQUEST); if (++framenum >= framecount) framenum = 0; num_tx++; - num_tx_bytes += mbuf->pkt_len; + num_tx_bytes += rte_pktmbuf_pkt_len(mbuf); } rte_pktmbuf_free_bulk(&bufs[0], i); @@ -396,10 +409,13 @@ eth_dev_configure(struct rte_eth_dev *dev) { struct rte_eth_conf *dev_conf = &dev->data->dev_conf; const struct rte_eth_rxmode *rxmode = &dev_conf->rxmode; + const struct rte_eth_txmode *txmode = &dev_conf->txmode; struct pmd_internals *internals = dev->data->dev_private; internals->vlan_strip = !!(rxmode->offloads & RTE_ETH_RX_OFFLOAD_VLAN_STRIP); internals->timestamp_offloading = !!(rxmode->offloads & RTE_ETH_RX_OFFLOAD_TIMESTAMP); + internals->tx_sw_cksum = !!(txmode->offloads & (RTE_ETH_TX_OFFLOAD_IPV4_CKSUM | + RTE_ETH_TX_OFFLOAD_UDP_CKSUM | RTE_ETH_TX_OFFLOAD_TCP_CKSUM)); return 0; } @@ -417,7 +433,10 @@ eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info) dev_info->max_tx_queues = (uint16_t)internals->nb_queues; dev_info->min_rx_bufsize = ETH_AF_PACKET_ETH_OVERHEAD; dev_info->tx_offload_capa = RTE_ETH_TX_OFFLOAD_MULTI_SEGS | - RTE_ETH_TX_OFFLOAD_VLAN_INSERT; + RTE_ETH_TX_OFFLOAD_VLAN_INSERT | + RTE_ETH_TX_OFFLOAD_IPV4_CKSUM | + RTE_ETH_TX_OFFLOAD_UDP_CKSUM | + RTE_ETH_TX_OFFLOAD_TCP_CKSUM; dev_info->rx_offload_capa = RTE_ETH_RX_OFFLOAD_VLAN_STRIP | RTE_ETH_RX_OFFLOAD_TIMESTAMP; @@ -618,6 +637,7 @@ eth_tx_queue_setup(struct rte_eth_dev *dev, { struct pmd_internals *internals = dev->data->dev_private; + internals->tx_queue[tx_queue_id].sw_cksum = internals->tx_sw_cksum; dev->data->tx_queues[tx_queue_id] = &internals->tx_queue[tx_queue_id]; return 0; diff --git a/drivers/net/tap/rte_eth_tap.c b/drivers/net/tap/rte_eth_tap.c index 7a8a98cddb..388317699e 100644 --- a/drivers/net/tap/rte_eth_tap.c +++ b/drivers/net/tap/rte_eth_tap.c @@ -525,7 +525,6 @@ tap_write_mbufs(struct tx_queue *txq, uint16_t num_mbufs, struct iovec iovecs[mbuf->nb_segs + 2]; struct tun_pi pi = { .flags = 0, .proto = 0x00 }; struct rte_mbuf *seg = mbuf; - uint64_t l4_ol_flags; int proto; int n; int j; @@ -556,74 +555,15 @@ tap_write_mbufs(struct tx_queue *txq, uint16_t num_mbufs, iovecs[k].iov_len = sizeof(pi); k++; - l4_ol_flags = mbuf->ol_flags & RTE_MBUF_F_TX_L4_MASK; - if (txq->csum && (mbuf->ol_flags & RTE_MBUF_F_TX_IP_CKSUM || - l4_ol_flags == RTE_MBUF_F_TX_UDP_CKSUM || - l4_ol_flags == RTE_MBUF_F_TX_TCP_CKSUM)) { - unsigned int hdrlens = mbuf->l2_len + mbuf->l3_len; - uint16_t *l4_cksum; - void *l3_hdr; - - if (l4_ol_flags == RTE_MBUF_F_TX_UDP_CKSUM) - hdrlens += sizeof(struct rte_udp_hdr); - else if (l4_ol_flags == RTE_MBUF_F_TX_TCP_CKSUM) - hdrlens += sizeof(struct rte_tcp_hdr); - else if (l4_ol_flags != RTE_MBUF_F_TX_L4_NO_CKSUM) + if (txq->csum) { + seg = rte_net_ip_udptcp_cksum_mbuf(mbuf); + if (!seg) return -1; - /* Support only packets with at least layer 4 - * header included in the first segment - */ - if (rte_pktmbuf_data_len(mbuf) < hdrlens) - return -1; - - /* To change checksums (considering that a mbuf can be - * indirect, for example), copy l2, l3 and l4 headers - * in a new segment and chain it to existing data - */ - seg = rte_pktmbuf_copy(mbuf, mbuf->pool, 0, hdrlens); - if (seg == NULL) - return -1; - rte_pktmbuf_adj(mbuf, hdrlens); - rte_pktmbuf_chain(seg, mbuf); - pmbufs[i] = mbuf = seg; - - l3_hdr = rte_pktmbuf_mtod_offset(mbuf, void *, mbuf->l2_len); - if (mbuf->ol_flags & RTE_MBUF_F_TX_IP_CKSUM) { - struct rte_ipv4_hdr *iph = l3_hdr; - - iph->hdr_checksum = 0; - iph->hdr_checksum = rte_ipv4_cksum(iph); - } - - if (l4_ol_flags == RTE_MBUF_F_TX_L4_NO_CKSUM) - goto skip_l4_cksum; - - if (l4_ol_flags == RTE_MBUF_F_TX_UDP_CKSUM) { - struct rte_udp_hdr *udp_hdr; - - udp_hdr = rte_pktmbuf_mtod_offset(mbuf, struct rte_udp_hdr *, - mbuf->l2_len + mbuf->l3_len); - l4_cksum = &udp_hdr->dgram_cksum; - } else { - struct rte_tcp_hdr *tcp_hdr; - - tcp_hdr = rte_pktmbuf_mtod_offset(mbuf, struct rte_tcp_hdr *, - mbuf->l2_len + mbuf->l3_len); - l4_cksum = &tcp_hdr->cksum; - } - - *l4_cksum = 0; - if (mbuf->ol_flags & RTE_MBUF_F_TX_IPV4) { - *l4_cksum = rte_ipv4_udptcp_cksum_mbuf(mbuf, l3_hdr, - mbuf->l2_len + mbuf->l3_len); - } else { - *l4_cksum = rte_ipv6_udptcp_cksum_mbuf(mbuf, l3_hdr, - mbuf->l2_len + mbuf->l3_len); - } + mbuf = seg; + pmbufs[i] = seg; } -skip_l4_cksum: for (j = 0; j < mbuf->nb_segs; j++) { iovecs[k].iov_len = rte_pktmbuf_data_len(seg); iovecs[k].iov_base = rte_pktmbuf_mtod(seg, void *); diff --git a/lib/net/rte_net.c b/lib/net/rte_net.c index 458b4814a9..be09d9825c 100644 --- a/lib/net/rte_net.c +++ b/lib/net/rte_net.c @@ -615,3 +615,72 @@ uint32_t rte_net_get_ptype(const struct rte_mbuf *m, return pkt_type; } + +RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_net_ip_udptcp_cksum_mbuf, 26.03) +struct rte_mbuf * +rte_net_ip_udptcp_cksum_mbuf(struct rte_mbuf *mbuf) +{ + const uint64_t l4_ol_flags = mbuf->ol_flags & RTE_MBUF_F_TX_L4_MASK; + const uint32_t l4_offset = mbuf->l2_len + mbuf->l3_len; + uint32_t hdrlens = l4_offset; + unaligned_uint16_t *l4_cksum = NULL; + void *l3_hdr; + + /* Quick check - nothing to do if no checksum offloads requested */ + if (!(mbuf->ol_flags & (RTE_MBUF_F_TX_IP_CKSUM | RTE_MBUF_F_TX_L4_MASK))) + return mbuf; + + /* Determine total header length needed */ + if (l4_ol_flags == RTE_MBUF_F_TX_UDP_CKSUM) + hdrlens += sizeof(struct rte_udp_hdr); + else if (l4_ol_flags == RTE_MBUF_F_TX_TCP_CKSUM) + hdrlens += sizeof(struct rte_tcp_hdr); + else if (l4_ol_flags != RTE_MBUF_F_TX_L4_NO_CKSUM) + return NULL; /* Unsupported L4 checksum type */ + + /* Validate we at least have L2+L3 headers */ + if (unlikely(rte_pktmbuf_data_len(mbuf) < l4_offset)) + return NULL; + + if (!RTE_MBUF_DIRECT(mbuf) || rte_mbuf_refcnt_read(mbuf) > 1) { + /* Indirect or shared - must copy, cannot modify in-place */ + struct rte_mbuf *seg = rte_pktmbuf_copy(mbuf, mbuf->pool, 0, hdrlens); + if (!seg) + return NULL; + + rte_pktmbuf_adj(mbuf, hdrlens); + rte_pktmbuf_chain(seg, mbuf); + mbuf = seg; + } else if (rte_pktmbuf_data_len(mbuf) < hdrlens && + (rte_pktmbuf_linearize(mbuf) < 0 || rte_pktmbuf_data_len(mbuf) < hdrlens)) { + /* failed: direct, non-shared, but segmented headers linearize in-place */ + return NULL; + } + /* else: Direct, non-shared, contiguous - can modify in-place, nothing to do */ + + l3_hdr = rte_pktmbuf_mtod_offset(mbuf, void *, mbuf->l2_len); + + /* IPv4 header checksum */ + if (mbuf->ol_flags & RTE_MBUF_F_TX_IP_CKSUM) { + struct rte_ipv4_hdr *iph = (struct rte_ipv4_hdr *)l3_hdr; + iph->hdr_checksum = 0; + iph->hdr_checksum = rte_ipv4_cksum(iph); + } + + /* L4 checksum */ + if (l4_ol_flags == RTE_MBUF_F_TX_UDP_CKSUM) + l4_cksum = (unaligned_uint16_t *)&rte_pktmbuf_mtod_offset(mbuf, + struct rte_udp_hdr *, l4_offset)->dgram_cksum; + else if (l4_ol_flags == RTE_MBUF_F_TX_TCP_CKSUM) + l4_cksum = (unaligned_uint16_t *)&rte_pktmbuf_mtod_offset(mbuf, + struct rte_tcp_hdr *, l4_offset)->cksum; + + if (l4_cksum) { + *l4_cksum = 0; + *l4_cksum = (mbuf->ol_flags & RTE_MBUF_F_TX_IPV4) ? + rte_ipv4_udptcp_cksum_mbuf(mbuf, l3_hdr, l4_offset) : + rte_ipv6_udptcp_cksum_mbuf(mbuf, l3_hdr, l4_offset); + } + + return mbuf; +} diff --git a/lib/net/rte_net.h b/lib/net/rte_net.h index 65d724b84b..82fa277919 100644 --- a/lib/net/rte_net.h +++ b/lib/net/rte_net.h @@ -246,6 +246,31 @@ rte_net_intel_cksum_prepare(struct rte_mbuf *m) return rte_net_intel_cksum_flags_prepare(m, m->ol_flags); } +/** + * @warning + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice + * + * Compute IP and L4 checksums in software for mbufs with + * RTE_MBUF_F_TX_IP_CKSUM, RTE_MBUF_F_TX_UDP_CKSUM, or + * RTE_MBUF_F_TX_TCP_CKSUM offload flags set. + * + * On success, this function takes ownership of the input mbuf. The mbuf may be + * modified in-place (for direct, non-shared mbufs) or a new mbuf chain may be + * created (for indirect/shared mbufs) with the original becoming part of the chain. + * + * @param mbuf + * The packet mbuf to checksum. + * @return + * - On success: pointer to mbuf with checksums computed (may be same as input + * or a new mbuf chain). Caller must free only this returned pointer; the input + * mbuf pointer should not be freed separately as it may be part of the returned + * chain or may be the same as the returned pointer. + * - On error: NULL. Original mbuf remains valid and owned by caller. + */ +__rte_experimental +struct rte_mbuf * +rte_net_ip_udptcp_cksum_mbuf(struct rte_mbuf *mbuf); + #ifdef __cplusplus } #endif -- 2.39.5 (Apple Git-154) ^ permalink raw reply related [flat|nested] 65+ messages in thread
* Re: [PATCH v6 0/4] af_packet correctness, performance, cksum 2026-02-06 1:11 ` [PATCH v6 0/4] af_packet correctness, performance, cksum scott.k.mitch1 ` (3 preceding siblings ...) 2026-02-06 1:11 ` [PATCH v6 4/4] net/af_packet: add software checksum offload support scott.k.mitch1 @ 2026-02-06 1:49 ` Stephen Hemminger 2026-02-06 4:45 ` Scott Mitchell 2026-02-06 14:36 ` Morten Brørup 4 siblings, 2 replies; 65+ messages in thread From: Stephen Hemminger @ 2026-02-06 1:49 UTC (permalink / raw) To: scott.k.mitch1; +Cc: dev On Thu, 5 Feb 2026 17:11:37 -0800 scott.k.mitch1@gmail.com wrote: > From: Scott <scott.k.mitch1@gmail.com> > > This series fixes critical thread safety bugs in the af_packet PMD > and adds performance optimizations. > > Patch 1 fixes two major correctness issues: > - Thread safety: tp_status was accessed without memory barriers, > violating the kernel's PACKET_MMAP protocol. On aarch64 and other > weakly-ordered architectures, this causes packet corruption due to > missing memory ordering. The fix matches the kernel's memory model: > volatile unaligned reads/writes with explicit rte_smp_rmb/wmb > barriers and __may_alias__ protection. > > - Frame calculations: Fixed incorrect frame overhead and address > calculations that caused memory corruption when frames don't evenly > divide blocks. > > Patches 2-4 add performance improvements: > - Patch 2: Bulk mbuf freeing, unlikely annotations > - Patch 3: TX poll control to reduce syscall overhead > - Patch 4: Software checksum offload support with shared rte_net > utility > > v6 changes: > - rte_net_ip_udptcp_cksum_mbuf doxygen EXPERIMENTAL tag > > v5 changes: > - rte_net_ip_udptcp_cksum_mbuf moved to rte_net.c (avoid forced inline) > - rte_net_ip_udptcp_cksum_mbuf remove copy arg, handle more mbuf types > - af_packet and tap calling code consistent for sw cksum > > v4 changes: > - Remove prefetch (perf results didn't show benefit) > - Fix variable sytle for consistency (declare at start of method) > - Add release notes for af_packet and documentation for fixes > > v3 changes: > - Patch 4: Fix compile error due to implict cast with c++ compiler > > v2 changes: > - Patch 1: Rewrote to use volatile + barriers instead of C11 atomics > to match kernel's memory model. Added dependency on patch-160274 > for __rte_may_alias attribute. > - Patch 4: Refactored to use shared rte_net_ip_udptcp_cksum_mbuf() > utility function, eliminating code duplication with tap driver. > > Scott Mitchell (4): > net/af_packet: fix thread safety and frame calculations > net/af_packet: RX/TX bulk free, unlikely hint > net/af_packet: tx poll control > net/af_packet: add software checksum offload support > > doc/guides/nics/af_packet.rst | 6 +- > doc/guides/nics/features/afpacket.ini | 2 + > doc/guides/rel_notes/release_26_03.rst | 9 +- > drivers/net/af_packet/rte_eth_af_packet.c | 253 +++++++++++++++------- > drivers/net/tap/rte_eth_tap.c | 70 +----- > lib/net/rte_net.c | 69 ++++++ > lib/net/rte_net.h | 25 +++ > 7 files changed, 292 insertions(+), 142 deletions(-) > Why are the header structures marked packed, that is bogus, BSD and Linux don't do it. Windows probably does but Windows code seems to love packed even when it is not necessary. This is failing compile on FreeBsd OS: FreeBSD14-64 Target: x86_64-native-bsdapp-gcc FAILED: lib/librte_net.a.p/net_rte_net.c.o gcc -Ilib/librte_net.a.p -Ilib -I../lib -Ilib/net -I../lib/net -Ilib/eal/common -I../lib/eal/common -I. -I.. -Iconfig -I../config -Ilib/eal/include -I../lib/eal/include -Ilib/eal/freebsd/include -I../lib/eal/freebsd/include -Ilib/eal/x86/include -I../lib/eal/x86/include -Ilib/eal -I../lib/eal -Ilib/kvargs -I../lib/kvargs -Ilib/log -I../lib/log -Ilib/metrics -I../lib/metrics -Ilib/telemetry -I../lib/telemetry -Ilib/argparse -I../lib/argparse -Ilib/mbuf -I../lib/mbuf -Ilib/mempool -I../lib/mempool -Ilib/ring -I../lib/ring -fdiagnostics-color=always -D_FILE_OFFSET_BITS=64 -Wall -Winvalid-pch -Wextra -Werror -std=c11 -O3 -include rte_config.h -Wvla -Wcast-qual -Wdeprecated -Wformat -Wformat-nonliteral -Wformat-security -Wmissing-declarations -Wmissing-prototypes -Wnested-externs -Wold-style-definition -Wpointer-arith -Wshadow -Wsign-compare -Wstrict-prototypes -Wundef -Wwrite-strings -Wno-packed-not-aligned -Wno-missing-field-initializers -D_GNU_SOURCE -D__BSD_VISIBLE -fPIC -march=native -mno-avx512f -mrtm -DALLOW_EXPERIMENTAL_API -DALLOW_INTERNAL_API -mpclmul -maes -DRTE_LOG_DEFAULT_LOGTYPE=lib.net -MD -MQ lib/librte_net.a.p/net_rte_net.c.o -MF lib/librte_net.a.p/net_rte_net.c.o.d -o lib/librte_net.a.p/net_rte_net.c.o -c ../lib/net/rte_net.c ../lib/net/rte_net.c: In function 'rte_net_ip_udptcp_cksum_mbuf': ../lib/net/rte_net.c:672:50: error: taking address of packed member of 'struct rte_udp_hdr' may result in an unaligned pointer value [-Werror=address-of-packed-member] 672 | l4_cksum = (unaligned_uint16_t *)&rte_pktmbuf_mtod_offset(mbuf, | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 673 | struct rte_udp_hdr *, l4_offset)->dgram_cksum; | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ../lib/net/rte_net.c:675:50: error: taking address of packed member of 'struct rte_tcp_hdr' may result in an unaligned pointer value [-Werror=address-of-packed-member] 675 | l4_cksum = (unaligned_uint16_t *)&rte_pktmbuf_mtod_offset(mbuf, | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 676 | struct rte_tcp_hdr *, l4_offset)->cksum; | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ cc1: all warnings being treated as errors ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v6 0/4] af_packet correctness, performance, cksum 2026-02-06 1:49 ` [PATCH v6 0/4] af_packet correctness, performance, cksum Stephen Hemminger @ 2026-02-06 4:45 ` Scott Mitchell 2026-02-06 14:36 ` Morten Brørup 1 sibling, 0 replies; 65+ messages in thread From: Scott Mitchell @ 2026-02-06 4:45 UTC (permalink / raw) To: Stephen Hemminger; +Cc: dev > Why are the header structures marked packed, that is bogus, BSD and Linux don't do it. > Windows probably does but Windows code seems to love packed even when it is not necessary. Agreed. I'm not sure why rte_udp_hdr and rte_tcp_hdr are marked as packed. > This is failing compile on FreeBsd I forgot to carry the "Depends-on: patch-160679 ("eal: add __rte_may_alias and __rte_aligned to unaligned typedefs")" (https://patches.dpdk.org/project/dpdk/patch/20260202044841.90945-2-scott.k.mitch1@gmail.com/) forward from previous patches. I also see this failure locally with GCC without the Depends-on patch applied. I expect it will resolve on FreeBsd once this patch is applied and I can rebase this patch. ^ permalink raw reply [flat|nested] 65+ messages in thread
* RE: [PATCH v6 0/4] af_packet correctness, performance, cksum 2026-02-06 1:49 ` [PATCH v6 0/4] af_packet correctness, performance, cksum Stephen Hemminger 2026-02-06 4:45 ` Scott Mitchell @ 2026-02-06 14:36 ` Morten Brørup 2026-02-06 16:11 ` Stephen Hemminger 1 sibling, 1 reply; 65+ messages in thread From: Morten Brørup @ 2026-02-06 14:36 UTC (permalink / raw) To: Stephen Hemminger, scott.k.mitch1; +Cc: dev > From: Stephen Hemminger [mailto:stephen@networkplumber.org] > Sent: Friday, 6 February 2026 02.49 > > Why are the header structures marked packed, that is bogus, BSD and > Linux don't do it. They have been packed since the first public release in 2013 [1]. I guess it's because the IP and TCP headers contain 4-byte fields, which make those structures 4-byte aligned; but since the IP header follows a 14 byte Ethernet header (without the magic 2-byte pre-padding done by the kernel), the instances of the IP header are not 4-byte aligned, but 2-byte aligned. Marking them packed is a way of stripping the alignment. BTW, the IPv4 header was bumped (from no alignment) to 2-byte alignment with patch [2]. [1]: https://github.com/DPDK/dpdk/commit/af75078fece3615088e561357c1e97603e43a5fe#diff-620c2b2031359304a7f26328a52035c9f8ddf722b9280f957047dcb81467777f [2]: https://github.com/DPDK/dpdk/commit/c14fba68edfa4aeba7c0dfb5dbc3b4f23affbb81 > Windows probably does Yes, probably. The Microsoft compiler is more pedantic (leading to fewer bugs), and many of those structures should formally be packed (or more correctly: unaligned). > but Windows code seems to love packed even when > it is not necessary. I guess packing (without thinking about the need for it) has become a bad habit for some Windows programmers. ^ permalink raw reply [flat|nested] 65+ messages in thread
* Re: [PATCH v6 0/4] af_packet correctness, performance, cksum 2026-02-06 14:36 ` Morten Brørup @ 2026-02-06 16:11 ` Stephen Hemminger 0 siblings, 0 replies; 65+ messages in thread From: Stephen Hemminger @ 2026-02-06 16:11 UTC (permalink / raw) To: Morten Brørup; +Cc: scott.k.mitch1, dev On Fri, 6 Feb 2026 15:36:37 +0100 Morten Brørup <mb@smartsharesystems.com> wrote: > > From: Stephen Hemminger [mailto:stephen@networkplumber.org] > > Sent: Friday, 6 February 2026 02.49 > > > > Why are the header structures marked packed, that is bogus, BSD and > > Linux don't do it. > > They have been packed since the first public release in 2013 [1]. > > I guess it's because the IP and TCP headers contain 4-byte fields, which make those structures 4-byte aligned; but since the IP header follows a 14 byte Ethernet header (without the magic 2-byte pre-padding done by the kernel), the instances of the IP header are not 4-byte aligned, but 2-byte aligned. Marking them packed is a way of stripping the alignment. > > BTW, the IPv4 header was bumped (from no alignment) to 2-byte alignment with patch [2]. > > [1]: https://github.com/DPDK/dpdk/commit/af75078fece3615088e561357c1e97603e43a5fe#diff-620c2b2031359304a7f26328a52035c9f8ddf722b9280f957047dcb81467777f > [2]: https://github.com/DPDK/dpdk/commit/c14fba68edfa4aeba7c0dfb5dbc3b4f23affbb81 > > > > Windows probably does > > Yes, probably. > The Microsoft compiler is more pedantic (leading to fewer bugs), and many of those structures should formally be packed (or more correctly: unaligned). > > > but Windows code seems to love packed even when > > it is not necessary. > > I guess packing (without thinking about the need for it) has become a bad habit for some Windows programmers. > Making structure packed (in the past) made code slower on some architectures because it required generating multiple load/store operations. ^ permalink raw reply [flat|nested] 65+ messages in thread
end of thread, other threads:[~2026-02-06 19:18 UTC | newest] Thread overview: 65+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-01-27 18:13 [PATCH v1 0/3] net/af_packet: correctness fixes and improvements scott.k.mitch1 2026-01-27 18:13 ` [PATCH v1 1/3] net/af_packet: fix thread safety and frame calculations scott.k.mitch1 2026-01-27 18:39 ` Stephen Hemminger 2026-01-28 1:35 ` Scott Mitchell 2026-01-27 18:13 ` [PATCH v1 2/3] net/af_packet: RX/TX rte_memcpy, bulk free, prefetch scott.k.mitch1 2026-01-27 18:54 ` Stephen Hemminger 2026-01-28 1:23 ` Scott Mitchell 2026-01-28 9:49 ` Morten Brørup 2026-01-28 15:37 ` Scott Mitchell 2026-01-28 16:57 ` Stephen Hemminger 2026-01-27 18:13 ` [PATCH v1 3/3] net/af_packet: software checksum and tx poll control scott.k.mitch1 2026-01-27 18:57 ` Stephen Hemminger 2026-01-28 7:05 ` Scott Mitchell 2026-01-28 17:36 ` Stephen Hemminger 2026-01-28 18:59 ` Scott Mitchell 2026-01-27 20:45 ` [REVIEW] " Stephen Hemminger 2026-01-28 9:36 ` [PATCH v2 0/4] af_packet correctness, performance, cksum scott.k.mitch1 2026-01-28 9:36 ` [PATCH v2 1/4] net/af_packet: fix thread safety and frame calculations scott.k.mitch1 2026-01-28 16:59 ` Stephen Hemminger 2026-01-28 18:00 ` Scott Mitchell 2026-01-28 18:28 ` Stephen Hemminger 2026-01-28 9:36 ` [PATCH v2 2/4] net/af_packet: RX/TX unlikely, bulk free, prefetch scott.k.mitch1 2026-01-28 9:36 ` [PATCH v2 3/4] net/af_packet: tx poll control scott.k.mitch1 2026-01-28 9:36 ` [PATCH v2 4/4] net/af_packet: software checksum scott.k.mitch1 2026-01-28 18:27 ` Stephen Hemminger 2026-01-28 19:08 ` Scott Mitchell 2026-01-28 19:10 ` [PATCH v3 0/4] af_packet correctness, performance, cksum scott.k.mitch1 2026-01-28 19:10 ` [PATCH v3 1/4] net/af_packet: fix thread safety and frame calculations scott.k.mitch1 2026-01-28 19:10 ` [PATCH v3 2/4] net/af_packet: RX/TX unlikely, bulk free, prefetch scott.k.mitch1 2026-01-29 1:07 ` Stephen Hemminger 2026-02-02 5:29 ` Scott Mitchell 2026-01-28 19:10 ` [PATCH v3 3/4] net/af_packet: tx poll control scott.k.mitch1 2026-01-28 19:10 ` [PATCH v3 4/4] net/af_packet: software checksum scott.k.mitch1 2026-01-28 21:57 ` [REVIEW] " Stephen Hemminger 2026-02-02 7:55 ` Scott Mitchell 2026-02-02 16:58 ` Stephen Hemminger 2026-02-02 8:14 ` [PATCH v4 0/4] af_packet correctness, performance, cksum scott.k.mitch1 2026-02-02 8:14 ` [PATCH v4 1/4] net/af_packet: fix thread safety and frame calculations scott.k.mitch1 2026-02-02 8:14 ` [PATCH v4 2/4] net/af_packet: RX/TX bulk free, unlikely hint scott.k.mitch1 2026-02-02 8:14 ` [PATCH v4 3/4] net/af_packet: tx poll control scott.k.mitch1 2026-02-02 8:14 ` [PATCH v4 4/4] net/af_packet: add software checksum offload support scott.k.mitch1 2026-02-02 17:00 ` Stephen Hemminger 2026-02-02 18:47 ` Stephen Hemminger 2026-02-03 6:41 ` Scott Mitchell 2026-02-02 18:53 ` [PATCH v4 0/4] af_packet correctness, performance, cksum Stephen Hemminger 2026-02-03 7:07 ` [PATCH v5 " scott.k.mitch1 2026-02-03 7:07 ` [PATCH v5 1/4] net/af_packet: fix thread safety and frame calculations scott.k.mitch1 2026-02-03 7:07 ` [PATCH v5 2/4] net/af_packet: RX/TX bulk free, unlikely hint scott.k.mitch1 2026-02-03 7:07 ` [PATCH v5 3/4] net/af_packet: tx poll control scott.k.mitch1 2026-02-03 7:07 ` [PATCH v5 4/4] net/af_packet: add software checksum offload support scott.k.mitch1 2026-02-03 8:20 ` Scott Mitchell 2026-02-03 14:12 ` Stephen Hemminger 2026-02-04 2:59 ` Scott Mitchell 2026-02-03 14:13 ` Stephen Hemminger 2026-02-04 1:39 ` Scott Mitchell 2026-02-05 21:27 ` Stephen Hemminger 2026-02-06 1:11 ` [PATCH v6 0/4] af_packet correctness, performance, cksum scott.k.mitch1 2026-02-06 1:11 ` [PATCH v6 1/4] net/af_packet: fix thread safety and frame calculations scott.k.mitch1 2026-02-06 1:11 ` [PATCH v6 2/4] net/af_packet: RX/TX bulk free, unlikely hint scott.k.mitch1 2026-02-06 1:11 ` [PATCH v6 3/4] net/af_packet: tx poll control scott.k.mitch1 2026-02-06 1:11 ` [PATCH v6 4/4] net/af_packet: add software checksum offload support scott.k.mitch1 2026-02-06 1:49 ` [PATCH v6 0/4] af_packet correctness, performance, cksum Stephen Hemminger 2026-02-06 4:45 ` Scott Mitchell 2026-02-06 14:36 ` Morten Brørup 2026-02-06 16:11 ` Stephen Hemminger
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox