* Re: [PATCH v2 2/3] event/cnxk: add pause to spinloops
From: Stephen Hemminger @ 2026-06-08 17:44 UTC (permalink / raw)
To: Jerin Jacob; +Cc: dev, Pavan Nikhilesh, Shijith Thotton
In-Reply-To: <CALBAE1PzEtyBzKiKaU83q0OtwRwaCTmqDETsc5NNODDQ9hzauw@mail.gmail.com>
On Mon, 8 Jun 2026 21:19:20 +0530
Jerin Jacob <jerinjacobk@gmail.com> wrote:
> On Mon, Apr 13, 2026 at 10:36 PM Stephen Hemminger
> <stephen@networkplumber.org> wrote:
> >
> > On SMT systems when a spinloop is done without a pause
> > it may cause excessive latency. This problem was found
> > by the fix_empty_spinloops coccinelle script.
> >
> > Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
>
> rte_pause() translates to YIELD instruction. Since cnxk is an
> integrated SoC and it is a single threaded core, it won't help on
> anything other than adding one instruction bit more latency.
> In general 3/3 devtool is good. Please send a it separate version so
> that 3/3 patches can be merged through the main tree.
It matters if your SOC has SMT where two cores are sharing
and one core is waiting for its partner.
^ permalink raw reply
* Re: [PATCH 2/2] net/cnxk: add PMD API to support custom profile setup
From: Jerin Jacob @ 2026-06-08 17:27 UTC (permalink / raw)
To: rkudurumalla
Cc: Nithin Dabilpuram, Kiran Kumar K, Sunil Kumar Kori, Satha Rao,
Harman Kalra, dev, jerinj
In-Reply-To: <20260505085037.3961519-2-rkudurumalla@marvell.com>
On Tue, May 5, 2026 at 2:36 PM rkudurumalla <rkudurumalla@marvell.com> wrote:
>
> From: Rakesh Kudurumalla <rkudurumalla@marvell.com>
>
> Added new PMD APIS to create custom profile and API to update
> SA table created during profile setup based on profile ID
>
> Signed-off-by: Rakesh Kudurumalla <rkudurumalla@marvell.com>
Series applied to dpdk-next-net-mrvl/for-main. Thanks
^ permalink raw reply
* [PATCH] net/iavf: fix misleading AQ failure logging
From: Anurag Mandal @ 2026-06-08 17:15 UTC (permalink / raw)
To: dev
Cc: bruce.richardson, vladimir.medvedkin, ciara.loftus, Anurag Mandal,
stable
iavf_handle_virtchnl_msg() drains the admin receive queue in a loop
until iavf_clean_arq_element() reports that no descriptors are
pending. When the queue becomes empty, the base driver returns
IAVF_ERR_ADMIN_QUEUE_NO_WORK (-57), which is the documented
terminator for the drain loop, and is not an error.
The current loop treats every non-IAVF_SUCCESS return as a failure
and logs it as follows:
"Failed to read msg from AdminQ, ret: -57"
This message floods the logs on every interrupt cycle and misleads
the triage during VF reset by chasing a real virtchnl problem
seeing these spurious -57 AQ failure lines in logs and assumes
the admin queue is broken, when in fact it has just been drained.
This patch fixes the aforesaid issue by treating
IAVF_ERR_ADMIN_QUEUE_NO_WORK in virtchnl message drain as a normal
loop exit empty-queue condition and avoid logging it as an misleading
AQ failure.
Fixes: 02d212ca3125 ("net/iavf: rename remaining avf strings")
Cc: stable@dpdk.org
Signed-off-by: Anurag Mandal <anurag.mandal@intel.com>
---
drivers/net/intel/iavf/iavf_vchnl.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/drivers/net/intel/iavf/iavf_vchnl.c b/drivers/net/intel/iavf/iavf_vchnl.c
index 94ccfb5d6e..870d5c1820 100644
--- a/drivers/net/intel/iavf/iavf_vchnl.c
+++ b/drivers/net/intel/iavf/iavf_vchnl.c
@@ -570,7 +570,15 @@ iavf_handle_virtchnl_msg(struct rte_eth_dev *dev)
while (pending) {
ret = iavf_clean_arq_element(hw, &info, &pending);
- if (ret != IAVF_SUCCESS) {
+ /*
+ * IAVF_ERR_ADMIN_QUEUE_NO_WORK (-57) means AQ is empty
+ * and is a normal way to terminate the drain loop.
+ * Log error only for genuine other failure codes.
+ * Incorrect logging like this during VF resets might
+ * mislead into chasing a non-existent AQ failure.
+ */
+ if (ret != IAVF_SUCCESS &&
+ ret != IAVF_ERR_ADMIN_QUEUE_NO_WORK) {
PMD_DRV_LOG(INFO, "Failed to read msg from AdminQ,"
"ret: %d", ret);
break;
--
2.34.1
^ permalink raw reply related
* Re: [PATCH 6/7] pcapng: add user-supplied timestamp support
From: Stephen Hemminger @ 2026-06-08 17:09 UTC (permalink / raw)
To: Dawid Wesierski
Cc: dev, thomas, david.marchand, vladimir.medvedkin, bruce.richardson,
anatoly.burakov, reshma.pattan, Marek Kasiewicz
In-Reply-To: <20260608164059.65420-7-dawid.wesierski@intel.com>
On Mon, 8 Jun 2026 12:40:58 -0400
Dawid Wesierski <dawid.wesierski@intel.com> wrote:
> @@ -737,16 +736,6 @@ rte_pcapng_write_packets(rte_pcapng_t *self,
> return -1;
> }
>
> - /*
> - * When data is captured by pcapng_copy the current TSC is stored.
> - * Adjust the value recorded in file to PCAP epoch units.
> - */
> - cycles = (uint64_t)epb->timestamp_hi << 32;
> - cycles += epb->timestamp_lo;
> - timestamp = tsc_to_ns_epoch(&self->clock, cycles);
> - epb->timestamp_hi = timestamp >> 32;
> - epb->timestamp_lo = (uint32_t)timestamp;
> -
You can't generate valid pcapng timestamps without this.
^ permalink raw reply
* Re: [PATCH] test/event_eth_rx_intr_adapter: support NICs with fewer int vectors
From: Jerin Jacob @ 2026-06-08 17:08 UTC (permalink / raw)
To: Loftus, Ciara; +Cc: Richardson, Bruce, dev@dpdk.org
In-Reply-To: <DM3PPF7D18F34A115BCA37E02527DC8C08B8E2D2@DM3PPF7D18F34A1.namprd11.prod.outlook.com>
On Wed, Apr 22, 2026 at 4:06 PM Loftus, Ciara <ciara.loftus@intel.com> wrote:
>
> > Subject: [PATCH] test/event_eth_rx_intr_adapter: support NICs with fewer int
> > vectors
> >
> > Some NICs may not be able to support interrupts on all queues that are
> > advertised, which will cause the test to fail if the queues supporting
> > interrupts are fewer than 64. We can work around this by retrying the
> > NIC configuration multiple times with fewer queues in case of failure.
> > This allows the test to pass with NICs using ixgbe driver, for example.
> >
> > Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
>
> LGTM.
> Acked-by: Ciara Loftus <ciara.loftus@intel.com>
Applied to dpdk-next-eventdev/for-main. Thanks
>
> > ---
> > app/test/test_event_eth_rx_adapter.c | 52 +++++++++++++++++-----------
> > 1 file changed, 32 insertions(+), 20 deletions(-)
> >
> > diff --git a/app/test/test_event_eth_rx_adapter.c
> > b/app/test/test_event_eth_rx_adapter.c
> > index ae428b3333..7b38935bec 100644
> > --- a/app/test/test_event_eth_rx_adapter.c
> > +++ b/app/test/test_event_eth_rx_adapter.c
> > @@ -60,6 +60,7 @@ port_init_common(uint16_t port, const struct
> > rte_eth_conf *port_conf,
> > {
> > const uint16_t rx_ring_size = 512, tx_ring_size = 512;
> > int retval;
> > + bool started = false;
> > uint16_t q;
> > struct rte_eth_dev_info dev_info;
> >
> > @@ -76,32 +77,43 @@ port_init_common(uint16_t port, const struct
> > rte_eth_conf *port_conf,
> > MAX_NUM_RX_QUEUE);
> > default_params.tx_rings = 1;
> >
> > - /* Configure the Ethernet device. */
> > - retval = rte_eth_dev_configure(port, default_params.rx_rings,
> > + while (!started) {
> > + /* Configure the Ethernet device. */
> > + retval = rte_eth_dev_configure(port, default_params.rx_rings,
> > default_params.tx_rings, port_conf);
> > - if (retval != 0)
> > - return retval;
> > -
> > - for (q = 0; q < default_params.rx_rings; q++) {
> > - retval = rte_eth_rx_queue_setup(port, q, rx_ring_size,
> > - rte_eth_dev_socket_id(port), NULL, mp);
> > - if (retval < 0)
> > + if (retval != 0)
> > return retval;
> > - }
> >
> > - /* Allocate and set up 1 TX queue per Ethernet port. */
> > - for (q = 0; q < default_params.tx_rings; q++) {
> > - retval = rte_eth_tx_queue_setup(port, q, tx_ring_size,
> > - rte_eth_dev_socket_id(port), NULL);
> > - if (retval < 0)
> > + for (q = 0; q < default_params.rx_rings; q++) {
> > + retval = rte_eth_rx_queue_setup(port, q, rx_ring_size,
> > + rte_eth_dev_socket_id(port), NULL,
> > mp);
> > + if (retval < 0)
> > + return retval;
> > + }
> > +
> > + /* Allocate and set up 1 TX queue per Ethernet port. */
> > + for (q = 0; q < default_params.tx_rings; q++) {
> > + retval = rte_eth_tx_queue_setup(port, q, tx_ring_size,
> > + rte_eth_dev_socket_id(port), NULL);
> > + if (retval < 0)
> > + return retval;
> > + }
> > +
> > + /* Start the Ethernet port. */
> > + retval = rte_eth_dev_start(port);
> > + if (retval < 0) {
> > + /* Some NICs may not support interrupts on all
> > reported queues.
> > + * Therefore try to reconfigure and start with fewer
> > queues
> > + */
> > + if (default_params.rx_rings > 2) {
> > + default_params.rx_rings /= 2;
> > + continue;
> > + }
> > return retval;
> > + }
> > + started = true;
> > }
> >
> > - /* Start the Ethernet port. */
> > - retval = rte_eth_dev_start(port);
> > - if (retval < 0)
> > - return retval;
> > -
> > /* Display the port MAC address. */
> > struct rte_ether_addr addr;
> > retval = rte_eth_macaddr_get(port, &addr);
> > --
> > 2.51.0
>
^ permalink raw reply
* Re: [v3] net/cksum: compute raw cksum for several segments
From: Stephen Hemminger @ 2026-06-08 17:02 UTC (permalink / raw)
To: Su Sai; +Cc: dev
In-Reply-To: <20250804035430.4058391-1-spiderdetective.ss@gmail.com>
On Mon, 4 Aug 2025 11:54:30 +0800
Su Sai <spiderdetective.ss@gmail.com> wrote:
> The rte_raw_cksum_mbuf function is used to compute
> the raw checksum of a packet.
> If the packet payload stored in multi mbuf, the function
> will goto the hard case. In hard case,
> the variable 'tmp' is a type of uint32_t,
> so rte_bswap16 will drop high 16 bit.
> Meanwhile, the variable 'sum' is a type of uint32_t,
> so 'sum += tmp' will drop the carry when overflow.
> Both drop will make cksum incorrect.
> This commit fixes the above bug.
>
> Signed-off-by: Su Sai <spiderdetective.ss@gmail.com>
> ---
> .mailmap | 1 +
> app/test/test_cksum.c | 106 ++++++++++++++++++++++++++++++++++++++++++
> lib/net/rte_cksum.h | 27 +++++++++--
> 3 files changed, 130 insertions(+), 4 deletions(-)
>
> diff --git a/.mailmap b/.mailmap
> index 34a99f93a1..1da1d9f8e1 100644
> --- a/.mailmap
> +++ b/.mailmap
> @@ -1552,6 +1552,7 @@ Sunil Kumar Kori <skori@marvell.com> <sunil.kori@nxp.com>
> Sunil Pai G <sunil.pai.g@intel.com>
> Sunil Uttarwar <sunilprakashrao.uttarwar@amd.com>
> Sun Jiajia <sunx.jiajia@intel.com>
> +Su Sai <spiderdetective.ss@gmail.com> <susai.ss@bytedance.com>
> Sunyang Wu <sunyang.wu@jaguarmicro.com>
> Surabhi Boob <surabhi.boob@intel.com>
> Suyang Ju <sju@paloaltonetworks.com>
> diff --git a/app/test/test_cksum.c b/app/test/test_cksum.c
> index f2ab5af5a7..fb2e3cf9e6 100644
> --- a/app/test/test_cksum.c
> +++ b/app/test/test_cksum.c
> @@ -85,6 +85,42 @@ static const char test_cksum_ipv4_opts_udp[] = {
> 0x00, 0x35, 0x00, 0x09, 0x89, 0x6f, 0x78,
> };
>
> +/*
> + * generated in scapy with
> + * Ether()/IP()/TCP(options=[NOP,NOP,Timestamps])/os.urandom(113))
> + */
> +static const char test_cksum_ipv4_tcp_multi_segs[] = {
> + 0x00, 0x16, 0x3e, 0x0b, 0x6b, 0xd2, 0xee, 0xff,
> + 0xff, 0xff, 0xff, 0xff, 0x08, 0x00, 0x45, 0x00,
> + 0x00, 0xa5, 0x46, 0x10, 0x40, 0x00, 0x40, 0x06,
> + 0x80, 0xb5, 0xc0, 0xa8, 0xf9, 0x1d, 0xc0, 0xa8,
> + 0xf9, 0x1e, 0xdc, 0xa2, 0x14, 0x51, 0xbb, 0x8f,
> + 0xa0, 0x00, 0xe4, 0x7c, 0xe4, 0xb8, 0x80, 0x10,
> + 0x02, 0x00, 0x4b, 0xc1, 0x00, 0x00, 0x01, 0x01,
> + 0x08, 0x0a, 0x90, 0x60, 0xf4, 0xff, 0x03, 0xc5,
> + 0xb4, 0x19, 0x77, 0x34, 0xd4, 0xdc, 0x84, 0x86,
> + 0xff, 0x44, 0x09, 0x63, 0x36, 0x2e, 0x26, 0x9b,
> + 0x90, 0x70, 0xf2, 0xed, 0xc8, 0x5b, 0x87, 0xaa,
> + 0xb4, 0x67, 0x6b, 0x32, 0x3d, 0xc4, 0xbf, 0x15,
> + 0xa9, 0x16, 0x6c, 0x2a, 0x9d, 0xb2, 0xb7, 0x6b,
> + 0x58, 0x44, 0x58, 0x12, 0x4b, 0x8f, 0xe5, 0x12,
> + 0x11, 0x90, 0x94, 0x68, 0x37, 0xad, 0x0a, 0x9b,
> + 0xd6, 0x79, 0xf2, 0xb7, 0x31, 0xcf, 0x44, 0x22,
> + 0xc8, 0x99, 0x3f, 0xe5, 0xe7, 0xac, 0xc7, 0x0b,
> + 0x86, 0xdf, 0xda, 0xed, 0x0a, 0x0f, 0x86, 0xd7,
> + 0x48, 0xe2, 0xf1, 0xc2, 0x43, 0xed, 0x47, 0x3a,
> + 0xea, 0x25, 0x2d, 0xd6, 0x60, 0x38, 0x30, 0x07,
> + 0x28, 0xdd, 0x1f, 0x0c, 0xdd, 0x7b, 0x7c, 0xd9,
> + 0x35, 0x9d, 0x14, 0xaa, 0xc6, 0x35, 0xd1, 0x03,
> + 0x38, 0xb1, 0xf5,
> +};
> +
> +static const uint8_t test_cksum_ipv4_tcp_multi_segs_len[] = {
> + 66, /* the first seg contains all headers, including L2 to L4 */
> + 61, /* the second seg length is odd, test byte order independent */
> + 52, /* three segs are sufficient to test the most complex scenarios */
> +};
> +
> /* test l3/l4 checksum api */
> static int
> test_l4_cksum(struct rte_mempool *pktmbuf_pool, const char *pktdata, size_t len)
> @@ -223,6 +259,70 @@ test_l4_cksum(struct rte_mempool *pktmbuf_pool, const char *pktdata, size_t len)
> return -1;
> }
>
> +/* test l4 checksum api for a packet with multiple mbufs */
> +static int
> +test_l4_cksum_multi_mbufs(struct rte_mempool *pktmbuf_pool, const char *pktdata, size_t len,
> + const uint8_t *segs, size_t segs_len)
> +{
> + struct rte_mbuf *m[NB_MBUF] = {0};
> + struct rte_mbuf *m_hdr = NULL;
> + struct rte_net_hdr_lens hdr_lens;
> + size_t i, off = 0;
> + uint32_t packet_type, l3;
> + void *l3_hdr;
> + char *data;
> +
> + for (i = 0; i < segs_len; i++) {
> + m[i] = rte_pktmbuf_alloc(pktmbuf_pool);
> + if (m[i] == NULL)
> + GOTO_FAIL("Cannot allocate mbuf");
> +
> + data = rte_pktmbuf_append(m[i], segs[i]);
> + if (data == NULL)
> + GOTO_FAIL("Cannot append data");
> +
> + rte_memcpy(data, pktdata + off, segs[i]);
Tests (except rte_memcpy test) should not use rte_memcpy, instead use
regular memcpy which has better coverage from analyzers.
> + off += segs[i];
> +
> + if (m_hdr) {
> + if (rte_pktmbuf_chain(m_hdr, m[i]))
> + GOTO_FAIL("Cannot chain mbuf");
> + } else {
> + m_hdr = m[i];
> + }
> + }
> +
> + if (off != len)
> + GOTO_FAIL("Invalid segs");
> +
> + packet_type = rte_net_get_ptype(m_hdr, &hdr_lens, RTE_PTYPE_ALL_MASK);
> + l3 = packet_type & RTE_PTYPE_L3_MASK;
> +
> + l3_hdr = rte_pktmbuf_mtod_offset(m_hdr, void *, hdr_lens.l2_len);
> + off = hdr_lens.l2_len + hdr_lens.l3_len;
> +
> + if (l3 == RTE_PTYPE_L3_IPV4 || l3 == RTE_PTYPE_L3_IPV4_EXT) {
> + if (rte_ipv4_udptcp_cksum_mbuf_verify(m_hdr, l3_hdr, off) != 0)
> + GOTO_FAIL("Invalid L4 checksum verification for multiple mbufs");
> + } else if (l3 == RTE_PTYPE_L3_IPV6 || l3 == RTE_PTYPE_L3_IPV6_EXT) {
> + if (rte_ipv6_udptcp_cksum_mbuf_verify(m_hdr, l3_hdr, off) != 0)
> + GOTO_FAIL("Invalid L4 checksum verification for multiple mbufs");
> + }
> +
> + for (i = 0; i < segs_len; i++)
> + rte_pktmbuf_free(m[i]);
Can avoid the loop here and elsewhere by using rte_pktmbuf_free_bulk()
> + return 0;
> +
> +fail:
> + for (i = 0; i < segs_len; i++) {
> + if (m[i])
> + rte_pktmbuf_free(m[i]);
> + }
Freebulk will work here
> + return -1;
> +}
> +
> static int
> test_cksum(void)
> {
> @@ -256,6 +356,12 @@ test_cksum(void)
> sizeof(test_cksum_ipv4_opts_udp)) < 0)
> GOTO_FAIL("checksum error on ipv4_opts_udp");
>
> + if (test_l4_cksum_multi_mbufs(pktmbuf_pool, test_cksum_ipv4_tcp_multi_segs,
> + sizeof(test_cksum_ipv4_tcp_multi_segs),
> + test_cksum_ipv4_tcp_multi_segs_len,
> + sizeof(test_cksum_ipv4_tcp_multi_segs_len)) < 0)
> + GOTO_FAIL("checksum error on multi mbufs check");
> +
> rte_mempool_free(pktmbuf_pool);
>
> return 0;
> diff --git a/lib/net/rte_cksum.h b/lib/net/rte_cksum.h
> index a8e8927952..679ba82eb6 100644
> --- a/lib/net/rte_cksum.h
> +++ b/lib/net/rte_cksum.h
> @@ -80,6 +80,25 @@ __rte_raw_cksum_reduce(uint32_t sum)
> return (uint16_t)sum;
> }
>
> +/**
> + * @internal Reduce a sum to the non-complemented checksum.
> + * Helper routine for the rte_raw_cksum_mbuf().
> + *
> + * @param sum
> + * Value of the sum.
> + * @return
> + * The non-complemented checksum.
> + */
> +static inline uint16_t
> +__rte_raw_cksum_reduce_u64(uint64_t sum)
> +{
> + uint32_t tmp;
> +
> + tmp = __rte_raw_cksum_reduce((uint32_t)sum);
> + tmp += __rte_raw_cksum_reduce((uint32_t)(sum >> 32));
> + return __rte_raw_cksum_reduce(tmp);
> +}
> +
> /**
> * Process the non-complemented checksum of a buffer.
> *
> @@ -119,8 +138,8 @@ rte_raw_cksum_mbuf(const struct rte_mbuf *m, uint32_t off, uint32_t len,
> {
> const struct rte_mbuf *seg;
> const char *buf;
> - uint32_t sum, tmp;
> - uint32_t seglen, done;
> + uint32_t seglen, done, tmp;
> + uint64_t sum;
>
> /* easy case: all data in the first segment */
> if (off + len <= rte_pktmbuf_data_len(m)) {
> @@ -157,7 +176,7 @@ rte_raw_cksum_mbuf(const struct rte_mbuf *m, uint32_t off, uint32_t len,
> for (;;) {
> tmp = __rte_raw_cksum(buf, seglen, 0);
> if (done & 1)
> - tmp = rte_bswap16((uint16_t)tmp);
> + tmp = rte_bswap32(tmp);
> sum += tmp;
> done += seglen;
> if (done == len)
> @@ -169,7 +188,7 @@ rte_raw_cksum_mbuf(const struct rte_mbuf *m, uint32_t off, uint32_t len,
> seglen = len - done;
> }
>
> - *cksum = __rte_raw_cksum_reduce(sum);
> + *cksum = __rte_raw_cksum_reduce_u64(sum);
> return 0;
> }
>
^ permalink raw reply
* Re: [PATCH 0/7] intel network and pcapng updates
From: Thomas Monjalon @ 2026-06-08 16:59 UTC (permalink / raw)
To: Dawid Wesierski
Cc: dev, david.marchand, vladimir.medvedkin, bruce.richardson,
anatoly.burakov, reshma.pattan, stephen, Wesierski, Dawid
In-Reply-To: <20260608164059.65420-1-dawid.wesierski@intel.com>
08/06/2026 18:40, Dawid Wesierski:
> From: "Wesierski, Dawid" <dawid.wesierski@intel.com>
>
> These patches provide various updates for Intel iavf/ice drivers and pcapng.
Please would you mind sending the pcapng changes in a separate series?
Thank you
^ permalink raw reply
* Re: [PATCH 6/7] pcapng: add user-supplied timestamp support
From: Stephen Hemminger @ 2026-06-08 16:38 UTC (permalink / raw)
To: Dawid Wesierski
Cc: dev, bruce.richardson, anatoly.burakov, vladimir.medvedkin,
reshma.pattan, thomas, andrew.rybchenko, marek.kasiewicz
In-Reply-To: <20260429073111.3712950-7-dawid.wesierski@intel.com>
On Wed, 29 Apr 2026 03:31:10 -0400
Dawid Wesierski <dawid.wesierski@intel.com> wrote:
> @@ -737,16 +736,6 @@ rte_pcapng_write_packets(rte_pcapng_t *self,
> return -1;
> }
>
> - /*
> - * When data is captured by pcapng_copy the current TSC is stored.
> - * Adjust the value recorded in file to PCAP epoch units.
> - */
> - cycles = (uint64_t)epb->timestamp_hi << 32;
> - cycles += epb->timestamp_lo;
> - timestamp = tsc_to_ns_epoch(&self->clock, cycles);
> - epb->timestamp_hi = timestamp >> 32;
> - epb->timestamp_lo = (uint32_t)timestamp;
> -
> /*
> * Handle case of highly fragmented and large burst size
> * Note: this assumes that max segments per mbuf < IOV_MAX
> diff --git a/lib/pcapng/rte_pcapng.h b/lib/pcapng/rte_pcapng.h
NAK
You need to keep the correct timestamp correction.
PCAPNG specifies times as nanoseconds since 1/1/1970.
Any new API needs a test as well.
^ permalink raw reply
* Re: [PATCH] eal: fix core_index for non-EAL registered threads
From: David Marchand @ 2026-06-08 16:35 UTC (permalink / raw)
To: Maxime Peim; +Cc: dev
In-Reply-To: <CAJFAV8wn26hRUTzCG4ai+apT8+bWL7+cEE1N-gV1tJt8B9a4hQ@mail.gmail.com>
On Mon, 8 Jun 2026 at 18:10, David Marchand <david.marchand@redhat.com> wrote:
>
> On Wed, 22 Apr 2026 at 09:54, Maxime Peim <maxime.peim@gmail.com> wrote:
> >
> > Threads registered via rte_thread_register() are assigned a valid
> > lcore_id by eal_lcore_non_eal_allocate(), but their core_index in
> > lcore_config is left at -1. This value was set during rte_eal_cpu_init()
> > for lcores with ROLE_OFF (undetected CPUs) and is never updated when the
> > lcore is later allocated to a non-EAL thread.
> >
> > As a result, rte_lcore_index() returns -1 for registered non-EAL
> > threads. Libraries that use rte_lcore_index() to select per-lcore
> > caches fall back to a shared global path when it returns -1, causing
> > severe contention under concurrent access from multiple registered
> > threads.
> >
> > A concrete example is the mlx5 indexed memory pool (mlx5_ipool), which
> > uses rte_lcore_index() in mlx5_ipool_malloc_cache() to select a per-core
> > cache slot. When core_index is -1, all registered threads are funneled
> > into a single shared slot protected by a spinlock. In testing with VPP
> > (which registers worker threads via rte_thread_register()), this caused
> > async flow rule insertion throughput to drop from ~6.4M rules/sec to
> > ~1.2M rules/sec with 4 workers -- a 5x regression attributable entirely
> > to spinlock contention in the ipool allocator.
> >
> > Fix by setting core_index to the next sequential index (cfg->lcore_count)
> > in eal_lcore_non_eal_allocate() before incrementing the count. Also reset
> > core_index back to -1 on the error rollback path and in
> > eal_lcore_non_eal_release() for correctness.
> >
> > Fixes: 5c307ba2a5b1 ("eal: register non-EAL threads as lcores")
> Cc: stable@dpdk.org
>
> > Signed-off-by: Maxime Peim <maxime.peim@gmail.com>
> Acked-by: David Marchand <david.marchand@redhat.com>
>
Hum, I did not push the change.
Re-reading this code, we have an issue if some external thread
unregisters in the middle.
What do you think of the additional hunk:
$ git diff
diff --git a/lib/eal/common/eal_common_lcore.c
b/lib/eal/common/eal_common_lcore.c
index ae085d73e4..6f53f20d90 100644
--- a/lib/eal/common/eal_common_lcore.c
+++ b/lib/eal/common/eal_common_lcore.c
@@ -372,13 +372,16 @@ eal_lcore_non_eal_allocate(void)
struct rte_config *cfg = rte_eal_get_configuration();
struct lcore_callback *callback;
struct lcore_callback *prev;
+ unsigned int index = 0;
unsigned int lcore_id;
rte_rwlock_write_lock(&lcore_lock);
for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
- if (cfg->lcore_role[lcore_id] != ROLE_OFF)
+ if (cfg->lcore_role[lcore_id] != ROLE_OFF) {
+ index++;
continue;
- lcore_config[lcore_id].core_index = cfg->lcore_count;
+ }
+ lcore_config[lcore_id].core_index = index;
cfg->lcore_role[lcore_id] = ROLE_NON_EAL;
cfg->lcore_count++;
break;
--
David Marchand
^ permalink raw reply related
* Re: [PATCH] common/cnxk: fix VFIO MSI-X interrupt setup
From: Jerin Jacob @ 2026-06-08 16:35 UTC (permalink / raw)
To: pbhagavatula
Cc: jerinj, Nithin Dabilpuram, Kiran Kumar K, Sunil Kumar Kori,
Satha Rao, Harman Kalra, Stephen Hemminger, Tejasree Kondoj, dev,
stable
In-Reply-To: <20260521082647.71442-1-pbhagavatula@marvell.com>
On Thu, May 21, 2026 at 2:25 PM <pbhagavatula@marvell.com> wrote:
>
> From: Pavan Nikhilesh <pbhagavatula@marvell.com>
>
> Use heap allocation sized by the configured maximum interrupt count for the
> VFIO irq_set buffer, correct handling when irq.count is zero, and use a
> minimal stack buffer for per-vector configuration.
>
> Fixes: 1fb9f4ab14b3 ("common/cnxk: remove VLA in interrupt configuration")
> Cc: stable@dpdk.org
>
> Signed-off-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
Series applied to dpdk-next-net-mrvl/for-main. Thanks
^ permalink raw reply
* RE: [PATCH v3] net/iavf: fix duplicate VF reset during PF reset recovery
From: Mandal, Anurag @ 2026-06-08 16:32 UTC (permalink / raw)
To: Loftus, Ciara, dev@dpdk.org
Cc: Richardson, Bruce, Medvedkin, Vladimir, stable@dpdk.org
In-Reply-To: <IA4PR11MB9278AB660100B34D0001DC478E1C2@IA4PR11MB9278.namprd11.prod.outlook.com>
> -----Original Message-----
> From: Loftus, Ciara <ciara.loftus@intel.com>
> Sent: 08 June 2026 15:29
> To: Mandal, Anurag <anurag.mandal@intel.com>; dev@dpdk.org
> Cc: Richardson, Bruce <bruce.richardson@intel.com>; Medvedkin, Vladimir
> <vladimir.medvedkin@intel.com>; Mandal, Anurag
> <anurag.mandal@intel.com>; stable@dpdk.org
> Subject: RE: [PATCH v3] net/iavf: fix duplicate VF reset during PF reset recovery
>
> > Subject: [PATCH v3] net/iavf: fix duplicate VF reset during PF reset
> > recovery
> >
> > During PF initiated reset recovery, iavf_dev_close() sending an extra
> > VIRTCHNL_OP_RESET_VF while recovery is already in progress.
> > That second reset can leave PF/VF virtchnl state inconsistent and
> > cause VIRTCHNL_OP_CONFIG_VSI_QUEUES to fail with ERR_PARAM after ToR
> > link flap/power-cycle, leaving the VF unable to recover.
> > This results in connection loss.
> >
> > Skipped close-time VF reset and related close-time virtchnl operations
> > when PF triggered reset recovery is set. This is done to avoid a
> > duplicate VF reset, and keep normal behavior for application-driven
> > close.
> > Handled link-change events through a common static function that reads
> > the correct advanced & legacy link fields properly and updates
> > no-poll/watchdog/LSC state consistently.
> > Also added IAVF_ERR_ADMIN_QUEUE_NO_WORK in virtchnl message drain
> as a
> > normal empty-queue condition and avoid logging it as an misleading AQ
> > failure.
> >
> > Fixes: 675a104e2e94 ("net/iavf: fix abnormal disable HW interrupt")
> > Fixes: b34fe66ea893 ("net/iavf: delay VF reset command")
> > Fixes: 5e03e316c753 ("net/iavf: handle virtchnl event message without
> > interrupt")
> > Fixes: 5c8ca9f13c78 ("net/iavf: fix no polling mode switching")
> > Fixes: 48de41ca11f0 ("net/avf: enable link status update")
> > Fixes: 02d212ca3125 ("net/iavf: rename remaining avf strings")
> > Cc: stable@dpdk.org
>
> Hi Anurag,
>
> Thanks for the patch. There seems to be multiple logical fixes/changes in here
> and I think it would be good to split them into individual patches, each with
> their own Fixes tag where relevant. Having multiple fixes in one patch with
> multiple Fixes tags makes backporting tricky.
> I think at least logic which prevents the RESET_VF during a PF initiated reset
> should be split out from the link-change logic.
>
> Thanks,
> Ciara
>
Hi Ciara,
Thank you for your review comment. I have split the VF reset changes and sent patch v4 for the same. Others, I will put fresh patch.
Kindly review.
Thanks,
Anurag
> >
> > Signed-off-by: Anurag Mandal <anurag.mandal@intel.com>
> > ---
> > V3: Addressed latest ai-code-review comments
> > V2: Addressed ai-code-review comments
> >
> > doc/guides/rel_notes/release_26_07.rst | 3 +
> > drivers/net/intel/iavf/iavf_ethdev.c | 37 +++---
> > drivers/net/intel/iavf/iavf_vchnl.c | 155 ++++++++++++++++---------
> > 3 files changed, 123 insertions(+), 72 deletions(-)
> >
> > diff --git a/doc/guides/rel_notes/release_26_07.rst
> > b/doc/guides/rel_notes/release_26_07.rst
> > index b8a3e2ced9..e7ac730369 100644
> > --- a/doc/guides/rel_notes/release_26_07.rst
> > +++ b/doc/guides/rel_notes/release_26_07.rst
> > @@ -89,6 +89,9 @@ New Features
> >
> > * Added support for transmitting LLDP packets based on mbuf packet type.
> > * Implemented AVX2 context descriptor transmit paths.
> > + * Prevented duplicate 'VIRTCHNL_OP_RESET_VF' during a PF-initiated
> > + reset recovery, which earlier caused virtchnl state corruption
> > + and connection loss after a top-of-rack (ToR) link flap/power-cycle.
> >
> > * **Updated PCAP ethernet driver.**
> >
> > diff --git a/drivers/net/intel/iavf/iavf_ethdev.c
> > b/drivers/net/intel/iavf/iavf_ethdev.c
> > index bdf650b822..fb6f287d3c 100644
> > --- a/drivers/net/intel/iavf/iavf_ethdev.c
> > +++ b/drivers/net/intel/iavf/iavf_ethdev.c
> > @@ -3166,24 +3166,27 @@ iavf_dev_close(struct rte_eth_dev *dev)
> >
> > ret = iavf_dev_stop(dev);
> >
> > - /*
> > - * Release redundant queue resource when close the dev
> > - * so that other vfs can re-use the queues.
> > - */
> > - if (vf->lv_enabled) {
> > - ret = iavf_request_queues(dev,
> > IAVF_MAX_NUM_QUEUES_DFLT);
> > - if (ret)
> > - PMD_DRV_LOG(ERR, "Reset the num of queues
> > failed");
> > + /* Skip RESET_VF on a PF-initiated reset */
> > + if (!adapter->closed && !vf->in_reset_recovery) {
>
> adapter->closed will always be false here so the check is redundant.
>
> > + /*
> > + * Release redundant queue resource when close the dev
> > + * so that other vfs can re-use the queues.
> > + */
> > + if (vf->lv_enabled) {
> > + ret = iavf_request_queues(dev,
> > IAVF_MAX_NUM_QUEUES_DFLT);
> > + if (ret)
> > + PMD_DRV_LOG(ERR, "Reset the num of
> > queues failed");
> > + vf->max_rss_qregion =
> > IAVF_MAX_NUM_QUEUES_DFLT;
> > + }
> >
>
^ permalink raw reply
* [PATCH v4] net/iavf: fix duplicate VF reset during PF reset recovery
From: Anurag Mandal @ 2026-06-08 16:29 UTC (permalink / raw)
To: dev
Cc: bruce.richardson, vladimir.medvedkin, ciara.loftus, Anurag Mandal,
stable
In-Reply-To: <20260605202911.314359-1-anurag.mandal@intel.com>
During PF initiated reset recovery, iavf_dev_close() sending
an extra VIRTCHNL_OP_RESET_VF while recovery is already in progress.
That second reset can leave PF/VF virtchnl state inconsistent and
cause VIRTCHNL_OP_CONFIG_VSI_QUEUES to fail with ERR_PARAM after
ToR link flap/power-cycle, leaving the VF unable to recover.
This results in connection loss.
This patch skipped close-time VF reset and related close-time
virtchnl operations when PF triggered reset recovery is set.
This is done to avoid a duplicate VF reset, and keep normal
behavior for application-driven close.
Fixes: 675a104e2e94 ("net/iavf: fix abnormal disable HW interrupt")
Fixes: b34fe66ea893 ("net/iavf: delay VF reset command")
Fixes: 5e03e316c753 ("net/iavf: handle virtchnl event message without interrupt")
Cc: stable@dpdk.org
Signed-off-by: Anurag Mandal <anurag.mandal@intel.com>
---
V4: Addressed Ciara Loftus comments
- split VF reset from other code changes
V3: Addressed latest ai-code-review comments
V2: Addressed ai-code-review comments
doc/guides/rel_notes/release_26_07.rst | 3 +++
drivers/net/intel/iavf/iavf_ethdev.c | 37 +++++++++++++++-----------
drivers/net/intel/iavf/iavf_vchnl.c | 18 ++++++++++---
3 files changed, 39 insertions(+), 19 deletions(-)
diff --git a/doc/guides/rel_notes/release_26_07.rst b/doc/guides/rel_notes/release_26_07.rst
index d2563ac503..f6899a78c3 100644
--- a/doc/guides/rel_notes/release_26_07.rst
+++ b/doc/guides/rel_notes/release_26_07.rst
@@ -95,6 +95,9 @@ New Features
* Added support for transmitting LLDP packets based on mbuf packet type.
* Implemented AVX2 context descriptor transmit paths.
+ * Prevented duplicate 'VIRTCHNL_OP_RESET_VF' during a PF-initiated
+ reset recovery, which earlier caused virtchnl state corruption
+ and connection loss after a top-of-rack (ToR) link flap/power-cycle.
* **Updated PCAP ethernet driver.**
diff --git a/drivers/net/intel/iavf/iavf_ethdev.c b/drivers/net/intel/iavf/iavf_ethdev.c
index a8031e23a5..99457ae510 100644
--- a/drivers/net/intel/iavf/iavf_ethdev.c
+++ b/drivers/net/intel/iavf/iavf_ethdev.c
@@ -3166,24 +3166,27 @@ iavf_dev_close(struct rte_eth_dev *dev)
ret = iavf_dev_stop(dev);
- /*
- * Release redundant queue resource when close the dev
- * so that other vfs can re-use the queues.
- */
- if (vf->lv_enabled) {
- ret = iavf_request_queues(dev, IAVF_MAX_NUM_QUEUES_DFLT);
- if (ret)
- PMD_DRV_LOG(ERR, "Reset the num of queues failed");
+ /* Skip RESET_VF on a PF-initiated reset */
+ if (!adapter->closed && !vf->in_reset_recovery) {
+ /*
+ * Release redundant queue resource when close the dev
+ * so that other vfs can re-use the queues.
+ */
+ if (vf->lv_enabled) {
+ ret = iavf_request_queues(dev, IAVF_MAX_NUM_QUEUES_DFLT);
+ if (ret)
+ PMD_DRV_LOG(ERR, "Reset the num of queues failed");
+ vf->max_rss_qregion = IAVF_MAX_NUM_QUEUES_DFLT;
+ }
- vf->max_rss_qregion = IAVF_MAX_NUM_QUEUES_DFLT;
+ /*
+ * Disable promiscuous mode before resetting the VF. This is to avoid
+ * potential issues when the PF is bound to the kernel driver.
+ */
+ if (vf->promisc_unicast_enabled || vf->promisc_multicast_enabled)
+ iavf_config_promisc(adapter, false, false);
}
- /* Disable promiscuous mode before resetting the VF. This is to avoid
- * potential issues when the PF is bound to the kernel driver.
- */
- if (vf->promisc_unicast_enabled || vf->promisc_multicast_enabled)
- iavf_config_promisc(adapter, false, false);
-
adapter->closed = true;
/* free iAVF security device context all related resources */
@@ -3195,7 +3198,9 @@ iavf_dev_close(struct rte_eth_dev *dev)
iavf_flow_flush(dev, NULL);
iavf_flow_uninit(adapter);
- iavf_vf_reset(hw);
+ /* Skip RESET_VF on a PF-initiated reset */
+ if (!vf->in_reset_recovery)
+ iavf_vf_reset(hw);
vf->aq_intr_enabled = false;
iavf_shutdown_adminq(hw);
if (vf->vf_res->vf_cap_flags & VIRTCHNL_VF_OFFLOAD_WB_ON_ITR) {
diff --git a/drivers/net/intel/iavf/iavf_vchnl.c b/drivers/net/intel/iavf/iavf_vchnl.c
index 94ccfb5d6e..cf3513ef94 100644
--- a/drivers/net/intel/iavf/iavf_vchnl.c
+++ b/drivers/net/intel/iavf/iavf_vchnl.c
@@ -283,9 +283,21 @@ iavf_read_msg_from_pf(struct iavf_adapter *adapter, uint16_t buf_len,
vf->link_up ? "up" : "down");
break;
case VIRTCHNL_EVENT_RESET_IMPENDING:
- vf->vf_reset = true;
- iavf_set_no_poll(adapter, false);
- PMD_DRV_LOG(INFO, "VF is resetting");
+ /*
+ * Force link down on impending reset to drop
+ * the cached link-up state; a fresh LSC up
+ * event will be re-issued by the PF once the
+ * VF is reinitialised.
+ */
+ vf->link_up = false;
+ if (!vf->vf_reset) {
+ vf->vf_reset = true;
+ iavf_set_no_poll(adapter, false);
+ iavf_dev_event_post(vf->eth_dev,
+ RTE_ETH_EVENT_INTR_RESET,
+ NULL, 0);
+ }
+ PMD_DRV_LOG(DEBUG, "VF is resetting");
break;
case VIRTCHNL_EVENT_PF_DRIVER_CLOSE:
vf->dev_closed = true;
--
2.34.1
^ permalink raw reply related
* Re: [PATCH] lib/net: Add ICMP support to rte_net_get_ptype()
From: Stephen Hemminger @ 2026-06-08 16:28 UTC (permalink / raw)
To: Eimear Morrissey; +Cc: dev@dpdk.org
In-Reply-To: <03abaccc37f14f4c8955580784f30cbe@huawei.com>
On Fri, 31 Oct 2025 12:32:18 +0000
Eimear Morrissey <eimear.morrissey@huawei.com> wrote:
> > -----Original Message-----
> > From: Stephen Hemminger <stephen@networkplumber.org>
> > Sent: Wednesday 15 October 2025 18:30
> > To: Eimear Morrissey <eimear.morrissey@huawei.com>
> > Cc: dev@dpdk.org
> > Subject: Re: [PATCH] lib/net: Add ICMP support to rte_net_get_ptype()
> >
> > On Thu, 9 Oct 2025 16:27:40 +0100
> > Eimear Morrissey <eimear.morrissey@huawei.com> wrote:
> >
> > > Set RTE_PTYPE_L4_ICMP for ICMP packets.
> > >
> > > Signed-off-by: Eimear Morrissey <eimear.morrissey@huawei.com>
> >
> > Would be good to do ICMP6 as well.
>
>
> Should an ICMPv6 packet be RTE_PTYPE_L4_ICMP/RTE_PTYPE_INNER_L4_ICMP as well?
>
> The comments in rte_mbuf_ptype.h are inconsistent, the top level comment with examples suggests that 'version'=6, 'next header'=0x3A
> should be RTE_PYTPE_INNER_L4_ICMP but the comment on RTE_PTYPE_INNER_L4_ICMP itself says 'version'=6, 'next header'=1 ?
>
> -Eimear
static const uint32_t ptype_l4_proto[256] = {
[IPPROTO_ICMP] = RTE_PTYPE_L4_ICMP,
[IPPROTO_ICMPV6] = RTE_PTYPE_L4_ICMP,
[IPPROTO_UDP] = RTE_PTYPE_L4_UDP,
...
^ permalink raw reply
* Re: [PATCH v1 2/2] dma/odm: avoid zero length DMA transfers
From: Jerin Jacob @ 2026-06-08 16:26 UTC (permalink / raw)
To: Shijith Thotton; +Cc: Gowrishankar Muthukrishnan, Vidya Sagar Velumuri, dev
In-Reply-To: <20260601101559.1925302-3-sthotton@marvell.com>
On Mon, Jun 1, 2026 at 3:46 PM Shijith Thotton <sthotton@marvell.com> wrote:
>
> Add validation to reject zero-length DMA operations early
> with -EINVAL, preventing queue disable.
>
> Signed-off-by: Shijith Thotton <sthotton@marvell.com>
Added Fixes to 2/2 patch.
Series applied to dpdk-next-net-mrvl/for-main. Thanks
> ---
> drivers/dma/odm/odm_dmadev.c | 18 +++++++++++++++---
> 1 file changed, 15 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/dma/odm/odm_dmadev.c b/drivers/dma/odm/odm_dmadev.c
> index 0211133bd4..7488b960fd 100644
> --- a/drivers/dma/odm/odm_dmadev.c
> +++ b/drivers/dma/odm/odm_dmadev.c
> @@ -110,6 +110,9 @@ odm_dmadev_copy(void *dev_private, uint16_t vchan, rte_iova_t src, rte_iova_t ds
> vq = &odm->vq[vchan];
> hdr.s.xtype = vq->xtype;
>
> + if (unlikely(!length))
> + return -EINVAL;
> +
> h = length;
> h |= ((uint64_t)length << 32);
>
> @@ -262,14 +265,20 @@ odm_dmadev_copy_sg(void *dev_private, uint16_t vchan, const struct rte_dma_sge *
> pending_submit_len = vq->pending_submit_len;
> pending_submit_cnt = vq->pending_submit_cnt;
>
> - if (unlikely(nb_src > 4 || nb_dst > 4))
> + if (unlikely(!nb_src || nb_src > 4 || !nb_dst || nb_dst > 4))
> return -EINVAL;
>
> - for (i = 0; i < nb_src; i++)
> + for (i = 0; i < nb_src; i++) {
> + if (unlikely(!src[i].length))
> + return -EINVAL;
> s_sz += src[i].length;
> + }
>
> - for (i = 0; i < nb_dst; i++)
> + for (i = 0; i < nb_dst; i++) {
> + if (unlikely(!dst[i].length))
> + return -EINVAL;
> d_sz += dst[i].length;
> + }
>
> if (s_sz != d_sz)
> return -EINVAL;
> @@ -342,6 +351,9 @@ odm_dmadev_fill(void *dev_private, uint16_t vchan, uint64_t pattern, rte_iova_t
> .s.nlst = 1,
> };
>
> + if (unlikely(!length))
> + return -EINVAL;
> +
> h = (uint64_t)length;
>
> switch (pattern) {
> --
> 2.25.1
>
^ permalink raw reply
* Re: [PATCH v16 4/5] vhost: add mem region add/remove handlers
From: Maxime Coquelin @ 2026-06-08 16:16 UTC (permalink / raw)
To: pravin.bathija; +Cc: dev, stephen, fengchengwen, thomas
In-Reply-To: <20260606025211.1082615-5-pravin.bathija@dell.com>
On Sat, Jun 6, 2026 at 4:52 AM <pravin.bathija@dell.com> wrote:
>
> From: Pravin M Bathija <pravin.bathija@dell.com>
>
> Add support for VHOST_USER_ADD_MEM_REG, VHOST_USER_REM_MEM_REG and
> VHOST_USER_GET_MAX_MEM_SLOTS. Refactor memory initialization into
> common helper and add supporting functions for dynamic memory management.
>
> Signed-off-by: Pravin M Bathija <pravin.bathija@dell.com>
> ---
> lib/vhost/vhost_user.c | 266 +++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 266 insertions(+)
>
> diff --git a/lib/vhost/vhost_user.c b/lib/vhost/vhost_user.c
> index 94fca8b589..020c993b29 100644
> --- a/lib/vhost/vhost_user.c
> +++ b/lib/vhost/vhost_user.c
> @@ -71,6 +71,9 @@ VHOST_MESSAGE_HANDLER(VHOST_USER_SET_FEATURES, vhost_user_set_features, false, t
> VHOST_MESSAGE_HANDLER(VHOST_USER_SET_OWNER, vhost_user_set_owner, false, true) \
> VHOST_MESSAGE_HANDLER(VHOST_USER_RESET_OWNER, vhost_user_reset_owner, false, false) \
> VHOST_MESSAGE_HANDLER(VHOST_USER_SET_MEM_TABLE, vhost_user_set_mem_table, true, true) \
> +VHOST_MESSAGE_HANDLER(VHOST_USER_GET_MAX_MEM_SLOTS, vhost_user_get_max_mem_slots, false, false) \
> +VHOST_MESSAGE_HANDLER(VHOST_USER_ADD_MEM_REG, vhost_user_add_mem_reg, true, true) \
> +VHOST_MESSAGE_HANDLER(VHOST_USER_REM_MEM_REG, vhost_user_rem_mem_reg, false, true) \
> VHOST_MESSAGE_HANDLER(VHOST_USER_SET_LOG_BASE, vhost_user_set_log_base, true, true) \
> VHOST_MESSAGE_HANDLER(VHOST_USER_SET_LOG_FD, vhost_user_set_log_fd, true, true) \
> VHOST_MESSAGE_HANDLER(VHOST_USER_SET_VRING_NUM, vhost_user_set_vring_num, false, true) \
> @@ -1167,6 +1170,24 @@ add_guest_pages(struct virtio_net *dev, struct rte_vhost_mem_region *reg,
> return 0;
> }
>
> +static void
> +remove_guest_pages(struct virtio_net *dev, struct rte_vhost_mem_region *reg)
> +{
> + uint64_t reg_start = reg->host_user_addr;
> + uint64_t reg_end = reg_start + reg->size;
> + uint32_t i, j = 0;
> +
> + for (i = 0; i < dev->nr_guest_pages; i++) {
> + if (dev->guest_pages[i].host_user_addr >= reg_start &&
> + dev->guest_pages[i].host_user_addr < reg_end)
> + continue;
> + if (j != i)
> + dev->guest_pages[j] = dev->guest_pages[i];
> + j++;
> + }
> + dev->nr_guest_pages = j;
> +}
> +
> #ifdef RTE_LIBRTE_VHOST_DEBUG
> /* TODO: enable it only in debug mode? */
> static void
> @@ -1591,6 +1612,251 @@ vhost_user_set_mem_table(struct virtio_net **pdev,
> return RTE_VHOST_MSG_RESULT_ERR;
> }
>
> +
> +static int
> +vhost_user_get_max_mem_slots(struct virtio_net **pdev __rte_unused,
> + struct vhu_msg_context *ctx,
> + int main_fd __rte_unused)
> +{
> + uint32_t max_mem_slots = VHOST_MEMORY_MAX_NREGIONS;
> +
> + ctx->msg.payload.u64 = max_mem_slots;
> + ctx->msg.size = sizeof(ctx->msg.payload.u64);
> + ctx->fd_num = 0;
> +
> + return RTE_VHOST_MSG_RESULT_REPLY;
> +}
> +
> +/*
> + * Invalidate and re-translate all vring addresses after the memory table
> + * has been modified (add/remove region).
> + *
> + * translate_ring_addresses() may call numa_realloc(), which can reallocate
> + * the device structure. The updated pointer is written back through *pdev
> + * so callers must refresh their local "dev" afterwards: dev = *pdev.
> + */
> +static void
> +vhost_user_invalidate_vrings(struct virtio_net **pdev)
> +{
> + struct virtio_net *dev = *pdev;
> + uint32_t i;
> +
> + for (i = 0; i < dev->nr_vring; i++) {
> + struct vhost_virtqueue *vq = dev->virtqueue[i];
> +
> + if (!vq)
> + continue;
> +
> + if (vq->desc || vq->avail || vq->used) {
> + vq_assert_lock(dev, vq);
> +
> + vring_invalidate(dev, vq);
> +
> + translate_ring_addresses(&dev, &vq);
> + }
> + }
> +
> + *pdev = dev;
> +}
> +
> +/*
> + * Macro wrapper that performs the compile-time lock assertion with the
> + * correct message ID at the call site, then calls the implementation.
> + */
> +#define dev_invalidate_vrings(pdev, id) do { \
> + static_assert(id ## _LOCK_ALL_QPS, \
> + #id " handler is not declared as locking all queue pairs"); \
> + vhost_user_invalidate_vrings(pdev); \
> +} while (0)
> +
> +static int
> +vhost_user_add_mem_reg(struct virtio_net **pdev,
> + struct vhu_msg_context *ctx,
> + int main_fd __rte_unused)
> +{
> + struct VhostUserMemoryRegion *region = &ctx->msg.payload.memreg.region;
> + struct virtio_net *dev = *pdev;
> + uint32_t i;
> +
> + /* convert first region add to normal memory table set */
> + if (dev->mem == NULL) {
> + if (vhost_user_initialize_memory(pdev) < 0)
> + goto close_msg_fds;
> + }
> +
> + /* make sure new region will fit */
> + if (dev->mem->nregions >= VHOST_MEMORY_MAX_NREGIONS) {
> + VHOST_CONFIG_LOG(dev->ifname, ERR, "too many memory regions already (%u)",
> + dev->mem->nregions);
> + goto close_msg_fds;
> + }
> +
> + /* make sure supplied memory fd present */
> + if (ctx->fd_num != 1) {
> + VHOST_CONFIG_LOG(dev->ifname, ERR, "fd count makes no sense (%u)", ctx->fd_num);
> + goto close_msg_fds;
> + }
> +
> + /* Make sure no overlap in guest virtual address space */
> + for (i = 0; i < dev->mem->nregions; i++) {
> + struct rte_vhost_mem_region *cur = &dev->mem->regions[i];
> + uint64_t cur_start = cur->guest_user_addr;
> + uint64_t cur_end = cur_start + cur->size - 1;
> + uint64_t new_start = region->userspace_addr;
> + uint64_t new_end = new_start + region->memory_size - 1;
> +
> + if (new_end >= cur_start && new_start <= cur_end) {
> + VHOST_CONFIG_LOG(dev->ifname, ERR,
> + "requested memory region overlaps with another region");
> + VHOST_CONFIG_LOG(dev->ifname, ERR,
> + "\tRequested region address:0x%" PRIx64,
> + region->userspace_addr);
> + VHOST_CONFIG_LOG(dev->ifname, ERR,
> + "\tRequested region size:0x%" PRIx64,
> + region->memory_size);
> + VHOST_CONFIG_LOG(dev->ifname, ERR,
> + "\tOverlapping region address:0x%" PRIx64,
> + cur->guest_user_addr);
> + VHOST_CONFIG_LOG(dev->ifname, ERR,
> + "\tOverlapping region size:0x%" PRIx64,
> + cur->size);
> + goto close_msg_fds;
> + }
> + }
> +
> + /* New region goes at the end of the contiguous array */
> + struct rte_vhost_mem_region *reg = &dev->mem->regions[dev->mem->nregions];
> +
> + reg->guest_phys_addr = region->guest_phys_addr;
> + reg->guest_user_addr = region->userspace_addr;
> + reg->size = region->memory_size;
> + reg->fd = ctx->fds[0];
> + ctx->fds[0] = -1;
> +
> + if (vhost_user_mmap_region(dev, reg, region->mmap_offset) < 0) {
> + VHOST_CONFIG_LOG(dev->ifname, ERR, "failed to mmap region");
> + if (reg->mmap_addr) {
> + /* mmap succeeded but a later step (e.g. add_guest_pages)
> + * failed; undo the mapping and any guest-page entries.
> + */
> + remove_guest_pages(dev, reg);
> + free_mem_region(reg);
> + } else {
> + close(reg->fd);
> + reg->fd = -1;
> + }
> + goto close_msg_fds;
> + }
> +
> + dev->mem->nregions++;
> +
> + if (dev->async_copy && rte_vfio_is_enabled("vfio")) {
> + if (async_dma_map_region(dev, reg, true) < 0)
> + goto free_new_region_no_dma;
> + }
> +
> + if (dev->postcopy_listening) {
> + /*
> + * Cannot use vhost_user_postcopy_register() here because it
> + * reads ctx->msg.payload.memory (SET_MEM_TABLE layout), but
> + * ADD_MEM_REG uses the memreg payload. Register the
> + * single new region directly instead.
> + */
> + if (vhost_user_postcopy_region_register(dev, reg) < 0)
> + goto free_new_region;
> + }
> +
> + dev_invalidate_vrings(pdev, VHOST_USER_ADD_MEM_REG);
> + dev = *pdev;
> + dump_guest_pages(dev);
> +
> + /*
> + * In postcopy mode the front-end expects the back-end to reply with
> + * the base of the mapped region (see VHOST_USER_SET_MEM_TABLE, which
> + * applies here accordingly). No reply is expected otherwise.
> + *
> + * translate_ring_addresses() above may have reallocated dev->mem via
> + * numa_realloc(), so re-derive the region pointer from the refreshed
> + * dev rather than using the now-stale reg. The new region is the last
> + * entry in the contiguous array.
> + */
> + if (dev->postcopy_listening) {
> + reg = &dev->mem->regions[dev->mem->nregions - 1];
> + ctx->msg.payload.memreg.region.userspace_addr = reg->host_user_addr;
> + ctx->msg.size = sizeof(ctx->msg.payload.memreg);
> + ctx->fd_num = 0;
> + return RTE_VHOST_MSG_RESULT_REPLY;
> + }
Thanks Stephen, good catch by the AI.
I did some digging into Qemu code, which AI later confirmed, and if
the series was tested,
the test passed "by accident" when postcopy was not enabled because
Qemu would read the
padding field of the payload, and would treat its value as
RTE_VHOST_MSG_RESULT_OK
because it is zero-initialized...
With this fix:
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
> +
> + return RTE_VHOST_MSG_RESULT_OK;
> +
> +free_new_region:
> + if (dev->async_copy && rte_vfio_is_enabled("vfio"))
> + async_dma_map_region(dev, reg, false);
> +free_new_region_no_dma:
> + remove_guest_pages(dev, reg);
> + free_mem_region(reg);
> + dev->mem->nregions--;
> +close_msg_fds:
> + close_msg_fds(ctx);
> + return RTE_VHOST_MSG_RESULT_ERR;
> +}
> +
> +static int
> +vhost_user_rem_mem_reg(struct virtio_net **pdev,
> + struct vhu_msg_context *ctx,
> + int main_fd __rte_unused)
> +{
> + struct VhostUserMemoryRegion *region = &ctx->msg.payload.memreg.region;
> + struct virtio_net *dev = *pdev;
> + uint32_t i;
> +
> + if (dev->mem == NULL || dev->mem->nregions == 0) {
> + VHOST_CONFIG_LOG(dev->ifname, ERR, "no memory regions to remove");
> + return RTE_VHOST_MSG_RESULT_ERR;
> + }
> +
> + for (i = 0; i < dev->mem->nregions; i++) {
> + struct rte_vhost_mem_region *current_region = &dev->mem->regions[i];
> +
> + /*
> + * According to the vhost-user specification:
> + * The memory region to be removed is identified by its GPA,
> + * user address and size. The mmap offset is ignored.
> + */
> + if (region->userspace_addr == current_region->guest_user_addr
> + && region->guest_phys_addr == current_region->guest_phys_addr
> + && region->memory_size == current_region->size) {
> + if (dev->async_copy && rte_vfio_is_enabled("vfio"))
> + async_dma_map_region(dev, current_region, false);
> + if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM))
> + vhost_user_iotlb_cache_remove(dev,
> + current_region->guest_phys_addr,
> + current_region->size);
> + remove_guest_pages(dev, current_region);
> + free_mem_region(current_region);
> +
> + /* Compact the regions array to keep it contiguous */
> + if (i < dev->mem->nregions - 1) {
> + memmove(&dev->mem->regions[i],
> + &dev->mem->regions[i + 1],
> + (dev->mem->nregions - 1 - i) *
> + sizeof(struct rte_vhost_mem_region));
> + memset(&dev->mem->regions[dev->mem->nregions - 1],
> + 0, sizeof(struct rte_vhost_mem_region));
> + }
> +
> + dev->mem->nregions--;
> + dev_invalidate_vrings(pdev, VHOST_USER_REM_MEM_REG);
> + dev = *pdev;
> + return RTE_VHOST_MSG_RESULT_OK;
> + }
> + }
> +
> + VHOST_CONFIG_LOG(dev->ifname, ERR, "failed to find region");
> + return RTE_VHOST_MSG_RESULT_ERR;
> +}
> +
> static bool
> vq_is_ready(struct virtio_net *dev, struct vhost_virtqueue *vq)
> {
> --
> 2.43.0
>
^ permalink raw reply
* [PATCH 7/7] net/ice: add header split mbuf callback support
From: Dawid Wesierski @ 2026-06-08 16:40 UTC (permalink / raw)
To: dev
Cc: thomas, david.marchand, vladimir.medvedkin, bruce.richardson,
anatoly.burakov, reshma.pattan, stephen, Marek Kasiewicz,
Dawid Wesierski
In-Reply-To: <20260608164059.65420-1-dawid.wesierski@intel.com>
From: Marek Kasiewicz <marek.kasiewicz@intel.com>
Add an ethdev API rte_eth_hdrs_set_mbuf_callback() that allows
applications to register a callback providing custom payload mbufs
for header split RX mode. When registered, the ICE PMD calls this
callback at mbuf allocation points to obtain user-provided payload
buffers instead of allocating from the mempool.
This enables zero-copy RX for header split: the NIC DMAs the payload
directly into application-managed buffers (e.g., mapped frame buffers
with known IOVA), bypassing an extra memcpy from the mempool mbuf.
The callback is invoked at three allocation points in the ICE driver:
initial queue setup, bulk buffer allocation, and single-packet
receive path.
Signed-off-by: Marek Kasiewicz <marek.kasiewicz@intel.com>
Signed-off-by: Dawid Wesierski <dawid.wesierski@intel.com>
---
drivers/net/intel/common/rx.h | 2 +
drivers/net/intel/ice/ice_ethdev.c | 1 +
drivers/net/intel/ice/ice_rxtx.c | 63 ++++++++++++++++++++++++++++++
drivers/net/intel/ice/ice_rxtx.h | 2 +
lib/ethdev/ethdev_driver.h | 10 +++++
lib/ethdev/rte_ethdev.c | 17 ++++++++
lib/ethdev/rte_ethdev.h | 46 ++++++++++++++++++++++
7 files changed, 141 insertions(+)
diff --git a/drivers/net/intel/common/rx.h b/drivers/net/intel/common/rx.h
index e0bf520ebd..8abb2a3ce9 100644
--- a/drivers/net/intel/common/rx.h
+++ b/drivers/net/intel/common/rx.h
@@ -113,6 +113,8 @@ struct ci_rx_queue {
uint32_t hw_time_low; /* low 32 bits of timestamp */
int ts_offset; /* dynamic mbuf timestamp field offset */
uint64_t ts_flag; /* dynamic mbuf timestamp flag */
+ rte_eth_hdrs_mbuf_callback_fn hdrs_mbuf_cb; /* hdr split mbuf cb */
+ void *hdrs_mbuf_cb_priv; /* hdr split mbuf cb priv */
};
struct { /* iavf specific values */
const struct iavf_rxq_ops *ops; /**< queue ops */
diff --git a/drivers/net/intel/ice/ice_ethdev.c b/drivers/net/intel/ice/ice_ethdev.c
index b7cea3bfc1..fb15438dbc 100644
--- a/drivers/net/intel/ice/ice_ethdev.c
+++ b/drivers/net/intel/ice/ice_ethdev.c
@@ -282,6 +282,7 @@ static const struct eth_dev_ops ice_eth_dev_ops = {
.dev_set_link_down = ice_dev_set_link_down,
.dev_led_on = ice_dev_led_on,
.dev_led_off = ice_dev_led_off,
+ .hdrs_mbuf_set_cb = ice_hdrs_mbuf_set_cb,
.rx_queue_start = ice_rx_queue_start,
.rx_queue_stop = ice_rx_queue_stop,
.tx_queue_start = ice_tx_queue_start,
diff --git a/drivers/net/intel/ice/ice_rxtx.c b/drivers/net/intel/ice/ice_rxtx.c
index 8d709125f7..867f595291 100644
--- a/drivers/net/intel/ice/ice_rxtx.c
+++ b/drivers/net/intel/ice/ice_rxtx.c
@@ -487,6 +487,17 @@ ice_alloc_rx_queue_mbufs(struct ci_rx_queue *rxq)
return -ENOMEM;
}
+ if (rxq->hdrs_mbuf_cb) {
+ struct rte_eth_hdrs_mbuf hdrs_mbuf = {0};
+ int ret = rxq->hdrs_mbuf_cb(rxq->hdrs_mbuf_cb_priv,
+ &hdrs_mbuf);
+
+ if (ret >= 0) {
+ mbuf_pay->buf_addr = hdrs_mbuf.buf_addr;
+ mbuf_pay->buf_iova = hdrs_mbuf.buf_iova;
+ }
+ }
+
mbuf_pay->next = NULL;
mbuf_pay->data_off = RTE_PKTMBUF_HEADROOM;
mbuf_pay->nb_segs = 1;
@@ -2126,6 +2137,16 @@ ice_rx_alloc_bufs(struct ci_rx_queue *rxq)
rxdp[i].read.pkt_addr = dma_addr;
} else {
mb->next = rxq->sw_split_buf[i].mbuf;
+ if (rxq->hdrs_mbuf_cb && mb->next) {
+ struct rte_eth_hdrs_mbuf hdrs_mbuf = {0};
+ int ret = rxq->hdrs_mbuf_cb(rxq->hdrs_mbuf_cb_priv,
+ &hdrs_mbuf);
+
+ if (ret >= 0) {
+ mb->next->buf_addr = hdrs_mbuf.buf_addr;
+ mb->next->buf_iova = hdrs_mbuf.buf_iova;
+ }
+ }
pay_addr = rte_cpu_to_le_64(rte_mbuf_data_iova_default(mb->next));
rxdp[i].read.hdr_addr = dma_addr;
rxdp[i].read.pkt_addr = pay_addr;
@@ -2810,6 +2831,17 @@ ice_recv_pkts(void *rx_queue,
break;
}
+ if (rxq->hdrs_mbuf_cb) {
+ struct rte_eth_hdrs_mbuf hdrs_mbuf = {0};
+ int ret = rxq->hdrs_mbuf_cb(rxq->hdrs_mbuf_cb_priv,
+ &hdrs_mbuf);
+
+ if (ret >= 0) {
+ nmb_pay->buf_addr = hdrs_mbuf.buf_addr;
+ nmb_pay->buf_iova = hdrs_mbuf.buf_iova;
+ }
+ }
+
nmb->next = nmb_pay;
nmb_pay->next = NULL;
@@ -4533,3 +4565,34 @@ ice_fdir_programming(struct ice_pf *pf, struct ice_fltr_desc *fdir_desc)
}
+
+int
+ice_hdrs_mbuf_set_cb(struct rte_eth_dev *dev, uint16_t rx_queue_id,
+ void *priv, rte_eth_hdrs_mbuf_callback_fn cb)
+{
+ struct ci_rx_queue *rxq;
+
+ if (rx_queue_id >= dev->data->nb_rx_queues) {
+ PMD_DRV_LOG(ERR, "RX queue %u out of range", rx_queue_id);
+ return -EINVAL;
+ }
+
+ rxq = dev->data->rx_queues[rx_queue_id];
+ if (rxq == NULL) {
+ PMD_DRV_LOG(ERR, "RX queue %u not available or setup", rx_queue_id);
+ return -EINVAL;
+ }
+
+ if (rxq->hdrs_mbuf_cb) {
+ PMD_DRV_LOG(ERR, "RX queue %u has hdrs mbuf cb already",
+ rx_queue_id);
+ return -EEXIST;
+ }
+
+ rxq->hdrs_mbuf_cb_priv = priv;
+ rxq->hdrs_mbuf_cb = cb;
+ PMD_DRV_LOG(NOTICE, "RX queue %u register hdrs mbuf cb at %p",
+ rx_queue_id, cb);
+
+ return 0;
+}
diff --git a/drivers/net/intel/ice/ice_rxtx.h b/drivers/net/intel/ice/ice_rxtx.h
index 999b6b30d6..7ed114ee94 100644
--- a/drivers/net/intel/ice/ice_rxtx.h
+++ b/drivers/net/intel/ice/ice_rxtx.h
@@ -303,6 +303,8 @@ uint16_t ice_xmit_pkts_vec_avx512_offload(void *tx_queue,
int ice_fdir_programming(struct ice_pf *pf, struct ice_fltr_desc *fdir_desc);
int ice_tx_done_cleanup(void *txq, uint32_t free_cnt);
int ice_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc);
+int ice_hdrs_mbuf_set_cb(struct rte_eth_dev *dev, uint16_t rx_queue_id,
+ void *priv, rte_eth_hdrs_mbuf_callback_fn cb);
enum rte_vect_max_simd ice_get_max_simd_bitwidth(void);
#define FDIR_PARSING_ENABLE_PER_QUEUE(ad, on) do { \
diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
index 0f336f9567..b48681268c 100644
--- a/lib/ethdev/ethdev_driver.h
+++ b/lib/ethdev/ethdev_driver.h
@@ -1292,6 +1292,13 @@ typedef int (*eth_cman_config_set_t)(struct rte_eth_dev *dev,
typedef int (*eth_cman_config_get_t)(struct rte_eth_dev *dev,
struct rte_eth_cman_config *config);
+/** @internal
+ * Set header split payload mbuf callback for a receive queue.
+ */
+typedef int (*eth_hdrs_mbuf_set_cb_t)(struct rte_eth_dev *dev,
+ uint16_t rx_queue_id, void *priv,
+ rte_eth_hdrs_mbuf_callback_fn cb);
+
/**
* @internal
* Dump Rx descriptor info to a file.
@@ -1652,6 +1659,9 @@ struct eth_dev_ops {
/** Dump Tx descriptor info */
eth_tx_descriptor_dump_t eth_tx_descriptor_dump;
+ /** Set header split mbuf callback */
+ eth_hdrs_mbuf_set_cb_t hdrs_mbuf_set_cb;
+
/** Get congestion management information */
eth_cman_info_get_t cman_info_get;
/** Initialize congestion management structure with default values */
diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
index 9efeaf77cb..d5820ccd22 100644
--- a/lib/ethdev/rte_ethdev.c
+++ b/lib/ethdev/rte_ethdev.c
@@ -7316,6 +7316,23 @@ rte_eth_ip_reassembly_conf_set(uint16_t port_id,
return ret;
}
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_eth_hdrs_set_mbuf_callback, 26.07)
+int
+rte_eth_hdrs_set_mbuf_callback(uint16_t port_id, uint16_t rx_queue_id,
+ void *priv, rte_eth_hdrs_mbuf_callback_fn cb)
+{
+ struct rte_eth_dev *dev;
+
+ RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
+ dev = &rte_eth_devices[port_id];
+
+ if (dev->dev_ops->hdrs_mbuf_set_cb == NULL)
+ return -ENOTSUP;
+
+ return eth_err(port_id,
+ dev->dev_ops->hdrs_mbuf_set_cb(dev, rx_queue_id, priv, cb));
+}
+
RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_eth_dev_priv_dump, 22.03)
int
rte_eth_dev_priv_dump(uint16_t port_id, FILE *file)
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index ee400b386f..dbf2c23a35 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -6985,6 +6985,52 @@ rte_eth_tx_buffer(uint16_t port_id, uint16_t queue_id,
return rte_eth_tx_buffer_flush(port_id, queue_id, buffer);
}
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice.
+ *
+ * Buffer descriptor for header split payload mbuf callback.
+ */
+struct rte_eth_hdrs_mbuf {
+ void *buf_addr; /**< Virtual address of payload buffer. */
+ rte_iova_t buf_iova; /**< IOVA of payload buffer. */
+};
+
+/**
+ * Callback function type for providing custom payload mbufs
+ * in header split mode.
+ *
+ * @param priv
+ * User-provided private context.
+ * @param mbuf
+ * Pointer to buffer descriptor to be filled by the callback.
+ * @return
+ * 0 on success, negative errno on failure.
+ */
+typedef int (*rte_eth_hdrs_mbuf_callback_fn)(void *priv,
+ struct rte_eth_hdrs_mbuf *mbuf);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change, or be removed, without prior notice.
+ *
+ * Register a callback to provide custom payload mbufs for header split RX.
+ *
+ * @param port_id
+ * The port identifier of the Ethernet device.
+ * @param rx_queue_id
+ * The index of the receive queue.
+ * @param priv
+ * User-provided private context passed to the callback.
+ * @param cb
+ * Callback function that provides payload buffer descriptors.
+ * @return
+ * 0 on success, negative errno on failure.
+ */
+__rte_experimental
+int rte_eth_hdrs_set_mbuf_callback(uint16_t port_id, uint16_t rx_queue_id,
+ void *priv, rte_eth_hdrs_mbuf_callback_fn cb);
+
/**
* @warning
* @b EXPERIMENTAL: this API may change, or be removed, without prior notice
--
2.47.3
---------------------------------------------------------------------
Intel Technology Poland sp. z o.o.
ul. Slowackiego 173 | 80-298 Gdansk | Sad Rejonowy Gdansk Polnoc | VII Wydzial Gospodarczy Krajowego Rejestru Sadowego - KRS 101882 | NIP 957-07-52-316 | Kapital zakladowy 200.000 PLN.
Spolka oswiadcza, ze posiada status duzego przedsiebiorcy w rozumieniu ustawy z dnia 8 marca 2013 r. o przeciwdzialaniu nadmiernym opoznieniom w transakcjach handlowych.
Ta wiadomosc wraz z zalacznikami jest przeznaczona dla okreslonego adresata i moze zawierac informacje poufne. W razie przypadkowego otrzymania tej wiadomosci, prosimy o powiadomienie nadawcy oraz trwale jej usuniecie; jakiekolwiek przegladanie lub rozpowszechnianie jest zabronione.
This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). If you are not the intended recipient, please contact the sender and delete all copies; any review or distribution by others is strictly prohibited.
^ permalink raw reply related
* [PATCH 6/7] pcapng: add user-supplied timestamp support
From: Dawid Wesierski @ 2026-06-08 16:40 UTC (permalink / raw)
To: dev
Cc: thomas, david.marchand, vladimir.medvedkin, bruce.richardson,
anatoly.burakov, reshma.pattan, stephen, Marek Kasiewicz,
Dawid Wesierski
In-Reply-To: <20260608164059.65420-1-dawid.wesierski@intel.com>
From: Marek Kasiewicz <marek.kasiewicz@intel.com>
Add rte_pcapng_copy_ts() which accepts an optional timestamp parameter
in nanoseconds. When the timestamp is non-zero, it is used directly
instead of reading the TSC. This allows applications to provide
hardware PTP timestamps from the NIC, enabling accurate packet capture
with PTP-domain timing rather than host-local TSC values.
The existing rte_pcapng_copy() function is preserved as a static inline
wrapper that passes zero for backward compatibility.
The TSC-to-epoch conversion in the write path is removed since callers
providing hardware timestamps have already performed the conversion.
Signed-off-by: Marek Kasiewicz <marek.kasiewicz@intel.com>
Signed-off-by: Dawid Wesierski <dawid.wesierski@intel.com>
---
lib/pcapng/rte_pcapng.c | 19 ++++---------------
lib/pcapng/rte_pcapng.h | 41 +++++++++++++++++++++++++++++++++++++++--
2 files changed, 43 insertions(+), 17 deletions(-)
diff --git a/lib/pcapng/rte_pcapng.c b/lib/pcapng/rte_pcapng.c
index b5d1026891..96b3aafeb6 100644
--- a/lib/pcapng/rte_pcapng.c
+++ b/lib/pcapng/rte_pcapng.c
@@ -546,14 +546,14 @@ pcapng_vlan_insert(struct rte_mbuf *m, uint16_t ether_type, uint16_t tci)
*/
/* Make a copy of original mbuf with pcapng header and options */
-RTE_EXPORT_SYMBOL(rte_pcapng_copy)
+RTE_EXPORT_SYMBOL(rte_pcapng_copy_ts)
struct rte_mbuf *
-rte_pcapng_copy(uint16_t port_id, uint32_t queue,
+rte_pcapng_copy_ts(uint16_t port_id, uint32_t queue,
const struct rte_mbuf *md,
struct rte_mempool *mp,
uint32_t length,
enum rte_pcapng_direction direction,
- const char *comment)
+ const char *comment, uint64_t ts)
{
struct pcapng_enhance_packet_block *epb;
uint32_t orig_len, pkt_len, padding, flags;
@@ -691,7 +691,7 @@ rte_pcapng_copy(uint16_t port_id, uint32_t queue,
mc->port = port_id;
/* Put timestamp in cycles here - adjust in packet write */
- timestamp = rte_get_tsc_cycles();
+ timestamp = ts ? ts : rte_get_tsc_cycles();
epb->timestamp_hi = timestamp >> 32;
epb->timestamp_lo = (uint32_t)timestamp;
epb->capture_length = pkt_len;
@@ -720,7 +720,6 @@ rte_pcapng_write_packets(rte_pcapng_t *self,
for (i = 0; i < nb_pkts; i++) {
struct rte_mbuf *m = pkts[i];
struct pcapng_enhance_packet_block *epb;
- uint64_t cycles, timestamp;
/* sanity check that is really a pcapng mbuf */
epb = rte_pktmbuf_mtod(m, struct pcapng_enhance_packet_block *);
@@ -737,16 +736,6 @@ rte_pcapng_write_packets(rte_pcapng_t *self,
return -1;
}
- /*
- * When data is captured by pcapng_copy the current TSC is stored.
- * Adjust the value recorded in file to PCAP epoch units.
- */
- cycles = (uint64_t)epb->timestamp_hi << 32;
- cycles += epb->timestamp_lo;
- timestamp = tsc_to_ns_epoch(&self->clock, cycles);
- epb->timestamp_hi = timestamp >> 32;
- epb->timestamp_lo = (uint32_t)timestamp;
-
/*
* Handle case of highly fragmented and large burst size
* Note: this assumes that max segments per mbuf < IOV_MAX
diff --git a/lib/pcapng/rte_pcapng.h b/lib/pcapng/rte_pcapng.h
index d8d328f710..3d735e4ebe 100644
--- a/lib/pcapng/rte_pcapng.h
+++ b/lib/pcapng/rte_pcapng.h
@@ -109,7 +109,7 @@ enum rte_pcapng_direction {
};
/**
- * Format an mbuf for writing to file.
+ * Format an mbuf with time stamp for writing to file.
*
* @param port_id
* The Ethernet port on which packet was received
@@ -129,16 +129,53 @@ enum rte_pcapng_direction {
* @param comment
* Optional per packet comment.
* Truncated to UINT16_MAX characters.
+ * @param ts
+ * Optional timestamp in nanoseconds. If zero, the current TSC is used.
*
* @return
* - The pointer to the new mbuf formatted for pcapng_write
* - NULL on error such as invalid port or out of memory.
*/
struct rte_mbuf *
+rte_pcapng_copy_ts(uint16_t port_id, uint32_t queue,
+ const struct rte_mbuf *m, struct rte_mempool *mp,
+ uint32_t length,
+ enum rte_pcapng_direction direction, const char *comment, uint64_t ts);
+
+/**
+ * Format an mbuf for writing to file.
+ *
+ * @param port_id
+ * The Ethernet port on which packet was received
+ * or is going to be transmitted.
+ * @param queue
+ * The queue on the Ethernet port where packet was received
+ * or is going to be transmitted.
+ * @param mp
+ * The mempool from which the "clone" mbufs are allocated.
+ * @param m
+ * The mbuf to copy
+ * @param length
+ * The upper limit on bytes to copy. Passing UINT32_MAX
+ * means all data (after offset).
+ * @param direction
+ * The direction of the packer: receive, transmit or unknown.
+ * @param comment
+ * Packet comment.
+ *
+ * @return
+ * - The pointer to the new mbuf formatted for pcapng_write
+ * - NULL if allocation fails.
+ */
+static inline struct rte_mbuf *
rte_pcapng_copy(uint16_t port_id, uint32_t queue,
const struct rte_mbuf *m, struct rte_mempool *mp,
uint32_t length,
- enum rte_pcapng_direction direction, const char *comment);
+ enum rte_pcapng_direction direction, const char *comment)
+{
+ return rte_pcapng_copy_ts(port_id, queue, m, mp, length, direction,
+ comment, 0);
+}
/**
--
2.47.3
---------------------------------------------------------------------
Intel Technology Poland sp. z o.o.
ul. Slowackiego 173 | 80-298 Gdansk | Sad Rejonowy Gdansk Polnoc | VII Wydzial Gospodarczy Krajowego Rejestru Sadowego - KRS 101882 | NIP 957-07-52-316 | Kapital zakladowy 200.000 PLN.
Spolka oswiadcza, ze posiada status duzego przedsiebiorcy w rozumieniu ustawy z dnia 8 marca 2013 r. o przeciwdzialaniu nadmiernym opoznieniom w transakcjach handlowych.
Ta wiadomosc wraz z zalacznikami jest przeznaczona dla okreslonego adresata i moze zawierac informacje poufne. W razie przypadkowego otrzymania tej wiadomosci, prosimy o powiadomienie nadawcy oraz trwale jej usuniecie; jakiekolwiek przegladanie lub rozpowszechnianie jest zabronione.
This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). If you are not the intended recipient, please contact the sender and delete all copies; any review or distribution by others is strictly prohibited.
^ permalink raw reply related
* [PATCH 5/7] net/iavf: disable runtime queue setup capability
From: Dawid Wesierski @ 2026-06-08 16:40 UTC (permalink / raw)
To: dev
Cc: thomas, david.marchand, vladimir.medvedkin, bruce.richardson,
anatoly.burakov, reshma.pattan, stephen, Marek Kasiewicz,
Dawid Wesierski
In-Reply-To: <20260608164059.65420-1-dawid.wesierski@intel.com>
From: Marek Kasiewicz <marek.kasiewicz@intel.com>
Remove the advertisement of RTE_ETH_DEV_CAPA_RUNTIME_RX_QUEUE_SETUP
and RTE_ETH_DEV_CAPA_RUNTIME_TX_QUEUE_SETUP capabilities from the
iavf VF driver.
Runtime queue setup on E810 VFs causes queue state corruption when
queues are dynamically reconfigured while the hardware rate limiter
is actively pacing TX queues. Queue configuration messages to the PF
via virtchnl can race with ongoing TX operations, leading to undefined
behavior.
By not advertising these capabilities, all queues are configured at
port start and remain stable throughout the port lifecycle.
Signed-off-by: Marek Kasiewicz <marek.kasiewicz@intel.com>
Signed-off-by: Dawid Wesierski <dawid.wesierski@intel.com>
---
drivers/net/intel/iavf/iavf_ethdev.c | 3 ---
1 file changed, 3 deletions(-)
diff --git a/drivers/net/intel/iavf/iavf_ethdev.c b/drivers/net/intel/iavf/iavf_ethdev.c
index a8031e23a5..4f6325ef78 100644
--- a/drivers/net/intel/iavf/iavf_ethdev.c
+++ b/drivers/net/intel/iavf/iavf_ethdev.c
@@ -1159,9 +1159,6 @@ iavf_dev_info_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
dev_info->reta_size = vf->vf_res->rss_lut_size;
dev_info->flow_type_rss_offloads = IAVF_RSS_OFFLOAD_ALL;
dev_info->max_mac_addrs = IAVF_NUM_MACADDR_MAX;
- dev_info->dev_capa =
- RTE_ETH_DEV_CAPA_RUNTIME_RX_QUEUE_SETUP |
- RTE_ETH_DEV_CAPA_RUNTIME_TX_QUEUE_SETUP;
dev_info->rx_offload_capa =
RTE_ETH_RX_OFFLOAD_VLAN_STRIP |
RTE_ETH_RX_OFFLOAD_QINQ_STRIP |
--
2.47.3
---------------------------------------------------------------------
Intel Technology Poland sp. z o.o.
ul. Slowackiego 173 | 80-298 Gdansk | Sad Rejonowy Gdansk Polnoc | VII Wydzial Gospodarczy Krajowego Rejestru Sadowego - KRS 101882 | NIP 957-07-52-316 | Kapital zakladowy 200.000 PLN.
Spolka oswiadcza, ze posiada status duzego przedsiebiorcy w rozumieniu ustawy z dnia 8 marca 2013 r. o przeciwdzialaniu nadmiernym opoznieniom w transakcjach handlowych.
Ta wiadomosc wraz z zalacznikami jest przeznaczona dla okreslonego adresata i moze zawierac informacje poufne. W razie przypadkowego otrzymania tej wiadomosci, prosimy o powiadomienie nadawcy oraz trwale jej usuniecie; jakiekolwiek przegladanie lub rozpowszechnianie jest zabronione.
This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). If you are not the intended recipient, please contact the sender and delete all copies; any review or distribution by others is strictly prohibited.
^ permalink raw reply related
* [PATCH 4/7] net/ice: timestamp all received packets when PTP is enabled
From: Dawid Wesierski @ 2026-06-08 16:40 UTC (permalink / raw)
To: dev
Cc: thomas, david.marchand, vladimir.medvedkin, bruce.richardson,
anatoly.burakov, reshma.pattan, stephen, Marek Kasiewicz,
Dawid Wesierski
In-Reply-To: <20260608164059.65420-1-dawid.wesierski@intel.com>
From: Marek Kasiewicz <marek.kasiewicz@intel.com>
When PTP is enabled on the ICE PMD, hardware RX timestamps are only
applied to packets classified as IEEE 1588 (Ethertype 0x88F7). This
prevents applications from obtaining hardware timestamps on regular
UDP/IP traffic.
Remove the TIMESYNC packet type filter so that all received packets
get hardware timestamps when PTP is enabled. This is required for
time-sensitive networking applications that need per-packet arrival
timing on media traffic, such as ST 2110-21 receiver compliance
monitoring.
The change affects all three RX paths: scan, scattered, and single
packet receive functions.
Signed-off-by: Marek Kasiewicz <marek.kasiewicz@intel.com>
Signed-off-by: Dawid Wesierski <dawid.wesierski@intel.com>
---
drivers/net/intel/ice/ice_rxtx.c | 9 +++------
1 file changed, 3 insertions(+), 6 deletions(-)
diff --git a/drivers/net/intel/ice/ice_rxtx.c b/drivers/net/intel/ice/ice_rxtx.c
index c4b5454c53..8d709125f7 100644
--- a/drivers/net/intel/ice/ice_rxtx.c
+++ b/drivers/net/intel/ice/ice_rxtx.c
@@ -2023,8 +2023,7 @@ ice_rx_scan_hw_ring(struct ci_rx_queue *rxq)
pkt_flags |= rxq->ts_flag;
}
- if (ad->ptp_ena && ((mb->packet_type &
- RTE_PTYPE_L2_MASK) == RTE_PTYPE_L2_ETHER_TIMESYNC)) {
+ if (ad->ptp_ena) {
rxq->time_high =
rte_le_to_cpu_32(rxdp[j].wb.flex_ts.ts_high);
mb->timesync = rxq->queue_id;
@@ -2390,8 +2389,7 @@ ice_recv_scattered_pkts(void *rx_queue,
pkt_flags |= rxq->ts_flag;
}
- if (ad->ptp_ena && ((first_seg->packet_type & RTE_PTYPE_L2_MASK)
- == RTE_PTYPE_L2_ETHER_TIMESYNC)) {
+ if (ad->ptp_ena) {
rxq->time_high =
rte_le_to_cpu_32(rxd.wb.flex_ts.ts_high);
first_seg->timesync = rxq->queue_id;
@@ -2881,8 +2879,7 @@ ice_recv_pkts(void *rx_queue,
pkt_flags |= rxq->ts_flag;
}
- if (ad->ptp_ena && ((rxm->packet_type & RTE_PTYPE_L2_MASK) ==
- RTE_PTYPE_L2_ETHER_TIMESYNC)) {
+ if (ad->ptp_ena) {
rxq->time_high =
rte_le_to_cpu_32(rxd.wb.flex_ts.ts_high);
rxm->timesync = rxq->queue_id;
--
2.47.3
---------------------------------------------------------------------
Intel Technology Poland sp. z o.o.
ul. Slowackiego 173 | 80-298 Gdansk | Sad Rejonowy Gdansk Polnoc | VII Wydzial Gospodarczy Krajowego Rejestru Sadowego - KRS 101882 | NIP 957-07-52-316 | Kapital zakladowy 200.000 PLN.
Spolka oswiadcza, ze posiada status duzego przedsiebiorcy w rozumieniu ustawy z dnia 8 marca 2013 r. o przeciwdzialaniu nadmiernym opoznieniom w transakcjach handlowych.
Ta wiadomosc wraz z zalacznikami jest przeznaczona dla okreslonego adresata i moze zawierac informacje poufne. W razie przypadkowego otrzymania tej wiadomosci, prosimy o powiadomienie nadawcy oraz trwale jej usuniecie; jakiekolwiek przegladanie lub rozpowszechnianie jest zabronione.
This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). If you are not the intended recipient, please contact the sender and delete all copies; any review or distribution by others is strictly prohibited.
^ permalink raw reply related
* [PATCH 3/7] net/ice/base: reduce default scheduler burst size
From: Dawid Wesierski @ 2026-06-08 16:40 UTC (permalink / raw)
To: dev
Cc: thomas, david.marchand, vladimir.medvedkin, bruce.richardson,
anatoly.burakov, reshma.pattan, stephen, Marek Kasiewicz,
Dawid Wesierski
In-Reply-To: <20260608164059.65420-1-dawid.wesierski@intel.com>
From: Marek Kasiewicz <marek.kasiewicz@intel.com>
Reduce ICE_SCHED_DFLT_BURST_SIZE from 15 KB to 2 KB to improve
TX rate limiter granularity. The E810 TX scheduler uses a token
bucket algorithm where the burst size controls the maximum bytes
sent in a single burst before the rate limiter throttles.
A 15 KB burst allows micro-bursts of ~10 max-size frames, which
violates tight inter-packet spacing requirements in time-sensitive
networking applications such as SMPTE ST 2110-21 narrow-sender
compliance. Reducing to 2 KB forces near-constant-rate output
matching the configured shaper profile.
Signed-off-by: Marek Kasiewicz <marek.kasiewicz@intel.com>
Signed-off-by: Dawid Wesierski <dawid.wesierski@intel.com>
---
drivers/net/intel/ice/base/ice_type.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/intel/ice/base/ice_type.h b/drivers/net/intel/ice/base/ice_type.h
index 6d8c187689..39569ff3e3 100644
--- a/drivers/net/intel/ice/base/ice_type.h
+++ b/drivers/net/intel/ice/base/ice_type.h
@@ -1100,7 +1100,7 @@ enum ice_rl_type {
#define ICE_SCHED_NO_SHARED_RL_PROF_ID 0xFFFF
#define ICE_SCHED_DFLT_BW_WT 4
#define ICE_SCHED_INVAL_PROF_ID 0xFFFF
-#define ICE_SCHED_DFLT_BURST_SIZE (15 * 1024) /* in bytes (15k) */
+#define ICE_SCHED_DFLT_BURST_SIZE (2 * 1024) /* in bytes (2k) */
/* Access Macros for Tx Sched RL Profile data */
#define ICE_TXSCHED_GET_RL_PROF_ID(p) LE16_TO_CPU((p)->info.profile_id)
--
2.47.3
---------------------------------------------------------------------
Intel Technology Poland sp. z o.o.
ul. Slowackiego 173 | 80-298 Gdansk | Sad Rejonowy Gdansk Polnoc | VII Wydzial Gospodarczy Krajowego Rejestru Sadowego - KRS 101882 | NIP 957-07-52-316 | Kapital zakladowy 200.000 PLN.
Spolka oswiadcza, ze posiada status duzego przedsiebiorcy w rozumieniu ustawy z dnia 8 marca 2013 r. o przeciwdzialaniu nadmiernym opoznieniom w transakcjach handlowych.
Ta wiadomosc wraz z zalacznikami jest przeznaczona dla okreslonego adresata i moze zawierac informacje poufne. W razie przypadkowego otrzymania tej wiadomosci, prosimy o powiadomienie nadawcy oraz trwale jej usuniecie; jakiekolwiek przegladanie lub rozpowszechnianie jest zabronione.
This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). If you are not the intended recipient, please contact the sender and delete all copies; any review or distribution by others is strictly prohibited.
^ permalink raw reply related
* [PATCH 2/7] net/iavf: allow runtime queue rate limit configuration
From: Dawid Wesierski @ 2026-06-08 16:40 UTC (permalink / raw)
To: dev
Cc: thomas, david.marchand, vladimir.medvedkin, bruce.richardson,
anatoly.burakov, reshma.pattan, stephen, Marek Kasiewicz,
Dawid Wesierski
In-Reply-To: <20260608164059.65420-1-dawid.wesierski@intel.com>
From: Marek Kasiewicz <marek.kasiewicz@intel.com>
Allow per-queue bandwidth rate limiting to be configured without
stopping the port when only a single TC node and single QoS element
are involved. This enables dynamic session management where individual
queue pacing rates can be changed while other queues continue
transmitting.
Also fix the queue ID assignment in the bandwidth configuration to
use the actual TM node ID rather than a sequential counter index, and
only mark the TM hierarchy as committed when the port is stopped to
permit subsequent reconfiguration.
Signed-off-by: Marek Kasiewicz <marek.kasiewicz@intel.com>
Signed-off-by: Dawid Wesierski <dawid.wesierski@intel.com>
---
drivers/net/intel/iavf/iavf_tm.c | 11 +++++++----
1 file changed, 7 insertions(+), 4 deletions(-)
diff --git a/drivers/net/intel/iavf/iavf_tm.c b/drivers/net/intel/iavf/iavf_tm.c
index 1cf7bfb106..43d7a44337 100644
--- a/drivers/net/intel/iavf/iavf_tm.c
+++ b/drivers/net/intel/iavf/iavf_tm.c
@@ -804,8 +804,10 @@ static int iavf_hierarchy_commit(struct rte_eth_dev *dev,
int index = 0, node_committed = 0;
int i, ret_val = IAVF_SUCCESS;
- /* check if port is stopped */
- if (adapter->stopped != 1) {
+ /* check if port is stopped, except for setting queue bandwidth */
+ if (vf->tm_conf.nb_tc_node != 1 &&
+ vf->qos_cap->num_elem != 1 &&
+ adapter->stopped != 1) {
PMD_DRV_LOG(ERR, "Please stop port first");
ret_val = IAVF_ERR_NOT_READY;
goto err;
@@ -856,7 +858,7 @@ static int iavf_hierarchy_commit(struct rte_eth_dev *dev,
q_tc_mapping->tc[tm_node->tc].req.queue_count++;
if (tm_node->shaper_profile) {
- q_bw->cfg[node_committed].queue_id = node_committed;
+ q_bw->cfg[node_committed].queue_id = tm_node->id;
q_bw->cfg[node_committed].shaper.peak =
tm_node->shaper_profile->profile.peak.rate /
1000 * IAVF_BITS_PER_BYTE;
@@ -900,7 +902,8 @@ static int iavf_hierarchy_commit(struct rte_eth_dev *dev,
goto fail_clear;
vf->qtc_map = qtc_map;
- vf->tm_conf.committed = true;
+ if (adapter->stopped == 1)
+ vf->tm_conf.committed = true;
return ret_val;
fail_clear:
--
2.47.3
---------------------------------------------------------------------
Intel Technology Poland sp. z o.o.
ul. Slowackiego 173 | 80-298 Gdansk | Sad Rejonowy Gdansk Polnoc | VII Wydzial Gospodarczy Krajowego Rejestru Sadowego - KRS 101882 | NIP 957-07-52-316 | Kapital zakladowy 200.000 PLN.
Spolka oswiadcza, ze posiada status duzego przedsiebiorcy w rozumieniu ustawy z dnia 8 marca 2013 r. o przeciwdzialaniu nadmiernym opoznieniom w transakcjach handlowych.
Ta wiadomosc wraz z zalacznikami jest przeznaczona dla okreslonego adresata i moze zawierac informacje poufne. W razie przypadkowego otrzymania tej wiadomosci, prosimy o powiadomienie nadawcy oraz trwale jej usuniecie; jakiekolwiek przegladanie lub rozpowszechnianie jest zabronione.
This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). If you are not the intended recipient, please contact the sender and delete all copies; any review or distribution by others is strictly prohibited.
^ permalink raw reply related
* [PATCH 1/7] net/iavf: increase max ring descriptors to hardware limit
From: Dawid Wesierski @ 2026-06-08 16:40 UTC (permalink / raw)
To: dev
Cc: thomas, david.marchand, vladimir.medvedkin, bruce.richardson,
anatoly.burakov, reshma.pattan, stephen, Marek Kasiewicz,
Dawid Wesierski
In-Reply-To: <20260608164059.65420-1-dawid.wesierski@intel.com>
From: Marek Kasiewicz <marek.kasiewicz@intel.com>
The Intel E810 hardware supports up to 8160 (8K - 32) descriptors per
TX/RX ring, but IAVF_MAX_RING_DESC caps it at 4096. Applications that
need deep descriptor rings for hardware rate-limited pacing (e.g.,
ST2110 video with thousands of packets per frame) cannot queue enough
packets before the pacing epoch begins.
Increase IAVF_MAX_RING_DESC to the hardware maximum of 8160 to allow
full utilization of the ring depth on E810 VFs.
Signed-off-by: Marek Kasiewicz <marek.kasiewicz@intel.com>
Signed-off-by: Dawid Wesierski <dawid.wesierski@intel.com>
---
drivers/net/intel/iavf/iavf_rxtx.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/intel/iavf/iavf_rxtx.h b/drivers/net/intel/iavf/iavf_rxtx.h
index 8449236d4d..22ea415f44 100644
--- a/drivers/net/intel/iavf/iavf_rxtx.h
+++ b/drivers/net/intel/iavf/iavf_rxtx.h
@@ -16,7 +16,7 @@
/* In QLEN must be whole number of 32 descriptors. */
#define IAVF_ALIGN_RING_DESC 32
#define IAVF_MIN_RING_DESC 64
-#define IAVF_MAX_RING_DESC 4096
+#define IAVF_MAX_RING_DESC (8192 - 32)
#define IAVF_DMA_MEM_ALIGN 4096
/* Base address of the HW descriptor ring should be 128B aligned. */
#define IAVF_RING_BASE_ALIGN 128
--
2.47.3
---------------------------------------------------------------------
Intel Technology Poland sp. z o.o.
ul. Slowackiego 173 | 80-298 Gdansk | Sad Rejonowy Gdansk Polnoc | VII Wydzial Gospodarczy Krajowego Rejestru Sadowego - KRS 101882 | NIP 957-07-52-316 | Kapital zakladowy 200.000 PLN.
Spolka oswiadcza, ze posiada status duzego przedsiebiorcy w rozumieniu ustawy z dnia 8 marca 2013 r. o przeciwdzialaniu nadmiernym opoznieniom w transakcjach handlowych.
Ta wiadomosc wraz z zalacznikami jest przeznaczona dla okreslonego adresata i moze zawierac informacje poufne. W razie przypadkowego otrzymania tej wiadomosci, prosimy o powiadomienie nadawcy oraz trwale jej usuniecie; jakiekolwiek przegladanie lub rozpowszechnianie jest zabronione.
This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). If you are not the intended recipient, please contact the sender and delete all copies; any review or distribution by others is strictly prohibited.
^ permalink raw reply related
* [PATCH 0/7] intel network and pcapng updates
From: Dawid Wesierski @ 2026-06-08 16:40 UTC (permalink / raw)
To: dev
Cc: thomas, david.marchand, vladimir.medvedkin, bruce.richardson,
anatoly.burakov, reshma.pattan, stephen, Wesierski, Dawid
From: "Wesierski, Dawid" <dawid.wesierski@intel.com>
These patches provide various updates for Intel iavf/ice drivers and pcapng.
The changes include:
- Hardware limit ring descriptor increases for iavf.
- Runtime queue rate limit configuration for iavf.
- Scheduler burst size reductions for ice base.
- Global PTP timestamping for ice.
- Disabling runtime queue setup for iavf.
- User-supplied timestamp support in pcapng.
- Header split mbuf callback support for ice.
Marek Kasiewicz (7):
net/iavf: increase max ring descriptors to hardware limit
net/iavf: allow runtime queue rate limit configuration
net/ice/base: reduce default scheduler burst size
net/ice: timestamp all received packets when PTP is enabled
net/iavf: disable runtime queue setup capability
pcapng: add user-supplied timestamp support
net/ice: add header split mbuf callback support
drivers/net/intel/common/rx.h | 2 +
drivers/net/intel/iavf/iavf_ethdev.c | 3 --
drivers/net/intel/iavf/iavf_rxtx.h | 2 +-
drivers/net/intel/iavf/iavf_tm.c | 11 ++--
drivers/net/intel/ice/base/ice_type.h | 2 +-
drivers/net/intel/ice/ice_ethdev.c | 1 +
drivers/net/intel/ice/ice_rxtx.c | 72 ++++++++++++++++++++++++---
drivers/net/intel/ice/ice_rxtx.h | 2 +
lib/ethdev/ethdev_driver.h | 10 ++++
lib/ethdev/rte_ethdev.c | 17 +++++++
lib/ethdev/rte_ethdev.h | 46 +++++++++++++++++
lib/pcapng/rte_pcapng.c | 19 ++-----
lib/pcapng/rte_pcapng.h | 41 ++++++++++++++-
13 files changed, 196 insertions(+), 32 deletions(-)
--
2.47.3
---------------------------------------------------------------------
Intel Technology Poland sp. z o.o.
ul. Slowackiego 173 | 80-298 Gdansk | Sad Rejonowy Gdansk Polnoc | VII Wydzial Gospodarczy Krajowego Rejestru Sadowego - KRS 101882 | NIP 957-07-52-316 | Kapital zakladowy 200.000 PLN.
Spolka oswiadcza, ze posiada status duzego przedsiebiorcy w rozumieniu ustawy z dnia 8 marca 2013 r. o przeciwdzialaniu nadmiernym opoznieniom w transakcjach handlowych.
Ta wiadomosc wraz z zalacznikami jest przeznaczona dla okreslonego adresata i moze zawierac informacje poufne. W razie przypadkowego otrzymania tej wiadomosci, prosimy o powiadomienie nadawcy oraz trwale jej usuniecie; jakiekolwiek przegladanie lub rozpowszechnianie jest zabronione.
This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). If you are not the intended recipient, please contact the sender and delete all copies; any review or distribution by others is strictly prohibited.
^ permalink raw reply
* Re: [PATCH] eal: fix core_index for non-EAL registered threads
From: David Marchand @ 2026-06-08 16:10 UTC (permalink / raw)
To: Maxime Peim; +Cc: dev
In-Reply-To: <20260422075414.2528455-1-maxime.peim@gmail.com>
On Wed, 22 Apr 2026 at 09:54, Maxime Peim <maxime.peim@gmail.com> wrote:
>
> Threads registered via rte_thread_register() are assigned a valid
> lcore_id by eal_lcore_non_eal_allocate(), but their core_index in
> lcore_config is left at -1. This value was set during rte_eal_cpu_init()
> for lcores with ROLE_OFF (undetected CPUs) and is never updated when the
> lcore is later allocated to a non-EAL thread.
>
> As a result, rte_lcore_index() returns -1 for registered non-EAL
> threads. Libraries that use rte_lcore_index() to select per-lcore
> caches fall back to a shared global path when it returns -1, causing
> severe contention under concurrent access from multiple registered
> threads.
>
> A concrete example is the mlx5 indexed memory pool (mlx5_ipool), which
> uses rte_lcore_index() in mlx5_ipool_malloc_cache() to select a per-core
> cache slot. When core_index is -1, all registered threads are funneled
> into a single shared slot protected by a spinlock. In testing with VPP
> (which registers worker threads via rte_thread_register()), this caused
> async flow rule insertion throughput to drop from ~6.4M rules/sec to
> ~1.2M rules/sec with 4 workers -- a 5x regression attributable entirely
> to spinlock contention in the ipool allocator.
>
> Fix by setting core_index to the next sequential index (cfg->lcore_count)
> in eal_lcore_non_eal_allocate() before incrementing the count. Also reset
> core_index back to -1 on the error rollback path and in
> eal_lcore_non_eal_release() for correctness.
>
> Fixes: 5c307ba2a5b1 ("eal: register non-EAL threads as lcores")
Cc: stable@dpdk.org
> Signed-off-by: Maxime Peim <maxime.peim@gmail.com>
Acked-by: David Marchand <david.marchand@redhat.com>
Applied, thanks.
--
David Marchand
^ permalink raw reply
* Re: [PATCH] dma/cnxk: fix crash on secondary process cleanup
From: Jerin Jacob @ 2026-06-08 16:09 UTC (permalink / raw)
To: pbhagavatula
Cc: jerinj, Vamsi Attunuru, Anatoly Burakov, Radha Mohan Chintakuntla,
dev, stable
In-Reply-To: <20260605081620.97056-1-pbhagavatula@marvell.com>
On Fri, Jun 5, 2026 at 2:11 PM <pbhagavatula@marvell.com> wrote:
>
> From: Pavan Nikhilesh <pbhagavatula@marvell.com>
>
> cnxk_dmadev_probe() ran in secondary processes too, overwriting the
> shared rdpi->pci_dev with a process-local pointer and marking the
> device ready. With buses now cleaned up on shutdown, the primary's
> roc_dpi_dev_fini() dereferences that stale pointer and crashes.
>
> Skip HW init in secondary processes: attach to the shared device data
> and return, leaving rdpi and the device state untouched.
>
> Fixes: 53f6d7328bf4 ("dma/cnxk: create and initialize device on PCI probing")
> Cc: stable@dpdk.org
>
> Signed-off-by: Pavan Nikhilesh <pbhagavatula@marvell.com>
Applied to dpdk-next-net-mrvl/for-main. Thanks
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox