* RE: [PATCH v7] net/idpf: update for new mempool cache algorithm
From: Morten Brørup @ 2026-06-10 11:31 UTC (permalink / raw)
To: dev
In-Reply-To: <20260601183621.252920-1-mb@smartsharesystems.com>
Recheck-request: iol-unit-arm64-testing
Unrelated CI failure.
CI log:
==== 20 line log output for Ubuntu 20.04 (lpm_autotest): ====
[7/407] Compiling C object lib/librte_log.a.p/log_log_journal.c.o
[8/407] Linking static target lib/librte_log.a
[9/407] Generating rte_argparse_map with a custom command
[10/407] Compiling C object lib/librte_kvargs.a.p/kvargs_rte_kvargs.c.o
[11/407] Linking static target lib/librte_kvargs.a
[12/407] Compiling C object lib/librte_argparse.a.p/argparse_rte_argparse.c.o
[13/407] Generating kvargs.sym_chk with a custom command (wrapped by meson to capture output)
[14/407] Linking static target lib/librte_argparse.a
[15/407] Generating log.sym_chk with a custom command (wrapped by meson to capture output)
[16/407] Linking target lib/librte_log.so.26.2
[17/407] Compiling C object lib/librte_telemetry.a.p/telemetry_telemetry_data.c.o
[18/407] Compiling C object lib/librte_telemetry.a.p/telemetry_telemetry.c.o
[19/407] Generating rte_telemetry_map with a custom command
FAILED: lib/telemetry_exports.map
/usr/bin/python3 ../buildtools/gen-version-map.py --linker gnu --abi-version ../ABI_VERSION --output lib/telemetry_exports.map --source ../lib/telemetry/telemetry.c ../lib/telemetry/telemetry_data.c ../lib/telemetry/telemetry_legacy.c
Segmentation fault (core dumped)
^ permalink raw reply
* [PATCH v2 0/2] ethdev: fix out-of-bounds writes in rte_flow_conv()
From: James Raphael Tiovalen @ 2026-06-10 11:33 UTC (permalink / raw)
To: dev
Cc: orika, thomas, andrew.rybchenko, stephen, stable,
James Raphael Tiovalen
rte_flow_conv() is documented to truncate output to the caller-supplied
buffer size, but two paths handling variable-length trailing data
ignored that contract and copied the full payload whenever the
destination pointer was non-NULL. A caller passing a buffer just large
enough for the fixed-size header had adjacent memory clobbered:
- GENEVE_OPT: up to option_len * 4 bytes
- FLEX: up to 4 GiB, since src->length is a uint32_t and the API places
no bounds on it
Patch 1 aligns the GENEVE_OPT guard with the sibling RAW branch, which
already gates its copy on the remaining buffer size.
Patch 2 plumbs the remaining buffer size into the flex-item desc_fn
callback (which previously took no size argument at all) and gates the
inner rte_memcpy() on it.
v2 fixes the merge conflict between patch 1 and the main branch.
James Raphael Tiovalen (2):
ethdev: fix out-of-bounds write in GENEVE option conversion
ethdev: fix out-of-bounds write in flex item conversion
lib/ethdev/rte_flow.c | 11 ++++++-----
1 file changed, 6 insertions(+), 5 deletions(-)
--
2.43.0
^ permalink raw reply
* [PATCH v2 1/2] ethdev: fix out-of-bounds write in GENEVE option conversion
From: James Raphael Tiovalen @ 2026-06-10 11:33 UTC (permalink / raw)
To: dev
Cc: orika, thomas, andrew.rybchenko, stephen, stable,
James Raphael Tiovalen
In-Reply-To: <20260610113334.277895-1-jamestiotio@gmail.com>
rte_flow_conv_item_spec() is documented to truncate output to the
caller-supplied buffer size. For RTE_FLOW_ITEM_TYPE_GENEVE_OPT, the
deep-copy of the variable-length option data was gated on `size > 0`
instead of `size >= off + tmp`, the form used by the sibling RAW
branch. A caller passing a buffer just large enough for the header
struct had adjacent memory clobbered by up to `option_len * 4` bytes of
option payload.
Align the GENEVE_OPT guard with the RAW one.
Fixes: 841a0445442d ("ethdev: fix GENEVE option item conversion")
Cc: stable@dpdk.org
Signed-off-by: James Raphael Tiovalen <jamestiotio@gmail.com>
---
lib/ethdev/rte_flow.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/lib/ethdev/rte_flow.c b/lib/ethdev/rte_flow.c
index ec0fe08355..e534f2295b 100644
--- a/lib/ethdev/rte_flow.c
+++ b/lib/ethdev/rte_flow.c
@@ -701,7 +701,7 @@ rte_flow_conv_item_spec(void *buf, const size_t size,
src.geneve_opt = data;
dst.geneve_opt = buf;
tmp = spec.geneve_opt ? (spec.geneve_opt->option_len << 2) : 0;
- if (size > 0 && tmp > 0 && src.geneve_opt->data) {
+ if (size >= off + tmp && tmp > 0 && src.geneve_opt->data) {
deep_src = (void *)((uintptr_t)(dst.geneve_opt + 1));
dst.geneve_opt->data = rte_memcpy(deep_src,
src.geneve_opt->data,
--
2.43.0
^ permalink raw reply related
* [PATCH v2 2/2] ethdev: fix out-of-bounds write in flex item conversion
From: James Raphael Tiovalen @ 2026-06-10 11:33 UTC (permalink / raw)
To: dev
Cc: orika, thomas, andrew.rybchenko, stephen, stable,
James Raphael Tiovalen
In-Reply-To: <20260610113334.277895-1-jamestiotio@gmail.com>
rte_flow_item_flex_conv() is dispatched from rte_flow_conv_copy() to
deep-copy the variable-length pattern that follows a flex item header.
The function took no size argument at all, so the trailing rte_memcpy()
of `src->length` bytes was gated only on `buf != NULL`, violating the
documented contract that output is truncated to the caller-supplied
buffer size. A caller passing a buffer just large enough for the header
struct had adjacent memory clobbered by up to 4 GiB of pattern data,
since `src->length` is uint32_t and unbounded.
Propagate the remaining buffer size `size - sz` from
rte_flow_conv_copy() into the desc_fn callback and gate the inner
memcpy on it.
Fixes: dc4d860e8a89 ("ethdev: introduce configurable flexible item")
Cc: stable@dpdk.org
Signed-off-by: James Raphael Tiovalen <jamestiotio@gmail.com>
---
lib/ethdev/rte_flow.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/lib/ethdev/rte_flow.c b/lib/ethdev/rte_flow.c
index e534f2295b..60c9a3d06f 100644
--- a/lib/ethdev/rte_flow.c
+++ b/lib/ethdev/rte_flow.c
@@ -36,7 +36,7 @@ uint64_t rte_flow_dynf_metadata_mask;
struct rte_flow_desc_data {
const char *name;
size_t size;
- size_t (*desc_fn)(void *dst, const void *src);
+ size_t (*desc_fn)(void *dst, const void *src, size_t size);
};
/**
@@ -68,16 +68,17 @@ rte_flow_conv_copy(void *buf, const void *data, const size_t size,
if (buf != NULL)
rte_memcpy(buf, data, (size > sz ? sz : size));
if (rte_type && desc[type].desc_fn)
- sz += desc[type].desc_fn(size > 0 ? buf : NULL, data);
+ sz += desc[type].desc_fn(size > 0 ? buf : NULL, data,
+ size > sz ? size - sz : 0);
return sz;
}
static size_t
-rte_flow_item_flex_conv(void *buf, const void *data)
+rte_flow_item_flex_conv(void *buf, const void *data, size_t size)
{
struct rte_flow_item_flex *dst = buf;
const struct rte_flow_item_flex *src = data;
- if (buf) {
+ if (buf && size >= src->length) {
dst->pattern = rte_memcpy
((void *)((uintptr_t)(dst + 1)), src->pattern,
src->length);
--
2.43.0
^ permalink raw reply related
* Re: [PATCH v7] net/idpf: update for new mempool cache algorithm
From: Thomas Monjalon @ 2026-06-10 12:17 UTC (permalink / raw)
To: Morten Brørup, Bruce Richardson; +Cc: Jingjing Wu, Praveen Shetty, dev
In-Reply-To: <ailLDte3Xo8jsUWf@bricha3-mobl1.ger.corp.intel.com>
10/06/2026 13:31, Bruce Richardson:
> On Wed, Jun 10, 2026 at 01:21:38PM +0200, Morten Brørup wrote:
> > Intel idpf maintainers,
> >
> > PING for review.
> >
> > The mempool library has been improved [1], so the idpf PMD - which bypasses the mempool API - must be updated to match the library implementation. This patch does that.
> >
> > [1]: https://git.dpdk.org/dpdk/commit/?id=f5e1310f16e0909e7e7f71807123644c63b23cba
> >
> >
> > Venlig hilsen / Kind regards,
> > -Morten Brørup
> >
> Yep. I was waiting to see what happened to the mempool patch before
> considering this for next-net-intel.
I've merged it directly in main with dpaa mempool patches.
^ permalink raw reply
* Re: [PATCH v3] cmdline: prevent out-of-bounds read in completion buffer
From: Thomas Monjalon @ 2026-06-10 12:28 UTC (permalink / raw)
To: Daniil Iskhakov; +Cc: stable, dev, rrv, sdl.dpdk
In-Reply-To: <20260430170111.1557768-1-dish@amicon.ru>
30/04/2026 19:01, Daniil Iskhakov:
> tmp_buf is populated by the completion callback and is not guaranteed
> to be NUL-terminated.
>
> The code already accounts for this when computing tmp_size with
> strnlen(tmp_buf, sizeof(tmp_buf)). However, another loop in the same
> path still walks tmp_buf until a NUL byte is found, without checking
> the buffer limit.
>
> If the callback writes a full-sized non-NUL-terminated string, the loop
> may read past the end of tmp_buf.
>
> Fix this by computing a bounded length for each completion choice before
> printing it.
>
> Found by Linux Verification Center (linuxtesting.org) with SVACE.
>
> Fixes: af75078fece3 ("first public release")
> Cc: stable@dpdk.org
>
> Signed-off-by: Daniil Iskhakov <dish@amicon.ru>
Applied, thanks.
^ permalink raw reply
* Re: [PATCH v7] net/idpf: update for new mempool cache algorithm
From: Bruce Richardson @ 2026-06-10 12:34 UTC (permalink / raw)
To: Thomas Monjalon; +Cc: Morten Brørup, Jingjing Wu, Praveen Shetty, dev
In-Reply-To: <b0-eVG8rSLyIiCcsmRSskA@monjalon.net>
On Wed, Jun 10, 2026 at 02:17:43PM +0200, Thomas Monjalon wrote:
> 10/06/2026 13:31, Bruce Richardson:
> > On Wed, Jun 10, 2026 at 01:21:38PM +0200, Morten Brørup wrote:
> > > Intel idpf maintainers,
> > >
> > > PING for review.
> > >
> > > The mempool library has been improved [1], so the idpf PMD - which bypasses the mempool API - must be updated to match the library implementation. This patch does that.
> > >
> > > [1]: https://git.dpdk.org/dpdk/commit/?id=f5e1310f16e0909e7e7f71807123644c63b23cba
> > >
> > >
> > > Venlig hilsen / Kind regards,
> > > -Morten Brørup
> > >
> > Yep. I was waiting to see what happened to the mempool patch before
> > considering this for next-net-intel.
>
> I've merged it directly in main with dpaa mempool patches.
>
Ok, thanks.
/Bruce
^ permalink raw reply
* RE: [RFC v4 0/3] lib/fastmem: fast small-object allocator
From: Konstantin Ananyev @ 2026-06-10 12:35 UTC (permalink / raw)
To: Mattias Rönnblom, dev@dpdk.org
Cc: Morten Brørup, Mattias Rönnblom, Yogaraj Baskaravel,
Stephen Hemminger, Bruce Richardson
In-Reply-To: <20260530092634.46218-1-hofors@lysator.liu.se>
Hi Mattias,
> This RFC introduces fastmem, a general-purpose small-object allocator
> for DPDK. It is intended to replace per-type mempools with a single
> allocator that handles arbitrary sizes, grows on demand, and matches
> mempool-level performance on the hot path.
As stated before, I summitted RFC for the one we use internally:
https://patchwork.dpdk.org/project/dpdk/patch/20260610103918.96857-1-konstantin.ananyev@huawei.com/
Many things and ideas are similar, some are not.
Below I tried to summarize the main differences (as I see them).
I do understand that our use-cases and requirements are different,
but might be we can have a blend that will fit all of us.
Another two things that are probably necessary to move forward:
- some unified set of stress/performance test-caces that we can run
against all three: mempool/fastmem/memtank.
- some sort of guinea-pig: DPDK sub-component where this new lib
can be applied. We can try straight with the mbuf, but that's probably
quite ambitious choice for the first integration. Again, this is just one
of the possible usage scenarios.
Let me know what are your thoughts here.
Thanks
Konstantin
> Motivation
> ----------
>
> DPDK applications commonly maintain many mempools — one per object
> type (connections, sessions, timers, work items). Each must be sized
> up front, wastes memory when over-provisioned, and cannot serve
> objects of a different size. Fastmem eliminates this by accepting
> arbitrary sizes at runtime, backed by a slab allocator that
> repurposes memory across size classes as demand shifts.
I agree about first one - it is a big problem that you have to over-provision
everything with the mempool these days.
About forcing user to explicitly create multiple pools - for me it is not such big problem,
after all in most cases user knows the size of the objects he need to alloc/free upfront.
AFAIK - majority of SLAB-based allocators these days support both flavors:
user can create/maintain his own SLAB for some specific object types, or use generic
alloc/free which is backed by bunch of SLABs underneath, each serving specific size.
Might be we can do the same and support both too.
>
> Design
> ------
>
> Three-layer architecture:
>
> 1. Backing memory: 128 MiB IOVA-contiguous memzones from EAL,
> reserved lazily (or pre-reserved for deterministic latency).
> 2. Slabs: 2 MiB, 2 MiB-aligned regions carved from memzones.
In many cases, user don't need DMA-capabl;e memory for such objects,
so simple rte_malloc or even libc malloc might be enough.
I understand the intention - it is probably the fastest way to do things,
but I think it is way too constrained.
Might be the best approach is to do what memtank does -
allow user to define his own allox/free/init callbacks, then fastmem
approach will become just one of possible cases.
> The alignment enables O(1) slab lookup from any object pointer
> via bitmask — no radix tree or index structure. Slabs move
> freely between 18 power-of-2 size classes (8 B to 1 MiB).
That is cool idea and thought about doing the same: limit possible size
for SLAB to power-of-two values.
But then I realized that we still need to store inside the object some
extra metadata for stats and sanity-checking. So extra 8B for a SLAB
pointer doesn't make much difference.
But again - I think we can support both and make it configurable at creation time.
> 3. Per-lcore caches: bounded LIFO stacks (no locks on the hot
> path). Cache misses trigger bulk transfers to/from the shared
> bin under a spinlock.
memtank lacks of per-lcore caches right now, mostly due to lack of time
to implement it. It is definitely a good feature - way to go.
> Key properties:
>
> - Zero per-object metadata in the production build.
> - NUMA-aware, with per-socket bins and free-slab pools.
> - DMA-usable memory with O(1) virt-to-IOVA translation.
> - Bulk alloc/free with all-or-nothing semantics.
Personally, I don't find it very convenient.
For most cases we care about - we do use best-effort semantics.
Again, probably we can support both, same as rte_ring API.
> - Backing memory never returned during lifetime (slabs recycled).
For our case it is important to have ability return memory back to the system.
memtank lib supports it (though of course some fragmentation is possible).
Again it is much easier with separate pools.
What I really like with fastmem - that one SLAB can re-use memory from different
one, that seems usefull and might mitigate memory footprint growth till some extent.
Again, I suppose both flavors caon coexist:
individual pools can grow/shring, while fasmem (bunch of predefined pools) will not.
> - Non-EAL threads supported (bypass cache, take bin lock).
> - Secondary process support (lazy attach, no per-lcore caches).
>
^ permalink raw reply
* Re: [PATCH v6 0/7] Cuckoo hash optimization for small key sizes
From: Thomas Monjalon @ 2026-06-10 13:46 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: dev, vladimir.medvedkin
In-Reply-To: <20260329232409.205940-1-stephen@networkplumber.org>
> Stephen Hemminger (7):
> hash: move table of hash compare functions out of header
> hash: use static_assert
> hash: remove spurious warnings in CRC32 init
> hash: simplify key comparison across architectures
> hash: add support for common small key sizes
> app/test: convert hash test to use test suite runner
> test/hash: add test for key compare functions
That's a pity we didn't have more reviews.
Applied, thanks.
^ permalink raw reply
* Re: [PATCH v1 00/20] net/sxe2: added Linkdata sxe2 ethernet driver
From: Thomas Monjalon @ 2026-06-10 14:02 UTC (permalink / raw)
To: Jie Liu; +Cc: stephen, dev
In-Reply-To: <20260610013936.3634968-1-liujie5@linkdatatechnology.com>
It should be v15.
10/06/2026 03:39, liujie5@linkdatatechnology.com:
> From: Jie Liu <liujie5@linkdatatechnology.com>
>
> This patch set implements core functionality for the SXE2 PMD,
> including basic driver framework, data path setup, and advanced
> offload features (VLAN, RSS,TM, PTP etc.).
>
> Jie Liu (20):
> net/sxe2: support AVX512 vectorized path for Rx and Tx
> net/sxe2: add AVX2 vector data path for Rx and Tx
> drivers: add supported packet types get callback
> net/sxe2: support L2 filtering and MAC config
> drivers: support RSS feature
> net/sxe2: support TM hierarchy and shaping
> net/sxe2: support IPsec inline protocol offload
> net/sxe2: support statistics and multi-process
> drivers: interrupt handling
> net/sxe2: add NEON vec Rx/Tx burst functions
> drivers: add support for VF representors
> net/sxe2: add support for custom UDP tunnel ports
> net/sxe2: support firmware version reading
> net/sxe2: implement get monitor address
> common/sxe2: add shared SFP module definitions
> net/sxe2: support SFP module info and EEPROM access
> net/sxe2: implement private dump info
> net/sxe2: add mbuf validation in Tx debug mode
> drivers: add testpmd commands for private features
There are big chances that this commit is too much controversial.
Please remove it from this series so we can discuss it separately.
Also it should be split: 1 commit per feature.
> net/sxe2: update sxe2 feature matrix docs
Please update the documentation atomically in previous commits.
When you support a new feature, the commit should include the related doc changes.
Thank you
^ permalink raw reply
* [PATCH v3] net/iavf: fix to consolidate link change event handling
From: Anurag Mandal @ 2026-06-10 14:12 UTC (permalink / raw)
To: dev
Cc: bruce.richardson, vladimir.medvedkin, ciara.loftus, Anurag Mandal,
stable
In-Reply-To: <20260609001022.357509-1-anurag.mandal@intel.com>
Handled link-change events through a common static function that
reads the correct advanced & legacy link fields properly and
updates no-poll/watchdog/LSC state consistently.
Fixes: 5e03e316c753 ("net/iavf: handle virtchnl event message without interrupt")
Fixes: 48de41ca11f0 ("net/avf: enable link status update")
Cc: stable@dpdk.org
Signed-off-by: Anurag Mandal <anurag.mandal@intel.com>
---
V3: Addressed Ciara Loftus's review comments
- removed two unnecessary NULL checks
V2: Addressed Ciara Loftus's review comments
- removed unnecessary NULL checks which were overly defensive checks
drivers/net/intel/iavf/iavf_vchnl.c | 115 +++++++++++++++-------------
1 file changed, 63 insertions(+), 52 deletions(-)
diff --git a/drivers/net/intel/iavf/iavf_vchnl.c b/drivers/net/intel/iavf/iavf_vchnl.c
index 0643a835d5..36b7ee9526 100644
--- a/drivers/net/intel/iavf/iavf_vchnl.c
+++ b/drivers/net/intel/iavf/iavf_vchnl.c
@@ -216,6 +216,67 @@ iavf_convert_link_speed(enum virtchnl_link_speed virt_link_speed)
return speed;
}
+/*
+ * iavf_handle_link_change_event: common handler for VIRTCHNL link change events
+ *
+ * @dev: pointer to rte_eth_dev for this VF
+ * @vpe: pointer to the virtchnl_pf_event payload received from the PF
+ *
+ * Handle PF link-change event: decode adv/legacy link info, update VF
+ * link state, sync no-poll/watchdog behavior & notify app via LSC event.
+ */
+static void
+iavf_handle_link_change_event(struct rte_eth_dev *dev,
+ struct virtchnl_pf_event *vpe)
+{
+ struct iavf_adapter *adapter =
+ IAVF_DEV_PRIVATE_TO_ADAPTER(dev->data->dev_private);
+ struct iavf_info *vf = &adapter->vf;
+ bool adv_link_speed;
+
+ adv_link_speed = (vf->vf_res != NULL) &&
+ ((vf->vf_res->vf_cap_flags & VIRTCHNL_VF_CAP_ADV_LINK_SPEED) != 0);
+
+ if (adv_link_speed) {
+ vf->link_up = vpe->event_data.link_event_adv.link_status;
+ vf->link_speed = vpe->event_data.link_event_adv.link_speed;
+ } else {
+ enum virtchnl_link_speed speed;
+
+ vf->link_up = vpe->event_data.link_event.link_status;
+ speed = vpe->event_data.link_event.link_speed;
+ vf->link_speed = iavf_convert_link_speed(speed);
+ }
+
+ iavf_dev_link_update(dev, 0);
+
+ /*
+ * Update watchdog/no_poll state BEFORE notifying the application via
+ * the LSC event. Otherwise the application's link-up callback could
+ * race with stale (link-down) no_poll/watchdog state and either
+ * continue to drop traffic or trigger a spurious reset detection.
+ *
+ * Keeping the watchdog enabled whenever the link cannot be trusted
+ * (link is down or a VF reset is in progress); the watchdog drives
+ * auto-reset recovery, so it must remain armed in those cases.
+ */
+ if (vf->link_up && !vf->vf_reset)
+ iavf_dev_watchdog_disable(adapter);
+ else
+ iavf_dev_watchdog_enable(adapter);
+
+ if (adapter->devargs.no_poll_on_link_down) {
+ iavf_set_no_poll(adapter, true);
+ PMD_DRV_LOG(DEBUG, "VF no poll turned %s",
+ adapter->no_poll ? "on" : "off");
+ }
+
+ iavf_dev_event_post(dev, RTE_ETH_EVENT_INTR_LSC, NULL, 0);
+
+ PMD_DRV_LOG(INFO, "Link status update:%s",
+ vf->link_up ? "up" : "down");
+}
+
/* Read data in admin queue to get msg from pf driver */
static enum iavf_aq_result
iavf_read_msg_from_pf(struct iavf_adapter *adapter, uint16_t buf_len,
@@ -253,34 +314,7 @@ iavf_read_msg_from_pf(struct iavf_adapter *adapter, uint16_t buf_len,
result = IAVF_MSG_SYS;
switch (vpe->event) {
case VIRTCHNL_EVENT_LINK_CHANGE:
- vf->link_up =
- vpe->event_data.link_event.link_status;
- if (vf->vf_res != NULL &&
- vf->vf_res->vf_cap_flags & VIRTCHNL_VF_CAP_ADV_LINK_SPEED) {
- vf->link_speed =
- vpe->event_data.link_event_adv.link_speed;
- } else {
- enum virtchnl_link_speed speed;
- speed = vpe->event_data.link_event.link_speed;
- vf->link_speed = iavf_convert_link_speed(speed);
- }
- iavf_dev_link_update(vf->eth_dev, 0);
- iavf_dev_event_post(vf->eth_dev, RTE_ETH_EVENT_INTR_LSC, NULL, 0);
- if (vf->link_up && !vf->vf_reset) {
- iavf_dev_watchdog_disable(adapter);
- } else {
- if (!vf->link_up)
- iavf_dev_watchdog_enable(adapter);
- }
- if (adapter->devargs.no_poll_on_link_down) {
- iavf_set_no_poll(adapter, true);
- if (adapter->no_poll)
- PMD_DRV_LOG(DEBUG, "VF no poll turned on");
- else
- PMD_DRV_LOG(DEBUG, "VF no poll turned off");
- }
- PMD_DRV_LOG(INFO, "Link status update:%s",
- vf->link_up ? "up" : "down");
+ iavf_handle_link_change_event(vf->eth_dev, vpe);
break;
case VIRTCHNL_EVENT_RESET_IMPENDING:
vf->vf_reset = true;
@@ -525,30 +559,7 @@ iavf_handle_pf_event_msg(struct rte_eth_dev *dev, uint8_t *msg,
break;
case VIRTCHNL_EVENT_LINK_CHANGE:
PMD_DRV_LOG(DEBUG, "VIRTCHNL_EVENT_LINK_CHANGE event");
- vf->link_up = pf_msg->event_data.link_event.link_status;
- if (vf->vf_res->vf_cap_flags & VIRTCHNL_VF_CAP_ADV_LINK_SPEED) {
- vf->link_speed =
- pf_msg->event_data.link_event_adv.link_speed;
- } else {
- enum virtchnl_link_speed speed;
- speed = pf_msg->event_data.link_event.link_speed;
- vf->link_speed = iavf_convert_link_speed(speed);
- }
- iavf_dev_link_update(dev, 0);
- if (vf->link_up && !vf->vf_reset) {
- iavf_dev_watchdog_disable(adapter);
- } else {
- if (!vf->link_up)
- iavf_dev_watchdog_enable(adapter);
- }
- if (adapter->devargs.no_poll_on_link_down) {
- iavf_set_no_poll(adapter, true);
- if (adapter->no_poll)
- PMD_DRV_LOG(DEBUG, "VF no poll turned on");
- else
- PMD_DRV_LOG(DEBUG, "VF no poll turned off");
- }
- iavf_dev_event_post(dev, RTE_ETH_EVENT_INTR_LSC, NULL, 0);
+ iavf_handle_link_change_event(dev, pf_msg);
break;
case VIRTCHNL_EVENT_PF_DRIVER_CLOSE:
PMD_DRV_LOG(DEBUG, "VIRTCHNL_EVENT_PF_DRIVER_CLOSE event");
--
2.34.1
^ permalink raw reply related
* RE: [PATCH v2] net/iavf: fix to consolidate link change event handling
From: Mandal, Anurag @ 2026-06-10 14:13 UTC (permalink / raw)
To: Loftus, Ciara, dev@dpdk.org
Cc: Richardson, Bruce, Medvedkin, Vladimir, stable@dpdk.org
In-Reply-To: <IA4PR11MB92788A795E02D08A3287C3628E1A2@IA4PR11MB9278.namprd11.prod.outlook.com>
> -----Original Message-----
> From: Loftus, Ciara <ciara.loftus@intel.com>
> Sent: 10 June 2026 15:13
> To: Mandal, Anurag <anurag.mandal@intel.com>; dev@dpdk.org
> Cc: Richardson, Bruce <bruce.richardson@intel.com>; Medvedkin, Vladimir
> <vladimir.medvedkin@intel.com>; stable@dpdk.org
> Subject: RE: [PATCH v2] net/iavf: fix to consolidate link change event handling
>
> > Subject: [PATCH v2] net/iavf: fix to consolidate link change event
> > handling
> >
>
> [snip]
>
> > +
> > /* Read data in admin queue to get msg from pf driver */ static enum
> > iavf_aq_result iavf_read_msg_from_pf(struct iavf_adapter *adapter,
> > uint16_t buf_len, @@ -249,38 +310,15 @@ iavf_read_msg_from_pf(struct
> > iavf_adapter *adapter, uint16_t buf_len,
> > if (opcode == VIRTCHNL_OP_EVENT) {
> > struct virtchnl_pf_event *vpe =
> > (struct virtchnl_pf_event *)event.msg_buf;
> > + if (vpe == NULL) {
> > + PMD_DRV_LOG(ERR, "Invalid PF event message");
> > + return IAVF_MSG_ERR;
> > + }
>
> This check can be removed.
> iavf_read_msg_from_pf is called from iavf_wait_for_msg which performs a
> NULL check on the same location (args->out_buffer) before passing it to
> iavf_read_msg_from_pf
>
> >
> > result = IAVF_MSG_SYS;
> > switch (vpe->event) {
> > case VIRTCHNL_EVENT_LINK_CHANGE:
> > - vf->link_up =
> > - vpe->event_data.link_event.link_status;
> > - if (vf->vf_res != NULL &&
> > - vf->vf_res->vf_cap_flags &
> > VIRTCHNL_VF_CAP_ADV_LINK_SPEED) {
> > - vf->link_speed =
> > - vpe-
> > >event_data.link_event_adv.link_speed;
> > - } else {
> > - enum virtchnl_link_speed speed;
> > - speed = vpe-
> > >event_data.link_event.link_speed;
> > - vf->link_speed =
> > iavf_convert_link_speed(speed);
> > - }
> > - iavf_dev_link_update(vf->eth_dev, 0);
> > - iavf_dev_event_post(vf->eth_dev,
> > RTE_ETH_EVENT_INTR_LSC, NULL, 0);
> > - if (vf->link_up && !vf->vf_reset) {
> > - iavf_dev_watchdog_disable(adapter);
> > - } else {
> > - if (!vf->link_up)
> > - iavf_dev_watchdog_enable(adapter);
> > - }
> > - if (adapter->devargs.no_poll_on_link_down) {
> > - iavf_set_no_poll(adapter, true);
> > - if (adapter->no_poll)
> > - PMD_DRV_LOG(DEBUG, "VF no poll
> > turned on");
> > - else
> > - PMD_DRV_LOG(DEBUG, "VF no poll
> > turned off");
> > - }
> > - PMD_DRV_LOG(INFO, "Link status update:%s",
> > - vf->link_up ? "up" : "down");
> > + iavf_handle_link_change_event(vf->eth_dev, vpe);
> > break;
> > case VIRTCHNL_EVENT_RESET_IMPENDING:
> > vf->vf_reset = true;
> > @@ -505,6 +543,12 @@ iavf_handle_pf_event_msg(struct rte_eth_dev *dev,
> > uint8_t *msg,
> > PMD_DRV_LOG(DEBUG, "Error event");
> > return;
> > }
> > +
> > + if (pf_msg == NULL) {
> > + PMD_DRV_LOG(ERR, "Invalid PF event message");
> > + return;
> > + }
>
> This too can be removed.
> pf_msg resolves to vf->aq_resp which is a fixed buffer allocated at driver init
> time. It cannot be NULL here.
>
> With those two changes I think the patch will be good to go.
>
> > +
> > switch (pf_msg->event) {
> > case VIRTCHNL_EVENT_RESET_IMPENDING:
> > PMD_DRV_LOG(DEBUG,
> > "VIRTCHNL_EVENT_RESET_IMPENDING event"); @@ -518,30 +562,7 @@
> > iavf_handle_pf_event_msg(struct rte_eth_dev *dev, uint8_t *msg,
> > break;
> > case VIRTCHNL_EVENT_LINK_CHANGE:
> > PMD_DRV_LOG(DEBUG, "VIRTCHNL_EVENT_LINK_CHANGE
> event");
> > - vf->link_up = pf_msg->event_data.link_event.link_status;
> > - if (vf->vf_res->vf_cap_flags &
> > VIRTCHNL_VF_CAP_ADV_LINK_SPEED) {
> > - vf->link_speed =
> > - pf_msg-
> > >event_data.link_event_adv.link_speed;
> > - } else {
> > - enum virtchnl_link_speed speed;
> > - speed = pf_msg->event_data.link_event.link_speed;
> > - vf->link_speed = iavf_convert_link_speed(speed);
> > - }
> > - iavf_dev_link_update(dev, 0);
> > - if (vf->link_up && !vf->vf_reset) {
> > - iavf_dev_watchdog_disable(adapter);
> > - } else {
> > - if (!vf->link_up)
> > - iavf_dev_watchdog_enable(adapter);
> > - }
> > - if (adapter->devargs.no_poll_on_link_down) {
> > - iavf_set_no_poll(adapter, true);
> > - if (adapter->no_poll)
> > - PMD_DRV_LOG(DEBUG, "VF no poll turned
> > on");
> > - else
> > - PMD_DRV_LOG(DEBUG, "VF no poll turned
> > off");
> > - }
> > - iavf_dev_event_post(dev, RTE_ETH_EVENT_INTR_LSC, NULL,
> > 0);
> > + iavf_handle_link_change_event(dev, pf_msg);
> > break;
> > case VIRTCHNL_EVENT_PF_DRIVER_CLOSE:
> > PMD_DRV_LOG(DEBUG,
> > "VIRTCHNL_EVENT_PF_DRIVER_CLOSE event");
> > --
> > 2.34.1
Hi Ciara,
Thank you for the review. Removed those two unwarranted NULL checks.
Sent v3. Please check.
Thanks,
Anurag
^ permalink raw reply
* Re: [PATCH] bus/fslmc: fix shadowed variables in queue storage macros
From: Thomas Monjalon @ 2026-06-10 14:15 UTC (permalink / raw)
To: Weijun Pan, Hemant Agrawal, Weijun Pan
Cc: dev, Sachin Saxena, Jun Yang, stable, Stephen Hemminger
In-Reply-To: <20260407070940.225de771@phoenix.local>
07/04/2026 16:09, Stephen Hemminger:
> Why are these not inline functions.
> Macros with lower case names are likely place fore confusion like this?
Hemant, Sachin, Weijun, please could you consider this comment?
^ permalink raw reply
* Re: [PATCH] fib6: fix error code propagation on next hop update
From: Thomas Monjalon @ 2026-06-10 14:26 UTC (permalink / raw)
To: Vladimir Medvedkin; +Cc: dev, stable
In-Reply-To: <20260605164757.927661-1-vladimir.medvedkin@intel.com>
05/06/2026 18:47, Vladimir Medvedkin:
> When updating the next hop of an existing prefix, trie_modify() ignored
> the return value of modify_dp() and always returned 0. An out-of-range
> next hop is rejected by modify_dp() with -EINVAL but was reported to
> the caller as success. Return the actual result.
>
> Fixes: c3e12e0f0354 ("fib: add dataplane algorithm for IPv6")
> Cc: stable@dpdk.org
> Signed-off-by: Vladimir Medvedkin <vladimir.medvedkin@intel.com>
Applied, thanks.
^ permalink raw reply
* RE: [PATCH v3] net/iavf: fix to consolidate link change event handling
From: Loftus, Ciara @ 2026-06-10 14:34 UTC (permalink / raw)
To: Mandal, Anurag, dev@dpdk.org
Cc: Richardson, Bruce, Medvedkin, Vladimir, stable@dpdk.org
In-Reply-To: <20260610141230.369232-1-anurag.mandal@intel.com>
> Subject: [PATCH v3] net/iavf: fix to consolidate link change event handling
>
> Handled link-change events through a common static function that
> reads the correct advanced & legacy link fields properly and
> updates no-poll/watchdog/LSC state consistently.
>
> Fixes: 5e03e316c753 ("net/iavf: handle virtchnl event message without
> interrupt")
> Fixes: 48de41ca11f0 ("net/avf: enable link status update")
> Cc: stable@dpdk.org
>
> Signed-off-by: Anurag Mandal <anurag.mandal@intel.com>
Thanks Anurag.
Acked-by: Ciara Loftus <ciara.loftus@intel.com>
> ---
> V3: Addressed Ciara Loftus's review comments
> - removed two unnecessary NULL checks
> V2: Addressed Ciara Loftus's review comments
> - removed unnecessary NULL checks which were overly defensive checks
>
> drivers/net/intel/iavf/iavf_vchnl.c | 115 +++++++++++++++-------------
> 1 file changed, 63 insertions(+), 52 deletions(-)
>
> diff --git a/drivers/net/intel/iavf/iavf_vchnl.c
> b/drivers/net/intel/iavf/iavf_vchnl.c
> index 0643a835d5..36b7ee9526 100644
> --- a/drivers/net/intel/iavf/iavf_vchnl.c
> +++ b/drivers/net/intel/iavf/iavf_vchnl.c
> @@ -216,6 +216,67 @@ iavf_convert_link_speed(enum virtchnl_link_speed
> virt_link_speed)
> return speed;
> }
>
> +/*
> + * iavf_handle_link_change_event: common handler for VIRTCHNL link
> change events
> + *
> + * @dev: pointer to rte_eth_dev for this VF
> + * @vpe: pointer to the virtchnl_pf_event payload received from the PF
> + *
> + * Handle PF link-change event: decode adv/legacy link info, update VF
> + * link state, sync no-poll/watchdog behavior & notify app via LSC event.
> + */
> +static void
> +iavf_handle_link_change_event(struct rte_eth_dev *dev,
> + struct virtchnl_pf_event *vpe)
> +{
> + struct iavf_adapter *adapter =
> + IAVF_DEV_PRIVATE_TO_ADAPTER(dev->data->dev_private);
> + struct iavf_info *vf = &adapter->vf;
> + bool adv_link_speed;
> +
> + adv_link_speed = (vf->vf_res != NULL) &&
> + ((vf->vf_res->vf_cap_flags &
> VIRTCHNL_VF_CAP_ADV_LINK_SPEED) != 0);
> +
> + if (adv_link_speed) {
> + vf->link_up = vpe->event_data.link_event_adv.link_status;
> + vf->link_speed = vpe->event_data.link_event_adv.link_speed;
> + } else {
> + enum virtchnl_link_speed speed;
> +
> + vf->link_up = vpe->event_data.link_event.link_status;
> + speed = vpe->event_data.link_event.link_speed;
> + vf->link_speed = iavf_convert_link_speed(speed);
> + }
> +
> + iavf_dev_link_update(dev, 0);
> +
> + /*
> + * Update watchdog/no_poll state BEFORE notifying the application
> via
> + * the LSC event. Otherwise the application's link-up callback could
> + * race with stale (link-down) no_poll/watchdog state and either
> + * continue to drop traffic or trigger a spurious reset detection.
> + *
> + * Keeping the watchdog enabled whenever the link cannot be trusted
> + * (link is down or a VF reset is in progress); the watchdog drives
> + * auto-reset recovery, so it must remain armed in those cases.
> + */
> + if (vf->link_up && !vf->vf_reset)
> + iavf_dev_watchdog_disable(adapter);
> + else
> + iavf_dev_watchdog_enable(adapter);
> +
> + if (adapter->devargs.no_poll_on_link_down) {
> + iavf_set_no_poll(adapter, true);
> + PMD_DRV_LOG(DEBUG, "VF no poll turned %s",
> + adapter->no_poll ? "on" : "off");
> + }
> +
> + iavf_dev_event_post(dev, RTE_ETH_EVENT_INTR_LSC, NULL, 0);
> +
> + PMD_DRV_LOG(INFO, "Link status update:%s",
> + vf->link_up ? "up" : "down");
> +}
> +
> /* Read data in admin queue to get msg from pf driver */
> static enum iavf_aq_result
> iavf_read_msg_from_pf(struct iavf_adapter *adapter, uint16_t buf_len,
> @@ -253,34 +314,7 @@ iavf_read_msg_from_pf(struct iavf_adapter
> *adapter, uint16_t buf_len,
> result = IAVF_MSG_SYS;
> switch (vpe->event) {
> case VIRTCHNL_EVENT_LINK_CHANGE:
> - vf->link_up =
> - vpe->event_data.link_event.link_status;
> - if (vf->vf_res != NULL &&
> - vf->vf_res->vf_cap_flags &
> VIRTCHNL_VF_CAP_ADV_LINK_SPEED) {
> - vf->link_speed =
> - vpe-
> >event_data.link_event_adv.link_speed;
> - } else {
> - enum virtchnl_link_speed speed;
> - speed = vpe-
> >event_data.link_event.link_speed;
> - vf->link_speed =
> iavf_convert_link_speed(speed);
> - }
> - iavf_dev_link_update(vf->eth_dev, 0);
> - iavf_dev_event_post(vf->eth_dev,
> RTE_ETH_EVENT_INTR_LSC, NULL, 0);
> - if (vf->link_up && !vf->vf_reset) {
> - iavf_dev_watchdog_disable(adapter);
> - } else {
> - if (!vf->link_up)
> - iavf_dev_watchdog_enable(adapter);
> - }
> - if (adapter->devargs.no_poll_on_link_down) {
> - iavf_set_no_poll(adapter, true);
> - if (adapter->no_poll)
> - PMD_DRV_LOG(DEBUG, "VF no poll
> turned on");
> - else
> - PMD_DRV_LOG(DEBUG, "VF no poll
> turned off");
> - }
> - PMD_DRV_LOG(INFO, "Link status update:%s",
> - vf->link_up ? "up" : "down");
> + iavf_handle_link_change_event(vf->eth_dev, vpe);
> break;
> case VIRTCHNL_EVENT_RESET_IMPENDING:
> vf->vf_reset = true;
> @@ -525,30 +559,7 @@ iavf_handle_pf_event_msg(struct rte_eth_dev *dev,
> uint8_t *msg,
> break;
> case VIRTCHNL_EVENT_LINK_CHANGE:
> PMD_DRV_LOG(DEBUG, "VIRTCHNL_EVENT_LINK_CHANGE
> event");
> - vf->link_up = pf_msg->event_data.link_event.link_status;
> - if (vf->vf_res->vf_cap_flags &
> VIRTCHNL_VF_CAP_ADV_LINK_SPEED) {
> - vf->link_speed =
> - pf_msg-
> >event_data.link_event_adv.link_speed;
> - } else {
> - enum virtchnl_link_speed speed;
> - speed = pf_msg->event_data.link_event.link_speed;
> - vf->link_speed = iavf_convert_link_speed(speed);
> - }
> - iavf_dev_link_update(dev, 0);
> - if (vf->link_up && !vf->vf_reset) {
> - iavf_dev_watchdog_disable(adapter);
> - } else {
> - if (!vf->link_up)
> - iavf_dev_watchdog_enable(adapter);
> - }
> - if (adapter->devargs.no_poll_on_link_down) {
> - iavf_set_no_poll(adapter, true);
> - if (adapter->no_poll)
> - PMD_DRV_LOG(DEBUG, "VF no poll turned
> on");
> - else
> - PMD_DRV_LOG(DEBUG, "VF no poll turned
> off");
> - }
> - iavf_dev_event_post(dev, RTE_ETH_EVENT_INTR_LSC, NULL,
> 0);
> + iavf_handle_link_change_event(dev, pf_msg);
> break;
> case VIRTCHNL_EVENT_PF_DRIVER_CLOSE:
> PMD_DRV_LOG(DEBUG,
> "VIRTCHNL_EVENT_PF_DRIVER_CLOSE event");
> --
> 2.34.1
^ permalink raw reply
* Re: [PATCH v4 02/11] bpf: introduce extensible load API
From: Thomas Monjalon @ 2026-06-10 14:46 UTC (permalink / raw)
To: Marat Khalili; +Cc: Konstantin Ananyev, Wathsala Vithanage, dev
In-Reply-To: <20260520124922.42445-3-marat.khalili@huawei.com>
20/05/2026 14:49, Marat Khalili:
> Introduce new BPF load parameters struct rte_bpf_prm_ex that can be
> extended without breaking backward or forward compatibility. Introduce
> new function rte_bpf_load_ex consolidating in one code path loading from
> both ELF file and raw memory image, with possibility to add more options
> in the future.
Unfortunately, compilation is failing on this patch:
lib/bpf/bpf_load.c:274:40: error: 'struct rte_bpf_jit' has no member named 'raw'
^ permalink raw reply
* Re: [PATCH v2 2/3] ring: use GCC builtin as alternative to rte_atomic32
From: Thomas Monjalon @ 2026-06-10 15:41 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: dev, Konstantin Ananyev, Wathsala Vithanage
In-Reply-To: <20260604163656.1226902-3-stephen@networkplumber.org>
04/06/2026 18:32, Stephen Hemminger:
> This patch replaces use of the deprecated rte_atomic32 code with
> GCC builtin atomic operations.
It compiles fine with GCC, but there is an issue with clang:
ninja: Entering directory `build-gcc-static'
ninja: no work to do.
ninja: Entering directory `build-gcc-shared'
ninja: no work to do.
ninja: Entering directory `build-clang-static'
ninja: no work to do.
ninja: Entering directory `build-clang-shared'
[1/3069] Compiling C object lib/librte_ring.a.p/ring_soring.c.o
rte_ring_gcc_pvt.h:43:2: error: address argument to atomic operation must be a pointer to integer or pointer ('volatile _Atomic(uint32_t) *' invalid)
43 | __atomic_store_n(&ht->tail, new_val, __ATOMIC_RELEASE);
| ^ ~~~~~~~~~
> Although it would be preferable to use C11 version on all architectures,
> there is a performance loss if we do it that way:
>
> Measured on i9-13900H, two physical cores MP/MC bulk n=128, 10 runs:
> with C11 builtin: 5.86 cycles/elem
> with __sync builtin: 5.36 cycles/elem (-9.4%)
You don't compare with the current rte_atomic functions?
^ permalink raw reply
* [PATCH v6] net/iavf: fix duplicate VF reset during PF reset recovery
From: Anurag Mandal @ 2026-06-10 15:43 UTC (permalink / raw)
To: dev
Cc: bruce.richardson, vladimir.medvedkin, ciara.loftus, Anurag Mandal,
stable
In-Reply-To: <20260605202911.314359-1-anurag.mandal@intel.com>
During PF initiated reset recovery, iavf_dev_close() sends an
extra 'VIRTCHNL_OP_RESET_VF' while recovery is already in progress.
That second reset can leave PF/VF virtchnl state inconsistent and
cause 'VIRTCHNL_OP_CONFIG_VSI_QUEUES' to fail with 'ERR_PARAM' after
ToR link flap/power-cycle, leaving the VF unable to recover.
This results in connection loss.
This patch introduces a new flag 'pf_reset_in_progress', which
is set only when iavf_handle_hw_reset() is entered for a
PF-initiated reset (vf_initiated_reset is false), and
it is cleared on exit.
The aforesaid flag is used to prevent sending close-time VF
reset and related close-time virtchnl operation messages to the
AdminQ when PF triggered reset recovery is set.
This is done to avoid duplicate VF reset requests while preserving
normal behavior for application-driven close or VF-initiated reinit.
Fixes: 675a104e2e94 ("net/iavf: fix abnormal disable HW interrupt")
Fixes: b34fe66ea893 ("net/iavf: delay VF reset command")
Fixes: 5e03e316c753 ("net/iavf: handle virtchnl event message without interrupt")
Cc: stable@dpdk.org
Signed-off-by: Anurag Mandal <anurag.mandal@intel.com>
---
V6: Addressed Ciara Loftus's review comments
- changed to concise relase note
- removed unwarranted comment
- added proper comments in two places
- aligned commits with latest 'next-net-intel-for-next-net' branch
V5: Addressed Ciara Loftus's review comments
- added separate flag for PF initiated reset recovery
V4: Addressed Ciara Loftus's review comments
- split VF reset from other code changes
V3: Addressed latest ai-code-review comments
V2: Addressed ai-code-review comments
doc/guides/rel_notes/release_26_07.rst | 2 ++
drivers/net/intel/iavf/iavf.h | 1 +
drivers/net/intel/iavf/iavf_ethdev.c | 42 +++++++++++++++++---------
drivers/net/intel/iavf/iavf_vchnl.c | 18 +++++++++--
4 files changed, 46 insertions(+), 17 deletions(-)
diff --git a/doc/guides/rel_notes/release_26_07.rst b/doc/guides/rel_notes/release_26_07.rst
index b5285af5fe..3832410363 100644
--- a/doc/guides/rel_notes/release_26_07.rst
+++ b/doc/guides/rel_notes/release_26_07.rst
@@ -121,6 +121,8 @@ New Features
* Added support for transmitting LLDP packets based on mbuf packet type.
* Implemented AVX2 context descriptor transmit paths.
+ * Fixed duplicate send of 'VIRTCHNL_OP_RESET_VF' during PF reset recovery
+ which could cause virtchnl state corruption.
* **Updated NVIDIA mlx5 ethernet driver.**
diff --git a/drivers/net/intel/iavf/iavf.h b/drivers/net/intel/iavf/iavf.h
index 4444602a30..293adaf6c9 100644
--- a/drivers/net/intel/iavf/iavf.h
+++ b/drivers/net/intel/iavf/iavf.h
@@ -292,6 +292,7 @@ struct iavf_info {
bool in_reset_recovery;
bool reset_pending;
+ bool pf_reset_in_progress;
uint32_t ptp_caps;
rte_spinlock_t phc_time_aq_lock;
diff --git a/drivers/net/intel/iavf/iavf_ethdev.c b/drivers/net/intel/iavf/iavf_ethdev.c
index ec1ad02826..4c8a1895e4 100644
--- a/drivers/net/intel/iavf/iavf_ethdev.c
+++ b/drivers/net/intel/iavf/iavf_ethdev.c
@@ -3168,22 +3168,29 @@ iavf_dev_close(struct rte_eth_dev *dev)
ret = iavf_dev_stop(dev);
/*
- * Release redundant queue resource when close the dev
- * so that other vfs can re-use the queues.
+ * Prevent sending close-time virtchnl messages to the AdminQ
+ * during PF-initiated reset recovery.
*/
- if (vf->lv_enabled) {
- ret = iavf_request_queues(dev, IAVF_MAX_NUM_QUEUES_DFLT);
- if (ret)
- PMD_DRV_LOG(ERR, "Reset the num of queues failed");
+ if (!vf->pf_reset_in_progress) {
- vf->max_rss_qregion = IAVF_MAX_NUM_QUEUES_DFLT;
- }
+ /*
+ * Release redundant queue resource when close the dev
+ * so that other vfs can re-use the queues.
+ */
+ if (vf->lv_enabled) {
+ ret = iavf_request_queues(dev, IAVF_MAX_NUM_QUEUES_DFLT);
+ if (ret)
+ PMD_DRV_LOG(ERR, "Reset the num of queues failed");
+ vf->max_rss_qregion = IAVF_MAX_NUM_QUEUES_DFLT;
+ }
- /* Disable promiscuous mode before resetting the VF. This is to avoid
- * potential issues when the PF is bound to the kernel driver.
- */
- if (vf->promisc_unicast_enabled || vf->promisc_multicast_enabled)
- iavf_config_promisc(adapter, false, false);
+ /*
+ * Disable promiscuous mode before resetting the VF. This is to avoid
+ * potential issues when the PF is bound to the kernel driver.
+ */
+ if (vf->promisc_unicast_enabled || vf->promisc_multicast_enabled)
+ iavf_config_promisc(adapter, false, false);
+ }
adapter->closed = true;
@@ -3196,7 +3203,12 @@ iavf_dev_close(struct rte_eth_dev *dev)
iavf_flow_flush(dev, NULL);
iavf_flow_uninit(adapter);
- iavf_vf_reset(hw);
+ /*
+ * Prevent sending VIRTCHNL_OP_RESET_VF during PF-initiated
+ * reset recovery.
+ */
+ if (!vf->pf_reset_in_progress)
+ iavf_vf_reset(hw);
/*
* If a reset is pending, wait for the PF to disable the VF's admin
* receive queue (its first reset action) before we shut it down
@@ -3380,6 +3392,7 @@ iavf_handle_hw_reset(struct rte_eth_dev *dev, bool vf_initiated_reset)
}
vf->in_reset_recovery = true;
+ vf->pf_reset_in_progress = !vf_initiated_reset;
iavf_set_no_poll(adapter, false);
/* Call the pre reset callback */
@@ -3430,6 +3443,7 @@ iavf_handle_hw_reset(struct rte_eth_dev *dev, bool vf_initiated_reset)
vf->post_reset_cb(dev->data->port_id, ret, vf->post_reset_cb_arg);
vf->in_reset_recovery = false;
+ vf->pf_reset_in_progress = false;
iavf_set_no_poll(adapter, false);
return;
diff --git a/drivers/net/intel/iavf/iavf_vchnl.c b/drivers/net/intel/iavf/iavf_vchnl.c
index 0643a835d5..08ab11ccf1 100644
--- a/drivers/net/intel/iavf/iavf_vchnl.c
+++ b/drivers/net/intel/iavf/iavf_vchnl.c
@@ -283,9 +283,21 @@ iavf_read_msg_from_pf(struct iavf_adapter *adapter, uint16_t buf_len,
vf->link_up ? "up" : "down");
break;
case VIRTCHNL_EVENT_RESET_IMPENDING:
- vf->vf_reset = true;
- iavf_set_no_poll(adapter, false);
- PMD_DRV_LOG(INFO, "VF is resetting");
+ /*
+ * Force link down on impending reset to drop
+ * the cached link-up state; a fresh LSC up
+ * event will be re-issued by the PF once the
+ * VF is reinitialised.
+ */
+ vf->link_up = false;
+ if (!vf->vf_reset) {
+ vf->vf_reset = true;
+ iavf_set_no_poll(adapter, false);
+ iavf_dev_event_post(vf->eth_dev,
+ RTE_ETH_EVENT_INTR_RESET,
+ NULL, 0);
+ }
+ PMD_DRV_LOG(DEBUG, "VF is resetting");
break;
case VIRTCHNL_EVENT_PF_DRIVER_CLOSE:
vf->dev_closed = true;
--
2.34.1
^ permalink raw reply related
* RE: [PATCH v5] net/iavf: fix duplicate VF reset during PF reset recovery
From: Mandal, Anurag @ 2026-06-10 15:45 UTC (permalink / raw)
To: Loftus, Ciara, dev@dpdk.org
Cc: Richardson, Bruce, Medvedkin, Vladimir, stable@dpdk.org
In-Reply-To: <IA4PR11MB9278C1F04D8BB54089B361E58E1A2@IA4PR11MB9278.namprd11.prod.outlook.com>
> -----Original Message-----
> From: Loftus, Ciara <ciara.loftus@intel.com>
> Sent: 10 June 2026 16:20
> To: Mandal, Anurag <anurag.mandal@intel.com>; dev@dpdk.org
> Cc: Richardson, Bruce <bruce.richardson@intel.com>; Medvedkin, Vladimir
> <vladimir.medvedkin@intel.com>; stable@dpdk.org
> Subject: RE: [PATCH v5] net/iavf: fix duplicate VF reset during PF reset recovery
>
> > Subject: [PATCH v5] net/iavf: fix duplicate VF reset during PF reset
> > recovery
> >
> > During PF initiated reset recovery, iavf_dev_close() sending an extra
> > VIRTCHNL_OP_RESET_VF while recovery is already in progress.
> > That second reset can leave PF/VF virtchnl state inconsistent and
> > cause VIRTCHNL_OP_CONFIG_VSI_QUEUES to fail with ERR_PARAM after ToR
> > link flap/power-cycle, leaving the VF unable to recover.
> > This results in connection loss.
> >
> > This patch introduces a new flag 'pf_reset_in_progress', that is set
> > only when iavf_handle_hw_reset() is entered with vf_initiated_reset as
> > false and is cleared on exit.
> > Also, close-time VF reset and related close-time virtchnl operations
> > are skipped when PF triggered reset recovery is set.
> > This is done to avoid a duplicate VF reset, and keep normal behavior
> > for application-driven close or VF-initiated reinit.
> >
> > Fixes: 675a104e2e94 ("net/iavf: fix abnormal disable HW interrupt")
> > Fixes: b34fe66ea893 ("net/iavf: delay VF reset command")
> > Fixes: 5e03e316c753 ("net/iavf: handle virtchnl event message without
> > interrupt")
> > Cc: stable@dpdk.org
> >
> > Signed-off-by: Anurag Mandal <anurag.mandal@intel.com>
>
> Acked-by: Ciara Loftus <ciara.loftus@intel.com>
>
> I think you may need to respin due to patch application failure.
> I have some suggestions for improving the comments/release notes that you
> could include in the next version. Code looks good to me.
>
> > ---
> > V5: Addressed Ciara Loftus's comments
> > - added separate flag for PF initiated reset recovery
> > V4: Addressed Ciara Loftus's comments
> > - split VF reset from other code changes
> > V3: Addressed latest ai-code-review comments
> > V2: Addressed ai-code-review comments
> >
> > doc/guides/rel_notes/release_26_07.rst | 3 ++
> > drivers/net/intel/iavf/iavf.h | 7 +++++
> > drivers/net/intel/iavf/iavf_ethdev.c | 40 +++++++++++++++-----------
> > drivers/net/intel/iavf/iavf_vchnl.c | 18 ++++++++++--
> > 4 files changed, 49 insertions(+), 19 deletions(-)
> >
> > diff --git a/doc/guides/rel_notes/release_26_07.rst
> > b/doc/guides/rel_notes/release_26_07.rst
> > index d2563ac503..f6899a78c3 100644
> > --- a/doc/guides/rel_notes/release_26_07.rst
> > +++ b/doc/guides/rel_notes/release_26_07.rst
> > @@ -95,6 +95,9 @@ New Features
> >
> > * Added support for transmitting LLDP packets based on mbuf packet type.
> > * Implemented AVX2 context descriptor transmit paths.
> > + * Prevented duplicate 'VIRTCHNL_OP_RESET_VF' during a PF-initiated
> > + reset recovery, which earlier caused virtchnl state corruption
> > + and connection loss after a top-of-rack (ToR) link flap/power-cycle.
>
> I think something more concise here would be better.
> eg. "Fixed duplicate send of 'VIRTCHNL_OP_RESET_VF' during PF reset recovery
> which could cause virtchnl state corruption"
>
> >
> > * **Updated PCAP ethernet driver.**
> >
> > diff --git a/drivers/net/intel/iavf/iavf.h
> > b/drivers/net/intel/iavf/iavf.h index 2615b6f034..67aacbe7a6 100644
> > --- a/drivers/net/intel/iavf/iavf.h
> > +++ b/drivers/net/intel/iavf/iavf.h
> > @@ -292,6 +292,13 @@ struct iavf_info {
> >
> > bool in_reset_recovery;
> >
> > + /*
> > + * Set only while iavf_handle_hw_reset()
> > + * is processing a PF-initiated reset
> > + * (vf_initiated_reset == false).
> > + */
>
> I don't think a comment is warranted here, the variable name is self-
> explanatory.
>
> > + bool pf_reset_in_progress;
> > +
> > uint32_t ptp_caps;
> > rte_spinlock_t phc_time_aq_lock;
> > };
> > diff --git a/drivers/net/intel/iavf/iavf_ethdev.c
> > b/drivers/net/intel/iavf/iavf_ethdev.c
> > index a8031e23a5..2b6f4daa99 100644
> > --- a/drivers/net/intel/iavf/iavf_ethdev.c
> > +++ b/drivers/net/intel/iavf/iavf_ethdev.c
> > @@ -3166,23 +3166,27 @@ iavf_dev_close(struct rte_eth_dev *dev)
> >
> > ret = iavf_dev_stop(dev);
> >
> > - /*
> > - * Release redundant queue resource when close the dev
> > - * so that other vfs can re-use the queues.
> > - */
> > - if (vf->lv_enabled) {
> > - ret = iavf_request_queues(dev,
> > IAVF_MAX_NUM_QUEUES_DFLT);
> > - if (ret)
> > - PMD_DRV_LOG(ERR, "Reset the num of queues
> > failed");
> > + /* Skip RESET_VF on a PF-initiated reset */
>
> Regarding the comment above, here we're not skipping RESET_VF rather
> preventing sending virtchnl messages to the adminq during the PF-initiated
> reset. I suggest rewording the comment to reflect that.
>
> > + if (!vf->pf_reset_in_progress) {
> >
> > - vf->max_rss_qregion = IAVF_MAX_NUM_QUEUES_DFLT;
> > - }
> > + /*
> > + * Release redundant queue resource when close the dev
> > + * so that other vfs can re-use the queues.
> > + */
> > + if (vf->lv_enabled) {
> > + ret = iavf_request_queues(dev,
> > IAVF_MAX_NUM_QUEUES_DFLT);
> > + if (ret)
> > + PMD_DRV_LOG(ERR, "Reset the num of
> > queues failed");
> > + vf->max_rss_qregion =
Hi Ciara,
Thank you for the detailed review. I have addresses all the review comments.
Sent v6. Kindly review.
Thanks,
Anurag M
^ permalink raw reply
* Re: [PATCH v5] ethdev: support inline calculating masked item value
From: Stephen Hemminger @ 2026-06-10 15:46 UTC (permalink / raw)
To: Bing Zhao
Cc: viacheslavo, dev, rasland, orika, dsosnowski, suanmingm, matan,
thomas
In-Reply-To: <20260610052729.5637-1-bingz@nvidia.com>
On Wed, 10 Jun 2026 08:27:29 +0300
Bing Zhao <bingz@nvidia.com> wrote:
> In the asynchronous API definition and some drivers, the
> rte_flow_item spec value may not be calculated by the driver due to the
> reason of speed of light rule insertion rate and sometimes the input
> parameters will be copied and changed internally.
>
> After copying, the spec and last will be protected by the keyword
> const and cannot be changed in the code itself. And also the driver
> needs some extra memory to do the calculation and extra conditions
> to understand the length of each item spec. This is not efficient.
>
> To solve the issue and support usage of the following fix, a new OP
> was introduced to calculate the spec and last values after applying
> the mask inline.
>
> Signed-off-by: Bing Zhao <bingz@nvidia.com>
> Acked-by: Dariusz Sosnowski <dsosnowski@nvidia.com>
> ---
Detailed AI review still sees issues here:
On Wed, 10 Jun 2026 08:27:29 +0300, Bing Zhao wrote:
> Subject: [PATCH v5] ethdev: support inline calculating masked item value
> v5: handle some items separately and add test for them
The v5 lib/ethdev/rte_flow.c and app/test/test_ethdev_api.c hunks are
identical to v4 -- the masking loop is unchanged and the test still only
covers ETH. The changelog says items are handled separately and a test was
added, but no such change is present in the diff. The v4 issue is still open.
Error: byte-wise masking corrupts embedded pointers in deep-copy item
types (RAW, FLEX, GENEVE_OPT).
In rte_flow_conv_pattern(), the mask is applied over the fixed item struct:
size_t item_mask_size = mask ? rte_flow_conv_item_mask_size(src) : 0;
...
size_t mask_size = RTE_MIN(ret, item_mask_size);
for (j = 0; j < mask_size; j++)
c_spec[j] &= mask[j];
item_mask_size is rte_flow_desc_item[type].size, the fixed item struct size.
For RTE_FLOW_ITEM_TYPE_RAW, FLEX, and GENEVE_OPT, that fixed struct ends in an
embedded pointer that rte_flow_conv_item_spec() has just populated to point at
the deep-copied trailing data (rte_flow_item_raw.pattern,
rte_flow_item_flex.pattern, rte_flow_item_geneve_opt.data). Because the masked
range covers the whole fixed struct, the loop ANDs the bytes of that pointer
with the mask's corresponding bytes (typically a NULL mask pointer), zeroing
or garbling it.
The converted item's pattern/data pointer is clobbered while the copied
payload it should reference is left unreachable. A consumer that follows
conv->pattern then dereferences NULL or a corrupt address. Plain value items
(eth, ipv4, ...) are unaffected; only the deep-copy item types break.
Suggested fix: do not blind-mask the entire fixed struct for items that carry
an embedded pointer / desc_fn deep copy. Either skip masking when
rte_flow_desc_item[type].desc_fn != NULL, or mask only the leading plain-data
region and leave the pointer field (and trailing copied bytes) intact.
Warning: the new test validates only an ETH pattern, so the RAW/FLEX/GENEVE_OPT
path above is untested. A RAW item case would have surfaced the pointer
corruption -- and is what the v5 changelog claims to have added but did not.
Info: the Doxygen block for RTE_FLOW_CONV_OP_PATTERN_MASKED uses @p mask,
@p spec, @p last, but those are item fields, not parameters of the op; the
neighboring enum entries only document the @p src / @p dst types.
^ permalink raw reply
* Re: [PATCH v2] eal: add destructor to unregister tailq on unload
From: Stephen Hemminger @ 2026-06-10 15:57 UTC (permalink / raw)
To: fengchengwen; +Cc: dev, stable, Bruce Richardson, Neil Horman, David Marchand
In-Reply-To: <bfa31475-848f-42af-bfb4-f796433c3073@huawei.com>
On Wed, 10 Jun 2026 09:19:42 +0800
fengchengwen <fengchengwen@huawei.com> wrote:
> >
> > +RTE_EXPORT_SYMBOL(rte_eal_tailq_unregister)
>
> this should be with EXPERIMENTAL
Not possible, this is part of the EAL_REGISTER_TAILQ macro and usage
is under the covers. So if anything was marked experimental it would
fail code that did not allow experimental
>
> > +void
> > +rte_eal_tailq_unregister(struct rte_tailq_elem *t)
> > +{
> > + TAILQ_REMOVE(&rte_tailq_elem_head, t, next);
>
> We need first make sure it exist the tailq, just like TAILQ_FOREACH rte_eal_tailq_local_register()
Ok cheap scan since not in critical path.
^ permalink raw reply
* Re: [PATCH 0/2] Pflock downgrade & stress tests for pflock/rwlock libraries
From: Stephen Hemminger @ 2026-06-10 15:59 UTC (permalink / raw)
To: Eimear Morrissey; +Cc: dev
In-Reply-To: <20260610091147.88412-1-eimear.morrissey@huawei.com>
On Wed, 10 Jun 2026 10:11:45 +0100
Eimear Morrissey <eimear.morrissey@huawei.com> wrote:
> Add new downgrade option for pflock. Add stress tests for this &
> by extension the rest of the pflock/rwlock libraries.
>
> Eimear Morrissey (1):
> app/test: add stress tests for rwlock and pflock
>
> Konstantin Ananyev (1):
> eal/pflock: add API to downgrade from wr to rd lock
>
> app/test/meson.build | 2 +
> app/test/test_pflock_stress.c | 76 ++++++
> app/test/test_rwlock_stress.c | 59 +++++
> app/test/test_rwlock_stress_impl.h | 393 +++++++++++++++++++++++++++++
> lib/eal/include/rte_pflock.h | 21 ++
> 5 files changed, 551 insertions(+)
> create mode 100644 app/test/test_pflock_stress.c
> create mode 100644 app/test/test_rwlock_stress.c
> create mode 100644 app/test/test_rwlock_stress_impl.h
>
Interesting idea, lots of feedback from AI. Mostly about the test.
Patch 1/2 (eal/pflock: add API to downgrade from wr to rd lock)
Warning: new public API is not marked __rte_experimental.
New APIs (including static inline) should carry the experimental tag
for at least one release per the ABI policy:
__rte_experimental
static inline void
rte_pflock_write_downgrade(rte_pflock_t *pf)
Warning: new EAL API added without a release notes entry.
Please add a note to doc/guides/rel_notes/release_26_07.rst.
Info: the hunk adds a double blank line before the #ifdef __cplusplus,
and the two atomic calls use different continuation indentation
(one tab vs two). checkpatch will complain about the blank lines.
Patch 2/2 (app/test: add stress tests for rwlock and pflock)
Error: read lock is leaked on reader error paths, hanging the test.
handle_error() with write_lock=false does not unlock, and its comment
claims the lock is "already unlocked by the calling function" -- but
handle_reader_work() calls it while still holding the read lock (both
the array-mismatch path and the counter-changed path), and the
DOWNGRADE_TEST failure path in handle_writer_work() likewise calls it
while holding the downgraded read lock. The leaked reader keeps rd.out
from ever matching, so any writer blocks forever in write_lock() and
rte_eal_wait_lcore() never returns: a detected failure becomes a hang
instead of a test failure. Simplest fix is to have callers unlock
before calling handle_error() and drop the unlock from it entirely;
that also fixes the downgrade path incrementing reader_errors for a
writer thread.
Error: stop flag uses volatile instead of atomics.
"volatile bool stop" is written by workers (handle_error) and the main
lcore and polled by all workers. volatile provides no ordering or
atomicity guarantee; use RTE_ATOMIC(bool) with
rte_atomic_load_explicit/rte_atomic_store_explicit as
test_ring_stress_impl.h does for wrk_cmd.
The volatile on counter and counter_array is unnecessary -- they are
only accessed under the lock, which already provides ordering -- and
it defeats compiler optimization of the 1024-element verify loops.
Warning: DYNAMIC_ROLES does not switch roles.
should_be_writer() is called once in lcore_function() before the loop,
so each thread's role is fixed for the whole run; the flag's stated
purpose ("Threads can switch between reader/writer roles") never
happens. Move the role decision inside the while loop when
DYNAMIC_ROLES is set.
Warning: should_be_writer() assumes the main lcore has index 0.
unsigned int idx = rte_lcore_index(rte_lcore_id()) - 1;
With --main-lcore set to a non-lowest core, a worker can have index 0,
so idx underflows to UINT_MAX and the reader/writer split no longer
matches num_readers/num_writers. Compute the worker's position by
iterating RTE_LCORE_FOREACH_WORKER or skip the main lcore's index
explicitly.
Warning: trailing alignment attribute placement.
} __rte_cache_aligned;
on struct lcore_stats and struct rwlock_test_shared must be written as
struct __rte_cache_aligned lcore_stats {
checkpatch enforces this (required for MSVC).
Info: "max wait" statistic does not measure lock wait time.
acquire_time spans the entire iteration including the verify loops and
the configured reader/writer delays, so for long_hold it reports the
delay, not contention. Either time only the lock call or rename it.
Info: missing space in the summary printf string concatenation:
"%"PRIu64" writer ops," "total time: ..."
prints "writer ops,total time". Also the first element of
pflock_specific_tests is mis-indented (opening brace at column 0).
^ permalink raw reply
* Re: [PATCH v8 0/1] net/mana: add device reset support
From: Stephen Hemminger @ 2026-06-10 16:56 UTC (permalink / raw)
To: Wei Hu; +Cc: dev, longli, weh
In-Reply-To: <cover.1781017284.git.weh@linux.microsoft.com>
On Wed, 10 Jun 2026 00:21:21 -0700
Wei Hu <weh@linux.microsoft.com> wrote:
> From: Wei Hu <weh@microsoft.com>
>
> Add support for handling hardware service reset events in the
> MANA driver. When the MANA kernel driver receives a hardware
> service event, it initiates a device reset and notifies userspace
> via IBV_EVENT_DEVICE_FATAL. The MANA PMD handles this by
> performing an automatic teardown and recovery sequence.
>
> The driver uses ethdev recovery events (ERR_RECOVERING,
> RECOVERY_SUCCESS, RECOVERY_FAILED) to notify upper layers of
> the reset lifecycle, and a PCI device removal event callback
> to distinguish hot-remove from service reset.
>
> Changes since v7:
> - Moved heavy teardown (dev_stop, IPC to secondaries, dev_close,
> MR btree free) from mana_reset_enter (EAL interrupt thread)
> to mana_reset_thread (control thread). The interrupt handler
> now only sets state, drains in-flight bursts, and spawns the
> thread. Teardown runs immediately in the control thread before
> the recovery timer wait, avoiding blocking the interrupt thread
> on multi-second IPC timeouts and ibverbs calls. Each function
> now owns its own lock scope with no lock hand-off between
> threads.
> - Fixed self-join deadlock: clear reset_thread_active before
> emitting RECOVERY_SUCCESS/FAILED callbacks from the reset
> thread. Without this, if the callback calls dev_stop/dev_close,
> mana_join_reset_thread attempts to join the current thread.
> - Simplified burst_state from encoding device state in bits 1+
> to a single blocked flag (bit 1). Only one value was ever
> stored, so the multi-state encoding was misleading. Added
> MANA_BURST_BLOCKED constant.
> - Updated mana.rst to reflect that teardown runs on the control
> thread, not the interrupt handler.
>
> Changes since v6:
> - Rebased onto latest upstream for-main
> - Replaced removed RTE_ETH_DEV_TO_PCI macro with
> RTE_CLASS_TO_BUS_DEVICE (upstream commit 4757b8df04
> removed the old bus-specific ethdev convenience macros)
>
> Changes since v5:
> - Replaced RCU QSBR with per-queue atomic burst_state using a
> single-variable CAS design: bit 0 is the in-burst flag, bit 1
> is the blocked flag. The data path uses CAS(0→1) to enter
> burst and fetch_and(~1) to exit. The reset path uses fetch_or
> to set the blocked bit and polls bit 0 to drain in-flight
> bursts. This eliminates the two-variable Dekker pattern and the
> need for sequential consistency (seq_cst) ordering.
> - Removed librte_rcu dependency
> - Removed __rte_no_thread_safety_analysis annotations (no longer
> needed after mutex conversion)
> - Moved ERR_RECOVERING event emission before acquiring
> reset_ops_lock and before mana_reset_enter, so upper layers
> (e.g. netvsc) can switch data path before mana stops queues.
> Emitting outside the lock avoids deadlock if the callback
> calls dev_stop or dev_close.
> - Replaced MANA_OPS_*_LOCK macros with mana_reset_trylock()
> helper function and explicit per-operation wrappers
> - Removed unused rte_alarm.h and rte_lock_annotations.h includes
> - Added RECOVERY_FAILED event when mana_reset_enter fails
> internally, so the application always receives a terminal event
> - Added mana_clear_burst_state() helper to clear per-queue
> burst_state on failure paths (reset_failed, dev_stop_lock,
> dev_close_lock) preventing permanent silent packet drop after
> a failed reset
>
> Changes since v4:
> - Fixed stale rte_spinlock_unlock call in mana_intr_handler that
> was missed during the spinlock-to-mutex conversion, causing a
> -Wincompatible-pointer-types warning
>
> Changes since v3:
> - Converted reset_ops_lock from rte_spinlock_t to pthread_mutex_t
> with PTHREAD_PROCESS_SHARED, since the lock is held across
> blocking IB verbs calls and IPC with 5s timeout
> - Removed rte_dev_event_callback_unregister retry loop to avoid
> deadlock: the callback itself blocks on reset_ops_lock, so
> retrying on -EAGAIN while holding the lock is a deadlock
> - Introduced mana_join_reset_thread() helper using CAS on
> reset_thread_active to prevent double-join undefined behavior
> - Added reset thread join in mana_dev_uninit to prevent thread
> leak on device removal
> - Fixed ibv handle leak: priv->ib_ctx is now only set to NULL
> after ibv_close_device succeeds
> - Fixed misleading "All secondary threads are quiescent" log in
> mana_mp_reset_enter — changed to "Secondary doorbell pages
> unmapped" since actual quiescence is enforced by the primary's
> per-queue atomic flag check before IPC is sent
> - Changed event list in mana.rst to RST definition list style
> - Squashed documentation into the feature patch per convention
>
> Changes since v2:
> - Fixed dev_state_qsv memory leak on device removal
> - Fixed reset thread TCB/stack leak: reset_thread_active is now
> only cleared by the joiner, not the thread itself
> - Fixed second reset crash: removed reset thread join logic from
> mana_dev_close (inner function) to avoid corrupting dev_state
> when called from mana_reset_enter
> - Made reset_thread_active RTE_ATOMIC(bool) with explicit ordering
> - Added retry loop for rte_dev_event_callback_unregister on -EAGAIN
> - Initialized condvar/mutex with PTHREAD_PROCESS_SHARED since priv
> is in hugepage shared memory
> - Added re-check of dev_state after lock acquisition in
> mana_intr_handler to prevent racing with pci_remove_event_cb
> - Replaced (void *)0 with NULL in mp.c
> - Added lock ownership comment block at mana_reset_enter
> - Documented rte_dev_event_monitor_start() requirement
> - Added mana.rst documentation and release note
>
> Changes since v1:
> - Removed net/netvsc patch from this series
> - Simplified reset exit: mana_reset_exit calls
> mana_reset_exit_delay directly instead of spawning a thread
> - Added __rte_no_thread_safety_analysis annotations for clang
> - Switched to rte_thread_create_internal_control
> - Fixed declaration-after-statement style issues
> - Removed unnecessary blank lines and stale comments
>
> Wei Hu (1):
> net/mana: add device reset support
>
> doc/guides/nics/mana.rst | 40 +
> doc/guides/rel_notes/release_26_07.rst | 8 +
> drivers/net/mana/mana.c | 1076 ++++++++++++++++++++++--
> drivers/net/mana/mana.h | 52 +-
> drivers/net/mana/mp.c | 89 +-
> drivers/net/mana/mr.c | 6 +-
> drivers/net/mana/rx.c | 23 +-
> drivers/net/mana/tx.c | 44 +-
> 8 files changed, 1230 insertions(+), 108 deletions(-)
>
AI review thread still sees some issues:
This is close. Teardown is off the interrupt thread now, the cross-thread lock
hand-off is gone, and the burst_state encoding reads cleanly. One real problem
left.
The reset thread leaks on every successful recovery. mana_reset_exit_delay and
the reset_failed path in mana_reset_thread clear reset_thread_active from inside
the thread itself, without joining. The thread is created joinable via
rte_thread_create_internal_control, so terminating it unjoined leaks its
resources, and because the flag is now false none of the join sites
(mana_join_reset_thread, the join-previous block in mana_reset_enter,
mana_dev_uninit) will ever reap it. The PCI-remove abort path leaves the flag
true and is reaped later, which is the inconsistency that exposes this: some
exits expect a join and some do not, and the latter have no reaper.
Simplest fix is to detach the reset thread (rte_thread_detach) and drop the
reset_thread_active / mana_join_reset_thread machinery, using reset_ops_lock and
dev_state for the dev_stop/dev_close sequencing instead. That removes the
self-join hazard too. If you keep the join, don't clear the flag from inside the
thread; have mana_join_reset_thread detect the self case and skip only the join.
Minor: the recovery condvar wait in mana_reset_thread is a bare cond_timedwait.
If pci_remove signals before the thread reaches the wait, the wakeup is lost and
removal isn't seen until the 15s timer expires. Use a dev_state predicate loop
under reset_cond_mutex.
^ permalink raw reply
* Re: [PATCH v1 00/20] net/sxe2: added Linkdata sxe2 ethernet driver
From: Stephen Hemminger @ 2026-06-10 17:11 UTC (permalink / raw)
To: liujie5; +Cc: dev
In-Reply-To: <20260610013936.3634968-1-liujie5@linkdatatechnology.com>
On Wed, 10 Jun 2026 09:39:16 +0800
liujie5@linkdatatechnology.com wrote:
> From: Jie Liu <liujie5@linkdatatechnology.com>
>
> This patch set implements core functionality for the SXE2 PMD,
> including basic driver framework, data path setup, and advanced
> offload features (VLAN, RSS,TM, PTP etc.).
>
> Jie Liu (20):
> net/sxe2: support AVX512 vectorized path for Rx and Tx
> net/sxe2: add AVX2 vector data path for Rx and Tx
> drivers: add supported packet types get callback
> net/sxe2: support L2 filtering and MAC config
> drivers: support RSS feature
> net/sxe2: support TM hierarchy and shaping
> net/sxe2: support IPsec inline protocol offload
> net/sxe2: support statistics and multi-process
> drivers: interrupt handling
> net/sxe2: add NEON vec Rx/Tx burst functions
> drivers: add support for VF representors
> net/sxe2: add support for custom UDP tunnel ports
> net/sxe2: support firmware version reading
> net/sxe2: implement get monitor address
> common/sxe2: add shared SFP module definitions
> net/sxe2: support SFP module info and EEPROM access
> net/sxe2: implement private dump info
> net/sxe2: add mbuf validation in Tx debug mode
> drivers: add testpmd commands for private features
> net/sxe2: update sxe2 feature matrix docs
>
> doc/guides/nics/features/sxe2.ini | 56 +
> drivers/common/sxe2/sxe2_common.c | 156 ++
> drivers/common/sxe2/sxe2_common.h | 4 +
> drivers/common/sxe2/sxe2_flow_public.h | 633 +++++++
> drivers/common/sxe2/sxe2_ioctl_chnl.c | 178 +-
> drivers/common/sxe2/sxe2_ioctl_chnl_func.h | 18 +
> drivers/common/sxe2/sxe2_msg.h | 118 ++
> drivers/common/sxe2/sxe2_ptype.h | 1793 ++++++++++++++++++
> drivers/net/sxe2/meson.build | 56 +-
> drivers/net/sxe2/sxe2_cmd_chnl.c | 1587 +++++++++++++++-
> drivers/net/sxe2/sxe2_cmd_chnl.h | 139 ++
> drivers/net/sxe2/sxe2_drv_cmd.h | 521 +++++-
> drivers/net/sxe2/sxe2_dump.c | 304 +++
> drivers/net/sxe2/sxe2_dump.h | 12 +
> drivers/net/sxe2/sxe2_ethdev.c | 1531 +++++++++++++++-
> drivers/net/sxe2/sxe2_ethdev.h | 112 +-
> drivers/net/sxe2/sxe2_ethdev_repr.c | 610 ++++++
> drivers/net/sxe2/sxe2_ethdev_repr.h | 32 +
> drivers/net/sxe2/sxe2_filter.c | 895 +++++++++
> drivers/net/sxe2/sxe2_filter.h | 100 +
> drivers/net/sxe2/sxe2_flow.c | 1391 ++++++++++++++
> drivers/net/sxe2/sxe2_flow.h | 30 +
> drivers/net/sxe2/sxe2_flow_define.h | 144 ++
> drivers/net/sxe2/sxe2_flow_parse_action.c | 1182 ++++++++++++
> drivers/net/sxe2/sxe2_flow_parse_action.h | 23 +
> drivers/net/sxe2/sxe2_flow_parse_engine.c | 106 ++
> drivers/net/sxe2/sxe2_flow_parse_engine.h | 13 +
> drivers/net/sxe2/sxe2_flow_parse_pattern.c | 1935 ++++++++++++++++++++
> drivers/net/sxe2/sxe2_flow_parse_pattern.h | 46 +
> drivers/net/sxe2/sxe2_ipsec.c | 1565 ++++++++++++++++
> drivers/net/sxe2/sxe2_ipsec.h | 254 +++
> drivers/net/sxe2/sxe2_irq.c | 1026 +++++++++++
> drivers/net/sxe2/sxe2_irq.h | 25 +
> drivers/net/sxe2/sxe2_mac.c | 535 ++++++
> drivers/net/sxe2/sxe2_mac.h | 84 +
> drivers/net/sxe2/sxe2_mp.c | 414 +++++
> drivers/net/sxe2/sxe2_mp.h | 67 +
> drivers/net/sxe2/sxe2_queue.c | 17 +-
> drivers/net/sxe2/sxe2_rss.c | 584 ++++++
> drivers/net/sxe2/sxe2_rss.h | 81 +
> drivers/net/sxe2/sxe2_rx.c | 38 +
> drivers/net/sxe2/sxe2_rx.h | 2 +
> drivers/net/sxe2/sxe2_security.c | 335 ++++
> drivers/net/sxe2/sxe2_security.h | 77 +
> drivers/net/sxe2/sxe2_stats.c | 591 ++++++
> drivers/net/sxe2/sxe2_stats.h | 39 +
> drivers/net/sxe2/sxe2_switchdev.c | 332 ++++
> drivers/net/sxe2/sxe2_switchdev.h | 33 +
> drivers/net/sxe2/sxe2_testpmd.c | 733 ++++++++
> drivers/net/sxe2/sxe2_testpmd_lib.c | 969 ++++++++++
> drivers/net/sxe2/sxe2_testpmd_lib.h | 142 ++
> drivers/net/sxe2/sxe2_tm.c | 1169 ++++++++++++
> drivers/net/sxe2/sxe2_tm.h | 78 +
> drivers/net/sxe2/sxe2_tx.c | 7 +
> drivers/net/sxe2/sxe2_txrx.c | 176 +-
> drivers/net/sxe2/sxe2_txrx.h | 4 +
> drivers/net/sxe2/sxe2_txrx_check_mbuf.c | 595 ++++++
> drivers/net/sxe2/sxe2_txrx_check_mbuf.h | 38 +
> drivers/net/sxe2/sxe2_txrx_poll.c | 243 ++-
> drivers/net/sxe2/sxe2_txrx_vec.c | 46 +-
> drivers/net/sxe2/sxe2_txrx_vec.h | 38 +-
> drivers/net/sxe2/sxe2_txrx_vec_avx2.c | 776 ++++++++
> drivers/net/sxe2/sxe2_txrx_vec_avx512.c | 897 +++++++++
> drivers/net/sxe2/sxe2_txrx_vec_common.h | 1 +
> drivers/net/sxe2/sxe2_txrx_vec_neon.c | 721 ++++++++
> drivers/net/sxe2/sxe2_vsi.c | 146 ++
> drivers/net/sxe2/sxe2_vsi.h | 12 +-
> drivers/net/sxe2/sxe2vf_regs.h | 85 +
> 68 files changed, 26576 insertions(+), 124 deletions(-)
> create mode 100644 drivers/common/sxe2/sxe2_flow_public.h
> create mode 100644 drivers/common/sxe2/sxe2_msg.h
> create mode 100644 drivers/common/sxe2/sxe2_ptype.h
> create mode 100644 drivers/net/sxe2/sxe2_dump.c
> create mode 100644 drivers/net/sxe2/sxe2_dump.h
> create mode 100644 drivers/net/sxe2/sxe2_ethdev_repr.c
> create mode 100644 drivers/net/sxe2/sxe2_ethdev_repr.h
> create mode 100644 drivers/net/sxe2/sxe2_filter.c
> create mode 100644 drivers/net/sxe2/sxe2_filter.h
> create mode 100644 drivers/net/sxe2/sxe2_flow.c
> create mode 100644 drivers/net/sxe2/sxe2_flow.h
> create mode 100644 drivers/net/sxe2/sxe2_flow_define.h
> create mode 100644 drivers/net/sxe2/sxe2_flow_parse_action.c
> create mode 100644 drivers/net/sxe2/sxe2_flow_parse_action.h
> create mode 100644 drivers/net/sxe2/sxe2_flow_parse_engine.c
> create mode 100644 drivers/net/sxe2/sxe2_flow_parse_engine.h
> create mode 100644 drivers/net/sxe2/sxe2_flow_parse_pattern.c
> create mode 100644 drivers/net/sxe2/sxe2_flow_parse_pattern.h
> create mode 100644 drivers/net/sxe2/sxe2_ipsec.c
> create mode 100644 drivers/net/sxe2/sxe2_ipsec.h
> create mode 100644 drivers/net/sxe2/sxe2_irq.c
> create mode 100644 drivers/net/sxe2/sxe2_mac.c
> create mode 100644 drivers/net/sxe2/sxe2_mac.h
> create mode 100644 drivers/net/sxe2/sxe2_mp.c
> create mode 100644 drivers/net/sxe2/sxe2_mp.h
> create mode 100644 drivers/net/sxe2/sxe2_rss.c
> create mode 100644 drivers/net/sxe2/sxe2_rss.h
> create mode 100644 drivers/net/sxe2/sxe2_security.c
> create mode 100644 drivers/net/sxe2/sxe2_security.h
> create mode 100644 drivers/net/sxe2/sxe2_stats.c
> create mode 100644 drivers/net/sxe2/sxe2_stats.h
> create mode 100644 drivers/net/sxe2/sxe2_switchdev.c
> create mode 100644 drivers/net/sxe2/sxe2_switchdev.h
> create mode 100644 drivers/net/sxe2/sxe2_testpmd.c
> create mode 100644 drivers/net/sxe2/sxe2_testpmd_lib.c
> create mode 100644 drivers/net/sxe2/sxe2_testpmd_lib.h
> create mode 100644 drivers/net/sxe2/sxe2_tm.c
> create mode 100644 drivers/net/sxe2/sxe2_tm.h
> create mode 100644 drivers/net/sxe2/sxe2_txrx_check_mbuf.c
> create mode 100644 drivers/net/sxe2/sxe2_txrx_check_mbuf.h
> create mode 100644 drivers/net/sxe2/sxe2_txrx_vec_avx2.c
> create mode 100644 drivers/net/sxe2/sxe2_txrx_vec_avx512.c
> create mode 100644 drivers/net/sxe2/sxe2_txrx_vec_neon.c
> create mode 100644 drivers/net/sxe2/sxe2vf_regs.h
>
I assume you meant v15 for this. The simple things first.
The code needs rebase against main and you need to run at a minimum the
full set of build tests: devtools/test-meson-builds.sh
Also, concerned that not all patches will bisect cleanly.
You should run a build at each step; yes that means 20 builds.
The amount of new devargs and testpmd functions concerns me.
Each new option increases the amount of options, and increases the probablity
of bugs.
AI code review also found:
Deep-dive review of the latest sxe2 respin (01-20/20), including full
compilation of the assembled tree and per-commit bisect builds.
[09/20] drivers: interrupt handling
Error: patches 09/20 through 18/20 do not link. sxe2_irq.c (added by 09/20)
calls SXE2_DEV_TO_PCI() at lines 430 and 512, but that macro no longer exists:
upstream removed it from sxe2_ethdev.h in early June (it was present in the
May 29 tree, gone by June 8). Nothing in this series defines it. The series
only links from 19/20 onward, where the two calls are converted to
container_of(dev->device, struct rte_pci_device, device).
Verified empirically, not by patch inspection: I applied the series to a
June 8 upstream base (applies cleanly) and built at every one of the 20
commits. Commits 9-18 all fail with "undefined reference to SXE2_DEV_TO_PCI";
commits 19-20 link. Ten consecutive broken commits breaks git bisect.
This looks like rebase fallout from the upstream macro removal: the previous
revision removed the macro definition in 19/20 (correct when the base still
had it); after upstream deleted it, that hunk was dropped - but the
container_of conversion stayed in 19/20 instead of moving into 09/20 where
the calls are introduced. Fix: have 09/20 use RTE_DEV_TO_PCI(dev->device)
or container_of directly, and drop the conversion hunk from 19/20.
Also note the series needs a trivial rebase: 11/20 conflicts with the June 10
sxe2_common.c refactor on current main.
^ permalink raw reply
* Re: [PATCH v1 19/20] drivers: add testpmd commands for private features
From: Stephen Hemminger @ 2026-06-10 17:22 UTC (permalink / raw)
To: liujie5; +Cc: dev
In-Reply-To: <20260610013936.3634968-20-liujie5@linkdatatechnology.com>
On Wed, 10 Jun 2026 09:39:35 +0800
liujie5@linkdatatechnology.com wrote:
> From: Jie Liu <liujie5@linkdatatechnology.com>
>
> Introduce private testpmd commands and implementation files to enable
> debugging and testing of sxe2-specific hardware features (such as
> packet scheduling reset, UDP tunnel configuration, and IPsec ingress/
> egress offloads) directly within the testpmd application.
>
> The parameters are parsed using the standard 'rte_kvargs' library during
> the PCI/vdev probing phase. Documentation for these parameters is also
> updated.
>
> During memory hotplug events, the SXE2 driver needs to track memory
> segment layout changes to maintain internal DMA mappings. However,
> existing memseg walk functions (rte_memseg_walk) acquire memory locks
> and cannot be called from within memory event callbacks, leading to
> potential deadlocks.
>
> This commit introduces sxe2_memseg_walk_cb() as a helper that walks
> memory segments using the thread-unsafe variant
> rte_memseg_walk_thread_unsafe(), which is safe to call from
> memory-related callbacks [citation:1][citation:3][citation:5].
>
> The implementation follows the standard rte_memseg_walk_t prototype,
> processing each memseg to update driver-specific data structures.
>
> Signed-off-by: Jie Liu <liujie5@linkdatatechnology.com>
> ---
This memory stuff looks problematic and needs more review.
At a minimum I see a pattern of not handling values from strtoul()
that are out of range.
I asked AI for a more detailed review and it saw.
[PATCH 19/20] drivers: add testpmd commands for private features
There is concern about the amount of driver-private testpmd plumbing and
devargs this patch adds. The raw command count (7) is within precedent
(i40e has 29, mlx5 13, ixgbe 11), but the mechanism and content are not.
Error: the command logic is placed in sxe2_testpmd_lib.c, compiled into the
driver library, and exposed through 14 new RTE_EXPORT_EXPERIMENTAL_SYMBOL
entries (sxe2_ipsec_egress_create, sxe2_ipsec_conf_set, sxe2_flow_rule_dump,
sxe2_udp_tunnel_operations, sxe2_stats_info_show, sxe2_testpmd_sched_reset,
etc). No upstream driver exports symbols for its testpmd commands; all six
existing drivers with testpmd integration compile their *_testpmd.c into
testpmd via testpmd_sources and use internal access. These exports are
vendor public API that any application can link against. The driver .so also
gains application state for the commands: g_tx_session[][], g_rx_session[][],
g_esp_header_offset[], g_sess_pool. SA-manager bookkeeping does not belong
in a PMD. Move the logic into sxe2_testpmd.c and drop all 14 exports; at
most RTE_EXPORT_INTERNAL_SYMBOL is appropriate here.
Error: three commands duplicate standard testpmd functionality the driver
already supports. "sxe2 flow rule dump" exists because the driver does not
implement the rte_flow dev_dump op; implement the op and the standard
"flow dump <port> all" works for every application. "sxe2 <port>
udp_tunnel_port add|rm" duplicates "port config <port> udp_tunnel_port
add|rm", which calls the udp_tunnel ops added in patch 12. "sxe2 show stats"
duplicates "show port xstats"; the driver already implements xstats, and
anything missing from xstats should be added there, not shown by a private
formatter.
Warning: the 9-subcommand ipsec suite (egress/ingress add/rm/show,
session-id and esp-hdr-offset set/get, flush, stats) is an SA management
application embedded in the driver. Inline crypto is exercised with
examples/ipsec-secgw, as done for other inline-crypto PMDs. If interactive
SA management in testpmd is needed, propose it as generic testpmd commands
over rte_security so all drivers benefit.
Warning: seven private devargs are added (flow-duplicate-pattern,
function-flow-direct, fnav-stat-type, drv-sw-stats, high-performance-mode,
sched-layer-mode, rx-low-latency) with no documentation: no Runtime
Configuration section in sxe2.rst and no RTE_PMD_REGISTER_PARAM_STRING, so
they are undiscoverable. Beyond documentation: flow-duplicate-pattern makes
rte_flow duplicate-rule semantics vary per boot option, which is not
acceptable for a standard API; fnav-stat-type and drv-sw-stats select stats
sources and belong in xstats; sched-layer-mode configures TM topology that
the rte_tm hierarchy built by the application should determine;
high-performance-mode accepts only the value 1 and is undocumented - if the
mode is safe make it the default, otherwise document the trade-off. Each
surviving devarg needs documentation and a rationale for why no standard
API covers it.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox