* Re: [PATCH v3 0/6] Make VA reservation limits configurable
From: Thomas Monjalon @ 2026-06-04 8:22 UTC (permalink / raw)
To: Anatoly Burakov; +Cc: dev, Bruce Richardson
In-Reply-To: <cover.1780071269.git.anatoly.burakov@intel.com>
> Anatoly Burakov (6):
> eal: reject non-numeric input in str to size
> eal/memory: remove per-list segment and memory limits
> eal/memory: allocate all VA space in one go
> eal/memory: get rid of global VA space limits
> eal/memory: store default segment limits in config
> eal/memory: add page size VA limits EAL parameter
Thank you for the simplification and the new runtime option.
Applied with release notes added, thanks.
^ permalink raw reply
* Re: [PATCH] net/mlx5: improve error when group table type mismatch
From: Raslan Darawsheh @ 2026-06-04 9:26 UTC (permalink / raw)
To: Maayan Kashani, dev
Cc: dsosnowski, Viacheslav Ovsiienko, Bing Zhao, Ori Kam,
Suanming Mou, Matan Azrad
In-Reply-To: <20260325215426.9519-1-mkashani@nvidia.com>
Hi,
On 25/03/2026 11:54 PM, Maayan Kashani wrote:
> HWS fixes a flow group's DR table type (FDB_UNIFIED,
> FDB_RX, FDB_TX, etc.) on first use.
> If a later table uses the same group with a different domain
> (e.g. transfer wire_orig) the driver returns a generic type mismatch.
>
> - Add mlx5dr_table_type_name() to map table type enum to readable strings
> (e.g. FDB_RX(wire_orig), FDB_UNIFIED).
>
> This makes it easier to fix ordering issues
> (e.g. create wire_origtable before the table that jumps to the group).
>
> Signed-off-by: Maayan Kashani <mkashani@nvidia.com>
> Acked-by: Dariusz Sosnowski <dsosnowski@nvidia.com>
Patch applied to next-net-mlx,
Kindest regards
Raslan Darawsheh
^ permalink raw reply
* Re: [PATCH 1/1] net/mlx5: fix crash if malloc FCQS fails
From: Raslan Darawsheh @ 2026-06-04 9:27 UTC (permalink / raw)
To: Yunjian Wang, dev
Cc: matan, suanmingm, viacheslavo, dsosnowski, jerry.lilijun, stable
In-Reply-To: <1777027713-91888-1-git-send-email-wangyunjian@huawei.com>
Hi,
On 24/04/2026 1:48 PM, Yunjian Wang wrote:
> A crash is triggered when memory malloc for FCQS fails. Currently,
> the abnormal branch releases the txq, and it removes the 'txq_ctrl->obj'
> from the linked list, but the 'txq_ctrl->obj' has not actually been
> added to the linked list yet. This triggers a null pointer reference
> issue.
>
> The call stack is as follows:
> Program terminated with signal 11, Segmentation fault.
> 1210 LIST_REMOVE(txq_ctrl->obj, next);
> (gdb) bt
> #0 mlx5_txq_release
> #1 mlx5_txq_start
> #2 mlx5_dev_start
> #3 rte_eth_dev_start
> #4 member_start
> #5 bond_ethdev_start
>
> Fixes: f49f44839df3 ("net/mlx5: share Tx control code")
> Cc: stable@dpdk.org
>
> Signed-off-by: Yunjian Wang <wangyunjian@huawei.com>
Patch applied to next-net-mlx,
Kindest regards
Raslan Darawsheh
^ permalink raw reply
* Re: [PATCH v2] net/mlx5: query hardware capability for max lro size
From: Raslan Darawsheh @ 2026-06-04 9:28 UTC (permalink / raw)
To: Rayane Boussanni, dsosnowski
Cc: bingz, dev, matan, orika, suanmingm, viacheslavo
In-Reply-To: <20260424142732.1904-1-rboussanni@gmail.com>
Hi,
On 24/04/2026 5:27 PM, Rayane Boussanni wrote:
> Resolve a FIXME in mlx5_dev_infos_get() by dynamically checking the
> lro_allowed flag instead of unconditionally advertising
> MLX5_MAX_LRO_SIZE.
>
> Signed-off-by: Rayane Boussanni <rboussanni@gmail.com>
> ---
Patch applied to next-net-mlx,
Kindest regards
Raslan Darawsheh
^ permalink raw reply
* Re: [PATCH v3] net/mlx5: add validation for indirect actions
From: Raslan Darawsheh @ 2026-06-04 9:32 UTC (permalink / raw)
To: Rayane Boussanni, dev; +Cc: dsosnowski
In-Reply-To: <20260514193359.195017-1-rboussanni@gmail.com>
Hi,
On 14/05/2026 10:33 PM, Rayane Boussanni wrote:
> This patch implements missing validation logic for RSS and Connection
> Tracking (ConnTrack) indirect actions in the Hardware Steering (HWS)
> flow engine.
>
> Previously, these actions were accepted without being validated
> against hardware capabilities, which could lead to unexpected behavior
> when applying flow rules. The specialist validation functions
> (mlx5_hw_validate_action_rss and mlx5_hw_validate_action_conntrack)
> already existed but were not wired up to the indirect action handler.
>
> The signature of flow_hw_validate_action_indirect was updated to
> include the actions template attributes (attr), allowing it to pass
> the necessary traffic direction context (ingress/egress/transfer)
> to the underlying validation specialists. For indirect RSS, only the
> template attributes are validated, as the RSS configuration itself is
> already validated when the indirect action handle is created.
>
> Reported-by: Dariusz Sosnowski <dsosnowski@nvidia.com>
> Signed-off-by: Rayane Boussanni <rboussanni@gmail.com>
> ---
> v3:
> - Fix segfault reported by Dariusz Sosnowski when an actions template
> references an indirect RSS action. v2 called
> mlx5_hw_validate_action_rss() on the indirect path, which
> dereferences action->conf as struct rte_flow_action_rss. For indirect
> actions action->conf is an opaque action handle, not an RSS config.
> Add bool is_indirect to mlx5_hw_validate_action_rss() so the indirect
> path validates only the template attributes
> (ingress/egress/transfer).
>
> drivers/net/mlx5/mlx5_flow_hw.c | 36 ++++++++++++++++++++++++++++++---
> 1 file changed, 33 insertions(+), 3 deletions(-)
>
Patch applied to next-net-mlx,
Kindest regards
Raslan Darawsheh
^ permalink raw reply
* Re: [PATCH] net/mlx5: redirect LACP traffic for legacy E-Switch
From: Raslan Darawsheh @ 2026-06-04 9:34 UTC (permalink / raw)
To: Dariusz Sosnowski, Viacheslav Ovsiienko, Bing Zhao, Ori Kam,
Suanming Mou, Matan Azrad
Cc: dev, stable
In-Reply-To: <20260515123700.354341-1-dsosnowski@nvidia.com>
Hi,
On 15/05/2026 3:37 PM, Dariusz Sosnowski wrote:
> Offending patch fixed the LACP miss rule logic for NICs where
> switchdev is enabled. In this case, LACP miss rules should be inserted
> if and only if started port is a main port on the embedded switch.
> Side effect of that change was that LACP miss rules are not inserted
> when switchdev is disabled and legacy SR-IOV switch mode is used.
>
> This patch addresses that:
>
> - Fix the LACP rule insertion condition.
> - Move HWS table for LACP rule creation out of FDB rules,
> so they can be created separately.
>
> Fixes: 87e4384d2662 ("net/mlx5: fix condition of LACP miss flow")
> Cc: stable@dpdk.org
>
> Signed-off-by: Dariusz Sosnowski <dsosnowski@nvidia.com>
Patch applied to next-net-mlx,
Kindest regards
Raslan Darawsheh
^ permalink raw reply
* Re: [PATCH] net/mlx5: fix uninitialized skip count
From: Raslan Darawsheh @ 2026-06-04 9:35 UTC (permalink / raw)
To: Dariusz Sosnowski, Viacheslav Ovsiienko, Bing Zhao, Ori Kam,
Suanming Mou, Matan Azrad, Alexander Kozyrev
Cc: dev, Kiran Vedere, stable
In-Reply-To: <20260515123358.354191-1-dsosnowski@nvidia.com>
Hi,
On 15/05/2026 3:33 PM, Dariusz Sosnowski wrote:
> From: Kiran Vedere <kiranv@nvidia.com>
>
> mlx5_rx_poll_len() may return MLX5_ERROR_CQE_MASK when
> mlx5_rx_err_handle() reports MLX5_CQE_STATUS_HW_OWN while the Rx queue
> is in IGNORE error state. In this HW_OWN case mlx5_rx_err_handle()
> does not necessarily write to *skip_cnt, yet the caller (mlx5_rx_burst)
> unconditionally uses skip_cnt to advance rq_ci.
>
> This can cause rq_ci to jump by an undefined value, desynchronizing the
> RQ and CQ rings and leading to persistent bad packet delivery until the
> queue is reset.
>
> Fixes: aa67ed308458 ("net/mlx5: ignore non-critical syndromes for Rx queue")
> Cc: akozyrev@nvidia.com
> Cc: stable@dpdk.org
>
> Signed-off-by: Kiran Vedere <kiranv@nvidia.com>
> Acked-by: Dariusz Sosnowski <dsosnowski@nvidia.com>
Patch applied to next-net-mlx,
Kindest regards
Raslan Darawsheh
^ permalink raw reply
* Re: [PATCH] net/mlx5: remove nonsensical flow action class_id checks
From: Raslan Darawsheh @ 2026-06-04 9:36 UTC (permalink / raw)
To: Adrian Schollmeyer, Dariusz Sosnowski, Viacheslav Ovsiienko,
Bing Zhao, Ori Kam, Suanming Mou, Matan Azrad, Michael Baum
Cc: dev, Michael Pfeiffer, stable
In-Reply-To: <20260520132533.159996-1-a.schollmeyer@syseleven.de>
Hi,
On 20/05/2026 4:25 PM, Adrian Schollmeyer wrote:
> From: Michael Pfeiffer <m.pfeiffer@syseleven.de>
>
> For a MODIFY_FIELD action, flow_hw_validate_action_modify_field() is
> invoked and enforces class_id == 0 in the action's source and
> destination, if the modified field is none of
> RTE_FLOW_FIELD_GENEVE_OPT_*, as the value is used solely for GENEVE
> fields.
>
> However, this check is flawed due to the way rte_flow_field_data is
> initialized. As it consists of unions and anonymous structs as members,
> empty initialization of this struct or initializing just the tag_index
> only guarantees initialization of the first union member, while the
> remaining member's default initialization behavior is unspecified.
> Therefore, depending on the compiler type, version and configuration,
> the remaining members may either be default-initialized as well or
> contain bytes from uninitialized memory. This causes the check to fail
> depending on how the struct is initialized wherever it is used.
>
> For example, rte_flow_configure() sometimes fails on mlx5 under these
> circumstances with an error "destination class id is not supported"
> during creation of representor tagging rules, as these internally use
> MODIFY_FIELD actions in the following call stack:
>
> 1. rte_flow_configure
> 2. mlx5_flow_port_configure
> 3. flow_hw_configure
> 4. __flow_hw_configure
> 5. flow_hw_setup_tx_repr_tagging
> 6. flow_hw_create_tx_repr_tag_jump_acts_tmpl
> --> various rte_flow_action_modify_field are initialized here, but
> class_id remains uninitialized
> 7. __flow_hw_actions_template_create
> 8. mlx5_flow_hw_actions_validate
> 9. flow_hw_validate_action_modify_field
> --> invoked with class_id containing uninitialized bytes and
> non-GENEVE field type
>
> Remove the two checks for class_id in the non-GENEVE case, as this field
> is unused for these actions and avoids additional implicit dependencies
> on the correct ordering of union members.
>
> Fixes: 1caa89ec1891 ("net/mlx5: support GENEVE options modification")
> Cc: stable@dpdk.org
>
> Signed-off-by: Michael Pfeiffer <m.pfeiffer@syseleven.de>
> Signed-off-by: Adrian Schollmeyer <a.schollmeyer@syseleven.de>
Patch applied to next-net-mlx,
Kindest regards
Raslan Darawsheh
^ permalink raw reply
* Re: [PATCH v2] net/mlx5: use port index as representor index
From: Raslan Darawsheh @ 2026-06-04 9:36 UTC (permalink / raw)
To: Dariusz Sosnowski, Viacheslav Ovsiienko, Bing Zhao, Ori Kam,
Suanming Mou, Matan Azrad
Cc: dev, stable
In-Reply-To: <20260525091733.558838-1-dsosnowski@nvidia.com>
Hi,
On 25/05/2026 12:17 PM, Dariusz Sosnowski wrote:
> Since the offending commit, mlx5 driver supports probing
> representors on BlueField DPUs with Socket Direct (SD).
> Such card can be connected to 2 different CPUs on the host system.
> On DPU, user would see the following network devices:
>
> - p0 and p1 - physical ports
> - pf0hpf and pf2hpf - PF0 on CPU 0 and CPU 1 respectively
> - pf1hpf and pf3hpf - PF1 on CPU 0 and CPU 1 respectively
>
> mlx5 driver finds the relevant netdev by matching information
> provided in representor devarg to phys_port_name
> reported by Linux kernel.
> For the above interfaces phys_port_name's would be reported
> and probed as:
>
> - p0 -> p0, no need for representor devarg
> - p1 -> p1, with representor=pf1
> - pf0hpf -> c1pf0, with representor=c1pf0vf65535
> - pf1hpf -> c1pf1, with representor=c1pf1vf65535
> - pf2hpf -> c2pf0, with representor=c2pf0vf65535
> - pf3hpf -> c2pf1, with representor=c2pf1vf65535
>
> Although hot-plugging all these representors is successful,
> RTE_ETH_FOREACH_MATCHING_DEV() macro would not find DPDK ports.
> This is caused missing information reported by mlx5 driver,
> through rte_eth_representor_info_get() API.
> Specifically, mlx5 driver did not report controller index for all
> representor ranges.
>
> Until now mlx5 driver used static encoding for 16-bit representor_id:
>
> - 2 bits for representor type
> - 2 bits for PF index
> - 12 bits for representor index (either VF or SF number)
>
> Controller index was not encoded. This caused the mentioned issue
> and on top of that:
>
> - limits the number of PFs
> - limits the number of SFs
>
> This patch changes the mlx5 driver logic for
> rte_eth_representor_info_get().
> Instead of static encoding:
>
> - representor_id's will be dynamically assigned
> to each probed representor.
> - rte_eth_representor_info_get() will report N ranges:
> - N == number of probed ports on single embedded switch
> - Each range will define single representor_id
> for given controller/PF/VF/SF.
>
> Fixes: 2f7cdd821b1b ("net/mlx5: fix probing to allow BlueField Socket Direct")
> Cc: stable@dpdk.org
>
> Signed-off-by: Dariusz Sosnowski <dsosnowski@nvidia.com>
> Acked-by: Bing Zhao <bingz@nvidia.com>
> ---
> v2:
> - Added missing "not" in "RTE_ETH_FOREACH_MATCHING_DEV() macro would not find DPDK ports"
> in the commit message.
> - Fixed typo in number of bits for representor index.
> Should be 12, not 2.
>
> drivers/net/mlx5/linux/mlx5_os.c | 6 +-
> drivers/net/mlx5/mlx5.h | 19 +++
> drivers/net/mlx5/mlx5_ethdev.c | 284 +++++++++++++++++++------------
> 3 files changed, 199 insertions(+), 110 deletions(-)
>
--
Patch applied to next-net-mlx,
Kindest regards
Raslan Darawsheh
^ permalink raw reply
* Re: [PATCH v5] net/mlx5: prepend implicit items in sync flow creation path
From: Raslan Darawsheh @ 2026-06-04 9:37 UTC (permalink / raw)
To: Maxime Peim, dev; +Cc: dsosnowski, viacheslavo, bingz, orika, suanmingm, matan
In-Reply-To: <20260527103531.1266488-1-maxime.peim@gmail.com>
Hi,
On 27/05/2026 1:35 PM, Maxime Peim wrote:
> In eSwitch mode, the async (template) flow creation path automatically
> prepends implicit match items to scope flow rules to the correct
> representor port:
> - Ingress: REPRESENTED_PORT item matching dev->data->port_id
> - Egress: REG_C_0 TAG item matching the port's tx tag value
>
> The sync path (flow_hw_list_create) was missing this logic, causing all
> flow rules created via the non-template API to match traffic from all
> ports rather than being scoped to the specific representor.
>
> Add the same implicit item prepending to flow_hw_list_create, right
> after pattern validation and before any branching (sample/RSS/single/
> prefix), mirroring the behavior of flow_hw_pattern_template_create
> and flow_hw_get_rule_items. The ingress case prepends
> REPRESENTED_PORT with the current port_id; the egress case prepends
> MLX5_RTE_FLOW_ITEM_TYPE_TAG with REG_C_0 value/mask (skipped when
> user provides an explicit SQ item).
>
> Also fix a pre-existing bug where 'return split' on metadata split
> failure returned a negative int cast to uintptr_t, which callers
> would treat as a valid flow handle instead of an error.
>
> Fixes: e38776c36c8a ("net/mlx5: introduce HWS for non-template flow API")
> Fixes: 821a6a5cc495 ("net/mlx5: add metadata split for compatibility")
> Signed-off-by: Maxime Peim <maxime.peim@gmail.com>
> ---
> v3:
> - Factor the implicit-item prepend logic out of
> flow_hw_pattern_template_create() into a new helper
> flow_hw_adjust_pattern() and reuse it from flow_hw_list_create(),
> instead of duplicating the prepend logic inline in the sync path.
> - Zero-initialize item_flags in both callers. The validator is
> read-modify-write on item_flags (reads MLX5_FLOW_LAYER_TUNNEL on
> the first iteration), so leaving it uninitialized was UB.
> - Call __flow_hw_pattern_validate() with nt_flow=true from the sync
> path (was effectively nt_flow=false via the wrapper), restoring the
> previous behavior that skips GENEVE_OPT TLV parser validation on
> the non-template path.
> - Document flow_hw_adjust_pattern(): the dual role of the nt_flow
> parameter (template spec-left-zero vs. sync spec-filled + validator
> flag), the three-way return, and the caller's ownership of
> *copied_items across every exit path.
> - Clarify the "omitting implicit REG_C_0 match" debug log now that
> the helper runs on both the template and sync paths.
> - Add Fixes: tags for the two original commits.
>
> v4:
> - Fix items in case splitted metadata are not needed.
>
> v5:
> - Make flow_hw_prepend_item() return a self-contained array. The
> helper used to shallow-copy the prepended item, leaving its
> .spec/.mask pointing at flow_hw_adjust_pattern()'s stack locals
> (port_spec, tag_v, tag_m); once that frame returned, the
> consumers in flow_hw_list_create() (sample / RSS / single create)
> and the post-extraction template path dereferenced dangling
> pointers. The prepended item is now deep-copied via
> rte_flow_conv(RTE_FLOW_CONV_OP_ITEM, ...) into the tail of the
> same mlx5_malloc() block, so the lifetime of every byte the
> consumer can reach equals the lifetime of the returned array.
> items[] continue to be shallow-copied (their spec/mask blobs are
> application-owned and outlive the call). One alloc, one free; no
> call-site or signature changes.
> - Fix the &item_flags / &orig_item_nb argument order at both
> flow_hw_adjust_pattern() call sites (introduced in v3 by the
> helper extraction): the prior order silently stored the item
> count into it->item_flags / the layer-flag arguments forwarded
> into mlx5_nta_sample_flow_list_create / mlx5_flow_nta_handle_rss /
> mlx5_flow_hw_create_flow, and stored the OR-accumulated layer
> flags into it->orig_item_nb / per-rule item-count uses.
>
> drivers/net/mlx5/mlx5_flow_hw.c | 262 +++++++++++++++++++++++---------
> 1 file changed, 194 insertions(+), 68 deletions(-)
>
--
Patch applied to next-net-mlx,
Kindest regards
Raslan Darawsheh
^ permalink raw reply
* Re: [PATCH v2] net/mlx5: promote private API to stable
From: Raslan Darawsheh @ 2026-06-04 9:38 UTC (permalink / raw)
To: Dariusz Sosnowski, Viacheslav Ovsiienko, Bing Zhao, Ori Kam,
Suanming Mou, Matan Azrad
Cc: dev
In-Reply-To: <20260529073255.689808-1-dsosnowski@nvidia.com>
Hi,
On 29/05/2026 10:32 AM, Dariusz Sosnowski wrote:
> Following experimental functions are exposed by mlx5 PMD
> since 25.11 release:
>
> - rte_pmd_mlx5_driver_event_cb_register
> - rte_pmd_mlx5_driver_event_cb_unregister
> - rte_pmd_mlx5_enable_steering
> - rte_pmd_mlx5_disable_steering
>
> First two are used to register callbacks for driver events
> (when Rx/Tx queues are created or destroyed).
> Other two are used to enable/disable flow steering in mlx5 PMD.
> No changes were made and no changes are planned to these symbols.
>
> These are currently used by NVIDIA DOCA SDK since version 3.3,
> which started depending on upstream DPDK releases [1].
> Purpose of their use is to:
>
> - expose HW identifiers of Rx/Tx mlx5 queues managed by DPDK and
> - allow flow steering to happen outside of DPDK.
>
> Also, some of these symbols will be used by netdev-doca backend
> in Open vSwitch [2].
> Whenever a DOCA netdev would be added/removed in Open vSwitch,
> it will have to disable/enable steering for mlx5 driver.
> Stabilizing these symbols is required by current OVS policy
> to remove the need for ALLOW_EXPERIMENTAL_API [3].
>
> This patch promotes aforementioned symbols to stable.
>
> [1]: https://docs.nvidia.com/doca/sdk/customer-affecting-changes/index.html
> [2]: https://patchwork.ozlabs.org/project/openvswitch/list/?series=504726&state=%2A&archive=both
> [3]: https://mail.openvswitch.org/pipermail/ovs-dev/2026-May/432066.html
>
> Signed-off-by: Dariusz Sosnowski <dsosnowski@nvidia.com>
> ---
> v2:
> - Updated 26.07 release notes.
>
> doc/guides/rel_notes/release_26_07.rst | 9 +++++++++
> drivers/net/mlx5/mlx5_driver_event.c | 4 ++--
> drivers/net/mlx5/mlx5_flow.c | 4 ++--
> drivers/net/mlx5/rte_pmd_mlx5.h | 4 ----
> 4 files changed, 13 insertions(+), 8 deletions(-)
>
--
Patch applied to next-net-mlx,
Kindest regards
Raslan Darawsheh
^ permalink raw reply
* Re: [PATCH v13 1/5] vhost: add user to mailmap and define to vhost hdr
From: Maxime Coquelin @ 2026-06-04 9:56 UTC (permalink / raw)
To: Bathija, Pravin
Cc: fengchengwen, dev@dpdk.org, stephen@networkplumber.org,
thomas@monjalon.net
In-Reply-To: <DM6PR19MB356483DFA0304202FA06F344F5012@DM6PR19MB3564.namprd19.prod.outlook.com>
[-- Attachment #1: Type: text/plain, Size: 1337 bytes --]
On Wed, May 20, 2026 at 4:21 AM Bathija, Pravin <Pravin.Bathija@dell.com>
wrote:
>
>
>
> Internal Use - Confidential
> > -----Original Message-----
> > From: fengchengwen <fengchengwen@huawei.com>
> > Sent: Thursday, May 14, 2026 5:21 PM
> > To: Bathija, Pravin <Pravin.Bathija@dell.com>; dev@dpdk.org;
> > stephen@networkplumber.org; maxime.coquelin@redhat.com
> > Cc: thomas@monjalon.net
> > Subject: Re: [PATCH v13 1/5] vhost: add user to mailmap and define to
> vhost
> > hdr
> >
> >
> > [EXTERNAL EMAIL]
> >
> > On 5/15/2026 6:46 AM, pravin.bathija@dell.com wrote:
> > > From: Pravin M Bathija <pravin.bathija@dell.com>
> > >
> > > - add user to mailmap file.
> > > - define a bit-field called
> > VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS
> > > that depicts if the feature/capability to add/remove memory regions
> > > is supported. This is a part of the overall support for add/remove
> > > memory region feature in this patchset.
> > >
> > > Signed-off-by: Pravin M Bathija <pravin.bathija@dell.com>
> >
> > Please attach reviewed/acked-by of this commit in new version
>
> done
>
I haven't added Acked-by or Reviewed-by tags to all the patches, so why did
you add them in my name on every patch?
I haven't checked whether this was the case for the other people.
>
> >
>
>
[-- Attachment #2: Type: text/html, Size: 2597 bytes --]
^ permalink raw reply
* Re: [PATCH v14 1/5] vhost: add user to mailmap and define to vhost hdr
From: Maxime Coquelin @ 2026-06-04 9:58 UTC (permalink / raw)
To: pravin.bathija; +Cc: dev, fengchengwen, stephen, thomas, Stephen Hemminger
In-Reply-To: <20260520022012.243619-2-pravin.bathija@dell.com>
[-- Attachment #1: Type: text/plain, Size: 1920 bytes --]
On Wed, May 20, 2026 at 4:20 AM <pravin.bathija@dell.com> wrote:
> From: Pravin M Bathija <pravin.bathija@dell.com>
>
> - add user to mailmap file.
> - define a bit-field called VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS
> that depicts if the feature/capability to add/remove memory regions
> is supported. This is a part of the overall support for add/remove
> memory region feature in this patchset.
>
> Signed-off-by: Pravin M Bathija <pravin.bathija@dell.com>
> Acked-by: Fengchengwen <fengchengwen@huawei.com>
> Reviewed-by: Stephen Hemminger <stephen@networkplumber.com>
> Acked-by: Maxime Coquelin <maxime.coquelin@redhat.com>
>
It was a:
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
Not a ack.
> ---
> .mailmap | 1 +
> lib/vhost/rte_vhost.h | 4 ++++
> 2 files changed, 5 insertions(+)
>
> diff --git a/.mailmap b/.mailmap
> index 0e0d83e1c6..cc44e27036 100644
> --- a/.mailmap
> +++ b/.mailmap
> @@ -1295,6 +1295,7 @@ Prateek Agarwal <prateekag@cse.iitb.ac.in>
> Prathisna Padmasanan <prathisna.padmasanan@intel.com>
> Praveen Kaligineedi <pkaligineedi@google.com>
> Praveen Shetty <praveen.shetty@intel.com>
> +Pravin M Bathija <pravin.bathija@dell.com>
> Pravin Pathak <pravin.pathak.dev@gmail.com> <pravin.pathak@intel.com>
> Prince Takkar <ptakkar@marvell.com>
> Priyalee Kushwaha <priyalee.kushwaha@intel.com>
> diff --git a/lib/vhost/rte_vhost.h b/lib/vhost/rte_vhost.h
> index 2f7c4c0080..a7f9700538 100644
> --- a/lib/vhost/rte_vhost.h
> +++ b/lib/vhost/rte_vhost.h
> @@ -109,6 +109,10 @@ extern "C" {
> #define VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD 12
> #endif
>
> +#ifndef VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS
> +#define VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS 15
> +#endif
> +
> #ifndef VHOST_USER_PROTOCOL_F_STATUS
> #define VHOST_USER_PROTOCOL_F_STATUS 16
> #endif
> --
> 2.43.0
>
>
[-- Attachment #2: Type: text/html, Size: 3585 bytes --]
^ permalink raw reply
* Re: [PATCH v14 2/5] vhost: header defines for add/rem mem region
From: Maxime Coquelin @ 2026-06-04 9:59 UTC (permalink / raw)
To: pravin.bathija; +Cc: dev, fengchengwen, stephen, thomas, Stephen Hemminger
In-Reply-To: <20260520022012.243619-3-pravin.bathija@dell.com>
[-- Attachment #1: Type: text/plain, Size: 2408 bytes --]
On Wed, May 20, 2026 at 4:20 AM <pravin.bathija@dell.com> wrote:
> From: Pravin M Bathija <pravin.bathija@dell.com>
>
> The changes in this file cover the enum message requests for
> supporting add/remove memory regions. The front-end vhost-user
> client sends messages like get max memory slots, add memory region
> and remove memory region which are covered in these changes which
> are on the vhost-user back-end. The changes also include data structure
> definition of memory region to be added/removed. The data structure
> VhostUserMsg has been changed to include the memory region.
>
> Signed-off-by: Pravin M Bathija <pravin.bathija@dell.com>
> Acked-by: Fengchengwen <fengchengwen@huawei.com>
> Reviewed-by: Stephen Hemminger <stephen@networkplumber.com>
> Acked-by: Maxime Coquelin <maxime.coquelin@redhat.com>
>
Same here:
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
> ---
> lib/vhost/vhost_user.h | 9 +++++++++
> 1 file changed, 9 insertions(+)
>
> diff --git a/lib/vhost/vhost_user.h b/lib/vhost/vhost_user.h
> index ef486545ba..6435816534 100644
> --- a/lib/vhost/vhost_user.h
> +++ b/lib/vhost/vhost_user.h
> @@ -67,6 +67,9 @@ typedef enum VhostUserRequest {
> VHOST_USER_POSTCOPY_END = 30,
> VHOST_USER_GET_INFLIGHT_FD = 31,
> VHOST_USER_SET_INFLIGHT_FD = 32,
> + VHOST_USER_GET_MAX_MEM_SLOTS = 36,
> + VHOST_USER_ADD_MEM_REG = 37,
> + VHOST_USER_REM_MEM_REG = 38,
> VHOST_USER_SET_STATUS = 39,
> VHOST_USER_GET_STATUS = 40,
> } VhostUserRequest;
> @@ -91,6 +94,11 @@ typedef struct VhostUserMemory {
> VhostUserMemoryRegion regions[VHOST_MEMORY_MAX_NREGIONS];
> } VhostUserMemory;
>
> +typedef struct VhostUserMemRegMsg {
> + uint64_t padding;
> + VhostUserMemoryRegion region;
> +} VhostUserMemRegMsg;
> +
> typedef struct VhostUserLog {
> uint64_t mmap_size;
> uint64_t mmap_offset;
> @@ -186,6 +194,7 @@ typedef struct __rte_packed_begin VhostUserMsg {
> struct vhost_vring_state state;
> struct vhost_vring_addr addr;
> VhostUserMemory memory;
> + VhostUserMemRegMsg memreg;
> VhostUserLog log;
> struct vhost_iotlb_msg iotlb;
> VhostUserCryptoSessionParam crypto_session;
> --
> 2.43.0
>
>
[-- Attachment #2: Type: text/html, Size: 3548 bytes --]
^ permalink raw reply
* Re: [PATCH v14 3/5] vhost: refactor memory helper functions
From: Maxime Coquelin @ 2026-06-04 10:08 UTC (permalink / raw)
To: pravin.bathija; +Cc: dev, fengchengwen, stephen, thomas, Stephen Hemminger
In-Reply-To: <20260520022012.243619-4-pravin.bathija@dell.com>
[-- Attachment #1: Type: text/plain, Size: 929 bytes --]
On Wed, May 20, 2026 at 4:20 AM <pravin.bathija@dell.com> wrote:
> From: Pravin M Bathija <pravin.bathija@dell.com>
>
> - Extract reusable helper routines for vhost-user backend memory
> operations.
> - Split DMA map/unmap into per-region logic.
> - Decouple and rework memory region free routines.
> - Iterate over VHOST_MEMORY_MAX_NREGIONS uniformly
> across related functions to simplify code reuse
>
> Signed-off-by: Pravin M Bathija <pravin.bathija@dell.com>
> Acked-by: Fengchengwen <fengchengwen@huawei.com>
> Reviewed-by: Stephen Hemminger <stephen@networkplumber.com>
> Acked-by: Maxime Coquelin <maxime.coquelin@redhat.com>
>
I did not Ack this one AFAICT.
> ---
> lib/vhost/vhost_user.c | 172 ++++++++++++++++++++++++++---------------
> 1 file changed, 110 insertions(+), 62 deletions(-)
>
>
>
The change looks good though:
Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
[-- Attachment #2: Type: text/html, Size: 1897 bytes --]
^ permalink raw reply
* Re: [PATCH v14 4/5] vhost: add mem region add/remove handlers
From: Maxime Coquelin @ 2026-06-04 10:26 UTC (permalink / raw)
To: pravin.bathija; +Cc: dev, fengchengwen, stephen, thomas, Stephen Hemminger
In-Reply-To: <20260520022012.243619-5-pravin.bathija@dell.com>
[-- Attachment #1: Type: text/plain, Size: 13942 bytes --]
On Wed, May 20, 2026 at 4:20 AM <pravin.bathija@dell.com> wrote:
> From: Pravin M Bathija <pravin.bathija@dell.com>
>
> Add support for VHOST_USER_ADD_MEM_REG, VHOST_USER_REM_MEM_REG and
> VHOST_USER_GET_MAX_MEM_SLOTS. Refactor memory initialization into
> common helper and add supporting functions for dynamic memory management.
>
> Signed-off-by: Pravin M Bathija <pravin.bathija@dell.com>
> Acked-by: Fengchengwen <fengchengwen@huawei.com>
> Reviewed-by: Stephen Hemminger <stephen@networkplumber.com>
> Acked-by: Maxime Coquelin <maxime.coquelin@redhat.com>
> ---
> lib/vhost/vhost_user.c | 255 +++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 255 insertions(+)
>
> diff --git a/lib/vhost/vhost_user.c b/lib/vhost/vhost_user.c
> index 94fca8b589..522ba1db82 100644
> --- a/lib/vhost/vhost_user.c
> +++ b/lib/vhost/vhost_user.c
> @@ -71,6 +71,9 @@ VHOST_MESSAGE_HANDLER(VHOST_USER_SET_FEATURES,
> vhost_user_set_features, false, t
> VHOST_MESSAGE_HANDLER(VHOST_USER_SET_OWNER, vhost_user_set_owner, false,
> true) \
> VHOST_MESSAGE_HANDLER(VHOST_USER_RESET_OWNER, vhost_user_reset_owner,
> false, false) \
> VHOST_MESSAGE_HANDLER(VHOST_USER_SET_MEM_TABLE, vhost_user_set_mem_table,
> true, true) \
> +VHOST_MESSAGE_HANDLER(VHOST_USER_GET_MAX_MEM_SLOTS,
> vhost_user_get_max_mem_slots, false, false) \
> +VHOST_MESSAGE_HANDLER(VHOST_USER_ADD_MEM_REG, vhost_user_add_mem_reg,
> true, true) \
> +VHOST_MESSAGE_HANDLER(VHOST_USER_REM_MEM_REG, vhost_user_rem_mem_reg,
> true, true) \
>
The removal request does not expect FDs in ancillary data.
It should be:
VHOST_MESSAGE_HANDLER(VHOST_USER_REM_MEM_REG, vhost_user_rem_mem_reg,
false, true)
> VHOST_MESSAGE_HANDLER(VHOST_USER_SET_LOG_BASE, vhost_user_set_log_base,
> true, true) \
> VHOST_MESSAGE_HANDLER(VHOST_USER_SET_LOG_FD, vhost_user_set_log_fd, true,
> true) \
> VHOST_MESSAGE_HANDLER(VHOST_USER_SET_VRING_NUM, vhost_user_set_vring_num,
> false, true) \
> @@ -1167,6 +1170,24 @@ add_guest_pages(struct virtio_net *dev, struct
> rte_vhost_mem_region *reg,
> return 0;
> }
>
> +static void
> +remove_guest_pages(struct virtio_net *dev, struct rte_vhost_mem_region
> *reg)
> +{
> + uint64_t reg_start = reg->host_user_addr;
> + uint64_t reg_end = reg_start + reg->size;
> + uint32_t i, j = 0;
> +
> + for (i = 0; i < dev->nr_guest_pages; i++) {
> + if (dev->guest_pages[i].host_user_addr >= reg_start &&
> + dev->guest_pages[i].host_user_addr < reg_end)
> + continue;
> + if (j != i)
> + dev->guest_pages[j] = dev->guest_pages[i];
> + j++;
> + }
> + dev->nr_guest_pages = j;
> +}
> +
> #ifdef RTE_LIBRTE_VHOST_DEBUG
> /* TODO: enable it only in debug mode? */
> static void
> @@ -1591,6 +1612,240 @@ vhost_user_set_mem_table(struct virtio_net **pdev,
> return RTE_VHOST_MSG_RESULT_ERR;
> }
>
> +
> +static int
> +vhost_user_get_max_mem_slots(struct virtio_net **pdev __rte_unused,
> + struct vhu_msg_context *ctx,
> + int main_fd __rte_unused)
> +{
> + uint32_t max_mem_slots = VHOST_MEMORY_MAX_NREGIONS;
> +
> + ctx->msg.payload.u64 = max_mem_slots;
> + ctx->msg.size = sizeof(ctx->msg.payload.u64);
> + ctx->fd_num = 0;
> +
> + return RTE_VHOST_MSG_RESULT_REPLY;
> +}
> +
> +/*
> + * Invalidate and re-translate all vring addresses after the memory table
> + * has been modified (add/remove region).
> + *
> + * translate_ring_addresses() may call numa_realloc(), which can
> reallocate
> + * the device structure. The updated pointer is written back through
> *pdev
> + * so callers must refresh their local "dev" afterwards: dev = *pdev.
> + */
> +static void
> +vhost_user_invalidate_vrings(struct virtio_net **pdev)
> +{
> + struct virtio_net *dev = *pdev;
> + uint32_t i;
> +
> + for (i = 0; i < dev->nr_vring; i++) {
> + struct vhost_virtqueue *vq = dev->virtqueue[i];
> +
> + if (!vq)
> + continue;
> +
> + if (vq->desc || vq->avail || vq->used) {
> + vq_assert_lock(dev, vq);
> +
> + vring_invalidate(dev, vq);
> +
> + translate_ring_addresses(&dev, &vq);
> + }
> + }
> +
> + *pdev = dev;
> +}
> +
> +/*
> + * Macro wrapper that performs the compile-time lock assertion with the
> + * correct message ID at the call site, then calls the implementation.
> + */
> +#define dev_invalidate_vrings(pdev, id) do { \
> + static_assert(id ## _LOCK_ALL_QPS, \
> + #id " handler is not declared as locking all queue
> pairs"); \
> + vhost_user_invalidate_vrings(pdev); \
> +} while (0)
> +
> +static int
> +vhost_user_add_mem_reg(struct virtio_net **pdev,
> + struct vhu_msg_context *ctx,
> + int main_fd __rte_unused)
> +{
> + struct VhostUserMemoryRegion *region =
> &ctx->msg.payload.memreg.region;
> + struct virtio_net *dev = *pdev;
> + uint32_t i;
> +
> + /* convert first region add to normal memory table set */
> + if (dev->mem == NULL) {
> + if (vhost_user_initialize_memory(pdev) < 0)
> + goto close_msg_fds;
> + }
> +
> + /* make sure new region will fit */
> + if (dev->mem->nregions >= VHOST_MEMORY_MAX_NREGIONS) {
> + VHOST_CONFIG_LOG(dev->ifname, ERR, "too many memory
> regions already (%u)",
> +
> dev->mem->nregions);
> + goto close_msg_fds;
> + }
> +
> + /* make sure supplied memory fd present */
> + if (ctx->fd_num != 1) {
> + VHOST_CONFIG_LOG(dev->ifname, ERR, "fd count makes no
> sense (%u)", ctx->fd_num);
> + goto close_msg_fds;
> + }
> +
> + /* Make sure no overlap in guest virtual address space */
> + for (i = 0; i < dev->mem->nregions; i++) {
> + struct rte_vhost_mem_region *cur = &dev->mem->regions[i];
> + uint64_t cur_start = cur->guest_user_addr;
> + uint64_t cur_end = cur_start + cur->size - 1;
> + uint64_t new_start = region->userspace_addr;
> + uint64_t new_end = new_start + region->memory_size - 1;
> +
> + if (new_end >= cur_start && new_start <= cur_end) {
> + VHOST_CONFIG_LOG(dev->ifname, ERR,
> + "requested memory region overlaps with
> another region");
> + VHOST_CONFIG_LOG(dev->ifname, ERR,
> + "\tRequested region address:0x%" PRIx64,
> + region->userspace_addr);
> + VHOST_CONFIG_LOG(dev->ifname, ERR,
> + "\tRequested region size:0x%" PRIx64,
> + region->memory_size);
> + VHOST_CONFIG_LOG(dev->ifname, ERR,
> + "\tOverlapping region address:0x%" PRIx64,
> + cur->guest_user_addr);
> + VHOST_CONFIG_LOG(dev->ifname, ERR,
> + "\tOverlapping region size:0x%" PRIx64,
> + cur->size);
> + goto close_msg_fds;
> + }
> + }
> +
> + /* New region goes at the end of the contiguous array */
> + struct rte_vhost_mem_region *reg =
> &dev->mem->regions[dev->mem->nregions];
> +
> + reg->guest_phys_addr = region->guest_phys_addr;
> + reg->guest_user_addr = region->userspace_addr;
> + reg->size = region->memory_size;
> + reg->fd = ctx->fds[0];
> + ctx->fds[0] = -1;
> +
> + if (vhost_user_mmap_region(dev, reg, region->mmap_offset) < 0) {
> + VHOST_CONFIG_LOG(dev->ifname, ERR, "failed to mmap
> region");
> + if (reg->mmap_addr) {
> + /* mmap succeeded but a later step (e.g.
> add_guest_pages)
> + * failed; undo the mapping and any guest-page
> entries.
> + */
> + remove_guest_pages(dev, reg);
> + free_mem_region(reg);
> + } else {
> + close(reg->fd);
> + reg->fd = -1;
> + }
> + goto close_msg_fds;
> + }
> +
> + dev->mem->nregions++;
> +
> + if (dev->async_copy && rte_vfio_is_enabled("vfio")) {
> + if (async_dma_map_region(dev, reg, true) < 0)
> + goto free_new_region_no_dma;
> + }
> +
> + if (dev->postcopy_listening) {
> + /*
> + * Cannot use vhost_user_postcopy_register() here because
> it
> + * reads ctx->msg.payload.memory (SET_MEM_TABLE layout),
> but
> + * ADD_MEM_REG uses the memreg payload. Register the
> + * single new region directly instead.
> + */
> + if (vhost_user_postcopy_region_register(dev, reg) < 0)
> + goto free_new_region;
> + }
> +
> + dev_invalidate_vrings(pdev, VHOST_USER_ADD_MEM_REG);
> + dev = *pdev;
> + dump_guest_pages(dev);
> +
> + /* Reply with the back-end's mapping address per vhost-user spec */
> + ctx->msg.payload.memreg.region.userspace_addr =
> reg->host_user_addr;
> + ctx->msg.size = sizeof(ctx->msg.payload.memreg);
> + ctx->fd_num = 0;
> +
> + return RTE_VHOST_MSG_RESULT_REPLY;
> +
> +free_new_region:
> + if (dev->async_copy && rte_vfio_is_enabled("vfio"))
> + async_dma_map_region(dev, reg, false);
> +free_new_region_no_dma:
> + remove_guest_pages(dev, reg);
> + free_mem_region(reg);
> + dev->mem->nregions--;
> +close_msg_fds:
> + close_msg_fds(ctx);
> + return RTE_VHOST_MSG_RESULT_ERR;
> +}
> +
> +static int
> +vhost_user_rem_mem_reg(struct virtio_net **pdev,
> + struct vhu_msg_context *ctx,
> + int main_fd __rte_unused)
> +{
> + struct VhostUserMemoryRegion *region =
> &ctx->msg.payload.memreg.region;
> + struct virtio_net *dev = *pdev;
> + uint32_t i;
> +
> + if (dev->mem == NULL || dev->mem->nregions == 0) {
> + VHOST_CONFIG_LOG(dev->ifname, ERR, "no memory regions to
> remove");
> + close_msg_fds(ctx);
>
Not needed if properly declared.
> + return RTE_VHOST_MSG_RESULT_ERR;
> + }
> +
> + if (validate_msg_fds(dev, ctx, 0) != 0)
> + return RTE_VHOST_MSG_RESULT_ERR;
>
With proper declaration, we can remove this check, as it is done in a
generic way.
> +
> + for (i = 0; i < dev->mem->nregions; i++) {
> + struct rte_vhost_mem_region *current_region =
> &dev->mem->regions[i];
> +
> + /*
> + * According to the vhost-user specification:
> + * The memory region to be removed is identified by its
> GPA,
> + * user address and size. The mmap offset is ignored.
> + */
> + if (region->userspace_addr ==
> current_region->guest_user_addr
> + && region->guest_phys_addr ==
> current_region->guest_phys_addr
> + && region->memory_size == current_region->size) {
> + if (dev->async_copy && rte_vfio_is_enabled("vfio"))
> + async_dma_map_region(dev, current_region,
> false);
> + remove_guest_pages(dev, current_region);
>
You are missing the step to clear the IOTLB cache entries matching with
this removed region.
In vhost_user_set_mem_table(), a vhost_user_iotlb_flush_all() call is made,
but that would be kind of a nuclear option for memory hotplug.
I suggest removing only the entries matching the removed area, something
like this:
if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM))
vhost_user_iotlb_cache_remove(dev, current_region->guest_phys_addr,
current_region->size);
> + free_mem_region(current_region);
> +
> + /* Compact the regions array to keep it contiguous
> */
> + if (i < dev->mem->nregions - 1) {
> + memmove(&dev->mem->regions[i],
> + &dev->mem->regions[i + 1],
> + (dev->mem->nregions - 1 - i) *
> + sizeof(struct
> rte_vhost_mem_region));
> +
> memset(&dev->mem->regions[dev->mem->nregions - 1],
> + 0, sizeof(struct
> rte_vhost_mem_region));
> + }
> +
> + dev->mem->nregions--;
> + dev_invalidate_vrings(pdev,
> VHOST_USER_REM_MEM_REG);
> + dev = *pdev;
> + close_msg_fds(ctx);
>
And no need to close FDs, are we are now sure none were provided.
> + return RTE_VHOST_MSG_RESULT_OK;
> + }
> + }
> +
> + VHOST_CONFIG_LOG(dev->ifname, ERR, "failed to find region");
> + close_msg_fds(ctx);
>
Same, no more needed.
> + return RTE_VHOST_MSG_RESULT_ERR;
> +}
> +
> static bool
> vq_is_ready(struct virtio_net *dev, struct vhost_virtqueue *vq)
> {
> --
> 2.43.0
>
>
[-- Attachment #2: Type: text/html, Size: 18359 bytes --]
^ permalink raw reply
* Re: [PATCH] net/mlx5: fix the eCPRI match on HWS root table
From: Raslan Darawsheh @ 2026-06-04 10:54 UTC (permalink / raw)
To: Bing Zhao, viacheslavo, dev; +Cc: orika, dsosnowski, suanmingm, matan, thomas
In-Reply-To: <20260604035222.3415-1-bingz@nvidia.com>
Hi,
On 04/06/2026 6:52 AM, Bing Zhao wrote:
> When inserting a rule on the root table in HWS mode, the DV
> interfaces are still being used for the matcher and value
> translation.
>
> For item eCPRI in HWS mode, the DWs of message header and body may be
> 0 for a valid rule. So we need to use some flags on root table to
> decide if the sample ID and matching value should be set for the
> rule. Since the table and rule may be created in turn for
> asynchronous flow API, the flags should be per table to avoid
> over-written. In SWS mode, nothing would be done separately for the
> matcher itself.
>
> On the non-root table in HWS, the callbacks to fill the WQE will
> handle such case automatically.
>
> Fixes: 93c7d4c22628 ("net/mlx5: support eCPRI with HWS")
>
> Signed-off-by: Bing Zhao <bingz@nvidia.com>
> Acked-by: Dariusz Sosnowski <dsosnowski@nvidia.com>
Patch applied to next-net-mlx,
Kindest regards
Raslan Darawsheh
^ permalink raw reply
* [PATCH v8] mempool: improve cache behaviour and performance
From: Morten Brørup @ 2026-06-04 11:48 UTC (permalink / raw)
To: dev, Thomas Monjalon; +Cc: Morten Brørup, Andrew Rybchenko
In-Reply-To: <20260408141315.904381-1-mb@smartsharesystems.com>
This patch refactors the mempool cache to eliminate some unexpected
behaviour and reduce the mempool cache miss rate.
1.
The actual cache size was 1.5 times the cache size specified at run-time
mempool creation.
This was obviously not expected by application developers.
2.
In get operations, the check for when to use the cache as bounce buffer
did not respect the run-time configured cache size,
but compared to the build time maximum possible cache size
(RTE_MEMPOOL_CACHE_MAX_SIZE, default 512).
E.g. with a configured cache size of 32 objects, getting 256 objects
would first fetch 32 + 256 = 288 objects into the cache,
and then move the 256 objects from the cache to the destination memory,
instead of fetching the 256 objects directly to the destination memory.
This had a performance cost.
However, this is unlikely to occur in real applications, so it is not
important in itself.
3.
When putting objects into a mempool, and the mempool cache did not have
free space for so many objects,
the cache was flushed completely, and the new objects were then put into
the cache.
I.e. the cache drain level was zero.
This (complete cache flush) meant that a subsequent get operation (with
the same number of objects) completely emptied the cache,
so another subsequent get operation required replenishing the cache.
Similarly,
When getting objects from a mempool, and the mempool cache did not hold so
many objects,
the cache was replenished to cache->size + remaining objects,
and then (the remaining part of) the requested objects were fetched via
the cache,
which left the cache filled (to cache->size) at completion.
I.e. the cache refill level was cache->size (plus some, depending on
request size).
(1) was improved by generally comparing to cache->size instead of
cache->flushthresh, when considering the capacity of the cache.
The cache->flushthresh field is kept for API/ABI compatibility purposes,
and initialized to cache->size instead of cache->size * 1.5.
(2) was improved by generally comparing to cache->size / 2 instead of
RTE_MEMPOOL_CACHE_MAX_SIZE, when checking the bounce buffer limit.
(3) was improved by flushing and replenishing the cache by half its size,
so a flush/refill can be followed randomly by get or put requests.
This also reduced the number of objects in each flush/refill operation.
As a consequence of these changes, the size of the array holding the
objects in the cache (cache->objs[]) no longer needs to be
2 * RTE_MEMPOOL_CACHE_MAX_SIZE, and can be reduced to
RTE_MEMPOOL_CACHE_MAX_SIZE at an API/ABI breaking release.
Performance data:
With a real WAN Optimization application, where the number of allocated
packets varies (as they are held in e.g. shaper queues), the mempool
cache miss rate dropped from ca. 1/20 objects to ca. 1/48 objects.
This was deployed in production at an ISP, and using an effective cache
size of 384 objects.
Bugzilla ID: 1027
Fixes: ea5dd2744b90 ("mempool: cache optimisations")
Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
---
v8:
* Rebased.
CI cannot apply the patch when the release notes have changed at the tip
of main. Hoping to win the race this time.
v7:
* Rebased. Dependency no longer required. (Thomas)
v6:
* Moved driver changes out as separate patches, for easier review. (Bruce)
Tests using the Intel idpf PMD in AVX512 mode may fail with this patch.
* Reverted a small code comment change. The original was better. (Bruce)
* Reverted rte_mempool_create() description requiring the cache_size to be
an even number. There is no such requirement.
v5:
* Flush the cache from the bottom, where objects are colder, and move down
the remaining objects, which are hotter.
* In the Intel idpf PMD, move up the hot objects in the cache and refill
with cold objects at the bottom.
v4:
* Added Bugzilla ID.
* Added Fixes tag. For reference only.
* Moved fast-free related update of Intel common driver out as a separate
patch, and depend on that patch.
* Omitted unrelated changes to the Intel idpf AVX512 driver, specifically
fixing an indentation and adding mbuf instrumentation.
* Omitted unrelated changes to the mempool library, specifically adding
__rte_restrict and changing a couple of comments to proper sentences.
* Please checkpatches by swapping operators in a couple of comparisons.
v3:
* Fixed my copy-paste bug in idpf_splitq_rearm().
v2:
* Fixed issue found by abidiff:
Reverted cache objects array size reduction. Added a note instead.
* Added missing mbuf instrumentation to the Intel idpf AVX512 driver.
* Updated idpf_splitq_rearm() like idpf_singleq_rearm().
* Added a few more __rte_assume(). (Inspired by AI review)
* Updated NXP dpaa and dpaa2 mempool drivers to not set mempool cache
flush threshold.
* Added release notes.
* Added deprecation notes.
---
doc/guides/rel_notes/deprecation.rst | 7 +++
doc/guides/rel_notes/release_26_07.rst | 11 +++++
lib/mempool/rte_mempool.c | 14 +-----
lib/mempool/rte_mempool.h | 66 ++++++++++++++++----------
4 files changed, 61 insertions(+), 37 deletions(-)
diff --git a/doc/guides/rel_notes/deprecation.rst b/doc/guides/rel_notes/deprecation.rst
index 17f90a6352..1b6cc181fb 100644
--- a/doc/guides/rel_notes/deprecation.rst
+++ b/doc/guides/rel_notes/deprecation.rst
@@ -158,3 +158,10 @@ Deprecation Notices
* net/iavf: The dynamic mbuf field used to detect LLDP packets on the
transmit path in the iavf PMD will be removed in a future release.
After removal, only packet type-based detection will be supported.
+
+* mempool: The ``flushthresh`` field in ``struct rte_mempool_cache``
+ is obsolete, and will be removed in DPDK 26.11.
+
+* mempool: The object array in ``struct rte_mempool_cache`` is oversize by
+ factor two, and will be reduced to ``RTE_MEMPOOL_CACHE_MAX_SIZE`` in
+ DPDK 26.11.
diff --git a/doc/guides/rel_notes/release_26_07.rst b/doc/guides/rel_notes/release_26_07.rst
index d2563ac503..b1010d238f 100644
--- a/doc/guides/rel_notes/release_26_07.rst
+++ b/doc/guides/rel_notes/release_26_07.rst
@@ -69,6 +69,17 @@ New Features
``rte_eal_init`` and the application is responsible for probing each device,
* ``--auto-probing`` enables the initial bus probing, which is the current default behavior.
+* **Changed effective size of mempool cache.**
+
+ * The effective size of a mempool cache was changed to match the specified size at mempool creation; the effective size was previously 50 % larger than requested.
+ * The ``flushthresh`` field of the ``struct rte_mempool_cache`` became obsolete, but was kept for API/ABI compatibility purposes.
+ * The effective size of the ``objs`` array in the ``struct rte_mempool_cache`` was reduced to ``RTE_MEMPOOL_CACHE_MAX_SIZE``, but its size was kept for API/ABI compatibility purposes.
+
+* **Improved mempool cache flush/refill algorithm.**
+
+ The mempool cache flush/refill algorithm was improved, to reduce the mempool cache miss rate for most application types.
+ Applications where each lcore only puts or gets to a mempool, e.g. pipelined applications where ethdev Rx and Tx run on separate lcores, should adapt to the new algorithm by doubling their configured mempool cache size, to avoid doubling their mempool cache miss rate.
+
* **Added RISC-V vector paths.**
* Increased the default SIMD bitwidth to allow using the vector extension.
diff --git a/lib/mempool/rte_mempool.c b/lib/mempool/rte_mempool.c
index 3ddf7b9c11..817e2b8dc1 100644
--- a/lib/mempool/rte_mempool.c
+++ b/lib/mempool/rte_mempool.c
@@ -52,11 +52,6 @@ static void
mempool_event_callback_invoke(enum rte_mempool_event event,
struct rte_mempool *mp);
-/* Note: avoid using floating point since that compiler
- * may not think that is constant.
- */
-#define CALC_CACHE_FLUSHTHRESH(c) (((c) * 3) / 2)
-
#if defined(RTE_ARCH_X86)
/*
* return the greatest common divisor between a and b (fast algorithm)
@@ -757,13 +752,8 @@ rte_mempool_free(struct rte_mempool *mp)
static void
mempool_cache_init(struct rte_mempool_cache *cache, uint32_t size)
{
- /* Check that cache have enough space for flush threshold */
- RTE_BUILD_BUG_ON(CALC_CACHE_FLUSHTHRESH(RTE_MEMPOOL_CACHE_MAX_SIZE) >
- RTE_SIZEOF_FIELD(struct rte_mempool_cache, objs) /
- RTE_SIZEOF_FIELD(struct rte_mempool_cache, objs[0]));
-
cache->size = size;
- cache->flushthresh = CALC_CACHE_FLUSHTHRESH(size);
+ cache->flushthresh = size; /* Obsolete; for API/ABI compatibility purposes only */
cache->len = 0;
}
@@ -850,7 +840,7 @@ rte_mempool_create_empty(const char *name, unsigned n, unsigned elt_size,
/* asked cache too big */
if (cache_size > RTE_MEMPOOL_CACHE_MAX_SIZE ||
- CALC_CACHE_FLUSHTHRESH(cache_size) > n) {
+ cache_size > n) {
rte_errno = EINVAL;
return NULL;
}
diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index d3f969847b..50d958c7c6 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -89,7 +89,7 @@ struct __rte_cache_aligned rte_mempool_debug_stats {
*/
struct __rte_cache_aligned rte_mempool_cache {
uint32_t size; /**< Size of the cache */
- uint32_t flushthresh; /**< Threshold before we flush excess elements */
+ uint32_t flushthresh; /**< Obsolete; for API/ABI compatibility purposes only */
uint32_t len; /**< Current cache count */
#ifdef RTE_LIBRTE_MEMPOOL_STATS
uint32_t unused;
@@ -107,8 +107,10 @@ struct __rte_cache_aligned rte_mempool_cache {
/**
* Cache objects
*
- * Cache is allocated to this size to allow it to overflow in certain
- * cases to avoid needless emptying of cache.
+ * Note:
+ * Cache is allocated at double size for API/ABI compatibility purposes only.
+ * When reducing its size at an API/ABI breaking release,
+ * remember to add a cache guard after it.
*/
alignas(RTE_CACHE_LINE_SIZE) void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2];
};
@@ -1047,11 +1049,16 @@ rte_mempool_free(struct rte_mempool *mp);
* If cache_size is non-zero, the rte_mempool library will try to
* limit the accesses to the common lockless pool, by maintaining a
* per-lcore object cache. This argument must be lower or equal to
- * RTE_MEMPOOL_CACHE_MAX_SIZE and n / 1.5.
+ * RTE_MEMPOOL_CACHE_MAX_SIZE and n.
* The access to the per-lcore table is of course
* faster than the multi-producer/consumer pool. The cache can be
* disabled if the cache_size argument is set to 0; it can be useful to
* avoid losing objects in cache.
+ * Note:
+ * Mempool put/get requests of more than cache_size / 2 objects may be
+ * partially or fully served directly by the multi-producer/consumer
+ * pool, to avoid the overhead of copying the objects twice (instead of
+ * once) when using the cache as a bounce buffer.
* @param private_data_size
* The size of the private data appended after the mempool
* structure. This is useful for storing some private data after the
@@ -1419,22 +1426,30 @@ rte_mempool_do_generic_put(struct rte_mempool *mp, void * const *obj_table,
RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_bulk, 1);
RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_objs, n);
- __rte_assume(cache->flushthresh <= RTE_MEMPOOL_CACHE_MAX_SIZE * 2);
- __rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE * 2);
- __rte_assume(cache->len <= cache->flushthresh);
- if (likely(cache->len + n <= cache->flushthresh)) {
+ __rte_assume(cache->size <= RTE_MEMPOOL_CACHE_MAX_SIZE);
+ __rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
+ __rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE);
+ __rte_assume(cache->len <= cache->size);
+ if (likely(cache->len + n <= cache->size)) {
/* Sufficient room in the cache for the objects. */
cache_objs = &cache->objs[cache->len];
cache->len += n;
- } else if (n <= cache->flushthresh) {
+ } else if (n <= cache->size / 2) {
/*
- * The cache is big enough for the objects, but - as detected by
- * the comparison above - has insufficient room for them.
- * Flush the cache to make room for the objects.
+ * The number of objects is within the cache bounce buffer limit,
+ * but - as detected by the comparison above - the cache has
+ * insufficient room for them.
+ * Flush the cache to the backend to make room for the objects;
+ * flush (size / 2) objects from the bottom of the cache, where
+ * objects are less hot, and move down the remaining objects, which
+ * are more hot, from the upper half of the cache.
*/
- cache_objs = &cache->objs[0];
- rte_mempool_ops_enqueue_bulk(mp, cache_objs, cache->len);
- cache->len = n;
+ __rte_assume(cache->len > cache->size / 2);
+ rte_mempool_ops_enqueue_bulk(mp, &cache->objs[0], cache->size / 2);
+ rte_memcpy(&cache->objs[0], &cache->objs[cache->size / 2],
+ sizeof(void *) * (cache->len - cache->size / 2));
+ cache_objs = &cache->objs[cache->len - cache->size / 2];
+ cache->len = cache->len - cache->size / 2 + n;
} else {
/* The request itself is too big for the cache. */
goto driver_enqueue_stats_incremented;
@@ -1553,7 +1568,7 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
/* The cache is a stack, so copy will be in reverse order. */
cache_objs = &cache->objs[cache->len];
- __rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE * 2);
+ __rte_assume(cache->len <= RTE_MEMPOOL_CACHE_MAX_SIZE);
if (likely(n <= cache->len)) {
/* The entire request can be satisfied from the cache. */
RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1);
@@ -1577,13 +1592,13 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
for (index = 0; index < len; index++)
*obj_table++ = *--cache_objs;
- /* Dequeue below would overflow mem allocated for cache? */
- if (unlikely(remaining > RTE_MEMPOOL_CACHE_MAX_SIZE))
+ /* Dequeue below would exceed the cache bounce buffer limit? */
+ __rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
+ if (unlikely(remaining > cache->size / 2))
goto driver_dequeue;
- /* Fill the cache from the backend; fetch size + remaining objects. */
- ret = rte_mempool_ops_dequeue_bulk(mp, cache->objs,
- cache->size + remaining);
+ /* Fill the cache from the backend; fetch (size / 2) objects. */
+ ret = rte_mempool_ops_dequeue_bulk(mp, cache->objs, cache->size / 2);
if (unlikely(ret < 0)) {
/*
* We are buffer constrained, and not able to fetch all that.
@@ -1597,10 +1612,11 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1);
RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_objs, n);
- __rte_assume(cache->size <= RTE_MEMPOOL_CACHE_MAX_SIZE);
- __rte_assume(remaining <= RTE_MEMPOOL_CACHE_MAX_SIZE);
- cache_objs = &cache->objs[cache->size + remaining];
- cache->len = cache->size;
+ __rte_assume(cache->size / 2 <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
+ __rte_assume(remaining <= RTE_MEMPOOL_CACHE_MAX_SIZE / 2);
+ __rte_assume(remaining <= cache->size / 2);
+ cache_objs = &cache->objs[cache->size / 2];
+ cache->len = cache->size / 2 - remaining;
for (index = 0; index < remaining; index++)
*obj_table++ = *--cache_objs;
--
2.43.0
^ permalink raw reply related
* Re: [PATCH v14 5/5] vhost: enable configure memory slots
From: Maxime Coquelin @ 2026-06-04 11:52 UTC (permalink / raw)
To: pravin.bathija; +Cc: dev, fengchengwen, stephen, thomas, Stephen Hemminger
In-Reply-To: <20260520022012.243619-6-pravin.bathija@dell.com>
[-- Attachment #1: Type: text/plain, Size: 1325 bytes --]
On Wed, May 20, 2026 at 4:20 AM <pravin.bathija@dell.com> wrote:
> From: Pravin M Bathija <pravin.bathija@dell.com>
>
> This patch enables configure memory slots in the header define
> VHOST_USER_PROTOCOL_FEATURES.
>
> Signed-off-by: Pravin M Bathija <pravin.bathija@dell.com>
> Acked-by: Fengchengwen <fengchengwen@huawei.com>
> Reviewed-by: Stephen Hemminger <stephen@networkplumber.com>
> Acked-by: Maxime Coquelin <maxime.coquelin@redhat.com>
> ---
> lib/vhost/vhost_user.h | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/lib/vhost/vhost_user.h b/lib/vhost/vhost_user.h
> index 6435816534..732aa4dc02 100644
> --- a/lib/vhost/vhost_user.h
> +++ b/lib/vhost/vhost_user.h
> @@ -32,6 +32,7 @@
> (1ULL <<
> VHOST_USER_PROTOCOL_F_BACKEND_SEND_FD) | \
> (1ULL <<
> VHOST_USER_PROTOCOL_F_HOST_NOTIFIER) | \
> (1ULL <<
> VHOST_USER_PROTOCOL_F_PAGEFAULT) | \
> + (1ULL <<
> VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS) | \
> (1ULL <<
> VHOST_USER_PROTOCOL_F_STATUS))
>
> typedef enum VhostUserRequest {
> --
> 2.43.0
>
> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
[-- Attachment #2: Type: text/html, Size: 2257 bytes --]
^ permalink raw reply
* [PATCH] net/mlx5: fix counter TAILQ race between free and query callback
From: Laaahu @ 2026-06-04 10:11 UTC (permalink / raw)
To: dev; +Cc: stable, lilinhu
From: lilinhu <lilinhu618@gmail.com>
flow_dv_counter_free() inserts counters into
pool->counters[pool->query_gen] under pool->csl. Meanwhile,
mlx5_flow_async_pool_query_handle() moves counters from
pool->counters[query_gen ^ 1] to the global free list via
TAILQ_CONCAT while holding only cmng->csl, not pool->csl.
The comment in flow_dv_counter_free() claims the lock is not needed
because the query callback and the release function operate on
different lists. That holds only if the free path always observes
the up-to-date query_gen. It can be violated:
1. A counter free thread (non-PMD, e.g. OVS offload thread) reads
pool->query_gen == 0 and is about to insert into counters[0].
2. The free thread is preempted by the OS scheduler; it is a regular
pthread, not pinned to a core.
3. The eal-intr-thread alarm fires: query_gen++ (now 1) and the async
query is sent.
4. Hardware completes the query and the callback runs TAILQ_CONCAT on
counters[0] (= query_gen ^ 1).
5. The free thread resumes and runs TAILQ_INSERT_TAIL on counters[0]
concurrently with step 4 on another core.
Because the two paths take different locks, TAILQ_INSERT_TAIL and
TAILQ_CONCAT run concurrently on the same list with no
synchronization and corrupt it: the pool-local list ends up with a
NULL head but a dangling tqh_last, and the global free list tail no
longer points to the real tail. The just-freed counter and every
counter inserted afterwards become unreachable and are leaked.
Non-PMD threads can be preempted for hundreds of microseconds under
CPU pressure, which is well within the async query round-trip time,
so the window is reachable in practice.
Fix it by taking pool->csl in the query completion callback before
operating on pool->counters[query_gen], serializing the CONCAT with
any concurrent INSERT. The lock is taken once per pool per query
completion in the eal-intr-thread context, not on the datapath, so
the cost is negligible. Lock order is pool->csl then cmng->csl,
matching all other sites.
Also handle the error path: previously the counters accumulated in
pool->counters[query_gen] were abandoned when a query failed. Move
them back to the global free list to avoid a leak on persistent
query failures.
Fixes: ac79183dc6f7 ("net/mlx5: optimize free counter lookup")
Cc: stable@dpdk.org
Signed-off-by: lilinhu <lilinhu618@gmail.com>
---
drivers/net/mlx5/mlx5_flow.c | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)
diff --git a/drivers/net/mlx5/mlx5_flow.c b/drivers/net/mlx5/mlx5_flow.c
index 915ea29a5a..20aad87f5d 100644
--- a/drivers/net/mlx5/mlx5_flow.c
+++ b/drivers/net/mlx5/mlx5_flow.c
@@ -9904,6 +9904,20 @@ mlx5_flow_async_pool_query_handle(struct mlx5_dev_ctx_shared *sh,
if (unlikely(status)) {
raw_to_free = pool->raw_hw;
+ /*
+ * The query failed, so the freed counters accumulated in the
+ * old-gen list during this round would otherwise be stranded.
+ * Move them back to the global free list to avoid a leak when
+ * queries fail persistently.
+ */
+ if (!TAILQ_EMPTY(&pool->counters[query_gen])) {
+ rte_spinlock_lock(&pool->csl);
+ rte_spinlock_lock(&cmng->csl[cnt_type]);
+ TAILQ_CONCAT(&cmng->counters[cnt_type],
+ &pool->counters[query_gen], next);
+ rte_spinlock_unlock(&cmng->csl[cnt_type]);
+ rte_spinlock_unlock(&pool->csl);
+ }
} else {
raw_to_free = pool->raw;
if (pool->is_aged)
@@ -9913,11 +9927,20 @@ mlx5_flow_async_pool_query_handle(struct mlx5_dev_ctx_shared *sh,
rte_spinlock_unlock(&pool->sl);
/* Be sure the new raw counters data is updated in memory. */
rte_io_wmb();
+ /*
+ * A counter free thread may have read a stale query_gen
+ * before the generation was flipped and could still be
+ * inserting into this same old-gen list. Hold pool->csl to
+ * serialize TAILQ_CONCAT with that TAILQ_INSERT_TAIL and
+ * avoid corrupting the list.
+ */
if (!TAILQ_EMPTY(&pool->counters[query_gen])) {
+ rte_spinlock_lock(&pool->csl);
rte_spinlock_lock(&cmng->csl[cnt_type]);
TAILQ_CONCAT(&cmng->counters[cnt_type],
&pool->counters[query_gen], next);
rte_spinlock_unlock(&cmng->csl[cnt_type]);
+ rte_spinlock_unlock(&pool->csl);
}
}
LIST_INSERT_HEAD(&sh->sws_cmng.free_stat_raws, raw_to_free, next);
--
2.39.3 (Apple Git-146)
^ permalink raw reply related
* DTS meeting / status
From: Lincoln Lavoie @ 2026-06-04 13:18 UTC (permalink / raw)
To: dev
[-- Attachment #1: Type: text/plain, Size: 2521 bytes --]
Hello All,
It seems like we're not really getting critical mass from the community on
the DTS meetings, while a few key participants are on leave. I wanted to
provide a status update on a number of patch series, so we can make sure we
are staying on target for the July release.
Below is the current list of DTS patches, where many are planned for the
July release, if some are able to help provide reviews, that would be
appreciated.
-
report dut/NIC info during DTS run
-
Needs review
-
add code coverage reporting to DTS
-
Needs review
-
add retry loop to trex traffic generation
-
Needs review
-
add directory for test resources
-
Needs review
-
clean cryptodev environment after a test run
-
Needs review
-
update parsing for cryptodev latency
-
Needs review
-
update test suite names to be clear and consistent
-
Needs review
-
add ipgre test suite
-
Needs review
-
fix topology capability comparison
-
Needs review
-
move exception module from framework to API
-
Needs review
-
fix port info getter when port is not found
-
Reviewed by Luca, no comments, April 23
-
ConfigurationError not thrown due to lack of next default value
-
Needs to be marked as not active, replaced by “fix port info getter
when port is not found”
-
update modprobe command to be privileged
-
Reviewed by Luca, no comments, April 9, 2026
-
update verbose output regex for VXLAN packets
-
Still pending review
-
port speed capabilities test suite to next DTS
-
Reviewed by Patrick, no comments, May 30
-
add default values for queue info
-
Reviewed by Patrick, no comments, May 28.
-
refactor flow suite with generator pattern
-
Still pending review
-
add support for no link topology
-
Comments from Patrick on May 30 need answer
-
add PVVP case to virtio suite
-
Still pending reviews, marked ready on April 10, 2026
Cheers,
Lincoln
--
*Lincoln Lavoie*
Principal Engineer, Broadband Technologies
21 Madbury Rd., Ste. 100, Durham, NH 03824
lylavoie@iol.unh.edu
https://www.iol.unh.edu
+1-603-674-2755 (m)
[-- Attachment #2: Type: text/html, Size: 20219 bytes --]
^ permalink raw reply
* RE: [PATCH v8] mempool: improve cache behaviour and performance
From: Morten Brørup @ 2026-06-04 13:57 UTC (permalink / raw)
To: dev
In-Reply-To: <20260604114851.12586-1-mb@smartsharesystems.com>
Recheck-request: iol-intel-Performance
^ permalink raw reply
* Re: [PATCH v5 00/25] Consolidate bus driver infrastructure
From: David Marchand @ 2026-06-04 14:54 UTC (permalink / raw)
To: David Marchand; +Cc: dev, thomas, stephen, bruce.richardson, Chengwen Feng
In-Reply-To: <20260530075201.869606-1-david.marchand@redhat.com>
On Sat, 30 May 2026 at 09:52, David Marchand <david.marchand@redhat.com> wrote:
>
> This is a continuation of the work I started on the bus infrastructure,
> but this time, a lot of the changes were done by a AI "friend".
> It is still an unfinished topic as the current series focuses on probing
> only. The detaching/cleanup aspect is postponed to another release/time.
>
> My AI "friend" really *sucked* at git and at separating unrelated changes,
> so it required quite a lot of massage/polishing afterwards.
> But it seems good enough now for upstream submission.
>
> I would like to see this series merged in 26.07, so that we have enough
> time to stabilize it before the next LTS.
> And seeing how it affects drivers, it is probably better to merge it
> the sooner possible (so Thomas does not have to solve too many conflicts
> when pulling next-* subtrees after, especially wrt the last patch).
>
>
> This series refactors the DPDK bus infrastructure to consolidate common
> operations and reduce code duplication across all bus drivers.
> Currently, each bus implements its own specific device/driver lists,
> probe logic, and lookup functions.
> This series moves these common patterns into the EAL bus layer,
> providing generic helpers that all buses can use.
>
> The refactoring removes approximately 1,400 lines of duplicated code across
> the codebase while maintaining full functional equivalence.
>
> Key changes:
> - Factorize device and driver lists into struct rte_bus
> - Implement generic probe, device/driver lookup, and iteration helpers in EAL
> - Introduce conversion macros (RTE_BUS_DEVICE, RTE_BUS_DRIVER, RTE_CLASS_TO_BUS_DEVICE)
> to safely convert between generic and bus-specific types
> - Remove bus-specific device/driver types from most driver code
> - Move probe logic from individual buses to rte_bus_generic_probe()
> - Separate NXP-specific metadata from generic bus structures
>
> Benefits:
> - Significant code reduction (~1,400 lines removed)
> - Consistent behavior across all bus types
> - Simplified bus driver implementation
> - Easier maintenance and future enhancements
>
> The series is structured as a progressive refactoring:
> - Remove redundant checks and helpers, fix existing bugs (patches 1-7)
> - Add conversion macros and factorize lists (patches 8-10)
> - Consolidate device/driver lookup and iteration (patches 11-13)
> - Refactor probe logic (patches 14-17)
> - Remove bus-specific types from drivers (patches 18-25)
>
> Note on ABI:
> This series breaks the ABI for drivers (changes to rte_pci_device,
> rte_pci_driver, and similar structures for other buses). However, the DPDK
> ABI policy does not provide guarantees for driver-level interfaces.
Thomas pointed out offlist at a leak in the bus cleanup updates (patch
6): not-probed devices are not freed when calling the bus cleanup
operation.
I will fix this in a future series (hopefully for rc2), after
refactoring the cleanup/unplug code flows.
For now, series applied for rc1.
Thanks for the reviews!
--
David Marchand
^ permalink raw reply
* RE: [PATCH 1/5] ring: split single thread vs multi-thread cases
From: Konstantin Ananyev @ 2026-06-04 15:09 UTC (permalink / raw)
To: Stephen Hemminger, dev@dpdk.org; +Cc: Wathsala Vithanage
In-Reply-To: <20260602171552.686349-2-stephen@networkplumber.org>
> The move head function has optimization for updating when
> being used on single threaded ring. Code is cleaner if the two
> cases are split into separate functions.
>
> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
> ---
> lib/ring/rte_ring_c11_pvt.h | 100 +++++++++++++++++++++++++-------
> lib/ring/rte_ring_elem_pvt.h | 16 +++--
> lib/ring/rte_ring_generic_pvt.h | 77 ++++++++++++++++++++----
> lib/ring/soring.c | 24 +++++---
> 4 files changed, 171 insertions(+), 46 deletions(-)
>
> diff --git a/lib/ring/rte_ring_c11_pvt.h b/lib/ring/rte_ring_c11_pvt.h
> index 07b6efc416..5afc14dec9 100644
> --- a/lib/ring/rte_ring_c11_pvt.h
> +++ b/lib/ring/rte_ring_c11_pvt.h
> @@ -46,6 +46,7 @@ __rte_ring_update_tail(struct rte_ring_headtail *ht,
> uint32_t old_val,
>
> /**
> * @internal This is a helper function that moves the producer/consumer head
> + * optimized for single threaded case
> *
> * @param d
> * A pointer to the headtail structure with head value to be moved
> @@ -54,8 +55,6 @@ __rte_ring_update_tail(struct rte_ring_headtail *ht,
> uint32_t old_val,
> * function only reads tail value from it
> * @param capacity
> * Either ring capacity value (for producer), or zero (for consumer)
> - * @param is_st
> - * Indicates whether multi-thread safe path is needed or not
> * @param n
> * The number of elements we want to move head value on
> * @param behavior
> @@ -72,14 +71,77 @@ __rte_ring_update_tail(struct rte_ring_headtail *ht,
> uint32_t old_val,
> * If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only
> */
> static __rte_always_inline unsigned int
> -__rte_ring_headtail_move_head(struct rte_ring_headtail *d,
> +__rte_ring_headtail_move_head_st(struct rte_ring_headtail *d,
> const struct rte_ring_headtail *s, uint32_t capacity,
> - unsigned int is_st, unsigned int n,
> + unsigned int n,
> enum rte_ring_queue_behavior behavior,
> uint32_t *old_head, uint32_t *new_head, uint32_t *entries)
> {
> uint32_t stail;
> - int success;
> +
> + /* Single producer: only this thread writes d->head,
> + * so a relaxed load is sufficient.
> + */
> + *old_head = rte_atomic_load_explicit(&d->head,
> rte_memory_order_relaxed);
> +
> + /* Acquire pairs with the consumer's release-store of tail in
> __rte_ring_update_tail,
> + * ensuring the consumer's ring-element reads are complete before
> + * we observe the updated tail.
> + */
> + stail = rte_atomic_load_explicit(&s->tail, rte_memory_order_acquire);
> +
> + /* Unsigned subtraction is modulo 2^32, so entries is always in
> + * [0, capacity) even if old_head > stail.
> + */
> + *entries = capacity + stail - *old_head;
> +
> + /* check that we have enough room in ring */
> + if (unlikely(n > *entries))
> + n = (behavior == RTE_RING_QUEUE_FIXED) ? 0 : *entries;
> +
> + if (n > 0) {
> + *new_head = *old_head + n;
> + rte_atomic_store_explicit(&d->head, *new_head,
> rte_memory_order_relaxed);
> + }
> +
> + return n;
> +}
> +
> +/**
> + * @internal This is a helper function that moves the producer/consumer head
> + * for use in multi-thread safe path
> + *
> + * @param d
> + * A pointer to the headtail structure with head value to be moved
> + * @param s
> + * A pointer to the counter-part headtail structure. Note that this
> + * function only reads tail value from it
> + * @param capacity
> + * Either ring capacity value (for producer), or zero (for consumer)
> + * @param n
> + * The number of elements we want to move head value on
> + * @param behavior
> + * RTE_RING_QUEUE_FIXED: Move on a fixed number of items
> + * RTE_RING_QUEUE_VARIABLE: Move on as many items as possible
> + * @param old_head
> + * Returns head value as it was before the move
> + * @param new_head
> + * Returns the new head value
> + * @param entries
> + * Returns the number of ring entries available BEFORE head was moved
> + * @return
> + * Actual number of objects the head was moved on
> + * If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only
> + */
> +static __rte_always_inline unsigned int
> +__rte_ring_headtail_move_head_mt(struct rte_ring_headtail *d,
> + const struct rte_ring_headtail *s, uint32_t capacity,
> + unsigned int n,
> + enum rte_ring_queue_behavior behavior,
> + uint32_t *old_head, uint32_t *new_head, uint32_t *entries)
> +{
> + uint32_t stail;
> + bool success;
> unsigned int max = n;
>
> /*
> @@ -120,25 +182,21 @@ __rte_ring_headtail_move_head(struct
> rte_ring_headtail *d,
> return 0;
>
> *new_head = *old_head + n;
> - if (is_st) {
> - d->head = *new_head;
> - success = 1;
> - } else
> - /* on failure, *old_head is updated */
> - /*
> - * R1/A2.
> - * R1: Establishes a synchronizing edge with A0 of a
> - * different thread.
> - * A2: Establishes a synchronizing edge with R1 of a
> - * different thread to observe same value for stail
> - * observed by that thread on CAS failure (to retry
> - * with an updated *old_head).
> - */
> - success =
> rte_atomic_compare_exchange_strong_explicit(
> + /* on failure, *old_head is updated */
> + /*
> + * R1/A2.
> + * R1: Establishes a synchronizing edge with A0 of a
> + * different thread.
> + * A2: Establishes a synchronizing edge with R1 of a
> + * different thread to observe same value for stail
> + * observed by that thread on CAS failure (to retry
> + * with an updated *old_head).
> + */
> + success = rte_atomic_compare_exchange_strong_explicit(
> &d->head, old_head, *new_head,
> rte_memory_order_release,
> rte_memory_order_acquire);
> - } while (unlikely(success == 0));
> + } while (unlikely(!success));
> return n;
> }
>
> diff --git a/lib/ring/rte_ring_elem_pvt.h b/lib/ring/rte_ring_elem_pvt.h
> index 6eafae121f..a0fdec9812 100644
> --- a/lib/ring/rte_ring_elem_pvt.h
> +++ b/lib/ring/rte_ring_elem_pvt.h
> @@ -341,8 +341,12 @@ __rte_ring_move_prod_head(struct rte_ring *r,
> unsigned int is_sp,
> uint32_t *old_head, uint32_t *new_head,
> uint32_t *free_entries)
> {
> - return __rte_ring_headtail_move_head(&r->prod, &r->cons, r->capacity,
> - is_sp, n, behavior, old_head, new_head, free_entries);
> + if (is_sp)
> + return __rte_ring_headtail_move_head_st(&r->prod, &r->cons,
> r->capacity,
> + n, behavior, old_head, new_head, free_entries);
> + else
> + return __rte_ring_headtail_move_head_mt(&r->prod, &r->cons,
> r->capacity,
> + n, behavior, old_head, new_head, free_entries);
> }
>
> /**
> @@ -374,8 +378,12 @@ __rte_ring_move_cons_head(struct rte_ring *r,
> unsigned int is_sc,
> uint32_t *old_head, uint32_t *new_head,
> uint32_t *entries)
> {
> - return __rte_ring_headtail_move_head(&r->cons, &r->prod, 0,
> - is_sc, n, behavior, old_head, new_head, entries);
> + if (is_sc)
> + return __rte_ring_headtail_move_head_st(&r->cons, &r->prod,
> 0,
> + n, behavior, old_head, new_head, entries);
> + else
> + return __rte_ring_headtail_move_head_mt(&r->cons, &r->prod,
> 0,
> + n, behavior, old_head, new_head, entries);
> }
>
> /**
> diff --git a/lib/ring/rte_ring_generic_pvt.h b/lib/ring/rte_ring_generic_pvt.h
> index affd2d5ba7..c044b0824f 100644
> --- a/lib/ring/rte_ring_generic_pvt.h
> +++ b/lib/ring/rte_ring_generic_pvt.h
> @@ -42,6 +42,7 @@ __rte_ring_update_tail(struct rte_ring_headtail *ht,
> uint32_t old_val,
>
> /**
> * @internal This is a helper function that moves the producer/consumer head
> + * for use in multi-thread safe path
> *
> * @param d
> * A pointer to the headtail structure with head value to be moved
> @@ -50,8 +51,6 @@ __rte_ring_update_tail(struct rte_ring_headtail *ht,
> uint32_t old_val,
> * function only reads tail value from it
> * @param capacity
> * Either ring capacity value (for producer), or zero (for consumer)
> - * @param is_st
> - * Indicates whether multi-thread safe path is needed or not
> * @param n
> * The number of elements we want to move head value on
> * @param behavior
> @@ -68,10 +67,9 @@ __rte_ring_update_tail(struct rte_ring_headtail *ht,
> uint32_t old_val,
> * If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only
> */
> static __rte_always_inline unsigned int
> -__rte_ring_headtail_move_head(struct rte_ring_headtail *d,
> +__rte_ring_headtail_move_head_mt(struct rte_ring_headtail *d,
> const struct rte_ring_headtail *s, uint32_t capacity,
> - unsigned int is_st, unsigned int n,
> - enum rte_ring_queue_behavior behavior,
> + unsigned int n, enum rte_ring_queue_behavior behavior,
> uint32_t *old_head, uint32_t *new_head, uint32_t *entries)
> {
> unsigned int max = n;
> @@ -105,15 +103,70 @@ __rte_ring_headtail_move_head(struct
> rte_ring_headtail *d,
> return 0;
>
> *new_head = *old_head + n;
> - if (is_st) {
> - d->head = *new_head;
> - success = 1;
> - } else
> - success = rte_atomic32_cmpset(
> - (uint32_t *)(uintptr_t)&d->head,
> - *old_head, *new_head);
> + success = rte_atomic32_cmpset(
> + (uint32_t *)(uintptr_t)&d->head,
> + *old_head, *new_head);
> } while (unlikely(success == 0));
> return n;
> }
>
> +/**
> + * @internal This is a helper function that moves the producer/consumer head
> + * optimized for single threaded case
> + *
> + * @param d
> + * A pointer to the headtail structure with head value to be moved
> + * @param s
> + * A pointer to the counter-part headtail structure. Note that this
> + * function only reads tail value from it
> + * @param capacity
> + * Either ring capacity value (for producer), or zero (for consumer)
> + * @param n
> + * The number of elements we want to move head value on
> + * @param behavior
> + * RTE_RING_QUEUE_FIXED: Move on a fixed number of items
> + * RTE_RING_QUEUE_VARIABLE: Move on as many items as possible
> + * @param old_head
> + * Returns head value as it was before the move
> + * @param new_head
> + * Returns the new head value
> + * @param entries
> + * Returns the number of ring entries available BEFORE head was moved
> + * @return
> + * Actual number of objects the head was moved on
> + * If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only
> + */
> +static __rte_always_inline unsigned int
> +__rte_ring_headtail_move_head_st(struct rte_ring_headtail *d,
> + const struct rte_ring_headtail *s, uint32_t capacity,
> + unsigned int n,
> + enum rte_ring_queue_behavior behavior,
> + uint32_t *old_head, uint32_t *new_head, uint32_t *entries)
> +{
> + *old_head = d->head;
> +
> + /* add rmb barrier to avoid load/load reorder in weak
> + * memory model. It is noop on x86
> + */
> + rte_smp_rmb();
> +
> + /*
> + * The subtraction is done between two unsigned 32bits value
> + * (the result is always modulo 32 bits even if we have
> + * *old_head > s->tail). So 'entries' is always between 0
> + * and capacity (which is < size).
> + */
> + *entries = (capacity + s->tail - *old_head);
> +
> + /* check that we have enough room in ring */
> + if (unlikely(n > *entries))
> + n = (behavior == RTE_RING_QUEUE_FIXED) ? 0 : *entries;
> +
> + if (likely(n > 0)) {
> + *new_head = *old_head + n;
> + d->head = *new_head;
> + }
> + return n;
> +}
> +
> #endif /* _RTE_RING_GENERIC_PVT_H_ */
> diff --git a/lib/ring/soring.c b/lib/ring/soring.c
> index e9c75619fe..22f9c60e9c 100644
> --- a/lib/ring/soring.c
> +++ b/lib/ring/soring.c
> @@ -135,9 +135,12 @@ __rte_soring_move_prod_head(struct rte_soring *r,
> uint32_t num,
>
> switch (st) {
> case RTE_RING_SYNC_ST:
> + n = __rte_ring_headtail_move_head_st(&r->prod.ht, &r-
> >cons.ht,
> + r->capacity, num, behavior, head, next, free);
> + break;
> case RTE_RING_SYNC_MT:
> - n = __rte_ring_headtail_move_head(&r->prod.ht, &r->cons.ht,
> - r->capacity, st, num, behavior, head, next, free);
> + n = __rte_ring_headtail_move_head_mt(&r->prod.ht, &r-
> >cons.ht,
> + r->capacity, num, behavior, head, next, free);
> break;
> case RTE_RING_SYNC_MT_RTS:
> n = __rte_ring_rts_move_head(&r->prod.rts, &r->cons.ht,
> @@ -168,9 +171,13 @@ __rte_soring_move_cons_head(struct rte_soring *r,
> uint32_t stage, uint32_t num,
>
> switch (st) {
> case RTE_RING_SYNC_ST:
> + n = __rte_ring_headtail_move_head_st(&r->cons.ht,
> + &r->stage[stage].ht, 0, num, behavior,
> + head, next, avail);
> + break;
> case RTE_RING_SYNC_MT:
> - n = __rte_ring_headtail_move_head(&r->cons.ht,
> - &r->stage[stage].ht, 0, st, num, behavior,
> + n = __rte_ring_headtail_move_head_mt(&r->cons.ht,
> + &r->stage[stage].ht, 0, num, behavior,
> head, next, avail);
> break;
> case RTE_RING_SYNC_MT_RTS:
> @@ -309,9 +316,8 @@ soring_enqueue_start(struct rte_soring *r, uint32_t num,
>
> switch (st) {
> case RTE_RING_SYNC_ST:
> - n = __rte_ring_headtail_move_head(&r->prod.ht, &r->cons.ht,
> - r->capacity, RTE_RING_SYNC_ST, num, behavior,
> - &head, &next, &free);
> + n = __rte_ring_headtail_move_head_st(&r->prod.ht, &r-
> >cons.ht,
> + r->capacity, num, behavior, &head, &next, &free);
> break;
> case RTE_RING_SYNC_MT_HTS:
> n = __rte_ring_hts_move_head(&r->prod.hts, &r->cons.ht,
> @@ -419,8 +425,8 @@ soring_dequeue_start(struct rte_soring *r, void *objs,
> void *meta,
>
> switch (st) {
> case RTE_RING_SYNC_ST:
> - n = __rte_ring_headtail_move_head(&r->cons.ht, &r-
> >stage[ns].ht,
> - 0, RTE_RING_SYNC_ST, num, behavior, &head, &next,
> + n = __rte_ring_headtail_move_head_st(&r->cons.ht, &r-
> >stage[ns].ht,
> + 0, num, behavior, &head, &next,
> &avail);
> break;
> case RTE_RING_SYNC_MT_HTS:
> --
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
Tested-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
> 2.53.0
^ permalink raw reply
* RE: [PATCH 2/5] ring: use GCC builtin as alternative to rte_atomic32
From: Konstantin Ananyev @ 2026-06-04 15:11 UTC (permalink / raw)
To: Stephen Hemminger, dev@dpdk.org; +Cc: Wathsala Vithanage
In-Reply-To: <20260602171552.686349-3-stephen@networkplumber.org>
> This patch replaces use of the deprecated rte_atomic32 code with
> GCC builtin atomic operations.
>
> Although it would be preferable to use C11 version on all architectures,
> there is a performance loss if we do it that way:
>
> Measured on i9-13900H, two physical cores MP/MC bulk n=128, 10 runs:
> with C11 builtin: 5.86 cycles/elem
> with __sync builtin: 5.36 cycles/elem (-9.4%)
>
> The C11 __atomic_compare_exchange_n builtin writes the actual value back
> to its expected pointer on failure. On x86 this forces GCC
> to emit extra instructions on the critical path between the CAS
> and the success-test.
>
> __sync_bool_compare_and_swap returns a plain bool with no pointer
> writeback, allowing GCC to emit tighter code.
>
> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
> ---
> lib/ring/meson.build | 2 +-
> lib/ring/rte_ring_c11_pvt.h | 3 +-
> lib/ring/rte_ring_elem_pvt.h | 2 +-
> ..._ring_generic_pvt.h => rte_ring_gcc_pvt.h} | 37 +++++++++++--------
> 4 files changed, 24 insertions(+), 20 deletions(-)
> rename lib/ring/{rte_ring_generic_pvt.h => rte_ring_gcc_pvt.h} (87%)
>
> diff --git a/lib/ring/meson.build b/lib/ring/meson.build
> index 21f2c12989..2ba160b178 100644
> --- a/lib/ring/meson.build
> +++ b/lib/ring/meson.build
> @@ -9,7 +9,7 @@ indirect_headers += files (
> 'rte_ring_elem.h',
> 'rte_ring_elem_pvt.h',
> 'rte_ring_c11_pvt.h',
> - 'rte_ring_generic_pvt.h',
> + 'rte_ring_gcc_pvt.h',
> 'rte_ring_hts.h',
> 'rte_ring_hts_elem_pvt.h',
> 'rte_ring_peek.h',
> diff --git a/lib/ring/rte_ring_c11_pvt.h b/lib/ring/rte_ring_c11_pvt.h
> index 5afc14dec9..8358b0f21f 100644
> --- a/lib/ring/rte_ring_c11_pvt.h
> +++ b/lib/ring/rte_ring_c11_pvt.h
> @@ -43,7 +43,6 @@ __rte_ring_update_tail(struct rte_ring_headtail *ht,
> uint32_t old_val,
> */
> rte_atomic_store_explicit(&ht->tail, new_val,
> rte_memory_order_release);
> }
> -
> /**
> * @internal This is a helper function that moves the producer/consumer head
> * optimized for single threaded case
> @@ -82,7 +81,7 @@ __rte_ring_headtail_move_head_st(struct rte_ring_headtail
> *d,
> /* Single producer: only this thread writes d->head,
> * so a relaxed load is sufficient.
> */
> - *old_head = rte_atomic_load_explicit(&d->head,
> rte_memory_order_relaxed);
> + *old_head = rte_atomic_load_explicit(&d->head,
> rte_memory_order_acquire);
Not sure, why it had changed to 'acquire' here?
Looks like just patch splitting mistake, no?
>
> /* Acquire pairs with the consumer's release-store of tail in
> __rte_ring_update_tail,
> * ensuring the consumer's ring-element reads are complete before
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox