* Re: [PATCH v8 13/21] net/txgbe: fix link stability for 40G NIC
From: Stephen Hemminger @ 2026-06-17 15:55 UTC (permalink / raw)
To: Zaiyu Wang; +Cc: dev, stable, Jiawen Wu
In-Reply-To: <20260617081309.19124-14-zaiyuwang@trustnetic.com>
On Wed, 17 Jun 2026 16:13:00 +0800
Zaiyu Wang <zaiyuwang@trustnetic.com> wrote:
> + /* CMS Config Master */
> + addr = E56G_CMS_ANA_OVRDVAL_7_ADDR;
> + rdata = rd32_ephy(hw, addr);
> + ((E56G_CMS_ANA_OVRDVAL_7 *)&rdata)->ana_lcpll_lf_vco_swing_ctrl_i = 0xf;
> + wr32_ephy(hw, addr, rdata);
> +
DPDK has replaced all use of the terms master/slave with alternative words.
This gets flagged by check patch. Please consider other alternatives.
WARNING:TYPO_SPELLING: 'Master' may be misspelled - perhaps 'Primary'?
#210: FILE: drivers/net/txgbe/base/txgbe_e56.c:181:
+ /* CMS Config Master */
^^^^^^
WARNING:TYPO_SPELLING: 'Master' may be misspelled - perhaps 'Primary'?
#231: FILE: drivers/net/txgbe/base/txgbe_e56.c:202:
+ /* TXS Config Master */
^^^^^^
WARNING:TYPO_SPELLING: 'master' may be misspelled - perhaps 'primary'?
#267: FILE: drivers/net/txgbe/base/txgbe_e56.c:238:
+ /* RXS Config master */
^^^^^^
WARNING:TYPO_SPELLING: 'master' may be misspelled - perhaps 'primary'?
#484: FILE: drivers/net/txgbe/base/txgbe_e56.c:455:
+ /* PDIG Config master */
^^^^^^
^ permalink raw reply
* Re: [PATCH v8 12/21] net/txgbe: fix link stability for 25G NIC
From: Stephen Hemminger @ 2026-06-17 15:53 UTC (permalink / raw)
To: Zaiyu Wang; +Cc: dev, stable, Jiawen Wu
In-Reply-To: <20260617081309.19124-13-zaiyuwang@trustnetic.com>
On Wed, 17 Jun 2026 16:12:59 +0800
Zaiyu Wang <zaiyuwang@trustnetic.com> wrote:
> +void
> +set_fields_e56(unsigned int *src_data, unsigned int bit_high,
> + unsigned int bit_low, unsigned int set_value)
> +{
Function could be static here?
> +#define CL74_KRTR_TRAINNING_TIMEOUT 6000 /* 3000ms c74 trainning timeout */
> +#define AN_TRAINNING_MODE 0 /* 0: not dis an 1: dis an */
Checkpatch thinks this is a spelling error did you mean "training"
Another typo here.
WARNING: [TYPO_SPELLING] 'sequeence' may be misspelled - perhaps 'sequence'?
# drivers/net/txgbe/base/txgbe_e56_bp.c:2465:
+ /* 19. rxs calibration and adaotation sequeence */
^ permalink raw reply
* [PATCH v5] dts: add retry loop to trex traffic generation
From: Andrew Bailey @ 2026-06-17 15:57 UTC (permalink / raw)
To: luca.vizzarro, patrickrobb1997; +Cc: lylavoie, knimoji, dev, Andrew Bailey
In-Reply-To: <20260518175424.253600-1-abailey@iol.unh.edu>
There was an issue where the single core forward test would report zero
MPPS intermittently. This was due to TREX reporting that the link was
down when the client was called to start generating traffic. The links
were being reported down by TREX on the tg even when testpmd was
reporting them up on the SUT. Adding a retry loop to the generate
traffic method of TREX gives the tg enough time to set up and send
traffic.
Bugzilla ID: 1946
Fixes: d77d7f04f24c ("dts: add single-core performance test suite")
Signed-off-by: Andrew Bailey <abailey@iol.unh.edu>
---
.../testbed_model/traffic_generator/trex.py | 38 ++++++++++++++-----
1 file changed, 29 insertions(+), 9 deletions(-)
diff --git a/dts/framework/testbed_model/traffic_generator/trex.py b/dts/framework/testbed_model/traffic_generator/trex.py
index 22cd20dea9..2abfbfc5ab 100644
--- a/dts/framework/testbed_model/traffic_generator/trex.py
+++ b/dts/framework/testbed_model/traffic_generator/trex.py
@@ -13,6 +13,7 @@
from framework.config.node import OS, NodeConfiguration
from framework.config.test_run import TrexTrafficGeneratorConfig
+from framework.exception import SSHTimeoutError
from framework.parser import TextParser
from framework.remote_session.blocking_app import BlockingApp
from framework.remote_session.python_shell import PythonShell
@@ -220,7 +221,9 @@ def _create_packet_stream(self, packet: Packet) -> None:
]
self._shell.send_command("\n".join(packet_stream))
- def _send_traffic_and_get_stats(self, duration: float, send_mpps: float | None = None) -> str:
+ def _send_traffic_and_get_stats(
+ self, duration: float, send_mpps: float | None = None, retry_attempts: int = 5
+ ) -> str:
"""Send traffic and get TG Rx stats.
Sends traffic from the TRex client's ports for the given duration.
@@ -230,15 +233,32 @@ def _send_traffic_and_get_stats(self, duration: float, send_mpps: float | None =
Args:
duration: The traffic generation duration.
send_mpps: The millions of packets per second for TRex to send from each port.
+ retry_attempts: The number of times to retry this command on failure.
+
+ Raises:
+ SSHTimeoutError: If TRex fails to send traffic in the allotted attempts.
"""
- if send_mpps:
- self._shell.send_command(f"""{self.stl_client_name}.start(ports=[0, 1],
- mult = '{send_mpps}mpps',
- duration = {duration})""")
- else:
- self._shell.send_command(f"""{self.stl_client_name}.start(ports=[0, 1],
- mult = '100%',
- duration = {duration})""")
+ link_down = True
+ attempt = 0
+
+ while link_down and attempt < retry_attempts:
+ if send_mpps:
+ result = self._shell.send_command(f"""{self.stl_client_name}.start(ports=[0, 1],
+ mult = '{send_mpps}mpps',
+ duration = {duration})""")
+ else:
+ result = self._shell.send_command(f"""{self.stl_client_name}.start(ports=[0, 1],
+ mult = '100%',
+ duration = {duration})""")
+ link_down = "link is DOWN" in result
+ if link_down:
+ self._logger.info(
+ f"Generate traffic command failed (attempt {attempt + 1} of {retry_attempts})"
+ )
+ time.sleep(0.25)
+ attempt += 1
+ if link_down:
+ raise SSHTimeoutError("Link failed to come up, Traffic could not be generated.")
time.sleep(duration)
--
2.54.0
^ permalink raw reply related
* Re: [PATCH v8 00/21] Wangxun Fixes
From: Stephen Hemminger @ 2026-06-17 15:57 UTC (permalink / raw)
To: Zaiyu Wang; +Cc: dev
In-Reply-To: <20260617081309.19124-1-zaiyuwang@trustnetic.com>
On Wed, 17 Jun 2026 16:12:47 +0800
Zaiyu Wang <zaiyuwang@trustnetic.com> wrote:
> This series fixes several issues found on Wangxun Emerald, Sapphire and
> Amber-lite NICs, with a focus on link-related problems.
> ---
> v8:
> - Fixed compilation error by replacing RTE_ETH_DEV_TO_PCI with RTE_CLASS_TO_BUS_DEVICE
> ---
> v7:
> - Fixed inverted semantics of is_flat_mem to match SFF8636
> ---
> v6:
> - Fixed more issues identified by AI review
> ---
> v5:
> - Fixed issues identified by AI review
> ---
> v4:
> - Fixed issues identified by devtools scripts
> ---
> v3:
> - Addressed Stephen's comments
> ---
> v2:
> - Fixed compilation error and code style issues
> ---
>
> Zaiyu Wang (21):
> net/txgbe: remove duplicate xstats counters
> net/ngbe: remove duplicate xstats counters
> net/ngbe: add missing CDR config for YT PHY
> net/ngbe: fix VF promiscuous and allmulticast
> net/txgbe: fix inaccuracy in Tx rate limiting
> net/txgbe: fix link status check condition
> net/txgbe: fix Tx desc free logic
> net/txgbe: fix link flow control registers for Amber-Lite
> net/txgbe: fix link flow control config for Sapphire
> net/txgbe: fix a mass of unknown interrupts
> net/txgbe: fix traffic class priority configuration
> net/txgbe: fix link stability for 25G NIC
> net/txgbe: fix link stability for 40G NIC
> net/txgbe: fix link stability for Amber-Lite backplane mode
> net/txgbe: fix FEC mode configuration on 25G NIC
> net/txgbe: fix SFP module identification
> net/txgbe: fix get module info operation
> net/txgbe: fix get EEPROM operation
> net/txgbe: fix to reset Tx write-back pointer
> net/txgbe: fix to enable Tx desc check
> net/txgbe: fix temperature track for AML NIC
>
> drivers/net/ngbe/base/ngbe_phy_yt.c | 3 +
> drivers/net/ngbe/ngbe_ethdev.c | 5 -
> drivers/net/ngbe/ngbe_ethdev_vf.c | 11 +-
> drivers/net/txgbe/base/meson.build | 2 +
> drivers/net/txgbe/base/txgbe.h | 2 +
> drivers/net/txgbe/base/txgbe_aml.c | 185 +-
> drivers/net/txgbe/base/txgbe_aml.h | 6 +-
> drivers/net/txgbe/base/txgbe_aml40.c | 114 +-
> drivers/net/txgbe/base/txgbe_aml40.h | 6 +-
> drivers/net/txgbe/base/txgbe_dcb_hw.c | 2 +-
> drivers/net/txgbe/base/txgbe_e56.c | 3773 +++++++++++++++++++++
> drivers/net/txgbe/base/txgbe_e56.h | 1753 ++++++++++
> drivers/net/txgbe/base/txgbe_e56_bp.c | 2597 ++++++++++++++
> drivers/net/txgbe/base/txgbe_e56_bp.h | 282 ++
> drivers/net/txgbe/base/txgbe_hw.c | 54 +-
> drivers/net/txgbe/base/txgbe_hw.h | 4 +-
> drivers/net/txgbe/base/txgbe_osdep.h | 4 +
> drivers/net/txgbe/base/txgbe_phy.c | 362 +-
> drivers/net/txgbe/base/txgbe_phy.h | 46 +-
> drivers/net/txgbe/base/txgbe_regs.h | 13 +-
> drivers/net/txgbe/base/txgbe_type.h | 43 +-
> drivers/net/txgbe/txgbe_ethdev.c | 472 ++-
> drivers/net/txgbe/txgbe_ethdev.h | 7 +-
> drivers/net/txgbe/txgbe_rxtx.c | 109 +-
> drivers/net/txgbe/txgbe_rxtx.h | 36 +
> drivers/net/txgbe/txgbe_rxtx_vec_common.h | 17 +-
> 26 files changed, 9464 insertions(+), 444 deletions(-)
> create mode 100644 drivers/net/txgbe/base/txgbe_e56.c
> create mode 100644 drivers/net/txgbe/base/txgbe_e56.h
> create mode 100644 drivers/net/txgbe/base/txgbe_e56_bp.c
> create mode 100644 drivers/net/txgbe/base/txgbe_e56_bp.h
>
Please fix several checkpatch errors and resubmit.
Although this is a false positive:
WARNING:TYPO_SPELLING: 'tread' may be misspelled - perhaps 'thread'?
#2647: FILE: drivers/net/txgbe/base/txgbe_e56_bp.c:2303:
+ BP_LOG("\tread 7001b data %0x\n", rd32_epcs(hw, 0x7001b));
It is still a bad idea to put control characters like tab in the log.
The log is often going via syslog/journal on production systems and
tabs don't do what you expect there.
^ permalink raw reply
* Re: [PATCH v3] app/testpmd: add VLAN priority insert support
From: Stephen Hemminger @ 2026-06-17 15:46 UTC (permalink / raw)
To: Xingui Yang
Cc: dev, david.marchand, aman.deep.singh, fengchengwen, yangshuaisong,
lihuisong, liuyonglong, kangfenglong
In-Reply-To: <20260617085247.2204050-1-yangxingui@huawei.com>
On Wed, 17 Jun 2026 16:52:47 +0800
Xingui Yang <yangxingui@huawei.com> wrote:
> The tx_vlan set and tx_qinq set commands now accept full 16-bit VLAN TCI
> (Tag Control Information) instead of only 12-bit VLAN ID. This allows
> users to set 802.1p priority and CFI/DEI bits for hardware VLAN insertion.
>
> ---
> v3:
> - Remove TX path validation to accept full 16-bit TCI values
> - Rename parameter from vlan_id to vlan_tci in code and documentation
> - Rename struct fields tx_vlan_id to tx_vlan_tci for consistency
> - Rename token variables cmd_tx_vlan_set_vlanid to cmd_tx_vlan_set_vlantci
> - Update cmdline.c structure fields, TOKEN definitions, and help strings
> - Add documentation with TCI bit layout and calculation examples
>
> Suggested-by: Stephen Hemminger <stephen@networkplumber.org>
> Suggested-by: Chengwen Feng <fengchengwen@huawei.com>
> Signed-off-by: Xingui Yang <yangxingui@huawei.com>
Added to net-next but had to fix one thing.
Because you put Suggested/Signed-off below the "---" cut line, git
trimmed away those parts and had to manually add them back.
^ permalink raw reply
* Re: [PATCH 0/4] Wangxun new feature
From: Stephen Hemminger @ 2026-06-17 15:36 UTC (permalink / raw)
To: Zaiyu Wang; +Cc: dev
In-Reply-To: <20260617105959.10764-1-zaiyuwang@trustnetic.com>
On Wed, 17 Jun 2026 18:59:55 +0800
Zaiyu Wang <zaiyuwang@trustnetic.com> wrote:
> This patchset introduces three new features and critical fixes for our
> recent release cycle.
>
> Patch 1/2 adds support for UDP Segmentation Offload (USO) to improve
> large-packet transmission performance for UDP workloads.
>
> Patch 3 enables VFs to sense PF ifconfig down/up events, allowing
> better fault tolerance and fast recovery in virtualized environments.
>
> Patch 4 adds the missing VF support for the Amber-Lite 40G NICs, which
> was previously omitted in the initial integration.
>
> Zaiyu Wang (4):
> net/ngbe: add USO support
> net/txgbe: add USO support
> net/txgbe: add support for VF sensing PF down
> net/txgbe: add VF support for Amber-Lite 40G NIC
>
> drivers/net/ngbe/ngbe_rxtx.c | 13 +++---
> drivers/net/txgbe/base/txgbe_devids.h | 2 +
> drivers/net/txgbe/base/txgbe_hw.c | 7 ++++
> drivers/net/txgbe/base/txgbe_regs.h | 7 +++-
> drivers/net/txgbe/base/txgbe_type.h | 2 +
> drivers/net/txgbe/base/txgbe_vf.c | 7 ++--
> drivers/net/txgbe/txgbe_ethdev.c | 4 +-
> drivers/net/txgbe/txgbe_ethdev_vf.c | 60 +++++++++++++++++++++++----
> drivers/net/txgbe/txgbe_rxtx.c | 13 +++---
> 9 files changed, 92 insertions(+), 23 deletions(-)
>
Preliminary AI review found some bugs.
Review of [PATCH v2 0/4] net/{txgbe,ngbe}: USO, VF PF-down sensing,
AML 40G VF support
Patch 3/4 (net/txgbe: add support for VF sensing PF down)
Error: dead store in txgbevf_get_pf_link_status() leaves the link
struct status field unmodified when the PF is down.
+ if (!hw->pf_running) {
+ link_up = false;
+ link_speed = TXGBE_LINK_SPEED_UNKNOWN;
+ link.link_duplex = RTE_ETH_LINK_HALF_DUPLEX;
+ return rte_eth_linkstatus_set(dev, &link);
+ }
The local variables link_up and link_speed are written here and never
read before the function returns, so they have no effect. The actual
fields that need to change are link.link_status and link.link_speed,
neither of which is touched -- the function then publishes a link
struct that still has the previous link.link_status (whatever
rte_eth_linkstatus_get returned a few lines earlier) and only the
duplex updated.
Compare with the parallel code added in the same patch in
txgbevf_check_link_for_intr(), which gets it right:
+ new_link.link_status = RTE_ETH_LINK_DOWN;
+ new_link.link_speed = RTE_ETH_SPEED_NUM_NONE;
+ new_link.link_duplex = RTE_ETH_LINK_HALF_DUPLEX;
In txgbevf_get_pf_link_status, assign directly to the link fields:
if (!hw->pf_running) {
link.link_status = RTE_ETH_LINK_DOWN;
link.link_speed = RTE_ETH_SPEED_NUM_NONE;
link.link_duplex = RTE_ETH_LINK_HALF_DUPLEX;
return rte_eth_linkstatus_set(dev, &link);
}
Warning: the commit message describes a flag that is not what the code
checks. The message says "Detect PF ifconfig down when
TXGBE_VT_MSGTYPE_SPEC is present in mailbox commands", but the
implementation in txgbevf_mbx_process does:
+ if (!(in_msg & TXGBE_VT_MSGTYPE_CTS)) {
TXGBE_VT_MSGTYPE_SPEC is defined in base/txgbe_mbx.h with a different
bit (0x10000000) and is only used in the 5-tuple-filter request path.
The detection here is "CTS is absent", not "SPEC is present".
Please rewrite the commit message to match the code.
Warning: no release notes entry. This is new VF behavior visible to
applications (a new INTR_RESET callback is dispatched when the PF
comes back up); doc/guides/rel_notes/release_26_07.rst should
mention it.
Info: msgbuf is declared as
u32 msgbuf[TXGBE_VF_PERMADDR_MSG_LEN] = {0}; but only msgbuf[0] is
ever used. Either size it to one element or drop the zero-init since
only the first element is touched before write_posted.
Patch 4/4 (net/txgbe: add VF support for Amber-Lite 40G NIC)
Warning: stray blank line between the if() and its body in
txgbe_reset_hw_vf:
+ if (hw->mac.type == txgbe_mac_aml_vf ||
+ hw->mac.type == txgbe_mac_aml40_vf)
+
wr32(hw, TXGBE_BME_AML, 0x1);
The blank line is a no-op for C but reads as a typo and checkpatch will
flag it. Remove the blank '+' line.
Warning: no release notes entry. New device IDs are user-visible (the
VF binds devices that previously did not work); please add a brief
note to release_26_07.rst.
Info: in txgbe_check_mac_link_vf the patch drops the explicit
sp_vf/aml_vf check before the wait loop:
- if ((mac->type == txgbe_mac_sp_vf ||
- mac->type == txgbe_mac_aml_vf) && wait_to_complete) {
+ if (wait_to_complete) {
The simplification is fine since txgbe_check_mac_link_vf is only ever
called with a VF mac type, but it is an unrelated cleanup that could be
mentioned in the commit message (or split into its own patch).
Patches 1/4 and 2/4 (net/{ngbe,txgbe}: add USO support)
Info: it would help readers if the commit message noted that
RTE_ETH_TX_OFFLOAD_UDP_TSO was already advertised in tx_offload_capa
(see ngbe_get_tx_port_offloads / txgbe_get_tx_port_offloads), so this
patch fills in the data path for a capability that was previously
claimed but unimplemented. Without that context the change looks like
a new feature addition; with it, it is clearly a fix and could carry a
Fixes: tag plus Cc: stable.
Info: a brief features-list or feature-matrix note in release_26_07.rst
would still be welcome since UDP_SEG offload now actually works.
Stephen Hemminger <stephen@networkplumber.org>
^ permalink raw reply
* Re: [PATCH v6 0/4] net/zxdh: optimize Rx/Tx path performance
From: Stephen Hemminger @ 2026-06-17 15:21 UTC (permalink / raw)
To: Junlong Wang; +Cc: dev
In-Reply-To: <20260617082828.1058127-1-wang.junlong1@zte.com.cn>
On Wed, 17 Jun 2026 16:28:23 +0800
Junlong Wang <wang.junlong1@zte.com.cn> wrote:
> v6:
> - Remove unnecessary error checking code in submit_to_backend_simple() and
> pkt_padding(). Since as the max dl_net_hdr_len is always less than
> RTE_PKTMBUF_HEADROOM, rte_pktmbuf_prepend() cannot fail in the
> simple path (single-segment mbufs).
>
> v5:
> - Reorganize patch series, placing interrupt fix as the first patch
> and fix condition check to properly enable interrupts.
> - Fix zxdh_recv_single_pkts() not compacting rcv_pkts[] on failure,
> which could cause use-after-free and mbuf leak.
> - Fix tx_bunch() and tx1() missing store barrier before setting AVAIL flag,
> preventing data race on weakly-ordered architectures.
> - Fix submit_to_backend_simple() writing descriptors for packets that
> failed pkt_padding(), causing mbuf leak.
>
> v4:
> - fix some AI review issues.
> - fix queue enable intr bug.
>
> v3:
> - remove unnecessary NULL check in zxdh_init_queue.
> - Split Ring: Bit[31] is unused and reserved, zxdh_queue_notify(): removing the
> zxdh_pci_with_feature(hw, ZXDH_F_RING_PACKED) check;
> - remove unnecessary double-free in in zxdh_recv_single_pkts();
> - used rte_pktmbuf_mtod();
> - remove rxq_get_vq(q) macro, use q->vq and apply it consistently;
> - Refactoring scatter and mtu check logic in zxdh_dev_mtu_set();
> - set txdp->id = avail_idx + i in tx_bunch/tx1.
> - add comment documenting zxdh_xmit_enqueue_append() now sets dxp->cookie = NULL for
> the head slot and stores cookies per descriptor via dep[idx].cookie.
> - add one-line comment noting tx_bunch() is the simple path handles single-segment.
> - remove unnecessary Extra initialization and the uint32_t cast.
>
> v2:
> - zxdh_rxtx.c, pkt_padding(): modifyed the return value of pkt_padding();
> - zxdh_rxtx.c, zxdh_recv_single_pkts(): modifyed When zxdh_init_mbuf() fails
> the loop does "continue" and free mbufs;
> - zxdh_rxtx.c, refill_desc_unwrap(): Add rte_io_wmb() before writing flags
> in the refill_que_descs();
> - zxdh_queue.h, zxdh_queue_enable_intr(): Remove unnecessary function of zxdh_queue_enable_intr;
> - zxdh_ethdev.c, zxdh_init_queue(): changed the hdr_mz NULL check logic;
>
> - zxdh_rxtx.c, zxdh_xmit_pkts_simple()、zxdh_recv_single_pkts(): add stats.bytes count;
> - zxdh_rxtx.c, zxdh_init_mbuf():remove rte_pktmbuf_dump(stdout, rxm, 40);
> - zxdh_ethdev.c, zxdh_dev_free_mbufs(): using rte_pktmbuf_free() to free mbufs;
> - Splitting into separate patches, structure reorganization and sw_ring removal、
> RX recv optimize、Tx xmit optimize、Tx;
>
> v1:
> This patch optimizes the ZXDH PMD's receive and transmit path for better
> performance through several improvements:
>
> - Add simple TX/RX burst functions (zxdh_xmit_pkts_simple and
> zxdh_recv_single_pkts) for single-segment packet scenarios.
> - Remove RX software ring (sw_ring) to reduce memory allocation and
> copy.
> - Optimize descriptor management with prefetching and simplified
> cleanup.
> - Reorganize structure fields for better cache locality.
>
> These changes reduce CPU cycles and memory bandwidth consumption,
> resulting in improved packet processing throughput.
>
> Junlong Wang (4):
> net/zxdh: fix queue enable intr issues
> net/zxdh: optimize queue structure to improve performance
> net/zxdh: optimize Rx recv pkts performance
> net/zxdh: optimize Tx xmit pkts performance
>
> drivers/net/zxdh/zxdh_ethdev.c | 81 ++---
> drivers/net/zxdh/zxdh_ethdev_ops.c | 23 +-
> drivers/net/zxdh/zxdh_ethdev_ops.h | 4 +
> drivers/net/zxdh/zxdh_pci.c | 2 +-
> drivers/net/zxdh/zxdh_queue.c | 11 +-
> drivers/net/zxdh/zxdh_queue.h | 122 ++++---
> drivers/net/zxdh/zxdh_rxtx.c | 529 ++++++++++++++++++++++-------
> drivers/net/zxdh/zxdh_rxtx.h | 27 +-
> 8 files changed, 539 insertions(+), 260 deletions(-)
>
I thought this was better, but AI found one new bug.
Also, if the driver wants headroom in mbuf and headroom typically
is set from rte_pktmbuf_alloc, a good defensive in depth.
Something like:
static_assert(RTE_PKTMBUF_HEADROOM >= ZXDH_DL_NET_HDR_SIZE,
"RTE_PKTMBUF_HEADROOM too small for zxdh Tx downlink header");
You still need the check in transmit, but this would protect
against user misconfiguration.
Not sure why you needed to open code rte_pktmbuf_prepend() which
has the check that AI wants you to have
Series review: net/zxdh Rx/Tx optimization (v6)
Patches 1-3 are unchanged from v5 and remain good. The v5 issue in the
simple Tx path (leaked mbuf and unpublished descriptor on the
pkt_padding failure branch) is resolved: pkt_padding() no longer
returns a failure, so there is no skip/continue and no ring gap. The
way it was made infallible introduces a new problem.
[PATCH v6 4/4] net/zxdh: optimize Tx xmit pkts performance
Error: pkt_padding() writes the downlink header in place without
checking that the mbuf has hdr_len bytes of headroom, so a packet with
short headroom causes an out-of-bounds write and a data_off underflow.
hdr = rte_pktmbuf_mtod_offset(cookie, struct zxdh_net_hdr_dl *, -hdr_len);
memcpy(hdr, net_hdr_dl, hdr_len);
/* Update mbuf to reflect the prepended header */
cookie->data_off -= hdr_len;
cookie->data_len += hdr_len;
cookie->pkt_len += hdr_len;
This open-codes rte_pktmbuf_prepend() but drops its headroom check. If
cookie->data_off < hdr_len, the mtod_offset pointer lands before
buf_addr and the rte_memcpy scribbles over the mbuf's private area or
the preceding memory; cookie->data_off (uint16_t) then underflows to a
large value. v4/v5 used rte_pktmbuf_prepend() here, which returns NULL
when headroom is short, so this is a regression: a checked-but-mishandled
case became an unchecked memory corruption.
Nothing upstream of the simple path guarantees the headroom. The simple
Tx burst is selected solely on RTE_ETH_TX_OFFLOAD_MULTI_SEGS being
disabled, which is independent of headroom, and tx_prepare()'s
dl_net_hdr_check() only validates offload capability, not headroom. The
driver itself shows the headroom cannot be assumed: the packed path
gates the in-place prepend on
txm->data_off >= ZXDH_DL_NET_HDR_SIZE
(zxdh_xmit_pkts_packed, the can_push test) and falls back to
zxdh_xmit_enqueue_append(), which copies the header into the reserved
txr region instead of prepending. The simple path has no equivalent
guard.
Restore a headroom check before the in-place prepend. For a
short-headroom packet, fall back to the append path that copies the
header into the reserved region (as the packed path does) rather than
writing before the buffer.
^ permalink raw reply
* Re: [PATCH] common/cnxk: add bulk Rx queue enable/disable
From: Stephen Hemminger @ 2026-06-17 14:52 UTC (permalink / raw)
To: Rakesh Kudurumalla
Cc: Nithin Dabilpuram, Kiran Kumar K, Sunil Kumar Kori, Satha Rao,
Harman Kalra, dev, jerinj
In-Reply-To: <20260617054943.287815-1-rkudurumalla@marvell.com>
On Wed, 17 Jun 2026 11:19:43 +0530
Rakesh Kudurumalla <rkudurumalla@marvell.com> wrote:
>
> +#define NIX_RQ_BULK_ENA_DIS_LOOP(REQ_TYPE, ALLOC_FN) \
> + do { \
> + for (i = 0; i < nb_rx_queues; i++) { \
> + REQ_TYPE *aq; \
> + if (rqs[i].qid == UINT16_MAX) \
> + continue; \
> + struct roc_nix_rq *rq = &rqs[i]; \
> + aq = ALLOC_FN(mbox); \
> + if (!aq) { \
> + rc = mbox_process(mbox); \
> + if (rc) \
> + goto exit; \
> + aq = ALLOC_FN(mbox); \
> + if (!aq) { \
> + rc = -ENOSPC; \
> + goto exit; \
> + } \
> + } \
> + aq->qidx = rq->qid; \
> + aq->ctype = NIX_AQ_CTYPE_RQ; \
> + aq->op = NIX_AQ_INSTOP_WRITE; \
> + aq->rq.ena = enable; \
> + aq->rq_mask.ena = ~(aq->rq_mask.ena); \
> + } \
> + } while (0)
> +
Why can't this be an inline function? large macros are a bad idea.
^ permalink raw reply
* [DPDK/ethdev Bug 1958] Broadcom bnxt RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE always on
From: bugzilla @ 2026-06-17 14:51 UTC (permalink / raw)
To: dev
http://bugs.dpdk.org/show_bug.cgi?id=1958
Bug ID: 1958
Summary: Broadcom bnxt RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE always
on
Product: DPDK
Version: 26.03
Hardware: All
OS: All
Status: UNCONFIRMED
Severity: normal
Priority: Normal
Component: ethdev
Assignee: dev@dpdk.org
Reporter: riaan.wiid@vastech.co.za
Target Milestone: ---
Environment
===========
dpdk-26.03 (tag v26.03)
OS: Rocky Linux (9.3 (Blue Onyx))
HW: Broadcom BCM57414 NetXtreme-E 10Gb/25Gb
Compiler: gcc 14.2.1 20240801 (Red Hat 14.2.1-1)
Description of Problem
======================
RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE is always enabled if not enabling PTP support
on a NIC port. In our case we are not enabling PTP support which means
RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE is enabled by default on all queues, which is
undesired behaviour when making use of Mbuf refcnt in our designs.
This behaviour seems to have started after this commit
https://github.com/DPDK/dpdk/commit/5722ed1aafd2fc5ad44e432a4f6c7bd337c673d8
RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE cannot be disabled through the offloads
specified in either rte_eth_dev_configure() or rte_eth_tx_queue_setup()
--
You are receiving this mail because:
You are the assignee for the bug.
^ permalink raw reply
* [PATCH v2] drivers: update relaxed ordering policy for mlx5 mkeys
From: Maayan Kashani @ 2026-06-17 14:27 UTC (permalink / raw)
To: dev
Cc: mkashani, rasland, Viacheslav Ovsiienko, Dariusz Sosnowski,
Bing Zhao, Ori Kam, Suanming Mou, Matan Azrad
In-Reply-To: <20260616122355.39114-1-mkashani@nvidia.com>
New adapters expose additional ordering capabilities.
Query the new caps and apply them when creating DevX mkeys via
mlx5_devx_mkey_attr_set_ordering(), which sets PCI relaxed ordering
and RAW=RO when relaxed order is supported.
Use this helper on Windows (still gated by Haswell/Broadwell) and for
Linux wrapped mkeys and crypto/regex/vdpa indirect mkeys when
relaxed order only flag is set.
Linux wrapped mkeys continue to use the legacy Haswell/Broadwell rule for
IBV_ACCESS_RELAXED_ORDERING on the verbs MR.
Upcoming FW will requires setting the correct ordering attributes,
otherwise it fails to create the memory key.
Signed-off-by: Maayan Kashani <mkashani@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
---
drivers/common/mlx5/linux/mlx5_common_os.c | 8 +++++
drivers/common/mlx5/mlx5_devx_cmds.c | 31 ++++++++++++++++++++
drivers/common/mlx5/mlx5_devx_cmds.h | 9 ++++++
drivers/common/mlx5/mlx5_prm.h | 18 ++++++++++--
drivers/common/mlx5/windows/mlx5_common_os.c | 8 ++---
drivers/crypto/mlx5/mlx5_crypto.c | 4 +++
drivers/regex/mlx5/mlx5_regex_fastpath.c | 5 ++++
drivers/regex/mlx5/mlx5_rxp.c | 4 +++
drivers/vdpa/mlx5/mlx5_vdpa_mem.c | 4 +++
9 files changed, 83 insertions(+), 8 deletions(-)
diff --git a/drivers/common/mlx5/linux/mlx5_common_os.c b/drivers/common/mlx5/linux/mlx5_common_os.c
index e3db6c41245..36b7874ce77 100644
--- a/drivers/common/mlx5/linux/mlx5_common_os.c
+++ b/drivers/common/mlx5/linux/mlx5_common_os.c
@@ -997,6 +997,7 @@ int
mlx5_os_wrapped_mkey_create(void *ctx, void *pd, uint32_t pdn, void *addr,
size_t length, struct mlx5_pmd_wrapped_mr *pmd_mr)
{
+ struct mlx5_hca_attr hca_attr = { 0 };
struct mlx5_klm klm = {
.byte_count = length,
.address = (uintptr_t)addr,
@@ -1019,6 +1020,13 @@ mlx5_os_wrapped_mkey_create(void *ctx, void *pd, uint32_t pdn, void *addr,
klm.mkey = ibv_mr->lkey;
mkey_attr.addr = (uintptr_t)addr;
mkey_attr.size = length;
+ if (mlx5_devx_cmd_query_hca_attr(ctx, &hca_attr)) {
+ claim_zero(mlx5_glue->dereg_mr(ibv_mr));
+ return -1;
+ }
+ /* If only relaxed order is allowed. */
+ if (hca_attr.mkc_order_write_after_write_ro_only)
+ mlx5_devx_mkey_attr_set_ordering(&mkey_attr, &hca_attr);
mkey = mlx5_devx_cmd_mkey_create(ctx, &mkey_attr);
if (!mkey) {
claim_zero(mlx5_glue->dereg_mr(ibv_mr));
diff --git a/drivers/common/mlx5/mlx5_devx_cmds.c b/drivers/common/mlx5/mlx5_devx_cmds.c
index c4ac2aaceed..140b057ab47 100644
--- a/drivers/common/mlx5/mlx5_devx_cmds.c
+++ b/drivers/common/mlx5/mlx5_devx_cmds.c
@@ -331,6 +331,29 @@ mlx5_devx_cmd_flow_counter_query(struct mlx5_devx_obj *dcs,
return 0;
}
+/**
+ * Apply PCI relaxed-ordering and read-after-write ordering to mkey attributes.
+ *
+ * @param[in, out] mkey_attr
+ * Mkey attributes to update.
+ * @param[in] hca_attr
+ * HCA capabilities from mlx5_devx_cmd_query_hca_attr().
+ */
+RTE_EXPORT_INTERNAL_SYMBOL(mlx5_devx_mkey_attr_set_ordering)
+void
+mlx5_devx_mkey_attr_set_ordering(struct mlx5_devx_mkey_attr *mkey_attr,
+ const struct mlx5_hca_attr *hca_attr)
+{
+ if (!mkey_attr || !hca_attr)
+ return;
+
+ mkey_attr->relaxed_ordering_write = hca_attr->relaxed_ordering_write;
+ mkey_attr->relaxed_ordering_read =
+ hca_attr->relaxed_ordering_read || hca_attr->pci_relaxed_ordered_read;
+ if (hca_attr->mkc_order_read_after_write)
+ mkey_attr->read_after_write_ordering = MLX5_MKC_RAW_ORDERING_RO;
+}
+
/**
* Create a new mkey.
*
@@ -417,6 +440,8 @@ mlx5_devx_cmd_mkey_create(void *ctx,
MLX5_SET(mkc, mkc, relaxed_ordering_write,
attr->relaxed_ordering_write);
MLX5_SET(mkc, mkc, relaxed_ordering_read, attr->relaxed_ordering_read);
+ MLX5_SET(mkc, mkc, order_read_after_write,
+ attr->read_after_write_ordering);
MLX5_SET64(mkc, mkc, start_addr, attr->addr);
MLX5_SET64(mkc, mkc, len, attr->size);
MLX5_SET(mkc, mkc, crypto_en, attr->crypto_en);
@@ -1003,6 +1028,12 @@ mlx5_devx_cmd_query_hca_attr(void *ctx,
relaxed_ordering_write);
attr->relaxed_ordering_read = MLX5_GET(cmd_hca_cap, hcattr,
relaxed_ordering_read);
+ attr->pci_relaxed_ordered_read = MLX5_GET(cmd_hca_cap, hcattr,
+ pci_relaxed_ordered_read);
+ attr->mkc_order_read_after_write = MLX5_GET(cmd_hca_cap, hcattr,
+ mkc_order_read_after_write);
+ attr->mkc_order_write_after_write_ro_only = MLX5_GET(cmd_hca_cap, hcattr,
+ mkc_order_write_after_write_ro_only);
attr->access_register_user = MLX5_GET(cmd_hca_cap, hcattr,
access_register_user);
attr->eth_net_offloads = MLX5_GET(cmd_hca_cap, hcattr,
diff --git a/drivers/common/mlx5/mlx5_devx_cmds.h b/drivers/common/mlx5/mlx5_devx_cmds.h
index 82d949972bb..90beb2e9e6c 100644
--- a/drivers/common/mlx5/mlx5_devx_cmds.h
+++ b/drivers/common/mlx5/mlx5_devx_cmds.h
@@ -34,6 +34,7 @@ struct mlx5_devx_mkey_attr {
uint32_t pg_access:1;
uint32_t relaxed_ordering_write:1;
uint32_t relaxed_ordering_read:1;
+ uint32_t read_after_write_ordering:2;
uint32_t umr_en:1;
uint32_t crypto_en:2;
uint32_t set_remote_rw:1;
@@ -237,6 +238,9 @@ struct mlx5_hca_attr {
uint32_t vhca_id:16;
uint32_t relaxed_ordering_write:1;
uint32_t relaxed_ordering_read:1;
+ uint32_t pci_relaxed_ordered_read:1;
+ uint32_t mkc_order_read_after_write:1;
+ uint32_t mkc_order_write_after_write_ro_only:1;
uint32_t access_register_user:1;
uint32_t wqe_index_ignore:1;
uint32_t cross_channel:1;
@@ -748,6 +752,11 @@ int mlx5_devx_cmd_query_hca_attr(void *ctx,
__rte_internal
struct mlx5_devx_obj *mlx5_devx_cmd_mkey_create(void *ctx,
struct mlx5_devx_mkey_attr *attr);
+
+__rte_internal
+void
+mlx5_devx_mkey_attr_set_ordering(struct mlx5_devx_mkey_attr *mkey_attr,
+ const struct mlx5_hca_attr *hca_attr);
__rte_internal
int mlx5_devx_get_out_command_status(void *out);
__rte_internal
diff --git a/drivers/common/mlx5/mlx5_prm.h b/drivers/common/mlx5/mlx5_prm.h
index 3bb072a7fec..c2810194f8e 100644
--- a/drivers/common/mlx5/mlx5_prm.h
+++ b/drivers/common/mlx5/mlx5_prm.h
@@ -1463,7 +1463,9 @@ struct mlx5_ifc_mkc_bits {
u8 bsf_octword_size[0x20];
u8 reserved_at_120[0x80];
u8 translations_octword_size[0x20];
- u8 reserved_at_1c0[0x19];
+ u8 reserved_at_1c0[0x16];
+ u8 order_read_after_write[0x2];
+ u8 reserved_at_1d8[0x1];
u8 relaxed_ordering_read[0x1];
u8 reserved_at_1da[0x1];
u8 log_page_size[0x5];
@@ -1478,6 +1480,13 @@ enum {
MLX5_MKEY_CRYPTO_ENABLED = 0x1,
};
+/* MKC read_after_write_ordering field (2-bit, dword 0x38 bits 9:8). */
+enum mlx5_mkc_raw_ordering {
+ MLX5_MKC_RAW_ORDERING_SO = 0x0,
+ MLX5_MKC_RAW_ORDERING_SAO = 0x1,
+ MLX5_MKC_RAW_ORDERING_RO = 0x2,
+};
+
struct mlx5_ifc_create_mkey_out_bits {
u8 status[0x8];
u8 reserved_at_8[0x18];
@@ -1827,7 +1836,8 @@ struct mlx5_ifc_cmd_hca_cap_bits {
u8 log_max_mcg[0x8];
u8 reserved_at_320[0x3];
u8 log_max_transport_domain[0x5];
- u8 reserved_at_328[0x3];
+ u8 reserved_at_328[0x2];
+ u8 pci_relaxed_ordered_read[0x1];
u8 log_max_pd[0x5];
u8 reserved_at_330[0xb];
u8 log_max_xrcd[0x5];
@@ -1860,7 +1870,9 @@ struct mlx5_ifc_cmd_hca_cap_bits {
u8 ext_stride_num_range[0x1];
u8 reserved_at_3a1[0x2];
u8 log_max_stride_sz_rq[0x5];
- u8 reserved_at_3a8[0x3];
+ u8 mkc_order_read_after_write[0x1];
+ u8 mkc_order_write_after_write_ro_only[0x1];
+ u8 reserved_at_3aa[0x1];
u8 log_min_stride_sz_rq[0x5];
u8 reserved_at_3b0[0x3];
u8 log_max_stride_sz_sq[0x5];
diff --git a/drivers/common/mlx5/windows/mlx5_common_os.c b/drivers/common/mlx5/windows/mlx5_common_os.c
index c790c9a4aeb..bdafb95df98 100644
--- a/drivers/common/mlx5/windows/mlx5_common_os.c
+++ b/drivers/common/mlx5/windows/mlx5_common_os.c
@@ -384,7 +384,7 @@ mlx5_os_reg_mr(void *pd,
{
struct mlx5_devx_mkey_attr mkey_attr;
struct mlx5_pd *mlx5_pd = (struct mlx5_pd *)pd;
- struct mlx5_hca_attr attr;
+ struct mlx5_hca_attr attr = { 0 };
struct mlx5_devx_obj *mkey;
void *obj;
@@ -403,10 +403,8 @@ mlx5_os_reg_mr(void *pd,
mkey_attr.size = length;
mkey_attr.umem_id = ((struct mlx5_devx_umem *)(obj))->umem_id;
mkey_attr.pd = mlx5_pd->pdn;
- if (!mlx5_haswell_broadwell_cpu) {
- mkey_attr.relaxed_ordering_write = attr.relaxed_ordering_write;
- mkey_attr.relaxed_ordering_read = attr.relaxed_ordering_read;
- }
+ if (!mlx5_haswell_broadwell_cpu)
+ mlx5_devx_mkey_attr_set_ordering(&mkey_attr, &attr);
mkey = mlx5_devx_cmd_mkey_create(mlx5_pd->devx_ctx, &mkey_attr);
if (!mkey) {
claim_zero(mlx5_os_umem_dereg(obj));
diff --git a/drivers/crypto/mlx5/mlx5_crypto.c b/drivers/crypto/mlx5/mlx5_crypto.c
index dd0aabb6d75..448dd0c5a4e 100644
--- a/drivers/crypto/mlx5/mlx5_crypto.c
+++ b/drivers/crypto/mlx5/mlx5_crypto.c
@@ -97,7 +97,11 @@ mlx5_crypto_indirect_mkeys_prepare(struct mlx5_crypto_priv *priv,
mlx5_crypto_mkey_update_t update_cb)
{
uint32_t i;
+ struct mlx5_hca_attr *hca_attr = &priv->cdev->config.hca_attr;
+ /* If only relaxed order is allowed. */
+ if (hca_attr->mkc_order_write_after_write_ro_only)
+ mlx5_devx_mkey_attr_set_ordering(attr, hca_attr);
for (i = 0; i < qp->entries_n; i++) {
attr->klm_array = update_cb(priv, qp, i);
qp->mkey[i] = mlx5_devx_cmd_mkey_create(priv->cdev->ctx, attr);
diff --git a/drivers/regex/mlx5/mlx5_regex_fastpath.c b/drivers/regex/mlx5/mlx5_regex_fastpath.c
index 3207bcbc603..55f7411593a 100644
--- a/drivers/regex/mlx5/mlx5_regex_fastpath.c
+++ b/drivers/regex/mlx5/mlx5_regex_fastpath.c
@@ -755,9 +755,14 @@ mlx5_regexdev_setup_fastpath(struct mlx5_regex_priv *priv, uint32_t qp_id)
setup_qps(priv, qp);
if (priv->has_umr) {
+ struct mlx5_hca_attr *hca_attr = &priv->cdev->config.hca_attr;
+
#ifdef HAVE_IBV_FLOW_DV_SUPPORT
attr.pd = priv->cdev->pdn;
#endif
+ /* If only relaxed order is allowed. */
+ if (hca_attr->mkc_order_write_after_write_ro_only)
+ mlx5_devx_mkey_attr_set_ordering(&attr, hca_attr);
for (i = 0; i < qp->nb_desc; i++) {
attr.klm_num = MLX5_REGEX_MAX_KLM_NUM;
attr.klm_array = qp->jobs[i].imkey_array;
diff --git a/drivers/regex/mlx5/mlx5_rxp.c b/drivers/regex/mlx5/mlx5_rxp.c
index dda4a7fdb0b..b865c08b53c 100644
--- a/drivers/regex/mlx5/mlx5_rxp.c
+++ b/drivers/regex/mlx5/mlx5_rxp.c
@@ -54,6 +54,7 @@ rxp_create_mkey(struct mlx5_regex_priv *priv, void *ptr, size_t size,
uint32_t access, struct mlx5_regex_mkey *mkey)
{
struct mlx5_devx_mkey_attr mkey_attr;
+ struct mlx5_hca_attr *hca_attr = &priv->cdev->config.hca_attr;
/* Register the memory. */
mkey->umem = mlx5_glue->devx_umem_reg(priv->cdev->ctx, ptr, size, access);
@@ -72,6 +73,9 @@ rxp_create_mkey(struct mlx5_regex_priv *priv, void *ptr, size_t size,
#ifdef HAVE_IBV_FLOW_DV_SUPPORT
mkey_attr.pd = priv->cdev->pdn;
#endif
+ /* If only relaxed order is allowed. */
+ if (hca_attr->mkc_order_write_after_write_ro_only)
+ mlx5_devx_mkey_attr_set_ordering(&mkey_attr, hca_attr);
mkey->mkey = mlx5_devx_cmd_mkey_create(priv->cdev->ctx, &mkey_attr);
if (!mkey->mkey) {
DRV_LOG(ERR, "Failed to create direct mkey!");
diff --git a/drivers/vdpa/mlx5/mlx5_vdpa_mem.c b/drivers/vdpa/mlx5/mlx5_vdpa_mem.c
index 4dfe800b8fc..8c9d169d2a8 100644
--- a/drivers/vdpa/mlx5/mlx5_vdpa_mem.c
+++ b/drivers/vdpa/mlx5/mlx5_vdpa_mem.c
@@ -179,6 +179,7 @@ static int
mlx5_vdpa_create_indirect_mkey(struct mlx5_vdpa_priv *priv)
{
struct mlx5_devx_mkey_attr mkey_attr;
+ struct mlx5_hca_attr *hca_attr = &priv->cdev->config.hca_attr;
struct mlx5_vdpa_query_mr *mrs =
(struct mlx5_vdpa_query_mr *)priv->mrs;
struct mlx5_vdpa_query_mr *entry;
@@ -242,6 +243,9 @@ mlx5_vdpa_create_indirect_mkey(struct mlx5_vdpa_priv *priv)
mkey_attr.pg_access = 0;
mkey_attr.klm_array = klm_array;
mkey_attr.klm_num = klm_index;
+ /* If only relaxed order is allowed. */
+ if (hca_attr->mkc_order_write_after_write_ro_only)
+ mlx5_devx_mkey_attr_set_ordering(&mkey_attr, hca_attr);
entry = &mrs[mem->nregions];
entry->mkey = mlx5_devx_cmd_mkey_create(priv->cdev->ctx, &mkey_attr);
if (!entry->mkey) {
--
2.21.0
^ permalink raw reply related
* [PATCH v2] eal: fix core_index for non-EAL registered threads
From: Maxime Peim @ 2026-06-17 8:09 UTC (permalink / raw)
To: dev; +Cc: david.marchand
Threads registered via rte_thread_register() are assigned a valid
lcore_id by eal_lcore_non_eal_allocate(), but their core_index in
lcore_config is left at -1. This value was set during rte_eal_cpu_init()
for lcores with ROLE_OFF (undetected CPUs) and is never updated when the
lcore is later allocated to a non-EAL thread.
As a result, rte_lcore_index() returns -1 for registered non-EAL
threads. Libraries that use rte_lcore_index() to select per-lcore
caches fall back to a shared global path when it returns -1, causing
severe contention under concurrent access from multiple registered
threads.
A concrete example is the mlx5 indexed memory pool (mlx5_ipool), which
uses rte_lcore_index() in mlx5_ipool_malloc_cache() to select a per-core
cache slot. When core_index is -1, all registered threads are funneled
into a single shared slot protected by a spinlock. In testing with VPP
(which registers worker threads via rte_thread_register()), this caused
async flow rule insertion throughput to drop from ~6.4M rules/sec to
~1.2M rules/sec with 4 workers -- a 5x regression attributable entirely
to spinlock contention in the ipool allocator.
Fix by tracking currently allocated core_index values in a bitset and
assigning non-EAL threads the first free index. Keep the bitset in sync
when initial EAL lcores are configured, when EAL core-list parsing
remaps lcores, and when non-EAL registration fails or releases an lcore.
Fixes: 5c307ba2a5b1 ("eal: register non-EAL threads as lcores")
Signed-off-by: Maxime Peim <maxime.peim@gmail.com>
---
v2:
- Track allocated core_index values with a bitset instead of deriving
the next non-EAL index from lcore_count, avoiding duplicate indices
after non-EAL lcore release.
- Keep the bitset in sync when default EAL lcores are discovered, when
EAL lcore options remap the active set, and when non-EAL lcore
registration rolls back or releases an lcore.
lib/eal/common/eal_common_lcore.c | 19 ++++++++++++++++++-
lib/eal/common/eal_common_options.c | 5 +++++
lib/eal/common/eal_private.h | 3 +++
3 files changed, 26 insertions(+), 1 deletion(-)
diff --git a/lib/eal/common/eal_common_lcore.c b/lib/eal/common/eal_common_lcore.c
index 39411f9370..b5f59a6380 100644
--- a/lib/eal/common/eal_common_lcore.c
+++ b/lib/eal/common/eal_common_lcore.c
@@ -6,8 +6,9 @@
#include <stdlib.h>
#include <string.h>
-#include <rte_common.h>
+#include <rte_bitset.h>
#include <rte_branch_prediction.h>
+#include <rte_common.h>
#include <rte_errno.h>
#include <rte_lcore.h>
#include <rte_log.h>
@@ -184,6 +185,9 @@ rte_eal_cpu_init(void)
/* By default, lcore 1:1 map to cpu id */
CPU_SET(lcore_id, &lcore_config[lcore_id].cpuset);
+ /* This is the first time we discover the lcores, so the bitset should be zeroed */
+ rte_bitset_set(config->core_indices, count);
+
/* By default, each detected core is enabled */
config->lcore_role[lcore_id] = ROLE_RTE;
lcore_config[lcore_id].core_role = ROLE_RTE;
@@ -373,11 +377,20 @@ eal_lcore_non_eal_allocate(void)
struct lcore_callback *callback;
struct lcore_callback *prev;
unsigned int lcore_id;
+ int core_index = -1;
rte_rwlock_write_lock(&lcore_lock);
+ core_index = rte_bitset_find_first_clear(cfg->core_indices, RTE_MAX_LCORE);
+ if (core_index == -1) {
+ EAL_LOG(DEBUG, "No core_index available.");
+ lcore_id = RTE_MAX_LCORE;
+ goto out;
+ }
for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
if (cfg->lcore_role[lcore_id] != ROLE_OFF)
continue;
+ rte_bitset_set(cfg->core_indices, core_index);
+ lcore_config[lcore_id].core_index = core_index;
cfg->lcore_role[lcore_id] = ROLE_NON_EAL;
cfg->lcore_count++;
break;
@@ -399,6 +412,8 @@ eal_lcore_non_eal_allocate(void)
}
EAL_LOG(DEBUG, "Initialization refused for lcore %u.",
lcore_id);
+ rte_bitset_clear(cfg->core_indices, lcore_config[lcore_id].core_index);
+ lcore_config[lcore_id].core_index = -1;
cfg->lcore_role[lcore_id] = ROLE_OFF;
cfg->lcore_count--;
lcore_id = RTE_MAX_LCORE;
@@ -420,6 +435,8 @@ eal_lcore_non_eal_release(unsigned int lcore_id)
goto out;
TAILQ_FOREACH(callback, &lcore_callbacks, next)
callback_uninit(callback, lcore_id);
+ rte_bitset_clear(cfg->core_indices, lcore_config[lcore_id].core_index);
+ lcore_config[lcore_id].core_index = -1;
cfg->lcore_role[lcore_id] = ROLE_OFF;
cfg->lcore_count--;
out:
diff --git a/lib/eal/common/eal_common_options.c b/lib/eal/common/eal_common_options.c
index 290386dc63..8f35fb376d 100644
--- a/lib/eal/common/eal_common_options.c
+++ b/lib/eal/common/eal_common_options.c
@@ -891,6 +891,7 @@ eal_parse_service_coremask(const char *coremask)
if (coremask[i] != '0')
return -1;
+ rte_bitset_clear_all(cfg->core_indices, RTE_MAX_LCORE);
for (; idx < RTE_MAX_LCORE; idx++)
lcore_config[idx].core_index = -1;
@@ -917,6 +918,7 @@ update_lcore_config(const rte_cpuset_t *cpuset, bool remap, uint16_t remap_base)
int ret = 0;
/* set everything to disabled first, then set up values */
+ rte_bitset_clear_all(cfg->core_indices, RTE_MAX_LCORE);
for (i = 0; i < RTE_MAX_LCORE; i++) {
cfg->lcore_role[i] = ROLE_OFF;
lcore_config[i].core_index = -1;
@@ -946,6 +948,7 @@ update_lcore_config(const rte_cpuset_t *cpuset, bool remap, uint16_t remap_base)
continue;
}
+ rte_bitset_set(cfg->core_indices, count);
cfg->lcore_role[lcore_id] = ROLE_RTE;
lcore_config[lcore_id].core_index = count;
CPU_ZERO(&lcore_config[lcore_id].cpuset);
@@ -1368,6 +1371,7 @@ eal_parse_lcores(const char *lcores)
CPU_ZERO(&cpuset);
/* Reset lcore config */
+ rte_bitset_clear_all(cfg->core_indices, RTE_MAX_LCORE);
for (idx = 0; idx < RTE_MAX_LCORE; idx++) {
cfg->lcore_role[idx] = ROLE_OFF;
lcore_config[idx].core_index = -1;
@@ -1432,6 +1436,7 @@ eal_parse_lcores(const char *lcores)
set_count--;
if (cfg->lcore_role[idx] != ROLE_RTE) {
+ rte_bitset_set(cfg->core_indices, count);
lcore_config[idx].core_index = count;
cfg->lcore_role[idx] = ROLE_RTE;
count++;
diff --git a/lib/eal/common/eal_private.h b/lib/eal/common/eal_private.h
index e032dd10c9..2b7a4fddbc 100644
--- a/lib/eal/common/eal_private.h
+++ b/lib/eal/common/eal_private.h
@@ -11,6 +11,7 @@
#include <sys/queue.h>
#include <dev_driver.h>
+#include <rte_bitset.h>
#include <rte_lcore.h>
#include <rte_log.h>
#include <rte_memory.h>
@@ -44,6 +45,8 @@ extern struct lcore_config lcore_config[RTE_MAX_LCORE];
* The global RTE configuration structure.
*/
struct rte_config {
+ RTE_BITSET_DECLARE(core_indices,
+ RTE_MAX_LCORE); /**< bitset of currently allocated core_indices */
uint32_t main_lcore; /**< Id of the main lcore */
uint32_t lcore_count; /**< Number of available logical cores. */
uint32_t numa_node_count; /**< Number of detected NUMA nodes. */
--
2.43.0
^ permalink raw reply related
* [v4] net/cksum: compute raw cksum for several segments
From: Su Sai @ 2026-06-16 12:30 UTC (permalink / raw)
To: stephen; +Cc: dev, spiderdetective.ss
In-Reply-To: <20260608100202.0deac83d@phoenix.local>
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=y, Size: 7012 bytes --]
The rte_raw_cksum_mbuf function is used to compute
the raw checksum of a packet.
If the packet payload stored in multi mbuf, the function
will goto the hard case. In hard case,
the variable 'tmp' is a type of uint32_t,
so rte_bswap16 will drop high 16 bit.
Meanwhile, the variable 'sum' is a type of uint32_t,
so 'sum += tmp' will drop the carry when overflow.
Both drop will make cksum incorrect.
This commit fixes the above bug.
Signed-off-by: Su Sai <spiderdetective.ss@gmail.com>
---
.mailmap | 1 +
app/test/test_cksum.c | 102 ++++++++++++++++++++++++++++++++++++++++++
lib/net/rte_cksum.h | 27 +++++++++--
3 files changed, 126 insertions(+), 4 deletions(-)
diff --git a/.mailmap b/.mailmap
index 4001e5fb0e..bcf73cb902 100644
--- a/.mailmap
+++ b/.mailmap
@@ -1630,6 +1630,7 @@ Sylvia Grundwürmer <sylvia.grundwuermer@b-plus.com>
Sylwester Dziedziuch <sylwesterx.dziedziuch@intel.com>
Sylwia Wnuczko <sylwia.wnuczko@intel.com>
Szymon Sliwa <szs@semihalf.com>
+Su Sai <spiderdetective.ss@gmail.com> <susai.ss@bytedance.com>
Szymon T Cudzilo <szymon.t.cudzilo@intel.com>
Tadhg Kearney <tadhg.kearney@intel.com>
Taekyung Kim <kim.tae.kyung@navercorp.com>
diff --git a/app/test/test_cksum.c b/app/test/test_cksum.c
index ea443382a1..5bd9723fbd 100644
--- a/app/test/test_cksum.c
+++ b/app/test/test_cksum.c
@@ -85,6 +85,42 @@ static const char test_cksum_ipv4_opts_udp[] = {
0x00, 0x35, 0x00, 0x09, 0x89, 0x6f, 0x78,
};
+/*
+ * generated in scapy with
+ * Ether()/IP()/TCP(options=[NOP,NOP,Timestamps])/os.urandom(113))
+ */
+static const char test_cksum_ipv4_tcp_multi_segs[] = {
+ 0x00, 0x16, 0x3e, 0x0b, 0x6b, 0xd2, 0xee, 0xff,
+ 0xff, 0xff, 0xff, 0xff, 0x08, 0x00, 0x45, 0x00,
+ 0x00, 0xa5, 0x46, 0x10, 0x40, 0x00, 0x40, 0x06,
+ 0x80, 0xb5, 0xc0, 0xa8, 0xf9, 0x1d, 0xc0, 0xa8,
+ 0xf9, 0x1e, 0xdc, 0xa2, 0x14, 0x51, 0xbb, 0x8f,
+ 0xa0, 0x00, 0xe4, 0x7c, 0xe4, 0xb8, 0x80, 0x10,
+ 0x02, 0x00, 0x4b, 0xc1, 0x00, 0x00, 0x01, 0x01,
+ 0x08, 0x0a, 0x90, 0x60, 0xf4, 0xff, 0x03, 0xc5,
+ 0xb4, 0x19, 0x77, 0x34, 0xd4, 0xdc, 0x84, 0x86,
+ 0xff, 0x44, 0x09, 0x63, 0x36, 0x2e, 0x26, 0x9b,
+ 0x90, 0x70, 0xf2, 0xed, 0xc8, 0x5b, 0x87, 0xaa,
+ 0xb4, 0x67, 0x6b, 0x32, 0x3d, 0xc4, 0xbf, 0x15,
+ 0xa9, 0x16, 0x6c, 0x2a, 0x9d, 0xb2, 0xb7, 0x6b,
+ 0x58, 0x44, 0x58, 0x12, 0x4b, 0x8f, 0xe5, 0x12,
+ 0x11, 0x90, 0x94, 0x68, 0x37, 0xad, 0x0a, 0x9b,
+ 0xd6, 0x79, 0xf2, 0xb7, 0x31, 0xcf, 0x44, 0x22,
+ 0xc8, 0x99, 0x3f, 0xe5, 0xe7, 0xac, 0xc7, 0x0b,
+ 0x86, 0xdf, 0xda, 0xed, 0x0a, 0x0f, 0x86, 0xd7,
+ 0x48, 0xe2, 0xf1, 0xc2, 0x43, 0xed, 0x47, 0x3a,
+ 0xea, 0x25, 0x2d, 0xd6, 0x60, 0x38, 0x30, 0x07,
+ 0x28, 0xdd, 0x1f, 0x0c, 0xdd, 0x7b, 0x7c, 0xd9,
+ 0x35, 0x9d, 0x14, 0xaa, 0xc6, 0x35, 0xd1, 0x03,
+ 0x38, 0xb1, 0xf5,
+};
+
+static const uint8_t test_cksum_ipv4_tcp_multi_segs_len[] = {
+ 66, /* the first seg contains all headers, including L2 to L4 */
+ 61, /* the second seg length is odd, test byte order independent */
+ 52, /* three segs are sufficient to test the most complex scenarios */
+};
+
/* test l3/l4 checksum api */
static int
test_l4_cksum(struct rte_mempool *pktmbuf_pool, const char *pktdata, size_t len)
@@ -223,6 +259,66 @@ test_l4_cksum(struct rte_mempool *pktmbuf_pool, const char *pktdata, size_t len)
return -1;
}
+/* test l4 checksum api for a packet with multiple mbufs */
+static int
+test_l4_cksum_multi_mbufs(struct rte_mempool *pktmbuf_pool, const char *pktdata, size_t len,
+ const uint8_t *segs, size_t segs_len)
+{
+ struct rte_mbuf *m[NB_MBUF] = {0};
+ struct rte_mbuf *m_hdr = NULL;
+ struct rte_net_hdr_lens hdr_lens;
+ size_t i, off = 0;
+ uint32_t packet_type, l3;
+ void *l3_hdr;
+ char *data;
+
+ for (i = 0; i < segs_len; i++) {
+ m[i] = rte_pktmbuf_alloc(pktmbuf_pool);
+ if (m[i] == NULL)
+ GOTO_FAIL("Cannot allocate mbuf");
+
+ data = rte_pktmbuf_append(m[i], segs[i]);
+ if (data == NULL)
+ GOTO_FAIL("Cannot append data");
+
+ memcpy(data, pktdata + off, segs[i]);
+ off += segs[i];
+
+ if (m_hdr) {
+ if (rte_pktmbuf_chain(m_hdr, m[i]))
+ GOTO_FAIL("Cannot chain mbuf");
+ } else {
+ m_hdr = m[i];
+ }
+ }
+
+ if (off != len)
+ GOTO_FAIL("Invalid segs");
+
+ packet_type = rte_net_get_ptype(m_hdr, &hdr_lens, RTE_PTYPE_ALL_MASK);
+ l3 = packet_type & RTE_PTYPE_L3_MASK;
+
+ l3_hdr = rte_pktmbuf_mtod_offset(m_hdr, void *, hdr_lens.l2_len);
+ off = hdr_lens.l2_len + hdr_lens.l3_len;
+
+ if (l3 == RTE_PTYPE_L3_IPV4 || l3 == RTE_PTYPE_L3_IPV4_EXT) {
+ if (rte_ipv4_udptcp_cksum_mbuf_verify(m_hdr, l3_hdr, off) != 0)
+ GOTO_FAIL("Invalid L4 checksum verification for multiple mbufs");
+ } else if (l3 == RTE_PTYPE_L3_IPV6 || l3 == RTE_PTYPE_L3_IPV6_EXT) {
+ if (rte_ipv6_udptcp_cksum_mbuf_verify(m_hdr, l3_hdr, off) != 0)
+ GOTO_FAIL("Invalid L4 checksum verification for multiple mbufs");
+ }
+
+ rte_pktmbuf_free_bulk(m, segs_len);
+
+ return 0;
+
+fail:
+ rte_pktmbuf_free_bulk(m, segs_len);
+
+ return -1;
+}
+
static int
test_cksum(void)
{
@@ -256,6 +352,12 @@ test_cksum(void)
sizeof(test_cksum_ipv4_opts_udp)) < 0)
GOTO_FAIL("checksum error on ipv4_opts_udp");
+ if (test_l4_cksum_multi_mbufs(pktmbuf_pool, test_cksum_ipv4_tcp_multi_segs,
+ sizeof(test_cksum_ipv4_tcp_multi_segs),
+ test_cksum_ipv4_tcp_multi_segs_len,
+ sizeof(test_cksum_ipv4_tcp_multi_segs_len)) < 0)
+ GOTO_FAIL("checksum error on multi mbufs check");
+
rte_mempool_free(pktmbuf_pool);
return 0;
diff --git a/lib/net/rte_cksum.h b/lib/net/rte_cksum.h
index a8e8927952..679ba82eb6 100644
--- a/lib/net/rte_cksum.h
+++ b/lib/net/rte_cksum.h
@@ -80,6 +80,25 @@ __rte_raw_cksum_reduce(uint32_t sum)
return (uint16_t)sum;
}
+/**
+ * @internal Reduce a sum to the non-complemented checksum.
+ * Helper routine for the rte_raw_cksum_mbuf().
+ *
+ * @param sum
+ * Value of the sum.
+ * @return
+ * The non-complemented checksum.
+ */
+static inline uint16_t
+__rte_raw_cksum_reduce_u64(uint64_t sum)
+{
+ uint32_t tmp;
+
+ tmp = __rte_raw_cksum_reduce((uint32_t)sum);
+ tmp += __rte_raw_cksum_reduce((uint32_t)(sum >> 32));
+ return __rte_raw_cksum_reduce(tmp);
+}
+
/**
* Process the non-complemented checksum of a buffer.
*
@@ -119,8 +138,8 @@ rte_raw_cksum_mbuf(const struct rte_mbuf *m, uint32_t off, uint32_t len,
{
const struct rte_mbuf *seg;
const char *buf;
- uint32_t sum, tmp;
- uint32_t seglen, done;
+ uint32_t seglen, done, tmp;
+ uint64_t sum;
/* easy case: all data in the first segment */
if (off + len <= rte_pktmbuf_data_len(m)) {
@@ -157,7 +176,7 @@ rte_raw_cksum_mbuf(const struct rte_mbuf *m, uint32_t off, uint32_t len,
for (;;) {
tmp = __rte_raw_cksum(buf, seglen, 0);
if (done & 1)
- tmp = rte_bswap16((uint16_t)tmp);
+ tmp = rte_bswap32(tmp);
sum += tmp;
done += seglen;
if (done == len)
@@ -169,7 +188,7 @@ rte_raw_cksum_mbuf(const struct rte_mbuf *m, uint32_t off, uint32_t len,
seglen = len - done;
}
- *cksum = __rte_raw_cksum_reduce(sum);
+ *cksum = __rte_raw_cksum_reduce_u64(sum);
return 0;
}
--
2.20.1
^ permalink raw reply related
* [PATCH v5] net/mlx5: fix counter TAILQ race between free and query callback
From: Linhu Li @ 2026-06-16 8:03 UTC (permalink / raw)
To: dev; +Cc: stable, dsosnowski, Linhu Li
In-Reply-To: <20260604101112.72177-1-lilinhu618@gmail.com>
flow_dv_counter_free() inserts counters into
pool->counters[pool->query_gen] under pool->csl. Meanwhile,
mlx5_flow_async_pool_query_handle() moves counters from
pool->counters[query_gen ^ 1] to the global free list via
TAILQ_CONCAT while holding only cmng->csl, not pool->csl.
The comment in flow_dv_counter_free() claims the lock is not needed
because the query callback and the release function operate on
different lists. That holds only if the free path always observes
the up-to-date query_gen. It can be violated:
1. A counter free thread (non-PMD, e.g. OVS offload thread) reads
pool->query_gen == 0 and is about to insert into counters[0].
2. The free thread is preempted by the OS scheduler; it is a regular
pthread, not pinned to a core.
3. The eal-intr-thread alarm fires: query_gen++ (now 1) and the async
query is sent.
4. Hardware completes the query and the callback runs TAILQ_CONCAT on
counters[0] (= query_gen ^ 1).
5. The free thread resumes and runs TAILQ_INSERT_TAIL on counters[0]
concurrently with step 4 on another core.
Because the two paths take different locks, TAILQ_INSERT_TAIL and
TAILQ_CONCAT run concurrently on the same list with no synchronization
and corrupt it: the pool-local list ends up with a NULL head but a
dangling tqh_last, and the global free list tail no longer points to
the real tail. The just-freed counter and every counter inserted
afterwards become unreachable and are leaked.
Non-PMD threads can be preempted for hundreds of microseconds under
CPU pressure, which is well within the async query round-trip time,
so the window is reachable in practice.
Fix it by taking pool->csl in the query completion callback before
operating on pool->counters[query_gen], serializing the CONCAT with
any concurrent INSERT. The lock is taken once per pool per query
completion in the eal-intr-thread context, not on the datapath, so
the cost is negligible. Lock order is pool->csl then cmng->csl,
matching all other sites.
Also handle the error path: previously the counters accumulated in
pool->counters[query_gen] were abandoned when a query failed. Move
them back to the global free list to avoid a leak on persistent
query failures.
Additionally, fix a second independent race in flow_dv_counter_free():
TAILQ_INSERT_TAIL is passed &pool->counters[pool->query_gen] directly,
but the macro evaluates its head argument multiple times. Since
pool->query_gen is a volatile bit-field, if mlx5_flow_query_alarm()
increments query_gen between two evaluations of the macro, the same
insertion can operate on two different lists: the earlier steps update
counters[0] while the later steps update counters[1], leaving both
lists with inconsistent metadata and leaking the counter. Fix by
caching pool->query_gen into a local variable before calling the macro.
Fixes: ac79183dc6f7 ("net/mlx5: optimize free counter lookup")
Cc: stable@dpdk.org
Signed-off-by: Linhu Li <lilinhu618@gmail.com>
Acked-by: Dariusz Sosnowski <dsosnowski@nvidia.com>
---
v5:
- Added fix for Race 2: cache pool->query_gen into a local variable
before TAILQ_INSERT_TAIL to prevent the macro from evaluating the
volatile bit-field multiple times and crossing generation lists.
- Updated release notes: moved the fix entry under "Updated NVIDIA mlx5
driver" in New Features instead of using a separate "Fixed Issues" section.
v4:
- Fixed commit log line length over 75 characters.
v3:
- Added release notes entry.
- Added function comment in mlx5_flow_async_pool_query_handle().
- Clarified error path comment to note it is safe for transient failures.
v2:
- Fixed Signed-off-by to use full name.
doc/guides/rel_notes/release_26_07.rst | 6 +++++
drivers/net/mlx5/mlx5_flow.c | 31 ++++++++++++++++++++++++++
drivers/net/mlx5/mlx5_flow_dv.c | 12 +++++-----
3 files changed, 44 insertions(+), 5 deletions(-)
diff --git a/doc/guides/rel_notes/release_26_07.rst b/doc/guides/rel_notes/release_26_07.rst
index b8a3e2ced9..6c6f0aa696 100644
--- a/doc/guides/rel_notes/release_26_07.rst
+++ b/doc/guides/rel_notes/release_26_07.rst
@@ -90,6 +90,11 @@ New Features
* Added support for transmitting LLDP packets based on mbuf packet type.
* Implemented AVX2 context descriptor transmit paths.
+* **Updated NVIDIA mlx5 driver.**
+
+ Fixed counter free list corruption when counter free operations race with
+ asynchronous query completions.
+
* **Updated PCAP ethernet driver.**
* Added support for VLAN insertion and stripping.
@@ -153,6 +158,7 @@ ABI Changes
* No ABI change that would break compatibility with 25.11.
+
Known Issues
------------
diff --git a/drivers/net/mlx5/mlx5_flow.c b/drivers/net/mlx5/mlx5_flow.c
index 915ea29a5a..2f785d58ec 100644
--- a/drivers/net/mlx5/mlx5_flow.c
+++ b/drivers/net/mlx5/mlx5_flow.c
@@ -9893,6 +9893,13 @@ void
mlx5_flow_async_pool_query_handle(struct mlx5_dev_ctx_shared *sh,
uint64_t async_id, int status)
{
+ /*
+ * Handle async counter pool query completion.
+ * query_gen is flipped each round: freed counters go into [query_gen],
+ * while this callback moves [query_gen ^ 1] to the global free list.
+ * pool->csl must be held when operating on pool->counters[] to serialize
+ * with concurrent free-path insertions.
+ */
struct mlx5_flow_counter_pool *pool =
(struct mlx5_flow_counter_pool *)(uintptr_t)async_id;
struct mlx5_counter_stats_raw *raw_to_free;
@@ -9904,6 +9911,21 @@ mlx5_flow_async_pool_query_handle(struct mlx5_dev_ctx_shared *sh,
if (unlikely(status)) {
raw_to_free = pool->raw_hw;
+ /*
+ * The query failed, so the freed counters accumulated
+ * in the old-gen list would otherwise be stranded.
+ * Move them back to the global free list. This is safe
+ * for both transient and persistent failures: the
+ * counters are still valid and can be reused.
+ */
+ if (!TAILQ_EMPTY(&pool->counters[query_gen])) {
+ rte_spinlock_lock(&pool->csl);
+ rte_spinlock_lock(&cmng->csl[cnt_type]);
+ TAILQ_CONCAT(&cmng->counters[cnt_type],
+ &pool->counters[query_gen], next);
+ rte_spinlock_unlock(&cmng->csl[cnt_type]);
+ rte_spinlock_unlock(&pool->csl);
+ }
} else {
raw_to_free = pool->raw;
if (pool->is_aged)
@@ -9913,11 +9935,20 @@ mlx5_flow_async_pool_query_handle(struct mlx5_dev_ctx_shared *sh,
rte_spinlock_unlock(&pool->sl);
/* Be sure the new raw counters data is updated in memory. */
rte_io_wmb();
+ /*
+ * A counter free thread may have read a stale query_gen
+ * before the generation was flipped and could still be
+ * inserting into this same old-gen list. Hold pool->csl to
+ * serialize TAILQ_CONCAT with that TAILQ_INSERT_TAIL and
+ * avoid corrupting the list.
+ */
if (!TAILQ_EMPTY(&pool->counters[query_gen])) {
+ rte_spinlock_lock(&pool->csl);
rte_spinlock_lock(&cmng->csl[cnt_type]);
TAILQ_CONCAT(&cmng->counters[cnt_type],
&pool->counters[query_gen], next);
rte_spinlock_unlock(&cmng->csl[cnt_type]);
+ rte_spinlock_unlock(&pool->csl);
}
}
LIST_INSERT_HEAD(&sh->sws_cmng.free_stat_raws, raw_to_free, next);
diff --git a/drivers/net/mlx5/mlx5_flow_dv.c b/drivers/net/mlx5/mlx5_flow_dv.c
index c2a2874913..060ccdeec2 100644
--- a/drivers/net/mlx5/mlx5_flow_dv.c
+++ b/drivers/net/mlx5/mlx5_flow_dv.c
@@ -7129,6 +7129,7 @@ flow_dv_counter_free(struct rte_eth_dev *dev, uint32_t counter)
struct mlx5_flow_counter_pool *pool = NULL;
struct mlx5_flow_counter *cnt;
enum mlx5_counter_type cnt_type;
+ uint32_t query_gen;
if (!counter)
return;
@@ -7153,16 +7154,17 @@ flow_dv_counter_free(struct rte_eth_dev *dev, uint32_t counter)
cnt->pool = pool;
/*
* Put the counter back to list to be updated in none fallback mode.
- * Currently, we are using two list alternately, while one is in query,
+ * Currently, we are using two lists alternately, while one is in query,
* add the freed counter to the other list based on the pool query_gen
* value. After query finishes, add counter the list to the global
- * container counter list. The list changes while query starts. In
- * this case, lock will not be needed as query callback and release
- * function both operate with the different list.
+ * container counter list. Cache query_gen into a local variable before
+ * TAILQ_INSERT_TAIL, since the macro evaluates its head argument
+ * multiple times and pool->query_gen is a volatile bit-field.
*/
if (!priv->sh->sws_cmng.counter_fallback) {
rte_spinlock_lock(&pool->csl);
- TAILQ_INSERT_TAIL(&pool->counters[pool->query_gen], cnt, next);
+ query_gen = pool->query_gen;
+ TAILQ_INSERT_TAIL(&pool->counters[query_gen], cnt, next);
rte_spinlock_unlock(&pool->csl);
} else {
cnt->dcs_when_free = cnt->dcs_when_active;
--
2.39.3 (Apple Git-146)
^ permalink raw reply related
* [PATCH] net/mlx5: fix double free in vectorized Rx recovery
From: Borys Tsyrulnikov @ 2026-06-17 13:43 UTC (permalink / raw)
To: Thomas Monjalon, Dariusz Sosnowski, Viacheslav Ovsiienko,
Bing Zhao, Ori Kam, Suanming Mou, Matan Azrad, Alexander Kozyrev
Cc: dev, stable, Borys Tsyrulnikov
During Rx queue error recovery, the vectorized path in
mlx5_rx_err_handle() reallocates an mbuf for every queue element. When
rte_mbuf_raw_alloc() fails (for example, the mempool is exhausted), the
rollback loop frees the mbufs allocated so far, but masks the element
ring index with "& elts_n" instead of "& (elts_n - 1)".
elts_n is a power-of-two element count, so "x & elts_n" isolates a
single bit and can only evaluate to 0 or elts_n, regardless of the loop
counter. The rollback therefore never frees the mbufs just allocated in
this pass (they are leaked); instead it repeatedly frees elts[0], a live
mbuf still posted to the NIC (use-after-free / double free), and
elts[elts_n], the fake_mbuf padding entry used by the vector datapath.
Mask with the existing e_mask (elts_n - 1), as already done in the
matching forward allocation loop just above.
Fixes: 0f20acbf5eda ("net/mlx5: implement vectorized MPRQ burst")
Cc: stable@dpdk.org
Signed-off-by: Borys Tsyrulnikov <tsyrulnikov.borys@gmail.com>
---
.mailmap | 1 +
drivers/net/mlx5/mlx5_rx.c | 2 +-
2 files changed, 2 insertions(+), 1 deletion(-)
diff --git a/.mailmap b/.mailmap
index 4001e5fb0e..0b09243c45 100644
--- a/.mailmap
+++ b/.mailmap
@@ -222,6 +222,7 @@ Boleslav Stankevich <boleslav.stankevich@oktetlabs.ru>
Boon Ang <boon.ang@broadcom.com> <bang@vmware.com>
Boris Ouretskey <borisusun@gmail.com>
Boris Pismenny <borisp@mellanox.com>
+Borys Tsyrulnikov <tsyrulnikov.borys@gmail.com>
Brad Larson <bradley.larson@amd.com>
Brandon Lo <blo@iol.unh.edu>
Brendan Ryan <brendan.ryan@intel.com>
diff --git a/drivers/net/mlx5/mlx5_rx.c b/drivers/net/mlx5/mlx5_rx.c
index ce50087b70..c0ad8d6701 100644
--- a/drivers/net/mlx5/mlx5_rx.c
+++ b/drivers/net/mlx5/mlx5_rx.c
@@ -662,7 +662,7 @@ mlx5_rx_err_handle(struct mlx5_rxq_data *rxq, uint8_t vec,
if (!*elt) {
for (i--; i >= 0; --i) {
elt_idx = (elts_ci +
- i) & elts_n;
+ i) & e_mask;
elt = &(*rxq->elts)
[elt_idx];
rte_pktmbuf_free_seg
--
2.34.1
^ permalink raw reply related
* Re: ARM v8 rte_power_pause
From: Wathsala Vithanage @ 2026-06-17 11:57 UTC (permalink / raw)
To: Hemant Agrawal, Morten Brørup
Cc: dev@dpdk.org, Maxime Leroy, Gagandeep Singh
In-Reply-To: <GV1PR04MB10750E293DBDCCDF9FEEFE00289182@GV1PR04MB10750.eurprd04.prod.outlook.com>
Hi Morten and Hemant,
YIELD is a NOP on non-SMT CPUs, such as Neoverse.
WFE is universally available on AArch64, but it comes with a caveat: the
CPU can remain in a low-power state indefinitely unless an event is
triggered. That event can be generated explicitly via SEV/SEVL by a
different CPU, or implicitly through address monitoring (LDAXR).
WFET is the safer variant because it includes a timeout, so explicit or
implicit event-register manipulation is not required.
--wathsala
On 6/12/26 01:11, Hemant Agrawal wrote:
> Hi Morten,
> On Cortex‑A72 (ARMv8), the only architectural primitives available are YIELD, WFE, and WFI:
>
> YIELD is the only deterministic, low-overhead option (pure CPU relax, no entry into low-power state)
> WFE can be used as a low-power idle hint, but it is event-driven and not time-based (it may return immediately)
> WFI depends on interrupt wakeup and is therefore not suitable for tight latency loops
>
> For ~1 µs latency targets, the practical approach is a hybrid strategy:
>
> Short waits → spin using YIELD
> Slightly longer waits → opportunistically use WFE for power reduction
>
> A simple implementation could look like (not tested):
>
> static inline void rte_armv8_pause(unsigned int iters)
> {
> if (iters < 64) {
> for (unsigned int i = 0; i < iters; i++)
> asm volatile("yield");
> } else {
> asm volatile("sevl");
> asm volatile("wfe");
> }
> }
>
> @Wathsala Vithanage — would appreciate your thoughts, especially if there are any micro-architectural nuances we should consider.
>
> Regards,
> Hemant
>
>> -----Original Message-----
>> From: Morten Brørup <mb@smartsharesystems.com>
>> Sent: 03 June 2026 17:26
>> To: Wathsala Vithanage <wathsala.vithanage@arm.com>; Hemant Agrawal
>> <hemant.agrawal@nxp.com>; Sachin Saxena (OSS)
>> <sachin.saxena@oss.nxp.com>
>> Cc: dev@dpdk.org; Maxime Leroy <maxime@leroys.fr>
>> Subject: ARM v8 rte_power_pause
>> Importance: High
>>
>> Hi Wathsala, Hemant and Sachin,
>>
>> Over at the Grout project, we are discussing power management in the
>> context of 100 Gbit/s latency deadlines [1].
>>
>> rte_power_pause() is not implemented for ARM v8 / Cortex-A72.
>> Syscalls such as nanosleep() have too much overhead, and cannot be used.
>>
>> Any suggestions for a power-reducing method to make a CPU core "sleep" (i.e.
>> do nothing) for durations in the order of 1 microsecond?
>>
>> [1]:
>> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithu
>> b.com%2FDPDK%2Fgrout%2Fpull%2F624%23issuecomment-
>> 4602036364&data=05%7C02%7Chemant.agrawal%40nxp.com%7Cdbff5f2e
>> 8db1406f0c4008dec1671791%7C686ea1d3bc2b4c6fa92cd99c5c301635%7
>> C0%7C0%7C639160845728472826%7CUnknown%7CTWFpbGZsb3d8eyJFb
>> XB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTW
>> FpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=DRpJWjm2yaF3Cnhk0b
>> bFFhmGbKRweOOiWdsWco2NbX0%3D&reserved=0
>>
>> -Morten
^ permalink raw reply
* RE: [PATCH v1 0/5] prefix lcore role enum values
From: Morten Brørup @ 2026-06-17 11:48 UTC (permalink / raw)
To: Huisong Li, thomas; +Cc: andrew.rybchenko, dev, zhanjie9
In-Reply-To: <20260617102834.2343356-1-lihuisong@huawei.com>
> From: Huisong Li [mailto:lihuisong@huawei.com]
> Sent: Wednesday, 17 June 2026 12.28
>
> Add the RTE_LCORE_ prefix to the lcore role enum values in
> rte_lcore_role_t
> to follow DPDK naming conventions.
>
> - ROLE_RTE -> RTE_LCORE_ROLE_RTE
> - ROLE_OFF -> RTE_LCORE_ROLE_OFF
> - ROLE_SERVICE -> RTE_LCORE_ROLE_SERVICE
> - ROLE_NON_EAL -> RTE_LCORE_ROLE_NON_EAL
>
> Old names are kept as macros aliasing to the new names to preserve
> backward compatibility.
>
Series-Acked-by: Morten Brørup <mb@smartsharesystems.com>
^ permalink raw reply
* [PATCH v2 4/4] net/txgbe: add VF support for Amber-Lite 40G NIC
From: Zaiyu Wang @ 2026-06-17 11:33 UTC (permalink / raw)
To: dev; +Cc: Zaiyu Wang, Jiawen Wu
In-Reply-To: <20260617113335.15648-1-zaiyuwang@trustnetic.com>
VF support for the 40G NIC was previously omitted; only the 25G VF was
added. Now add 40G VF support based on the existing 25G VF implementation,
with no major changes but only device ID adaptation.
Signed-off-by: Zaiyu Wang <zaiyuwang@trustnetic.com>
---
drivers/net/txgbe/base/txgbe_devids.h | 2 ++
drivers/net/txgbe/base/txgbe_hw.c | 7 +++++++
drivers/net/txgbe/base/txgbe_regs.h | 7 +++++--
drivers/net/txgbe/base/txgbe_type.h | 1 +
drivers/net/txgbe/base/txgbe_vf.c | 7 ++++---
drivers/net/txgbe/txgbe_ethdev.c | 1 +
drivers/net/txgbe/txgbe_ethdev_vf.c | 2 ++
7 files changed, 22 insertions(+), 5 deletions(-)
diff --git a/drivers/net/txgbe/base/txgbe_devids.h b/drivers/net/txgbe/base/txgbe_devids.h
index b7133c7d54..f5454ffbb1 100644
--- a/drivers/net/txgbe/base/txgbe_devids.h
+++ b/drivers/net/txgbe/base/txgbe_devids.h
@@ -28,6 +28,8 @@
#define TXGBE_DEV_ID_AML_VF 0x5001
#define TXGBE_DEV_ID_AML5024_VF 0x5024
#define TXGBE_DEV_ID_AML5124_VF 0x5124
+#define TXGBE_DEV_ID_AML503F_VF 0x503f
+#define TXGBE_DEV_ID_AML513F_VF 0x513f
/*
* Subsystem IDs
diff --git a/drivers/net/txgbe/base/txgbe_hw.c b/drivers/net/txgbe/base/txgbe_hw.c
index 0f3db3a1ad..21465d68ff 100644
--- a/drivers/net/txgbe/base/txgbe_hw.c
+++ b/drivers/net/txgbe/base/txgbe_hw.c
@@ -2543,6 +2543,7 @@ s32 txgbe_init_shared_code(struct txgbe_hw *hw)
break;
case txgbe_mac_sp_vf:
case txgbe_mac_aml_vf:
+ case txgbe_mac_aml40_vf:
status = txgbe_init_ops_vf(hw);
break;
default:
@@ -2573,6 +2574,7 @@ bool txgbe_is_vf(struct txgbe_hw *hw)
switch (hw->mac.type) {
case txgbe_mac_sp_vf:
case txgbe_mac_aml_vf:
+ case txgbe_mac_aml40_vf:
return true;
default:
return false;
@@ -2620,6 +2622,11 @@ s32 txgbe_set_mac_type(struct txgbe_hw *hw)
hw->phy.media_type = txgbe_media_type_virtual;
hw->mac.type = txgbe_mac_aml_vf;
break;
+ case TXGBE_DEV_ID_AML503F_VF:
+ case TXGBE_DEV_ID_AML513F_VF:
+ hw->phy.media_type = txgbe_media_type_virtual;
+ hw->mac.type = txgbe_mac_aml40_vf;
+ break;
default:
err = TXGBE_ERR_DEVICE_NOT_SUPPORTED;
DEBUGOUT("Unsupported device id: %x", hw->device_id);
diff --git a/drivers/net/txgbe/base/txgbe_regs.h b/drivers/net/txgbe/base/txgbe_regs.h
index 95c585a025..5eb92c54b6 100644
--- a/drivers/net/txgbe/base/txgbe_regs.h
+++ b/drivers/net/txgbe/base/txgbe_regs.h
@@ -1824,12 +1824,14 @@ txgbe_map_reg(struct txgbe_hw *hw, u32 reg)
switch (reg) {
case TXGBE_REG_RSSTBL:
if (hw->mac.type == txgbe_mac_sp_vf ||
- hw->mac.type == txgbe_mac_aml_vf)
+ hw->mac.type == txgbe_mac_aml_vf ||
+ hw->mac.type == txgbe_mac_aml40_vf)
reg = TXGBE_VFRSSTBL(0);
break;
case TXGBE_REG_RSSKEY:
if (hw->mac.type == txgbe_mac_sp_vf ||
- hw->mac.type == txgbe_mac_aml_vf)
+ hw->mac.type == txgbe_mac_aml_vf ||
+ hw->mac.type == txgbe_mac_aml40_vf)
reg = TXGBE_VFRSSKEY(0);
break;
default:
@@ -2012,6 +2014,7 @@ static inline void txgbe_flush(struct txgbe_hw *hw)
break;
case txgbe_mac_sp_vf:
case txgbe_mac_aml_vf:
+ case txgbe_mac_aml40_vf:
rd32(hw, TXGBE_VFSTATUS);
break;
default:
diff --git a/drivers/net/txgbe/base/txgbe_type.h b/drivers/net/txgbe/base/txgbe_type.h
index 956080c702..132d5c4eff 100644
--- a/drivers/net/txgbe/base/txgbe_type.h
+++ b/drivers/net/txgbe/base/txgbe_type.h
@@ -171,6 +171,7 @@ enum txgbe_mac_type {
txgbe_mac_aml40,
txgbe_mac_sp_vf,
txgbe_mac_aml_vf,
+ txgbe_mac_aml40_vf,
txgbe_num_macs
};
diff --git a/drivers/net/txgbe/base/txgbe_vf.c b/drivers/net/txgbe/base/txgbe_vf.c
index 1a8a20f104..4412006f1f 100644
--- a/drivers/net/txgbe/base/txgbe_vf.c
+++ b/drivers/net/txgbe/base/txgbe_vf.c
@@ -134,7 +134,9 @@ s32 txgbe_reset_hw_vf(struct txgbe_hw *hw)
}
/* amlite: bme */
- if (hw->mac.type == txgbe_mac_aml_vf)
+ if (hw->mac.type == txgbe_mac_aml_vf ||
+ hw->mac.type == txgbe_mac_aml40_vf)
+
wr32(hw, TXGBE_BME_AML, 0x1);
if (!timeout)
@@ -493,8 +495,7 @@ s32 txgbe_check_mac_link_vf(struct txgbe_hw *hw, u32 *speed,
/* for SFP+ modules and DA cables it can take up to 500usecs
* before the link status is correct
*/
- if ((mac->type == txgbe_mac_sp_vf ||
- mac->type == txgbe_mac_aml_vf) && wait_to_complete) {
+ if (wait_to_complete) {
if (po32m(hw, TXGBE_VFSTATUS, TXGBE_VFSTATUS_UP,
0, NULL, 5, 100))
goto out;
diff --git a/drivers/net/txgbe/txgbe_ethdev.c b/drivers/net/txgbe/txgbe_ethdev.c
index 003a24141c..63b967d71a 100644
--- a/drivers/net/txgbe/txgbe_ethdev.c
+++ b/drivers/net/txgbe/txgbe_ethdev.c
@@ -5228,6 +5228,7 @@ txgbe_rss_update(enum txgbe_mac_type mac_type)
case txgbe_mac_aml:
case txgbe_mac_aml40:
case txgbe_mac_aml_vf:
+ case txgbe_mac_aml40_vf:
return 1;
default:
return 0;
diff --git a/drivers/net/txgbe/txgbe_ethdev_vf.c b/drivers/net/txgbe/txgbe_ethdev_vf.c
index 7ec1e009ed..655ccc622f 100644
--- a/drivers/net/txgbe/txgbe_ethdev_vf.c
+++ b/drivers/net/txgbe/txgbe_ethdev_vf.c
@@ -77,6 +77,8 @@ static const struct rte_pci_id pci_id_txgbevf_map[] = {
{ RTE_PCI_DEVICE(PCI_VENDOR_ID_WANGXUN, TXGBE_DEV_ID_AML_VF) },
{ RTE_PCI_DEVICE(PCI_VENDOR_ID_WANGXUN, TXGBE_DEV_ID_AML5024_VF) },
{ RTE_PCI_DEVICE(PCI_VENDOR_ID_WANGXUN, TXGBE_DEV_ID_AML5124_VF) },
+ { RTE_PCI_DEVICE(PCI_VENDOR_ID_WANGXUN, TXGBE_DEV_ID_AML503F_VF) },
+ { RTE_PCI_DEVICE(PCI_VENDOR_ID_WANGXUN, TXGBE_DEV_ID_AML513F_VF) },
{ .vendor_id = 0, /* sentinel */ },
};
--
2.21.0.windows.1
^ permalink raw reply related
* [PATCH v2 3/4] net/txgbe: add support for VF sensing PF down
From: Zaiyu Wang @ 2026-06-17 11:33 UTC (permalink / raw)
To: dev; +Cc: Zaiyu Wang, Jiawen Wu
In-Reply-To: <20260617113335.15648-1-zaiyuwang@trustnetic.com>
VFs should continue normal packet Rx/Tx after PF ifconfig down/up.
To achieve this, cooperate with mailbox commands added in our Linux
kernel driver txgbe-2.2.0. Detect PF ifconfig down when
TXGBE_VT_MSGTYPE_SPEC is present in mailbox commands. Detect PF ifconfig
up when mailbox commands lack TXGBE_VT_MSGTYPE_CTS. Upon detection PF
up, the VF needs to reset; the driver sets a reset callback to prompt
users to reset the VF.
Additionally, hw->rx_loaded and hw->offset_loaded must be reset after
PF ifconfig up; otherwise, because hardware counter registers are cleared
during PF reset, the VF's software counters will overflow to 0xFFFFFFFF.
Signed-off-by: Zaiyu Wang <zaiyuwang@trustnetic.com>
---
drivers/net/txgbe/base/txgbe_type.h | 1 +
drivers/net/txgbe/txgbe_ethdev.c | 3 +-
drivers/net/txgbe/txgbe_ethdev_vf.c | 58 +++++++++++++++++++++++++----
3 files changed, 54 insertions(+), 8 deletions(-)
diff --git a/drivers/net/txgbe/base/txgbe_type.h b/drivers/net/txgbe/base/txgbe_type.h
index ede780321f..956080c702 100644
--- a/drivers/net/txgbe/base/txgbe_type.h
+++ b/drivers/net/txgbe/base/txgbe_type.h
@@ -883,6 +883,7 @@ struct txgbe_hw {
rte_atomic32_t swfw_busy;
u32 fec_mode;
u32 cur_fec_link;
+ bool pf_running;
};
struct txgbe_backplane_ability {
diff --git a/drivers/net/txgbe/txgbe_ethdev.c b/drivers/net/txgbe/txgbe_ethdev.c
index 0f484dfe91..003a24141c 100644
--- a/drivers/net/txgbe/txgbe_ethdev.c
+++ b/drivers/net/txgbe/txgbe_ethdev.c
@@ -3150,7 +3150,8 @@ txgbe_dev_link_update_share(struct rte_eth_dev *dev,
hw->mac.get_link_status = true;
- if (intr->flags & TXGBE_FLAG_NEED_LINK_CONFIG)
+ if (intr->flags & TXGBE_FLAG_NEED_LINK_CONFIG ||
+ (txgbe_is_vf(hw) && !hw->pf_running))
return rte_eth_linkstatus_set(dev, &link);
/* check if it needs to wait to complete, if lsc interrupt is enabled */
diff --git a/drivers/net/txgbe/txgbe_ethdev_vf.c b/drivers/net/txgbe/txgbe_ethdev_vf.c
index 7a50c7a855..7ec1e009ed 100644
--- a/drivers/net/txgbe/txgbe_ethdev_vf.c
+++ b/drivers/net/txgbe/txgbe_ethdev_vf.c
@@ -281,6 +281,7 @@ eth_txgbevf_dev_init(struct rte_eth_dev *eth_dev)
hw->subsystem_device_id = pci_dev->id.subsystem_device_id;
hw->subsystem_vendor_id = pci_dev->id.subsystem_vendor_id;
hw->hw_addr = (void *)pci_dev->mem_resource[0].addr;
+ hw->pf_running = true;
/* initialize the vfta */
memset(shadow_vfta, 0, sizeof(*shadow_vfta));
@@ -1405,8 +1406,18 @@ static s32 txgbevf_get_pf_link_status(struct rte_eth_dev *dev)
if (retval)
return 0;
+ if (!(msgbuf[0] & TXGBE_NOFITY_VF_LINK_STATUS))
+ return 0;
+
rte_eth_linkstatus_get(dev, &link);
+ if (!hw->pf_running) {
+ link_up = false;
+ link_speed = TXGBE_LINK_SPEED_UNKNOWN;
+ link.link_duplex = RTE_ETH_LINK_HALF_DUPLEX;
+ return rte_eth_linkstatus_set(dev, &link);
+ }
+
link_up = msgbuf[1] & TXGBE_VFSTATUS_UP;
link_speed = (msgbuf[1] & 0xFFF0) >> 1;
@@ -1434,10 +1445,22 @@ static s32 txgbevf_get_pf_link_status(struct rte_eth_dev *dev)
static void txgbevf_check_link_for_intr(struct rte_eth_dev *dev)
{
struct rte_eth_link orig_link, new_link;
+ struct txgbe_hw *hw = TXGBE_DEV_HW(dev);
rte_eth_linkstatus_get(dev, &orig_link);
- txgbevf_dev_link_update(dev, 0);
- rte_eth_linkstatus_get(dev, &new_link);
+
+ if (hw->pf_running) {
+ txgbevf_dev_link_update(dev, 0);
+ rte_eth_linkstatus_get(dev, &new_link);
+ } else {
+ DEBUGOUT("PF ifconfig down, so VF link down");
+ new_link.link_status = RTE_ETH_LINK_DOWN;
+ new_link.link_speed = RTE_ETH_SPEED_NUM_NONE;
+ new_link.link_duplex = RTE_ETH_LINK_HALF_DUPLEX;
+ new_link.link_autoneg = !(dev->data->dev_conf.link_speeds &
+ RTE_ETH_LINK_SPEED_FIXED);
+ rte_eth_linkstatus_set(dev, &new_link);
+ }
PMD_DRV_LOG(INFO, "orig_link: %d, new_link: %d",
orig_link.link_status, new_link.link_status);
@@ -1450,6 +1473,8 @@ static void txgbevf_check_link_for_intr(struct rte_eth_dev *dev)
static void txgbevf_mbx_process(struct rte_eth_dev *dev)
{
struct txgbe_hw *hw = TXGBE_DEV_HW(dev);
+ struct txgbe_mbx_info *mbx = &hw->mbx;
+ u32 msgbuf[TXGBE_VF_PERMADDR_MSG_LEN] = {0};
u32 in_msg = 0;
/* peek the message first */
@@ -1457,14 +1482,33 @@ static void txgbevf_mbx_process(struct rte_eth_dev *dev)
/* PF reset VF event */
if (in_msg & TXGBE_PF_CONTROL_MSG) {
- if (in_msg & TXGBE_NOFITY_VF_LINK_STATUS) {
+ /* msg is not CTS, we need to do reset */
+ if (!(in_msg & TXGBE_VT_MSGTYPE_CTS)) {
+ /* send reset to PF to reconfig CTS flag */
+ int err = 0;
+
+ msgbuf[0] = TXGBE_VF_RESET;
+ err = mbx->write_posted(hw, msgbuf, 1, 0);
+ if (err) {
+ hw->pf_running = false;
+ txgbevf_check_link_for_intr(dev);
+ } else {
+ hw->pf_running = true;
+ rte_eth_dev_callback_process(dev, RTE_ETH_EVENT_INTR_RESET,
+ NULL);
+ }
+ }
+
+ if (in_msg & TXGBE_NOFITY_VF_LINK_STATUS)
txgbevf_get_pf_link_status(dev);
- } else {
- /* dummy mbx read to ack pf */
- txgbe_read_mbx(hw, &in_msg, 1, 0);
+ else
/* check link status if pf ping vf */
txgbevf_check_link_for_intr(dev);
- }
+ }
+
+ if (!hw->pf_running) {
+ hw->rx_loaded = true;
+ hw->offset_loaded = true;
}
}
--
2.21.0.windows.1
^ permalink raw reply related
* [PATCH v2 2/4] net/txgbe: add USO support
From: Zaiyu Wang @ 2026-06-17 11:33 UTC (permalink / raw)
To: dev; +Cc: Zaiyu Wang, Jiawen Wu
In-Reply-To: <20260617113335.15648-1-zaiyuwang@trustnetic.com>
USO (UDP Segmentation Offload), also known as UFO (UDP Fragmentation
Offload), is a hardware offload rarely seen in DPDK. Its implementation
is similar to TSO (TCP Segmentation Offload), so the driver enables
USO based on existing TSO support.
Note:
USO segments UDP packets, requiring hardware to recalculate both IP
and UDP checksums due to length change. Thus, USO implicitly requires
IP and UDP checksum offloads, same as TSO.
Signed-off-by: Zaiyu Wang <zaiyuwang@trustnetic.com>
---
drivers/net/txgbe/txgbe_rxtx.c | 13 ++++++++-----
1 file changed, 8 insertions(+), 5 deletions(-)
diff --git a/drivers/net/txgbe/txgbe_rxtx.c b/drivers/net/txgbe/txgbe_rxtx.c
index e2cd9b8841..c4cbdbc2b4 100644
--- a/drivers/net/txgbe/txgbe_rxtx.c
+++ b/drivers/net/txgbe/txgbe_rxtx.c
@@ -58,6 +58,7 @@ static const u64 TXGBE_TX_OFFLOAD_MASK = (RTE_MBUF_F_TX_IP_CKSUM |
RTE_MBUF_F_TX_VLAN |
RTE_MBUF_F_TX_L4_MASK |
RTE_MBUF_F_TX_TCP_SEG |
+ RTE_MBUF_F_TX_UDP_SEG |
RTE_MBUF_F_TX_TUNNEL_MASK |
RTE_MBUF_F_TX_OUTER_IP_CKSUM |
RTE_MBUF_F_TX_OUTER_UDP_CKSUM |
@@ -367,7 +368,7 @@ txgbe_set_xmit_ctx(struct txgbe_tx_queue *txq,
type_tucmd_mlhl |= TXGBE_TXD_PTID(tx_offload.ptid);
/* check if TCP segmentation required for this packet */
- if (ol_flags & RTE_MBUF_F_TX_TCP_SEG) {
+ if (ol_flags & (RTE_MBUF_F_TX_TCP_SEG | RTE_MBUF_F_TX_UDP_SEG)) {
tx_offload_mask.l2_len |= ~0;
tx_offload_mask.l3_len |= ~0;
tx_offload_mask.l4_len |= ~0;
@@ -517,7 +518,7 @@ tx_desc_cksum_flags_to_olinfo(uint64_t ol_flags)
tmp |= TXGBE_TXD_CC;
tmp |= TXGBE_TXD_EIPCS;
}
- if (ol_flags & RTE_MBUF_F_TX_TCP_SEG) {
+ if (ol_flags & (RTE_MBUF_F_TX_TCP_SEG | RTE_MBUF_F_TX_UDP_SEG)) {
tmp |= TXGBE_TXD_CC;
/* implies IPv4 cksum */
if (ol_flags & RTE_MBUF_F_TX_IPV4)
@@ -537,7 +538,7 @@ tx_desc_ol_flags_to_cmdtype(uint64_t ol_flags)
if (ol_flags & RTE_MBUF_F_TX_VLAN)
cmdtype |= TXGBE_TXD_VLE;
- if (ol_flags & RTE_MBUF_F_TX_TCP_SEG)
+ if (ol_flags & (RTE_MBUF_F_TX_TCP_SEG | RTE_MBUF_F_TX_UDP_SEG))
cmdtype |= TXGBE_TXD_TSE;
if (ol_flags & RTE_MBUF_F_TX_MACSEC)
cmdtype |= TXGBE_TXD_LINKSEC;
@@ -587,6 +588,8 @@ tx_desc_ol_flags_to_ptype(uint64_t oflags)
if (oflags & RTE_MBUF_F_TX_TCP_SEG)
ptype |= (tun ? RTE_PTYPE_INNER_L4_TCP : RTE_PTYPE_L4_TCP);
+ else if (oflags & RTE_MBUF_F_TX_UDP_SEG)
+ ptype |= (tun ? RTE_PTYPE_INNER_L4_UDP : RTE_PTYPE_L4_UDP);
/* Tunnel */
switch (oflags & RTE_MBUF_F_TX_TUNNEL_MASK) {
@@ -1071,7 +1074,7 @@ txgbe_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
olinfo_status = 0;
if (tx_ol_req) {
- if (ol_flags & RTE_MBUF_F_TX_TCP_SEG) {
+ if (ol_flags & (RTE_MBUF_F_TX_TCP_SEG | RTE_MBUF_F_TX_UDP_SEG)) {
/* when TSO is on, paylen in descriptor is the
* not the packet len but the tcp payload len
*/
@@ -2389,7 +2392,7 @@ txgbe_get_tx_port_offloads(struct rte_eth_dev *dev)
RTE_ETH_TX_OFFLOAD_TCP_CKSUM |
RTE_ETH_TX_OFFLOAD_SCTP_CKSUM |
RTE_ETH_TX_OFFLOAD_TCP_TSO |
- RTE_ETH_TX_OFFLOAD_UDP_TSO |
+ RTE_ETH_TX_OFFLOAD_UDP_TSO |
RTE_ETH_TX_OFFLOAD_UDP_TNL_TSO |
RTE_ETH_TX_OFFLOAD_IP_TNL_TSO |
RTE_ETH_TX_OFFLOAD_VXLAN_TNL_TSO |
--
2.21.0.windows.1
^ permalink raw reply related
* [PATCH v2 1/4] net/ngbe: add USO support
From: Zaiyu Wang @ 2026-06-17 11:33 UTC (permalink / raw)
To: dev; +Cc: Zaiyu Wang, Jiawen Wu
In-Reply-To: <20260617113335.15648-1-zaiyuwang@trustnetic.com>
USO (UDP Segmentation Offload), also known as UFO (UDP Fragmentation
Offload), is a hardware offload rarely seen in DPDK. Its implementation
is similar to TSO (TCP Segmentation Offload), so the driver enables
USO based on existing TSO support.
Note:
USO segments UDP packets, requiring hardware to recalculate both IP
and UDP checksums due to length change. Thus, USO implicitly requires
IP and UDP checksum offloads, same as TSO.
Signed-off-by: Zaiyu Wang <zaiyuwang@trustnetic.com>
---
drivers/net/ngbe/ngbe_rxtx.c | 13 ++++++++-----
1 file changed, 8 insertions(+), 5 deletions(-)
diff --git a/drivers/net/ngbe/ngbe_rxtx.c b/drivers/net/ngbe/ngbe_rxtx.c
index 91e215694c..a1389de9c0 100644
--- a/drivers/net/ngbe/ngbe_rxtx.c
+++ b/drivers/net/ngbe/ngbe_rxtx.c
@@ -30,6 +30,7 @@ static const u64 NGBE_TX_OFFLOAD_MASK = (RTE_MBUF_F_TX_IP_CKSUM |
RTE_MBUF_F_TX_VLAN |
RTE_MBUF_F_TX_L4_MASK |
RTE_MBUF_F_TX_TCP_SEG |
+ RTE_MBUF_F_TX_UDP_SEG |
NGBE_TX_IEEE1588_TMST);
#define NGBE_TX_OFFLOAD_NOTSUP_MASK \
@@ -317,7 +318,7 @@ ngbe_set_xmit_ctx(struct ngbe_tx_queue *txq,
type_tucmd_mlhl |= NGBE_TXD_PTID(tx_offload.ptid);
/* check if TCP segmentation required for this packet */
- if (ol_flags & RTE_MBUF_F_TX_TCP_SEG) {
+ if (ol_flags & (RTE_MBUF_F_TX_TCP_SEG | RTE_MBUF_F_TX_UDP_SEG)) {
tx_offload_mask.l2_len |= ~0;
tx_offload_mask.l3_len |= ~0;
tx_offload_mask.l4_len |= ~0;
@@ -427,7 +428,7 @@ tx_desc_cksum_flags_to_olinfo(uint64_t ol_flags)
tmp |= NGBE_TXD_CC;
tmp |= NGBE_TXD_EIPCS;
}
- if (ol_flags & RTE_MBUF_F_TX_TCP_SEG) {
+ if (ol_flags & (RTE_MBUF_F_TX_TCP_SEG | RTE_MBUF_F_TX_UDP_SEG)) {
tmp |= NGBE_TXD_CC;
/* implies IPv4 cksum */
if (ol_flags & RTE_MBUF_F_TX_IPV4)
@@ -447,7 +448,7 @@ tx_desc_ol_flags_to_cmdtype(uint64_t ol_flags)
if (ol_flags & RTE_MBUF_F_TX_VLAN)
cmdtype |= NGBE_TXD_VLE;
- if (ol_flags & RTE_MBUF_F_TX_TCP_SEG)
+ if (ol_flags & (RTE_MBUF_F_TX_TCP_SEG | RTE_MBUF_F_TX_UDP_SEG))
cmdtype |= NGBE_TXD_TSE;
return cmdtype;
}
@@ -483,6 +484,8 @@ tx_desc_ol_flags_to_ptype(uint64_t oflags)
if (oflags & RTE_MBUF_F_TX_TCP_SEG)
ptype |= RTE_PTYPE_L4_TCP;
+ else if (oflags & RTE_MBUF_F_TX_UDP_SEG)
+ ptype |= RTE_PTYPE_L4_UDP;
return ptype;
}
@@ -764,7 +767,7 @@ ngbe_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
olinfo_status = 0;
if (tx_ol_req) {
- if (ol_flags & RTE_MBUF_F_TX_TCP_SEG) {
+ if (ol_flags & (RTE_MBUF_F_TX_TCP_SEG | RTE_MBUF_F_TX_UDP_SEG)) {
/* when TSO is on, paylen in descriptor is the
* not the packet len but the tcp payload len
*/
@@ -1991,7 +1994,7 @@ ngbe_get_tx_port_offloads(struct rte_eth_dev *dev)
RTE_ETH_TX_OFFLOAD_TCP_CKSUM |
RTE_ETH_TX_OFFLOAD_SCTP_CKSUM |
RTE_ETH_TX_OFFLOAD_TCP_TSO |
- RTE_ETH_TX_OFFLOAD_UDP_TSO |
+ RTE_ETH_TX_OFFLOAD_UDP_TSO |
RTE_ETH_TX_OFFLOAD_MULTI_SEGS;
if (hw->is_pf)
--
2.21.0.windows.1
^ permalink raw reply related
* [PATCH v2 0/4] Wangxun new feature
From: Zaiyu Wang @ 2026-06-17 11:33 UTC (permalink / raw)
To: dev; +Cc: Zaiyu Wang
In-Reply-To: <20260617105959.10764-1-zaiyuwang@trustnetic.com>
This patchset introduces three new features and critical fixes for our
recent release cycle.
Patches 1-2 add support for UDP Segmentation Offload (USO) to improve
large-packet transmission performance for UDP workloads.
Patch 3 enables VFs to sense PF ifconfig down/up events, allowing
better fault tolerance and fast recovery in virtualized environments.
Patch 4 adds the missing VF support for the Amber-Lite 40G NICs, which
was previously omitted in the initial integration.
---
v2:
- Rebased on top of commit 72fdcb7bd19d to resolve conflict in
drivers/net/txgbe/base/txgbe_type.h.
- No code changes compared to v1.
---
Zaiyu Wang (4):
net/ngbe: add USO support
net/txgbe: add USO support
net/txgbe: add support for VF sensing PF down
net/txgbe: add VF support for Amber-Lite 40G NIC
drivers/net/ngbe/ngbe_rxtx.c | 13 +++---
drivers/net/txgbe/base/txgbe_devids.h | 2 +
drivers/net/txgbe/base/txgbe_hw.c | 7 ++++
drivers/net/txgbe/base/txgbe_regs.h | 7 +++-
drivers/net/txgbe/base/txgbe_type.h | 2 +
drivers/net/txgbe/base/txgbe_vf.c | 7 ++--
drivers/net/txgbe/txgbe_ethdev.c | 4 +-
drivers/net/txgbe/txgbe_ethdev_vf.c | 60 +++++++++++++++++++++++----
drivers/net/txgbe/txgbe_rxtx.c | 13 +++---
9 files changed, 92 insertions(+), 23 deletions(-)
--
2.21.0.windows.1
^ permalink raw reply
* [PATCH 3/4] net/txgbe: add support for VF sensing PF down
From: Zaiyu Wang @ 2026-06-17 10:59 UTC (permalink / raw)
To: dev; +Cc: Zaiyu Wang, Jiawen Wu
In-Reply-To: <20260617105959.10764-1-zaiyuwang@trustnetic.com>
VFs should continue normal packet Rx/Tx after PF ifconfig down/up.
To achieve this, cooperate with mailbox commands added in our Linux
kernel driver txgbe-2.2.0. Detect PF ifconfig down when
TXGBE_VT_MSGTYPE_SPEC is present in mailbox commands. Detect PF ifconfig
up when mailbox commands lack TXGBE_VT_MSGTYPE_CTS. Upon detection PF
up, the VF needs to reset; the driver sets a reset callback to prompt
users to reset the VF.
Additionally, hw->rx_loaded and hw->offset_loaded must be reset after
PF ifconfig up; otherwise, because hardware counter registers are cleared
during PF reset, the VF's software counters will overflow to 0xFFFFFFFF.
Signed-off-by: Zaiyu Wang <zaiyuwang@trustnetic.com>
---
drivers/net/txgbe/base/txgbe_type.h | 1 +
drivers/net/txgbe/txgbe_ethdev.c | 3 +-
drivers/net/txgbe/txgbe_ethdev_vf.c | 58 +++++++++++++++++++++++++----
3 files changed, 54 insertions(+), 8 deletions(-)
diff --git a/drivers/net/txgbe/base/txgbe_type.h b/drivers/net/txgbe/base/txgbe_type.h
index 2e2d79e0e1..d9d08b79a2 100644
--- a/drivers/net/txgbe/base/txgbe_type.h
+++ b/drivers/net/txgbe/base/txgbe_type.h
@@ -909,6 +909,7 @@ struct txgbe_hw {
bool an_done;
u32 fsm;
u64 bp_event_interval;
+ bool pf_running;
};
typedef enum {
diff --git a/drivers/net/txgbe/txgbe_ethdev.c b/drivers/net/txgbe/txgbe_ethdev.c
index 2ed9a8c179..6e7ac1320f 100644
--- a/drivers/net/txgbe/txgbe_ethdev.c
+++ b/drivers/net/txgbe/txgbe_ethdev.c
@@ -3387,7 +3387,8 @@ txgbe_dev_link_update_share(struct rte_eth_dev *dev,
hw->mac.get_link_status = true;
- if (intr->flags & TXGBE_FLAG_NEED_LINK_CONFIG)
+ if (intr->flags & TXGBE_FLAG_NEED_LINK_CONFIG ||
+ (txgbe_is_vf(hw) && !hw->pf_running))
return rte_eth_linkstatus_set(dev, &link);
/* check if it needs to wait to complete, if lsc interrupt is enabled */
diff --git a/drivers/net/txgbe/txgbe_ethdev_vf.c b/drivers/net/txgbe/txgbe_ethdev_vf.c
index 7a50c7a855..7ec1e009ed 100644
--- a/drivers/net/txgbe/txgbe_ethdev_vf.c
+++ b/drivers/net/txgbe/txgbe_ethdev_vf.c
@@ -281,6 +281,7 @@ eth_txgbevf_dev_init(struct rte_eth_dev *eth_dev)
hw->subsystem_device_id = pci_dev->id.subsystem_device_id;
hw->subsystem_vendor_id = pci_dev->id.subsystem_vendor_id;
hw->hw_addr = (void *)pci_dev->mem_resource[0].addr;
+ hw->pf_running = true;
/* initialize the vfta */
memset(shadow_vfta, 0, sizeof(*shadow_vfta));
@@ -1405,8 +1406,18 @@ static s32 txgbevf_get_pf_link_status(struct rte_eth_dev *dev)
if (retval)
return 0;
+ if (!(msgbuf[0] & TXGBE_NOFITY_VF_LINK_STATUS))
+ return 0;
+
rte_eth_linkstatus_get(dev, &link);
+ if (!hw->pf_running) {
+ link_up = false;
+ link_speed = TXGBE_LINK_SPEED_UNKNOWN;
+ link.link_duplex = RTE_ETH_LINK_HALF_DUPLEX;
+ return rte_eth_linkstatus_set(dev, &link);
+ }
+
link_up = msgbuf[1] & TXGBE_VFSTATUS_UP;
link_speed = (msgbuf[1] & 0xFFF0) >> 1;
@@ -1434,10 +1445,22 @@ static s32 txgbevf_get_pf_link_status(struct rte_eth_dev *dev)
static void txgbevf_check_link_for_intr(struct rte_eth_dev *dev)
{
struct rte_eth_link orig_link, new_link;
+ struct txgbe_hw *hw = TXGBE_DEV_HW(dev);
rte_eth_linkstatus_get(dev, &orig_link);
- txgbevf_dev_link_update(dev, 0);
- rte_eth_linkstatus_get(dev, &new_link);
+
+ if (hw->pf_running) {
+ txgbevf_dev_link_update(dev, 0);
+ rte_eth_linkstatus_get(dev, &new_link);
+ } else {
+ DEBUGOUT("PF ifconfig down, so VF link down");
+ new_link.link_status = RTE_ETH_LINK_DOWN;
+ new_link.link_speed = RTE_ETH_SPEED_NUM_NONE;
+ new_link.link_duplex = RTE_ETH_LINK_HALF_DUPLEX;
+ new_link.link_autoneg = !(dev->data->dev_conf.link_speeds &
+ RTE_ETH_LINK_SPEED_FIXED);
+ rte_eth_linkstatus_set(dev, &new_link);
+ }
PMD_DRV_LOG(INFO, "orig_link: %d, new_link: %d",
orig_link.link_status, new_link.link_status);
@@ -1450,6 +1473,8 @@ static void txgbevf_check_link_for_intr(struct rte_eth_dev *dev)
static void txgbevf_mbx_process(struct rte_eth_dev *dev)
{
struct txgbe_hw *hw = TXGBE_DEV_HW(dev);
+ struct txgbe_mbx_info *mbx = &hw->mbx;
+ u32 msgbuf[TXGBE_VF_PERMADDR_MSG_LEN] = {0};
u32 in_msg = 0;
/* peek the message first */
@@ -1457,14 +1482,33 @@ static void txgbevf_mbx_process(struct rte_eth_dev *dev)
/* PF reset VF event */
if (in_msg & TXGBE_PF_CONTROL_MSG) {
- if (in_msg & TXGBE_NOFITY_VF_LINK_STATUS) {
+ /* msg is not CTS, we need to do reset */
+ if (!(in_msg & TXGBE_VT_MSGTYPE_CTS)) {
+ /* send reset to PF to reconfig CTS flag */
+ int err = 0;
+
+ msgbuf[0] = TXGBE_VF_RESET;
+ err = mbx->write_posted(hw, msgbuf, 1, 0);
+ if (err) {
+ hw->pf_running = false;
+ txgbevf_check_link_for_intr(dev);
+ } else {
+ hw->pf_running = true;
+ rte_eth_dev_callback_process(dev, RTE_ETH_EVENT_INTR_RESET,
+ NULL);
+ }
+ }
+
+ if (in_msg & TXGBE_NOFITY_VF_LINK_STATUS)
txgbevf_get_pf_link_status(dev);
- } else {
- /* dummy mbx read to ack pf */
- txgbe_read_mbx(hw, &in_msg, 1, 0);
+ else
/* check link status if pf ping vf */
txgbevf_check_link_for_intr(dev);
- }
+ }
+
+ if (!hw->pf_running) {
+ hw->rx_loaded = true;
+ hw->offset_loaded = true;
}
}
--
2.21.0.windows.1
^ permalink raw reply related
* [PATCH 4/4] net/txgbe: add VF support for Amber-Lite 40G NIC
From: Zaiyu Wang @ 2026-06-17 10:59 UTC (permalink / raw)
To: dev; +Cc: Zaiyu Wang, Jiawen Wu
In-Reply-To: <20260617105959.10764-1-zaiyuwang@trustnetic.com>
VF support for the 40G NIC was previously omitted; only the 25G VF was
added. Now add 40G VF support based on the existing 25G VF implementation,
with no major changes but only device ID adaptation.
Signed-off-by: Zaiyu Wang <zaiyuwang@trustnetic.com>
---
drivers/net/txgbe/base/txgbe_devids.h | 2 ++
drivers/net/txgbe/base/txgbe_hw.c | 7 +++++++
drivers/net/txgbe/base/txgbe_regs.h | 7 +++++--
drivers/net/txgbe/base/txgbe_type.h | 1 +
drivers/net/txgbe/base/txgbe_vf.c | 7 ++++---
drivers/net/txgbe/txgbe_ethdev.c | 1 +
drivers/net/txgbe/txgbe_ethdev_vf.c | 2 ++
7 files changed, 22 insertions(+), 5 deletions(-)
diff --git a/drivers/net/txgbe/base/txgbe_devids.h b/drivers/net/txgbe/base/txgbe_devids.h
index b7133c7d54..f5454ffbb1 100644
--- a/drivers/net/txgbe/base/txgbe_devids.h
+++ b/drivers/net/txgbe/base/txgbe_devids.h
@@ -28,6 +28,8 @@
#define TXGBE_DEV_ID_AML_VF 0x5001
#define TXGBE_DEV_ID_AML5024_VF 0x5024
#define TXGBE_DEV_ID_AML5124_VF 0x5124
+#define TXGBE_DEV_ID_AML503F_VF 0x503f
+#define TXGBE_DEV_ID_AML513F_VF 0x513f
/*
* Subsystem IDs
diff --git a/drivers/net/txgbe/base/txgbe_hw.c b/drivers/net/txgbe/base/txgbe_hw.c
index c84656e206..2650b8b7f1 100644
--- a/drivers/net/txgbe/base/txgbe_hw.c
+++ b/drivers/net/txgbe/base/txgbe_hw.c
@@ -2552,6 +2552,7 @@ s32 txgbe_init_shared_code(struct txgbe_hw *hw)
break;
case txgbe_mac_sp_vf:
case txgbe_mac_aml_vf:
+ case txgbe_mac_aml40_vf:
status = txgbe_init_ops_vf(hw);
break;
default:
@@ -2582,6 +2583,7 @@ bool txgbe_is_vf(struct txgbe_hw *hw)
switch (hw->mac.type) {
case txgbe_mac_sp_vf:
case txgbe_mac_aml_vf:
+ case txgbe_mac_aml40_vf:
return true;
default:
return false;
@@ -2629,6 +2631,11 @@ s32 txgbe_set_mac_type(struct txgbe_hw *hw)
hw->phy.media_type = txgbe_media_type_virtual;
hw->mac.type = txgbe_mac_aml_vf;
break;
+ case TXGBE_DEV_ID_AML503F_VF:
+ case TXGBE_DEV_ID_AML513F_VF:
+ hw->phy.media_type = txgbe_media_type_virtual;
+ hw->mac.type = txgbe_mac_aml40_vf;
+ break;
default:
err = TXGBE_ERR_DEVICE_NOT_SUPPORTED;
DEBUGOUT("Unsupported device id: %x", hw->device_id);
diff --git a/drivers/net/txgbe/base/txgbe_regs.h b/drivers/net/txgbe/base/txgbe_regs.h
index 3c4c696c00..bf46a80862 100644
--- a/drivers/net/txgbe/base/txgbe_regs.h
+++ b/drivers/net/txgbe/base/txgbe_regs.h
@@ -1829,12 +1829,14 @@ txgbe_map_reg(struct txgbe_hw *hw, u32 reg)
switch (reg) {
case TXGBE_REG_RSSTBL:
if (hw->mac.type == txgbe_mac_sp_vf ||
- hw->mac.type == txgbe_mac_aml_vf)
+ hw->mac.type == txgbe_mac_aml_vf ||
+ hw->mac.type == txgbe_mac_aml40_vf)
reg = TXGBE_VFRSSTBL(0);
break;
case TXGBE_REG_RSSKEY:
if (hw->mac.type == txgbe_mac_sp_vf ||
- hw->mac.type == txgbe_mac_aml_vf)
+ hw->mac.type == txgbe_mac_aml_vf ||
+ hw->mac.type == txgbe_mac_aml40_vf)
reg = TXGBE_VFRSSKEY(0);
break;
default:
@@ -2017,6 +2019,7 @@ static inline void txgbe_flush(struct txgbe_hw *hw)
break;
case txgbe_mac_sp_vf:
case txgbe_mac_aml_vf:
+ case txgbe_mac_aml40_vf:
rd32(hw, TXGBE_VFSTATUS);
break;
default:
diff --git a/drivers/net/txgbe/base/txgbe_type.h b/drivers/net/txgbe/base/txgbe_type.h
index d9d08b79a2..4b5ff7da17 100644
--- a/drivers/net/txgbe/base/txgbe_type.h
+++ b/drivers/net/txgbe/base/txgbe_type.h
@@ -174,6 +174,7 @@ enum txgbe_mac_type {
txgbe_mac_aml40,
txgbe_mac_sp_vf,
txgbe_mac_aml_vf,
+ txgbe_mac_aml40_vf,
txgbe_num_macs
};
diff --git a/drivers/net/txgbe/base/txgbe_vf.c b/drivers/net/txgbe/base/txgbe_vf.c
index 1a8a20f104..4412006f1f 100644
--- a/drivers/net/txgbe/base/txgbe_vf.c
+++ b/drivers/net/txgbe/base/txgbe_vf.c
@@ -134,7 +134,9 @@ s32 txgbe_reset_hw_vf(struct txgbe_hw *hw)
}
/* amlite: bme */
- if (hw->mac.type == txgbe_mac_aml_vf)
+ if (hw->mac.type == txgbe_mac_aml_vf ||
+ hw->mac.type == txgbe_mac_aml40_vf)
+
wr32(hw, TXGBE_BME_AML, 0x1);
if (!timeout)
@@ -493,8 +495,7 @@ s32 txgbe_check_mac_link_vf(struct txgbe_hw *hw, u32 *speed,
/* for SFP+ modules and DA cables it can take up to 500usecs
* before the link status is correct
*/
- if ((mac->type == txgbe_mac_sp_vf ||
- mac->type == txgbe_mac_aml_vf) && wait_to_complete) {
+ if (wait_to_complete) {
if (po32m(hw, TXGBE_VFSTATUS, TXGBE_VFSTATUS_UP,
0, NULL, 5, 100))
goto out;
diff --git a/drivers/net/txgbe/txgbe_ethdev.c b/drivers/net/txgbe/txgbe_ethdev.c
index 6e7ac1320f..16a548e6d0 100644
--- a/drivers/net/txgbe/txgbe_ethdev.c
+++ b/drivers/net/txgbe/txgbe_ethdev.c
@@ -5602,6 +5602,7 @@ txgbe_rss_update(enum txgbe_mac_type mac_type)
case txgbe_mac_aml:
case txgbe_mac_aml40:
case txgbe_mac_aml_vf:
+ case txgbe_mac_aml40_vf:
return 1;
default:
return 0;
diff --git a/drivers/net/txgbe/txgbe_ethdev_vf.c b/drivers/net/txgbe/txgbe_ethdev_vf.c
index 7ec1e009ed..655ccc622f 100644
--- a/drivers/net/txgbe/txgbe_ethdev_vf.c
+++ b/drivers/net/txgbe/txgbe_ethdev_vf.c
@@ -77,6 +77,8 @@ static const struct rte_pci_id pci_id_txgbevf_map[] = {
{ RTE_PCI_DEVICE(PCI_VENDOR_ID_WANGXUN, TXGBE_DEV_ID_AML_VF) },
{ RTE_PCI_DEVICE(PCI_VENDOR_ID_WANGXUN, TXGBE_DEV_ID_AML5024_VF) },
{ RTE_PCI_DEVICE(PCI_VENDOR_ID_WANGXUN, TXGBE_DEV_ID_AML5124_VF) },
+ { RTE_PCI_DEVICE(PCI_VENDOR_ID_WANGXUN, TXGBE_DEV_ID_AML503F_VF) },
+ { RTE_PCI_DEVICE(PCI_VENDOR_ID_WANGXUN, TXGBE_DEV_ID_AML513F_VF) },
{ .vendor_id = 0, /* sentinel */ },
};
--
2.21.0.windows.1
^ permalink raw reply related
* [PATCH 2/4] net/txgbe: add USO support
From: Zaiyu Wang @ 2026-06-17 10:59 UTC (permalink / raw)
To: dev; +Cc: Zaiyu Wang, Jiawen Wu
In-Reply-To: <20260617105959.10764-1-zaiyuwang@trustnetic.com>
USO (UDP Segmentation Offload), also known as UFO (UDP Fragmentation
Offload), is a hardware offload rarely seen in DPDK. Its implementation
is similar to TSO (TCP Segmentation Offload), so the driver enables
USO based on existing TSO support.
Note:
USO segments UDP packets, requiring hardware to recalculate both IP
and UDP checksums due to length change. Thus, USO implicitly requires
IP and UDP checksum offloads, same as TSO.
Signed-off-by: Zaiyu Wang <zaiyuwang@trustnetic.com>
---
drivers/net/txgbe/txgbe_rxtx.c | 13 ++++++++-----
1 file changed, 8 insertions(+), 5 deletions(-)
diff --git a/drivers/net/txgbe/txgbe_rxtx.c b/drivers/net/txgbe/txgbe_rxtx.c
index f51c6193a9..77ec9f1e39 100644
--- a/drivers/net/txgbe/txgbe_rxtx.c
+++ b/drivers/net/txgbe/txgbe_rxtx.c
@@ -58,6 +58,7 @@ static const u64 TXGBE_TX_OFFLOAD_MASK = (RTE_MBUF_F_TX_IP_CKSUM |
RTE_MBUF_F_TX_VLAN |
RTE_MBUF_F_TX_L4_MASK |
RTE_MBUF_F_TX_TCP_SEG |
+ RTE_MBUF_F_TX_UDP_SEG |
RTE_MBUF_F_TX_TUNNEL_MASK |
RTE_MBUF_F_TX_OUTER_IP_CKSUM |
RTE_MBUF_F_TX_OUTER_UDP_CKSUM |
@@ -366,7 +367,7 @@ txgbe_set_xmit_ctx(struct txgbe_tx_queue *txq,
type_tucmd_mlhl |= TXGBE_TXD_PTID(tx_offload.ptid);
/* check if TCP segmentation required for this packet */
- if (ol_flags & RTE_MBUF_F_TX_TCP_SEG) {
+ if (ol_flags & (RTE_MBUF_F_TX_TCP_SEG | RTE_MBUF_F_TX_UDP_SEG)) {
tx_offload_mask.l2_len |= ~0;
tx_offload_mask.l3_len |= ~0;
tx_offload_mask.l4_len |= ~0;
@@ -516,7 +517,7 @@ tx_desc_cksum_flags_to_olinfo(uint64_t ol_flags)
tmp |= TXGBE_TXD_CC;
tmp |= TXGBE_TXD_EIPCS;
}
- if (ol_flags & RTE_MBUF_F_TX_TCP_SEG) {
+ if (ol_flags & (RTE_MBUF_F_TX_TCP_SEG | RTE_MBUF_F_TX_UDP_SEG)) {
tmp |= TXGBE_TXD_CC;
/* implies IPv4 cksum */
if (ol_flags & RTE_MBUF_F_TX_IPV4)
@@ -536,7 +537,7 @@ tx_desc_ol_flags_to_cmdtype(uint64_t ol_flags)
if (ol_flags & RTE_MBUF_F_TX_VLAN)
cmdtype |= TXGBE_TXD_VLE;
- if (ol_flags & RTE_MBUF_F_TX_TCP_SEG)
+ if (ol_flags & (RTE_MBUF_F_TX_TCP_SEG | RTE_MBUF_F_TX_UDP_SEG))
cmdtype |= TXGBE_TXD_TSE;
if (ol_flags & RTE_MBUF_F_TX_MACSEC)
cmdtype |= TXGBE_TXD_LINKSEC;
@@ -586,6 +587,8 @@ tx_desc_ol_flags_to_ptype(uint64_t oflags)
if (oflags & RTE_MBUF_F_TX_TCP_SEG)
ptype |= (tun ? RTE_PTYPE_INNER_L4_TCP : RTE_PTYPE_L4_TCP);
+ else if (oflags & RTE_MBUF_F_TX_UDP_SEG)
+ ptype |= (tun ? RTE_PTYPE_INNER_L4_UDP : RTE_PTYPE_L4_UDP);
/* Tunnel */
switch (oflags & RTE_MBUF_F_TX_TUNNEL_MASK) {
@@ -1071,7 +1074,7 @@ txgbe_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
olinfo_status = 0;
if (tx_ol_req) {
- if (ol_flags & RTE_MBUF_F_TX_TCP_SEG) {
+ if (ol_flags & (RTE_MBUF_F_TX_TCP_SEG | RTE_MBUF_F_TX_UDP_SEG)) {
/* when TSO is on, paylen in descriptor is the
* not the packet len but the tcp payload len
*/
@@ -2395,7 +2398,7 @@ txgbe_get_tx_port_offloads(struct rte_eth_dev *dev)
RTE_ETH_TX_OFFLOAD_TCP_CKSUM |
RTE_ETH_TX_OFFLOAD_SCTP_CKSUM |
RTE_ETH_TX_OFFLOAD_TCP_TSO |
- RTE_ETH_TX_OFFLOAD_UDP_TSO |
+ RTE_ETH_TX_OFFLOAD_UDP_TSO |
RTE_ETH_TX_OFFLOAD_UDP_TNL_TSO |
RTE_ETH_TX_OFFLOAD_IP_TNL_TSO |
RTE_ETH_TX_OFFLOAD_VXLAN_TNL_TSO |
--
2.21.0.windows.1
^ permalink raw reply related
* [PATCH 1/4] net/ngbe: add USO support
From: Zaiyu Wang @ 2026-06-17 10:59 UTC (permalink / raw)
To: dev; +Cc: Zaiyu Wang, Jiawen Wu
In-Reply-To: <20260617105959.10764-1-zaiyuwang@trustnetic.com>
USO (UDP Segmentation Offload), also known as UFO (UDP Fragmentation
Offload), is a hardware offload rarely seen in DPDK. Its implementation
is similar to TSO (TCP Segmentation Offload), so the driver enables
USO based on existing TSO support.
Note:
USO segments UDP packets, requiring hardware to recalculate both IP
and UDP checksums due to length change. Thus, USO implicitly requires
IP and UDP checksum offloads, same as TSO.
Signed-off-by: Zaiyu Wang <zaiyuwang@trustnetic.com>
---
drivers/net/ngbe/ngbe_rxtx.c | 13 ++++++++-----
1 file changed, 8 insertions(+), 5 deletions(-)
diff --git a/drivers/net/ngbe/ngbe_rxtx.c b/drivers/net/ngbe/ngbe_rxtx.c
index 91e215694c..a1389de9c0 100644
--- a/drivers/net/ngbe/ngbe_rxtx.c
+++ b/drivers/net/ngbe/ngbe_rxtx.c
@@ -30,6 +30,7 @@ static const u64 NGBE_TX_OFFLOAD_MASK = (RTE_MBUF_F_TX_IP_CKSUM |
RTE_MBUF_F_TX_VLAN |
RTE_MBUF_F_TX_L4_MASK |
RTE_MBUF_F_TX_TCP_SEG |
+ RTE_MBUF_F_TX_UDP_SEG |
NGBE_TX_IEEE1588_TMST);
#define NGBE_TX_OFFLOAD_NOTSUP_MASK \
@@ -317,7 +318,7 @@ ngbe_set_xmit_ctx(struct ngbe_tx_queue *txq,
type_tucmd_mlhl |= NGBE_TXD_PTID(tx_offload.ptid);
/* check if TCP segmentation required for this packet */
- if (ol_flags & RTE_MBUF_F_TX_TCP_SEG) {
+ if (ol_flags & (RTE_MBUF_F_TX_TCP_SEG | RTE_MBUF_F_TX_UDP_SEG)) {
tx_offload_mask.l2_len |= ~0;
tx_offload_mask.l3_len |= ~0;
tx_offload_mask.l4_len |= ~0;
@@ -427,7 +428,7 @@ tx_desc_cksum_flags_to_olinfo(uint64_t ol_flags)
tmp |= NGBE_TXD_CC;
tmp |= NGBE_TXD_EIPCS;
}
- if (ol_flags & RTE_MBUF_F_TX_TCP_SEG) {
+ if (ol_flags & (RTE_MBUF_F_TX_TCP_SEG | RTE_MBUF_F_TX_UDP_SEG)) {
tmp |= NGBE_TXD_CC;
/* implies IPv4 cksum */
if (ol_flags & RTE_MBUF_F_TX_IPV4)
@@ -447,7 +448,7 @@ tx_desc_ol_flags_to_cmdtype(uint64_t ol_flags)
if (ol_flags & RTE_MBUF_F_TX_VLAN)
cmdtype |= NGBE_TXD_VLE;
- if (ol_flags & RTE_MBUF_F_TX_TCP_SEG)
+ if (ol_flags & (RTE_MBUF_F_TX_TCP_SEG | RTE_MBUF_F_TX_UDP_SEG))
cmdtype |= NGBE_TXD_TSE;
return cmdtype;
}
@@ -483,6 +484,8 @@ tx_desc_ol_flags_to_ptype(uint64_t oflags)
if (oflags & RTE_MBUF_F_TX_TCP_SEG)
ptype |= RTE_PTYPE_L4_TCP;
+ else if (oflags & RTE_MBUF_F_TX_UDP_SEG)
+ ptype |= RTE_PTYPE_L4_UDP;
return ptype;
}
@@ -764,7 +767,7 @@ ngbe_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
olinfo_status = 0;
if (tx_ol_req) {
- if (ol_flags & RTE_MBUF_F_TX_TCP_SEG) {
+ if (ol_flags & (RTE_MBUF_F_TX_TCP_SEG | RTE_MBUF_F_TX_UDP_SEG)) {
/* when TSO is on, paylen in descriptor is the
* not the packet len but the tcp payload len
*/
@@ -1991,7 +1994,7 @@ ngbe_get_tx_port_offloads(struct rte_eth_dev *dev)
RTE_ETH_TX_OFFLOAD_TCP_CKSUM |
RTE_ETH_TX_OFFLOAD_SCTP_CKSUM |
RTE_ETH_TX_OFFLOAD_TCP_TSO |
- RTE_ETH_TX_OFFLOAD_UDP_TSO |
+ RTE_ETH_TX_OFFLOAD_UDP_TSO |
RTE_ETH_TX_OFFLOAD_MULTI_SEGS;
if (hw->is_pf)
--
2.21.0.windows.1
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox