DPDK-dev Archive on lore.kernel.org

DPDK-dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 5/6] ip_frag: reject oversized reassembled datagrams
From: Stephen Hemminger @ 2026-06-16 21:05 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger, stable, Konstantin Ananyev
In-Reply-To: <20260616210656.464062-1-stephen@networkplumber.org>

The reassembled total length of a packet must not exceed 65535.
A fragment with a high offset could drive the sum past that,
causing silent truncation since IP payload_len/total_length is 16 bits.

When reassembling a packet the total length should not be allowed
to exceed 65535. A fragment with high offset could drive the sum
past that, causing silent truncation.

A valid datagram never exceeds 65535 bytes, so reject any fragment
whose resulting length would exceed that.
Fold the test into the existing zero-length check.

Fixes: cc8f4d020c0b ("examples/ip_reassembly: initial import")
Cc: stable@dpdk.org

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 lib/ip_frag/rte_ipv4_reassembly.c | 9 +++++++--
 lib/ip_frag/rte_ipv6_reassembly.c | 9 +++++++--
 2 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/lib/ip_frag/rte_ipv4_reassembly.c b/lib/ip_frag/rte_ipv4_reassembly.c
index 980f7a3b77..727fc58243 100644
--- a/lib/ip_frag/rte_ipv4_reassembly.c
+++ b/lib/ip_frag/rte_ipv4_reassembly.c
@@ -136,8 +136,13 @@ rte_ipv4_frag_reassemble_packet(struct rte_ip_frag_tbl *tbl,
 		tbl, tbl->max_cycles, tbl->entry_mask, tbl->max_entries,
 		tbl->use_entries);
 
-	/* check that fragment length is greater then zero. */
-	if (ip_len <= 0) {
+	/*
+	 * Drop fragments with no payload, and any fragment whose end would
+	 * make the reassembled datagram exceed the maximum IPv4 size. The
+	 * total_length field is 16 bits, so otherwise it is silently
+	 * truncated while the mbuf still holds the full length.
+	 */
+	if (ip_len <= 0 || ip_ofs + ip_len + mb->l3_len > UINT16_MAX) {
 		IP_FRAG_MBUF2DR(dr, mb);
 		return NULL;
 	}
diff --git a/lib/ip_frag/rte_ipv6_reassembly.c b/lib/ip_frag/rte_ipv6_reassembly.c
index 7c1659002b..0b44275b37 100644
--- a/lib/ip_frag/rte_ipv6_reassembly.c
+++ b/lib/ip_frag/rte_ipv6_reassembly.c
@@ -174,8 +174,13 @@ rte_ipv6_frag_reassemble_packet(struct rte_ip_frag_tbl *tbl,
 		tbl, tbl->max_cycles, tbl->entry_mask, tbl->max_entries,
 		tbl->use_entries);
 
-	/* check that fragment length is greater then zero. */
-	if (ip_len <= 0) {
+	/*
+	 * Drop fragments with no payload, and any fragment whose end would
+	 * make the reassembled payload exceed 65535 bytes. The payload_len
+	 * field is 16 bits, so otherwise it is silently truncated while the
+	 * mbuf still holds the full length.
+	 */
+	if (ip_len <= 0 || ip_ofs + ip_len > UINT16_MAX) {
 		IP_FRAG_MBUF2DR(dr, mb);
 		return NULL;
 	}
-- 
2.53.0


^ permalink raw reply related

* [PATCH 4/6] ip_frag: drop IPv6 fragments with unexpected headers
From: Stephen Hemminger @ 2026-06-16 21:05 UTC (permalink / raw)
  To: dev
  Cc: Stephen Hemminger, stable, Konstantin Ananyev, Anatoly Burakov,
	Thomas Monjalon
In-Reply-To: <20260616210656.464062-1-stephen@networkplumber.org>

DPDK version of IPv6 reassembly only handles a fragment header placed
directly after the IPv6 header. With other extension headers in the
unfragmentable part, ipv6_frag_reassemble() patches the wrong
next-header field, miscomputes the payload length, and shifts the
wrong bytes, corrupting the result.

Drop the fragment when l3_len covers more than the IPv6 and fragment
headers. RFC 8200 allows a receiver to discard packets whose extension
headers are not in the recommended order, and RFC 9099 recommends
dropping non-conforming fragmented IPv6 packets, so dropping here is
permitted rather than a deviation.

Fixes: 4f1a8f633862 ("ip_frag: add IPv6 reassembly")
Cc: stable@dpdk.org

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 lib/ip_frag/rte_ipv6_reassembly.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/lib/ip_frag/rte_ipv6_reassembly.c b/lib/ip_frag/rte_ipv6_reassembly.c
index 0e809a01e5..7c1659002b 100644
--- a/lib/ip_frag/rte_ipv6_reassembly.c
+++ b/lib/ip_frag/rte_ipv6_reassembly.c
@@ -180,6 +180,19 @@ rte_ipv6_frag_reassemble_packet(struct rte_ip_frag_tbl *tbl,
 		return NULL;
 	}
 
+	/*
+	 * Only a fragment header directly following the IPv6 header is
+	 * supported. Other extension headers in the unfragmentable part are
+	 * not handled: ipv6_frag_reassemble() assumes l3_len covers exactly
+	 * the IPv6 and fragment headers when it patches the next-header field
+	 * and removes the fragment header. Drop the fragment rather than
+	 * produce a corrupt datagram.
+	 */
+	if (mb->l3_len != sizeof(struct rte_ipv6_hdr) + sizeof(*frag_hdr)) {
+		IP_FRAG_MBUF2DR(dr, mb);
+		return NULL;
+	}
+
 	if (unlikely(trim > 0))
 		rte_pktmbuf_trim(mb, trim);
 
-- 
2.53.0


^ permalink raw reply related

* [PATCH 3/6] ip_frag: include protocol in IPv4 reassembly key
From: Stephen Hemminger @ 2026-06-16 21:05 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger, stable, Konstantin Ananyev
In-Reply-To: <20260616210656.464062-1-stephen@networkplumber.org>

DPDK IPv4 reassembly code was not following RFC 791 section 3.2
which says:
    The internet identification field (ID) is used together with the
    source and destination address, and the protocol fields, to identify
    datagram fragments for reassembly.

Omitting the protocol means two datagrams between the
same pair of hosts that share an IP id but carry different protocols
(for example UDP and ICMP) are merged into a single reassembly context,
producing a corrupted datagram.

Fold the protocol into the unused upper bits of the 32-bit id field
of the key. The IPv4 identification is 16 bits and occupies the low
half, so the protocol can be carried in the upper bits without changing
the key layout, the key comparison or the hash.

Fixes: cc8f4d020c0b ("examples/ip_reassembly: initial import")
Cc: stable@dpdk.org

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 lib/ip_frag/rte_ipv4_reassembly.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/lib/ip_frag/rte_ipv4_reassembly.c b/lib/ip_frag/rte_ipv4_reassembly.c
index 3c8ae113ba..980f7a3b77 100644
--- a/lib/ip_frag/rte_ipv4_reassembly.c
+++ b/lib/ip_frag/rte_ipv4_reassembly.c
@@ -111,9 +111,15 @@ rte_ipv4_frag_reassemble_packet(struct rte_ip_frag_tbl *tbl,
 	ip_ofs = (uint16_t)(flag_offset & RTE_IPV4_HDR_OFFSET_MASK);
 	ip_flag = (uint16_t)(flag_offset & RTE_IPV4_HDR_MF_FLAG);
 
+	/*
+	 * RFC 791 requires using: source, destination, identifier field and protocol
+	 */
+
 	/* use first 8 bytes only */
 	memcpy(&key.src_dst[0], &ip_hdr->src_addr, 8);
-	key.id = ip_hdr->packet_id;
+
+	/* packet_id is 16 bits and proto id is 8 bits */
+	key.id = ((uint32_t) ip_hdr->next_proto_id << 16) | ip_hdr->packet_id;
 	key.key_len = IPV4_KEYLEN;
 
 	ip_ofs *= RTE_IPV4_HDR_OFFSET_UNITS;
-- 
2.53.0


^ permalink raw reply related

* [PATCH 2/6] ip_frag: discard datagrams with overlapping fragments
From: Stephen Hemminger @ 2026-06-16 21:05 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger, stable, Konstantin Ananyev
In-Reply-To: <20260616210656.464062-1-stephen@networkplumber.org>

Existing code does not handle overlapping fragments.

RFC 8200 (IPv6) requires that on overlap all reassembly is abandoned
andall received fragments are dropped. RFC 791 (IPv4) originally called
fortrimming and rewriting, but Linux discards for IPv4 as well, since
overlap has no legitimate use and is a known attack vector.

Depends on the duplicate-tolerance change so that an exact duplicate is
dropped on its own rather than discarding the whole datagram.

Fixes: cc8f4d020c0b ("examples/ip_reassembly: initial import")
Cc: stable@dpdk.org

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 lib/ip_frag/ip_frag_internal.c | 34 ++++++++++++++++++++++++++--------
 1 file changed, 26 insertions(+), 8 deletions(-)

diff --git a/lib/ip_frag/ip_frag_internal.c b/lib/ip_frag/ip_frag_internal.c
index 9a03ef995a..2505314a29 100644
--- a/lib/ip_frag/ip_frag_internal.c
+++ b/lib/ip_frag/ip_frag_internal.c
@@ -92,16 +92,34 @@ ip_frag_process(struct ip_frag_pkt *fp, struct rte_ip_frag_death_row *dr,
 	uint32_t i, idx;
 
 	/*
-	 * Discard an exact duplicate fragment. If a previously stored fragment
-	 * already covers the same offset and length, this fragment carries no
-	 * new data. Reassembly is tolerant of duplicates (RFC 791), so drop
-	 * only this mbuf and keep the reassembly entry intact rather than
-	 * treating it as an error. Fragments overlapping an existing one with
-	 * different bounds are not handled here.
+	 * Scan the fragments already collected for this datagram before
+	 * storing the new one. The stored set is kept free of duplicates and
+	 * overlaps, so a single pass is sufficient.
 	 */
 	for (i = 0; i != fp->last_idx; i++) {
-		if (fp->frags[i].mb != NULL && fp->frags[i].ofs == ofs &&
-				fp->frags[i].len == len) {
+		if (fp->frags[i].mb == NULL)
+			continue;
+
+		/*
+		 * Exact duplicate: carries no new data. Reassembly tolerates
+		 * duplicates (RFC 791), so drop only this mbuf and keep the
+		 * entry.
+		 */
+		if (fp->frags[i].ofs == ofs && fp->frags[i].len == len) {
+			IP_FRAG_MBUF2DR(dr, mb);
+			return NULL;
+		}
+
+		/*
+		 * Overlap with an existing fragment. Per RFC 8200 section 4.5
+		 * (and RFC 5722) the datagram must be discarded; the same is
+		 * applied to IPv4. Free all collected fragments, drop this one,
+		 * and invalidate the entry.
+		 */
+		if (ofs < fp->frags[i].ofs + fp->frags[i].len &&
+				fp->frags[i].ofs < ofs + len) {
+			ip_frag_free(fp, dr);
+			ip_frag_key_invalidate(&fp->key);
 			IP_FRAG_MBUF2DR(dr, mb);
 			return NULL;
 		}
-- 
2.53.0


^ permalink raw reply related

* [PATCH 1/6] ip_frag: tolerate duplicate fragments
From: Stephen Hemminger @ 2026-06-16 21:05 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger, stable, Samyak Jain, Konstantin Ananyev
In-Reply-To: <20260616210656.464062-1-stephen@networkplumber.org>

The reassembly code tracked only a running byte total and reserved slots
for the first and last fragments, with no check for a fragment
duplicating data already received. A single duplicate could destroy a
recoverable datagram:
 - a duplicate first or last fragment collided with the reserved slot and
   sent the whole entry down the error path, freeing every collected
   fragment;
 - a duplicate intermediate fragment was appended to a new slot, inflating
   frag_size past total_size so reassembly never completed.

RFC 791 reassembly tolerates duplicates: a fragment covering bytes
already present carries no new information. Check for an exact duplicate
(stored fragment with the same offset and length) and drop only that
mbuf, before frag_size is updated, leaving the entry's accounting
unchanged.

Overlapping fragments with differing bounds are a separate issue
addressed in the next patch.

Fixes: cc8f4d020c0b ("examples/ip_reassembly: initial import")
Cc: stable@dpdk.org
Reported-by: Samyak Jain <samyak.jain@amantyatech.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 lib/ip_frag/ip_frag_internal.c | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/lib/ip_frag/ip_frag_internal.c b/lib/ip_frag/ip_frag_internal.c
index 382f42d0e1..9a03ef995a 100644
--- a/lib/ip_frag/ip_frag_internal.c
+++ b/lib/ip_frag/ip_frag_internal.c
@@ -89,7 +89,23 @@ struct rte_mbuf *
 ip_frag_process(struct ip_frag_pkt *fp, struct rte_ip_frag_death_row *dr,
 	struct rte_mbuf *mb, uint16_t ofs, uint16_t len, uint16_t more_frags)
 {
-	uint32_t idx;
+	uint32_t i, idx;
+
+	/*
+	 * Discard an exact duplicate fragment. If a previously stored fragment
+	 * already covers the same offset and length, this fragment carries no
+	 * new data. Reassembly is tolerant of duplicates (RFC 791), so drop
+	 * only this mbuf and keep the reassembly entry intact rather than
+	 * treating it as an error. Fragments overlapping an existing one with
+	 * different bounds are not handled here.
+	 */
+	for (i = 0; i != fp->last_idx; i++) {
+		if (fp->frags[i].mb != NULL && fp->frags[i].ofs == ofs &&
+				fp->frags[i].len == len) {
+			IP_FRAG_MBUF2DR(dr, mb);
+			return NULL;
+		}
+	}
 
 	fp->frag_size += len;
 
-- 
2.53.0


^ permalink raw reply related

* [PATCH 0/6] ip_frag: fix reassembly defects and add test
From: Stephen Hemminger @ 2026-06-16 21:05 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger

The IP reassembly library tracks only a running byte total and reserved
slots for the first and last fragments, with no coverage map. As a result
it mishandles duplicate, overlapping, oversized, and misheadered
fragments, and the IPv4 key is missing a field RFC 791 requires. There
was also no functional test to catch any of it.

These came out of reviewing a duplicate-fragment report on the list.

Patches 1 and 2 are interdependent: the overlap discard relies on the
duplicate handling so an exact duplicate is dropped on its own rather
than discarding the whole datagram. The rest are independent.

Patch 6 adds a functional test modeled on the Linux selftest ip_defrag.c.
It passes on this series; with any single fix reverted the matching case
fails.

Stephen Hemminger (6):
  ip_frag: tolerate duplicate fragments
  ip_frag: discard datagrams with overlapping fragments
  ip_frag: include protocol in IPv4 reassembly key
  ip_frag: drop IPv6 fragments with unexpected headers
  ip_frag: reject oversized reassembled datagrams
  app/test: add test for IP reassembly

 app/test/meson.build              |   1 +
 app/test/test_reassembly.c        | 644 ++++++++++++++++++++++++++++++
 lib/ip_frag/ip_frag_internal.c    |  36 +-
 lib/ip_frag/rte_ipv4_reassembly.c |  17 +-
 lib/ip_frag/rte_ipv6_reassembly.c |  22 +-
 5 files changed, 714 insertions(+), 6 deletions(-)
 create mode 100644 app/test/test_reassembly.c

-- 
2.53.0

^ permalink raw reply

* Re: DPDK release candidate 26.07-rc1
From: Riley Fletcher @ 2026-06-16 15:48 UTC (permalink / raw)
  To: Thomas Monjalon, dev
In-Reply-To: <EE7NFX47QzqKRvGYzs3QUA@monjalon.net>

IBM - Power Systems Testing
DPDK v26.07-rc1

* Build CI on Fedora 40, 41, 42 and 43 container images for ppc64le
* Basic PF on Mellanox: No issues found
* Performance: TestPMD throughput single core tests
* OS:- RHEL 10.2  kernel: 6.12.0-211.7.3.el10_2.ppc64le
         with gcc version 14.3.1 20251022 (Red Hat 14.3.1-4)

Systems tested:
  - LPARs on IBM Power11 CHRP IBM, 9105-22A
     NICs:
     - Mellanox Technologies MT4129 Family [ConnectX-7 100 GbE 2-Port]
     - Firmware Version: 28.48.1000 (MT_0000000834)
     - OFED 26.04-0.8.6

Regards,

Riley Fletcher

On Thu, 2026-06-11 at 04:52 +0200, Thomas Monjalon wrote:
> A new DPDK release candidate is ready for testing:
> 	https://git.dpdk.org/dpdk/tag/?id=v26.07-rc1
> 
> There are 432 new patches in this snapshot.
> 
> Release notes:
> 	https://doc.dpdk.org/guides/rel_notes/release_26_07.html
> 
> Highlights of 26.07-rc1:
> 	- option --pagesz-mem for per-page-size maximum
> 	- option --no-auto-probing for no device at init
> 	- less mempool cache misses in run-to-completion
> 	- peek style API for staged-ordered ring
> 	- RISC-V vector paths
> 	- PTP helpers and clock relay example
> 	- selective Rx API to save PCI bandwidth
> 	- vhost memory hotplug
> 	- pcap driver enhanced
> 	- LinkData sxe2 NIC driver
> 	- AI review helpers
> 
> Important note:
> Pipelined applications, where ethdev Rx and Tx run on separate
> lcores,
> should adapt to the new algorithm by doubling their configured
> mempool
> cache size, to avoid doubling their mempool cache miss rate.
> 
> Please test and report issues on bugs.dpdk.org.
> 
> We plan to release -rc2 in 2 weeks.
> 
> Thank you everyone
> 

^ permalink raw reply

* RE: [PATCH v2 6/6] net/dpaa2: drop the fake software VLAN strip offload
From: Hemant Agrawal @ 2026-06-16 15:40 UTC (permalink / raw)
  To: Stephen Hemminger, Maxime Leroy; +Cc: dev@dpdk.org, Sachin Saxena
In-Reply-To: <20260616071110.1baf1beb@phoenix.local>



> -----Original Message-----
> From: Stephen Hemminger <stephen@networkplumber.org>
> Sent: 16 June 2026 19:41
> To: Maxime Leroy <maxime@leroys.fr>
> Cc: dev@dpdk.org; Hemant Agrawal <hemant.agrawal@nxp.com>; Sachin
> Saxena <sachin.saxena@nxp.com>
> Subject: Re: [PATCH v2 6/6] net/dpaa2: drop the fake software VLAN strip
> offload
> Importance: High
> 
> On Tue, 16 Jun 2026 12:27:26 +0200
> Maxime Leroy <maxime@leroys.fr> wrote:
> 
> > RTE_ETH_RX_OFFLOAD_VLAN_STRIP is advertised, but no hardware VLAN
> > strip backs it: when enabled, the Rx burst calls rte_vlan_strip() on
> > every frame, a software op masquerading as a hardware offload.
> >
> > It saves a forwarding application nothing: the datapath reads the L2
> > header anyway to classify or strip. The offload does not remove that
> > read, it relocates it into the driver Rx burst, where it is far more
> > expensive.
> >
> > The cost is a matter of timing. rte_vlan_strip() reaches the L2 header
> > through rte_pktmbuf_mtod(), which dereferences mbuf->buf_addr. On a
> > freshly recycled buffer that mbuf cacheline is cold. eth_fd_to_mbuf()
> > has just written other fields of it (data_off, ol_flags), but buf_addr
> > is a persistent field it does not rewrite. A write does not stall: it
> > posts to the store buffer while the line fills in the background, and
> > the rewritten fields are forwarded straight from there. buf_addr has
> > nothing to forward, so it must be read from the line, whose fill is
> > still in flight, and the read stalls. The ethertype read that follows,
> > on the cold payload line, stalls again. Read later by the application,
> > when the fill has completed, the same read hits. The offload just
> > performs it at the worst possible moment.
> >
> > Measured on a single-core port-to-port forwarding test over two 10G
> > ports (one core at 2 GHz, 64-byte untagged frames):
> >
> >   - throughput 4.22 -> 5.00 Mpps (+18 percent)
> >   - IPC 0.93 -> 1.25: the cost was memory stall, not compute
> >   - L3/DRAM-bound L2 refills 319M -> 200M over 10s (-37 percent)
> >
> > perf confirms it: with the offload, the buf_addr load (the cold mbuf
> > field) and the payload load account for about 84 percent of the Rx
> > burst's L2 refills; removing it, those vanish and only the inherent
> > DQRR dequeue misses remain.
> >
> > Stop advertising VLAN_STRIP and remove the rte_vlan_strip() calls from
> > every Rx path. This is a behavioural change: the tag is left in the
> > frame, so an application must strip it itself, on the L2 header it
> > already reads.
> >
> > Signed-off-by: Maxime Leroy <maxime@leroys.fr>
> 
> Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Acked-by: Hemant Agrawal <hemant.agrawal@nxp.com>

^ permalink raw reply

* Re: [RFC 0/4] alternative capture mechanism
From: Stephen Hemminger @ 2026-06-16 14:28 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: dev
In-Reply-To: <ajFDjhDc5hXhE8km@bricha3-mobl1.ger.corp.intel.com>

On Tue, 16 Jun 2026 13:37:34 +0100
Bruce Richardson <bruce.richardson@intel.com> wrote:

> On Tue, Jun 09, 2026 at 02:02:01PM -0700, Stephen Hemminger wrote:
> > This is an RFC for an alternative way to capture packets from a DPDK
> > application. I did brief demo of similar mechanism at DPDK summit but
> > this is more complete. Capture runs in the primary process and is driven
> > entirely over telemetry; no secondary process is involved.
> > 
> > A client asks the application to start capturing and passes it a file
> > descriptor to write to. The application writes pcapng to that descriptor.
> > A Wireshark extcap script is the intended front end, but the control path
> > is just telemetry and the output is just a pipe, so other front ends are
> > possible.
> > 
> >   1/4  telemetry: let a command receive file descriptors from the client
> >   2/4  capture: the library
> >   3/4  test: functional test
> >   4/4  app: the Wireshark extcap script and its documentation
> > 
> > Setup and usage are in doc/guides/tools/wireshark_extcap.rst.
> > 
> > Primary process only for now; secondary-process capture is possible as
> > follow-on. Posting as RFC to get feedback on the approach.
> > 
> > The extcap script is dual licensed (BSD-3-Clause OR GPL-2.0-or-later) as
> > it may be more useful in the Wireshark tree.
> >   
> 
> One concern I have though - does this cause system-calls to be made in the
> fast-path because we are writting to a passed in FD? For performance
> reasons, would it not be better to use a memory buffer for this, thereby
> avoiding syscalls? For example, rather than passing in an FD to telemetry,
> we could pass in a key to be passed to shmget (going old-school!), or
> name parameter for shm_open. Thereafter with the memory buffer we can use a
> circular ring or similar to pass the data from app to client.
> 
> /Bruce

The system calls are contained inside the thread spawned when capture starts.
The flow is:
         callback -> ring -> capture thread -> FIFO -> wireshark

^ permalink raw reply

* Re: [RFC 1/4] telemetry: allow commands to receive file descriptors
From: Stephen Hemminger @ 2026-06-16 14:26 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: dev
In-Reply-To: <ajFCXCZfikPJTrLH@bricha3-mobl1.ger.corp.intel.com>

On Tue, 16 Jun 2026 13:32:28 +0100
Bruce Richardson <bruce.richardson@intel.com> wrote:

> On Tue, Jun 09, 2026 at 02:02:02PM -0700, Stephen Hemminger wrote:
> > Add rte_telemetry_register_cmd_fd_arg() to register a command whose
> > callback also receives file descriptors passed by the client as
> > SCM_RIGHTS ancillary data. The callback owns the descriptors and must
> > close them.
> > 
> > This lets a client open a file itself and hand the descriptor to the
> > primary process, so DPDK never opens the path. That avoids path and
> > permission problems and works across container filesystem namespaces.
> > 
> > Existing commands and clients are unaffected. If unsolicited file
> > descriptor is passed, it is closed.
> >   
> 
> This scheme seems reasonable in general. My only concern is whether the
> lack of potential windows support is an issue? For regular telemetry, there
> was always the option of a windows implementation using regular
> TCP/UDP/SCTP sockets bound to localhost. However, AFAIK there is no windows
> implementation of anything that supports file descriptors or handles
> between processes.
> 
> Some other pieces of feedback inline below.
> 
> /Bruce

I have new version (testing) that passes filename as parameter.
That should work without the fd passing.

^ permalink raw reply

* Re: [PATCH v2] app/testpmd: add VLAN priority insert support
From: Stephen Hemminger @ 2026-06-16 14:23 UTC (permalink / raw)
  To: Xingui Yang
  Cc: dev, david.marchand, aman.deep.singh, fengchengwen, yangshuaisong,
	lihuisong, liuyonglong, kangfenglong
In-Reply-To: <20260616131001.2955655-1-yangxingui@huawei.com>

On Tue, 16 Jun 2026 21:10:01 +0800
Xingui Yang <yangxingui@huawei.com> wrote:

> The tx_vlan set and tx_qinq set commands only accepted VLAN ID in range
> [0, 4095]. This prevented users from setting 802.1p priority and CFI
> bits when using hardware VLAN insertion.
> 
> Since mbuf vlan_tci field already supports full 16-bit VLAN Tag Control
> Information (TCI), relax the validation for TX paths to allow priority
> and CFI bits. The vlan_id parameter now accepts:
>   - Bits 0-11:  VLAN ID (0-4095)
>   - Bit 12:    CFI (Canonical Format Indicator)
>   - Bits 13-15: Priority (0-7, 802.1p CoS)
> 
> Suggested-by: Stephen Hemminger <stephen@networkplumber.org>
> Suggested-by: fengchengwen <fengchengwen@huawei.com>
> Signed-off-by: Xingui Yang <yangxingui@huawei.com>
> ---
> v2:
> - Removed --enable-vlan-priority option and global variable as suggested
>   by Stephen Hemminger. The feature is now always enabled for TX paths
> - RX VLAN filter continues to enforce strict VLAN ID validation as
>   suggested by fengchengwen
> - Added documentation updates for testpmd_funcs.rst and release notes
> 
>  app/test-pmd/config.c                       | 13 ++++++++-----
>  doc/guides/rel_notes/release_26_07.rst      |  7 +++++++
>  doc/guides/testpmd_app_ug/testpmd_funcs.rst | 17 ++++++++++++++---
>  3 files changed, 29 insertions(+), 8 deletions(-)
> 
> diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
> index 9d457ca88e..38758f9c05 100644
> --- a/app/test-pmd/config.c
> +++ b/app/test-pmd/config.c
> @@ -1241,8 +1241,11 @@ void print_valid_ports(void)
>  }
>  
>  static int
> -vlan_id_is_invalid(uint16_t vlan_id)
> +vlan_id_is_invalid(uint16_t vlan_id, bool is_tx)
>  {
> +	if (is_tx)
> +		return 0;
> +
>  	if (vlan_id < 4096)
>  		return 0;
>  	fprintf(stderr, "Invalid vlan_id %d (must be < 4096)\n", vlan_id);
> @@ -6876,7 +6879,7 @@ rx_vft_set(portid_t port_id, uint16_t vlan_id, int on)
>  
>  	if (port_id_is_invalid(port_id, ENABLED_WARN))
>  		return 1;
> -	if (vlan_id_is_invalid(vlan_id))
> +	if (vlan_id_is_invalid(vlan_id, false))
>  		return 1;
>  	diag = rte_eth_dev_vlan_filter(port_id, vlan_id, on);
>  	if (diag == 0)
> @@ -6923,7 +6926,7 @@ tx_vlan_set(portid_t port_id, uint16_t vlan_id)
>  	struct rte_eth_dev_info dev_info;
>  	int ret;
>  
> -	if (vlan_id_is_invalid(vlan_id))
> +	if (vlan_id_is_invalid(vlan_id, true))
>  		return;

Why have the is_tx flag if it is always used as constant?
Just remove the whole vlan_id_is_invalid() branch test in the transmit path.
Maybe add a comment that any VLAN is allowed on transmit?

Or make a new function. Since VLAN of 0xffff is reserved. Though you might want
to allow it since testpmd is for testing even invalid packets.


^ permalink raw reply

* Re: [PATCH v2 6/6] net/dpaa2: drop the fake software VLAN strip offload
From: Stephen Hemminger @ 2026-06-16 14:11 UTC (permalink / raw)
  To: Maxime Leroy; +Cc: dev, Hemant Agrawal, Sachin Saxena
In-Reply-To: <20260616102727.708948-7-maxime@leroys.fr>

On Tue, 16 Jun 2026 12:27:26 +0200
Maxime Leroy <maxime@leroys.fr> wrote:

> RTE_ETH_RX_OFFLOAD_VLAN_STRIP is advertised, but no hardware VLAN strip
> backs it: when enabled, the Rx burst calls rte_vlan_strip() on every
> frame, a software op masquerading as a hardware offload.
> 
> It saves a forwarding application nothing: the datapath reads the L2
> header anyway to classify or strip. The offload does not remove that
> read, it relocates it into the driver Rx burst, where it is far more
> expensive.
> 
> The cost is a matter of timing. rte_vlan_strip() reaches the L2 header
> through rte_pktmbuf_mtod(), which dereferences mbuf->buf_addr. On a
> freshly recycled buffer that mbuf cacheline is cold. eth_fd_to_mbuf()
> has just written other fields of it (data_off, ol_flags), but buf_addr
> is a persistent field it does not rewrite. A write does not stall: it
> posts to the store buffer while the line fills in the background, and
> the rewritten fields are forwarded straight from there. buf_addr has
> nothing to forward, so it must be read from the line, whose fill is
> still in flight, and the read stalls. The ethertype read that follows,
> on the cold payload line, stalls again. Read later by the application,
> when the fill has completed, the same read hits. The offload just
> performs it at the worst possible moment.
> 
> Measured on a single-core port-to-port forwarding test over two 10G
> ports (one core at 2 GHz, 64-byte untagged frames):
> 
>   - throughput 4.22 -> 5.00 Mpps (+18 percent)
>   - IPC 0.93 -> 1.25: the cost was memory stall, not compute
>   - L3/DRAM-bound L2 refills 319M -> 200M over 10s (-37 percent)
> 
> perf confirms it: with the offload, the buf_addr load (the cold mbuf
> field) and the payload load account for about 84 percent of the Rx
> burst's L2 refills; removing it, those vanish and only the inherent DQRR
> dequeue misses remain.
> 
> Stop advertising VLAN_STRIP and remove the rte_vlan_strip() calls from
> every Rx path. This is a behavioural change: the tag is left in the
> frame, so an application must strip it itself, on the L2 header it
> already reads.
> 
> Signed-off-by: Maxime Leroy <maxime@leroys.fr>

Acked-by: Stephen Hemminger <stephen@networkplumber.org>

^ permalink raw reply

* Re: [PATCH 2/2] ethdev: return 0 from dummy queue count
From: Stephen Hemminger @ 2026-06-16 14:07 UTC (permalink / raw)
  To: Maxime Leroy
  Cc: dev, stable, Thomas Monjalon, Andrew Rybchenko, Sunil Kumar Kori,
	Morten Brørup
In-Reply-To: <20260616094259.686555-3-maxime@leroys.fr>

On Tue, 16 Jun 2026 11:42:58 +0200
Maxime Leroy <maxime@leroys.fr> wrote:

> The dummy rx_queue_count/tx_queue_count callback returned -ENOTSUP. On a
> port that is not started (freshly allocated, or stopped once the fast-path
> ops are reset to dummies) there are no packets queued, so the truthful
> answer is 0, not an error: querying the count is not an unsupported
> operation. This also matches the dummy Rx/Tx burst, which reports 0
> packets.
> 
> A poll-mode worker checking rte_eth_rx_queue_count() across a concurrent
> port stop then sees an empty queue instead of a negative error.
> 
> Fixes: 066f3d9cc21c ("ethdev: remove callback checks from fast path")
> Cc: stable@dpdk.org
> 
> Suggested-by: Stephen Hemminger <stephen@networkplumber.org>
> Signed-off-by: Maxime Leroy <maxime@leroys.fr>

Acked-by: Stephen Hemminger <stephen@networkplumber.org>

^ permalink raw reply

* Re: [PATCH] app/testpmd: add VLAN priority insert support
From: yangxingui @ 2026-06-16 13:19 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: dev, david.marchand, aman.deep.singh, fengchengwen, yangshuaisong,
	lihuisong, liuyonglong, kangfenglong
In-Reply-To: <20260615121214.6fb7d8b7@phoenix.local>



On 2026/6/16 3:12, Stephen Hemminger wrote:
> On Fri, 12 Jun 2026 16:14:11 +0800
> Xingui Yang <yangxingui@huawei.com> wrote:
> 
>> The tx_vlan set command currently only accepts a VLAN ID in range
>> [0, 4095].  This patch adds support for an extended format that includes
>> 802.1p priority and CFI bits, allowing users to set the VLAN priority
>> tag when inserting VLAN headers in TX packets.
>>
>> The extended format is:
>>    bit 0-11:  VLAN ID (0-4095)
>>    bit 12:    CFI (Canonical Format Indicator)
>>    bit 13-15: Priority (0-7, 802.1p CoS)
>>
>> This is consistent with the VLAN tag structure used by
>> rte_eth_dev_set_vlan_pvid() where the PVID field encodes VLAN ID, CFI
>> and priority in the same format.
>>
>> A new command line option --enable-vlan-priority is added to enable this
>> feature. By default, the feature is disabled to maintain backward
>> compatibility with existing users. When enabled, the
>> vlan_id_is_invalid() function allows any 16-bit value to pass, while the
>> full 16-bit value (including CFI and priority bits) is passed to the
>> driver for hardware VLAN insertion.
>>
>> Signed-off-by: Xingui Yang <yangxingui@huawei.com>
>> ---
> 
> 
> Having ability to set priority bits is good, and testpmd should allow it.
> The mbuf vlan_tci is already a full 16-bit TCI (priority/CFI/VID), and
> the TX insert path copies tx_vlan_id straight into it.  So priority
> insert already works; the only thing in the way is the < 4096 check.
> 
> Do you actually need a new option for this?  Both of_push_vlan +
> of_set_vlan_pcp (rte_flow) and "tx_vlan set pvid" already let you set
> the priority bits today, with no new code.
> 
> If you still want "tx_vlan set" itself to carry priority, I'd suggest
> a smaller change: relax only the TX insert validators and drop the
> option and the global.  Don't touch rx_vft_set -- it feeds the VLAN
> filter, which only takes a VLAN ID and rejects > 4095 anyway, so the
> flag just turns a clear error into a confusing one.
> 
> Either way, if the option stays, please document it, and add a release note.
> The commit message why the existing paths aren't enough.

Thank you for the suggestion. I have implemented the simpler approach in v2.

Thanks,
Xingui

^ permalink raw reply

* Re: [PATCH] app/testpmd: add VLAN priority insert support
From: yangxingui @ 2026-06-16 13:17 UTC (permalink / raw)
  To: fengchengwen, dev
  Cc: stephen, david.marchand, aman.deep.singh, yangshuaisong,
	lihuisong, liuyonglong, kangfenglong
In-Reply-To: <7ead3834-afaa-4319-83bf-faf19b3ea3ef@huawei.com>



On 2026/6/15 17:46, fengchengwen wrote:
> On 6/12/2026 4:14 PM, Xingui Yang wrote:
>> The tx_vlan set command currently only accepts a VLAN ID in range
>> [0, 4095].  This patch adds support for an extended format that includes
>> 802.1p priority and CFI bits, allowing users to set the VLAN priority
>> tag when inserting VLAN headers in TX packets.
>>
>> The extended format is:
>>    bit 0-11:  VLAN ID (0-4095)
>>    bit 12:    CFI (Canonical Format Indicator)
>>    bit 13-15: Priority (0-7, 802.1p CoS)
>>
>> This is consistent with the VLAN tag structure used by
>> rte_eth_dev_set_vlan_pvid() where the PVID field encodes VLAN ID, CFI
>> and priority in the same format.
>>
>> A new command line option --enable-vlan-priority is added to enable this
>> feature. By default, the feature is disabled to maintain backward
>> compatibility with existing users. When enabled, the
>> vlan_id_is_invalid() function allows any 16-bit value to pass, while the
>> full 16-bit value (including CFI and priority bits) is passed to the
>> driver for hardware VLAN insertion.
>>
>> Signed-off-by: Xingui Yang <yangxingui@huawei.com>
>> ---
>>   app/test-pmd/config.c     | 24 +++++++++++++++---------
>>   app/test-pmd/parameters.c |  6 ++++++
>>   app/test-pmd/testpmd.c    |  5 +++++
>>   app/test-pmd/testpmd.h    |  2 ++
>>   4 files changed, 28 insertions(+), 9 deletions(-)
>>
>> diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
>> index 36b9b023e2..80cde109e6 100644
>> --- a/app/test-pmd/config.c
>> +++ b/app/test-pmd/config.c
>> @@ -1241,12 +1241,18 @@ void print_valid_ports(void)
>>   }
>>   
>>   static int
>> -vlan_id_is_invalid(uint16_t vlan_id)
>> +vlan_id_is_invalid(uint16_t vlan_id, int vlan_priority_ena)
>>   {
>> -	if (vlan_id < 4096)
>> -		return 0;
>> -	fprintf(stderr, "Invalid vlan_id %d (must be < 4096)\n", vlan_id);
>> -	return 1;
>> +	if (!vlan_priority_ena && vlan_id >= 4096) {
>> +		fprintf(stderr, "Invalid vlan_id %d (must be < 4096)\n", vlan_id);
>> +		return 1;
>> +	}
>> +
>> +	/*
>> +	 * When vlan_priority_ena is enabled, allow any 16-bit value
>> +	 * to pass priority and CFI bits to the driver.
>> +	 */
>> +	return 0;
>>   }
>>   
>>   static uint32_t
>> @@ -6876,7 +6882,7 @@ rx_vft_set(portid_t port_id, uint16_t vlan_id, int on)
>>   
>>   	if (port_id_is_invalid(port_id, ENABLED_WARN))
>>   		return 1;
>> -	if (vlan_id_is_invalid(vlan_id))
>> +	if (vlan_id_is_invalid(vlan_id, vlan_priority_insert_ena))
> 
> Just vlan_id_is_invalid(vlan_id, false) because Rx is no need to impl this.
> 
>>   		return 1;
>>   	diag = rte_eth_dev_vlan_filter(port_id, vlan_id, on);
>>   	if (diag == 0)
>> @@ -6923,7 +6929,7 @@ tx_vlan_set(portid_t port_id, uint16_t vlan_id)
>>   	struct rte_eth_dev_info dev_info;
>>   	int ret;
>>   
>> -	if (vlan_id_is_invalid(vlan_id))
>> +	if (vlan_id_is_invalid(vlan_id, vlan_priority_insert_ena))
>>   		return;
>>   
>>   	if (ports[port_id].dev_conf.txmode.offloads &
>> @@ -6954,9 +6960,9 @@ tx_qinq_set(portid_t port_id, uint16_t vlan_id, uint16_t vlan_id_outer)
>>   	struct rte_eth_dev_info dev_info;
>>   	int ret;
>>   
>> -	if (vlan_id_is_invalid(vlan_id))
>> +	if (vlan_id_is_invalid(vlan_id, vlan_priority_insert_ena))
>>   		return;
>> -	if (vlan_id_is_invalid(vlan_id_outer))
>> +	if (vlan_id_is_invalid(vlan_id_outer, vlan_priority_insert_ena))
>>   		return;
>>   
>>   	ret = eth_dev_info_get_print_err(port_id, &dev_info);
>> diff --git a/app/test-pmd/parameters.c b/app/test-pmd/parameters.c
>> index 8c3b1244e7..3f37498d3b 100644
>> --- a/app/test-pmd/parameters.c
>> +++ b/app/test-pmd/parameters.c
>> @@ -117,6 +117,8 @@ enum {
>>   	TESTPMD_OPT_ENABLE_HW_VLAN_EXTEND_NUM,
>>   #define TESTPMD_OPT_ENABLE_HW_QINQ_STRIP "enable-hw-qinq-strip"
>>   	TESTPMD_OPT_ENABLE_HW_QINQ_STRIP_NUM,
>> +#define TESTPMD_OPT_ENABLE_VLAN_PRIORITY "enable-vlan-priority"
>> +	TESTPMD_OPT_ENABLE_VLAN_PRIORITY_NUM,
> 
> How about TESTPMD_OPT_ENABLE_VLAN_INSERT_PRI "enable-vlan-insert-pri"

I have adopted the simpler approach as suggested by Stephen.

>>   #define TESTPMD_OPT_ENABLE_DROP_EN "enable-drop-en"
>>   	TESTPMD_OPT_ENABLE_DROP_EN_NUM,
>>   #define TESTPMD_OPT_DISABLE_RSS "disable-rss"
>> @@ -461,6 +463,7 @@ usage(char* progname)
>>   	printf("  --enable-hw-vlan-strip: enable hardware vlan strip.\n");
>>   	printf("  --enable-hw-vlan-extend: enable hardware vlan extend.\n");
>>   	printf("  --enable-hw-qinq-strip: enable hardware qinq strip.\n");
>> +	printf("  --enable-vlan-priority: enable vlan priority insert.\n");
>>   	printf("  --enable-drop-en: enable per queue packet drop.\n");
>>   	printf("  --disable-rss: disable rss.\n");
>>   	printf("  --enable-rss: Force rss even for single-queue operation.\n");
>> @@ -1259,6 +1262,9 @@ launch_args_parse(int argc, char** argv)
>>   		case TESTPMD_OPT_ENABLE_HW_QINQ_STRIP_NUM:
>>   			rx_offloads |= RTE_ETH_RX_OFFLOAD_QINQ_STRIP;
>>   			break;
>> +		case TESTPMD_OPT_ENABLE_VLAN_PRIORITY_NUM:
>> +			vlan_priority_insert_ena = 1;
> 
> How about tx_insert_vlan_pri_en
> 
>> +			break;

I have adopted the simpler approach as suggested by Stephen, which 
eliminates the need for a new command-line option and global variable.

> 
> We need also update the testpmd document

OK.

Thanks,
Xingui



^ permalink raw reply

* [PATCH v2] app/testpmd: add VLAN priority insert support
From: Xingui Yang @ 2026-06-16 13:10 UTC (permalink / raw)
  To: dev
  Cc: stephen, david.marchand, aman.deep.singh, fengchengwen,
	yangshuaisong, lihuisong, liuyonglong, kangfenglong
In-Reply-To: <20260612081411.2798403-1-yangxingui@huawei.com>

The tx_vlan set and tx_qinq set commands only accepted VLAN ID in range
[0, 4095]. This prevented users from setting 802.1p priority and CFI
bits when using hardware VLAN insertion.

Since mbuf vlan_tci field already supports full 16-bit VLAN Tag Control
Information (TCI), relax the validation for TX paths to allow priority
and CFI bits. The vlan_id parameter now accepts:
  - Bits 0-11:  VLAN ID (0-4095)
  - Bit 12:    CFI (Canonical Format Indicator)
  - Bits 13-15: Priority (0-7, 802.1p CoS)

Suggested-by: Stephen Hemminger <stephen@networkplumber.org>
Suggested-by: fengchengwen <fengchengwen@huawei.com>
Signed-off-by: Xingui Yang <yangxingui@huawei.com>
---
v2:
- Removed --enable-vlan-priority option and global variable as suggested
  by Stephen Hemminger. The feature is now always enabled for TX paths
- RX VLAN filter continues to enforce strict VLAN ID validation as
  suggested by fengchengwen
- Added documentation updates for testpmd_funcs.rst and release notes

 app/test-pmd/config.c                       | 13 ++++++++-----
 doc/guides/rel_notes/release_26_07.rst      |  7 +++++++
 doc/guides/testpmd_app_ug/testpmd_funcs.rst | 17 ++++++++++++++---
 3 files changed, 29 insertions(+), 8 deletions(-)

diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index 9d457ca88e..38758f9c05 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -1241,8 +1241,11 @@ void print_valid_ports(void)
 }
 
 static int
-vlan_id_is_invalid(uint16_t vlan_id)
+vlan_id_is_invalid(uint16_t vlan_id, bool is_tx)
 {
+	if (is_tx)
+		return 0;
+
 	if (vlan_id < 4096)
 		return 0;
 	fprintf(stderr, "Invalid vlan_id %d (must be < 4096)\n", vlan_id);
@@ -6876,7 +6879,7 @@ rx_vft_set(portid_t port_id, uint16_t vlan_id, int on)
 
 	if (port_id_is_invalid(port_id, ENABLED_WARN))
 		return 1;
-	if (vlan_id_is_invalid(vlan_id))
+	if (vlan_id_is_invalid(vlan_id, false))
 		return 1;
 	diag = rte_eth_dev_vlan_filter(port_id, vlan_id, on);
 	if (diag == 0)
@@ -6923,7 +6926,7 @@ tx_vlan_set(portid_t port_id, uint16_t vlan_id)
 	struct rte_eth_dev_info dev_info;
 	int ret;
 
-	if (vlan_id_is_invalid(vlan_id))
+	if (vlan_id_is_invalid(vlan_id, true))
 		return;
 
 	if (ports[port_id].dev_conf.txmode.offloads &
@@ -6954,9 +6957,9 @@ tx_qinq_set(portid_t port_id, uint16_t vlan_id, uint16_t vlan_id_outer)
 	struct rte_eth_dev_info dev_info;
 	int ret;
 
-	if (vlan_id_is_invalid(vlan_id))
+	if (vlan_id_is_invalid(vlan_id, true))
 		return;
-	if (vlan_id_is_invalid(vlan_id_outer))
+	if (vlan_id_is_invalid(vlan_id_outer, true))
 		return;
 
 	ret = eth_dev_info_get_print_err(port_id, &dev_info);
diff --git a/doc/guides/rel_notes/release_26_07.rst b/doc/guides/rel_notes/release_26_07.rst
index 5d7aa8d1bf..e382c7f407 100644
--- a/doc/guides/rel_notes/release_26_07.rst
+++ b/doc/guides/rel_notes/release_26_07.rst
@@ -150,6 +150,13 @@ New Features
   * Added ``eof`` devarg to use link state to signal end of receive file input.
   * Added unit test suite.
 
+* **Added VLAN priority support in testpmd.**
+
+  Added support for setting VLAN priority and CFI bits in ``tx_vlan set``
+  and ``tx_qinq set`` commands. The ``vlan_tci`` parameter now accepts the
+  full 16-bit VLAN Tag Control Information (TCI) format, which includes
+  priority (bits 13-15), CFI (bit 12), and VLAN ID (bits 0-11).
+
 * **Added AI review helpers.**
 
   Added AGENTS.md file for AI review
diff --git a/doc/guides/testpmd_app_ug/testpmd_funcs.rst b/doc/guides/testpmd_app_ug/testpmd_funcs.rst
index f0f2b0758b..b967810b10 100644
--- a/doc/guides/testpmd_app_ug/testpmd_funcs.rst
+++ b/doc/guides/testpmd_app_ug/testpmd_funcs.rst
@@ -1120,15 +1120,26 @@ tx_vlan set
 
 Set hardware insertion of VLAN IDs in packets sent on a port::
 
-   testpmd> tx_vlan set (port_id) vlan_id[, vlan_id_outer]
+   testpmd> tx_vlan set (port_id) vlan_tci[, vlan_tci_outer]
+
+The ``vlan_tci`` parameter accepts the full 16-bit VLAN Tag Control Information (TCI)
+format, which includes:
+
+* Bits 0-11:  VLAN ID (0-4095)
+* Bit 12:    CFI (Canonical Format Indicator)
+* Bits 13-15: Priority (0-7, 802.1p CoS)
 
 For example, set a single VLAN ID (5) insertion on port 0::
 
-   tx_vlan set 0 5
+   testpmd> tx_vlan set 0 5
+
+Or, set a VLAN ID with priority (priority=3, VLAN ID=6) insertion on port 0::
+
+   testpmd> tx_vlan set 0 0x6006
 
 Or, set double VLAN ID (inner: 2, outer: 3) insertion on port 1::
 
-   tx_vlan set 1 2 3
+   testpmd> tx_vlan set 1 2 3
 
 
 tx_vlan set pvid
-- 
2.43.0


^ permalink raw reply related

* Re: [v3] net/cksum: compute raw cksum for several segments
From: su sai @ 2026-06-16 12:48 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev
In-Reply-To: <20260608100202.0deac83d@phoenix.local>

[-- Attachment #1: Type: text/plain, Size: 9570 bytes --]

Hi Stephen, I've revised the patch per your feedback and sent out v4:
net/cksum: compute raw cksum for several segments. Your review is
appreciated.

On Tue, Jun 9, 2026 at 1:02 AM Stephen Hemminger <stephen@networkplumber.org>
wrote:

> On Mon,  4 Aug 2025 11:54:30 +0800
> Su Sai <spiderdetective.ss@gmail.com> wrote:
>
> > The rte_raw_cksum_mbuf function is used to compute
> > the raw checksum of a packet.
> > If the packet payload stored in multi mbuf, the function
> > will goto the hard case. In hard case,
> > the variable 'tmp' is a type of uint32_t,
> > so rte_bswap16 will drop high 16 bit.
> > Meanwhile, the variable 'sum' is a type of uint32_t,
> > so 'sum += tmp' will drop the carry when overflow.
> > Both drop will make cksum incorrect.
> > This commit fixes the above bug.
> >
> > Signed-off-by: Su Sai <spiderdetective.ss@gmail.com>
> > ---
> >  .mailmap              |   1 +
> >  app/test/test_cksum.c | 106 ++++++++++++++++++++++++++++++++++++++++++
> >  lib/net/rte_cksum.h   |  27 +++++++++--
> >  3 files changed, 130 insertions(+), 4 deletions(-)
> >
> > diff --git a/.mailmap b/.mailmap
> > index 34a99f93a1..1da1d9f8e1 100644
> > --- a/.mailmap
> > +++ b/.mailmap
> > @@ -1552,6 +1552,7 @@ Sunil Kumar Kori <skori@marvell.com> <
> sunil.kori@nxp.com>
> >  Sunil Pai G <sunil.pai.g@intel.com>
> >  Sunil Uttarwar <sunilprakashrao.uttarwar@amd.com>
> >  Sun Jiajia <sunx.jiajia@intel.com>
> > +Su Sai <spiderdetective.ss@gmail.com> <susai.ss@bytedance.com>
> >  Sunyang Wu <sunyang.wu@jaguarmicro.com>
> >  Surabhi Boob <surabhi.boob@intel.com>
> >  Suyang Ju <sju@paloaltonetworks.com>
> > diff --git a/app/test/test_cksum.c b/app/test/test_cksum.c
> > index f2ab5af5a7..fb2e3cf9e6 100644
> > --- a/app/test/test_cksum.c
> > +++ b/app/test/test_cksum.c
> > @@ -85,6 +85,42 @@ static const char test_cksum_ipv4_opts_udp[] = {
> >       0x00, 0x35, 0x00, 0x09, 0x89, 0x6f, 0x78,
> >  };
> >
> > +/*
> > + * generated in scapy with
> > + * Ether()/IP()/TCP(options=[NOP,NOP,Timestamps])/os.urandom(113))
> > + */
> > +static const char test_cksum_ipv4_tcp_multi_segs[] = {
> > +     0x00, 0x16, 0x3e, 0x0b, 0x6b, 0xd2, 0xee, 0xff,
> > +     0xff, 0xff, 0xff, 0xff, 0x08, 0x00, 0x45, 0x00,
> > +     0x00, 0xa5, 0x46, 0x10, 0x40, 0x00, 0x40, 0x06,
> > +     0x80, 0xb5, 0xc0, 0xa8, 0xf9, 0x1d, 0xc0, 0xa8,
> > +     0xf9, 0x1e, 0xdc, 0xa2, 0x14, 0x51, 0xbb, 0x8f,
> > +     0xa0, 0x00, 0xe4, 0x7c, 0xe4, 0xb8, 0x80, 0x10,
> > +     0x02, 0x00, 0x4b, 0xc1, 0x00, 0x00, 0x01, 0x01,
> > +     0x08, 0x0a, 0x90, 0x60, 0xf4, 0xff, 0x03, 0xc5,
> > +     0xb4, 0x19, 0x77, 0x34, 0xd4, 0xdc, 0x84, 0x86,
> > +     0xff, 0x44, 0x09, 0x63, 0x36, 0x2e, 0x26, 0x9b,
> > +     0x90, 0x70, 0xf2, 0xed, 0xc8, 0x5b, 0x87, 0xaa,
> > +     0xb4, 0x67, 0x6b, 0x32, 0x3d, 0xc4, 0xbf, 0x15,
> > +     0xa9, 0x16, 0x6c, 0x2a, 0x9d, 0xb2, 0xb7, 0x6b,
> > +     0x58, 0x44, 0x58, 0x12, 0x4b, 0x8f, 0xe5, 0x12,
> > +     0x11, 0x90, 0x94, 0x68, 0x37, 0xad, 0x0a, 0x9b,
> > +     0xd6, 0x79, 0xf2, 0xb7, 0x31, 0xcf, 0x44, 0x22,
> > +     0xc8, 0x99, 0x3f, 0xe5, 0xe7, 0xac, 0xc7, 0x0b,
> > +     0x86, 0xdf, 0xda, 0xed, 0x0a, 0x0f, 0x86, 0xd7,
> > +     0x48, 0xe2, 0xf1, 0xc2, 0x43, 0xed, 0x47, 0x3a,
> > +     0xea, 0x25, 0x2d, 0xd6, 0x60, 0x38, 0x30, 0x07,
> > +     0x28, 0xdd, 0x1f, 0x0c, 0xdd, 0x7b, 0x7c, 0xd9,
> > +     0x35, 0x9d, 0x14, 0xaa, 0xc6, 0x35, 0xd1, 0x03,
> > +     0x38, 0xb1, 0xf5,
> > +};
> > +
> > +static const uint8_t test_cksum_ipv4_tcp_multi_segs_len[] = {
> > +     66,  /* the first seg contains all headers, including L2 to L4 */
> > +     61,  /* the second seg length is odd, test byte order independent
> */
> > +     52,  /* three segs are sufficient to test the most complex
> scenarios */
> > +};
> > +
> >  /* test l3/l4 checksum api */
> >  static int
> >  test_l4_cksum(struct rte_mempool *pktmbuf_pool, const char *pktdata,
> size_t len)
> > @@ -223,6 +259,70 @@ test_l4_cksum(struct rte_mempool *pktmbuf_pool,
> const char *pktdata, size_t len)
> >       return -1;
> >  }
> >
> > +/* test l4 checksum api for a packet with multiple mbufs */
> > +static int
> > +test_l4_cksum_multi_mbufs(struct rte_mempool *pktmbuf_pool, const char
> *pktdata, size_t len,
> > +                          const uint8_t *segs, size_t segs_len)
> > +{
> > +     struct rte_mbuf *m[NB_MBUF] = {0};
> > +     struct rte_mbuf *m_hdr = NULL;
> > +     struct rte_net_hdr_lens hdr_lens;
> > +     size_t i, off = 0;
> > +     uint32_t packet_type, l3;
> > +     void *l3_hdr;
> > +     char *data;
> > +
> > +     for (i = 0; i < segs_len; i++) {
> > +             m[i] = rte_pktmbuf_alloc(pktmbuf_pool);
> > +             if (m[i] == NULL)
> > +                     GOTO_FAIL("Cannot allocate mbuf");
> > +
> > +             data = rte_pktmbuf_append(m[i], segs[i]);
> > +             if (data == NULL)
> > +                     GOTO_FAIL("Cannot append data");
> > +
> > +             rte_memcpy(data, pktdata + off, segs[i]);
>
> Tests (except rte_memcpy test) should not use rte_memcpy, instead use
> regular memcpy which has better coverage from analyzers.
>
> > +             off += segs[i];
> > +
> > +             if (m_hdr) {
> > +                     if (rte_pktmbuf_chain(m_hdr, m[i]))
> > +                             GOTO_FAIL("Cannot chain mbuf");
> > +             } else {
> > +                     m_hdr = m[i];
> > +             }
> > +     }
> > +
> > +     if (off != len)
> > +             GOTO_FAIL("Invalid segs");
> > +
> > +     packet_type = rte_net_get_ptype(m_hdr, &hdr_lens,
> RTE_PTYPE_ALL_MASK);
> > +     l3 = packet_type & RTE_PTYPE_L3_MASK;
> > +
> > +     l3_hdr = rte_pktmbuf_mtod_offset(m_hdr, void *, hdr_lens.l2_len);
> > +     off = hdr_lens.l2_len + hdr_lens.l3_len;
> > +
> > +     if (l3 == RTE_PTYPE_L3_IPV4 || l3 == RTE_PTYPE_L3_IPV4_EXT) {
> > +             if (rte_ipv4_udptcp_cksum_mbuf_verify(m_hdr, l3_hdr, off)
> != 0)
> > +                     GOTO_FAIL("Invalid L4 checksum verification for
> multiple mbufs");
> > +     } else if (l3 == RTE_PTYPE_L3_IPV6 || l3 == RTE_PTYPE_L3_IPV6_EXT)
> {
> > +             if (rte_ipv6_udptcp_cksum_mbuf_verify(m_hdr, l3_hdr, off)
> != 0)
> > +                     GOTO_FAIL("Invalid L4 checksum verification for
> multiple mbufs");
> > +     }
> > +
> > +     for (i = 0; i < segs_len; i++)
> > +             rte_pktmbuf_free(m[i]);
>
> Can avoid the loop here and elsewhere by using rte_pktmbuf_free_bulk()
>
> > +     return 0;
> > +
> > +fail:
> > +     for (i = 0; i < segs_len; i++) {
> > +             if (m[i])
> > +                     rte_pktmbuf_free(m[i]);
> > +     }
>
> Freebulk will work here
>
> > +     return -1;
> > +}
> > +
> >  static int
> >  test_cksum(void)
> >  {
> > @@ -256,6 +356,12 @@ test_cksum(void)
> >                         sizeof(test_cksum_ipv4_opts_udp)) < 0)
> >               GOTO_FAIL("checksum error on ipv4_opts_udp");
> >
> > +     if (test_l4_cksum_multi_mbufs(pktmbuf_pool,
> test_cksum_ipv4_tcp_multi_segs,
> > +                       sizeof(test_cksum_ipv4_tcp_multi_segs),
> > +                       test_cksum_ipv4_tcp_multi_segs_len,
> > +                       sizeof(test_cksum_ipv4_tcp_multi_segs_len)) < 0)
> > +             GOTO_FAIL("checksum error on multi mbufs check");
> > +
> >       rte_mempool_free(pktmbuf_pool);
> >
> >       return 0;
> > diff --git a/lib/net/rte_cksum.h b/lib/net/rte_cksum.h
> > index a8e8927952..679ba82eb6 100644
> > --- a/lib/net/rte_cksum.h
> > +++ b/lib/net/rte_cksum.h
> > @@ -80,6 +80,25 @@ __rte_raw_cksum_reduce(uint32_t sum)
> >       return (uint16_t)sum;
> >  }
> >
> > +/**
> > + * @internal Reduce a sum to the non-complemented checksum.
> > + * Helper routine for the rte_raw_cksum_mbuf().
> > + *
> > + * @param sum
> > + *   Value of the sum.
> > + * @return
> > + *   The non-complemented checksum.
> > + */
> > +static inline uint16_t
> > +__rte_raw_cksum_reduce_u64(uint64_t sum)
> > +{
> > +     uint32_t tmp;
> > +
> > +     tmp = __rte_raw_cksum_reduce((uint32_t)sum);
> > +     tmp += __rte_raw_cksum_reduce((uint32_t)(sum >> 32));
> > +     return __rte_raw_cksum_reduce(tmp);
> > +}
> > +
> >  /**
> >   * Process the non-complemented checksum of a buffer.
> >   *
> > @@ -119,8 +138,8 @@ rte_raw_cksum_mbuf(const struct rte_mbuf *m,
> uint32_t off, uint32_t len,
> >  {
> >       const struct rte_mbuf *seg;
> >       const char *buf;
> > -     uint32_t sum, tmp;
> > -     uint32_t seglen, done;
> > +     uint32_t seglen, done, tmp;
> > +     uint64_t sum;
> >
> >       /* easy case: all data in the first segment */
> >       if (off + len <= rte_pktmbuf_data_len(m)) {
> > @@ -157,7 +176,7 @@ rte_raw_cksum_mbuf(const struct rte_mbuf *m,
> uint32_t off, uint32_t len,
> >       for (;;) {
> >               tmp = __rte_raw_cksum(buf, seglen, 0);
> >               if (done & 1)
> > -                     tmp = rte_bswap16((uint16_t)tmp);
> > +                     tmp = rte_bswap32(tmp);
> >               sum += tmp;
> >               done += seglen;
> >               if (done == len)
> > @@ -169,7 +188,7 @@ rte_raw_cksum_mbuf(const struct rte_mbuf *m,
> uint32_t off, uint32_t len,
> >                       seglen = len - done;
> >       }
> >
> > -     *cksum = __rte_raw_cksum_reduce(sum);
> > +     *cksum = __rte_raw_cksum_reduce_u64(sum);
> >       return 0;
> >  }
> >
>
>

[-- Attachment #2: Type: text/html, Size: 12542 bytes --]

^ permalink raw reply

* [v4] net/cksum: compute raw cksum for several segments
From: Su Sai @ 2026-06-16 12:38 UTC (permalink / raw)
  To: stephen; +Cc: dev, spiderdetective.ss
In-Reply-To: <20260608100202.0deac83d@phoenix.local>

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=y, Size: 7012 bytes --]

The rte_raw_cksum_mbuf function is used to compute
the raw checksum of a packet.
If the packet payload stored in multi mbuf, the function
will goto the hard case. In hard case,
the variable 'tmp' is a type of uint32_t,
so rte_bswap16 will drop high 16 bit.
Meanwhile, the variable 'sum' is a type of uint32_t,
so 'sum += tmp' will drop the carry when overflow.
Both drop will make cksum incorrect.
This commit fixes the above bug.

Signed-off-by: Su Sai <spiderdetective.ss@gmail.com>
---
 .mailmap              |   1 +
 app/test/test_cksum.c | 102 ++++++++++++++++++++++++++++++++++++++++++
 lib/net/rte_cksum.h   |  27 +++++++++--
 3 files changed, 126 insertions(+), 4 deletions(-)

diff --git a/.mailmap b/.mailmap
index 4001e5fb0e..bcf73cb902 100644
--- a/.mailmap
+++ b/.mailmap
@@ -1630,6 +1630,7 @@ Sylvia Grundwürmer <sylvia.grundwuermer@b-plus.com>
 Sylwester Dziedziuch <sylwesterx.dziedziuch@intel.com>
 Sylwia Wnuczko <sylwia.wnuczko@intel.com>
 Szymon Sliwa <szs@semihalf.com>
+Su Sai <spiderdetective.ss@gmail.com> <susai.ss@bytedance.com>
 Szymon T Cudzilo <szymon.t.cudzilo@intel.com>
 Tadhg Kearney <tadhg.kearney@intel.com>
 Taekyung Kim <kim.tae.kyung@navercorp.com>
diff --git a/app/test/test_cksum.c b/app/test/test_cksum.c
index ea443382a1..5bd9723fbd 100644
--- a/app/test/test_cksum.c
+++ b/app/test/test_cksum.c
@@ -85,6 +85,42 @@ static const char test_cksum_ipv4_opts_udp[] = {
 	0x00, 0x35, 0x00, 0x09, 0x89, 0x6f, 0x78,
 };
 
+/*
+ * generated in scapy with
+ * Ether()/IP()/TCP(options=[NOP,NOP,Timestamps])/os.urandom(113))
+ */
+static const char test_cksum_ipv4_tcp_multi_segs[] = {
+	0x00, 0x16, 0x3e, 0x0b, 0x6b, 0xd2, 0xee, 0xff,
+	0xff, 0xff, 0xff, 0xff, 0x08, 0x00, 0x45, 0x00,
+	0x00, 0xa5, 0x46, 0x10, 0x40, 0x00, 0x40, 0x06,
+	0x80, 0xb5, 0xc0, 0xa8, 0xf9, 0x1d, 0xc0, 0xa8,
+	0xf9, 0x1e, 0xdc, 0xa2, 0x14, 0x51, 0xbb, 0x8f,
+	0xa0, 0x00, 0xe4, 0x7c, 0xe4, 0xb8, 0x80, 0x10,
+	0x02, 0x00, 0x4b, 0xc1, 0x00, 0x00, 0x01, 0x01,
+	0x08, 0x0a, 0x90, 0x60, 0xf4, 0xff, 0x03, 0xc5,
+	0xb4, 0x19, 0x77, 0x34, 0xd4, 0xdc, 0x84, 0x86,
+	0xff, 0x44, 0x09, 0x63, 0x36, 0x2e, 0x26, 0x9b,
+	0x90, 0x70, 0xf2, 0xed, 0xc8, 0x5b, 0x87, 0xaa,
+	0xb4, 0x67, 0x6b, 0x32, 0x3d, 0xc4, 0xbf, 0x15,
+	0xa9, 0x16, 0x6c, 0x2a, 0x9d, 0xb2, 0xb7, 0x6b,
+	0x58, 0x44, 0x58, 0x12, 0x4b, 0x8f, 0xe5, 0x12,
+	0x11, 0x90, 0x94, 0x68, 0x37, 0xad, 0x0a, 0x9b,
+	0xd6, 0x79, 0xf2, 0xb7, 0x31, 0xcf, 0x44, 0x22,
+	0xc8, 0x99, 0x3f, 0xe5, 0xe7, 0xac, 0xc7, 0x0b,
+	0x86, 0xdf, 0xda, 0xed, 0x0a, 0x0f, 0x86, 0xd7,
+	0x48, 0xe2, 0xf1, 0xc2, 0x43, 0xed, 0x47, 0x3a,
+	0xea, 0x25, 0x2d, 0xd6, 0x60, 0x38, 0x30, 0x07,
+	0x28, 0xdd, 0x1f, 0x0c, 0xdd, 0x7b, 0x7c, 0xd9,
+	0x35, 0x9d, 0x14, 0xaa, 0xc6, 0x35, 0xd1, 0x03,
+	0x38, 0xb1, 0xf5,
+};
+
+static const uint8_t test_cksum_ipv4_tcp_multi_segs_len[] = {
+	66,  /* the first seg contains all headers, including L2 to L4 */
+	61,  /* the second seg length is odd, test byte order independent */
+	52,  /* three segs are sufficient to test the most complex scenarios */
+};
+
 /* test l3/l4 checksum api */
 static int
 test_l4_cksum(struct rte_mempool *pktmbuf_pool, const char *pktdata, size_t len)
@@ -223,6 +259,66 @@ test_l4_cksum(struct rte_mempool *pktmbuf_pool, const char *pktdata, size_t len)
 	return -1;
 }
 
+/* test l4 checksum api for a packet with multiple mbufs */
+static int
+test_l4_cksum_multi_mbufs(struct rte_mempool *pktmbuf_pool, const char *pktdata, size_t len,
+			     const uint8_t *segs, size_t segs_len)
+{
+	struct rte_mbuf *m[NB_MBUF] = {0};
+	struct rte_mbuf *m_hdr = NULL;
+	struct rte_net_hdr_lens hdr_lens;
+	size_t i, off = 0;
+	uint32_t packet_type, l3;
+	void *l3_hdr;
+	char *data;
+
+	for (i = 0; i < segs_len; i++) {
+		m[i] = rte_pktmbuf_alloc(pktmbuf_pool);
+		if (m[i] == NULL)
+			GOTO_FAIL("Cannot allocate mbuf");
+
+		data = rte_pktmbuf_append(m[i], segs[i]);
+		if (data == NULL)
+			GOTO_FAIL("Cannot append data");
+
+		memcpy(data, pktdata + off, segs[i]);
+		off += segs[i];
+
+		if (m_hdr) {
+			if (rte_pktmbuf_chain(m_hdr, m[i]))
+				GOTO_FAIL("Cannot chain mbuf");
+		} else {
+			m_hdr = m[i];
+		}
+	}
+
+	if (off != len)
+		GOTO_FAIL("Invalid segs");
+
+	packet_type = rte_net_get_ptype(m_hdr, &hdr_lens, RTE_PTYPE_ALL_MASK);
+	l3 = packet_type & RTE_PTYPE_L3_MASK;
+
+	l3_hdr = rte_pktmbuf_mtod_offset(m_hdr, void *, hdr_lens.l2_len);
+	off = hdr_lens.l2_len + hdr_lens.l3_len;
+
+	if (l3 == RTE_PTYPE_L3_IPV4 || l3 == RTE_PTYPE_L3_IPV4_EXT) {
+		if (rte_ipv4_udptcp_cksum_mbuf_verify(m_hdr, l3_hdr, off) != 0)
+			GOTO_FAIL("Invalid L4 checksum verification for multiple mbufs");
+	} else if (l3 == RTE_PTYPE_L3_IPV6 || l3 == RTE_PTYPE_L3_IPV6_EXT) {
+		if (rte_ipv6_udptcp_cksum_mbuf_verify(m_hdr, l3_hdr, off) != 0)
+			GOTO_FAIL("Invalid L4 checksum verification for multiple mbufs");
+	}
+
+	rte_pktmbuf_free_bulk(m, segs_len);
+
+	return 0;
+
+fail:
+	rte_pktmbuf_free_bulk(m, segs_len);
+
+	return -1;
+}
+
 static int
 test_cksum(void)
 {
@@ -256,6 +352,12 @@ test_cksum(void)
 			  sizeof(test_cksum_ipv4_opts_udp)) < 0)
 		GOTO_FAIL("checksum error on ipv4_opts_udp");
 
+	if (test_l4_cksum_multi_mbufs(pktmbuf_pool, test_cksum_ipv4_tcp_multi_segs,
+			  sizeof(test_cksum_ipv4_tcp_multi_segs),
+			  test_cksum_ipv4_tcp_multi_segs_len,
+			  sizeof(test_cksum_ipv4_tcp_multi_segs_len)) < 0)
+		GOTO_FAIL("checksum error on multi mbufs check");
+
 	rte_mempool_free(pktmbuf_pool);
 
 	return 0;
diff --git a/lib/net/rte_cksum.h b/lib/net/rte_cksum.h
index a8e8927952..679ba82eb6 100644
--- a/lib/net/rte_cksum.h
+++ b/lib/net/rte_cksum.h
@@ -80,6 +80,25 @@ __rte_raw_cksum_reduce(uint32_t sum)
 	return (uint16_t)sum;
 }
 
+/**
+ * @internal Reduce a sum to the non-complemented checksum.
+ * Helper routine for the rte_raw_cksum_mbuf().
+ *
+ * @param sum
+ *   Value of the sum.
+ * @return
+ *   The non-complemented checksum.
+ */
+static inline uint16_t
+__rte_raw_cksum_reduce_u64(uint64_t sum)
+{
+	uint32_t tmp;
+
+	tmp = __rte_raw_cksum_reduce((uint32_t)sum);
+	tmp += __rte_raw_cksum_reduce((uint32_t)(sum >> 32));
+	return __rte_raw_cksum_reduce(tmp);
+}
+
 /**
  * Process the non-complemented checksum of a buffer.
  *
@@ -119,8 +138,8 @@ rte_raw_cksum_mbuf(const struct rte_mbuf *m, uint32_t off, uint32_t len,
 {
 	const struct rte_mbuf *seg;
 	const char *buf;
-	uint32_t sum, tmp;
-	uint32_t seglen, done;
+	uint32_t seglen, done, tmp;
+	uint64_t sum;
 
 	/* easy case: all data in the first segment */
 	if (off + len <= rte_pktmbuf_data_len(m)) {
@@ -157,7 +176,7 @@ rte_raw_cksum_mbuf(const struct rte_mbuf *m, uint32_t off, uint32_t len,
 	for (;;) {
 		tmp = __rte_raw_cksum(buf, seglen, 0);
 		if (done & 1)
-			tmp = rte_bswap16((uint16_t)tmp);
+			tmp = rte_bswap32(tmp);
 		sum += tmp;
 		done += seglen;
 		if (done == len)
@@ -169,7 +188,7 @@ rte_raw_cksum_mbuf(const struct rte_mbuf *m, uint32_t off, uint32_t len,
 			seglen = len - done;
 	}
 
-	*cksum = __rte_raw_cksum_reduce(sum);
+	*cksum = __rte_raw_cksum_reduce_u64(sum);
 	return 0;
 }
 
-- 
2.20.1


^ permalink raw reply related

* Re: [RFC 0/4] alternative capture mechanism
From: Bruce Richardson @ 2026-06-16 12:37 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev
In-Reply-To: <20260609210540.768074-1-stephen@networkplumber.org>

On Tue, Jun 09, 2026 at 02:02:01PM -0700, Stephen Hemminger wrote:
> This is an RFC for an alternative way to capture packets from a DPDK
> application. I did brief demo of similar mechanism at DPDK summit but
> this is more complete. Capture runs in the primary process and is driven
> entirely over telemetry; no secondary process is involved.
> 
> A client asks the application to start capturing and passes it a file
> descriptor to write to. The application writes pcapng to that descriptor.
> A Wireshark extcap script is the intended front end, but the control path
> is just telemetry and the output is just a pipe, so other front ends are
> possible.
> 
>   1/4  telemetry: let a command receive file descriptors from the client
>   2/4  capture: the library
>   3/4  test: functional test
>   4/4  app: the Wireshark extcap script and its documentation
> 
> Setup and usage are in doc/guides/tools/wireshark_extcap.rst.
> 
> Primary process only for now; secondary-process capture is possible as
> follow-on. Posting as RFC to get feedback on the approach.
> 
> The extcap script is dual licensed (BSD-3-Clause OR GPL-2.0-or-later) as
> it may be more useful in the Wireshark tree.
> 

One concern I have though - does this cause system-calls to be made in the
fast-path because we are writting to a passed in FD? For performance
reasons, would it not be better to use a memory buffer for this, thereby
avoiding syscalls? For example, rather than passing in an FD to telemetry,
we could pass in a key to be passed to shmget (going old-school!), or
name parameter for shm_open. Thereafter with the memory buffer we can use a
circular ring or similar to pass the data from app to client.

/Bruce

> Stephen Hemminger (4):
>   telemetry: allow commands to receive file descriptors
>   capture: infrastructure wireshark packet capture
>   test: add test for capture hooks
>   usertools/dpdk-wireshark-extcap.py: script for external capture
> 
>  MAINTAINERS                            |   4 +
>  app/test/meson.build                   |   1 +
>  app/test/test_capture.c                | 365 +++++++++++
>  doc/guides/rel_notes/release_26_07.rst |  12 +
>  doc/guides/tools/index.rst             |   1 +
>  doc/guides/tools/wireshark_extcap.rst  | 155 +++++
>  lib/capture/capture.c                  | 821 +++++++++++++++++++++++++
>  lib/capture/capture_impl.h             |  56 ++
>  lib/capture/filter.c                   | 108 ++++
>  lib/capture/meson.build                |  19 +
>  lib/meson.build                        |   1 +
>  lib/telemetry/rte_telemetry.h          |  66 ++
>  lib/telemetry/telemetry.c              | 115 +++-
>  usertools/dpdk-wireshark-extcap.py     | 274 +++++++++
>  14 files changed, 1986 insertions(+), 12 deletions(-)
>  create mode 100644 app/test/test_capture.c
>  create mode 100644 doc/guides/tools/wireshark_extcap.rst
>  create mode 100644 lib/capture/capture.c
>  create mode 100644 lib/capture/capture_impl.h
>  create mode 100644 lib/capture/filter.c
>  create mode 100644 lib/capture/meson.build
>  create mode 100755 usertools/dpdk-wireshark-extcap.py
> 
> -- 
> 2.53.0
> 

^ permalink raw reply

* [DPDK/other Bug 952] unit tests fail when machine has more than 128 cores
From: bugzilla @ 2026-06-16 12:34 UTC (permalink / raw)
  To: dev
In-Reply-To: <bug-952-3@http.bugs.dpdk.org/>

http://bugs.dpdk.org/show_bug.cgi?id=952

Thomas Monjalon (thomas@monjalon.net) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |FIXED
             Status|UNCONFIRMED                 |RESOLVED

--- Comment #1 from Thomas Monjalon (thomas@monjalon.net) ---
Resolved in https://dpdk.org/id/c9eb695f16
test/atomic: scale test based on core count

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply

* Re: [RFC 1/4] telemetry: allow commands to receive file descriptors
From: Bruce Richardson @ 2026-06-16 12:32 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev
In-Reply-To: <20260609210540.768074-2-stephen@networkplumber.org>

On Tue, Jun 09, 2026 at 02:02:02PM -0700, Stephen Hemminger wrote:
> Add rte_telemetry_register_cmd_fd_arg() to register a command whose
> callback also receives file descriptors passed by the client as
> SCM_RIGHTS ancillary data. The callback owns the descriptors and must
> close them.
> 
> This lets a client open a file itself and hand the descriptor to the
> primary process, so DPDK never opens the path. That avoids path and
> permission problems and works across container filesystem namespaces.
> 
> Existing commands and clients are unaffected. If unsolicited file
> descriptor is passed, it is closed.
> 

This scheme seems reasonable in general. My only concern is whether the
lack of potential windows support is an issue? For regular telemetry, there
was always the option of a windows implementation using regular
TCP/UDP/SCTP sockets bound to localhost. However, AFAIK there is no windows
implementation of anything that supports file descriptors or handles
between processes.

Some other pieces of feedback inline below.

/Bruce

> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
> ---
>  doc/guides/rel_notes/release_26_07.rst |   5 ++
>  lib/telemetry/rte_telemetry.h          |  66 ++++++++++++++
>  lib/telemetry/telemetry.c              | 115 ++++++++++++++++++++++---
>  3 files changed, 174 insertions(+), 12 deletions(-)
> 
> diff --git a/doc/guides/rel_notes/release_26_07.rst b/doc/guides/rel_notes/release_26_07.rst
> index b5285af5fe..d7a2df88c1 100644
> --- a/doc/guides/rel_notes/release_26_07.rst
> +++ b/doc/guides/rel_notes/release_26_07.rst
> @@ -141,6 +141,11 @@ New Features
>    Added AGENTS.md file for AI review
>    and supporting scripts to review patches and documentation.
>  
> +* **Added telemetry support for passing file descriptors.**
> +
> +  Add experimental telemetry callback ``rte_telemetry_register_cmd_fd_arg()``
> +  to allow command to receive file descriptors passed by client.
> +
>  
>  Removed Items
>  -------------
> diff --git a/lib/telemetry/rte_telemetry.h b/lib/telemetry/rte_telemetry.h
> index 0a58e518f7..3e32d2902b 100644
> --- a/lib/telemetry/rte_telemetry.h
> +++ b/lib/telemetry/rte_telemetry.h
> @@ -325,6 +325,37 @@ typedef int (*telemetry_cb)(const char *cmd, const char *params,
>  typedef int (*telemetry_arg_cb)(const char *cmd, const char *params, void *arg,
>  		struct rte_tel_data *info);
>  
> +/**
> + * This telemetry callback is used when registering a telemetry command with
> + * rte_telemetry_register_cmd_fd_arg().
> + *
> + * It behaves like telemetry_arg_cb, but additionally receives any file
> + * descriptors the client passed alongside the command as SCM_RIGHTS ancillary
> + * data. The callback takes ownership of these descriptors and is responsible
> + * for closing them.
> + *
> + * @param cmd
> + *   The cmd that was requested by the client.
> + * @param params
> + *   Contains data required by the callback function.
> + * @param arg
> + *   The opaque value that was passed to rte_telemetry_register_cmd_fd_arg().
> + * @param fds
> + *   Array of file descriptors received from the client. May be NULL when
> + *   n_fds is zero.
> + * @param n_fds
> + *   Number of file descriptors in the fds array.
> + * @param info
> + *   The information to be returned to the caller.
> + *
> + * @return
> + *   Length of buffer used on success.
> + * @return
> + *   Negative integer on error.
> + */
> +typedef int (*telemetry_fd_cb)(const char *cmd, const char *params, void *arg,
> +		const int *fds, unsigned int n_fds, struct rte_tel_data *info);
> +

Do we anticipate in future having callbacks taking more than one FD? Would
it not be simpler just to a single fd parameter (which is -1 on no fd
passed).

>  /**
>   * Used when registering a command and callback function with telemetry.
>   *
> @@ -368,6 +399,41 @@ __rte_experimental
>  int
>  rte_telemetry_register_cmd_arg(const char *cmd, telemetry_arg_cb fn, void *arg, const char *help);
>  
> +/**
> + * Register a command and a file-descriptor-aware callback with telemetry.
> + *
> + * The callback is invoked like rte_telemetry_register_cmd_arg(), but also
> + * receives any file descriptors the client passed alongside the command as
> + * SCM_RIGHTS ancillary data. This lets a client open a file (for example a
> + * capture output file) itself and hand the descriptor to the DPDK process,
> + * which never opens the path - avoiding path and permission concerns and
> + * working across container filesystem namespaces.
> + *
> + * Descriptors sent to a command registered with rte_telemetry_register_cmd()
> + * or rte_telemetry_register_cmd_arg() are rejected and the connection is
> + * closed.
> + *
> + * @param cmd
> + *   The command to register with telemetry.
> + * @param fn
> + *   Callback function to be called when the command is requested.
> + * @param arg
> + *   An opaque value that will be passed to the callback function.
> + * @param help
> + *   Help text for the command.
> + *
> + * @return
> + *   0 on success.
> + * @return
> + *   -EINVAL for invalid parameters failure.
> + * @return
> + *   -ENOMEM for mem allocation failure.
> + */
> +__rte_experimental
> +int
> +rte_telemetry_register_cmd_fd_arg(const char *cmd, telemetry_fd_cb fn, void *arg,
> +		const char *help);
> +

Do we want to make this experimental for later stabilization, or is this an
API that is best kept as internal-only? I'd tend towards keeping it
internal-only, rather than allowing apps to define callbacks which take
FDs.

>  /**
>   * @internal
>   * Free a container that has memory allocated.
> diff --git a/lib/telemetry/telemetry.c b/lib/telemetry/telemetry.c
> index b109d076d4..30d3ae3a13 100644
> --- a/lib/telemetry/telemetry.c
> +++ b/lib/telemetry/telemetry.c
> @@ -29,6 +29,8 @@
>  #define MAX_CMD_LEN 56
>  #define MAX_OUTPUT_LEN (1024 * 16)
>  #define MAX_CONNECTIONS 10
> +/* Maximum number of file descriptors a client may pass with one command. */
> +#define MAX_FDS 8
>  

As above, do we really need multiple FDs?

>  #ifndef RTE_EXEC_ENV_WINDOWS
>  static void *
> @@ -39,6 +41,7 @@ struct cmd_callback {
>  	char cmd[MAX_CMD_LEN];
>  	telemetry_cb fn;
>  	telemetry_arg_cb fn_arg;
> +	telemetry_fd_cb fn_fd;
>  	void *arg;
>  	char help[RTE_TEL_MAX_STRING_LEN];
>  };
> @@ -72,15 +75,15 @@ static RTE_ATOMIC(uint16_t) v2_clients;
>  #endif /* !RTE_EXEC_ENV_WINDOWS */
>  
>  static int
> -register_cmd(const char *cmd, const char *help,
> -	     telemetry_cb fn, telemetry_arg_cb fn_arg, void *arg)
> +register_cmd(const char *cmd, const char *help, telemetry_cb fn,
> +	     telemetry_arg_cb fn_arg, telemetry_fd_cb fn_fd, void *arg)
>  {
>  	struct cmd_callback *new_callbacks;
>  	const char *cmdp = cmd;
>  	int i = 0;
>  
> -	if (strlen(cmd) >= MAX_CMD_LEN || (fn == NULL && fn_arg == NULL) || cmd[0] != '/'
> -			|| strlen(help) >= RTE_TEL_MAX_STRING_LEN)
> +	if (strlen(cmd) >= MAX_CMD_LEN || (fn == NULL && fn_arg == NULL && fn_fd == NULL)
> +			|| cmd[0] != '/' || strlen(help) >= RTE_TEL_MAX_STRING_LEN)
>  		return -EINVAL;
>  
>  	while (*cmdp != '\0') {
> @@ -107,6 +110,7 @@ register_cmd(const char *cmd, const char *help,
>  	strlcpy(callbacks[i].cmd, cmd, MAX_CMD_LEN);
>  	callbacks[i].fn = fn;
>  	callbacks[i].fn_arg = fn_arg;
> +	callbacks[i].fn_fd = fn_fd;
>  	callbacks[i].arg = arg;
>  	strlcpy(callbacks[i].help, help, RTE_TEL_MAX_STRING_LEN);
>  	num_callbacks++;
> @@ -119,14 +123,22 @@ RTE_EXPORT_SYMBOL(rte_telemetry_register_cmd)
>  int
>  rte_telemetry_register_cmd(const char *cmd, telemetry_cb fn, const char *help)
>  {
> -	return register_cmd(cmd, help, fn, NULL, NULL);
> +	return register_cmd(cmd, help, fn, NULL, NULL, NULL);
>  }
>  
>  RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_telemetry_register_cmd_arg, 24.11)
>  int
>  rte_telemetry_register_cmd_arg(const char *cmd, telemetry_arg_cb fn, void *arg, const char *help)
>  {
> -	return register_cmd(cmd, help, NULL, fn, arg);
> +	return register_cmd(cmd, help, NULL, fn, NULL, arg);
> +}
> +
> +RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_telemetry_register_cmd_fd_arg, 26.07)
> +int
> +rte_telemetry_register_cmd_fd_arg(const char *cmd, telemetry_fd_cb fn, void *arg,
> +		const char *help)
> +{
> +	return register_cmd(cmd, help, NULL, NULL, fn, arg);
>  }
>  
>  #ifndef RTE_EXEC_ENV_WINDOWS
> @@ -368,13 +380,70 @@ output_json(const char *cmd, const struct rte_tel_data *d, int s)
>  		TMTY_LOG_LINE(ERR, "Error writing to socket: %s", strerror(errno));
>  }
>  
> +/*
> + * Receive a command and any file descriptors the client passed alongside it
> + * as SCM_RIGHTS ancillary data. The payload length is returned (0 if the
> + * client sent an empty message or closed the connection, negative on error).
> + * Descriptors that arrive are returned in fds[]/n_fds and are owned by the
> + * caller. MSG_CTRUNC means more descriptors were sent than the control buffer
> + * could hold; *ctrunc is set so the caller can reject the command, but the
> + * descriptors that did fit are still returned so they can be closed rather
> + * than leaked.
> + */
> +static int
> +recv_with_fds(int s, char *buf, size_t buf_len, int *fds, unsigned int *n_fds,
> +	      bool *ctrunc)
> +{
> +	char cmsgbuf[CMSG_SPACE(sizeof(int) * MAX_FDS)];
> +	struct iovec iov = { .iov_base = buf, .iov_len = buf_len };
> +	struct msghdr msg = {
> +		.msg_iov = &iov,
> +		.msg_iovlen = 1,
> +		.msg_control = cmsgbuf,
> +		.msg_controllen = sizeof(cmsgbuf),
> +	};
> +	struct cmsghdr *cmsg;
> +	int bytes;
> +
> +	*n_fds = 0;
> +	*ctrunc = false;
> +
> +	bytes = recvmsg(s, &msg, 0);
> +	if (bytes < 0)
> +		return bytes;
> +
> +	if (msg.msg_flags & MSG_CTRUNC)
> +		*ctrunc = true;
> +
> +	for (cmsg = CMSG_FIRSTHDR(&msg); cmsg != NULL; cmsg = CMSG_NXTHDR(&msg, cmsg)) {
> +		if (cmsg->cmsg_level != SOL_SOCKET || cmsg->cmsg_type != SCM_RIGHTS)
> +			continue;
> +		*n_fds = (cmsg->cmsg_len - CMSG_LEN(0)) / sizeof(int);
> +		memcpy(fds, CMSG_DATA(cmsg), *n_fds * sizeof(int));
> +		break;
> +	}
> +	return bytes;
> +}
> +
>  static void
> -perform_command(const struct cmd_callback *cb, const char *cmd, const char *param, int s)
> +close_fds(const int *fds, unsigned int n_fds)
> +{
> +	unsigned int i;
> +
> +	for (i = 0; i < n_fds; i++)
> +		close(fds[i]);
> +}
> +
> +static void
> +perform_command(const struct cmd_callback *cb, const char *cmd, const char *param,
> +		const int *fds, unsigned int n_fds, int s)
>  {
>  	struct rte_tel_data data = {0};
>  	int ret;
>  
> -	if (cb->fn_arg != NULL)
> +	if (cb->fn_fd != NULL)
> +		ret = cb->fn_fd(cmd, param, cb->arg, fds, n_fds, &data);
> +	else if (cb->fn_arg != NULL)
>  		ret = cb->fn_arg(cmd, param, cb->arg, &data);
>  	else
>  		ret = cb->fn(cmd, param, &data);
> @@ -412,8 +481,11 @@ client_handler(void *sock_id)
>  	}
>  
>  	/* receive data is not null terminated */
> -	int bytes = read(s, buffer, sizeof(buffer) - 1);
> -	while (bytes > 0) {
> +	int fds[MAX_FDS];
> +	unsigned int n_fds = 0;
> +	bool ctrunc = false;
> +	int bytes = recv_with_fds(s, buffer, sizeof(buffer) - 1, fds, &n_fds, &ctrunc);
> +	while (bytes > 0 || (bytes == 0 && n_fds > 0)) {
>  		buffer[bytes] = 0;
>  		const char *cmd = strtok(buffer, ",");
>  		const char *param = strtok(NULL, "\0");
> @@ -429,9 +501,28 @@ client_handler(void *sock_id)
>  				}
>  			rte_spinlock_unlock(&callback_sl);
>  		}
> -		perform_command(&cb, cmd, param, s);
>  
> -		bytes = read(s, buffer, sizeof(buffer) - 1);
> +		/*
> +		 * File descriptors go only to a command that registered to
> +		 * receive them. A command that did not, or a truncated control
> +		 * message, is a client error: close the descriptors and drop the
> +		 * connection rather than silently discarding them.
> +		 */
> +		if (n_fds > 0 && (cb.fn_fd == NULL || ctrunc)) {
> +			TMTY_LOG_LINE(ERR,
> +				"Closing connection: %u file descriptor(s) passed to '%s'%s",
> +				n_fds, cmd ? cmd : "(none)",
> +				ctrunc ? " (truncated)" : " which does not accept them");
> +			close_fds(fds, n_fds);
> +			break;
> +		}
> +
> +		/* an fd-aware callback takes ownership of the descriptors */
> +		perform_command(&cb, cmd, param, fds, n_fds, s);
> +
> +		n_fds = 0;
> +		ctrunc = false;

the receive function always resets this to false anyway, so you may be able
to omit this line (assuming compiler doesn't complain).

> +		bytes = recv_with_fds(s, buffer, sizeof(buffer) - 1, fds, &n_fds, &ctrunc);
>  	}
>  exit:
>  	close(s);
> -- 
> 2.53.0
> 

^ permalink raw reply

* [PATCH v10 1/1] net/mana: add device reset support
From: Wei Hu @ 2026-06-16 12:31 UTC (permalink / raw)
  To: dev, stephen; +Cc: longli, weh
In-Reply-To: <20260616123158.43583-1-weh@linux.microsoft.com>

From: Wei Hu <weh@microsoft.com>

Add support for handling hardware reset events in the MANA driver.
When the MANA kernel driver receives a hardware service event, it
initiates a device reset and notifies userspace via
IBV_EVENT_DEVICE_FATAL. The DPDK driver handles this by performing
an automatic teardown and recovery sequence.

The interrupt handler sets the device state, blocks new data path
bursts, waits for in-flight bursts to drain using per-queue atomic
flags, and spawns a control thread. The control thread performs
teardown immediately (dev_stop, secondary IPC, dev_close, MR cache
free) before waiting for the hardware recovery timer to fire. This
avoids blocking the EAL interrupt thread on multi-second IPC
timeouts and ibverbs calls. After the recovery delay, the thread
unregisters the interrupt handler, re-probes the PCI device,
reinitializes MR caches, and restarts queues. Each function owns
its own lock scope with no lock hand-off between threads.

Each queue has an atomic burst_state variable where bit 0 is the
in-burst flag and bit 1 is a blocked flag. The data path uses a
single compare-and-swap (0 to 1) to enter a burst, which fails
immediately if the blocked bit is set. The reset path sets the
blocked bit via atomic fetch-or and polls bit 0 to wait for
in-flight bursts to drain. This single-variable design avoids the
need for sequential consistency ordering.

A per-device mutex serializes the reset path with ethdev
operations. The mutex uses PTHREAD_PROCESS_SHARED for multi-process
support and is held across blocking IB verbs calls. A trylock
helper encapsulates the lock acquisition and device state check
for all ethdev operation wrappers. Operations that cannot wait
(configure, queue setup) return -EBUSY during reset, while
dev_stop and dev_close join the reset thread before acquiring
the lock to ensure proper sequencing.

The reset thread keeps reset_thread_active true throughout its
lifetime. mana_join_reset_thread uses rte_thread_equal to detect
the self-join case (when a recovery callback calls dev_stop or
dev_close from the reset thread itself) and calls
rte_thread_detach instead of join, so thread resources are freed
on exit. External callers join normally.

The condvar wait in the reset thread uses a predicate loop that
checks dev_state under reset_cond_mutex, so a PCI remove signal
that arrives before the thread enters the wait is not lost. The
PCI remove callback sets dev_state to RESET_FAILED under the
same mutex before signaling. A lock/unlock barrier on
reset_ops_lock in the PCI remove path ensures teardown has
completed before emitting the INTR_RMV event.

Multi-process support is included: secondary processes unmap and
remap doorbell pages via IPC during the reset enter and exit
phases. The secondary RESET_EXIT handler closes the received fd
unconditionally after processing, even when the doorbell page is
already mapped. Data path functions in both primary and secondary
processes check the device state atomically and return early when
the device is not active.

The driver emits RTE_ETH_EVENT_ERR_RECOVERING before entering the
reset path so that upper layers (e.g. netvsc) can switch their
data path before queues are stopped. The event is emitted outside
the reset lock to avoid deadlock if the callback calls dev_stop or
dev_close. On completion, the driver emits RECOVERY_SUCCESS or
RECOVERY_FAILED after releasing the lock. If a recovery callback
triggers dev_stop or dev_close, the self-join detection in
mana_join_reset_thread detaches the thread to avoid deadlock. If
the enter phase fails internally, RECOVERY_FAILED is sent
immediately so the application receives a terminal event. A PCI
device removal event callback distinguishes hot-remove from
service reset.

Documentation for the device reset feature is added in the MANA
NIC guide and the 26.07 release notes.

Signed-off-by: Wei Hu <weh@microsoft.com>
---
 doc/guides/nics/mana.rst               |   40 +
 doc/guides/rel_notes/release_26_07.rst |    8 +
 drivers/net/mana/mana.c                | 1088 ++++++++++++++++++++++--
 drivers/net/mana/mana.h                |   52 +-
 drivers/net/mana/mp.c                  |   92 +-
 drivers/net/mana/mr.c                  |    6 +-
 drivers/net/mana/rx.c                  |   23 +-
 drivers/net/mana/tx.c                  |   44 +-
 8 files changed, 1245 insertions(+), 108 deletions(-)

diff --git a/doc/guides/nics/mana.rst b/doc/guides/nics/mana.rst
index 0fcab6e2f6..08e345ea61 100644
--- a/doc/guides/nics/mana.rst
+++ b/doc/guides/nics/mana.rst
@@ -71,3 +71,43 @@ The user can specify below argument in devargs.
     The default value is not set,
     meaning all the NICs will be probed and loaded.
     User can specify multiple mac=xx:xx:xx:xx:xx:xx arguments for up to 8 NICs.
+
+Device Reset Support
+--------------------
+
+The MANA PMD supports automatic recovery from hardware service reset events.
+When the MANA kernel driver receives a hardware service event,
+it initiates a device reset and notifies userspace
+via ``IBV_EVENT_DEVICE_FATAL``.
+
+The driver handles this transparently through a two-phase reset flow:
+
+* **Enter phase**: The interrupt handler blocks new data path bursts
+  and waits for all in-flight burst calls to drain
+  using per-queue atomic flags,
+  then spawns a control thread for the remaining work.
+
+* **Teardown and exit phase**: The control thread tears down
+  IB resources and queues, unmaps secondary process doorbell pages,
+  and closes the device. After a delay for hardware recovery,
+  it re-probes the PCI device,
+  reinstalls the interrupt handler,
+  reinitializes resources, and restarts queues.
+
+The driver emits the following ethdev recovery events
+to notify upper layers (e.g. netvsc) of the reset lifecycle:
+
+``RTE_ETH_EVENT_ERR_RECOVERING``
+   Reset has started.
+
+``RTE_ETH_EVENT_RECOVERY_SUCCESS``
+   Device has recovered successfully.
+
+``RTE_ETH_EVENT_RECOVERY_FAILED``
+   Recovery failed.
+
+To distinguish a PCI hot-remove from a service reset,
+the driver registers for PCI device removal events.
+This requires the application to call ``rte_dev_event_monitor_start()``
+for removal events to be delivered
+(e.g. testpmd ``--hot-plug-handling`` option).
diff --git a/doc/guides/rel_notes/release_26_07.rst b/doc/guides/rel_notes/release_26_07.rst
index bd0cec2709..58e8c2422e 100644
--- a/doc/guides/rel_notes/release_26_07.rst
+++ b/doc/guides/rel_notes/release_26_07.rst
@@ -122,6 +122,14 @@ New Features
   Added AGENTS.md file for AI review
   and supporting scripts to review patches and documentation.
 
+* **Added device reset support to the MANA PMD.**
+
+  Added automatic recovery from hardware service reset events
+  in the MANA poll mode driver. The driver uses ethdev recovery events
+  (``RTE_ETH_EVENT_ERR_RECOVERING``, ``RTE_ETH_EVENT_RECOVERY_SUCCESS``,
+  ``RTE_ETH_EVENT_RECOVERY_FAILED``) to notify upper layers of the
+  reset lifecycle.
+
 
 Removed Items
 -------------
diff --git a/drivers/net/mana/mana.c b/drivers/net/mana/mana.c
index 67396cda1f..0b72f711a1 100644
--- a/drivers/net/mana/mana.c
+++ b/drivers/net/mana/mana.c
@@ -103,6 +103,8 @@ mana_dev_configure(struct rte_eth_dev *dev)
 			      RTE_ETH_RX_OFFLOAD_VLAN_STRIP);
 
 	priv->num_queues = dev->data->nb_rx_queues;
+	DRV_LOG(DEBUG, "priv %p, port %u, dev port %u, num_queues: %u",
+		priv, priv->port_id, priv->dev_port, priv->num_queues);
 
 	manadv_set_context_attr(priv->ib_ctx, MANADV_CTX_ATTR_BUF_ALLOCATORS,
 				(void *)((uintptr_t)&(struct manadv_ctx_allocators){
@@ -214,8 +216,8 @@ mana_dev_start(struct rte_eth_dev *dev)
 
 	DRV_LOG(INFO, "TX/RX queues have started");
 
-	/* Enable datapath for secondary processes */
-	mana_mp_req_on_rxtx(dev, MANA_MP_REQ_START_RXTX);
+	/* Intentionally ignore errors — secondary may not be running */
+	(void)mana_mp_req_on_rxtx(dev, MANA_MP_REQ_START_RXTX);
 
 	ret = rxq_intr_enable(priv);
 	if (ret) {
@@ -242,26 +244,33 @@ mana_dev_stop(struct rte_eth_dev *dev)
 {
 	int ret;
 	struct mana_priv *priv = dev->data->dev_private;
-
-	rxq_intr_disable(priv);
+	enum mana_device_state state;
+
+	state = rte_atomic_load_explicit(&priv->dev_state,
+					 rte_memory_order_acquire);
+	if (state == MANA_DEV_ACTIVE ||
+	    state == MANA_DEV_RESET_FAILED) {
+		rxq_intr_disable(priv);
+		DRV_LOG(DEBUG, "rxq_intr_disable called");
+	}
 
 	dev->tx_pkt_burst = mana_tx_burst_removed;
 	dev->rx_pkt_burst = mana_rx_burst_removed;
 
-	/* Stop datapath on secondary processes */
-	mana_mp_req_on_rxtx(dev, MANA_MP_REQ_STOP_RXTX);
+	/* Intentionally ignore errors — secondary may not be running */
+	(void)mana_mp_req_on_rxtx(dev, MANA_MP_REQ_STOP_RXTX);
 
 	rte_wmb();
 
 	ret = mana_stop_tx_queues(dev);
 	if (ret) {
-		DRV_LOG(ERR, "failed to stop tx queues");
+		DRV_LOG(ERR, "failed to stop tx queues, ret %d", ret);
 		return ret;
 	}
 
 	ret = mana_stop_rx_queues(dev);
 	if (ret) {
-		DRV_LOG(ERR, "failed to stop tx queues");
+		DRV_LOG(ERR, "failed to stop rx queues, ret %d", ret);
 		return ret;
 	}
 
@@ -275,36 +284,66 @@ mana_dev_close(struct rte_eth_dev *dev)
 {
 	struct mana_priv *priv = dev->data->dev_private;
 	int ret;
+	enum mana_device_state state;
 
+	DRV_LOG(DEBUG, "Free MR for priv %p", priv);
 	mana_remove_all_mr(priv);
 
-	ret = mana_intr_uninstall(priv);
-	if (ret)
-		return ret;
+	state = rte_atomic_load_explicit(&priv->dev_state,
+					 rte_memory_order_acquire);
+	if (state == MANA_DEV_ACTIVE ||
+	    state == MANA_DEV_RESET_FAILED) {
+		ret = mana_intr_uninstall(priv);
+		if (ret)
+			return ret;
+	}
 
 	if (priv->ib_parent_pd) {
-		int err = ibv_dealloc_pd(priv->ib_parent_pd);
-		if (err)
-			DRV_LOG(ERR, "Failed to deallocate parent PD: %d", err);
+		ret = ibv_dealloc_pd(priv->ib_parent_pd);
+		if (ret)
+			DRV_LOG(ERR,
+				"Failed to deallocate parent PD: %d", ret);
 		priv->ib_parent_pd = NULL;
 	}
 
 	if (priv->ib_pd) {
-		int err = ibv_dealloc_pd(priv->ib_pd);
-		if (err)
-			DRV_LOG(ERR, "Failed to deallocate PD: %d", err);
+		ret = ibv_dealloc_pd(priv->ib_pd);
+		if (ret)
+			DRV_LOG(ERR, "Failed to deallocate PD: %d", ret);
 		priv->ib_pd = NULL;
 	}
 
-	ret = ibv_close_device(priv->ib_ctx);
-	if (ret) {
-		ret = errno;
-		return ret;
+	state = rte_atomic_load_explicit(&priv->dev_state,
+					 rte_memory_order_acquire);
+	if (state == MANA_DEV_ACTIVE ||
+	    state == MANA_DEV_RESET_FAILED) {
+		if (priv->ib_ctx) {
+			ret = ibv_close_device(priv->ib_ctx);
+			if (ret) {
+				ret = errno;
+				return ret;
+			}
+			priv->ib_ctx = NULL;
+		}
 	}
 
 	return 0;
 }
 
+/*
+ * Called from mana_pci_remove to free resources allocated
+ * during probe that are not freed by dev_close.
+ */
+static void
+mana_dev_free_resources(struct rte_eth_dev *dev)
+{
+	struct mana_priv *priv = dev->data->dev_private;
+
+	pthread_mutex_destroy(&priv->reset_ops_lock);
+	pthread_mutex_destroy(&priv->reset_cond_mutex);
+	pthread_cond_destroy(&priv->reset_cond);
+}
+
 static int
 mana_dev_info_get(struct rte_eth_dev *dev,
 		  struct rte_eth_dev_info *dev_info)
@@ -391,6 +430,39 @@ mana_dev_info_get(struct rte_eth_dev *dev,
 	return 0;
 }
 
+/*
+ * Try to acquire the reset lock and verify the device is active.
+ * Returns 0 with lock held on success, or -EBUSY if the lock
+ * could not be acquired or the device is not in ACTIVE state.
+ */
+static int
+mana_reset_trylock(struct mana_priv *priv)
+{
+	if (pthread_mutex_trylock(&priv->reset_ops_lock))
+		return -EBUSY;
+
+	if (rte_atomic_load_explicit(&priv->dev_state,
+	    rte_memory_order_acquire) != MANA_DEV_ACTIVE) {
+		pthread_mutex_unlock(&priv->reset_ops_lock);
+		return -EBUSY;
+	}
+	return 0;
+}
+
+static int
+mana_dev_info_get_lock(struct rte_eth_dev *dev,
+		       struct rte_eth_dev_info *dev_info)
+{
+	struct mana_priv *priv = dev->data->dev_private;
+	int ret;
+
+	if (mana_reset_trylock(priv))
+		return -EBUSY;
+	ret = mana_dev_info_get(dev, dev_info);
+	pthread_mutex_unlock(&priv->reset_ops_lock);
+	return ret;
+}
+
 static void
 mana_dev_tx_queue_info(struct rte_eth_dev *dev, uint16_t queue_id,
 		       struct rte_eth_txq_info *qinfo)
@@ -552,6 +624,22 @@ mana_dev_tx_queue_setup(struct rte_eth_dev *dev, uint16_t queue_idx,
 	return ret;
 }
 
+static int
+mana_dev_tx_queue_setup_lock(struct rte_eth_dev *dev, uint16_t queue_idx,
+			     uint16_t nb_desc, unsigned int socket_id,
+			     const struct rte_eth_txconf *tx_conf)
+{
+	struct mana_priv *priv = dev->data->dev_private;
+	int ret;
+
+	if (mana_reset_trylock(priv))
+		return -EBUSY;
+	ret = mana_dev_tx_queue_setup(dev, queue_idx,
+				      nb_desc, socket_id, tx_conf);
+	pthread_mutex_unlock(&priv->reset_ops_lock);
+	return ret;
+}
+
 static void
 mana_dev_tx_queue_release(struct rte_eth_dev *dev, uint16_t qid)
 {
@@ -629,6 +717,23 @@ mana_dev_rx_queue_setup(struct rte_eth_dev *dev, uint16_t queue_idx,
 	return ret;
 }
 
+static int
+mana_dev_rx_queue_setup_lock(struct rte_eth_dev *dev, uint16_t queue_idx,
+			     uint16_t nb_desc, unsigned int socket_id,
+			     const struct rte_eth_rxconf *rx_conf __rte_unused,
+			     struct rte_mempool *mp)
+{
+	struct mana_priv *priv = dev->data->dev_private;
+	int ret;
+
+	if (mana_reset_trylock(priv))
+		return -EBUSY;
+	ret = mana_dev_rx_queue_setup(dev, queue_idx, nb_desc,
+				      socket_id, rx_conf, mp);
+	pthread_mutex_unlock(&priv->reset_ops_lock);
+	return ret;
+}
+
 static void
 mana_dev_rx_queue_release(struct rte_eth_dev *dev, uint16_t qid)
 {
@@ -820,33 +925,267 @@ mana_mtu_set(struct rte_eth_dev *dev, uint16_t mtu)
 	return mana_ifreq(priv, SIOCSIFMTU, &request);
 }
 
+static int
+mana_dev_configure_lock(struct rte_eth_dev *dev)
+{
+	struct mana_priv *priv = dev->data->dev_private;
+	int ret;
+
+	if (mana_reset_trylock(priv))
+		return -EBUSY;
+	ret = mana_dev_configure(dev);
+	pthread_mutex_unlock(&priv->reset_ops_lock);
+	return ret;
+}
+
+static int
+mana_dev_start_lock(struct rte_eth_dev *dev)
+{
+	struct mana_priv *priv = dev->data->dev_private;
+	int ret;
+
+	if (mana_reset_trylock(priv))
+		return -EBUSY;
+	ret = mana_dev_start(dev);
+	pthread_mutex_unlock(&priv->reset_ops_lock);
+	return ret;
+}
+
+/*
+ * Join the reset thread if it is active. Uses CAS on
+ * reset_thread_active to ensure only one caller joins.
+ * If called from the reset thread itself (e.g. via a recovery
+ * event callback that calls dev_stop/dev_close), detach instead
+ * of joining to avoid deadlock and let the thread self-free.
+ */
+static void
+mana_join_reset_thread(struct mana_priv *priv)
+{
+	bool expected = true;
+
+	if (rte_atomic_compare_exchange_strong_explicit(
+			&priv->reset_thread_active, &expected, false,
+			rte_memory_order_acq_rel,
+			rte_memory_order_acquire)) {
+		if (rte_thread_equal(rte_thread_self(),
+				     priv->reset_thread)) {
+			/* Self case: detach so resources are freed on
+			 * thread exit. Don't modify dev_state — the
+			 * caller (dev_stop_lock/dev_close_lock) handles
+			 * state transitions.
+			 */
+			rte_thread_detach(priv->reset_thread);
+			return;
+		}
+
+		pthread_mutex_lock(&priv->reset_cond_mutex);
+		rte_atomic_store_explicit(&priv->dev_state,
+			MANA_DEV_ACTIVE, rte_memory_order_release);
+		pthread_cond_signal(&priv->reset_cond);
+		pthread_mutex_unlock(&priv->reset_cond_mutex);
+		rte_thread_join(priv->reset_thread, NULL);
+	}
+}
+
+/*
+ * Clear per-queue burst_state so the data path CAS can succeed again.
+ * Must be called under reset_ops_lock when transitioning back to ACTIVE
+ * after a failed or aborted reset.
+ */
+static void
+mana_clear_burst_state(struct rte_eth_dev *dev)
+{
+	struct mana_priv *priv = dev->data->dev_private;
+	int i;
+
+	for (i = 0; i < priv->num_queues; i++) {
+		struct mana_rxq *rxq = dev->data->rx_queues[i];
+		struct mana_txq *txq = dev->data->tx_queues[i];
+
+		if (rxq)
+			rte_atomic_store_explicit(&rxq->burst_state, 0,
+						  rte_memory_order_release);
+		if (txq)
+			rte_atomic_store_explicit(&txq->burst_state, 0,
+						  rte_memory_order_release);
+	}
+}
+
+/*
+ * Custom lock wrappers for dev_stop and dev_close.
+ * These join any active reset thread and use a blocking lock (not
+ * trylock) so they wait for any in-progress reset processing to
+ * finish, rather than returning -EBUSY. When the device is not in
+ * MANA_DEV_ACTIVE state, they transition state to MANA_DEV_ACTIVE.
+ */
+static int
+mana_dev_stop_lock(struct rte_eth_dev *dev)
+{
+	struct mana_priv *priv = dev->data->dev_private;
+	int ret;
+
+	mana_join_reset_thread(priv);
+
+	pthread_mutex_lock(&priv->reset_ops_lock);
+
+	if (rte_atomic_load_explicit(&priv->dev_state,
+	    rte_memory_order_acquire) != MANA_DEV_ACTIVE) {
+		mana_clear_burst_state(dev);
+		rte_atomic_store_explicit(&priv->dev_state,
+			MANA_DEV_ACTIVE, rte_memory_order_release);
+		pthread_mutex_unlock(&priv->reset_ops_lock);
+		return 0;
+	}
+
+	ret = mana_dev_stop(dev);
+	pthread_mutex_unlock(&priv->reset_ops_lock);
+	return ret;
+}
+
+static int
+mana_dev_close_lock(struct rte_eth_dev *dev)
+{
+	struct mana_priv *priv = dev->data->dev_private;
+	int ret;
+
+	mana_join_reset_thread(priv);
+
+	pthread_mutex_lock(&priv->reset_ops_lock);
+
+	if (rte_atomic_load_explicit(&priv->dev_state,
+	    rte_memory_order_acquire) != MANA_DEV_ACTIVE) {
+		mana_clear_burst_state(dev);
+		rte_atomic_store_explicit(&priv->dev_state,
+			MANA_DEV_ACTIVE, rte_memory_order_release);
+	}
+
+	ret = mana_dev_close(dev);
+	pthread_mutex_unlock(&priv->reset_ops_lock);
+	return ret;
+}
+
+static int
+mana_rss_hash_update_lock(struct rte_eth_dev *dev,
+			  struct rte_eth_rss_conf *rss_conf)
+{
+	struct mana_priv *priv = dev->data->dev_private;
+	int ret;
+
+	if (mana_reset_trylock(priv))
+		return -EBUSY;
+	ret = mana_rss_hash_update(dev, rss_conf);
+	pthread_mutex_unlock(&priv->reset_ops_lock);
+	return ret;
+}
+
+static int
+mana_rss_hash_conf_get_lock(struct rte_eth_dev *dev,
+			    struct rte_eth_rss_conf *rss_conf)
+{
+	struct mana_priv *priv = dev->data->dev_private;
+	int ret;
+
+	if (mana_reset_trylock(priv))
+		return -EBUSY;
+	ret = mana_rss_hash_conf_get(dev, rss_conf);
+	pthread_mutex_unlock(&priv->reset_ops_lock);
+	return ret;
+}
+
+static void
+mana_dev_tx_queue_release_lock(struct rte_eth_dev *dev, uint16_t qid)
+{
+	struct mana_priv *priv = dev->data->dev_private;
+
+	if (mana_reset_trylock(priv)) {
+		DRV_LOG(ERR, "Device reset in progress, "
+			"mana_dev_tx_queue_release not called");
+		return;
+	}
+	mana_dev_tx_queue_release(dev, qid);
+	pthread_mutex_unlock(&priv->reset_ops_lock);
+}
+
+static void
+mana_dev_rx_queue_release_lock(struct rte_eth_dev *dev, uint16_t qid)
+{
+	struct mana_priv *priv = dev->data->dev_private;
+
+	if (mana_reset_trylock(priv)) {
+		DRV_LOG(ERR, "Device reset in progress, "
+			"mana_dev_rx_queue_release not called");
+		return;
+	}
+	mana_dev_rx_queue_release(dev, qid);
+	pthread_mutex_unlock(&priv->reset_ops_lock);
+}
+
+static int
+mana_rx_intr_enable_lock(struct rte_eth_dev *dev, uint16_t rx_queue_id)
+{
+	struct mana_priv *priv = dev->data->dev_private;
+	int ret;
+
+	if (mana_reset_trylock(priv))
+		return -EBUSY;
+	ret = mana_rx_intr_enable(dev, rx_queue_id);
+	pthread_mutex_unlock(&priv->reset_ops_lock);
+	return ret;
+}
+
+static int
+mana_rx_intr_disable_lock(struct rte_eth_dev *dev, uint16_t rx_queue_id)
+{
+	struct mana_priv *priv = dev->data->dev_private;
+	int ret;
+
+	if (mana_reset_trylock(priv))
+		return -EBUSY;
+	ret = mana_rx_intr_disable(dev, rx_queue_id);
+	pthread_mutex_unlock(&priv->reset_ops_lock);
+	return ret;
+}
+
+static int
+mana_mtu_set_lock(struct rte_eth_dev *dev, uint16_t mtu)
+{
+	struct mana_priv *priv = dev->data->dev_private;
+	int ret;
+
+	if (mana_reset_trylock(priv))
+		return -EBUSY;
+	ret = mana_mtu_set(dev, mtu);
+	pthread_mutex_unlock(&priv->reset_ops_lock);
+	return ret;
+}
+
 static const struct eth_dev_ops mana_dev_ops = {
-	.dev_configure		= mana_dev_configure,
-	.dev_start		= mana_dev_start,
-	.dev_stop		= mana_dev_stop,
-	.dev_close		= mana_dev_close,
-	.dev_infos_get		= mana_dev_info_get,
+	.dev_configure		= mana_dev_configure_lock,
+	.dev_start		= mana_dev_start_lock,
+	.dev_stop		= mana_dev_stop_lock,
+	.dev_close		= mana_dev_close_lock,
+	.dev_infos_get		= mana_dev_info_get_lock,
 	.txq_info_get		= mana_dev_tx_queue_info,
 	.rxq_info_get		= mana_dev_rx_queue_info,
 	.dev_supported_ptypes_get = mana_supported_ptypes,
-	.rss_hash_update	= mana_rss_hash_update,
-	.rss_hash_conf_get	= mana_rss_hash_conf_get,
-	.tx_queue_setup		= mana_dev_tx_queue_setup,
-	.tx_queue_release	= mana_dev_tx_queue_release,
-	.rx_queue_setup		= mana_dev_rx_queue_setup,
-	.rx_queue_release	= mana_dev_rx_queue_release,
-	.rx_queue_intr_enable	= mana_rx_intr_enable,
-	.rx_queue_intr_disable	= mana_rx_intr_disable,
+	.rss_hash_update	= mana_rss_hash_update_lock,
+	.rss_hash_conf_get	= mana_rss_hash_conf_get_lock,
+	.tx_queue_setup		= mana_dev_tx_queue_setup_lock,
+	.tx_queue_release	= mana_dev_tx_queue_release_lock,
+	.rx_queue_setup		= mana_dev_rx_queue_setup_lock,
+	.rx_queue_release	= mana_dev_rx_queue_release_lock,
+	.rx_queue_intr_enable	= mana_rx_intr_enable_lock,
+	.rx_queue_intr_disable	= mana_rx_intr_disable_lock,
 	.link_update		= mana_dev_link_update,
 	.stats_get		= mana_dev_stats_get,
 	.stats_reset		= mana_dev_stats_reset,
-	.mtu_set		= mana_mtu_set,
+	.mtu_set		= mana_mtu_set_lock,
 };
 
 static const struct eth_dev_ops mana_dev_secondary_ops = {
 	.stats_get = mana_dev_stats_get,
 	.stats_reset = mana_dev_stats_reset,
-	.dev_infos_get = mana_dev_info_get,
+	.dev_infos_get = mana_dev_info_get_lock,
 };
 
 uint16_t
@@ -1031,28 +1370,516 @@ mana_ibv_device_to_pci_addr(const struct ibv_device *device,
 	return 0;
 }
 
+static int mana_pci_probe(struct rte_pci_driver *pci_drv,
+			  struct rte_pci_device *pci_dev);
+static void mana_intr_handler(void *arg);
+static void mana_reset_exit(struct mana_priv *priv);
+
+/* Delay before initiating reset exit after reset enter completes */
+#define MANA_RESET_TIMER_US (15 * 1000000ULL) /* 15 seconds */
+
 /*
- * Interrupt handler from IB layer to notify this device is being removed.
+ * Callback for PCI device removal events from EAL.
+ * If the device is in reset (RESET_EXIT state), this means the PCI
+ * device was hot-removed rather than a service reset. Wake the reset
+ * thread via condvar and notify netvsc via RTE_ETH_EVENT_INTR_RMV.
+ */
+static void
+mana_pci_remove_event_cb(const char *device_name,
+			 enum rte_dev_event_type event, void *cb_arg)
+{
+	struct mana_priv *priv = cb_arg;
+	struct rte_eth_dev *dev;
+
+	if (event != RTE_DEV_EVENT_REMOVE)
+		return;
+
+	DRV_LOG(INFO, "PCI device %s removed", device_name);
+
+	/* Wake the reset thread immediately */
+	pthread_mutex_lock(&priv->reset_cond_mutex);
+	rte_atomic_store_explicit(&priv->dev_state,
+		MANA_DEV_RESET_FAILED, rte_memory_order_release);
+	pthread_cond_signal(&priv->reset_cond);
+	pthread_mutex_unlock(&priv->reset_cond_mutex);
+
+	/* Wait for the reset thread to finish teardown and release
+	 * reset_ops_lock before emitting INTR_RMV to the application.
+	 */
+	pthread_mutex_lock(&priv->reset_ops_lock);
+	pthread_mutex_unlock(&priv->reset_ops_lock);
+
+	dev = &rte_eth_devices[priv->port_id];
+	DRV_LOG(INFO, "Sending RTE_ETH_EVENT_INTR_RMV for port %u",
+		priv->port_id);
+	rte_eth_dev_callback_process(dev,
+		RTE_ETH_EVENT_INTR_RMV, NULL);
+}
+
+/*
+ * Reset thread: performs teardown immediately, waits for the
+ * recovery timer, then re-probes and restarts the device.
+ * Runs on a control thread so it can call blocking IPC, ibv
+ * teardown, and rte_intr_callback_unregister (which all must
+ * not run on the EAL interrupt thread).
+ */
+static uint32_t
+mana_reset_thread(void *arg)
+{
+	struct mana_priv *priv = (struct mana_priv *)arg;
+	struct rte_eth_dev *dev = &rte_eth_devices[priv->port_id];
+	struct timespec ts;
+	int ret;
+	int i;
+
+	DRV_LOG(INFO, "Reset thread started");
+
+	pthread_mutex_lock(&priv->reset_ops_lock);
+
+	/* Teardown: stop data path, unmap secondary doorbells, close device,
+	 * free MR caches. Must happen immediately — hardware may be gone.
+	 */
+	ret = mana_dev_stop(dev);
+	if (ret) {
+		DRV_LOG(ERR, "Failed to stop mana dev ret %d", ret);
+		rte_atomic_store_explicit(&priv->dev_state,
+			MANA_DEV_RESET_FAILED, rte_memory_order_release);
+		goto reset_failed;
+	}
+
+	ret = mana_mp_req_on_rxtx(dev, MANA_MP_REQ_RESET_ENTER);
+	if (ret) {
+		DRV_LOG(ERR, "Failed to reset secondary processes ret = %d",
+			ret);
+		rte_atomic_store_explicit(&priv->dev_state,
+			MANA_DEV_RESET_FAILED, rte_memory_order_release);
+		goto reset_failed;
+	}
+
+	ret = mana_dev_close(dev);
+	if (ret) {
+		DRV_LOG(ERR, "Failed to close mana dev ret %d", ret);
+		rte_atomic_store_explicit(&priv->dev_state,
+			MANA_DEV_RESET_FAILED, rte_memory_order_release);
+		goto reset_failed;
+	}
+
+	for (i = 0; i < priv->num_queues; i++) {
+		struct mana_rxq *rxq = dev->data->rx_queues[i];
+		struct mana_txq *txq = dev->data->tx_queues[i];
+
+		DRV_LOG(DEBUG, "Free MR for priv = %p, rxq %u, txq %u",
+			priv, rxq->rxq_idx, txq->txq_idx);
+		mana_mr_btree_free(&rxq->mr_btree);
+		mana_mr_btree_free(&txq->mr_btree);
+	}
+
+	DRV_LOG(DEBUG, "Teardown complete");
+
+	rte_atomic_store_explicit(&priv->dev_state, MANA_DEV_RESET_EXIT,
+				     rte_memory_order_release);
+
+	pthread_mutex_unlock(&priv->reset_ops_lock);
+
+	/* Wait for the recovery timer before re-probing.
+	 * Check dev_state under reset_cond_mutex before waiting:
+	 * if mana_pci_remove_event_cb already set RESET_FAILED
+	 * (under the same mutex), we skip the wait entirely.
+	 * This avoids losing a condvar signal that arrived before
+	 * we entered the wait.
+	 */
+	DRV_LOG(INFO, "Waiting %us for hardware recovery",
+		(unsigned int)(MANA_RESET_TIMER_US / 1000000));
+
+	clock_gettime(CLOCK_REALTIME, &ts);
+	ts.tv_sec += MANA_RESET_TIMER_US / 1000000;
+
+	pthread_mutex_lock(&priv->reset_cond_mutex);
+	while (rte_atomic_load_explicit(&priv->dev_state,
+	       rte_memory_order_acquire) == MANA_DEV_RESET_EXIT) {
+		if (pthread_cond_timedwait(&priv->reset_cond,
+		    &priv->reset_cond_mutex, &ts))
+			break; /* timeout */
+	}
+	pthread_mutex_unlock(&priv->reset_cond_mutex);
+
+	pthread_mutex_lock(&priv->reset_ops_lock);
+
+	if (rte_atomic_load_explicit(&priv->dev_state,
+	    rte_memory_order_acquire) != MANA_DEV_RESET_EXIT) {
+		DRV_LOG(INFO, "Reset thread: dev_state=%d, skipping exit",
+			(int)rte_atomic_load_explicit(&priv->dev_state,
+			rte_memory_order_acquire));
+		pthread_mutex_unlock(&priv->reset_ops_lock);
+		return 0;
+	}
+
+	DRV_LOG(INFO, "Reset thread: initiating reset exit");
+	mana_reset_exit(priv);
+	/* Lock is released by mana_reset_exit_delay.
+	 * reset_thread_active remains true — the joiner
+	 * (mana_join_reset_thread) will either join or detach
+	 * (if called from this thread's own callback).
+	 */
+	return 0;
+
+reset_failed:
+	mana_clear_burst_state(dev);
+	pthread_mutex_unlock(&priv->reset_ops_lock);
+
+	DRV_LOG(INFO, "Sending RTE_ETH_EVENT_RECOVERY_FAILED for port %u",
+		priv->port_id);
+	rte_eth_dev_callback_process(dev,
+		RTE_ETH_EVENT_RECOVERY_FAILED, NULL);
+	return 0;
+}
+
+static void
+mana_reset_enter(struct mana_priv *priv)
+{
+	int ret;
+	int i;
+	struct rte_eth_dev *dev = &rte_eth_devices[priv->port_id];
+
+	/*
+	 * Lock ownership: mana_intr_handler acquires reset_ops_lock,
+	 * mana_reset_enter sets state/drains/spawns thread and releases it.
+	 * The reset thread independently acquires/releases the lock for
+	 * teardown and for the exit (re-probe) phase.
+	 */
+
+	rte_atomic_store_explicit(&priv->dev_state, MANA_DEV_RESET_ENTER,
+				     rte_memory_order_release);
+
+	DRV_LOG(DEBUG, "Entering into device reset state");
+	DRV_LOG(DEBUG, "Resetting dev = %p, priv = %p", dev, priv);
+
+	/* Set the blocked bit on each queue's burst_state so new bursts
+	 * are rejected, then wait for any in-flight burst (bit 0) to finish.
+	 */
+	for (i = 0; i < priv->num_queues; i++) {
+		struct mana_rxq *rxq = dev->data->rx_queues[i];
+		struct mana_txq *txq = dev->data->tx_queues[i];
+
+		if (rxq)
+			rte_atomic_fetch_or_explicit(&rxq->burst_state,
+				MANA_BURST_BLOCKED,
+				rte_memory_order_release);
+		if (txq)
+			rte_atomic_fetch_or_explicit(&txq->burst_state,
+				MANA_BURST_BLOCKED,
+				rte_memory_order_release);
+	}
+
+	/* Wait for all in-flight burst calls to finish (bit 0 to clear) */
+	for (i = 0; i < priv->num_queues; i++) {
+		struct mana_rxq *rxq = dev->data->rx_queues[i];
+		struct mana_txq *txq = dev->data->tx_queues[i];
+
+		if (rxq)
+			while (rte_atomic_load_explicit(&rxq->burst_state,
+				    rte_memory_order_acquire) & 1)
+				rte_pause();
+		if (txq)
+			while (rte_atomic_load_explicit(&txq->burst_state,
+				    rte_memory_order_acquire) & 1)
+				rte_pause();
+	}
+
+	DRV_LOG(DEBUG, "All data path threads drained");
+
+	/* Join previous reset thread if it completed but was not joined.
+	 * Use CAS to avoid double-join if another path joined first.
+	 * Don't use mana_join_reset_thread() here — we are already in
+	 * RESET_ENTER state and must not change dev_state to ACTIVE.
+	 */
+	{
+		bool expected = true;
+
+		if (rte_atomic_compare_exchange_strong_explicit(
+				&priv->reset_thread_active, &expected, false,
+				rte_memory_order_acq_rel,
+				rte_memory_order_acquire))
+			rte_thread_join(priv->reset_thread, NULL);
+	}
+
+	ret = rte_thread_create_internal_control(&priv->reset_thread,
+						 "mana-reset",
+						 mana_reset_thread, priv);
+	if (ret) {
+		DRV_LOG(ERR, "Failed to create reset thread ret %d", ret);
+		rte_atomic_store_explicit(&priv->dev_state,
+					  MANA_DEV_RESET_FAILED,
+					  rte_memory_order_release);
+		goto reset_failed;
+	}
+	rte_atomic_store_explicit(&priv->reset_thread_active,
+		true, rte_memory_order_release);
+
+	DRV_LOG(DEBUG, "Reset thread started");
+
+	pthread_mutex_unlock(&priv->reset_ops_lock);
+	return;
+
+reset_failed:
+	mana_clear_burst_state(dev);
+	pthread_mutex_unlock(&priv->reset_ops_lock);
+}
+
+static int
+mana_reset_exit_delay(void *arg)
+{
+	struct mana_priv *priv = (struct mana_priv *)arg;
+	int ret = 0;
+	int i;
+	struct rte_eth_dev *dev;
+	struct rte_pci_device *pci_dev;
+
+	DRV_LOG(DEBUG, "Delayed mana device reset complete processing");
+
+	/* If the app called dev_stop/dev_close during the timer window,
+	 * state is no longer RESET_EXIT. Nothing to do.
+	 */
+	if (rte_atomic_load_explicit(&priv->dev_state,
+	    rte_memory_order_acquire) != MANA_DEV_RESET_EXIT) {
+		DRV_LOG(DEBUG, "State is not RESET_EXIT, skipping");
+		pthread_mutex_unlock(&priv->reset_ops_lock);
+		return ret;
+	}
+
+	dev = &rte_eth_devices[priv->port_id];
+	pci_dev = RTE_CLASS_TO_BUS_DEVICE(dev, *pci_dev);
+
+	DRV_LOG(DEBUG, "Resetting dev = %p, priv = %p", dev, priv);
+
+	ret = ibv_close_device(priv->ib_ctx);
+	if (ret) {
+		DRV_LOG(ERR, "Failed to close ibv device %d", ret);
+		rte_atomic_store_explicit(&priv->dev_state, MANA_DEV_RESET_FAILED,
+				     rte_memory_order_release);
+		goto out;
+	}
+	priv->ib_ctx = NULL;
+
+	ret = mana_pci_probe(NULL, pci_dev);
+	if (ret) {
+		DRV_LOG(ERR, "Failed to probe mana pci dev ret %d", ret);
+		rte_atomic_store_explicit(&priv->dev_state, MANA_DEV_RESET_FAILED,
+				     rte_memory_order_release);
+		goto out;
+	}
+
+	/*
+	 * Init the local MR caches.
+	 */
+	for (i = 0; i < priv->num_queues; i++) {
+		struct mana_rxq *rxq = dev->data->rx_queues[i];
+		struct mana_txq *txq = dev->data->tx_queues[i];
+
+		ret = mana_mr_btree_init(&rxq->mr_btree,
+					 MANA_MR_BTREE_PER_QUEUE_N,
+					 rxq->socket);
+		if (ret) {
+			DRV_LOG(ERR, "Failed to init RXQ %d MR btree "
+				"on socket %u, ret %d", i, rxq->socket, ret);
+			goto mr_init_failed_rxq;
+		}
+
+		ret = mana_mr_btree_init(&txq->mr_btree,
+					 MANA_MR_BTREE_PER_QUEUE_N,
+					 txq->socket);
+		if (ret) {
+			DRV_LOG(ERR, "Failed to init TXQ %d MR btree "
+				"on socket %u, ret %d", i, txq->socket, ret);
+			goto mr_init_failed_txq;
+		}
+	}
+	DRV_LOG(DEBUG, "priv %p, num_queues %u", priv, priv->num_queues);
+
+	/* Start secondaries */
+	ret = mana_mp_req_on_rxtx(dev, MANA_MP_REQ_RESET_EXIT);
+	if (ret) {
+		DRV_LOG(ERR, "Failed to start secondary processes ret = %d",
+			ret);
+		goto mr_init_failed_all;
+	}
+
+	ret = mana_dev_start(dev);
+	if (ret) {
+		DRV_LOG(ERR, "Failed to start mana dev ret %d", ret);
+		goto mr_init_failed_all;
+	}
+
+	/* Clear per-queue burst_state before marking device active so
+	 * data path CAS can succeed again.
+	 */
+	for (i = 0; i < priv->num_queues; i++) {
+		struct mana_rxq *rxq = dev->data->rx_queues[i];
+		struct mana_txq *txq = dev->data->tx_queues[i];
+
+		if (rxq)
+			rte_atomic_store_explicit(&rxq->burst_state, 0,
+						  rte_memory_order_release);
+		if (txq)
+			rte_atomic_store_explicit(&txq->burst_state, 0,
+						  rte_memory_order_release);
+	}
+
+	rte_atomic_store_explicit(&priv->dev_state, MANA_DEV_ACTIVE,
+				     rte_memory_order_release);
+
+	DRV_LOG(DEBUG, "Exiting the reset complete processing");
+	goto out;
+
+mr_init_failed_all:
+	i = priv->num_queues;
+	goto mr_init_failed_rxq;
+
+mr_init_failed_txq:
+	/* RXQ btree at index i was initialized, free it */
+	mana_mr_btree_free(&((struct mana_rxq *)
+			     dev->data->rx_queues[i])->mr_btree);
+
+mr_init_failed_rxq:
+	/* Free all fully initialized btrees for indices < i */
+	for (int j = 0; j < i; j++) {
+		struct mana_rxq *rxq = dev->data->rx_queues[j];
+		struct mana_txq *txq = dev->data->tx_queues[j];
+
+		mana_mr_btree_free(&rxq->mr_btree);
+		mana_mr_btree_free(&txq->mr_btree);
+	}
+	rte_atomic_store_explicit(&priv->dev_state, MANA_DEV_RESET_FAILED,
+				     rte_memory_order_release);
+
+out:
+	pthread_mutex_unlock(&priv->reset_ops_lock);
+
+	if (!ret) {
+		DRV_LOG(INFO, "Sending RTE_ETH_EVENT_RECOVERY_SUCCESS for port %u",
+			priv->port_id);
+		rte_eth_dev_callback_process(dev,
+			RTE_ETH_EVENT_RECOVERY_SUCCESS, NULL);
+	} else {
+		DRV_LOG(INFO, "Sending RTE_ETH_EVENT_RECOVERY_FAILED for port %u",
+			priv->port_id);
+		rte_eth_dev_callback_process(dev,
+			RTE_ETH_EVENT_RECOVERY_FAILED, NULL);
+	}
+	return ret;
+}
+
+static void
+mana_reset_exit(struct mana_priv *priv)
+{
+	int ret;
+
+	if (!priv) {
+		DRV_LOG(ERR, "Private structure invalid");
+		return;
+	}
+	DRV_LOG(DEBUG, "Entering into device reset complete processing");
+
+	rxq_intr_disable(priv);
+
+	/* Unregister the interrupt handler. Since mana_reset_exit is always
+	 * called from mana_reset_thread (a non-interrupt thread), the
+	 * interrupt source is inactive and rte_intr_callback_unregister
+	 * succeeds directly.
+	 */
+	if (priv->intr_handle) {
+		ret = rte_intr_callback_unregister(priv->intr_handle,
+						   mana_intr_handler, priv);
+		if (ret < 0)
+			DRV_LOG(ERR, "Failed to unregister intr callback ret %d",
+				ret);
+		else
+			DRV_LOG(DEBUG, "%d intr callback(s) removed", ret);
+
+		rte_intr_instance_free(priv->intr_handle);
+		priv->intr_handle = NULL;
+	}
+
+	/* Proceed directly to reset exit delay (re-probe and restart).
+	 * No need for a separate thread - we are already on
+	 * mana_reset_thread which is a non-interrupt control thread.
+	 */
+	mana_reset_exit_delay(priv);
+}
+
+/*
+ * Interrupt handler from IB layer to notify this device is
+ * being removed or reset.
  */
 static void
 mana_intr_handler(void *arg)
 {
 	struct mana_priv *priv = arg;
 	struct ibv_context *ctx = priv->ib_ctx;
-	struct ibv_async_event event;
+	struct ibv_async_event event = { 0 };
+	struct rte_eth_dev *dev;
 
 	/* Read and ack all messages from IB device */
 	while (true) {
 		if (ibv_get_async_event(ctx, &event))
 			break;
 
-		if (event.event_type == IBV_EVENT_DEVICE_FATAL) {
-			struct rte_eth_dev *dev;
-
-			dev = &rte_eth_devices[priv->port_id];
-			if (dev->data->dev_conf.intr_conf.rmv)
+		switch (event.event_type) {
+		case IBV_EVENT_DEVICE_FATAL:
+			DRV_LOG(INFO, "IBV_EVENT_DEVICE_FATAL received, dev_state=%d",
+				(int)rte_atomic_load_explicit(&priv->dev_state,
+				rte_memory_order_acquire));
+			if (rte_atomic_load_explicit(&priv->dev_state,
+			    rte_memory_order_acquire) == MANA_DEV_ACTIVE) {
+				/* Notify upper layers (e.g. netvsc) before
+				 * acquiring the lock so they can switch data
+				 * path before mana stops queues. Emitting
+				 * outside the lock avoids deadlock if the
+				 * callback calls dev_stop/dev_close.
+				 */
+				dev = &rte_eth_devices[priv->port_id];
+				DRV_LOG(INFO,
+					"Sending RTE_ETH_EVENT_ERR_RECOVERING for port %u",
+					priv->port_id);
 				rte_eth_dev_callback_process(dev,
-					RTE_ETH_EVENT_INTR_RMV, NULL);
+					RTE_ETH_EVENT_ERR_RECOVERING,
+					NULL);
+
+				pthread_mutex_lock(&priv->reset_ops_lock);
+
+				/* Re-check after lock to avoid racing with
+				 * mana_pci_remove_event_cb which may have
+				 * set RESET_FAILED while we waited.
+				 */
+				if (rte_atomic_load_explicit(&priv->dev_state,
+				    rte_memory_order_acquire) !=
+				    MANA_DEV_ACTIVE) {
+					pthread_mutex_unlock(
+						&priv->reset_ops_lock);
+					break;
+				}
+
+				mana_reset_enter(priv);
+
+				if (rte_atomic_load_explicit(&priv->dev_state,
+				    rte_memory_order_acquire) ==
+				    MANA_DEV_RESET_FAILED) {
+					DRV_LOG(INFO,
+						"Sending RTE_ETH_EVENT_RECOVERY_FAILED for port %u",
+						priv->port_id);
+					rte_eth_dev_callback_process(dev,
+						RTE_ETH_EVENT_RECOVERY_FAILED,
+						NULL);
+				}
+			} else {
+				DRV_LOG(ERR, "Already in reset handling, dev_state=%d",
+					(int)rte_atomic_load_explicit(&priv->dev_state,
+					rte_memory_order_acquire));
+			}
+			break;
+
+		default:
+			break;
 		}
 
 		ibv_ack_async_event(&event);
@@ -1063,6 +1890,23 @@ static int
 mana_intr_uninstall(struct mana_priv *priv)
 {
 	int ret;
+	struct rte_eth_dev *dev;
+
+	if (!priv->intr_handle)
+		return 0;
+
+	/* Unregister PCI device removal event callback.
+	 * Do not retry on -EAGAIN to avoid deadlock: the callback
+	 * may be blocked waiting for reset_ops_lock which we hold.
+	 */
+	dev = &rte_eth_devices[priv->port_id];
+	if (dev->device) {
+		ret = rte_dev_event_callback_unregister(dev->device->name,
+			mana_pci_remove_event_cb, priv);
+		if (ret < 0 && ret != -ENOENT)
+			DRV_LOG(WARNING, "Failed to unregister PCI remove cb ret %d",
+				ret);
+	}
 
 	ret = rte_intr_callback_unregister(priv->intr_handle,
 					   mana_intr_handler, priv);
@@ -1072,6 +1916,7 @@ mana_intr_uninstall(struct mana_priv *priv)
 	}
 
 	rte_intr_instance_free(priv->intr_handle);
+	priv->intr_handle = NULL;
 
 	return 0;
 }
@@ -1127,6 +1972,16 @@ mana_intr_install(struct rte_eth_dev *eth_dev, struct mana_priv *priv)
 		goto free_intr;
 	}
 
+	/* Register for PCI device removal events to distinguish
+	 * PCI hot-remove from service reset. This requires the
+	 * application to call rte_dev_event_monitor_start() for
+	 * events to be delivered (e.g. testpmd --hot-plug-handling).
+	 */
+	ret = rte_dev_event_callback_register(eth_dev->device->name,
+					      mana_pci_remove_event_cb, priv);
+	if (ret)
+		DRV_LOG(WARNING, "Failed to register PCI remove event callback");
+
 	eth_dev->intr_handle = priv->intr_handle;
 	return 0;
 
@@ -1156,7 +2011,7 @@ mana_proc_priv_init(struct rte_eth_dev *dev)
 /*
  * Map the doorbell page for the secondary process through IB device handle.
  */
-static int
+int
 mana_map_doorbell_secondary(struct rte_eth_dev *eth_dev, int fd)
 {
 	struct mana_process_priv *priv = eth_dev->process_private;
@@ -1294,17 +2149,29 @@ mana_probe_port(struct ibv_device *ibdev, struct ibv_device_attr_ex *dev_attr,
 	char name[RTE_ETH_NAME_MAX_LEN];
 	int ret;
 	struct ibv_context *ctx = NULL;
+	bool is_reset = false;
+	pthread_mutexattr_t mattr;
+	pthread_condattr_t cattr;
 
 	rte_ether_format_addr(address, sizeof(address), addr);
-	DRV_LOG(INFO, "device located port %u address %s", port, address);
 
-	priv = rte_zmalloc_socket(NULL, sizeof(*priv), RTE_CACHE_LINE_SIZE,
-				  SOCKET_ID_ANY);
-	if (!priv)
-		return -ENOMEM;
+	DRV_LOG(DEBUG, "device located port %u address %s", port, address);
 
 	snprintf(name, sizeof(name), "%s_port%d", pci_dev->device.name, port);
 
+	eth_dev = rte_eth_dev_allocated(name);
+	if (eth_dev) {
+		is_reset = true;
+		priv = eth_dev->data->dev_private;
+		DRV_LOG(DEBUG, "Device reset for eth_dev %p priv %p",
+			eth_dev, priv);
+	} else {
+		priv = rte_zmalloc_socket(NULL, sizeof(*priv), RTE_CACHE_LINE_SIZE,
+					  SOCKET_ID_ANY);
+		if (!priv)
+			return -ENOMEM;
+	}
+
 	if (rte_eal_process_type() == RTE_PROC_SECONDARY) {
 		int fd;
 
@@ -1317,6 +2184,7 @@ mana_probe_port(struct ibv_device *ibdev, struct ibv_device_attr_ex *dev_attr,
 
 		eth_dev->device = &pci_dev->device;
 		eth_dev->dev_ops = &mana_dev_secondary_ops;
+
 		ret = mana_proc_priv_init(eth_dev);
 		if (ret)
 			goto failed;
@@ -1336,7 +2204,7 @@ mana_probe_port(struct ibv_device *ibdev, struct ibv_device_attr_ex *dev_attr,
 			goto failed;
 		}
 
-		/* fd is no not used after mapping doorbell */
+		/* fd is not used after mapping doorbell */
 		close(fd);
 
 		eth_dev->tx_pkt_burst = mana_tx_burst;
@@ -1355,22 +2223,6 @@ mana_probe_port(struct ibv_device *ibdev, struct ibv_device_attr_ex *dev_attr,
 		goto failed;
 	}
 
-	eth_dev = rte_eth_dev_allocate(name);
-	if (!eth_dev) {
-		ret = -ENOMEM;
-		goto failed;
-	}
-
-	eth_dev->data->mac_addrs =
-		rte_calloc("mana_mac", 1,
-			   sizeof(struct rte_ether_addr), 0);
-	if (!eth_dev->data->mac_addrs) {
-		ret = -ENOMEM;
-		goto failed;
-	}
-
-	rte_ether_addr_copy(addr, eth_dev->data->mac_addrs);
-
 	priv->ib_pd = ibv_alloc_pd(ctx);
 	if (!priv->ib_pd) {
 		DRV_LOG(ERR, "ibv_alloc_pd failed port %d", port);
@@ -1390,10 +2242,6 @@ mana_probe_port(struct ibv_device *ibdev, struct ibv_device_attr_ex *dev_attr,
 	}
 
 	priv->ib_ctx = ctx;
-	priv->port_id = eth_dev->data->port_id;
-	priv->dev_port = port;
-	eth_dev->data->dev_private = priv;
-	priv->dev_data = eth_dev->data;
 
 	priv->max_rx_queues = dev_attr->orig_attr.max_qp;
 	priv->max_tx_queues = dev_attr->orig_attr.max_qp;
@@ -1415,23 +2263,72 @@ mana_probe_port(struct ibv_device *ibdev, struct ibv_device_attr_ex *dev_attr,
 		name, priv->max_rx_queues, priv->max_rx_desc,
 		priv->max_send_sge, priv->max_mr_size);
 
-	rte_eth_copy_pci_info(eth_dev, pci_dev);
+	if (!is_reset) {
+		eth_dev = rte_eth_dev_allocate(name);
+		if (!eth_dev) {
+			ret = -ENOMEM;
+			goto failed;
+		}
 
-	/* Create async interrupt handler */
-	ret = mana_intr_install(eth_dev, priv);
-	if (ret) {
-		DRV_LOG(ERR, "Failed to install intr handler");
-		goto failed;
+		eth_dev->data->mac_addrs =
+			rte_calloc("mana_mac", 1,
+				   sizeof(struct rte_ether_addr), 0);
+		if (!eth_dev->data->mac_addrs) {
+			ret = -ENOMEM;
+			goto failed;
+		}
+
+		rte_ether_addr_copy(addr, eth_dev->data->mac_addrs);
+	} else {
+		/*
+		 * Reset path.
+		 */
+		rte_ether_format_addr(address, RTE_ETHER_ADDR_FMT_SIZE,
+				      eth_dev->data->mac_addrs);
+		DRV_LOG(DEBUG, "Found existing eth_dev %p with mac addr %s",
+			eth_dev, address);
+		DRV_LOG(DEBUG, "ib_ctx = %p", priv->ib_ctx);
+		goto out;
 	}
 
-	eth_dev->device = &pci_dev->device;
+	priv->port_id = eth_dev->data->port_id;
+	priv->dev_port = port;
+	eth_dev->data->dev_private = priv;
+	priv->dev_data = eth_dev->data;
+	rte_atomic_store_explicit(&priv->dev_state, MANA_DEV_ACTIVE,
+				     rte_memory_order_release);
+
+	rte_eth_copy_pci_info(eth_dev, pci_dev);
+
+	pthread_mutexattr_init(&mattr);
+	pthread_mutexattr_setpshared(&mattr, PTHREAD_PROCESS_SHARED);
+	pthread_mutex_init(&priv->reset_ops_lock, &mattr);
+	pthread_mutex_init(&priv->reset_cond_mutex, &mattr);
+	pthread_mutexattr_destroy(&mattr);
+
+	pthread_condattr_init(&cattr);
+	pthread_condattr_setpshared(&cattr, PTHREAD_PROCESS_SHARED);
+	pthread_cond_init(&priv->reset_cond, &cattr);
+	pthread_condattr_destroy(&cattr);
 
-	DRV_LOG(INFO, "device %s at port %u", name, eth_dev->data->port_id);
+	eth_dev->device = &pci_dev->device;
 
 	eth_dev->rx_pkt_burst = mana_rx_burst_removed;
 	eth_dev->tx_pkt_burst = mana_tx_burst_removed;
 	eth_dev->dev_ops = &mana_dev_ops;
 
+out:
+	/* Create async interrupt handler */
+	ret = mana_intr_install(eth_dev, priv);
+	if (ret) {
+		DRV_LOG(ERR, "Failed to install intr handler, ret %d", ret);
+		goto failed;
+	}
+	DRV_LOG(INFO, "mana_intr_install succeeded");
+
+	DRV_LOG(INFO, "device %s priv %p dev port %d at port %u",
+		name, priv, priv->dev_port, eth_dev->data->port_id);
+
 	rte_eth_dev_probing_finish(eth_dev);
 
 	return 0;
@@ -1439,20 +2336,29 @@ mana_probe_port(struct ibv_device *ibdev, struct ibv_device_attr_ex *dev_attr,
 failed:
 	/* Free the resource for the port failed */
 	if (priv) {
-		if (priv->ib_parent_pd)
+		if (priv->ib_parent_pd) {
 			ibv_dealloc_pd(priv->ib_parent_pd);
+			priv->ib_parent_pd = NULL;
+		}
 
-		if (priv->ib_pd)
+		if (priv->ib_pd) {
 			ibv_dealloc_pd(priv->ib_pd);
+			priv->ib_pd = NULL;
+		}
 	}
 
-	if (eth_dev)
-		rte_eth_dev_release_port(eth_dev);
+	if (!is_reset) {
+		if (eth_dev)
+			rte_eth_dev_release_port(eth_dev);
 
-	rte_free(priv);
+		rte_free(priv);
+	}
 
-	if (ctx)
+	if (ctx) {
 		ibv_close_device(ctx);
+		if (is_reset && priv)
+			priv->ib_ctx = NULL;
+	}
 
 	return ret;
 }
@@ -1617,7 +2523,17 @@ mana_pci_probe(struct rte_pci_driver *pci_drv __rte_unused,
 static int
 mana_dev_uninit(struct rte_eth_dev *dev)
 {
-	return mana_dev_close(dev);
+	struct mana_priv *priv = dev->data->dev_private;
+	int ret;
+
+	/* Join reset thread before teardown to ensure it has exited
+	 * before we destroy the condvar/mutex in free_resources.
+	 */
+	mana_join_reset_thread(priv);
+
+	ret = mana_dev_close(dev);
+	mana_dev_free_resources(dev);
+	return ret;
 }
 
 /*
diff --git a/drivers/net/mana/mana.h b/drivers/net/mana/mana.h
index 79cc47b6ab..a7b301484a 100644
--- a/drivers/net/mana/mana.h
+++ b/drivers/net/mana/mana.h
@@ -5,6 +5,8 @@
 #ifndef __MANA_H__
 #define __MANA_H__
 
+#include <pthread.h>
+
 #define	PCI_VENDOR_ID_MICROSOFT		0x1414
 #define PCI_DEVICE_ID_MICROSOFT_MANA_PF	0x00b9
 #define PCI_DEVICE_ID_MICROSOFT_MANA	0x00ba
@@ -337,6 +339,26 @@ struct mana_process_priv {
 	void *db_page;
 };
 
+enum mana_device_state {
+	/* Normal running */
+	MANA_DEV_ACTIVE		= 0,
+	/* In reset enter processing */
+	MANA_DEV_RESET_ENTER	= 1,
+	/*
+	 * Reset enter processing completed.
+	 * Waiting for reset exit or in reset exit processing.
+	 */
+	MANA_DEV_RESET_EXIT	= 2,
+	/* Reset failed */
+	MANA_DEV_RESET_FAILED	= 3,
+};
+
+/* burst_state bit layout:
+ *   Bit 0: in-burst (set by data path CAS 0→1, cleared on exit).
+ *   Bit 1: blocked  (set by reset path to reject new bursts).
+ */
+#define MANA_BURST_BLOCKED	2
+
 struct mana_priv {
 	struct rte_eth_dev_data *dev_data;
 	struct mana_process_priv *process_priv;
@@ -368,6 +390,15 @@ struct mana_priv {
 	uint64_t max_mr_size;
 	struct mana_mr_btree mr_btree;
 	rte_spinlock_t	mr_btree_lock;
+	RTE_ATOMIC(enum mana_device_state) dev_state;
+	/* mutex for synchronizing mana reset and some mana_dev_ops callbacks */
+	pthread_mutex_t reset_ops_lock;
+	/* Reset thread ID, valid when reset_thread_active is true */
+	rte_thread_t reset_thread;
+	RTE_ATOMIC(bool) reset_thread_active;
+	/* Condvar to wake reset thread early on PCI remove */
+	pthread_mutex_t reset_cond_mutex;
+	pthread_cond_t reset_cond;
 };
 
 struct mana_txq_desc {
@@ -427,6 +458,14 @@ struct mana_txq {
 	struct mana_mr_btree mr_btree;
 	struct mana_stats stats;
 	unsigned int socket;
+	unsigned int txq_idx;
+
+	/*
+	 * Bit 0: in-burst flag (set by data path, cleared on exit).
+	 * Bit 1: blocked flag (set by reset path via fetch_or).
+	 * Data path CAS 0→1 to enter; fails if blocked bit is set.
+	 */
+	RTE_ATOMIC(uint32_t) burst_state;
 };
 
 struct mana_rxq {
@@ -462,6 +501,14 @@ struct mana_rxq {
 	struct mana_mr_btree mr_btree;
 
 	unsigned int socket;
+	unsigned int rxq_idx;
+
+	/*
+	 * Bit 0: in-burst flag (set by data path, cleared on exit).
+	 * Bit 1: blocked flag (set by reset path via fetch_or).
+	 * Data path CAS 0→1 to enter; fails if blocked bit is set.
+	 */
+	RTE_ATOMIC(uint32_t) burst_state;
 };
 
 extern int mana_logtype_driver;
@@ -543,6 +590,8 @@ enum mana_mp_req_type {
 	MANA_MP_REQ_CREATE_MR,
 	MANA_MP_REQ_START_RXTX,
 	MANA_MP_REQ_STOP_RXTX,
+	MANA_MP_REQ_RESET_ENTER,
+	MANA_MP_REQ_RESET_EXIT,
 };
 
 /* Pameters for IPC. */
@@ -563,8 +612,9 @@ void mana_mp_uninit_primary(void);
 void mana_mp_uninit_secondary(void);
 int mana_mp_req_verbs_cmd_fd(struct rte_eth_dev *dev);
 int mana_mp_req_mr_create(struct mana_priv *priv, uintptr_t addr, uint32_t len);
+int mana_map_doorbell_secondary(struct rte_eth_dev *eth_dev, int fd);
 
-void mana_mp_req_on_rxtx(struct rte_eth_dev *dev, enum mana_mp_req_type type);
+int mana_mp_req_on_rxtx(struct rte_eth_dev *dev, enum mana_mp_req_type type);
 
 void *mana_alloc_verbs_buf(size_t size, void *data);
 void mana_free_verbs_buf(void *ptr, void *data __rte_unused);
diff --git a/drivers/net/mana/mp.c b/drivers/net/mana/mp.c
index 72417fc0c7..1670f1ea9c 100644
--- a/drivers/net/mana/mp.c
+++ b/drivers/net/mana/mp.c
@@ -2,10 +2,13 @@
  * Copyright 2022 Microsoft Corporation
  */
 
+#include <sys/mman.h>
 #include <rte_malloc.h>
 #include <ethdev_driver.h>
 #include <rte_log.h>
+#include <rte_eal_paging.h>
 #include <stdlib.h>
+#include <unistd.h>
 
 #include <infiniband/verbs.h>
 
@@ -119,6 +122,23 @@ mana_mp_primary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
 	return ret;
 }
 
+static int
+mana_mp_reset_enter(struct rte_eth_dev *dev)
+{
+	struct mana_process_priv *proc_priv = dev->process_private;
+
+	void *addr = proc_priv->db_page;
+
+	/* Reset the db_page to NULL */
+	proc_priv->db_page = NULL;
+
+	if (addr)
+		(void)munmap(addr, rte_mem_page_size());
+
+	DRV_LOG(DEBUG, "Secondary doorbell pages unmapped");
+	return 0;
+}
+
 static int
 mana_mp_secondary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
 {
@@ -171,6 +191,52 @@ mana_mp_secondary_handle(const struct rte_mp_msg *mp_msg, const void *peer)
 		ret = rte_mp_reply(&mp_res, peer);
 		break;
 
+	case MANA_MP_REQ_RESET_ENTER:
+		DRV_LOG(INFO, "Port %u reset enter", dev->data->port_id);
+		res->result = mana_mp_reset_enter(dev);
+
+		ret = rte_mp_reply(&mp_res, peer);
+		break;
+
+	case MANA_MP_REQ_RESET_EXIT:
+		DRV_LOG(INFO, "Port %u reset exit", dev->data->port_id);
+		{
+			struct mana_process_priv *proc_priv =
+				dev->process_private;
+
+			if (proc_priv->db_page != NULL) {
+				DRV_LOG(DEBUG,
+					"Secondary doorbell already "
+					"mapped to %p",
+					proc_priv->db_page);
+				res->result = 0;
+			} else if (mp_msg->num_fds < 1) {
+				DRV_LOG(ERR,
+					"No FD in RESET_EXIT message");
+				res->result = -EINVAL;
+			} else {
+				ret = mana_map_doorbell_secondary(dev,
+							mp_msg->fds[0]);
+				if (ret) {
+					DRV_LOG(ERR,
+						"Failed secondary "
+						"doorbell map %d",
+						mp_msg->fds[0]);
+					res->result = -ENODEV;
+				} else {
+					res->result = 0;
+				}
+			}
+
+			/* Close the fd whenever present, even if
+			 * db_page was already mapped.
+			 */
+			if (mp_msg->num_fds >= 1)
+				close(mp_msg->fds[0]);
+		}
+		ret = rte_mp_reply(&mp_res, peer);
+		break;
+
 	default:
 		DRV_LOG(ERR, "Port %u unknown secondary MP type %u",
 			param->port_id, param->type);
@@ -254,7 +320,7 @@ mana_mp_req_verbs_cmd_fd(struct rte_eth_dev *dev)
 	}
 
 	ret = mp_res->fds[0];
-	DRV_LOG(ERR, "port %u command FD from primary is %d",
+	DRV_LOG(DEBUG, "port %u command FD from primary is %d",
 		dev->data->port_id, ret);
 exit:
 	free(mp_rep.msgs);
@@ -298,27 +364,36 @@ mana_mp_req_mr_create(struct mana_priv *priv, uintptr_t addr, uint32_t len)
 	return ret;
 }
 
-void
+int
 mana_mp_req_on_rxtx(struct rte_eth_dev *dev, enum mana_mp_req_type type)
 {
 	struct rte_mp_msg mp_req = { 0 };
 	struct rte_mp_msg *mp_res;
-	struct rte_mp_reply mp_rep;
+	struct rte_mp_reply mp_rep = { 0 };
 	struct mana_mp_param *res;
 	struct timespec ts = {.tv_sec = MANA_MP_REQ_TIMEOUT_SEC, .tv_nsec = 0};
-	int i, ret;
+	int i, ret = 0;
 
-	if (type != MANA_MP_REQ_START_RXTX && type != MANA_MP_REQ_STOP_RXTX) {
+	if (type != MANA_MP_REQ_START_RXTX && type != MANA_MP_REQ_STOP_RXTX &&
+	    type != MANA_MP_REQ_RESET_ENTER && type != MANA_MP_REQ_RESET_EXIT) {
 		DRV_LOG(ERR, "port %u unknown request (req_type %d)",
 			dev->data->port_id, type);
-		return;
+		return -EINVAL;
 	}
 
 	if (rte_atomic_load_explicit(&mana_shared_data->secondary_cnt, rte_memory_order_relaxed) == 0)
-		return;
+		return 0;
 
 	mp_init_msg(&mp_req, type, dev->data->port_id);
 
+	/* Include IB cmd FD for secondary doorbell remap */
+	if (type == MANA_MP_REQ_RESET_EXIT) {
+		struct mana_priv *priv = dev->data->dev_private;
+
+		mp_req.num_fds = 1;
+		mp_req.fds[0] = priv->ib_ctx->cmd_fd;
+	}
+
 	ret = rte_mp_request_sync(&mp_req, &mp_rep, &ts);
 	if (ret) {
 		if (rte_errno != ENOTSUP)
@@ -329,6 +404,7 @@ mana_mp_req_on_rxtx(struct rte_eth_dev *dev, enum mana_mp_req_type type)
 	if (mp_rep.nb_sent != mp_rep.nb_received) {
 		DRV_LOG(ERR, "port %u not all secondaries responded (%d)",
 			dev->data->port_id, type);
+		ret = -ETIMEDOUT;
 		goto exit;
 	}
 	for (i = 0; i < mp_rep.nb_received; i++) {
@@ -337,9 +413,11 @@ mana_mp_req_on_rxtx(struct rte_eth_dev *dev, enum mana_mp_req_type type)
 		if (res->result) {
 			DRV_LOG(ERR, "port %u request failed on secondary %d",
 				dev->data->port_id, i);
+			ret = res->result;
 			goto exit;
 		}
 	}
 exit:
 	free(mp_rep.msgs);
+	return ret;
 }
diff --git a/drivers/net/mana/mr.c b/drivers/net/mana/mr.c
index c4045141bc..8914f4cf04 100644
--- a/drivers/net/mana/mr.c
+++ b/drivers/net/mana/mr.c
@@ -314,8 +314,10 @@ mana_mr_btree_init(struct mana_mr_btree *bt, int n, int socket)
 void
 mana_mr_btree_free(struct mana_mr_btree *bt)
 {
-	rte_free(bt->table);
-	memset(bt, 0, sizeof(*bt));
+	if (bt && bt->table) {
+		rte_free(bt->table);
+		memset(bt, 0, sizeof(*bt));
+	}
 }
 
 int
diff --git a/drivers/net/mana/rx.c b/drivers/net/mana/rx.c
index 1b8ba1f3a9..aedb05d46f 100644
--- a/drivers/net/mana/rx.c
+++ b/drivers/net/mana/rx.c
@@ -36,6 +36,11 @@ mana_rq_ring_doorbell(struct mana_rxq *rxq)
 		db_page = process_priv->db_page;
 	}
 
+	if (!db_page) {
+		DP_LOG(ERR, "db_page is NULL, cannot ring RX doorbell");
+		return -EINVAL;
+	}
+
 	/* Hardware Spec specifies that software client should set 0 for
 	 * wqe_cnt for Receive Queues.
 	 */
@@ -172,7 +177,7 @@ mana_stop_rx_queues(struct rte_eth_dev *dev)
 
 	for (i = 0; i < priv->num_queues; i++)
 		if (dev->data->rx_queue_state[i] == RTE_ETH_QUEUE_STATE_STOPPED)
-			return -EINVAL;
+			return 0;
 
 	if (priv->rwq_qp) {
 		ret = ibv_destroy_qp(priv->rwq_qp);
@@ -256,6 +261,9 @@ mana_start_rx_queues(struct rte_eth_dev *dev)
 		struct mana_rxq *rxq = dev->data->rx_queues[i];
 		struct ibv_wq_init_attr wq_attr = {};
 
+		rxq->rxq_idx = i;
+		DRV_LOG(DEBUG, "assigning rxq_idx to %d", i);
+
 		manadv_set_context_attr(priv->ib_ctx,
 			MANADV_CTX_ATTR_BUF_ALLOCATORS,
 			(void *)((uintptr_t)&(struct manadv_ctx_allocators){
@@ -451,6 +459,16 @@ mana_rx_burst(void *dpdk_rxq, struct rte_mbuf **pkts, uint16_t pkts_n)
 	uint32_t pkt_len;
 	uint32_t i;
 	int polled = 0;
+	uint32_t expected = 0;
+
+	/* Single atomic CAS: enter burst only if device is active (0→1).
+	 * Fails immediately if reset path has set the blocked bit.
+	 */
+	if (unlikely(!rte_atomic_compare_exchange_strong_explicit(
+			&rxq->burst_state, &expected, 1,
+			rte_memory_order_acquire,
+			rte_memory_order_relaxed)))
+		return 0;
 
 repoll:
 	/* Polling on new completions if we have no backlog */
@@ -592,6 +610,9 @@ mana_rx_burst(void *dpdk_rxq, struct rte_mbuf **pkts, uint16_t pkts_n)
 				wqe_consumed, ret);
 	}
 
+	rte_atomic_fetch_and_explicit(&rxq->burst_state, ~(uint32_t)1,
+				     rte_memory_order_release);
+
 	return pkt_received;
 }
 
diff --git a/drivers/net/mana/tx.c b/drivers/net/mana/tx.c
index 57dbbc3651..10f2212b5d 100644
--- a/drivers/net/mana/tx.c
+++ b/drivers/net/mana/tx.c
@@ -17,7 +17,7 @@ mana_stop_tx_queues(struct rte_eth_dev *dev)
 
 	for (i = 0; i < priv->num_queues; i++)
 		if (dev->data->tx_queue_state[i] == RTE_ETH_QUEUE_STATE_STOPPED)
-			return -EINVAL;
+			return 0;
 
 	for (i = 0; i < priv->num_queues; i++) {
 		struct mana_txq *txq = dev->data->tx_queues[i];
@@ -83,6 +83,9 @@ mana_start_tx_queues(struct rte_eth_dev *dev)
 
 		txq = dev->data->tx_queues[i];
 
+		txq->txq_idx = i;
+		DRV_LOG(DEBUG, "assigning txq_idx to %d", txq->txq_idx);
+
 		manadv_set_context_attr(priv->ib_ctx,
 			MANADV_CTX_ATTR_BUF_ALLOCATORS,
 			(void *)((uintptr_t)&(struct manadv_ctx_allocators){
@@ -190,10 +193,34 @@ mana_tx_burst(void *dpdk_txq, struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
 	void *db_page;
 	uint16_t pkt_sent = 0;
 	uint32_t num_comp, i;
+	uint32_t expected = 0;
 #ifdef RTE_ARCH_32
 	uint32_t wqe_count = 0;
 #endif
 
+	db_page = priv->db_page;
+	if (rte_eal_process_type() == RTE_PROC_SECONDARY) {
+		struct rte_eth_dev *dev =
+			&rte_eth_devices[priv->dev_data->port_id];
+		struct mana_process_priv *process_priv = dev->process_private;
+
+		db_page = process_priv->db_page;
+	}
+
+	/* Single atomic CAS: enter burst only if device is active (0→1).
+	 * Fails immediately if reset path has set the blocked bit.
+	 */
+	if (unlikely(!rte_atomic_compare_exchange_strong_explicit(
+			&txq->burst_state, &expected, 1,
+			rte_memory_order_acquire,
+			rte_memory_order_relaxed) || !db_page)) {
+		if (!expected) /* CAS succeeded but db_page NULL — undo */
+			rte_atomic_fetch_and_explicit(&txq->burst_state,
+						      ~(uint32_t)1,
+						      rte_memory_order_release);
+		return 0;
+	}
+
 	/* Process send completions from GDMA */
 	num_comp = gdma_poll_completion_queue(&txq->gdma_cq,
 			txq->gdma_comp_buf, txq->num_desc);
@@ -216,7 +243,8 @@ mana_tx_burst(void *dpdk_txq, struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
 		}
 
 		if (!desc->pkt) {
-			DP_LOG(ERR, "mana_txq_desc has a NULL pkt");
+			DP_LOG(ERR, "mana_txq_desc has a NULL pkt, priv %p, "
+			       "txq = %d", priv, txq->txq_idx);
 		} else {
 			txq->stats.bytes += desc->pkt->pkt_len;
 			rte_pktmbuf_free(desc->pkt);
@@ -474,15 +502,6 @@ mana_tx_burst(void *dpdk_txq, struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
 	}
 
 	/* Ring hardware door bell */
-	db_page = priv->db_page;
-	if (rte_eal_process_type() == RTE_PROC_SECONDARY) {
-		struct rte_eth_dev *dev =
-			&rte_eth_devices[priv->dev_data->port_id];
-		struct mana_process_priv *process_priv = dev->process_private;
-
-		db_page = process_priv->db_page;
-	}
-
 	if (pkt_sent) {
 #ifdef RTE_ARCH_32
 		ret = mana_ring_short_doorbell(db_page, GDMA_QUEUE_SEND,
@@ -501,5 +520,8 @@ mana_tx_burst(void *dpdk_txq, struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
 			DP_LOG(ERR, "mana_ring_doorbell failed ret %d", ret);
 	}
 
+	rte_atomic_fetch_and_explicit(&txq->burst_state, ~(uint32_t)1,
+				     rte_memory_order_release);
+
 	return pkt_sent;
 }
-- 
2.34.1


^ permalink raw reply related

* [PATCH v10 0/1] net/mana: add device reset support
From: Wei Hu @ 2026-06-16 12:31 UTC (permalink / raw)
  To: dev, stephen; +Cc: longli, weh

From: Wei Hu <weh@microsoft.com>

Add support for handling hardware service reset events in the
MANA driver. When the MANA kernel driver receives a hardware
service event, it initiates a device reset and notifies userspace
via IBV_EVENT_DEVICE_FATAL. The MANA PMD handles this by
performing an automatic teardown and recovery sequence.

The driver uses ethdev recovery events (ERR_RECOVERING,
RECOVERY_SUCCESS, RECOVERY_FAILED) to notify upper layers of
the reset lifecycle, and a PCI device removal event callback
to distinguish hot-remove from service reset.

Changes since v9:
- Fixed fd leak in the secondary RESET_EXIT IPC handler: when
  the doorbell page was already mapped, the fd from the message
  was not closed. Moved close(fd) outside the if/else so it runs
  unconditionally whenever an fd is present.

Changes since v8:
- Fixed reset thread resource leak: previously reset_thread_active
  was cleared before emitting recovery callbacks, so no join site
  would reap the thread. Now the flag stays true throughout the
  thread lifetime. mana_join_reset_thread detects the self-join
  case (callback calling dev_stop/dev_close from the reset thread)
  using rte_thread_equal and calls rte_thread_detach instead of
  join, so thread resources are freed on exit. External callers
  continue to join normally.
- Fixed lost condvar signal: added a predicate loop around
  pthread_cond_timedwait that checks dev_state under
  reset_cond_mutex. If mana_pci_remove_event_cb signals before
  the reset thread enters the wait, the wakeup is no longer lost.
  The PCI remove callback sets dev_state to RESET_FAILED under
  the same mutex before signaling.
- Added a lock/unlock barrier on reset_ops_lock in
  mana_pci_remove_event_cb to ensure teardown has completed
  before emitting the INTR_RMV event.
- Fixed mana_reset_exit_delay return type from uint32_t to int
  to match the negative error codes it stores.
- Removed unnecessary else-after-goto in mana_probe_port.

Changes since v7:
- Moved heavy teardown (dev_stop, IPC to secondaries, dev_close,
  MR btree free) from mana_reset_enter (EAL interrupt thread)
  to mana_reset_thread (control thread). The interrupt handler
  now only sets state, drains in-flight bursts, and spawns the
  thread. Teardown runs immediately in the control thread before
  the recovery timer wait, avoiding blocking the interrupt thread
  on multi-second IPC timeouts and ibverbs calls. Each function
  now owns its own lock scope with no lock hand-off between
  threads.
- Simplified burst_state from encoding device state in bits 1+
  to a single blocked flag (bit 1). Only one value was ever
  stored, so the multi-state encoding was misleading. Added
  MANA_BURST_BLOCKED constant.
- Updated mana.rst to reflect that teardown runs on the control
  thread, not the interrupt handler.

Changes since v6:
- Rebased onto latest upstream for-main
- Replaced removed RTE_ETH_DEV_TO_PCI macro with
  RTE_CLASS_TO_BUS_DEVICE (upstream commit 4757b8df04
  removed the old bus-specific ethdev convenience macros)

Changes since v5:
- Replaced RCU QSBR with per-queue atomic burst_state using a
  single-variable CAS design: bit 0 is the in-burst flag, bit 1
  is the blocked flag. The data path uses CAS(0→1) to enter
  burst and fetch_and(~1) to exit. The reset path uses fetch_or
  to set the blocked bit and polls bit 0 to drain in-flight
  bursts. This eliminates the two-variable Dekker pattern and the
  need for sequential consistency (seq_cst) ordering.
- Removed librte_rcu dependency
- Removed __rte_no_thread_safety_analysis annotations (no longer
  needed after mutex conversion)
- Moved ERR_RECOVERING event emission before acquiring
  reset_ops_lock and before mana_reset_enter, so upper layers
  (e.g. netvsc) can switch data path before mana stops queues.
  Emitting outside the lock avoids deadlock if the callback
  calls dev_stop or dev_close.
- Replaced MANA_OPS_*_LOCK macros with mana_reset_trylock()
  helper function and explicit per-operation wrappers
- Removed unused rte_alarm.h and rte_lock_annotations.h includes
- Added RECOVERY_FAILED event when mana_reset_enter fails
  internally, so the application always receives a terminal event
- Added mana_clear_burst_state() helper to clear per-queue
  burst_state on failure paths (reset_failed, dev_stop_lock,
  dev_close_lock) preventing permanent silent packet drop after
  a failed reset

Changes since v4:
- Fixed stale rte_spinlock_unlock call in mana_intr_handler that
  was missed during the spinlock-to-mutex conversion, causing a
  -Wincompatible-pointer-types warning

Changes since v3:
- Converted reset_ops_lock from rte_spinlock_t to pthread_mutex_t
  with PTHREAD_PROCESS_SHARED, since the lock is held across
  blocking IB verbs calls and IPC with 5s timeout
- Removed rte_dev_event_callback_unregister retry loop to avoid
  deadlock when interrupt thread and reset thread contend

Changes since v2:
- Added per-queue burst_state atomic variable with Dekker-like
  synchronization to block data path during reset without RCU
- Replaced rte_alarm with condvar + control thread for reset exit
- Made reset_thread_active atomic with CAS — flag is set by
  creator and only cleared by the joiner, not the thread itself
- Fixed second reset crash: removed reset thread join logic from
  mana_dev_close (inner function) to avoid corrupting dev_state
  when called from mana_reset_enter
- Made reset_thread_active RTE_ATOMIC(bool) with explicit ordering
- Added retry loop for rte_dev_event_callback_unregister on -EAGAIN
- Initialized condvar/mutex with PTHREAD_PROCESS_SHARED since priv
  is in hugepage shared memory
- Added re-check of dev_state after lock acquisition in
  mana_intr_handler to prevent racing with pci_remove_event_cb
- Replaced (void *)0 with NULL in mp.c
- Added lock ownership comment block at mana_reset_enter
- Documented rte_dev_event_monitor_start() requirement
- Added mana.rst documentation and release note

Changes since v1:
- Removed net/netvsc patch from this series
- Simplified reset exit: mana_reset_exit calls
  mana_reset_exit_delay directly instead of spawning a thread
- Added __rte_no_thread_safety_analysis annotations for clang
- Switched to rte_thread_create_internal_control
- Fixed declaration-after-statement style issues
- Removed unnecessary blank lines and stale comments

Wei Hu (1):
  net/mana: add device reset support

 doc/guides/nics/mana.rst               |   40 +
 doc/guides/rel_notes/release_26_07.rst |    8 +
 drivers/net/mana/mana.c                | 1088 ++++++++++++++++++++++--
 drivers/net/mana/mana.h                |   52 +-
 drivers/net/mana/mp.c                  |   92 +-
 drivers/net/mana/mr.c                  |    6 +-
 drivers/net/mana/rx.c                  |   23 +-
 drivers/net/mana/tx.c                  |   44 +-
 8 files changed, 1245 insertions(+), 108 deletions(-)

-- 
2.34.1


^ permalink raw reply

* [PATCH] drivers: update relaxed ordering policy for mlx5 mkeys
From: Maayan Kashani @ 2026-06-16 12:23 UTC (permalink / raw)
  To: dev
  Cc: mkashani, rasland, Viacheslav Ovsiienko, Dariusz Sosnowski,
	Bing Zhao, Ori Kam, Suanming Mou, Matan Azrad

New adapters expose additional ordering capabilities.
Query the new caps and apply them when creating DevX mkeys via
mlx5_devx_mkey_attr_set_ordering(), which sets PCI relaxed ordering
and RAW=RO when relaxed order is supported.
Use this helper on Windows (still gated by Haswell/Broadwell) and for
Linux wrapped mkeys and crypto/regex/vdpa indirect mkeys when
relaxed order only flag is set.
Linux wrapped mkeys continue to use the legacy Haswell/Broadwell rule for
IBV_ACCESS_RELAXED_ORDERING on the verbs MR.
Upcoming FW will requires setting the correct ordering attributes,
otherwise it fails to create the memory key.

Signed-off-by: Maayan Kashani <mkashani@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
---
 drivers/common/mlx5/linux/mlx5_common_os.c   |  6 ++++
 drivers/common/mlx5/mlx5_devx_cmds.c         | 31 ++++++++++++++++++++
 drivers/common/mlx5/mlx5_devx_cmds.h         |  9 ++++++
 drivers/common/mlx5/mlx5_prm.h               | 18 ++++++++++--
 drivers/common/mlx5/windows/mlx5_common_os.c |  8 ++---
 drivers/crypto/mlx5/mlx5_crypto.c            |  4 +++
 drivers/regex/mlx5/mlx5_regex_fastpath.c     |  5 ++++
 drivers/regex/mlx5/mlx5_rxp.c                |  4 +++
 drivers/vdpa/mlx5/mlx5_vdpa_mem.c            |  4 +++
 9 files changed, 81 insertions(+), 8 deletions(-)

diff --git a/drivers/common/mlx5/linux/mlx5_common_os.c b/drivers/common/mlx5/linux/mlx5_common_os.c
index e3db6c41245..153709390d9 100644
--- a/drivers/common/mlx5/linux/mlx5_common_os.c
+++ b/drivers/common/mlx5/linux/mlx5_common_os.c
@@ -997,6 +997,7 @@ int
 mlx5_os_wrapped_mkey_create(void *ctx, void *pd, uint32_t pdn, void *addr,
 			    size_t length, struct mlx5_pmd_wrapped_mr *pmd_mr)
 {
+	struct mlx5_hca_attr hca_attr = { 0 };
 	struct mlx5_klm klm = {
 		.byte_count = length,
 		.address = (uintptr_t)addr,
@@ -1019,6 +1020,11 @@ mlx5_os_wrapped_mkey_create(void *ctx, void *pd, uint32_t pdn, void *addr,
 	klm.mkey = ibv_mr->lkey;
 	mkey_attr.addr = (uintptr_t)addr;
 	mkey_attr.size = length;
+	if (mlx5_devx_cmd_query_hca_attr(ctx, &hca_attr))
+		return -1;
+	/* If only relaxed order is allowed. */
+	if (hca_attr.mkc_order_write_after_write_ro_only)
+		mlx5_devx_mkey_attr_set_ordering(&mkey_attr, &hca_attr);
 	mkey = mlx5_devx_cmd_mkey_create(ctx, &mkey_attr);
 	if (!mkey) {
 		claim_zero(mlx5_glue->dereg_mr(ibv_mr));
diff --git a/drivers/common/mlx5/mlx5_devx_cmds.c b/drivers/common/mlx5/mlx5_devx_cmds.c
index c4ac2aaceed..140b057ab47 100644
--- a/drivers/common/mlx5/mlx5_devx_cmds.c
+++ b/drivers/common/mlx5/mlx5_devx_cmds.c
@@ -331,6 +331,29 @@ mlx5_devx_cmd_flow_counter_query(struct mlx5_devx_obj *dcs,
 	return 0;
 }
 
+/**
+ * Apply PCI relaxed-ordering and read-after-write ordering to mkey attributes.
+ *
+ * @param[in, out] mkey_attr
+ *   Mkey attributes to update.
+ * @param[in] hca_attr
+ *   HCA capabilities from mlx5_devx_cmd_query_hca_attr().
+ */
+RTE_EXPORT_INTERNAL_SYMBOL(mlx5_devx_mkey_attr_set_ordering)
+void
+mlx5_devx_mkey_attr_set_ordering(struct mlx5_devx_mkey_attr *mkey_attr,
+				 const struct mlx5_hca_attr *hca_attr)
+{
+	if (!mkey_attr || !hca_attr)
+		return;
+
+	mkey_attr->relaxed_ordering_write = hca_attr->relaxed_ordering_write;
+	mkey_attr->relaxed_ordering_read =
+		hca_attr->relaxed_ordering_read || hca_attr->pci_relaxed_ordered_read;
+	if (hca_attr->mkc_order_read_after_write)
+		mkey_attr->read_after_write_ordering = MLX5_MKC_RAW_ORDERING_RO;
+}
+
 /**
  * Create a new mkey.
  *
@@ -417,6 +440,8 @@ mlx5_devx_cmd_mkey_create(void *ctx,
 	MLX5_SET(mkc, mkc, relaxed_ordering_write,
 		 attr->relaxed_ordering_write);
 	MLX5_SET(mkc, mkc, relaxed_ordering_read, attr->relaxed_ordering_read);
+	MLX5_SET(mkc, mkc, order_read_after_write,
+		 attr->read_after_write_ordering);
 	MLX5_SET64(mkc, mkc, start_addr, attr->addr);
 	MLX5_SET64(mkc, mkc, len, attr->size);
 	MLX5_SET(mkc, mkc, crypto_en, attr->crypto_en);
@@ -1003,6 +1028,12 @@ mlx5_devx_cmd_query_hca_attr(void *ctx,
 						relaxed_ordering_write);
 	attr->relaxed_ordering_read = MLX5_GET(cmd_hca_cap, hcattr,
 					       relaxed_ordering_read);
+	attr->pci_relaxed_ordered_read = MLX5_GET(cmd_hca_cap, hcattr,
+						  pci_relaxed_ordered_read);
+	attr->mkc_order_read_after_write = MLX5_GET(cmd_hca_cap, hcattr,
+						    mkc_order_read_after_write);
+	attr->mkc_order_write_after_write_ro_only = MLX5_GET(cmd_hca_cap, hcattr,
+							     mkc_order_write_after_write_ro_only);
 	attr->access_register_user = MLX5_GET(cmd_hca_cap, hcattr,
 					      access_register_user);
 	attr->eth_net_offloads = MLX5_GET(cmd_hca_cap, hcattr,
diff --git a/drivers/common/mlx5/mlx5_devx_cmds.h b/drivers/common/mlx5/mlx5_devx_cmds.h
index 82d949972bb..90beb2e9e6c 100644
--- a/drivers/common/mlx5/mlx5_devx_cmds.h
+++ b/drivers/common/mlx5/mlx5_devx_cmds.h
@@ -34,6 +34,7 @@ struct mlx5_devx_mkey_attr {
 	uint32_t pg_access:1;
 	uint32_t relaxed_ordering_write:1;
 	uint32_t relaxed_ordering_read:1;
+	uint32_t read_after_write_ordering:2;
 	uint32_t umr_en:1;
 	uint32_t crypto_en:2;
 	uint32_t set_remote_rw:1;
@@ -237,6 +238,9 @@ struct mlx5_hca_attr {
 	uint32_t vhca_id:16;
 	uint32_t relaxed_ordering_write:1;
 	uint32_t relaxed_ordering_read:1;
+	uint32_t pci_relaxed_ordered_read:1;
+	uint32_t mkc_order_read_after_write:1;
+	uint32_t mkc_order_write_after_write_ro_only:1;
 	uint32_t access_register_user:1;
 	uint32_t wqe_index_ignore:1;
 	uint32_t cross_channel:1;
@@ -748,6 +752,11 @@ int mlx5_devx_cmd_query_hca_attr(void *ctx,
 __rte_internal
 struct mlx5_devx_obj *mlx5_devx_cmd_mkey_create(void *ctx,
 					      struct mlx5_devx_mkey_attr *attr);
+
+__rte_internal
+void
+mlx5_devx_mkey_attr_set_ordering(struct mlx5_devx_mkey_attr *mkey_attr,
+				 const struct mlx5_hca_attr *hca_attr);
 __rte_internal
 int mlx5_devx_get_out_command_status(void *out);
 __rte_internal
diff --git a/drivers/common/mlx5/mlx5_prm.h b/drivers/common/mlx5/mlx5_prm.h
index 3bb072a7fec..c2810194f8e 100644
--- a/drivers/common/mlx5/mlx5_prm.h
+++ b/drivers/common/mlx5/mlx5_prm.h
@@ -1463,7 +1463,9 @@ struct mlx5_ifc_mkc_bits {
 	u8 bsf_octword_size[0x20];
 	u8 reserved_at_120[0x80];
 	u8 translations_octword_size[0x20];
-	u8 reserved_at_1c0[0x19];
+	u8 reserved_at_1c0[0x16];
+	u8 order_read_after_write[0x2];
+	u8 reserved_at_1d8[0x1];
 	u8 relaxed_ordering_read[0x1];
 	u8 reserved_at_1da[0x1];
 	u8 log_page_size[0x5];
@@ -1478,6 +1480,13 @@ enum {
 	MLX5_MKEY_CRYPTO_ENABLED = 0x1,
 };
 
+/* MKC read_after_write_ordering field (2-bit, dword 0x38 bits 9:8). */
+enum mlx5_mkc_raw_ordering {
+	MLX5_MKC_RAW_ORDERING_SO = 0x0,
+	MLX5_MKC_RAW_ORDERING_SAO = 0x1,
+	MLX5_MKC_RAW_ORDERING_RO = 0x2,
+};
+
 struct mlx5_ifc_create_mkey_out_bits {
 	u8 status[0x8];
 	u8 reserved_at_8[0x18];
@@ -1827,7 +1836,8 @@ struct mlx5_ifc_cmd_hca_cap_bits {
 	u8 log_max_mcg[0x8];
 	u8 reserved_at_320[0x3];
 	u8 log_max_transport_domain[0x5];
-	u8 reserved_at_328[0x3];
+	u8 reserved_at_328[0x2];
+	u8 pci_relaxed_ordered_read[0x1];
 	u8 log_max_pd[0x5];
 	u8 reserved_at_330[0xb];
 	u8 log_max_xrcd[0x5];
@@ -1860,7 +1870,9 @@ struct mlx5_ifc_cmd_hca_cap_bits {
 	u8 ext_stride_num_range[0x1];
 	u8 reserved_at_3a1[0x2];
 	u8 log_max_stride_sz_rq[0x5];
-	u8 reserved_at_3a8[0x3];
+	u8 mkc_order_read_after_write[0x1];
+	u8 mkc_order_write_after_write_ro_only[0x1];
+	u8 reserved_at_3aa[0x1];
 	u8 log_min_stride_sz_rq[0x5];
 	u8 reserved_at_3b0[0x3];
 	u8 log_max_stride_sz_sq[0x5];
diff --git a/drivers/common/mlx5/windows/mlx5_common_os.c b/drivers/common/mlx5/windows/mlx5_common_os.c
index c790c9a4aeb..bdafb95df98 100644
--- a/drivers/common/mlx5/windows/mlx5_common_os.c
+++ b/drivers/common/mlx5/windows/mlx5_common_os.c
@@ -384,7 +384,7 @@ mlx5_os_reg_mr(void *pd,
 {
 	struct mlx5_devx_mkey_attr mkey_attr;
 	struct mlx5_pd *mlx5_pd = (struct mlx5_pd *)pd;
-	struct mlx5_hca_attr attr;
+	struct mlx5_hca_attr attr = { 0 };
 	struct mlx5_devx_obj *mkey;
 	void *obj;
 
@@ -403,10 +403,8 @@ mlx5_os_reg_mr(void *pd,
 	mkey_attr.size = length;
 	mkey_attr.umem_id = ((struct mlx5_devx_umem *)(obj))->umem_id;
 	mkey_attr.pd = mlx5_pd->pdn;
-	if (!mlx5_haswell_broadwell_cpu) {
-		mkey_attr.relaxed_ordering_write = attr.relaxed_ordering_write;
-		mkey_attr.relaxed_ordering_read = attr.relaxed_ordering_read;
-	}
+	if (!mlx5_haswell_broadwell_cpu)
+		mlx5_devx_mkey_attr_set_ordering(&mkey_attr, &attr);
 	mkey = mlx5_devx_cmd_mkey_create(mlx5_pd->devx_ctx, &mkey_attr);
 	if (!mkey) {
 		claim_zero(mlx5_os_umem_dereg(obj));
diff --git a/drivers/crypto/mlx5/mlx5_crypto.c b/drivers/crypto/mlx5/mlx5_crypto.c
index dd0aabb6d75..448dd0c5a4e 100644
--- a/drivers/crypto/mlx5/mlx5_crypto.c
+++ b/drivers/crypto/mlx5/mlx5_crypto.c
@@ -97,7 +97,11 @@ mlx5_crypto_indirect_mkeys_prepare(struct mlx5_crypto_priv *priv,
 				   mlx5_crypto_mkey_update_t update_cb)
 {
 	uint32_t i;
+	struct mlx5_hca_attr *hca_attr = &priv->cdev->config.hca_attr;
 
+	/* If only relaxed order is allowed. */
+	if (hca_attr->mkc_order_write_after_write_ro_only)
+		mlx5_devx_mkey_attr_set_ordering(attr, hca_attr);
 	for (i = 0; i < qp->entries_n; i++) {
 		attr->klm_array = update_cb(priv, qp, i);
 		qp->mkey[i] = mlx5_devx_cmd_mkey_create(priv->cdev->ctx, attr);
diff --git a/drivers/regex/mlx5/mlx5_regex_fastpath.c b/drivers/regex/mlx5/mlx5_regex_fastpath.c
index 3207bcbc603..55f7411593a 100644
--- a/drivers/regex/mlx5/mlx5_regex_fastpath.c
+++ b/drivers/regex/mlx5/mlx5_regex_fastpath.c
@@ -755,9 +755,14 @@ mlx5_regexdev_setup_fastpath(struct mlx5_regex_priv *priv, uint32_t qp_id)
 	setup_qps(priv, qp);
 
 	if (priv->has_umr) {
+		struct mlx5_hca_attr *hca_attr = &priv->cdev->config.hca_attr;
+
 #ifdef HAVE_IBV_FLOW_DV_SUPPORT
 		attr.pd = priv->cdev->pdn;
 #endif
+		/* If only relaxed order is allowed. */
+		if (hca_attr->mkc_order_write_after_write_ro_only)
+			mlx5_devx_mkey_attr_set_ordering(&attr, hca_attr);
 		for (i = 0; i < qp->nb_desc; i++) {
 			attr.klm_num = MLX5_REGEX_MAX_KLM_NUM;
 			attr.klm_array = qp->jobs[i].imkey_array;
diff --git a/drivers/regex/mlx5/mlx5_rxp.c b/drivers/regex/mlx5/mlx5_rxp.c
index dda4a7fdb0b..b865c08b53c 100644
--- a/drivers/regex/mlx5/mlx5_rxp.c
+++ b/drivers/regex/mlx5/mlx5_rxp.c
@@ -54,6 +54,7 @@ rxp_create_mkey(struct mlx5_regex_priv *priv, void *ptr, size_t size,
 	uint32_t access, struct mlx5_regex_mkey *mkey)
 {
 	struct mlx5_devx_mkey_attr mkey_attr;
+	struct mlx5_hca_attr *hca_attr = &priv->cdev->config.hca_attr;
 
 	/* Register the memory. */
 	mkey->umem = mlx5_glue->devx_umem_reg(priv->cdev->ctx, ptr, size, access);
@@ -72,6 +73,9 @@ rxp_create_mkey(struct mlx5_regex_priv *priv, void *ptr, size_t size,
 #ifdef HAVE_IBV_FLOW_DV_SUPPORT
 	mkey_attr.pd = priv->cdev->pdn;
 #endif
+	/* If only relaxed order is allowed. */
+	if (hca_attr->mkc_order_write_after_write_ro_only)
+		mlx5_devx_mkey_attr_set_ordering(&mkey_attr, hca_attr);
 	mkey->mkey = mlx5_devx_cmd_mkey_create(priv->cdev->ctx, &mkey_attr);
 	if (!mkey->mkey) {
 		DRV_LOG(ERR, "Failed to create direct mkey!");
diff --git a/drivers/vdpa/mlx5/mlx5_vdpa_mem.c b/drivers/vdpa/mlx5/mlx5_vdpa_mem.c
index 4dfe800b8fc..8c9d169d2a8 100644
--- a/drivers/vdpa/mlx5/mlx5_vdpa_mem.c
+++ b/drivers/vdpa/mlx5/mlx5_vdpa_mem.c
@@ -179,6 +179,7 @@ static int
 mlx5_vdpa_create_indirect_mkey(struct mlx5_vdpa_priv *priv)
 {
 	struct mlx5_devx_mkey_attr mkey_attr;
+	struct mlx5_hca_attr *hca_attr = &priv->cdev->config.hca_attr;
 	struct mlx5_vdpa_query_mr *mrs =
 		(struct mlx5_vdpa_query_mr *)priv->mrs;
 	struct mlx5_vdpa_query_mr *entry;
@@ -242,6 +243,9 @@ mlx5_vdpa_create_indirect_mkey(struct mlx5_vdpa_priv *priv)
 	mkey_attr.pg_access = 0;
 	mkey_attr.klm_array = klm_array;
 	mkey_attr.klm_num = klm_index;
+	/* If only relaxed order is allowed. */
+	if (hca_attr->mkc_order_write_after_write_ro_only)
+		mlx5_devx_mkey_attr_set_ordering(&mkey_attr, hca_attr);
 	entry = &mrs[mem->nregions];
 	entry->mkey = mlx5_devx_cmd_mkey_create(priv->cdev->ctx, &mkey_attr);
 	if (!entry->mkey) {
-- 
2.21.0


^ permalink raw reply related

* [PATCH v6 21/21] net/txgbe: fix temperature track for AML NIC
From: Zaiyu Wang @ 2026-06-16 12:20 UTC (permalink / raw)
  To: dev; +Cc: Zaiyu Wang, stable, Jiawen Wu
In-Reply-To: <20260616122030.9688-1-zaiyuwang@trustnetic.com>

Previously, temperature tracking for the amlite NIC was handled by
firmware together with the hardware setup. However, the firmware-based
PHY configuration has proven to be unstable.

Re-add the temperature tracking function directly in the driver and
invoke it periodically to ensure the PHY remains calibrated. According
to the hardware recommendation, the tracking sequence should be run at
least every 100 ms to keep temperature drift within 5 °C. Considering
the software and hardware overhead, a 2-second interval is used as a
practical trade-off that still meets stability requirements while
minimizing performance impact.

The periodic tracking is implemented using a timer in the driver, and
the sequence itself is the same as the one originally performed during
link setup.

Fixes: fb6eb170dfa2 ("net/txgbe: add basic link configuration for Amber-Lite")
Cc: stable@dpdk.org

Signed-off-by: Zaiyu Wang <zaiyuwang@trustnetic.com>
---
 drivers/net/txgbe/txgbe_ethdev.c | 44 +++++++++++++++++++++++++++++++-
 drivers/net/txgbe/txgbe_ethdev.h |  1 +
 2 files changed, 44 insertions(+), 1 deletion(-)

diff --git a/drivers/net/txgbe/txgbe_ethdev.c b/drivers/net/txgbe/txgbe_ethdev.c
index 57803fe841..cb69fcd28f 100644
--- a/drivers/net/txgbe/txgbe_ethdev.c
+++ b/drivers/net/txgbe/txgbe_ethdev.c
@@ -2011,8 +2011,10 @@ txgbe_dev_start(struct rte_eth_dev *dev)
 	txgbe_filter_restore(dev);
 
 	hw->bp_event_interval = 100 * 1000;
-	if (hw->mac.type == txgbe_mac_aml || hw->mac.type == txgbe_mac_aml40)
+	if (hw->mac.type == txgbe_mac_aml || hw->mac.type == txgbe_mac_aml40) {
 		rte_eal_alarm_set(hw->bp_event_interval, txgbe_dev_e56_check_bp_event, dev);
+		rte_eal_alarm_set(1000 * 1000 * 2, txgbe_dev_check_aml_temp_event, dev);
+	}
 
 	if (tm_conf->root && !tm_conf->committed)
 		PMD_DRV_LOG(WARNING,
@@ -2060,6 +2062,7 @@ txgbe_dev_stop(struct rte_eth_dev *dev)
 
 	if (hw->mac.type == txgbe_mac_aml || hw->mac.type == txgbe_mac_aml40) {
 		rte_eal_alarm_cancel(txgbe_dev_e56_check_bp_event, dev);
+		rte_eal_alarm_cancel(txgbe_dev_check_aml_temp_event, dev);
 		rte_eal_alarm_cancel(txgbe_dev_setup_link_alarm_handler_aml, hw);
 	}
 
@@ -2932,6 +2935,45 @@ txgbe_dev_supported_ptypes_get(struct rte_eth_dev *dev, size_t *no_of_elements)
 	return NULL;
 }
 
+void txgbe_dev_check_aml_temp_event(void *param)
+{
+	struct rte_eth_dev *dev = (struct rte_eth_dev *)param;
+	struct txgbe_hw *hw = TXGBE_DEV_HW(dev);
+	uint32_t link_speed = 0, val = 0;
+	s32 status = 0;
+	int temp;
+
+	if (hw == NULL)
+		return;
+
+	status = txgbe_e56_get_temp(hw, &temp);
+	if (status)
+		temp = DEFAULT_TEMP;
+
+	if (!(temp - hw->temperature > 4 ||
+		hw->temperature - temp > 4))
+		goto out;
+
+	hw->temperature = temp;
+	val = rd32(hw, TXGBE_PORT);
+	if (val & TXGBE_AMLITE_LED_LINK_40G)
+		link_speed = TXGBE_LINK_SPEED_40GB_FULL;
+	else if (val & TXGBE_AMLITE_LED_LINK_25G)
+		link_speed = TXGBE_LINK_SPEED_25GB_FULL;
+	else
+		link_speed = TXGBE_LINK_SPEED_10GB_FULL;
+
+	rte_spinlock_lock(&hw->phy_lock);
+	if (hw->mac.type == txgbe_mac_aml)
+		txgbe_temp_track_seq(hw, link_speed);
+	else if (hw->mac.type == txgbe_mac_aml40)
+		txgbe_temp_track_seq_40g(hw, link_speed);
+	rte_spinlock_unlock(&hw->phy_lock);
+
+out:
+	rte_eal_alarm_set(1000 * 1000 * 2, txgbe_dev_check_aml_temp_event, dev);
+}
+
 void txgbe_dev_e56_check_bp_event(void *param)
 {
 	struct rte_eth_dev *dev = (struct rte_eth_dev *)param;
diff --git a/drivers/net/txgbe/txgbe_ethdev.h b/drivers/net/txgbe/txgbe_ethdev.h
index 309db3bfe9..c32c61d8bf 100644
--- a/drivers/net/txgbe/txgbe_ethdev.h
+++ b/drivers/net/txgbe/txgbe_ethdev.h
@@ -747,5 +747,6 @@ void txgbe_vlan_hw_strip_bitmap_set(struct rte_eth_dev *dev,
 		uint16_t queue, bool on);
 void txgbe_config_vlan_strip_on_all_queues(struct rte_eth_dev *dev,
 						  int mask);
+void txgbe_dev_check_aml_temp_event(void *param);
 void txgbe_dev_e56_check_bp_event(void *param);
 #endif /* _TXGBE_ETHDEV_H_ */
-- 
2.21.0.windows.1


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox