From: Stephen Hemminger <stephen@networkplumber.org>
To: Wei Hu <weh@linux.microsoft.com>
Cc: dev@dpdk.org, longli@microsoft.com, weh@microsoft.com
Subject: Re: [PATCH v7 1/1] net/mana: add device reset support
Date: Mon, 8 Jun 2026 12:11:34 -0700 [thread overview]
Message-ID: <20260608121134.601e4f46@phoenix.local> (raw)
In-Reply-To: <20260608120824.287050-2-weh@linux.microsoft.com>
On Mon, 8 Jun 2026 05:08:24 -0700
Wei Hu <weh@linux.microsoft.com> wrote:
> From: Wei Hu <weh@microsoft.com>
>
> Add support for handling hardware reset events in the MANA driver.
> When the MANA kernel driver receives a hardware service event, it
> initiates a device reset and notifies userspace via
> IBV_EVENT_DEVICE_FATAL. The DPDK driver handles this by performing
> an automatic teardown and recovery sequence.
>
> The reset flow has two phases. In the enter phase, running on the
> EAL interrupt thread, the driver transitions the device state,
> waits for data path threads to drain using per-queue atomic flags,
> stops queues, tears down IB resources, and frees per-queue MR
> caches. A control thread is then spawned to handle the exit phase:
> it waits for the hardware to recover, unregisters the interrupt
> handler, re-probes the PCI device, reinitializes MR caches, and
> restarts queues.
>
> Each queue has an atomic burst_state variable where bit 0 is the
> in-burst flag and bits 1+ encode device state. The data path uses
> a single compare-and-swap (0 to 1) to enter a burst, which fails
> immediately if the reset path has set any state bits. The reset
> path sets state bits via atomic fetch-or and polls bit 0 to wait
> for in-flight bursts to drain. This single-variable design avoids
> the need for sequential consistency ordering.
>
> A per-device mutex serializes the reset path with ethdev
> operations. The mutex uses PTHREAD_PROCESS_SHARED for multi-process
> support and is held across blocking IB verbs calls. A trylock
> helper encapsulates the lock acquisition and device state check
> for all ethdev operation wrappers. Operations that cannot wait
> (configure, queue setup) return -EBUSY during reset, while
> dev_stop and dev_close join the reset thread before acquiring
> the lock to ensure proper sequencing. A CAS-based helper prevents
> double-join of the reset thread.
>
> Multi-process support is included: secondary processes unmap and
> remap doorbell pages via IPC during the reset enter and exit
> phases. Data path functions in both primary and secondary
> processes check the device state atomically and return early when
> the device is not active.
>
> The driver emits RTE_ETH_EVENT_ERR_RECOVERING before entering the
> reset path so that upper layers (e.g. netvsc) can switch their
> data path before queues are stopped. The event is emitted outside
> the reset lock to avoid deadlock if the callback calls dev_stop or
> dev_close. On completion, the driver emits RECOVERY_SUCCESS or
> RECOVERY_FAILED. If the enter phase fails internally,
> RECOVERY_FAILED is sent immediately so the application receives a
> terminal event. A PCI device removal event callback distinguishes
> hot-remove from service reset.
>
> Documentation for the device reset feature is added in the MANA
> NIC guide and the 26.07 release notes.
>
> Signed-off-by: Wei Hu <weh@microsoft.com>
> ---
This is is a rather complex set of state transitions so admit to relying
on AI as backup for tracking this. It still sees some errors here.
Worth asking the question, "what does mlx5 do?" and "should DPDK
EAL be doing this at the PCI layer instead?"
---
Much better - this addresses the RCU, the macros, the thread-safety-analysis
suppressions, and the callback-under-lock deadlock. The single-variable
burst_state CAS is a clean way to do the drain and the acquire/release
reasoning checks out.
One structural thing remains. The enter phase still runs the heavy teardown
on the EAL interrupt thread under reset_ops_lock: dev_stop, then
mana_mp_req_on_rxtx(RESET_ENTER) which is a blocking rte_mp_request_sync with
a multi-second timeout, then dev_close with its ibv calls. You already moved
the exit phase to a control thread because intr_callback_unregister cannot run
on the interrupt thread; the same argument applies to blocking IPC and verbs
teardown. A slow or absent secondary will stall the interrupt thread for the
MP timeout, and this is the blocking-under-a-sleeping-mutex pattern. Please
have the interrupt handler just set state and drain, then hand the rest of the
teardown to the control thread. That also removes the last lock hand-off
between functions/threads, so each function can own its lock.
Smaller points:
- RECOVERY_SUCCESS/FAILED are emitted from the reset thread. If the callback
calls dev_stop/dev_close, mana_join_reset_thread() joins the current thread
(EDEADLK, leaked handle). INTR_RMV is fine since it runs on the dev-event
thread.
- The burst_state comment says bits 1+ encode device state, but only
RESET_ENTER<<1 is ever stored - it is effectively a single "blocked" flag.
next prev parent reply other threads:[~2026-06-08 19:11 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-08 12:08 [PATCH v7 0/1] net/mana: add device reset support Wei Hu
2026-06-08 12:08 ` [PATCH v7 1/1] " Wei Hu
2026-06-08 19:11 ` Stephen Hemminger [this message]
2026-06-08 23:32 ` [EXTERNAL] " Long Li
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260608121134.601e4f46@phoenix.local \
--to=stephen@networkplumber.org \
--cc=dev@dpdk.org \
--cc=longli@microsoft.com \
--cc=weh@linux.microsoft.com \
--cc=weh@microsoft.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox