From: Stephen Hemminger <stephen@networkplumber.org>
To: Wei Hu <weh@linux.microsoft.com>
Cc: dev@dpdk.org, longli@microsoft.com, weh@microsoft.com
Subject: Re: [PATCH v8 0/1] net/mana: add device reset support
Date: Wed, 10 Jun 2026 09:56:33 -0700 [thread overview]
Message-ID: <20260610095633.0bf6c153@phoenix.local> (raw)
In-Reply-To: <cover.1781017284.git.weh@linux.microsoft.com>
On Wed, 10 Jun 2026 00:21:21 -0700
Wei Hu <weh@linux.microsoft.com> wrote:
> From: Wei Hu <weh@microsoft.com>
>
> Add support for handling hardware service reset events in the
> MANA driver. When the MANA kernel driver receives a hardware
> service event, it initiates a device reset and notifies userspace
> via IBV_EVENT_DEVICE_FATAL. The MANA PMD handles this by
> performing an automatic teardown and recovery sequence.
>
> The driver uses ethdev recovery events (ERR_RECOVERING,
> RECOVERY_SUCCESS, RECOVERY_FAILED) to notify upper layers of
> the reset lifecycle, and a PCI device removal event callback
> to distinguish hot-remove from service reset.
>
> Changes since v7:
> - Moved heavy teardown (dev_stop, IPC to secondaries, dev_close,
> MR btree free) from mana_reset_enter (EAL interrupt thread)
> to mana_reset_thread (control thread). The interrupt handler
> now only sets state, drains in-flight bursts, and spawns the
> thread. Teardown runs immediately in the control thread before
> the recovery timer wait, avoiding blocking the interrupt thread
> on multi-second IPC timeouts and ibverbs calls. Each function
> now owns its own lock scope with no lock hand-off between
> threads.
> - Fixed self-join deadlock: clear reset_thread_active before
> emitting RECOVERY_SUCCESS/FAILED callbacks from the reset
> thread. Without this, if the callback calls dev_stop/dev_close,
> mana_join_reset_thread attempts to join the current thread.
> - Simplified burst_state from encoding device state in bits 1+
> to a single blocked flag (bit 1). Only one value was ever
> stored, so the multi-state encoding was misleading. Added
> MANA_BURST_BLOCKED constant.
> - Updated mana.rst to reflect that teardown runs on the control
> thread, not the interrupt handler.
>
> Changes since v6:
> - Rebased onto latest upstream for-main
> - Replaced removed RTE_ETH_DEV_TO_PCI macro with
> RTE_CLASS_TO_BUS_DEVICE (upstream commit 4757b8df04
> removed the old bus-specific ethdev convenience macros)
>
> Changes since v5:
> - Replaced RCU QSBR with per-queue atomic burst_state using a
> single-variable CAS design: bit 0 is the in-burst flag, bit 1
> is the blocked flag. The data path uses CAS(0→1) to enter
> burst and fetch_and(~1) to exit. The reset path uses fetch_or
> to set the blocked bit and polls bit 0 to drain in-flight
> bursts. This eliminates the two-variable Dekker pattern and the
> need for sequential consistency (seq_cst) ordering.
> - Removed librte_rcu dependency
> - Removed __rte_no_thread_safety_analysis annotations (no longer
> needed after mutex conversion)
> - Moved ERR_RECOVERING event emission before acquiring
> reset_ops_lock and before mana_reset_enter, so upper layers
> (e.g. netvsc) can switch data path before mana stops queues.
> Emitting outside the lock avoids deadlock if the callback
> calls dev_stop or dev_close.
> - Replaced MANA_OPS_*_LOCK macros with mana_reset_trylock()
> helper function and explicit per-operation wrappers
> - Removed unused rte_alarm.h and rte_lock_annotations.h includes
> - Added RECOVERY_FAILED event when mana_reset_enter fails
> internally, so the application always receives a terminal event
> - Added mana_clear_burst_state() helper to clear per-queue
> burst_state on failure paths (reset_failed, dev_stop_lock,
> dev_close_lock) preventing permanent silent packet drop after
> a failed reset
>
> Changes since v4:
> - Fixed stale rte_spinlock_unlock call in mana_intr_handler that
> was missed during the spinlock-to-mutex conversion, causing a
> -Wincompatible-pointer-types warning
>
> Changes since v3:
> - Converted reset_ops_lock from rte_spinlock_t to pthread_mutex_t
> with PTHREAD_PROCESS_SHARED, since the lock is held across
> blocking IB verbs calls and IPC with 5s timeout
> - Removed rte_dev_event_callback_unregister retry loop to avoid
> deadlock: the callback itself blocks on reset_ops_lock, so
> retrying on -EAGAIN while holding the lock is a deadlock
> - Introduced mana_join_reset_thread() helper using CAS on
> reset_thread_active to prevent double-join undefined behavior
> - Added reset thread join in mana_dev_uninit to prevent thread
> leak on device removal
> - Fixed ibv handle leak: priv->ib_ctx is now only set to NULL
> after ibv_close_device succeeds
> - Fixed misleading "All secondary threads are quiescent" log in
> mana_mp_reset_enter — changed to "Secondary doorbell pages
> unmapped" since actual quiescence is enforced by the primary's
> per-queue atomic flag check before IPC is sent
> - Changed event list in mana.rst to RST definition list style
> - Squashed documentation into the feature patch per convention
>
> Changes since v2:
> - Fixed dev_state_qsv memory leak on device removal
> - Fixed reset thread TCB/stack leak: reset_thread_active is now
> only cleared by the joiner, not the thread itself
> - Fixed second reset crash: removed reset thread join logic from
> mana_dev_close (inner function) to avoid corrupting dev_state
> when called from mana_reset_enter
> - Made reset_thread_active RTE_ATOMIC(bool) with explicit ordering
> - Added retry loop for rte_dev_event_callback_unregister on -EAGAIN
> - Initialized condvar/mutex with PTHREAD_PROCESS_SHARED since priv
> is in hugepage shared memory
> - Added re-check of dev_state after lock acquisition in
> mana_intr_handler to prevent racing with pci_remove_event_cb
> - Replaced (void *)0 with NULL in mp.c
> - Added lock ownership comment block at mana_reset_enter
> - Documented rte_dev_event_monitor_start() requirement
> - Added mana.rst documentation and release note
>
> Changes since v1:
> - Removed net/netvsc patch from this series
> - Simplified reset exit: mana_reset_exit calls
> mana_reset_exit_delay directly instead of spawning a thread
> - Added __rte_no_thread_safety_analysis annotations for clang
> - Switched to rte_thread_create_internal_control
> - Fixed declaration-after-statement style issues
> - Removed unnecessary blank lines and stale comments
>
> Wei Hu (1):
> net/mana: add device reset support
>
> doc/guides/nics/mana.rst | 40 +
> doc/guides/rel_notes/release_26_07.rst | 8 +
> drivers/net/mana/mana.c | 1076 ++++++++++++++++++++++--
> drivers/net/mana/mana.h | 52 +-
> drivers/net/mana/mp.c | 89 +-
> drivers/net/mana/mr.c | 6 +-
> drivers/net/mana/rx.c | 23 +-
> drivers/net/mana/tx.c | 44 +-
> 8 files changed, 1230 insertions(+), 108 deletions(-)
>
AI review thread still sees some issues:
This is close. Teardown is off the interrupt thread now, the cross-thread lock
hand-off is gone, and the burst_state encoding reads cleanly. One real problem
left.
The reset thread leaks on every successful recovery. mana_reset_exit_delay and
the reset_failed path in mana_reset_thread clear reset_thread_active from inside
the thread itself, without joining. The thread is created joinable via
rte_thread_create_internal_control, so terminating it unjoined leaks its
resources, and because the flag is now false none of the join sites
(mana_join_reset_thread, the join-previous block in mana_reset_enter,
mana_dev_uninit) will ever reap it. The PCI-remove abort path leaves the flag
true and is reaped later, which is the inconsistency that exposes this: some
exits expect a join and some do not, and the latter have no reaper.
Simplest fix is to detach the reset thread (rte_thread_detach) and drop the
reset_thread_active / mana_join_reset_thread machinery, using reset_ops_lock and
dev_state for the dev_stop/dev_close sequencing instead. That removes the
self-join hazard too. If you keep the join, don't clear the flag from inside the
thread; have mana_join_reset_thread detect the self case and skip only the join.
Minor: the recovery condvar wait in mana_reset_thread is a bare cond_timedwait.
If pci_remove signals before the thread reaches the wait, the wakeup is lost and
removal isn't seen until the 15s timer expires. Use a dev_state predicate loop
under reset_cond_mutex.
prev parent reply other threads:[~2026-06-10 17:12 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-10 7:21 [PATCH v8 0/1] net/mana: add device reset support Wei Hu
2026-06-10 7:21 ` [PATCH v8 1/1] " Wei Hu
2026-06-10 16:56 ` Stephen Hemminger [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260610095633.0bf6c153@phoenix.local \
--to=stephen@networkplumber.org \
--cc=dev@dpdk.org \
--cc=longli@microsoft.com \
--cc=weh@linux.microsoft.com \
--cc=weh@microsoft.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.