From: Stephen Hemminger <stephen@networkplumber.org>
To: Wei Hu <weh@linux.microsoft.com>
Cc: dev@dpdk.org, longli@microsoft.com, weh@microsoft.com
Subject: Re: [PATCH v3 0/2] net/mana: add device reset support
Date: Tue, 26 May 2026 12:37:47 -0700 [thread overview]
Message-ID: <20260526123747.46871467@phoenix.local> (raw)
In-Reply-To: <20260522065943.126703-1-weh@linux.microsoft.com>
On Thu, 21 May 2026 23:59:41 -0700
Wei Hu <weh@linux.microsoft.com> wrote:
> From: Wei Hu <weh@microsoft.com>
>
> Add support for handling hardware service reset events in the
> MANA driver. When the MANA kernel driver receives a hardware
> service event, it initiates a device reset and notifies userspace
> via IBV_EVENT_DEVICE_FATAL. The MANA PMD handles this by
> performing an automatic teardown and recovery sequence.
>
> The driver uses ethdev recovery events (ERR_RECOVERING,
> RECOVERY_SUCCESS, RECOVERY_FAILED) to notify upper layers of
> the reset lifecycle, and a PCI device removal event callback
> to distinguish hot-remove from service reset.
>
> Changes since v2:
> - Fixed dev_state_qsv memory leak on device removal
> - Fixed reset thread TCB/stack leak: reset_thread_active is now
> only cleared by the joiner, not the thread itself
> - Fixed second reset crash: removed reset thread join logic from
> mana_dev_close (inner function) to avoid corrupting dev_state
> when called from mana_reset_enter
> - Made reset_thread_active RTE_ATOMIC(bool) with explicit ordering
> - Added retry loop for rte_dev_event_callback_unregister on -EAGAIN
> - Initialized condvar/mutex with PTHREAD_PROCESS_SHARED since priv
> is in hugepage shared memory
> - Added re-check of dev_state after lock acquisition in
> mana_intr_handler to prevent racing with pci_remove_event_cb
> - Replaced (void *)0 with NULL in mp.c
> - Added lock ownership comment block at mana_reset_enter
> - Documented rte_dev_event_monitor_start() requirement
> - Added mana.rst documentation and release note (patch 2/2)
>
> Changes since v1:
> - Removed net/netvsc patch from this series
> - Simplified reset exit: mana_reset_exit calls
> mana_reset_exit_delay directly instead of spawning a thread
> - Added __rte_no_thread_safety_analysis annotations for clang
> - Switched to rte_thread_create_internal_control
> - Fixed declaration-after-statement style issues
> - Removed unnecessary blank lines and stale comments
>
> Wei Hu (2):
> net/mana: add device reset support
> net/mana: add documentation for device reset support
>
> doc/guides/nics/mana.rst | 33 +
> doc/guides/rel_notes/release_26_07.rst | 8 +
> drivers/net/mana/mana.c | 995 ++++++++++++++++++++++---
> drivers/net/mana/mana.h | 33 +-
> drivers/net/mana/meson.build | 2 +-
> drivers/net/mana/mp.c | 89 ++-
> drivers/net/mana/mr.c | 6 +-
> drivers/net/mana/rx.c | 24 +-
> drivers/net/mana/tx.c | 40 +-
> 9 files changed, 1123 insertions(+), 107 deletions(-)
>
AI review found a few things...
Thanks for the v3. Most of the v2 items are addressed: dev_state_qsv
is freed via mana_dev_free_resources, reset_thread_active is now
RTE_ATOMIC(bool), the joiner owns the flag, pthread mutex/cond use
PROCESS_SHARED attrs, intr_handler re-checks state under the lock,
the lock hand-off has a clear comment, and a doc + release note
patch is included.
A few items remain or were introduced by the v2->v3 changes.
Errors
1. Deadlock in dev_close_lock + EAGAIN loop on PCI-remove callback.
The v3 fix for the discarded unregister return uses a busy-wait:
if (dev->device) {
do {
ret = rte_dev_event_callback_unregister(...);
} while (ret == -EAGAIN);
}
mana_intr_uninstall runs from mana_dev_close, which runs from
mana_dev_close_lock with reset_ops_lock held. EAL returns
-EAGAIN while cb_lst->active is 1 (callback dispatching). The
callback in question is mana_pci_remove_event_cb, which itself
blocks on rte_spinlock_lock(&priv->reset_ops_lock). Result:
dev_close holds the spinlock, the callback waits for the
spinlock, the loop waits for the callback to finish -- three-way
deadlock if a hot-remove event arrives during close.
Either drop the lock before unregistering and re-take it after,
or (better) use a sleeping mutex for reset_ops_lock so the
callback can complete while close is suspended.
2. reset_ops_lock is still rte_spinlock_t held across blocking work.
Unchanged from v2: the spinlock is held through
- mana_reset_enter: rte_rcu_qsbr_check spin,
mana_dev_stop,
mana_mp_req_on_rxtx (5s IPC timeout),
mana_dev_close
- mana_reset_exit_delay: ibv_close_device, mana_pci_probe,
mana_mp_req_on_rxtx, mana_dev_start
Any other ethdev op that hits dev_ops while reset is in flight
spins for tens of seconds. Blocking calls under a spinlock are
an Error; use a properly-initialised pthread_mutex_t (you
already have PROCESS_SHARED attrs handy).
Warnings
3. Secondary process: qsbr does not actually quiesce secondary lcores.
rte_rcu_qsbr_thread_register is only called from
mana_dev_configure, which the secondary never runs. The
secondary's rx_burst/tx_burst still call thread_online/offline
against the shared qsv, but the secondary tids are unregistered,
so they are invisible to rte_rcu_qsbr_check in the primary, and
the secondary MP handler (mana_mp_reset_enter) does not call
qsbr_check at all -- it just sets db_page = NULL and munmaps.
The dev_state check at the top of secondary tx_burst is racy:
the page can be munmapped after the in-loop read of db_page but
before the doorbell write at the bottom. The "All secondary
threads are quiescent" log line in mana_mp_reset_enter is not
true.
The secondary needs a real reader-side primitive -- its own
qsbr with secondary lcore registration, or an rwlock the MP
handler takes before munmap.
4. Double-join race between dev_close_lock and mana_reset_enter.
dev_close_lock signals + joins the reset thread *before* taking
reset_ops_lock. intr_handler may fire in that window:
1. dev_close_lock: pthread_cond_signal, rte_thread_join
2. intr_handler: take lock, see ACTIVE, call reset_enter
3. reset_enter: reset_thread_active still true (step 1
hasn't cleared it yet) -> rte_thread_join
on a thread that step 1 just joined -> UB
The pre-lock signal/join was added to avoid deadlock with the
reset thread holding the lock, which is fine in itself, but the
reset_thread_active update needs to be inside that critical
section, or reset_enter has to recheck under the lock.
5. mana_dev_close (non-_lock) does not join the reset thread.
mana_dev_uninit -> mana_dev_close path doesn't go through
dev_close_lock, so if pci_remove arrives while the reset thread
is alive, mana_dev_free_resources can destroy reset_cond_mutex /
reset_cond before the thread has exited. In practice the
pci_remove callback signals RESET_FAILED and the reset thread
should exit quickly, but the ordering isn't enforced. Either
join in mana_dev_uninit, or document the assumption.
Info
6. mana_reset_exit_delay sets priv->ib_ctx = NULL even when
ibv_close_device fails:
ret = ibv_close_device(priv->ib_ctx);
priv->ib_ctx = NULL;
if (ret) { ... goto out; }
The handle is leaked if close fails. Either keep the old
pointer on failure or accept that close-failure is unrecoverable
and free it anyway with a comment.
Patch 2/2
7. The events list is a term/description pattern and should be a
definition list per DPDK RST style:
``RTE_ETH_EVENT_ERR_RECOVERING``
Reset has started.
``RTE_ETH_EVENT_RECOVERY_SUCCESS``
Device has recovered successfully.
``RTE_ETH_EVENT_RECOVERY_FAILED``
Recovery failed.
8. Convention is to fold doc + release note updates into the same
commit as the feature. Patches 1/2 and 2/2 can be squashed.
next prev parent reply other threads:[~2026-05-26 19:37 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-22 6:59 [PATCH v3 0/2] net/mana: add device reset support Wei Hu
2026-05-22 6:59 ` [PATCH v3 1/2] " Wei Hu
2026-05-22 6:59 ` [PATCH v3 2/2] net/mana: add documentation for " Wei Hu
2026-05-26 19:37 ` Stephen Hemminger [this message]
2026-05-28 7:30 ` [EXTERNAL] Re: [PATCH v3 0/2] net/mana: add " Wei Hu
2026-05-28 18:24 ` Stephen Hemminger
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260526123747.46871467@phoenix.local \
--to=stephen@networkplumber.org \
--cc=dev@dpdk.org \
--cc=longli@microsoft.com \
--cc=weh@linux.microsoft.com \
--cc=weh@microsoft.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.