From: Stephen Hemminger <stephen@networkplumber.org>
To: Wei Hu <weh@linux.microsoft.com>
Cc: dev@dpdk.org, longli@microsoft.com, weh@microsoft.com
Subject: Re: [PATCH v2 1/1] net/mana: add device reset support
Date: Sun, 17 May 2026 20:48:58 -0700 [thread overview]
Message-ID: <20260517204858.7a8a9200@phoenix.local> (raw)
In-Reply-To: <20260509060113.61686-2-weh@linux.microsoft.com>
On Fri, 8 May 2026 23:01:13 -0700
Wei Hu <weh@linux.microsoft.com> wrote:
> From: Wei Hu <weh@microsoft.com>
>
> Add support for handling hardware reset events in the MANA driver.
> When the MANA kernel driver receives a hardware service event, it
> initiates a device reset and notifies userspace via
> IBV_EVENT_DEVICE_FATAL. The DPDK driver handles this by performing
> an automatic teardown and recovery sequence.
>
> The reset flow has two phases. In the enter phase, running on the
> EAL interrupt thread, the driver transitions the device state,
> waits for data path threads to reach a quiescent state using RCU,
> stops queues, tears down IB resources, and frees per-queue MR
> caches. A control thread is then spawned to handle the exit phase:
> it waits for the hardware to recover, unregisters the interrupt
> handler, re-probes the PCI device, reinitializes MR caches, and
> restarts queues.
>
> A per-device spinlock serializes the reset path with ethdev
> operations. Operations that cannot wait (configure, queue setup)
> return -EBUSY during reset, while dev_stop and dev_close join the
> reset thread and use a blocking lock to ensure proper sequencing.
>
> Multi-process support is included: secondary processes unmap and
> remap doorbell pages via IPC during the reset enter and exit
> phases. Data path functions in both primary and secondary
> processes check the device state atomically and return early when
> the device is not active.
>
> The driver uses ethdev recovery events to notify upper layers
> (e.g. netvsc) of the reset lifecycle: RTE_ETH_EVENT_ERR_RECOVERING
> on entry, RTE_ETH_EVENT_RECOVERY_SUCCESS or
> RTE_ETH_EVENT_RECOVERY_FAILED on completion. A PCI device removal
> event callback distinguishes hot-remove from service reset.
>
> Signed-off-by: Wei Hu <weh@microsoft.com>
> ---
Lots of issues found during AI review..
Subject: Re: [PATCH v2 1/1] net/mana: add device reset support
To: Wei Hu <weh@linux.microsoft.com>
Cc: dev@dpdk.org, longli@microsoft.com, weh@microsoft.com
In-Reply-To: <20260509060113.61686-2-weh@linux.microsoft.com>
References: <20260509060113.61686-1-weh@linux.microsoft.com>
<20260509060113.61686-2-weh@linux.microsoft.com>
On Fri, 8 May 2026 23:01:13 -0700, Wei Hu wrote:
> Add support for handling hardware reset events in the MANA driver.
Several issues need attention before this can be merged.
Errors
------
1. Memory leak: priv->dev_state_qsv is never freed on close.
In mana_probe_port (non-reset path) the qsbr buffer is allocated:
priv->dev_state_qsv = rte_zmalloc_socket("mana_rcu", sz, ...);
It is only freed in the probe-failure goto label. mana_dev_close
has no corresponding rte_free, so every successful close leaks
the allocation.
2. Joinable reset thread is leaked when reset completes on its own.
Both mana_reset_thread (skip path) and mana_reset_exit_delay
(normal completion) do:
priv->reset_thread_active = false;
rte_spinlock_unlock(&priv->reset_ops_lock);
return 0;
The pthread is still joinable at this point. If mana_dev_close
runs afterward it sees active==false and skips rte_thread_join,
leaving the TCB/stack unreaped. The flag must only be cleared by
whoever actually joins the thread; otherwise the reset thread
should be detached (and then dev_close cannot wait for it, which
is its own problem). Either way the current pattern is wrong.
3. Spinlock held across blocking operations.
priv->reset_ops_lock is rte_spinlock_t and is held across:
- mana_reset_enter: mana_dev_stop, mana_dev_close,
ibv_close_device,
mana_mp_req_on_rxtx (5s IPC timeout)
- mana_reset_exit_delay: ibv_close_device, mana_pci_probe,
mana_mp_req_on_rxtx, mana_dev_start
IPC with a 5-second timeout and device probe under a spinlock is
not acceptable. Use a sleeping mutex (pthread_mutex_t initialized
with PTHREAD_PROCESS_SHARED, since priv is in shared memory), or
split the lock so the long operations run outside it.
4. pthread_mutex_init / pthread_cond_init without PTHREAD_PROCESS_SHARED.
priv is allocated with rte_zmalloc_socket, so reset_cond_mutex and
reset_cond live in hugepage memory mapped into every secondary.
The patch initializes them with NULL attributes:
pthread_mutex_init(&priv->reset_cond_mutex, NULL);
pthread_cond_init(&priv->reset_cond, NULL);
Even if only the primary uses them today, placement in shared
memory requires the process-shared attribute. Otherwise behavior
is undefined and the choice will silently break the day a
secondary starts touching these fields.
Warnings
--------
5. Secondary data path has no synchronization with the doorbell unmap.
MANA_MP_REQ_RESET_ENTER causes the secondary MP handler to munmap
proc_priv->db_page and set it to NULL. The secondary's rx_burst /
tx_burst protect themselves only with an atomic state check:
rte_rcu_qsbr_thread_online(dstate_qsv, tid);
if (state != MANA_DEV_ACTIVE || !db_page) { ... return 0; }
But the qsbr only has primary threads registered (registration
happens in mana_dev_configure, which never runs in secondary), so
thread_online/offline in secondary do not block the primary's
qsbr_check. The MP handler in secondary therefore unmaps db_page
while a peer secondary lcore can still be inside rx/tx_burst with
a stale pointer. Result: SIGSEGV on the next doorbell write.
The secondary needs its own quiescence mechanism (a per-process
qsbr or a reader-side rwlock around the data path that the MP
handler acquires before unmap).
6. Return value of rte_dev_event_callback_unregister is discarded.
In mana_intr_uninstall:
if (dev->device)
rte_dev_event_callback_unregister(dev->device->name,
mana_pci_remove_event_cb,
priv);
The API returns -EAGAIN if the callback is currently executing.
When that happens the unregister silently fails, the callback
stays on the list, and any later free of priv produces a
use-after-free the next time the EAL device-event thread
dispatches.
7. PCI-remove callback depends on a monitor nobody starts.
rte_dev_event_callback_register only wires the callback into the
dispatch table; events are only delivered when something calls
rte_dev_event_monitor_start(). The patch never calls it. Whether
this works in practice depends on the application (testpmd,
netvsc) calling it first. At minimum document the requirement;
better, have the driver start the monitor on first probe and stop
it on last remove.
8. No documentation or release-note update.
This is a substantial new feature. doc/guides/rel_notes/ and
doc/guides/nics/mana.rst should both be updated to describe the
reset lifecycle, the RTE_ETH_EVENT_ERR_RECOVERING /
RECOVERY_SUCCESS / RECOVERY_FAILED behavior, and the dependency
on rte_dev_event_monitor_start.
9. priv->reset_thread_active is bool, accessed without atomic ops.
It is read in mana_dev_stop_lock and mana_dev_close_lock before
the spinlock is acquired (intentionally, to avoid deadlock with
the reset thread), and written from at least three threads (intr
handler, reset thread, dev_close). It should be RTE_ATOMIC(bool)
with explicit ordering. This compounds issue #2.
10. mana_intr_handler races with mana_pci_remove_event_cb on dev_state.
intr_handler does:
if (state == MANA_DEV_ACTIVE) {
rte_spinlock_lock(&priv->reset_ops_lock);
mana_reset_enter(priv); /* stores RESET_ENTER */
pci_remove_event_cb stores RESET_FAILED before taking the lock.
If intr_handler reads ACTIVE, pci_remove fires (sets
RESET_FAILED) while intr_handler is waiting on the lock,
intr_handler then acquires it and overwrites RESET_FAILED with
RESET_ENTER, and the driver attempts a reset on a device that
has already been physically removed. The state-set in
pci_remove_event_cb must occur under the lock, or the
intr_handler must re-check state after taking the lock.
Info
----
11. (void *)0 is used in mp.c (proc_priv->db_page = (void *)0; and
if (proc_priv->db_page != 0)). Use NULL.
12. The hand-off of reset_ops_lock from mana_reset_enter to
mana_reset_thread to mana_reset_exit_delay (released in the
latter) is hard to follow. A short comment block at
mana_reset_enter listing which functions take the lock and which
release it would help future maintainers.
13. The widened TOCTOU window in mana_tx_burst (db_page is now read
at the top of the function rather than just before the doorbell)
is not itself a bug, but it makes #5 worse. Once #5 is fixed
via a proper reader-side primitive, consider keeping the db_page
read inside the critical section.
prev parent reply other threads:[~2026-05-18 3:49 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-09 6:01 [PATCH v2 0/1] net/mana: add device reset support Wei Hu
2026-05-09 6:01 ` [PATCH v2 1/1] " Wei Hu
2026-05-18 3:48 ` Stephen Hemminger [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260517204858.7a8a9200@phoenix.local \
--to=stephen@networkplumber.org \
--cc=dev@dpdk.org \
--cc=longli@microsoft.com \
--cc=weh@linux.microsoft.com \
--cc=weh@microsoft.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox