Re: [PATCH v2 1/1] net/mana: add device reset support

DPDK-dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: Stephen Hemminger <stephen@networkplumber.org>
To: Wei Hu <weh@linux.microsoft.com>
Cc: dev@dpdk.org, longli@microsoft.com, weh@microsoft.com
Subject: Re: [PATCH v2 1/1] net/mana: add device reset support
Date: Sun, 17 May 2026 20:48:58 -0700	[thread overview]
Message-ID: <20260517204858.7a8a9200@phoenix.local> (raw)
In-Reply-To: <20260509060113.61686-2-weh@linux.microsoft.com>

On Fri,  8 May 2026 23:01:13 -0700
Wei Hu <weh@linux.microsoft.com> wrote:

> From: Wei Hu <weh@microsoft.com>
> 
> Add support for handling hardware reset events in the MANA driver.
> When the MANA kernel driver receives a hardware service event, it
> initiates a device reset and notifies userspace via
> IBV_EVENT_DEVICE_FATAL. The DPDK driver handles this by performing
> an automatic teardown and recovery sequence.
> 
> The reset flow has two phases. In the enter phase, running on the
> EAL interrupt thread, the driver transitions the device state,
> waits for data path threads to reach a quiescent state using RCU,
> stops queues, tears down IB resources, and frees per-queue MR
> caches. A control thread is then spawned to handle the exit phase:
> it waits for the hardware to recover, unregisters the interrupt
> handler, re-probes the PCI device, reinitializes MR caches, and
> restarts queues.
> 
> A per-device spinlock serializes the reset path with ethdev
> operations. Operations that cannot wait (configure, queue setup)
> return -EBUSY during reset, while dev_stop and dev_close join the
> reset thread and use a blocking lock to ensure proper sequencing.
> 
> Multi-process support is included: secondary processes unmap and
> remap doorbell pages via IPC during the reset enter and exit
> phases. Data path functions in both primary and secondary
> processes check the device state atomically and return early when
> the device is not active.
> 
> The driver uses ethdev recovery events to notify upper layers
> (e.g. netvsc) of the reset lifecycle: RTE_ETH_EVENT_ERR_RECOVERING
> on entry, RTE_ETH_EVENT_RECOVERY_SUCCESS or
> RTE_ETH_EVENT_RECOVERY_FAILED on completion. A PCI device removal
> event callback distinguishes hot-remove from service reset.
> 
> Signed-off-by: Wei Hu <weh@microsoft.com>
> ---

Lots of issues found during AI review..

Subject: Re: [PATCH v2 1/1] net/mana: add device reset support

To: Wei Hu <weh@linux.microsoft.com>
Cc: dev@dpdk.org, longli@microsoft.com, weh@microsoft.com
In-Reply-To: <20260509060113.61686-2-weh@linux.microsoft.com>
References: <20260509060113.61686-1-weh@linux.microsoft.com>
 <20260509060113.61686-2-weh@linux.microsoft.com>

On Fri,  8 May 2026 23:01:13 -0700, Wei Hu wrote:
> Add support for handling hardware reset events in the MANA driver.

Several issues need attention before this can be merged.

Errors
------

1. Memory leak: priv->dev_state_qsv is never freed on close.

   In mana_probe_port (non-reset path) the qsbr buffer is allocated:

       priv->dev_state_qsv = rte_zmalloc_socket("mana_rcu", sz, ...);

   It is only freed in the probe-failure goto label.  mana_dev_close
   has no corresponding rte_free, so every successful close leaks
   the allocation.

2. Joinable reset thread is leaked when reset completes on its own.

   Both mana_reset_thread (skip path) and mana_reset_exit_delay
   (normal completion) do:

       priv->reset_thread_active = false;
       rte_spinlock_unlock(&priv->reset_ops_lock);
       return 0;

   The pthread is still joinable at this point.  If mana_dev_close
   runs afterward it sees active==false and skips rte_thread_join,
   leaving the TCB/stack unreaped.  The flag must only be cleared by
   whoever actually joins the thread; otherwise the reset thread
   should be detached (and then dev_close cannot wait for it, which
   is its own problem).  Either way the current pattern is wrong.

3. Spinlock held across blocking operations.

   priv->reset_ops_lock is rte_spinlock_t and is held across:

     - mana_reset_enter:      mana_dev_stop, mana_dev_close,
                              ibv_close_device,
                              mana_mp_req_on_rxtx (5s IPC timeout)
     - mana_reset_exit_delay: ibv_close_device, mana_pci_probe,
                              mana_mp_req_on_rxtx, mana_dev_start

   IPC with a 5-second timeout and device probe under a spinlock is
   not acceptable.  Use a sleeping mutex (pthread_mutex_t initialized
   with PTHREAD_PROCESS_SHARED, since priv is in shared memory), or
   split the lock so the long operations run outside it.

4. pthread_mutex_init / pthread_cond_init without PTHREAD_PROCESS_SHARED.

   priv is allocated with rte_zmalloc_socket, so reset_cond_mutex and
   reset_cond live in hugepage memory mapped into every secondary.
   The patch initializes them with NULL attributes:

       pthread_mutex_init(&priv->reset_cond_mutex, NULL);
       pthread_cond_init(&priv->reset_cond, NULL);

   Even if only the primary uses them today, placement in shared
   memory requires the process-shared attribute.  Otherwise behavior
   is undefined and the choice will silently break the day a
   secondary starts touching these fields.

Warnings
--------

5. Secondary data path has no synchronization with the doorbell unmap.

   MANA_MP_REQ_RESET_ENTER causes the secondary MP handler to munmap
   proc_priv->db_page and set it to NULL.  The secondary's rx_burst /
   tx_burst protect themselves only with an atomic state check:

       rte_rcu_qsbr_thread_online(dstate_qsv, tid);
       if (state != MANA_DEV_ACTIVE || !db_page) { ... return 0; }

   But the qsbr only has primary threads registered (registration
   happens in mana_dev_configure, which never runs in secondary), so
   thread_online/offline in secondary do not block the primary's
   qsbr_check.  The MP handler in secondary therefore unmaps db_page
   while a peer secondary lcore can still be inside rx/tx_burst with
   a stale pointer.  Result: SIGSEGV on the next doorbell write.

   The secondary needs its own quiescence mechanism (a per-process
   qsbr or a reader-side rwlock around the data path that the MP
   handler acquires before unmap).

6. Return value of rte_dev_event_callback_unregister is discarded.

   In mana_intr_uninstall:

       if (dev->device)
               rte_dev_event_callback_unregister(dev->device->name,
                                                 mana_pci_remove_event_cb,
                                                 priv);

   The API returns -EAGAIN if the callback is currently executing.
   When that happens the unregister silently fails, the callback
   stays on the list, and any later free of priv produces a
   use-after-free the next time the EAL device-event thread
   dispatches.

7. PCI-remove callback depends on a monitor nobody starts.

   rte_dev_event_callback_register only wires the callback into the
   dispatch table; events are only delivered when something calls
   rte_dev_event_monitor_start().  The patch never calls it.  Whether
   this works in practice depends on the application (testpmd,
   netvsc) calling it first.  At minimum document the requirement;
   better, have the driver start the monitor on first probe and stop
   it on last remove.

8. No documentation or release-note update.

   This is a substantial new feature.  doc/guides/rel_notes/ and
   doc/guides/nics/mana.rst should both be updated to describe the
   reset lifecycle, the RTE_ETH_EVENT_ERR_RECOVERING /
   RECOVERY_SUCCESS / RECOVERY_FAILED behavior, and the dependency
   on rte_dev_event_monitor_start.

9. priv->reset_thread_active is bool, accessed without atomic ops.

   It is read in mana_dev_stop_lock and mana_dev_close_lock before
   the spinlock is acquired (intentionally, to avoid deadlock with
   the reset thread), and written from at least three threads (intr
   handler, reset thread, dev_close).  It should be RTE_ATOMIC(bool)
   with explicit ordering.  This compounds issue #2.

10. mana_intr_handler races with mana_pci_remove_event_cb on dev_state.

    intr_handler does:

        if (state == MANA_DEV_ACTIVE) {
                rte_spinlock_lock(&priv->reset_ops_lock);
                mana_reset_enter(priv);   /* stores RESET_ENTER */

    pci_remove_event_cb stores RESET_FAILED before taking the lock.
    If intr_handler reads ACTIVE, pci_remove fires (sets
    RESET_FAILED) while intr_handler is waiting on the lock,
    intr_handler then acquires it and overwrites RESET_FAILED with
    RESET_ENTER, and the driver attempts a reset on a device that
    has already been physically removed.  The state-set in
    pci_remove_event_cb must occur under the lock, or the
    intr_handler must re-check state after taking the lock.

Info
----

11. (void *)0 is used in mp.c (proc_priv->db_page = (void *)0; and
    if (proc_priv->db_page != 0)).  Use NULL.

12. The hand-off of reset_ops_lock from mana_reset_enter to
    mana_reset_thread to mana_reset_exit_delay (released in the
    latter) is hard to follow.  A short comment block at
    mana_reset_enter listing which functions take the lock and which
    release it would help future maintainers.

13. The widened TOCTOU window in mana_tx_burst (db_page is now read
    at the top of the function rather than just before the doorbell)
    is not itself a bug, but it makes #5 worse.  Once #5 is fixed
    via a proper reader-side primitive, consider keeping the db_page
    read inside the critical section.

     prev parent reply	other threads:[~2026-05-18  3:49 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-09  6:01 [PATCH v2 0/1] net/mana: add device reset support Wei Hu
2026-05-09  6:01 ` [PATCH v2 1/1] " Wei Hu
2026-05-18  3:48   ` Stephen Hemminger [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260517204858.7a8a9200@phoenix.local \
    --to=stephen@networkplumber.org \
    --cc=dev@dpdk.org \
    --cc=longli@microsoft.com \
    --cc=weh@linux.microsoft.com \
    --cc=weh@microsoft.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox