All of lore.kernel.org
 help / color / mirror / Atom feed
From: Stephen Hemminger <stephen@networkplumber.org>
To: Wei Hu <weh@linux.microsoft.com>
Cc: dev@dpdk.org, longli@microsoft.com, weh@microsoft.com
Subject: Re: [PATCH v2 1/1] net/mana: add device reset support
Date: Sun, 17 May 2026 20:48:58 -0700	[thread overview]
Message-ID: <20260517204858.7a8a9200@phoenix.local> (raw)
In-Reply-To: <20260509060113.61686-2-weh@linux.microsoft.com>

On Fri,  8 May 2026 23:01:13 -0700
Wei Hu <weh@linux.microsoft.com> wrote:

> From: Wei Hu <weh@microsoft.com>
> 
> Add support for handling hardware reset events in the MANA driver.
> When the MANA kernel driver receives a hardware service event, it
> initiates a device reset and notifies userspace via
> IBV_EVENT_DEVICE_FATAL. The DPDK driver handles this by performing
> an automatic teardown and recovery sequence.
> 
> The reset flow has two phases. In the enter phase, running on the
> EAL interrupt thread, the driver transitions the device state,
> waits for data path threads to reach a quiescent state using RCU,
> stops queues, tears down IB resources, and frees per-queue MR
> caches. A control thread is then spawned to handle the exit phase:
> it waits for the hardware to recover, unregisters the interrupt
> handler, re-probes the PCI device, reinitializes MR caches, and
> restarts queues.
> 
> A per-device spinlock serializes the reset path with ethdev
> operations. Operations that cannot wait (configure, queue setup)
> return -EBUSY during reset, while dev_stop and dev_close join the
> reset thread and use a blocking lock to ensure proper sequencing.
> 
> Multi-process support is included: secondary processes unmap and
> remap doorbell pages via IPC during the reset enter and exit
> phases. Data path functions in both primary and secondary
> processes check the device state atomically and return early when
> the device is not active.
> 
> The driver uses ethdev recovery events to notify upper layers
> (e.g. netvsc) of the reset lifecycle: RTE_ETH_EVENT_ERR_RECOVERING
> on entry, RTE_ETH_EVENT_RECOVERY_SUCCESS or
> RTE_ETH_EVENT_RECOVERY_FAILED on completion. A PCI device removal
> event callback distinguishes hot-remove from service reset.
> 
> Signed-off-by: Wei Hu <weh@microsoft.com>
> ---

Lots of issues found during AI review..

Subject: Re: [PATCH v2 1/1] net/mana: add device reset support

To: Wei Hu <weh@linux.microsoft.com>
Cc: dev@dpdk.org, longli@microsoft.com, weh@microsoft.com
In-Reply-To: <20260509060113.61686-2-weh@linux.microsoft.com>
References: <20260509060113.61686-1-weh@linux.microsoft.com>
 <20260509060113.61686-2-weh@linux.microsoft.com>

On Fri,  8 May 2026 23:01:13 -0700, Wei Hu wrote:
> Add support for handling hardware reset events in the MANA driver.

Several issues need attention before this can be merged.

Errors
------

1. Memory leak: priv->dev_state_qsv is never freed on close.

   In mana_probe_port (non-reset path) the qsbr buffer is allocated:

       priv->dev_state_qsv = rte_zmalloc_socket("mana_rcu", sz, ...);

   It is only freed in the probe-failure goto label.  mana_dev_close
   has no corresponding rte_free, so every successful close leaks
   the allocation.

2. Joinable reset thread is leaked when reset completes on its own.

   Both mana_reset_thread (skip path) and mana_reset_exit_delay
   (normal completion) do:

       priv->reset_thread_active = false;
       rte_spinlock_unlock(&priv->reset_ops_lock);
       return 0;

   The pthread is still joinable at this point.  If mana_dev_close
   runs afterward it sees active==false and skips rte_thread_join,
   leaving the TCB/stack unreaped.  The flag must only be cleared by
   whoever actually joins the thread; otherwise the reset thread
   should be detached (and then dev_close cannot wait for it, which
   is its own problem).  Either way the current pattern is wrong.

3. Spinlock held across blocking operations.

   priv->reset_ops_lock is rte_spinlock_t and is held across:

     - mana_reset_enter:      mana_dev_stop, mana_dev_close,
                              ibv_close_device,
                              mana_mp_req_on_rxtx (5s IPC timeout)
     - mana_reset_exit_delay: ibv_close_device, mana_pci_probe,
                              mana_mp_req_on_rxtx, mana_dev_start

   IPC with a 5-second timeout and device probe under a spinlock is
   not acceptable.  Use a sleeping mutex (pthread_mutex_t initialized
   with PTHREAD_PROCESS_SHARED, since priv is in shared memory), or
   split the lock so the long operations run outside it.

4. pthread_mutex_init / pthread_cond_init without PTHREAD_PROCESS_SHARED.

   priv is allocated with rte_zmalloc_socket, so reset_cond_mutex and
   reset_cond live in hugepage memory mapped into every secondary.
   The patch initializes them with NULL attributes:

       pthread_mutex_init(&priv->reset_cond_mutex, NULL);
       pthread_cond_init(&priv->reset_cond, NULL);

   Even if only the primary uses them today, placement in shared
   memory requires the process-shared attribute.  Otherwise behavior
   is undefined and the choice will silently break the day a
   secondary starts touching these fields.

Warnings
--------

5. Secondary data path has no synchronization with the doorbell unmap.

   MANA_MP_REQ_RESET_ENTER causes the secondary MP handler to munmap
   proc_priv->db_page and set it to NULL.  The secondary's rx_burst /
   tx_burst protect themselves only with an atomic state check:

       rte_rcu_qsbr_thread_online(dstate_qsv, tid);
       if (state != MANA_DEV_ACTIVE || !db_page) { ... return 0; }

   But the qsbr only has primary threads registered (registration
   happens in mana_dev_configure, which never runs in secondary), so
   thread_online/offline in secondary do not block the primary's
   qsbr_check.  The MP handler in secondary therefore unmaps db_page
   while a peer secondary lcore can still be inside rx/tx_burst with
   a stale pointer.  Result: SIGSEGV on the next doorbell write.

   The secondary needs its own quiescence mechanism (a per-process
   qsbr or a reader-side rwlock around the data path that the MP
   handler acquires before unmap).

6. Return value of rte_dev_event_callback_unregister is discarded.

   In mana_intr_uninstall:

       if (dev->device)
               rte_dev_event_callback_unregister(dev->device->name,
                                                 mana_pci_remove_event_cb,
                                                 priv);

   The API returns -EAGAIN if the callback is currently executing.
   When that happens the unregister silently fails, the callback
   stays on the list, and any later free of priv produces a
   use-after-free the next time the EAL device-event thread
   dispatches.

7. PCI-remove callback depends on a monitor nobody starts.

   rte_dev_event_callback_register only wires the callback into the
   dispatch table; events are only delivered when something calls
   rte_dev_event_monitor_start().  The patch never calls it.  Whether
   this works in practice depends on the application (testpmd,
   netvsc) calling it first.  At minimum document the requirement;
   better, have the driver start the monitor on first probe and stop
   it on last remove.

8. No documentation or release-note update.

   This is a substantial new feature.  doc/guides/rel_notes/ and
   doc/guides/nics/mana.rst should both be updated to describe the
   reset lifecycle, the RTE_ETH_EVENT_ERR_RECOVERING /
   RECOVERY_SUCCESS / RECOVERY_FAILED behavior, and the dependency
   on rte_dev_event_monitor_start.

9. priv->reset_thread_active is bool, accessed without atomic ops.

   It is read in mana_dev_stop_lock and mana_dev_close_lock before
   the spinlock is acquired (intentionally, to avoid deadlock with
   the reset thread), and written from at least three threads (intr
   handler, reset thread, dev_close).  It should be RTE_ATOMIC(bool)
   with explicit ordering.  This compounds issue #2.

10. mana_intr_handler races with mana_pci_remove_event_cb on dev_state.

    intr_handler does:

        if (state == MANA_DEV_ACTIVE) {
                rte_spinlock_lock(&priv->reset_ops_lock);
                mana_reset_enter(priv);   /* stores RESET_ENTER */

    pci_remove_event_cb stores RESET_FAILED before taking the lock.
    If intr_handler reads ACTIVE, pci_remove fires (sets
    RESET_FAILED) while intr_handler is waiting on the lock,
    intr_handler then acquires it and overwrites RESET_FAILED with
    RESET_ENTER, and the driver attempts a reset on a device that
    has already been physically removed.  The state-set in
    pci_remove_event_cb must occur under the lock, or the
    intr_handler must re-check state after taking the lock.

Info
----

11. (void *)0 is used in mp.c (proc_priv->db_page = (void *)0; and
    if (proc_priv->db_page != 0)).  Use NULL.

12. The hand-off of reset_ops_lock from mana_reset_enter to
    mana_reset_thread to mana_reset_exit_delay (released in the
    latter) is hard to follow.  A short comment block at
    mana_reset_enter listing which functions take the lock and which
    release it would help future maintainers.

13. The widened TOCTOU window in mana_tx_burst (db_page is now read
    at the top of the function rather than just before the doorbell)
    is not itself a bug, but it makes #5 worse.  Once #5 is fixed
    via a proper reader-side primitive, consider keeping the db_page
    read inside the critical section.

  reply	other threads:[~2026-05-18  3:49 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-09  6:01 [PATCH v2 0/1] net/mana: add device reset support Wei Hu
2026-05-09  6:01 ` [PATCH v2 1/1] " Wei Hu
2026-05-18  3:48   ` Stephen Hemminger [this message]
     [not found]     ` <SA1PR21MB66836A17CA584763D4D3F8A6CE002@SA1PR21MB6683.namprd21.prod.outlook.com>
2026-05-20  6:32       ` [EXTERNAL] " Wei Hu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260517204858.7a8a9200@phoenix.local \
    --to=stephen@networkplumber.org \
    --cc=dev@dpdk.org \
    --cc=longli@microsoft.com \
    --cc=weh@linux.microsoft.com \
    --cc=weh@microsoft.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.