All of lore.kernel.org
 help / color / mirror / Atom feed
From: Stephen Hemminger <stephen@networkplumber.org>
To: Wei Hu <weh@linux.microsoft.com>
Cc: dev@dpdk.org, longli@microsoft.com, weh@microsoft.com
Subject: Re: [PATCH v9 0/1] net/mana: add device reset support
Date: Mon, 15 Jun 2026 11:50:28 -0700	[thread overview]
Message-ID: <20260615115028.5fa705c3@phoenix.local> (raw)
In-Reply-To: <20260612081723.27699-1-weh@linux.microsoft.com>

On Fri, 12 Jun 2026 01:17:22 -0700
Wei Hu <weh@linux.microsoft.com> wrote:

> From: Wei Hu <weh@microsoft.com>
> 
> Add support for handling hardware service reset events in the
> MANA driver. When the MANA kernel driver receives a hardware
> service event, it initiates a device reset and notifies userspace
> via IBV_EVENT_DEVICE_FATAL. The MANA PMD handles this by
> performing an automatic teardown and recovery sequence.
> 
> The driver uses ethdev recovery events (ERR_RECOVERING,
> RECOVERY_SUCCESS, RECOVERY_FAILED) to notify upper layers of
> the reset lifecycle, and a PCI device removal event callback
> to distinguish hot-remove from service reset.
> 
> Changes since v8:
> - Fixed reset thread resource leak: previously reset_thread_active
>   was cleared before emitting recovery callbacks, so no join site
>   would reap the thread. Now the flag stays true throughout the
>   thread lifetime. mana_join_reset_thread detects the self-join
>   case (callback calling dev_stop/dev_close from the reset thread)
>   using rte_thread_equal and calls rte_thread_detach instead of
>   join, so thread resources are freed on exit. External callers
>   continue to join normally.
> - Fixed lost condvar signal: added a predicate loop around
>   pthread_cond_timedwait that checks dev_state under
>   reset_cond_mutex. If mana_pci_remove_event_cb signals before
>   the reset thread enters the wait, the wakeup is no longer lost.
>   The PCI remove callback sets dev_state to RESET_FAILED under
>   the same mutex before signaling.
> - Added a lock/unlock barrier on reset_ops_lock in
>   mana_pci_remove_event_cb to ensure teardown has completed
>   before emitting the INTR_RMV event.
> - Fixed mana_reset_exit_delay return type from uint32_t to int
>   to match the negative error codes it stores.
> - Removed unnecessary else-after-goto in mana_probe_port.
> 
> Changes since v7:
> - Moved heavy teardown (dev_stop, IPC to secondaries, dev_close,
>   MR btree free) from mana_reset_enter (EAL interrupt thread)
>   to mana_reset_thread (control thread). The interrupt handler
>   now only sets state, drains in-flight bursts, and spawns the
>   thread. Teardown runs immediately in the control thread before
>   the recovery timer wait, avoiding blocking the interrupt thread
>   on multi-second IPC timeouts and ibverbs calls. Each function
>   now owns its own lock scope with no lock hand-off between
>   threads.
> - Simplified burst_state from encoding device state in bits 1+
>   to a single blocked flag (bit 1). Only one value was ever
>   stored, so the multi-state encoding was misleading. Added
>   MANA_BURST_BLOCKED constant.
> - Updated mana.rst to reflect that teardown runs on the control
>   thread, not the interrupt handler.
> 
> Changes since v6:
> - Rebased onto latest upstream for-main
> - Replaced removed RTE_ETH_DEV_TO_PCI macro with
>   RTE_CLASS_TO_BUS_DEVICE (upstream commit 4757b8df04
>   removed the old bus-specific ethdev convenience macros)
> 
> Changes since v5:
> - Replaced RCU QSBR with per-queue atomic burst_state using a
>   single-variable CAS design: bit 0 is the in-burst flag, bit 1
>   is the blocked flag. The data path uses CAS(0→1) to enter
>   burst and fetch_and(~1) to exit. The reset path uses fetch_or
>   to set the blocked bit and polls bit 0 to drain in-flight
>   bursts. This eliminates the two-variable Dekker pattern and the
>   need for sequential consistency (seq_cst) ordering.
> - Removed librte_rcu dependency
> - Removed __rte_no_thread_safety_analysis annotations (no longer
>   needed after mutex conversion)
> - Moved ERR_RECOVERING event emission before acquiring
>   reset_ops_lock and before mana_reset_enter, so upper layers
>   (e.g. netvsc) can switch data path before mana stops queues.
>   Emitting outside the lock avoids deadlock if the callback
>   calls dev_stop or dev_close.
> - Replaced MANA_OPS_*_LOCK macros with mana_reset_trylock()
>   helper function and explicit per-operation wrappers
> - Removed unused rte_alarm.h and rte_lock_annotations.h includes
> - Added RECOVERY_FAILED event when mana_reset_enter fails
>   internally, so the application always receives a terminal event
> - Added mana_clear_burst_state() helper to clear per-queue
>   burst_state on failure paths (reset_failed, dev_stop_lock,
>   dev_close_lock) preventing permanent silent packet drop after
>   a failed reset
> 
> Changes since v4:
> - Fixed stale rte_spinlock_unlock call in mana_intr_handler that
>   was missed during the spinlock-to-mutex conversion, causing a
>   -Wincompatible-pointer-types warning
> 
> Changes since v3:
> - Converted reset_ops_lock from rte_spinlock_t to pthread_mutex_t
>   with PTHREAD_PROCESS_SHARED, since the lock is held across
>   blocking IB verbs calls and IPC with 5s timeout
> - Removed rte_dev_event_callback_unregister retry loop to avoid
>   deadlock when interrupt thread and reset thread contend
> 
> Changes since v2:
> - Added per-queue burst_state atomic variable with Dekker-like
>   synchronization to block data path during reset without RCU
> - Replaced rte_alarm with condvar + control thread for reset exit
> - Made reset_thread_active atomic with CAS — flag is set by
>   creator and only cleared by the joiner, not the thread itself
> - Fixed second reset crash: removed reset thread join logic from
>   mana_dev_close (inner function) to avoid corrupting dev_state
>   when called from mana_reset_enter
> - Made reset_thread_active RTE_ATOMIC(bool) with explicit ordering
> - Added retry loop for rte_dev_event_callback_unregister on -EAGAIN
> - Initialized condvar/mutex with PTHREAD_PROCESS_SHARED since priv
>   is in hugepage shared memory
> - Added re-check of dev_state after lock acquisition in
>   mana_intr_handler to prevent racing with pci_remove_event_cb
> - Replaced (void *)0 with NULL in mp.c
> - Added lock ownership comment block at mana_reset_enter
> - Documented rte_dev_event_monitor_start() requirement
> - Added mana.rst documentation and release note
> 
> Changes since v1:
> - Removed net/netvsc patch from this series
> - Simplified reset exit: mana_reset_exit calls
>   mana_reset_exit_delay directly instead of spawning a thread
> - Added __rte_no_thread_safety_analysis annotations for clang
> - Switched to rte_thread_create_internal_control
> - Fixed declaration-after-statement style issues
> - Removed unnecessary blank lines and stale comments
> 
> Wei Hu (1):
>   net/mana: add device reset support
> 
>  doc/guides/nics/mana.rst               |   40 +
>  doc/guides/rel_notes/release_26_07.rst |    8 +
>  drivers/net/mana/mana.c                | 1088 ++++++++++++++++++++++--
>  drivers/net/mana/mana.h                |   52 +-
>  drivers/net/mana/mp.c                  |   89 +-
>  drivers/net/mana/mr.c                  |    6 +-
>  drivers/net/mana/rx.c                  |   23 +-
>  drivers/net/mana/tx.c                  |   44 +-
>  8 files changed, 1242 insertions(+), 108 deletions(-)
> 

Looks good, one bit of AI review.
---
The self-join and the condvar wait both look right now. The reset thread keeps
reset_thread_active true for its whole life and mana_join_reset_thread is the
only place that transitions it, using rte_thread_equal to detach on the
self-call and join otherwise, so the thread is reaped exactly once with no leak
and no self-join deadlock. The predicate loop under reset_cond_mutex closes the
lost-wakeup window. Reset logic is in good shape.

One small thing in mp.c: in the RESET_EXIT secondary handler the received fd is
only closed on the branch that maps it. If proc_priv->db_page is already
non-NULL the fd from the message is leaked. Close it whenever num_fds >= 1,
outside the if/else.
---
I can merge it as is, or you can send a revision to close that minor leak.

      parent reply	other threads:[~2026-06-15 18:50 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-12  8:17 [PATCH v9 0/1] net/mana: add device reset support Wei Hu
2026-06-12  8:17 ` [PATCH v9 1/1] " Wei Hu
2026-06-15 18:50 ` Stephen Hemminger [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260615115028.5fa705c3@phoenix.local \
    --to=stephen@networkplumber.org \
    --cc=dev@dpdk.org \
    --cc=longli@microsoft.com \
    --cc=weh@linux.microsoft.com \
    --cc=weh@microsoft.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.