All of lore.kernel.org
 help / color / mirror / Atom feed
From: Stephen Hemminger <stephen@networkplumber.org>
To: longli@linux.microsoft.com
Cc: dev@dpdk.org, Wei Hu <weh@microsoft.com>,
	stable@dpdk.org, Dariusz Sosnowski <dsosnowski@nvidia.com>,
	Viacheslav Ovsiienko <viacheslavo@nvidia.com>,
	Bing Zhao <bingz@nvidia.com>, Ori Kam <orika@nvidia.com>,
	Suanming Mou <suanmingm@nvidia.com>,
	Matan Azrad <matan@nvidia.com>, Long Li <longli@microsoft.com>
Subject: Re: [PATCH v4 0/7] fix multi-process VF hotplug
Date: Thu, 26 Feb 2026 11:37:24 -0800	[thread overview]
Message-ID: <20260226113724.3b5ca20c@phoenix.local> (raw)
In-Reply-To: <20260226023940.961844-1-longli@linux.microsoft.com>

On Wed, 25 Feb 2026 18:39:31 -0800
longli@linux.microsoft.com wrote:

> From: Long Li <longli@microsoft.com>
> 
> Fix several issues with VF hotplug and multi-process support in
> netvsc, mana, mlx5, and mlx4 drivers:
> 
> - Fix race conditions between VSP notifications and DPDK device events
>   during VF add/remove, with proper locking of VF-related fields
> - Add multi-process communication infrastructure for coordinating VF
>   removal across primary and secondary processes
> - Fix Protection Domain resource leak on device close in mana
> - Fix devargs memory leak during VF hotplug in netvsc
> - Fix fast-path ops (rte_eth_fp_ops) setup in secondary processes for
>   mana, mlx5, and mlx4, preventing segfaults on VF hot-add
> 
> v4:
> - Patch 1: Check hn_vf_add() return value in netvsc_hotplug_retry
> - Patch 1: Track fresh_attach to avoid tearing down original VF
>   attachment when configure/start fails on an -EEXIST path
> - Patch 2: Move counter decrement and netvsc_uninit_once() after device
>   cleanup in eth_hn_remove() to prevent use-after-free of shared data
> - Patch 2: Clear netvsc_shared_data on init failure paths to prevent
>   dangling pointer
> 
> v3:
> - Fix review comments from v2
> 
> v2:
> - Initial rework of VF add/remove locking
> 
> Long Li (7):
>   net/netvsc: fix race conditions on VF add/remove events
>   net/netvsc: add multi-process VF device removal support
>   net/mana: fix PD resource leak on device close
>   net/netvsc: fix devargs memory leak on hotplug
>   net/mana: fix fast-path ops setup in secondary process
>   net/mlx5: fix fast-path ops setup in secondary process
>   net/mlx4: fix fast-path ops setup in secondary process
> 
>  drivers/net/mana/mana.c             |  14 ++
>  drivers/net/mana/mp.c               |   6 +
>  drivers/net/mlx4/mlx4_mp.c          |   4 +
>  drivers/net/mlx5/linux/mlx5_mp_os.c |   4 +
>  drivers/net/netvsc/hn_ethdev.c      | 300 +++++++++++++++++++++++++++-
>  drivers/net/netvsc/hn_nvs.h         |   6 +
>  drivers/net/netvsc/hn_rxtx.c        |  40 ++--
>  drivers/net/netvsc/hn_var.h         |   1 +
>  drivers/net/netvsc/hn_vf.c          | 148 ++++++++------
>  9 files changed, 431 insertions(+), 92 deletions(-)
> 

Looks to be in good shape, you may want to address these errors
flagged by AI review.

Critical Finding Summary
Patch	Severity	Issue
5, 6, 7	Error	START_RXTX handler does not restore rte_eth_fp_ops burst function pointers after STOP clears them — secondary Rx/Tx broken after STOP+START cycle
3	Warning	PD deallocation order: parent freed before child may cause EBUSY; reverse the order
1	Warning	switch_data_path failure leaves VF attached/started but unused
1	Warning	eth_hn_dev_init ignores hn_vf_add_unlocked error (pre-existing)
2	Info	secondary_cnt atomics use acquire/release where relaxed suffices


# DPDK VF Hotplug Patch Series v4 Review

**Series**: [PATCH v4 1/7] through [PATCH v4 7/7]
**Author**: Long Li <longli@microsoft.com>
**Reviewed against**: AGENTS.md DPDK Code Review Guidelines

---

## Summary

This series addresses race conditions and resource management issues in the netvsc/mana/mlx5/mlx4 drivers around VF hotplug, multi-process coordination, and fast-path ops setup. The locking rework in patch 1 is substantial and well-structured. The multi-process infrastructure in patch 2 is a significant addition. Commit messages throughout are excellent — clear root cause analysis, well-documented change rationale, and honest notes about limitations.

Subject lines are all within 60 characters, Fixes tags all have 12-character SHAs, tag ordering is correct throughout, and commit body line wrapping is clean.

---

## Patch 1: net/netvsc: fix race conditions on VF add/remove events

### Correctness analysis

The locking redesign is solid:

- **Rx/Tx fast path**: Moving the lock acquisition outside the `vf_vsc_switched` check eliminates the TOCTOU race where the flag was checked without the lock, then the VF could be removed before the lock was acquired. The old double-check pattern was racy.

- **`hn_vf_close`**: Upgrading from `rte_rwlock_read_lock` to `rte_rwlock_write_lock` is correct — the function modifies `vf_attached`, which is a write operation.

- **`hn_vf_detach`/`fresh_attach` pattern**: On `-EEXIST` from `hn_vf_attach`, the `fresh_attach = false` flag correctly prevents `hn_vf_detach` from tearing down a previously valid VF attachment on configure/start failure. This is a subtle and well-handled case.

- **Callback registration moved to `hn_vf_attach`**: Registering the RMV callback immediately on attach (with rollback on failure) is cleaner than the old placement in `hn_vf_configure`.

### Warning: `hn_vf_add_unlocked` does not clean up VF on `switch_data_path` failure

If `hn_nvs_set_datapath(NVS_DATAPATH_VF)` fails at the `switch_data_path` label, the function returns the error but leaves the VF attached and possibly started. The VF sits running but unused (traffic goes through synthetic path). The self-healing retry mechanism (re-entry through `hn_nvs_handle_vfassoc` or `netvsc_hotplug_retry`) would eventually retry, so this is not a crash bug, but a `goto detach` on data path switch failure would be cleaner and avoid the VF running in a wasted state.

### Warning: `eth_hn_dev_init` silently ignores `hn_vf_add_unlocked` error

```c
rte_rwlock_write_lock(&hv->vf_lock);
if (hv->vf_ctx.vf_vsp_reported && !hv->vf_ctx.vf_vsc_switched) {
    PMD_INIT_LOG(DEBUG, "Adding VF device");
    err = hn_vf_add_unlocked(eth_dev, hv);
}
rte_rwlock_write_unlock(&hv->vf_lock);

return 0;  /* err is ignored */
```

The `err` from `hn_vf_add_unlocked` is stored but never checked — the function always returns 0. This is pre-existing behavior, but worth noting as it could mask VF setup failures during init.

---

## Patch 2: net/netvsc: add multi-process VF device removal support

### Correctness analysis

The multi-process infrastructure is well-designed:

- **Shared memzone** for `secondary_cnt` with proper atomic operations
- **Spinlock** protecting the one-time init/uninit sequence
- **`rte_mp_request_sync`** with timeout for reliable cross-process coordination
- **ENOTSUP** handling for single-process mode

### Info: Memory ordering stronger than necessary

The `secondary_cnt` atomic uses `rte_memory_order_release` for stores and `rte_memory_order_acquire` for loads. Since this counter doesn't guard any other shared data (it's just a "should I bother sending MP requests?" optimization), `rte_memory_order_relaxed` would suffice. The current ordering is correct but unnecessarily strong.

### Info: `netvsc_mp_primary_handle` is a stub

```c
static int
netvsc_mp_primary_handle(const struct rte_mp_msg *mp_msg __rte_unused,
                          const void *peer __rte_unused)
{
    /* Stub function required for multi-process message handling registration */
    return 0;
}
```

The comment explains this is needed for registration. This is fine — the primary currently only sends, never receives. If future message types require primary handling, this stub is the extension point.

---

## Patch 3: net/mana: fix PD resource leak on device close

### Warning: PD deallocation order may be wrong

The code frees `ib_parent_pd` before `ib_pd`:

```c
if (priv->ib_parent_pd) {
    int err = ibv_dealloc_pd(priv->ib_parent_pd);
    ...
}
if (priv->ib_pd) {
    int err = ibv_dealloc_pd(priv->ib_pd);
    ...
}
```

If `ib_pd` is a child/derived PD of `ib_parent_pd`, `ibv_dealloc_pd(ib_parent_pd)` will fail with EBUSY because the child still holds a reference. The error is logged and the code continues, so `ib_pd` gets freed next, and `ibv_close_device` cleans up the orphaned parent PD. This works but is fragile.

Reversing the order (free `ib_pd` first, then `ib_parent_pd`) would be cleaner and avoid the EBUSY error on the parent. If the PDs are independent (no parent-child verbs relationship), the order doesn't matter, but the naming strongly suggests a hierarchy.

---

## Patch 4: net/netvsc: fix devargs memory leak on hotplug

No issues. The `rte_devargs_reset()` calls are correctly placed in both cleanup paths before `free(hot_ctx)`.

---

## Patches 5, 6, 7: net/mana, net/mlx5, net/mlx4: fix fast-path ops setup in secondary process

### Error: `rte_eth_fp_ops` burst function pointers not updated in START_RXTX handler

All three patches update `rte_eth_fp_ops[].rxq.data` and `txq.data` in the START handler, and update `rte_eth_fp_ops[].rx_pkt_burst` and `tx_pkt_burst` in the STOP handler. But the START handler does **not** restore the burst function pointers.

After a STOP → START cycle:
1. STOP sets `rte_eth_fp_ops[port].rx_pkt_burst = dummy` (newly added by this patch)
2. START sets `rte_eth_fp_ops[port].rxq.data = dev->data->rx_queues` but does NOT restore `rx_pkt_burst`
3. Secondary's `rte_eth_rx_burst()` calls `rte_eth_fp_ops[port].rx_pkt_burst(...)` which is still the dummy → **Rx/Tx broken**

Since `rte_eth_rx_burst()` uses `rte_eth_fp_ops` (process-local), not `dev->rx_pkt_burst` (shared), the secondary cannot receive or transmit after a STOP+START cycle even though `dev->rx_pkt_burst` was correctly set.

This may not manifest on the **first** start (where `rte_eth_dev_probing_finish()` sets fp_ops during probe), and it may be masked if the VF is re-probed during hot-add. But it will manifest on any in-place STOP+START cycle without re-probe.

**Suggested fix** — add to each START_RXTX handler:

For patch 5 (mana):
```c
rte_eth_fp_ops[param->port_id].rx_pkt_burst = mana_rx_burst;
rte_eth_fp_ops[param->port_id].tx_pkt_burst = mana_tx_burst;
```

For patch 6 (mlx5):
```c
rte_eth_fp_ops[param->port_id].rx_pkt_burst = dev->rx_pkt_burst;
rte_eth_fp_ops[param->port_id].tx_pkt_burst = dev->tx_pkt_burst;
```

For patch 7 (mlx4), same as mlx5.

Place these **before** the `rte_mb()` call so the barrier covers all fp_ops updates.

---

## Critical Finding Summary

| Patch | Severity | Issue |
|-------|----------|-------|
| 5, 6, 7 | **Error** | START_RXTX handler does not restore `rte_eth_fp_ops` burst function pointers after STOP clears them — secondary Rx/Tx broken after STOP+START cycle |
| 3 | Warning | PD deallocation order: parent freed before child may cause EBUSY; reverse the order |
| 1 | Warning | `switch_data_path` failure leaves VF attached/started but unused |
| 1 | Warning | `eth_hn_dev_init` ignores `hn_vf_add_unlocked` error (pre-existing) |
| 2 | Info | `secondary_cnt` atomics use acquire/release where relaxed suffices |

  parent reply	other threads:[~2026-02-26 19:37 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-26  2:39 [PATCH v4 0/7] fix multi-process VF hotplug longli
2026-02-26  2:39 ` [PATCH v4 1/7] net/netvsc: fix race conditions on VF add/remove events longli
2026-02-26  2:39 ` [PATCH v4 2/7] net/netvsc: add multi-process VF device removal support longli
2026-02-26 18:51   ` Stephen Hemminger
2026-02-27  0:03     ` [EXTERNAL] " Long Li
2026-02-26  2:39 ` [PATCH v4 3/7] net/mana: fix PD resource leak on device close longli
2026-02-26  2:39 ` [PATCH v4 4/7] net/netvsc: fix devargs memory leak on hotplug longli
2026-02-26  2:39 ` [PATCH v4 5/7] net/mana: fix fast-path ops setup in secondary process longli
2026-02-26  2:39 ` [PATCH v4 6/7] net/mlx5: " longli
2026-02-26  2:39 ` [PATCH v4 7/7] net/mlx4: " longli
2026-02-26 19:37 ` Stephen Hemminger [this message]
2026-02-27  1:02   ` [EXTERNAL] Re: [PATCH v4 0/7] fix multi-process VF hotplug Long Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260226113724.3b5ca20c@phoenix.local \
    --to=stephen@networkplumber.org \
    --cc=bingz@nvidia.com \
    --cc=dev@dpdk.org \
    --cc=dsosnowski@nvidia.com \
    --cc=longli@linux.microsoft.com \
    --cc=longli@microsoft.com \
    --cc=matan@nvidia.com \
    --cc=orika@nvidia.com \
    --cc=stable@dpdk.org \
    --cc=suanmingm@nvidia.com \
    --cc=viacheslavo@nvidia.com \
    --cc=weh@microsoft.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.