public inbox for dev@dpdk.org
 help / color / mirror / Atom feed
From: Stephen Hemminger <stephen@networkplumber.org>
To: longli@linux.microsoft.com
Cc: dev@dpdk.org, Wei Hu <weh@microsoft.com>,
	stable@dpdk.org, Dariusz Sosnowski <dsosnowski@nvidia.com>,
	Viacheslav Ovsiienko <viacheslavo@nvidia.com>,
	Bing Zhao <bingz@nvidia.com>, Ori Kam <orika@nvidia.com>,
	Suanming Mou <suanmingm@nvidia.com>,
	Matan Azrad <matan@nvidia.com>, Long Li <longli@microsoft.com>
Subject: Re: [PATCH v4 0/7] fix multi-process VF hotplug
Date: Thu, 26 Feb 2026 11:37:24 -0800	[thread overview]
Message-ID: <20260226113724.3b5ca20c@phoenix.local> (raw)
In-Reply-To: <20260226023940.961844-1-longli@linux.microsoft.com>

On Wed, 25 Feb 2026 18:39:31 -0800
longli@linux.microsoft.com wrote:

> From: Long Li <longli@microsoft.com>
> 
> Fix several issues with VF hotplug and multi-process support in
> netvsc, mana, mlx5, and mlx4 drivers:
> 
> - Fix race conditions between VSP notifications and DPDK device events
>   during VF add/remove, with proper locking of VF-related fields
> - Add multi-process communication infrastructure for coordinating VF
>   removal across primary and secondary processes
> - Fix Protection Domain resource leak on device close in mana
> - Fix devargs memory leak during VF hotplug in netvsc
> - Fix fast-path ops (rte_eth_fp_ops) setup in secondary processes for
>   mana, mlx5, and mlx4, preventing segfaults on VF hot-add
> 
> v4:
> - Patch 1: Check hn_vf_add() return value in netvsc_hotplug_retry
> - Patch 1: Track fresh_attach to avoid tearing down original VF
>   attachment when configure/start fails on an -EEXIST path
> - Patch 2: Move counter decrement and netvsc_uninit_once() after device
>   cleanup in eth_hn_remove() to prevent use-after-free of shared data
> - Patch 2: Clear netvsc_shared_data on init failure paths to prevent
>   dangling pointer
> 
> v3:
> - Fix review comments from v2
> 
> v2:
> - Initial rework of VF add/remove locking
> 
> Long Li (7):
>   net/netvsc: fix race conditions on VF add/remove events
>   net/netvsc: add multi-process VF device removal support
>   net/mana: fix PD resource leak on device close
>   net/netvsc: fix devargs memory leak on hotplug
>   net/mana: fix fast-path ops setup in secondary process
>   net/mlx5: fix fast-path ops setup in secondary process
>   net/mlx4: fix fast-path ops setup in secondary process
> 
>  drivers/net/mana/mana.c             |  14 ++
>  drivers/net/mana/mp.c               |   6 +
>  drivers/net/mlx4/mlx4_mp.c          |   4 +
>  drivers/net/mlx5/linux/mlx5_mp_os.c |   4 +
>  drivers/net/netvsc/hn_ethdev.c      | 300 +++++++++++++++++++++++++++-
>  drivers/net/netvsc/hn_nvs.h         |   6 +
>  drivers/net/netvsc/hn_rxtx.c        |  40 ++--
>  drivers/net/netvsc/hn_var.h         |   1 +
>  drivers/net/netvsc/hn_vf.c          | 148 ++++++++------
>  9 files changed, 431 insertions(+), 92 deletions(-)
> 

Looks to be in good shape, you may want to address these errors
flagged by AI review.

Critical Finding Summary
Patch	Severity	Issue
5, 6, 7	Error	START_RXTX handler does not restore rte_eth_fp_ops burst function pointers after STOP clears them — secondary Rx/Tx broken after STOP+START cycle
3	Warning	PD deallocation order: parent freed before child may cause EBUSY; reverse the order
1	Warning	switch_data_path failure leaves VF attached/started but unused
1	Warning	eth_hn_dev_init ignores hn_vf_add_unlocked error (pre-existing)
2	Info	secondary_cnt atomics use acquire/release where relaxed suffices


# DPDK VF Hotplug Patch Series v4 Review

**Series**: [PATCH v4 1/7] through [PATCH v4 7/7]
**Author**: Long Li <longli@microsoft.com>
**Reviewed against**: AGENTS.md DPDK Code Review Guidelines

---

## Summary

This series addresses race conditions and resource management issues in the netvsc/mana/mlx5/mlx4 drivers around VF hotplug, multi-process coordination, and fast-path ops setup. The locking rework in patch 1 is substantial and well-structured. The multi-process infrastructure in patch 2 is a significant addition. Commit messages throughout are excellent — clear root cause analysis, well-documented change rationale, and honest notes about limitations.

Subject lines are all within 60 characters, Fixes tags all have 12-character SHAs, tag ordering is correct throughout, and commit body line wrapping is clean.

---

## Patch 1: net/netvsc: fix race conditions on VF add/remove events

### Correctness analysis

The locking redesign is solid:

- **Rx/Tx fast path**: Moving the lock acquisition outside the `vf_vsc_switched` check eliminates the TOCTOU race where the flag was checked without the lock, then the VF could be removed before the lock was acquired. The old double-check pattern was racy.

- **`hn_vf_close`**: Upgrading from `rte_rwlock_read_lock` to `rte_rwlock_write_lock` is correct — the function modifies `vf_attached`, which is a write operation.

- **`hn_vf_detach`/`fresh_attach` pattern**: On `-EEXIST` from `hn_vf_attach`, the `fresh_attach = false` flag correctly prevents `hn_vf_detach` from tearing down a previously valid VF attachment on configure/start failure. This is a subtle and well-handled case.

- **Callback registration moved to `hn_vf_attach`**: Registering the RMV callback immediately on attach (with rollback on failure) is cleaner than the old placement in `hn_vf_configure`.

### Warning: `hn_vf_add_unlocked` does not clean up VF on `switch_data_path` failure

If `hn_nvs_set_datapath(NVS_DATAPATH_VF)` fails at the `switch_data_path` label, the function returns the error but leaves the VF attached and possibly started. The VF sits running but unused (traffic goes through synthetic path). The self-healing retry mechanism (re-entry through `hn_nvs_handle_vfassoc` or `netvsc_hotplug_retry`) would eventually retry, so this is not a crash bug, but a `goto detach` on data path switch failure would be cleaner and avoid the VF running in a wasted state.

### Warning: `eth_hn_dev_init` silently ignores `hn_vf_add_unlocked` error

```c
rte_rwlock_write_lock(&hv->vf_lock);
if (hv->vf_ctx.vf_vsp_reported && !hv->vf_ctx.vf_vsc_switched) {
    PMD_INIT_LOG(DEBUG, "Adding VF device");
    err = hn_vf_add_unlocked(eth_dev, hv);
}
rte_rwlock_write_unlock(&hv->vf_lock);

return 0;  /* err is ignored */
```

The `err` from `hn_vf_add_unlocked` is stored but never checked — the function always returns 0. This is pre-existing behavior, but worth noting as it could mask VF setup failures during init.

---

## Patch 2: net/netvsc: add multi-process VF device removal support

### Correctness analysis

The multi-process infrastructure is well-designed:

- **Shared memzone** for `secondary_cnt` with proper atomic operations
- **Spinlock** protecting the one-time init/uninit sequence
- **`rte_mp_request_sync`** with timeout for reliable cross-process coordination
- **ENOTSUP** handling for single-process mode

### Info: Memory ordering stronger than necessary

The `secondary_cnt` atomic uses `rte_memory_order_release` for stores and `rte_memory_order_acquire` for loads. Since this counter doesn't guard any other shared data (it's just a "should I bother sending MP requests?" optimization), `rte_memory_order_relaxed` would suffice. The current ordering is correct but unnecessarily strong.

### Info: `netvsc_mp_primary_handle` is a stub

```c
static int
netvsc_mp_primary_handle(const struct rte_mp_msg *mp_msg __rte_unused,
                          const void *peer __rte_unused)
{
    /* Stub function required for multi-process message handling registration */
    return 0;
}
```

The comment explains this is needed for registration. This is fine — the primary currently only sends, never receives. If future message types require primary handling, this stub is the extension point.

---

## Patch 3: net/mana: fix PD resource leak on device close

### Warning: PD deallocation order may be wrong

The code frees `ib_parent_pd` before `ib_pd`:

```c
if (priv->ib_parent_pd) {
    int err = ibv_dealloc_pd(priv->ib_parent_pd);
    ...
}
if (priv->ib_pd) {
    int err = ibv_dealloc_pd(priv->ib_pd);
    ...
}
```

If `ib_pd` is a child/derived PD of `ib_parent_pd`, `ibv_dealloc_pd(ib_parent_pd)` will fail with EBUSY because the child still holds a reference. The error is logged and the code continues, so `ib_pd` gets freed next, and `ibv_close_device` cleans up the orphaned parent PD. This works but is fragile.

Reversing the order (free `ib_pd` first, then `ib_parent_pd`) would be cleaner and avoid the EBUSY error on the parent. If the PDs are independent (no parent-child verbs relationship), the order doesn't matter, but the naming strongly suggests a hierarchy.

---

## Patch 4: net/netvsc: fix devargs memory leak on hotplug

No issues. The `rte_devargs_reset()` calls are correctly placed in both cleanup paths before `free(hot_ctx)`.

---

## Patches 5, 6, 7: net/mana, net/mlx5, net/mlx4: fix fast-path ops setup in secondary process

### Error: `rte_eth_fp_ops` burst function pointers not updated in START_RXTX handler

All three patches update `rte_eth_fp_ops[].rxq.data` and `txq.data` in the START handler, and update `rte_eth_fp_ops[].rx_pkt_burst` and `tx_pkt_burst` in the STOP handler. But the START handler does **not** restore the burst function pointers.

After a STOP → START cycle:
1. STOP sets `rte_eth_fp_ops[port].rx_pkt_burst = dummy` (newly added by this patch)
2. START sets `rte_eth_fp_ops[port].rxq.data = dev->data->rx_queues` but does NOT restore `rx_pkt_burst`
3. Secondary's `rte_eth_rx_burst()` calls `rte_eth_fp_ops[port].rx_pkt_burst(...)` which is still the dummy → **Rx/Tx broken**

Since `rte_eth_rx_burst()` uses `rte_eth_fp_ops` (process-local), not `dev->rx_pkt_burst` (shared), the secondary cannot receive or transmit after a STOP+START cycle even though `dev->rx_pkt_burst` was correctly set.

This may not manifest on the **first** start (where `rte_eth_dev_probing_finish()` sets fp_ops during probe), and it may be masked if the VF is re-probed during hot-add. But it will manifest on any in-place STOP+START cycle without re-probe.

**Suggested fix** — add to each START_RXTX handler:

For patch 5 (mana):
```c
rte_eth_fp_ops[param->port_id].rx_pkt_burst = mana_rx_burst;
rte_eth_fp_ops[param->port_id].tx_pkt_burst = mana_tx_burst;
```

For patch 6 (mlx5):
```c
rte_eth_fp_ops[param->port_id].rx_pkt_burst = dev->rx_pkt_burst;
rte_eth_fp_ops[param->port_id].tx_pkt_burst = dev->tx_pkt_burst;
```

For patch 7 (mlx4), same as mlx5.

Place these **before** the `rte_mb()` call so the barrier covers all fp_ops updates.

---

## Critical Finding Summary

| Patch | Severity | Issue |
|-------|----------|-------|
| 5, 6, 7 | **Error** | START_RXTX handler does not restore `rte_eth_fp_ops` burst function pointers after STOP clears them — secondary Rx/Tx broken after STOP+START cycle |
| 3 | Warning | PD deallocation order: parent freed before child may cause EBUSY; reverse the order |
| 1 | Warning | `switch_data_path` failure leaves VF attached/started but unused |
| 1 | Warning | `eth_hn_dev_init` ignores `hn_vf_add_unlocked` error (pre-existing) |
| 2 | Info | `secondary_cnt` atomics use acquire/release where relaxed suffices |

  parent reply	other threads:[~2026-02-26 19:37 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-26  2:39 [PATCH v4 0/7] fix multi-process VF hotplug longli
2026-02-26  2:39 ` [PATCH v4 1/7] net/netvsc: fix race conditions on VF add/remove events longli
2026-02-26  2:39 ` [PATCH v4 2/7] net/netvsc: add multi-process VF device removal support longli
2026-02-26 18:51   ` Stephen Hemminger
2026-02-27  0:03     ` [EXTERNAL] " Long Li
2026-02-26  2:39 ` [PATCH v4 3/7] net/mana: fix PD resource leak on device close longli
2026-02-26  2:39 ` [PATCH v4 4/7] net/netvsc: fix devargs memory leak on hotplug longli
2026-02-26  2:39 ` [PATCH v4 5/7] net/mana: fix fast-path ops setup in secondary process longli
2026-02-26  2:39 ` [PATCH v4 6/7] net/mlx5: " longli
2026-02-26  2:39 ` [PATCH v4 7/7] net/mlx4: " longli
2026-02-26 19:37 ` Stephen Hemminger [this message]
2026-02-27  1:02   ` [EXTERNAL] Re: [PATCH v4 0/7] fix multi-process VF hotplug Long Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260226113724.3b5ca20c@phoenix.local \
    --to=stephen@networkplumber.org \
    --cc=bingz@nvidia.com \
    --cc=dev@dpdk.org \
    --cc=dsosnowski@nvidia.com \
    --cc=longli@linux.microsoft.com \
    --cc=longli@microsoft.com \
    --cc=matan@nvidia.com \
    --cc=orika@nvidia.com \
    --cc=stable@dpdk.org \
    --cc=suanmingm@nvidia.com \
    --cc=viacheslavo@nvidia.com \
    --cc=weh@microsoft.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox