public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC] net/mlx5: check whether VFs are assigned before disabling SR-IOV
@ 2026-04-28 18:04 Max Boone via B4 Relay
  2026-04-29 12:38 ` Jason Gunthorpe
  0 siblings, 1 reply; 4+ messages in thread
From: Max Boone via B4 Relay @ 2026-04-28 18:04 UTC (permalink / raw)
  To: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: netdev, linux-rdma, linux-kernel, Max Boone

From: Max Boone <mboone@akamai.com>

When MLX5 cards are passed through to a VM, disabling SR-IOV by
setting the sriov_numvfs to 0 will render the machine unstable.

Other drivers (such as ixgbe, bnxt and octep) add this check to
see whether the VFs are passed through to a VM.

Signed-off-by: Max Boone <mboone@akamai.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h |  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/sriov.c     | 11 +++++++++--
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
index 1507e881d..85fe89c00 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
@@ -240,7 +240,7 @@ void mlx5_sriov_cleanup(struct mlx5_core_dev *dev);
 int mlx5_sriov_attach(struct mlx5_core_dev *dev);
 void mlx5_sriov_detach(struct mlx5_core_dev *dev);
 int mlx5_core_sriov_configure(struct pci_dev *dev, int num_vfs);
-void mlx5_sriov_disable(struct pci_dev *pdev, bool num_vf_change);
+int mlx5_sriov_disable(struct pci_dev *pdev, bool num_vf_change);
 int mlx5_core_sriov_set_msix_vec_count(struct pci_dev *vf, int msix_vec_count);
 int mlx5_core_enable_hca(struct mlx5_core_dev *dev, u16 func_id);
 int mlx5_core_disable_hca(struct mlx5_core_dev *dev, u16 func_id);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sriov.c b/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
index bf6f631cf..07c61a73b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
@@ -200,16 +200,23 @@ static int mlx5_sriov_enable(struct pci_dev *pdev, int num_vfs)
 	return err;
 }
 
-void mlx5_sriov_disable(struct pci_dev *pdev, bool num_vf_change)
+int mlx5_sriov_disable(struct pci_dev *pdev, bool num_vf_change)
 {
 	struct mlx5_core_dev *dev  = pci_get_drvdata(pdev);
 	struct devlink *devlink = priv_to_devlink(dev);
 	int num_vfs = pci_num_vf(dev->pdev);
 
+	if (pci_vfs_assigned(dev->pdev)) {
+		mlx5_core_warn(dev, "can't disable sriov, VFs are assigned\n");
+		return -EPERM;
+	}
+
 	pci_disable_sriov(pdev);
 	devl_lock(devlink);
 	mlx5_device_disable_sriov(dev, num_vfs, true, num_vf_change);
 	devl_unlock(devlink);
+
+	return 0;
 }
 
 int mlx5_core_sriov_configure(struct pci_dev *pdev, int num_vfs)
@@ -223,7 +230,7 @@ int mlx5_core_sriov_configure(struct pci_dev *pdev, int num_vfs)
 	if (num_vfs)
 		err = mlx5_sriov_enable(pdev, num_vfs);
 	else
-		mlx5_sriov_disable(pdev, true);
+		err = mlx5_sriov_disable(pdev, true);
 
 	if (!err)
 		sriov->num_vfs = num_vfs;

---
base-commit: dca922e019dd758b4c1b4bec8f1d509efddeaab4
change-id: 20260428-mlx5-sriov-in-use-check-5cc2a79638e5

Best regards,
-- 
Max Boone <mboone@akamai.com>



^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH RFC] net/mlx5: check whether VFs are assigned before disabling SR-IOV
  2026-04-28 18:04 [PATCH RFC] net/mlx5: check whether VFs are assigned before disabling SR-IOV Max Boone via B4 Relay
@ 2026-04-29 12:38 ` Jason Gunthorpe
  2026-04-29 13:29   ` Boone, Max
  0 siblings, 1 reply; 4+ messages in thread
From: Jason Gunthorpe @ 2026-04-29 12:38 UTC (permalink / raw)
  To: mboone
  Cc: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, netdev, linux-rdma, linux-kernel

On Tue, Apr 28, 2026 at 08:04:14PM +0200, Max Boone via B4 Relay wrote:
> From: Max Boone <mboone@akamai.com>
> 
> When MLX5 cards are passed through to a VM, disabling SR-IOV by
> setting the sriov_numvfs to 0 will render the machine unstable.

What? How does that happen?

> -void mlx5_sriov_disable(struct pci_dev *pdev, bool num_vf_change)
> +int mlx5_sriov_disable(struct pci_dev *pdev, bool num_vf_change)
>  {
>  	struct mlx5_core_dev *dev  = pci_get_drvdata(pdev);
>  	struct devlink *devlink = priv_to_devlink(dev);
>  	int num_vfs = pci_num_vf(dev->pdev);
>  
> +	if (pci_vfs_assigned(dev->pdev)) {
> +		mlx5_core_warn(dev, "can't disable sriov, VFs are assigned\n");
> +		return -EPERM;
> +	}

*barf* WTF did this come from?

Grep says only Xen makes this true, so this is all working around some
Xen brokenness in their "assignment" ?

If people care about Xen pci_is_dev_assigned() should be be purged and
pciback should be fixed to not "make the machine unstable" when it is
removed during a VF teardown.

Or at the very least this nasty Xen intrustion should be placed in the
PCI core code and removed from the drivers.

Also, no, you can't fail mlx5_sriov_disable() it is called during
driver remove and cannot fail in that flow.

Jason

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH RFC] net/mlx5: check whether VFs are assigned before disabling SR-IOV
  2026-04-29 12:38 ` Jason Gunthorpe
@ 2026-04-29 13:29   ` Boone, Max
  2026-04-29 13:57     ` Jason Gunthorpe
  0 siblings, 1 reply; 4+ messages in thread
From: Boone, Max @ 2026-04-29 13:29 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
	linux-kernel@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 3464 bytes --]



> On Apr 29, 2026, at 2:38 PM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> 
> !-------------------------------------------------------------------|
>  This Message Is From an External Sender
>  This message came from outside your organization.
> |-------------------------------------------------------------------!
> 
> On Tue, Apr 28, 2026 at 08:04:14PM +0200, Max Boone via B4 Relay wrote:
>> From: Max Boone <mboone@akamai.com>
>> 
>> When MLX5 cards are passed through to a VM, disabling SR-IOV by
>> setting the sriov_numvfs to 0 will render the machine unstable.
> 
> What? How does that happen?

Unstable is maybe a bit confusing phrasing on my part, “locks up”
might be a better description?

In short:
- Enable by setting sriov_numvfs to positive
- vfio-pci passthrough to QEMU (or other process)
- Disable by setting sriov_numvfs to zero
- QEMU processes freeze, shell that was writing to sysfs freezes
- SIGKILL doesn’t seem to have much effect, shutdown never completes

Python script to reproduce without QEMU:
- https://github.com/akamaxb/repro-vfio-sriov-removal/blob/main/vfio-sriov-bind.py

Does:
  1. Require sriov_numvfs == 0 on the PF (report any existing users and exit if not)
  2. Add one SR-IOV VF
  3. Bind the VF to vfio-pci via driver_override + drivers_probe
  4. Open VFIO container + group, get device fd
  5. Create a KVM VM (registers an MMU notifier — required to trigger the race)
  6. Hold and wait for user input

To trigger the bug while the script is waiting, in another terminal:
    echo 0 > /sys/bus/pci/devices/<pf_device>/sriov_numvfs

On the vfio-pci end of it all, it prints these two lines to dmesg before it hangs:
- https://elixir.bootlin.com/linux/v7.0.1/source/drivers/vfio/pci/vfio_pci_core.c#L1826
- https://elixir.bootlin.com/linux/v7.0.1/source/drivers/vfio/vfio_main.c#L421

>> -void mlx5_sriov_disable(struct pci_dev *pdev, bool num_vf_change)
>> +int mlx5_sriov_disable(struct pci_dev *pdev, bool num_vf_change)
>> {
>> struct mlx5_core_dev *dev  = pci_get_drvdata(pdev);
>> struct devlink *devlink = priv_to_devlink(dev);
>> int num_vfs = pci_num_vf(dev->pdev);
>> 
>> + if (pci_vfs_assigned(dev->pdev)) {
>> + mlx5_core_warn(dev, "can't disable sriov, VFs are assigned\n");
>> + return -EPERM;
>> + }
> 
> *barf* WTF did this come from?

Hahaha, take your pick:
- https://elixir.bootlin.com/linux/v7.0.1/C/ident/pci_vfs_assigned

I followed the sysfs sriov_numvfs op for a couple drivers and saw
that ixgbe (and others) had it plumbed in, so presumed (sorry)
that this would fix it / was an obvious omission if the rest is doing
it. My bad for cargo culting an artifact from Xen.

> Grep says only Xen makes this true, so this is all working around some
> Xen brokenness in their "assignment" ?

Yeap, I see, looks like it.

> If people care about Xen pci_is_dev_assigned() should be be purged and
> pciback should be fixed to not "make the machine unstable" when it is
> removed during a VF teardown.
> 
> Or at the very least this nasty Xen intrustion should be placed in the
> PCI core code and removed from the drivers.
> 
> Also, no, you can't fail mlx5_sriov_disable() it is called during
> driver remove and cannot fail in that flow.

Check. I can do some further digging and build a kernel with lockdep
to try and find what it is hanging on specifically. Unless something pops
to mind?

> 
> Jason


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 3061 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH RFC] net/mlx5: check whether VFs are assigned before disabling SR-IOV
  2026-04-29 13:29   ` Boone, Max
@ 2026-04-29 13:57     ` Jason Gunthorpe
  0 siblings, 0 replies; 4+ messages in thread
From: Jason Gunthorpe @ 2026-04-29 13:57 UTC (permalink / raw)
  To: Boone, Max
  Cc: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch,
	Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
	linux-kernel@vger.kernel.org

On Wed, Apr 29, 2026 at 01:29:57PM +0000, Boone, Max wrote:

> > On Tue, Apr 28, 2026 at 08:04:14PM +0200, Max Boone via B4 Relay wrote:
> >> From: Max Boone <mboone@akamai.com>
> >> 
> >> When MLX5 cards are passed through to a VM, disabling SR-IOV by
> >> setting the sriov_numvfs to 0 will render the machine unstable.
> > 
> > What? How does that happen?
> 
> Unstable is maybe a bit confusing phrasing on my part, “locks up”
> might be a better description?
> 
> In short:
> - Enable by setting sriov_numvfs to positive
> - vfio-pci passthrough to QEMU (or other process)
> - Disable by setting sriov_numvfs to zero
> - QEMU processes freeze, shell that was writing to sysfs freezes
> - SIGKILL doesn’t seem to have much effect, shutdown never completes

I'm not surprised, but this is definately a bug in VFIO that should be
researched.

And is it right that pci_vfs_assigned() doesn't fix this anyhow since
only Xen activates it?

> Python script to reproduce without QEMU:
> - https://github.com/akamaxb/repro-vfio-sriov-removal/blob/main/vfio-sriov-bind.py
> 
> Does:
>   1. Require sriov_numvfs == 0 on the PF (report any existing users and exit if not)
>   2. Add one SR-IOV VF
>   3. Bind the VF to vfio-pci via driver_override + drivers_probe
>   4. Open VFIO container + group, get device fd
>   5. Create a KVM VM (registers an MMU notifier — required to trigger the race)
>   6. Hold and wait for user input
> 
> To trigger the bug while the script is waiting, in another terminal:
>     echo 0 > /sys/bus/pci/devices/<pf_device>/sriov_numvfs
> 
> On the vfio-pci end of it all, it prints these two lines to dmesg before it hangs:
> - https://elixir.bootlin.com/linux/v7.0.1/source/drivers/vfio/pci/vfio_pci_core.c#L1826

The VFIO protocol requires userspace to implement this event channel
and immediately close the VFIO FD when the driver is removed. The vfio
driver unbind sleeps and waits for this. It should not hang the
userspace vfio user, but you will get an unkillable process writing to
the sriov_numvfs sysfs waiting on this.

The above logging shows your test program doesn't implement the
protocol so I don't expect it to work.

vfio currently does not support a kernel lead isolation of its fds.

> - https://elixir.bootlin.com/linux/v7.0.1/source/drivers/vfio/vfio_main.c#L421

And this is the above driver remove is waiting for the FD to close in
response to the event.

qemu is supposed to implement the event protocol, so I don't have a
guess how you get a qemu to become unkillable - that seems to be the
primary bug here. Would be good to know what system call qemu is stuck
inside.

I'd modify your test to implement the event protocol, setup an event
fd to receive an interrupt on VFIO_PCI_REQ_IRQ_INDEX then trigger the
teardown sequence when the eventfd triggers. That would clean out the
basic flow.

Jason

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-04-29 14:14 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-28 18:04 [PATCH RFC] net/mlx5: check whether VFs are assigned before disabling SR-IOV Max Boone via B4 Relay
2026-04-29 12:38 ` Jason Gunthorpe
2026-04-29 13:29   ` Boone, Max
2026-04-29 13:57     ` Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox