From: Jason Gunthorpe <jgg@ziepe.ca>
To: "Boone, Max" <mboone@akamai.com>
Cc: Saeed Mahameed <saeedm@nvidia.com>,
Leon Romanovsky <leon@kernel.org>,
Tariq Toukan <tariqt@nvidia.com>, Mark Bloch <mbloch@nvidia.com>,
Andrew Lunn <andrew+netdev@lunn.ch>,
"David S. Miller" <davem@davemloft.net>,
Eric Dumazet <edumazet@google.com>,
Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
"linux-rdma@vger.kernel.org" <linux-rdma@vger.kernel.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH RFC] net/mlx5: check whether VFs are assigned before disabling SR-IOV
Date: Wed, 29 Apr 2026 10:57:57 -0300 [thread overview]
Message-ID: <20260429135757.GO849557@ziepe.ca> (raw)
In-Reply-To: <DB8CFC33-0929-40F5-86CA-39D1CD84D415@akamai.com>
On Wed, Apr 29, 2026 at 01:29:57PM +0000, Boone, Max wrote:
> > On Tue, Apr 28, 2026 at 08:04:14PM +0200, Max Boone via B4 Relay wrote:
> >> From: Max Boone <mboone@akamai.com>
> >>
> >> When MLX5 cards are passed through to a VM, disabling SR-IOV by
> >> setting the sriov_numvfs to 0 will render the machine unstable.
> >
> > What? How does that happen?
>
> Unstable is maybe a bit confusing phrasing on my part, “locks up”
> might be a better description?
>
> In short:
> - Enable by setting sriov_numvfs to positive
> - vfio-pci passthrough to QEMU (or other process)
> - Disable by setting sriov_numvfs to zero
> - QEMU processes freeze, shell that was writing to sysfs freezes
> - SIGKILL doesn’t seem to have much effect, shutdown never completes
I'm not surprised, but this is definately a bug in VFIO that should be
researched.
And is it right that pci_vfs_assigned() doesn't fix this anyhow since
only Xen activates it?
> Python script to reproduce without QEMU:
> - https://github.com/akamaxb/repro-vfio-sriov-removal/blob/main/vfio-sriov-bind.py
>
> Does:
> 1. Require sriov_numvfs == 0 on the PF (report any existing users and exit if not)
> 2. Add one SR-IOV VF
> 3. Bind the VF to vfio-pci via driver_override + drivers_probe
> 4. Open VFIO container + group, get device fd
> 5. Create a KVM VM (registers an MMU notifier — required to trigger the race)
> 6. Hold and wait for user input
>
> To trigger the bug while the script is waiting, in another terminal:
> echo 0 > /sys/bus/pci/devices/<pf_device>/sriov_numvfs
>
> On the vfio-pci end of it all, it prints these two lines to dmesg before it hangs:
> - https://elixir.bootlin.com/linux/v7.0.1/source/drivers/vfio/pci/vfio_pci_core.c#L1826
The VFIO protocol requires userspace to implement this event channel
and immediately close the VFIO FD when the driver is removed. The vfio
driver unbind sleeps and waits for this. It should not hang the
userspace vfio user, but you will get an unkillable process writing to
the sriov_numvfs sysfs waiting on this.
The above logging shows your test program doesn't implement the
protocol so I don't expect it to work.
vfio currently does not support a kernel lead isolation of its fds.
> - https://elixir.bootlin.com/linux/v7.0.1/source/drivers/vfio/vfio_main.c#L421
And this is the above driver remove is waiting for the FD to close in
response to the event.
qemu is supposed to implement the event protocol, so I don't have a
guess how you get a qemu to become unkillable - that seems to be the
primary bug here. Would be good to know what system call qemu is stuck
inside.
I'd modify your test to implement the event protocol, setup an event
fd to receive an interrupt on VFIO_PCI_REQ_IRQ_INDEX then trigger the
teardown sequence when the eventfd triggers. That would clean out the
basic flow.
Jason
prev parent reply other threads:[~2026-04-29 13:57 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-28 18:04 [PATCH RFC] net/mlx5: check whether VFs are assigned before disabling SR-IOV Max Boone via B4 Relay
2026-04-29 12:38 ` Jason Gunthorpe
2026-04-29 13:29 ` Boone, Max
2026-04-29 13:57 ` Jason Gunthorpe [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260429135757.GO849557@ziepe.ca \
--to=jgg@ziepe.ca \
--cc=andrew+netdev@lunn.ch \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=kuba@kernel.org \
--cc=leon@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-rdma@vger.kernel.org \
--cc=mbloch@nvidia.com \
--cc=mboone@akamai.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=saeedm@nvidia.com \
--cc=tariqt@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox