* [PATCH] PCI/IOV: Fix recursive locking deadlock on pci_rescan_remove_lock
@ 2026-02-09 7:57 Ionut Nechita (Wind River)
2026-02-09 8:25 ` Sebastian Andrzej Siewior
2026-02-09 16:14 ` Bjorn Helgaas
0 siblings, 2 replies; 6+ messages in thread
From: Ionut Nechita (Wind River) @ 2026-02-09 7:57 UTC (permalink / raw)
To: Bjorn Helgaas, linux-pci
Cc: Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt,
linux-rt-devel, linux-kernel, Ionut Nechita, Ionut Nechita
From: Ionut Nechita <ionut.nechita@windriver.com>
When a PCI device is hot-removed via sysfs (e.g., echo 1 > /sys/.../remove),
pci_stop_and_remove_bus_device_locked() acquires pci_rescan_remove_lock and
then recursively walks the bus hierarchy calling driver .remove() callbacks.
If the removed device is a PF with SR-IOV enabled (e.g., i40e, ice), the
driver's .remove() calls pci_disable_sriov() -> sriov_disable() ->
sriov_del_vfs() which also tries to acquire pci_rescan_remove_lock.
Since this is a non-recursive mutex and the same thread already holds it,
this results in a deadlock.
On PREEMPT_RT kernels, where mutexes are backed by rtmutex with deadlock
detection, this immediately triggers:
WARNING: CPU: 15 PID: 11730 at kernel/locking/rtmutex.c:1663
Call Trace:
mutex_lock+0x47/0x60
sriov_disable+0x2a/0x100
i40e_free_vfs+0x415/0x470 [i40e]
i40e_remove+0x38d/0x3e0 [i40e]
pci_device_remove+0x3b/0xb0
device_release_driver_internal+0x193/0x200
pci_stop_bus_device+0x81/0xb0
pci_stop_and_remove_bus_device_locked+0x16/0x30
remove_store+0x79/0x90
On non-RT kernels the same recursive acquisition silently hangs the calling
process, eventually causing netdev watchdog TX timeout splats.
This affects all drivers that call pci_disable_sriov() from their .remove()
callback (i40e, ice, and others).
Fix this by tracking the owner of pci_rescan_remove_lock and skipping the
redundant acquisition in sriov_del_vfs() when the current thread already
holds it. The VF removal is still serialized correctly because the caller
already holds the lock.
Signed-off-by: Ionut Nechita <ionut.nechita@windriver.com>
---
drivers/pci/iov.c | 23 +++++++++++++++++++++--
drivers/pci/pci.h | 1 +
drivers/pci/probe.c | 15 +++++++++++++++
3 files changed, 37 insertions(+), 2 deletions(-)
diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 00784a60ba80b..3a21cf9aaa747 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -763,12 +763,31 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
static void sriov_del_vfs(struct pci_dev *dev)
{
struct pci_sriov *iov = dev->sriov;
+ bool do_unlock = false;
int i;
- pci_lock_rescan_remove();
+ /*
+ * If the current thread already holds pci_rescan_remove_lock (e.g.,
+ * when pci_disable_sriov() is called from a driver's .remove() that
+ * was invoked by pci_stop_and_remove_bus_device_locked()), skip
+ * taking the lock to avoid a deadlock. The lock is non-recursive
+ * and on PREEMPT_RT, where mutexes are rtmutexes, the deadlock is
+ * detected immediately and produces an alarming WARNING splat. On
+ * non-RT kernels the same recursive acquisition silently hangs.
+ *
+ * The VF removal below is still serialized correctly because the
+ * caller already holds the lock.
+ */
+ if (!pci_rescan_remove_locked()) {
+ pci_lock_rescan_remove();
+ do_unlock = true;
+ }
+
for (i = 0; i < iov->num_VFs; i++)
pci_iov_remove_virtfn(dev, i);
- pci_unlock_rescan_remove();
+
+ if (do_unlock)
+ pci_unlock_rescan_remove();
}
static void sriov_disable(struct pci_dev *dev)
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 0e67014aa0013..c1055d333e08a 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -92,6 +92,7 @@ extern const unsigned char pcie_link_speed[];
extern bool pci_early_dump;
extern struct mutex pci_rescan_remove_lock;
+bool pci_rescan_remove_locked(void);
bool pcie_cap_has_lnkctl(const struct pci_dev *dev);
bool pcie_cap_has_lnkctl2(const struct pci_dev *dev);
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 41183aed8f5d9..f058ffb51519c 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -3540,19 +3540,34 @@ EXPORT_SYMBOL_GPL(pci_rescan_bus);
* routines should always be executed under this mutex.
*/
DEFINE_MUTEX(pci_rescan_remove_lock);
+static struct task_struct *pci_rescan_remove_owner;
void pci_lock_rescan_remove(void)
{
mutex_lock(&pci_rescan_remove_lock);
+ WRITE_ONCE(pci_rescan_remove_owner, current);
}
EXPORT_SYMBOL_GPL(pci_lock_rescan_remove);
void pci_unlock_rescan_remove(void)
{
+ WRITE_ONCE(pci_rescan_remove_owner, NULL);
mutex_unlock(&pci_rescan_remove_lock);
}
EXPORT_SYMBOL_GPL(pci_unlock_rescan_remove);
+/**
+ * pci_rescan_remove_locked - check if current thread holds the lock
+ *
+ * Returns true if the current thread already holds pci_rescan_remove_lock.
+ * This is used by PCI core functions that may be called both with and
+ * without the lock held, to avoid recursive locking deadlocks.
+ */
+bool pci_rescan_remove_locked(void)
+{
+ return READ_ONCE(pci_rescan_remove_owner) == current;
+}
+
static int __init pci_sort_bf_cmp(const struct device *d_a,
const struct device *d_b)
{
--
2.52.0
^ permalink raw reply related [flat|nested] 6+ messages in thread* Re: [PATCH] PCI/IOV: Fix recursive locking deadlock on pci_rescan_remove_lock
2026-02-09 7:57 [PATCH] PCI/IOV: Fix recursive locking deadlock on pci_rescan_remove_lock Ionut Nechita (Wind River)
@ 2026-02-09 8:25 ` Sebastian Andrzej Siewior
2026-02-09 10:12 ` Niklas Schnelle
2026-02-09 16:14 ` Bjorn Helgaas
1 sibling, 1 reply; 6+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-02-09 8:25 UTC (permalink / raw)
To: Ionut Nechita (Wind River), Niklas Schnelle
Cc: Bjorn Helgaas, linux-pci, Clark Williams, Steven Rostedt,
linux-rt-devel, linux-kernel, Ionut Nechita, Benjamin Block,
Farhan Ali, Julian Ruess
On 2026-02-09 09:57:07 [+0200], Ionut Nechita (Wind River) wrote:
> From: Ionut Nechita <ionut.nechita@windriver.com>
>
> When a PCI device is hot-removed via sysfs (e.g., echo 1 > /sys/.../remove),
> pci_stop_and_remove_bus_device_locked() acquires pci_rescan_remove_lock and
> then recursively walks the bus hierarchy calling driver .remove() callbacks.
>
> If the removed device is a PF with SR-IOV enabled (e.g., i40e, ice), the
> driver's .remove() calls pci_disable_sriov() -> sriov_disable() ->
> sriov_del_vfs() which also tries to acquire pci_rescan_remove_lock.
> Since this is a non-recursive mutex and the same thread already holds it,
> this results in a deadlock.
>
> On PREEMPT_RT kernels, where mutexes are backed by rtmutex with deadlock
> detection, this immediately triggers:
>
> WARNING: CPU: 15 PID: 11730 at kernel/locking/rtmutex.c:1663
> Call Trace:
> mutex_lock+0x47/0x60
> sriov_disable+0x2a/0x100
> i40e_free_vfs+0x415/0x470 [i40e]
> i40e_remove+0x38d/0x3e0 [i40e]
> pci_device_remove+0x3b/0xb0
> device_release_driver_internal+0x193/0x200
> pci_stop_bus_device+0x81/0xb0
> pci_stop_and_remove_bus_device_locked+0x16/0x30
> remove_store+0x79/0x90
>
> On non-RT kernels the same recursive acquisition silently hangs the calling
> process, eventually causing netdev watchdog TX timeout splats.
>
> This affects all drivers that call pci_disable_sriov() from their .remove()
> callback (i40e, ice, and others).
>
> Fix this by tracking the owner of pci_rescan_remove_lock and skipping the
> redundant acquisition in sriov_del_vfs() when the current thread already
> holds it. The VF removal is still serialized correctly because the caller
> already holds the lock.
This looks like the result of commit 05703271c3cdc ("PCI/IOV: Add PCI
rescan-remove locking when enabling/disabling SR-IOV").
> Signed-off-by: Ionut Nechita <ionut.nechita@windriver.com>
Sebastian
^ permalink raw reply [flat|nested] 6+ messages in thread* Re: [PATCH] PCI/IOV: Fix recursive locking deadlock on pci_rescan_remove_lock
2026-02-09 8:25 ` Sebastian Andrzej Siewior
@ 2026-02-09 10:12 ` Niklas Schnelle
2026-02-11 7:37 ` Sebastian Andrzej Siewior
0 siblings, 1 reply; 6+ messages in thread
From: Niklas Schnelle @ 2026-02-09 10:12 UTC (permalink / raw)
To: Sebastian Andrzej Siewior, Ionut Nechita (Wind River),
Benjamin Block
Cc: Bjorn Helgaas, linux-pci, Clark Williams, Steven Rostedt,
linux-rt-devel, linux-kernel, Ionut Nechita, Farhan Ali,
Julian Ruess
On Mon, 2026-02-09 at 09:25 +0100, Sebastian Andrzej Siewior wrote:
> On 2026-02-09 09:57:07 [+0200], Ionut Nechita (Wind River) wrote:
> > From: Ionut Nechita <ionut.nechita@windriver.com>
> >
> > When a PCI device is hot-removed via sysfs (e.g., echo 1 > /sys/.../remove),
> > pci_stop_and_remove_bus_device_locked() acquires pci_rescan_remove_lock and
> > then recursively walks the bus hierarchy calling driver .remove() callbacks.
> >
> > If the removed device is a PF with SR-IOV enabled (e.g., i40e, ice), the
> > driver's .remove() calls pci_disable_sriov() -> sriov_disable() ->
> > sriov_del_vfs() which also tries to acquire pci_rescan_remove_lock.
> > Since this is a non-recursive mutex and the same thread already holds it,
> > this results in a deadlock.
> >
> > On PREEMPT_RT kernels, where mutexes are backed by rtmutex with deadlock
> > detection, this immediately triggers:
> >
> > WARNING: CPU: 15 PID: 11730 at kernel/locking/rtmutex.c:1663
> > Call Trace:
> > mutex_lock+0x47/0x60
> > sriov_disable+0x2a/0x100
> > i40e_free_vfs+0x415/0x470 [i40e]
> > i40e_remove+0x38d/0x3e0 [i40e]
> > pci_device_remove+0x3b/0xb0
> > device_release_driver_internal+0x193/0x200
> > pci_stop_bus_device+0x81/0xb0
> > pci_stop_and_remove_bus_device_locked+0x16/0x30
> > remove_store+0x79/0x90
> >
> > On non-RT kernels the same recursive acquisition silently hangs the calling
> > process, eventually causing netdev watchdog TX timeout splats.
> >
> > This affects all drivers that call pci_disable_sriov() from their .remove()
> > callback (i40e, ice, and others).
> >
> > Fix this by tracking the owner of pci_rescan_remove_lock and skipping the
> > redundant acquisition in sriov_del_vfs() when the current thread already
> > holds it. The VF removal is still serialized correctly because the caller
> > already holds the lock.
>
> This looks like the result of commit 05703271c3cdc ("PCI/IOV: Add PCI
> rescan-remove locking when enabling/disabling SR-IOV").
>
> > Signed-off-by: Ionut Nechita <ionut.nechita@windriver.com>
>
> Sebastian
Agree, this looks related to the deadlock I later found with that
commit and that lead to this revert+new fix that has now been queued
for the v6.20/v7.00 here:
https://lore.kernel.org/linux-pci/20251216-revert_sriov_lock-v3-0-dac4925a7621@linux.ibm.com/
That said I do find this approach interesting. Benjamin and I are
actually still looking into a related problem with not taking the
rescan/remove lock as part of vfio-pci tear down and there this
approach could work better than just moving the locking up into the
sysfs handler. So far we haven't found a good place to take the lock in
that path that doesn't suffer from the recursive locking in other
paths. On the other hand conditionally taking a mutex is always a
little ugly in my opinion.
Thanks,
Niklas
^ permalink raw reply [flat|nested] 6+ messages in thread* Re: [PATCH] PCI/IOV: Fix recursive locking deadlock on pci_rescan_remove_lock
2026-02-09 10:12 ` Niklas Schnelle
@ 2026-02-11 7:37 ` Sebastian Andrzej Siewior
0 siblings, 0 replies; 6+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-02-11 7:37 UTC (permalink / raw)
To: Niklas Schnelle
Cc: Ionut Nechita (Wind River), Benjamin Block, Bjorn Helgaas,
linux-pci, Clark Williams, Steven Rostedt, linux-rt-devel,
linux-kernel, Ionut Nechita, Farhan Ali, Julian Ruess
On 2026-02-09 11:12:36 [+0100], Niklas Schnelle wrote:
> Agree, this looks related to the deadlock I later found with that
> commit and that lead to this revert+new fix that has now been queued
> for the v6.20/v7.00 here:
>
> https://lore.kernel.org/linux-pci/20251216-revert_sriov_lock-v3-0-dac4925a7621@linux.ibm.com/
So this particular problem is solved then.
> That said I do find this approach interesting. Benjamin and I are
> actually still looking into a related problem with not taking the
> rescan/remove lock as part of vfio-pci tear down and there this
> approach could work better than just moving the locking up into the
> sysfs handler. So far we haven't found a good place to take the lock in
> that path that doesn't suffer from the recursive locking in other
> paths. On the other hand conditionally taking a mutex is always a
> little ugly in my opinion.
If you could split the calling chain and have one side "I need the lock"
and the other one "I don't need the lock" that would be nicer than this.
> Thanks,
> Niklas
Sebastian
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] PCI/IOV: Fix recursive locking deadlock on pci_rescan_remove_lock
2026-02-09 7:57 [PATCH] PCI/IOV: Fix recursive locking deadlock on pci_rescan_remove_lock Ionut Nechita (Wind River)
2026-02-09 8:25 ` Sebastian Andrzej Siewior
@ 2026-02-09 16:14 ` Bjorn Helgaas
2026-02-22 11:29 ` Ionut Nechita (Wind River)
1 sibling, 1 reply; 6+ messages in thread
From: Bjorn Helgaas @ 2026-02-09 16:14 UTC (permalink / raw)
To: Ionut Nechita (Wind River)
Cc: Bjorn Helgaas, linux-pci, Sebastian Andrzej Siewior,
Clark Williams, Steven Rostedt, linux-rt-devel, linux-kernel,
Ionut Nechita
On Mon, Feb 09, 2026 at 09:57:07AM +0200, Ionut Nechita (Wind River) wrote:
> From: Ionut Nechita <ionut.nechita@windriver.com>
>
> When a PCI device is hot-removed via sysfs (e.g., echo 1 > /sys/.../remove),
> pci_stop_and_remove_bus_device_locked() acquires pci_rescan_remove_lock and
> then recursively walks the bus hierarchy calling driver .remove() callbacks.
>
> If the removed device is a PF with SR-IOV enabled (e.g., i40e, ice), the
> driver's .remove() calls pci_disable_sriov() -> sriov_disable() ->
> sriov_del_vfs() which also tries to acquire pci_rescan_remove_lock.
> Since this is a non-recursive mutex and the same thread already holds it,
> this results in a deadlock.
>
> On PREEMPT_RT kernels, where mutexes are backed by rtmutex with deadlock
> detection, this immediately triggers:
>
> WARNING: CPU: 15 PID: 11730 at kernel/locking/rtmutex.c:1663
> Call Trace:
> mutex_lock+0x47/0x60
> sriov_disable+0x2a/0x100
> i40e_free_vfs+0x415/0x470 [i40e]
> i40e_remove+0x38d/0x3e0 [i40e]
> pci_device_remove+0x3b/0xb0
> device_release_driver_internal+0x193/0x200
> pci_stop_bus_device+0x81/0xb0
> pci_stop_and_remove_bus_device_locked+0x16/0x30
> remove_store+0x79/0x90
>
> On non-RT kernels the same recursive acquisition silently hangs the calling
> process, eventually causing netdev watchdog TX timeout splats.
>
> This affects all drivers that call pci_disable_sriov() from their .remove()
> callback (i40e, ice, and others).
>
> Fix this by tracking the owner of pci_rescan_remove_lock and skipping the
> redundant acquisition in sriov_del_vfs() when the current thread already
> holds it. The VF removal is still serialized correctly because the caller
> already holds the lock.
Ionut, can you confirm whether Niklas's patches resolve this deadlock?
The following patches are queued for v7.0:
2fa119c0e5e5 ("Revert "PCI/IOV: Add PCI rescan-remove locking when enabling/disabling SR-IOV"")
a5338e365c45 ("PCI/IOV: Fix race between SR-IOV enable/disable and hotplug")
They are included in next-20260205. They're probably in earlier
linux-next kernels, too, but I guess linux-next doesn't keep older
tags anymore, so I don't know how to figure out exactly when they were
included. I put them in my tree on Feb 1.
Bjorn
^ permalink raw reply [flat|nested] 6+ messages in thread* Re: [PATCH] PCI/IOV: Fix recursive locking deadlock on pci_rescan_remove_lock
2026-02-09 16:14 ` Bjorn Helgaas
@ 2026-02-22 11:29 ` Ionut Nechita (Wind River)
0 siblings, 0 replies; 6+ messages in thread
From: Ionut Nechita (Wind River) @ 2026-02-22 11:29 UTC (permalink / raw)
To: helgaas
Cc: Ionut Nechita, Bjorn Helgaas, linux-pci,
Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt,
linux-rt-devel, linux-kernel, Ionut Nechita
From: Ionut Nechita <ionut.nechita@windriver.com>
Hi Bjorn,
I tested Niklas's patches (from next-20260205):
2fa119c0e5e5 ("Revert "PCI/IOV: Add PCI rescan-remove locking when enabling/disabling SR-IOV"")
a5338e365c45 ("PCI/IOV: Fix race between SR-IOV enable/disable and hotplug")
With an initial round of testing on a NIC controller using the i40e
driver, the deadlock no longer reproduces. SR-IOV enable/disable and
hot-remove via sysfs work as expected without triggering the recursive
locking warning on PREEMPT_RT or hanging on non-RT kernels.
I have another round of testing planned over the coming days to cover
additional scenarios, but so far everything looks good.
Thanks,
Ionut
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-02-22 11:30 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-09 7:57 [PATCH] PCI/IOV: Fix recursive locking deadlock on pci_rescan_remove_lock Ionut Nechita (Wind River)
2026-02-09 8:25 ` Sebastian Andrzej Siewior
2026-02-09 10:12 ` Niklas Schnelle
2026-02-11 7:37 ` Sebastian Andrzej Siewior
2026-02-09 16:14 ` Bjorn Helgaas
2026-02-22 11:29 ` Ionut Nechita (Wind River)
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox