Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] drm/xe/pf: Take pci_rescan_remove_lock mutex when disabling VFs
@ 2026-02-27 21:40 Michal Wajdeczko
  2026-02-27 21:54 ` ✗ CI.checkpatch: warning for " Patchwork
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Michal Wajdeczko @ 2026-02-27 21:40 UTC (permalink / raw)
  To: intel-xe; +Cc: Michal Wajdeczko

Since recent commit a5338e365c45 ("PCI/IOV: Fix race between SR-IOV
enable/disable and hotplug") the driver pci_driver.sriov_configure
hook is called with the mutex pci_rescan_remove_lock already taken.

As we are using this hook as-is during driver removal, we get:

 [ ] xe 0000:4d:00.0: [drm:xe_pci_sriov_configure [xe]] PF: disabling 1 VF
 [ ] ------------[ cut here ]------------
 [ ] debug_locks && !(lock_is_held(&(&pci_rescan_remove_lock)->dep_map) != 0)
 [ ] WARNING: drivers/pci/remove.c:130 at pci_stop_and_remove_bus_device+0x4c/0x50, CPU#32: rmmod/6476
 [ ] RIP: 0010:pci_stop_and_remove_bus_device+0x4c/0x50
 [ ] Call Trace:
 [ ]  <TASK>
 [ ]  pci_iov_remove_virtfn+0xd1/0x140
 [ ]  sriov_disable+0x42/0x100
 [ ]  pci_disable_sriov+0x34/0x50
 [ ]  xe_pci_sriov_configure+0x2d0/0x1150 [xe]
 [ ]  xe_pci_remove+0x7c/0x190 [xe]
 [ ]  pci_device_remove+0x41/0xb0
 [ ]  device_remove+0x43/0x80
 [ ]  device_release_driver_internal+0x215/0x280
 [ ]  driver_detach+0x50/0xb0
 [ ]  bus_remove_driver+0x86/0x120
 [ ]  driver_unregister+0x2f/0x60
 [ ]  pci_unregister_driver+0x22/0xc0
 [ ]  xe_unregister_pci_driver+0x15/0x20 [xe]
 [ ]  xe_exit+0x1f/0x34 [xe]

Fix that by taking a pci_rescan_remove_lock as it is now expected.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
---
 drivers/gpu/drm/xe/xe_pci.c       |  2 +-
 drivers/gpu/drm/xe/xe_pci_sriov.c | 20 ++++++++++++++++++++
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/xe/xe_pci.c b/drivers/gpu/drm/xe/xe_pci.c
index 3ac99472d6dd..fb0abd768e67 100644
--- a/drivers/gpu/drm/xe/xe_pci.c
+++ b/drivers/gpu/drm/xe/xe_pci.c
@@ -1010,7 +1010,7 @@ static void xe_pci_remove(struct pci_dev *pdev)
 	struct xe_device *xe = pdev_to_xe_device(pdev);
 
 	if (IS_SRIOV_PF(xe))
-		xe_pci_sriov_configure(pdev, 0);
+		xe_pci_sriov_disable_vfs(pdev);
 
 	if (xe_survivability_mode_is_boot_enabled(xe))
 		return;
diff --git a/drivers/gpu/drm/xe/xe_pci_sriov.c b/drivers/gpu/drm/xe/xe_pci_sriov.c
index 3fd22034f03e..2a3fd3578ef2 100644
--- a/drivers/gpu/drm/xe/xe_pci_sriov.c
+++ b/drivers/gpu/drm/xe/xe_pci_sriov.c
@@ -239,6 +239,26 @@ int xe_pci_sriov_configure(struct pci_dev *pdev, int num_vfs)
 		return pf_disable_vfs(xe);
 }
 
+/**
+ * xe_pci_sriov_disable_vfs() - Disable all VFs.
+ * @pdev: the PF's &pci_dev
+ *
+ * This is a simple wrapper around our function that implements the
+ * pci_driver.sriov_configure hook but also takes a required mutex.
+ *
+ * Return: 0 on success or a negative error code on failure.
+ */
+int xe_pci_sriov_disable_vfs(struct pci_dev *pdev)
+{
+	int ret;
+
+	pci_lock_rescan_remove();
+	ret = xe_pci_sriov_configure(pdev, 0);
+	pci_unlock_rescan_remove();
+
+	return ret;
+}
+
 /**
  * xe_pci_sriov_get_vf_pdev() - Lookup the VF's PCI device using the VF identifier.
  * @pdev: the PF's &pci_dev
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* ✗ CI.checkpatch: warning for drm/xe/pf: Take pci_rescan_remove_lock mutex when disabling VFs
  2026-02-27 21:40 [PATCH] drm/xe/pf: Take pci_rescan_remove_lock mutex when disabling VFs Michal Wajdeczko
@ 2026-02-27 21:54 ` Patchwork
  2026-02-27 21:54 ` ✗ CI.KUnit: failure " Patchwork
  2026-03-02  9:43 ` [PATCH] " Piotr Piórkowski
  2 siblings, 0 replies; 7+ messages in thread
From: Patchwork @ 2026-02-27 21:54 UTC (permalink / raw)
  To: Michal Wajdeczko; +Cc: intel-xe

== Series Details ==

Series: drm/xe/pf: Take pci_rescan_remove_lock mutex when disabling VFs
URL   : https://patchwork.freedesktop.org/series/162346/
State : warning

== Summary ==

+ KERNEL=/kernel
+ git clone https://gitlab.freedesktop.org/drm/maintainer-tools mt
Cloning into 'mt'...
warning: redirecting to https://gitlab.freedesktop.org/drm/maintainer-tools.git/
+ git -C mt rev-list -n1 origin/master
1f57ba1afceae32108bd24770069f764d940a0e4
+ cd /kernel
+ git config --global --add safe.directory /kernel
+ git log -n1
commit 302a032f50b095b3e934f202c05299d12e008c71
Author: Michal Wajdeczko <michal.wajdeczko@intel.com>
Date:   Fri Feb 27 22:40:47 2026 +0100

    drm/xe/pf: Take pci_rescan_remove_lock mutex when disabling VFs
    
    Since recent commit a5338e365c45 ("PCI/IOV: Fix race between SR-IOV
    enable/disable and hotplug") the driver pci_driver.sriov_configure
    hook is called with the mutex pci_rescan_remove_lock already taken.
    
    As we are using this hook as-is during driver removal, we get:
    
     [ ] xe 0000:4d:00.0: [drm:xe_pci_sriov_configure [xe]] PF: disabling 1 VF
     [ ] ------------[ cut here ]------------
     [ ] debug_locks && !(lock_is_held(&(&pci_rescan_remove_lock)->dep_map) != 0)
     [ ] WARNING: drivers/pci/remove.c:130 at pci_stop_and_remove_bus_device+0x4c/0x50, CPU#32: rmmod/6476
     [ ] RIP: 0010:pci_stop_and_remove_bus_device+0x4c/0x50
     [ ] Call Trace:
     [ ]  <TASK>
     [ ]  pci_iov_remove_virtfn+0xd1/0x140
     [ ]  sriov_disable+0x42/0x100
     [ ]  pci_disable_sriov+0x34/0x50
     [ ]  xe_pci_sriov_configure+0x2d0/0x1150 [xe]
     [ ]  xe_pci_remove+0x7c/0x190 [xe]
     [ ]  pci_device_remove+0x41/0xb0
     [ ]  device_remove+0x43/0x80
     [ ]  device_release_driver_internal+0x215/0x280
     [ ]  driver_detach+0x50/0xb0
     [ ]  bus_remove_driver+0x86/0x120
     [ ]  driver_unregister+0x2f/0x60
     [ ]  pci_unregister_driver+0x22/0xc0
     [ ]  xe_unregister_pci_driver+0x15/0x20 [xe]
     [ ]  xe_exit+0x1f/0x34 [xe]
    
    Fix that by taking a pci_rescan_remove_lock as it is now expected.
    
    Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
+ /mt/dim checkpatch a0a6a2fc5a08dc2e936c43e5d181dd0975a251a1 drm-intel
302a032f50b0 drm/xe/pf: Take pci_rescan_remove_lock mutex when disabling VFs
-:15: WARNING:COMMIT_LOG_LONG_LINE: Prefer a maximum 75 chars per line (possible unwrapped commit description?)
#15: 
 [ ] debug_locks && !(lock_is_held(&(&pci_rescan_remove_lock)->dep_map) != 0)

-:82: WARNING:MISSING_FIXES_TAG: The commit message has 'Call Trace:', perhaps it also needs a 'Fixes:' tag?

total: 0 errors, 2 warnings, 0 checks, 34 lines checked



^ permalink raw reply	[flat|nested] 7+ messages in thread

* ✗ CI.KUnit: failure for drm/xe/pf: Take pci_rescan_remove_lock mutex when disabling VFs
  2026-02-27 21:40 [PATCH] drm/xe/pf: Take pci_rescan_remove_lock mutex when disabling VFs Michal Wajdeczko
  2026-02-27 21:54 ` ✗ CI.checkpatch: warning for " Patchwork
@ 2026-02-27 21:54 ` Patchwork
  2026-03-02  9:43 ` [PATCH] " Piotr Piórkowski
  2 siblings, 0 replies; 7+ messages in thread
From: Patchwork @ 2026-02-27 21:54 UTC (permalink / raw)
  To: Michal Wajdeczko; +Cc: intel-xe

== Series Details ==

Series: drm/xe/pf: Take pci_rescan_remove_lock mutex when disabling VFs
URL   : https://patchwork.freedesktop.org/series/162346/
State : failure

== Summary ==

+ trap cleanup EXIT
+ /kernel/tools/testing/kunit/kunit.py run --kunitconfig /kernel/drivers/gpu/drm/xe/.kunitconfig
ERROR:root:../drivers/gpu/drm/xe/xe_pci.c: In function ‘xe_pci_remove’:
../drivers/gpu/drm/xe/xe_pci.c:1013:17: error: implicit declaration of function ‘xe_pci_sriov_disable_vfs’ [-Werror=implicit-function-declaration]
 1013 |                 xe_pci_sriov_disable_vfs(pdev);
      |                 ^~~~~~~~~~~~~~~~~~~~~~~~
cc1: some warnings being treated as errors
make[7]: *** [../scripts/Makefile.build:289: drivers/gpu/drm/xe/xe_pci.o] Error 1
make[7]: *** Waiting for unfinished jobs....
make[6]: *** [../scripts/Makefile.build:546: drivers/gpu/drm/xe] Error 2
make[5]: *** [../scripts/Makefile.build:546: drivers/gpu/drm] Error 2
make[4]: *** [../scripts/Makefile.build:546: drivers/gpu] Error 2
make[3]: *** [../scripts/Makefile.build:546: drivers] Error 2
make[2]: *** [/kernel/Makefile:2101: .] Error 2
make[1]: *** [/kernel/Makefile:248: __sub-make] Error 2
make: *** [Makefile:248: __sub-make] Error 2

[21:54:15] Configuring KUnit Kernel ...
Generating .config ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
[21:54:19] Building KUnit Kernel ...
Populating config with:
$ make ARCH=um O=.kunit olddefconfig
Building with:
$ make all compile_commands.json scripts_gdb ARCH=um O=.kunit --jobs=48
+ cleanup
++ stat -c %u:%g /kernel
+ chown -R 1003:1003 /kernel



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] drm/xe/pf: Take pci_rescan_remove_lock mutex when disabling VFs
  2026-02-27 21:40 [PATCH] drm/xe/pf: Take pci_rescan_remove_lock mutex when disabling VFs Michal Wajdeczko
  2026-02-27 21:54 ` ✗ CI.checkpatch: warning for " Patchwork
  2026-02-27 21:54 ` ✗ CI.KUnit: failure " Patchwork
@ 2026-03-02  9:43 ` Piotr Piórkowski
  2026-03-02 20:36   ` Matthew Brost
  2 siblings, 1 reply; 7+ messages in thread
From: Piotr Piórkowski @ 2026-03-02  9:43 UTC (permalink / raw)
  To: Michal Wajdeczko; +Cc: intel-xe

Michal Wajdeczko <michal.wajdeczko@intel.com> wrote on pią [2026-lut-27 22:40:47 +0100]:
> Since recent commit a5338e365c45 ("PCI/IOV: Fix race between SR-IOV
> enable/disable and hotplug") the driver pci_driver.sriov_configure
> hook is called with the mutex pci_rescan_remove_lock already taken.
> 
> As we are using this hook as-is during driver removal, we get:
> 
>  [ ] xe 0000:4d:00.0: [drm:xe_pci_sriov_configure [xe]] PF: disabling 1 VF
>  [ ] ------------[ cut here ]------------
>  [ ] debug_locks && !(lock_is_held(&(&pci_rescan_remove_lock)->dep_map) != 0)
>  [ ] WARNING: drivers/pci/remove.c:130 at pci_stop_and_remove_bus_device+0x4c/0x50, CPU#32: rmmod/6476
>  [ ] RIP: 0010:pci_stop_and_remove_bus_device+0x4c/0x50
>  [ ] Call Trace:
>  [ ]  <TASK>
>  [ ]  pci_iov_remove_virtfn+0xd1/0x140
>  [ ]  sriov_disable+0x42/0x100
>  [ ]  pci_disable_sriov+0x34/0x50
>  [ ]  xe_pci_sriov_configure+0x2d0/0x1150 [xe]
>  [ ]  xe_pci_remove+0x7c/0x190 [xe]
>  [ ]  pci_device_remove+0x41/0xb0
>  [ ]  device_remove+0x43/0x80
>  [ ]  device_release_driver_internal+0x215/0x280
>  [ ]  driver_detach+0x50/0xb0
>  [ ]  bus_remove_driver+0x86/0x120
>  [ ]  driver_unregister+0x2f/0x60
>  [ ]  pci_unregister_driver+0x22/0xc0
>  [ ]  xe_unregister_pci_driver+0x15/0x20 [xe]
>  [ ]  xe_exit+0x1f/0x34 [xe]
> 
> Fix that by taking a pci_rescan_remove_lock as it is now expected.
> 
> Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_pci.c       |  2 +-
>  drivers/gpu/drm/xe/xe_pci_sriov.c | 20 ++++++++++++++++++++
>  2 files changed, 21 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_pci.c b/drivers/gpu/drm/xe/xe_pci.c
> index 3ac99472d6dd..fb0abd768e67 100644
> --- a/drivers/gpu/drm/xe/xe_pci.c
> +++ b/drivers/gpu/drm/xe/xe_pci.c
> @@ -1010,7 +1010,7 @@ static void xe_pci_remove(struct pci_dev *pdev)
>  	struct xe_device *xe = pdev_to_xe_device(pdev);
>  
>  	if (IS_SRIOV_PF(xe))
> -		xe_pci_sriov_configure(pdev, 0);
> +		xe_pci_sriov_disable_vfs(pdev);
>  
>  	if (xe_survivability_mode_is_boot_enabled(xe))
>  		return;
> diff --git a/drivers/gpu/drm/xe/xe_pci_sriov.c b/drivers/gpu/drm/xe/xe_pci_sriov.c
> index 3fd22034f03e..2a3fd3578ef2 100644
> --- a/drivers/gpu/drm/xe/xe_pci_sriov.c
> +++ b/drivers/gpu/drm/xe/xe_pci_sriov.c
> @@ -239,6 +239,26 @@ int xe_pci_sriov_configure(struct pci_dev *pdev, int num_vfs)
>  		return pf_disable_vfs(xe);
>  }
>  
> +/**
> + * xe_pci_sriov_disable_vfs() - Disable all VFs.
> + * @pdev: the PF's &pci_dev
> + *
> + * This is a simple wrapper around our function that implements the
> + * pci_driver.sriov_configure hook but also takes a required mutex.
> + *
> + * Return: 0 on success or a negative error code on failure.
> + */
> +int xe_pci_sriov_disable_vfs(struct pci_dev *pdev)
> +{
> +	int ret;
> +
> +	pci_lock_rescan_remove();
> +	ret = xe_pci_sriov_configure(pdev, 0);
> +	pci_unlock_rescan_remove();
> +
> +	return ret;
> +}
> +
LGTM:
Reviewed-by: Piotr Piórkowski <piotr.piorkowski@intel.com>


>  /**
>   * xe_pci_sriov_get_vf_pdev() - Lookup the VF's PCI device using the VF identifier.
>   * @pdev: the PF's &pci_dev
> -- 
> 2.47.1
> 

-- 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] drm/xe/pf: Take pci_rescan_remove_lock mutex when disabling VFs
  2026-03-02  9:43 ` [PATCH] " Piotr Piórkowski
@ 2026-03-02 20:36   ` Matthew Brost
  2026-03-02 21:47     ` Michal Wajdeczko
  0 siblings, 1 reply; 7+ messages in thread
From: Matthew Brost @ 2026-03-02 20:36 UTC (permalink / raw)
  To: Piotr Piórkowski; +Cc: Michal Wajdeczko, intel-xe

On Mon, Mar 02, 2026 at 10:43:36AM +0100, Piotr Piórkowski wrote:
> Michal Wajdeczko <michal.wajdeczko@intel.com> wrote on pią [2026-lut-27 22:40:47 +0100]:
> > Since recent commit a5338e365c45 ("PCI/IOV: Fix race between SR-IOV
> > enable/disable and hotplug") the driver pci_driver.sriov_configure
> > hook is called with the mutex pci_rescan_remove_lock already taken.
> > 
> > As we are using this hook as-is during driver removal, we get:
> > 
> >  [ ] xe 0000:4d:00.0: [drm:xe_pci_sriov_configure [xe]] PF: disabling 1 VF
> >  [ ] ------------[ cut here ]------------
> >  [ ] debug_locks && !(lock_is_held(&(&pci_rescan_remove_lock)->dep_map) != 0)
> >  [ ] WARNING: drivers/pci/remove.c:130 at pci_stop_and_remove_bus_device+0x4c/0x50, CPU#32: rmmod/6476
> >  [ ] RIP: 0010:pci_stop_and_remove_bus_device+0x4c/0x50
> >  [ ] Call Trace:
> >  [ ]  <TASK>
> >  [ ]  pci_iov_remove_virtfn+0xd1/0x140
> >  [ ]  sriov_disable+0x42/0x100
> >  [ ]  pci_disable_sriov+0x34/0x50
> >  [ ]  xe_pci_sriov_configure+0x2d0/0x1150 [xe]
> >  [ ]  xe_pci_remove+0x7c/0x190 [xe]
> >  [ ]  pci_device_remove+0x41/0xb0
> >  [ ]  device_remove+0x43/0x80
> >  [ ]  device_release_driver_internal+0x215/0x280
> >  [ ]  driver_detach+0x50/0xb0
> >  [ ]  bus_remove_driver+0x86/0x120
> >  [ ]  driver_unregister+0x2f/0x60
> >  [ ]  pci_unregister_driver+0x22/0xc0
> >  [ ]  xe_unregister_pci_driver+0x15/0x20 [xe]
> >  [ ]  xe_exit+0x1f/0x34 [xe]
> > 
> > Fix that by taking a pci_rescan_remove_lock as it is now expected.
> > 
> > Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_pci.c       |  2 +-
> >  drivers/gpu/drm/xe/xe_pci_sriov.c | 20 ++++++++++++++++++++
> >  2 files changed, 21 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_pci.c b/drivers/gpu/drm/xe/xe_pci.c
> > index 3ac99472d6dd..fb0abd768e67 100644
> > --- a/drivers/gpu/drm/xe/xe_pci.c
> > +++ b/drivers/gpu/drm/xe/xe_pci.c
> > @@ -1010,7 +1010,7 @@ static void xe_pci_remove(struct pci_dev *pdev)
> >  	struct xe_device *xe = pdev_to_xe_device(pdev);
> >  
> >  	if (IS_SRIOV_PF(xe))
> > -		xe_pci_sriov_configure(pdev, 0);
> > +		xe_pci_sriov_disable_vfs(pdev);
> >  
> >  	if (xe_survivability_mode_is_boot_enabled(xe))
> >  		return;
> > diff --git a/drivers/gpu/drm/xe/xe_pci_sriov.c b/drivers/gpu/drm/xe/xe_pci_sriov.c
> > index 3fd22034f03e..2a3fd3578ef2 100644
> > --- a/drivers/gpu/drm/xe/xe_pci_sriov.c
> > +++ b/drivers/gpu/drm/xe/xe_pci_sriov.c
> > @@ -239,6 +239,26 @@ int xe_pci_sriov_configure(struct pci_dev *pdev, int num_vfs)
> >  		return pf_disable_vfs(xe);
> >  }
> >  
> > +/**
> > + * xe_pci_sriov_disable_vfs() - Disable all VFs.
> > + * @pdev: the PF's &pci_dev
> > + *
> > + * This is a simple wrapper around our function that implements the
> > + * pci_driver.sriov_configure hook but also takes a required mutex.
> > + *
> > + * Return: 0 on success or a negative error code on failure.
> > + */
> > +int xe_pci_sriov_disable_vfs(struct pci_dev *pdev)
> > +{
> > +	int ret;
> > +
> > +	pci_lock_rescan_remove();
> > +	ret = xe_pci_sriov_configure(pdev, 0);
> > +	pci_unlock_rescan_remove();
> > +
> > +	return ret;
> > +}
> > +
> LGTM:
> Reviewed-by: Piotr Piórkowski <piotr.piorkowski@intel.com>
> 

The lockdep reasons and the placement of the lock make sense to me.

I do have a question though, as I’m a little concerned about our driver
having to take a lock like pci_lock_rescan_remove...

Why is the xe_pci_sriov_disable_vfs call needed in pci_driver.remove?
In other words, why doesn’t the PCI core call pci_driver.sriov_configure
first?

I don’t see many examples of device drivers having to call
pci_lock_rescan_remove [1], which is why I’m asking. I’m wondering whether we
are missing an accepted flow for SR-IOV, and whether the need to take
pci_lock_rescan_remove just to silence lockdep is pointing to a larger
issue.

Matt

[1] https://elixir.bootlin.com/linux/v6.19.3/A/ident/pci_lock_rescan_remove

> 
> >  /**
> >   * xe_pci_sriov_get_vf_pdev() - Lookup the VF's PCI device using the VF identifier.
> >   * @pdev: the PF's &pci_dev
> > -- 
> > 2.47.1
> > 
> 
> -- 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] drm/xe/pf: Take pci_rescan_remove_lock mutex when disabling VFs
  2026-03-02 20:36   ` Matthew Brost
@ 2026-03-02 21:47     ` Michal Wajdeczko
  2026-03-02 22:51       ` Matthew Brost
  0 siblings, 1 reply; 7+ messages in thread
From: Michal Wajdeczko @ 2026-03-02 21:47 UTC (permalink / raw)
  To: Matthew Brost; +Cc: intel-xe, Piotr Piórkowski



On 3/2/2026 9:36 PM, Matthew Brost wrote:
> On Mon, Mar 02, 2026 at 10:43:36AM +0100, Piotr Piórkowski wrote:
>> Michal Wajdeczko <michal.wajdeczko@intel.com> wrote on pią [2026-lut-27 22:40:47 +0100]:
>>> Since recent commit a5338e365c45 ("PCI/IOV: Fix race between SR-IOV
>>> enable/disable and hotplug") the driver pci_driver.sriov_configure
>>> hook is called with the mutex pci_rescan_remove_lock already taken.
>>>
>>> As we are using this hook as-is during driver removal, we get:
>>>
>>>  [ ] xe 0000:4d:00.0: [drm:xe_pci_sriov_configure [xe]] PF: disabling 1 VF
>>>  [ ] ------------[ cut here ]------------
>>>  [ ] debug_locks && !(lock_is_held(&(&pci_rescan_remove_lock)->dep_map) != 0)
>>>  [ ] WARNING: drivers/pci/remove.c:130 at pci_stop_and_remove_bus_device+0x4c/0x50, CPU#32: rmmod/6476
>>>  [ ] RIP: 0010:pci_stop_and_remove_bus_device+0x4c/0x50
>>>  [ ] Call Trace:
>>>  [ ]  <TASK>
>>>  [ ]  pci_iov_remove_virtfn+0xd1/0x140
>>>  [ ]  sriov_disable+0x42/0x100
>>>  [ ]  pci_disable_sriov+0x34/0x50
>>>  [ ]  xe_pci_sriov_configure+0x2d0/0x1150 [xe]
>>>  [ ]  xe_pci_remove+0x7c/0x190 [xe]
>>>  [ ]  pci_device_remove+0x41/0xb0
>>>  [ ]  device_remove+0x43/0x80
>>>  [ ]  device_release_driver_internal+0x215/0x280
>>>  [ ]  driver_detach+0x50/0xb0
>>>  [ ]  bus_remove_driver+0x86/0x120
>>>  [ ]  driver_unregister+0x2f/0x60
>>>  [ ]  pci_unregister_driver+0x22/0xc0
>>>  [ ]  xe_unregister_pci_driver+0x15/0x20 [xe]
>>>  [ ]  xe_exit+0x1f/0x34 [xe]
>>>
>>> Fix that by taking a pci_rescan_remove_lock as it is now expected.
>>>
>>> Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
>>> ---
>>>  drivers/gpu/drm/xe/xe_pci.c       |  2 +-
>>>  drivers/gpu/drm/xe/xe_pci_sriov.c | 20 ++++++++++++++++++++
>>>  2 files changed, 21 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/xe/xe_pci.c b/drivers/gpu/drm/xe/xe_pci.c
>>> index 3ac99472d6dd..fb0abd768e67 100644
>>> --- a/drivers/gpu/drm/xe/xe_pci.c
>>> +++ b/drivers/gpu/drm/xe/xe_pci.c
>>> @@ -1010,7 +1010,7 @@ static void xe_pci_remove(struct pci_dev *pdev)
>>>  	struct xe_device *xe = pdev_to_xe_device(pdev);
>>>  
>>>  	if (IS_SRIOV_PF(xe))
>>> -		xe_pci_sriov_configure(pdev, 0);
>>> +		xe_pci_sriov_disable_vfs(pdev);
>>>  
>>>  	if (xe_survivability_mode_is_boot_enabled(xe))
>>>  		return;
>>> diff --git a/drivers/gpu/drm/xe/xe_pci_sriov.c b/drivers/gpu/drm/xe/xe_pci_sriov.c
>>> index 3fd22034f03e..2a3fd3578ef2 100644
>>> --- a/drivers/gpu/drm/xe/xe_pci_sriov.c
>>> +++ b/drivers/gpu/drm/xe/xe_pci_sriov.c
>>> @@ -239,6 +239,26 @@ int xe_pci_sriov_configure(struct pci_dev *pdev, int num_vfs)
>>>  		return pf_disable_vfs(xe);
>>>  }
>>>  
>>> +/**
>>> + * xe_pci_sriov_disable_vfs() - Disable all VFs.
>>> + * @pdev: the PF's &pci_dev
>>> + *
>>> + * This is a simple wrapper around our function that implements the
>>> + * pci_driver.sriov_configure hook but also takes a required mutex.
>>> + *
>>> + * Return: 0 on success or a negative error code on failure.
>>> + */
>>> +int xe_pci_sriov_disable_vfs(struct pci_dev *pdev)
>>> +{
>>> +	int ret;
>>> +
>>> +	pci_lock_rescan_remove();
>>> +	ret = xe_pci_sriov_configure(pdev, 0);
>>> +	pci_unlock_rescan_remove();
>>> +
>>> +	return ret;
>>> +}
>>> +
>> LGTM:
>> Reviewed-by: Piotr Piórkowski <piotr.piorkowski@intel.com>
>>
> 
> The lockdep reasons and the placement of the lock make sense to me.
> 
> I do have a question though, as I’m a little concerned about our driver
> having to take a lock like pci_lock_rescan_remove...
> 
> Why is the xe_pci_sriov_disable_vfs call needed in pci_driver.remove?
> In other words, why doesn’t the PCI core call pci_driver.sriov_configure
> first?

dunno, but it does complain [1] when PF leaves VFs enabled

[1] https://elixir.bootlin.com/linux/v6.19.3/source/drivers/pci/iov.c#L1027

> 
> I don’t see many examples of device drivers having to call
> pci_lock_rescan_remove [1], which is why I’m asking. I’m wondering whether we
> are missing an accepted flow for SR-IOV, and whether the need to take
> pci_lock_rescan_remove just to silence lockdep is pointing to a larger
> issue.

at first I was assuming that there is just a new expectation for the
driver, but indeed it looks that we have a larger problem, as our CI
also found that in slightly different scenario [2] this mutex is already
taken by the PCI subsystem when calling the .remove callback:

 [ ]  Possible unsafe locking scenario:
 [ ]        CPU0
 [ ]        ----
 [ ]   lock(pci_rescan_remove_lock);
 [ ]   lock(pci_rescan_remove_lock);
 [ ] 
 *** DEADLOCK ***

 [ ] Call Trace:
 [ ]  dump_stack_lvl+0x91/0xf0
 [ ]  dump_stack+0x10/0x20
 [ ]  print_deadlock_bug+0x23f/0x320
 [ ]  __lock_acquire+0x146e/0x2790
 [ ]  lock_acquire+0xc4/0x2f0
 [ ]  __mutex_lock+0xb2/0x10e0
 [ ]  mutex_lock_nested+0x1b/0x30
 [ ]  pci_lock_rescan_remove+0x17/0x30
 [ ]  xe_pci_sriov_disable_vfs+0x12/0x40 [xe]
 [ ]  xe_pci_remove+0x7a/0x180 [xe]
 [ ]  pci_device_remove+0x41/0xb0
 [ ]  device_remove+0x43/0x80
 [ ]  device_release_driver_internal+0x215/0x280
 [ ]  device_release_driver+0x12/0x20
 [ ]  pci_stop_bus_device+0x69/0x90
 [ ]  pci_stop_and_remove_bus_device_locked+0x24/0x60
 [ ]  remove_store+0x85/0xa0
 [ ]  dev_attr_store+0x17/0x40

but it looks that this issue was already noticed and discussed [3]
in the PCI level, so lets wait and see how this will be solved.

[2] https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-162346v2/shard-bmg-7/igt@core_hotunplug@unplug-rescan.html
[3] https://lore.kernel.org/linux-pci/20260228120138.51197-4-ionut.nechita@windriver.com/


> 
> Matt
> 
> [1] https://elixir.bootlin.com/linux/v6.19.3/A/ident/pci_lock_rescan_remove
> 
>>
>>>  /**
>>>   * xe_pci_sriov_get_vf_pdev() - Lookup the VF's PCI device using the VF identifier.
>>>   * @pdev: the PF's &pci_dev
>>> -- 
>>> 2.47.1
>>>
>>
>> -- 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] drm/xe/pf: Take pci_rescan_remove_lock mutex when disabling VFs
  2026-03-02 21:47     ` Michal Wajdeczko
@ 2026-03-02 22:51       ` Matthew Brost
  0 siblings, 0 replies; 7+ messages in thread
From: Matthew Brost @ 2026-03-02 22:51 UTC (permalink / raw)
  To: Michal Wajdeczko; +Cc: intel-xe, Piotr Piórkowski

On Mon, Mar 02, 2026 at 10:47:41PM +0100, Michal Wajdeczko wrote:
> 
> 
> On 3/2/2026 9:36 PM, Matthew Brost wrote:
> > On Mon, Mar 02, 2026 at 10:43:36AM +0100, Piotr Piórkowski wrote:
> >> Michal Wajdeczko <michal.wajdeczko@intel.com> wrote on pią [2026-lut-27 22:40:47 +0100]:
> >>> Since recent commit a5338e365c45 ("PCI/IOV: Fix race between SR-IOV
> >>> enable/disable and hotplug") the driver pci_driver.sriov_configure
> >>> hook is called with the mutex pci_rescan_remove_lock already taken.
> >>>
> >>> As we are using this hook as-is during driver removal, we get:
> >>>
> >>>  [ ] xe 0000:4d:00.0: [drm:xe_pci_sriov_configure [xe]] PF: disabling 1 VF
> >>>  [ ] ------------[ cut here ]------------
> >>>  [ ] debug_locks && !(lock_is_held(&(&pci_rescan_remove_lock)->dep_map) != 0)
> >>>  [ ] WARNING: drivers/pci/remove.c:130 at pci_stop_and_remove_bus_device+0x4c/0x50, CPU#32: rmmod/6476
> >>>  [ ] RIP: 0010:pci_stop_and_remove_bus_device+0x4c/0x50
> >>>  [ ] Call Trace:
> >>>  [ ]  <TASK>
> >>>  [ ]  pci_iov_remove_virtfn+0xd1/0x140
> >>>  [ ]  sriov_disable+0x42/0x100
> >>>  [ ]  pci_disable_sriov+0x34/0x50
> >>>  [ ]  xe_pci_sriov_configure+0x2d0/0x1150 [xe]
> >>>  [ ]  xe_pci_remove+0x7c/0x190 [xe]
> >>>  [ ]  pci_device_remove+0x41/0xb0
> >>>  [ ]  device_remove+0x43/0x80
> >>>  [ ]  device_release_driver_internal+0x215/0x280
> >>>  [ ]  driver_detach+0x50/0xb0
> >>>  [ ]  bus_remove_driver+0x86/0x120
> >>>  [ ]  driver_unregister+0x2f/0x60
> >>>  [ ]  pci_unregister_driver+0x22/0xc0
> >>>  [ ]  xe_unregister_pci_driver+0x15/0x20 [xe]
> >>>  [ ]  xe_exit+0x1f/0x34 [xe]
> >>>
> >>> Fix that by taking a pci_rescan_remove_lock as it is now expected.
> >>>
> >>> Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
> >>> ---
> >>>  drivers/gpu/drm/xe/xe_pci.c       |  2 +-
> >>>  drivers/gpu/drm/xe/xe_pci_sriov.c | 20 ++++++++++++++++++++
> >>>  2 files changed, 21 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/xe/xe_pci.c b/drivers/gpu/drm/xe/xe_pci.c
> >>> index 3ac99472d6dd..fb0abd768e67 100644
> >>> --- a/drivers/gpu/drm/xe/xe_pci.c
> >>> +++ b/drivers/gpu/drm/xe/xe_pci.c
> >>> @@ -1010,7 +1010,7 @@ static void xe_pci_remove(struct pci_dev *pdev)
> >>>  	struct xe_device *xe = pdev_to_xe_device(pdev);
> >>>  
> >>>  	if (IS_SRIOV_PF(xe))
> >>> -		xe_pci_sriov_configure(pdev, 0);
> >>> +		xe_pci_sriov_disable_vfs(pdev);
> >>>  
> >>>  	if (xe_survivability_mode_is_boot_enabled(xe))
> >>>  		return;
> >>> diff --git a/drivers/gpu/drm/xe/xe_pci_sriov.c b/drivers/gpu/drm/xe/xe_pci_sriov.c
> >>> index 3fd22034f03e..2a3fd3578ef2 100644
> >>> --- a/drivers/gpu/drm/xe/xe_pci_sriov.c
> >>> +++ b/drivers/gpu/drm/xe/xe_pci_sriov.c
> >>> @@ -239,6 +239,26 @@ int xe_pci_sriov_configure(struct pci_dev *pdev, int num_vfs)
> >>>  		return pf_disable_vfs(xe);
> >>>  }
> >>>  
> >>> +/**
> >>> + * xe_pci_sriov_disable_vfs() - Disable all VFs.
> >>> + * @pdev: the PF's &pci_dev
> >>> + *
> >>> + * This is a simple wrapper around our function that implements the
> >>> + * pci_driver.sriov_configure hook but also takes a required mutex.
> >>> + *
> >>> + * Return: 0 on success or a negative error code on failure.
> >>> + */
> >>> +int xe_pci_sriov_disable_vfs(struct pci_dev *pdev)
> >>> +{
> >>> +	int ret;
> >>> +
> >>> +	pci_lock_rescan_remove();
> >>> +	ret = xe_pci_sriov_configure(pdev, 0);
> >>> +	pci_unlock_rescan_remove();
> >>> +
> >>> +	return ret;
> >>> +}
> >>> +
> >> LGTM:
> >> Reviewed-by: Piotr Piórkowski <piotr.piorkowski@intel.com>
> >>
> > 
> > The lockdep reasons and the placement of the lock make sense to me.
> > 
> > I do have a question though, as I’m a little concerned about our driver
> > having to take a lock like pci_lock_rescan_remove...
> > 
> > Why is the xe_pci_sriov_disable_vfs call needed in pci_driver.remove?
> > In other words, why doesn’t the PCI core call pci_driver.sriov_configure
> > first?
> 
> dunno, but it does complain [1] when PF leaves VFs enabled
> 
> [1] https://elixir.bootlin.com/linux/v6.19.3/source/drivers/pci/iov.c#L1027
> 
> > 
> > I don’t see many examples of device drivers having to call
> > pci_lock_rescan_remove [1], which is why I’m asking. I’m wondering whether we
> > are missing an accepted flow for SR-IOV, and whether the need to take
> > pci_lock_rescan_remove just to silence lockdep is pointing to a larger
> > issue.
> 
> at first I was assuming that there is just a new expectation for the
> driver, but indeed it looks that we have a larger problem, as our CI
> also found that in slightly different scenario [2] this mutex is already
> taken by the PCI subsystem when calling the .remove callback:
> 
>  [ ]  Possible unsafe locking scenario:
>  [ ]        CPU0
>  [ ]        ----
>  [ ]   lock(pci_rescan_remove_lock);
>  [ ]   lock(pci_rescan_remove_lock);
>  [ ] 
>  *** DEADLOCK ***
> 
>  [ ] Call Trace:
>  [ ]  dump_stack_lvl+0x91/0xf0
>  [ ]  dump_stack+0x10/0x20
>  [ ]  print_deadlock_bug+0x23f/0x320
>  [ ]  __lock_acquire+0x146e/0x2790
>  [ ]  lock_acquire+0xc4/0x2f0
>  [ ]  __mutex_lock+0xb2/0x10e0
>  [ ]  mutex_lock_nested+0x1b/0x30
>  [ ]  pci_lock_rescan_remove+0x17/0x30
>  [ ]  xe_pci_sriov_disable_vfs+0x12/0x40 [xe]
>  [ ]  xe_pci_remove+0x7a/0x180 [xe]
>  [ ]  pci_device_remove+0x41/0xb0
>  [ ]  device_remove+0x43/0x80
>  [ ]  device_release_driver_internal+0x215/0x280
>  [ ]  device_release_driver+0x12/0x20
>  [ ]  pci_stop_bus_device+0x69/0x90
>  [ ]  pci_stop_and_remove_bus_device_locked+0x24/0x60
>  [ ]  remove_store+0x85/0xa0
>  [ ]  dev_attr_store+0x17/0x40
> 
> but it looks that this issue was already noticed and discussed [3]
> in the PCI level, so lets wait and see how this will be solved.
> 
> [2] https://intel-gfx-ci.01.org/tree/intel-xe/xe-pw-162346v2/shard-bmg-7/igt@core_hotunplug@unplug-rescan.html
> [3] https://lore.kernel.org/linux-pci/20260228120138.51197-4-ionut.nechita@windriver.com/
> 

+1. Generally taking PCI level (or core) locks in drivers is a big red
flag. Let's see how the above gets resolved.

Matt

> 
> > 
> > Matt
> > 
> > [1] https://elixir.bootlin.com/linux/v6.19.3/A/ident/pci_lock_rescan_remove
> > 
> >>
> >>>  /**
> >>>   * xe_pci_sriov_get_vf_pdev() - Lookup the VF's PCI device using the VF identifier.
> >>>   * @pdev: the PF's &pci_dev
> >>> -- 
> >>> 2.47.1
> >>>
> >>
> >> -- 
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2026-03-02 22:52 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-27 21:40 [PATCH] drm/xe/pf: Take pci_rescan_remove_lock mutex when disabling VFs Michal Wajdeczko
2026-02-27 21:54 ` ✗ CI.checkpatch: warning for " Patchwork
2026-02-27 21:54 ` ✗ CI.KUnit: failure " Patchwork
2026-03-02  9:43 ` [PATCH] " Piotr Piórkowski
2026-03-02 20:36   ` Matthew Brost
2026-03-02 21:47     ` Michal Wajdeczko
2026-03-02 22:51       ` Matthew Brost

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox