regressions.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* [REGRESSION] vfio gpu passthrough stopped working
@ 2025-08-07 15:31 cat
  2025-08-07 15:52 ` Greg KH
  0 siblings, 1 reply; 6+ messages in thread
From: cat @ 2025-08-07 15:31 UTC (permalink / raw)
  Cc: regressions, stable

#regzbot introduced: v6.12.34..v6.12.35

After upgrade to kernel 6.12.35, vfio passthrough for my GPU has stopped working within a windows VM, it sees device in device manager but reports that it did not start correctly. I compared lspci logs in the vm before and after upgrade to 6.12.35, and here are the changes I noticed:

- the reported link speed for the passthrough GPU has changed from 2.5 to 16GT/s
- the passthrough GPU has lost it's 'BusMaster' and MSI enable flags
- latency measurement feature appeared

These entries also began appearing within the vm in dmesg when host kernel is 6.12.35 or above:

[    1.963177] nouveau 0000:01:00.0: sec2(gsp): mbox 1c503000 00000001
[    1.963296] nouveau 0000:01:00.0: sec2(gsp):booter-load: boot failed: -5
...
[    1.964580] nouveau 0000:01:00.0: gsp: init failed, -5
[    1.964641] nouveau 0000:01:00.0: init failed with -5
[    1.964681] nouveau: drm:00000000:00000080: init failed with -5
[    1.964721] nouveau 0000:01:00.0: drm: Device allocation failed: -5
[    1.966318] nouveau 0000:01:00.0: probe with driver nouveau failed with error -5


6.12.34 worked fine, and latest 6.12 LTS does not work either. I am using intel CPU and nvidia GPU (for passthrough, and as my GPU on linux system).

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [REGRESSION] vfio gpu passthrough stopped working
  2025-08-07 15:31 [REGRESSION] vfio gpu passthrough stopped working cat
@ 2025-08-07 15:52 ` Greg KH
  2025-08-07 18:31   ` Harshit Mogalapalli
  2025-08-08  4:40   ` cat
  0 siblings, 2 replies; 6+ messages in thread
From: Greg KH @ 2025-08-07 15:52 UTC (permalink / raw)
  To: cat; +Cc: regressions, stable

On Thu, Aug 07, 2025 at 03:31:17PM +0000, cat wrote:
> #regzbot introduced: v6.12.34..v6.12.35
> 
> After upgrade to kernel 6.12.35, vfio passthrough for my GPU has stopped working within a windows VM, it sees device in device manager but reports that it did not start correctly. I compared lspci logs in the vm before and after upgrade to 6.12.35, and here are the changes I noticed:
> 
> - the reported link speed for the passthrough GPU has changed from 2.5 to 16GT/s
> - the passthrough GPU has lost it's 'BusMaster' and MSI enable flags
> - latency measurement feature appeared
> 
> These entries also began appearing within the vm in dmesg when host kernel is 6.12.35 or above:
> 
> [    1.963177] nouveau 0000:01:00.0: sec2(gsp): mbox 1c503000 00000001
> [    1.963296] nouveau 0000:01:00.0: sec2(gsp):booter-load: boot failed: -5
> ...
> [    1.964580] nouveau 0000:01:00.0: gsp: init failed, -5
> [    1.964641] nouveau 0000:01:00.0: init failed with -5
> [    1.964681] nouveau: drm:00000000:00000080: init failed with -5
> [    1.964721] nouveau 0000:01:00.0: drm: Device allocation failed: -5
> [    1.966318] nouveau 0000:01:00.0: probe with driver nouveau failed with error -5
> 
> 
> 6.12.34 worked fine, and latest 6.12 LTS does not work either. I am using intel CPU and nvidia GPU (for passthrough, and as my GPU on linux system).

Can you use git bisect to find the offending commit?



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [REGRESSION] vfio gpu passthrough stopped working
  2025-08-07 15:52 ` Greg KH
@ 2025-08-07 18:31   ` Harshit Mogalapalli
  2025-08-08  4:40   ` cat
  1 sibling, 0 replies; 6+ messages in thread
From: Harshit Mogalapalli @ 2025-08-07 18:31 UTC (permalink / raw)
  To: Greg KH, cat; +Cc: regressions, stable

Hi,

On 07/08/25 21:22, Greg KH wrote:
> On Thu, Aug 07, 2025 at 03:31:17PM +0000, cat wrote:
>> #regzbot introduced: v6.12.34..v6.12.35
>>
>> After upgrade to kernel 6.12.35, vfio passthrough for my GPU has stopped working within a windows VM, it sees device in device manager but reports that it did not start correctly. I compared lspci logs in the vm before and after upgrade to 6.12.35, and here are the changes I noticed:
>>
>> - the reported link speed for the passthrough GPU has changed from 2.5 to 16GT/s
>> - the passthrough GPU has lost it's 'BusMaster' and MSI enable flags
>> - latency measurement feature appeared
>>
>> These entries also began appearing within the vm in dmesg when host kernel is 6.12.35 or above:
>>
>> [    1.963177] nouveau 0000:01:00.0: sec2(gsp): mbox 1c503000 00000001
>> [    1.963296] nouveau 0000:01:00.0: sec2(gsp):booter-load: boot failed: -5
>> ...
>> [    1.964580] nouveau 0000:01:00.0: gsp: init failed, -5
>> [    1.964641] nouveau 0000:01:00.0: init failed with -5
>> [    1.964681] nouveau: drm:00000000:00000080: init failed with -5
>> [    1.964721] nouveau 0000:01:00.0: drm: Device allocation failed: -5
>> [    1.966318] nouveau 0000:01:00.0: probe with driver nouveau failed with error -5
>>
>>
>> 6.12.34 worked fine, and latest 6.12 LTS does not work either. I am using intel CPU and nvidia GPU (for passthrough, and as my GPU on linux system).
> 
 > Can you use git bisect to find the offending commit?>

Additional notes: I looked at the log and am listing probably relevant 
commit, if bisection is too costly:

68e58f579121 PCI: dwc: ep: Correct PBA offset in .set_msix() callback
523815857b1e PCI: cadence-ep: Correct PBA offset in .set_msix() callback

These two might be interesting ones to consider. Please ignore this note 
if bisection is already in progress as these are pure guesses.


Thanks,
Harshit

> 
> 


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [REGRESSION] vfio gpu passthrough stopped working
  2025-08-07 15:52 ` Greg KH
  2025-08-07 18:31   ` Harshit Mogalapalli
@ 2025-08-08  4:40   ` cat
  2025-08-08  9:00     ` cat
  1 sibling, 1 reply; 6+ messages in thread
From: cat @ 2025-08-08  4:40 UTC (permalink / raw)
  To: Greg KH; +Cc: regressions, stable

I will perform bisection, yes.

On 8/7/25 3:52 PM, Greg KH wrote:
> On Thu, Aug 07, 2025 at 03:31:17PM +0000, cat wrote:
>> #regzbot introduced: v6.12.34..v6.12.35
>>
>> After upgrade to kernel 6.12.35, vfio passthrough for my GPU has stopped working within a windows VM, it sees device in device manager but reports that it did not start correctly. I compared lspci logs in the vm before and after upgrade to 6.12.35, and here are the changes I noticed:
>>
>> - the reported link speed for the passthrough GPU has changed from 2.5 to 16GT/s
>> - the passthrough GPU has lost it's 'BusMaster' and MSI enable flags
>> - latency measurement feature appeared
>>
>> These entries also began appearing within the vm in dmesg when host kernel is 6.12.35 or above:
>>
>> [    1.963177] nouveau 0000:01:00.0: sec2(gsp): mbox 1c503000 00000001
>> [    1.963296] nouveau 0000:01:00.0: sec2(gsp):booter-load: boot failed: -5
>> ...
>> [    1.964580] nouveau 0000:01:00.0: gsp: init failed, -5
>> [    1.964641] nouveau 0000:01:00.0: init failed with -5
>> [    1.964681] nouveau: drm:00000000:00000080: init failed with -5
>> [    1.964721] nouveau 0000:01:00.0: drm: Device allocation failed: -5
>> [    1.966318] nouveau 0000:01:00.0: probe with driver nouveau failed with error -5
>>
>>
>> 6.12.34 worked fine, and latest 6.12 LTS does not work either. I am using intel CPU and nvidia GPU (for passthrough, and as my GPU on linux system).
> Can you use git bisect to find the offending commit?
>
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [REGRESSION] vfio gpu passthrough stopped working
  2025-08-08  4:40   ` cat
@ 2025-08-08  9:00     ` cat
  2025-08-08  9:22       ` Harshit Mogalapalli
  0 siblings, 1 reply; 6+ messages in thread
From: cat @ 2025-08-08  9:00 UTC (permalink / raw)
  To: Greg KH, regressions, stable

fb5873b779dd5858123c19bbd6959566771e2e83 is the first bad commit
commit fb5873b779dd5858123c19bbd6959566771e2e83
Author: Lu Baolu <baolu.lu@linux.intel.com>
Date:   Tue May 20 15:58:49 2025 +0800

     iommu/vt-d: Restore context entry setup order for aliased devices

     commit 320302baed05c6456164652541f23d2a96522c06 upstream.

     Commit 2031c469f816 ("iommu/vt-d: Add support for static identity 
domain")
     changed the context entry setup during domain attachment from a
     set-and-check policy to a clear-and-reset approach. This inadvertently
     introduced a regression affecting PCI aliased devices behind 
PCIe-to-PCI
     bridges.

     Specifically, keyboard and touchpad stopped working on several Apple
     Macbooks with below messages:

      kernel: platform pxa2xx-spi.3: Adding to iommu group 20
      kernel: input: Apple SPI Keyboard as
  /devices/pci0000:00/0000:00:1e.3/pxa2xx-spi.3/spi_master/spi2/spi-APP000D:00/input/input0
      kernel: DMAR: DRHD: handling fault status reg 3
      kernel: DMAR: [DMA Read NO_PASID] Request device [00:1e.3] fault addr
      0xffffa000 [fault reason 0x06] PTE Read access is not set
      kernel: DMAR: DRHD: handling fault status reg 3
      kernel: DMAR: [DMA Read NO_PASID] Request device [00:1e.3] fault addr
      0xffffa000 [fault reason 0x06] PTE Read access is not set
      kernel: applespi spi-APP000D:00: Error writing to device: 01 0e 00 00
      kernel: DMAR: DRHD: handling fault status reg 3
      kernel: DMAR: [DMA Read NO_PASID] Request device [00:1e.3] fault addr
      0xffffa000 [fault reason 0x06] PTE Read access is not set
      kernel: DMAR: DRHD: handling fault status reg 3
      kernel: applespi spi-APP000D:00: Error writing to device: 01 0e 00 00

     Fix this by restoring the previous context setup order.

     Fixes: 2031c469f816 ("iommu/vt-d: Add support for static identity 
domain")
     Closes: 
https://lore.kernel.org/all/4dada48a-c5dd-4c30-9c85-5b03b0aa01f0@bfh.ch/
     Cc: stable@vger.kernel.org
     Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
     Reviewed-by: Kevin Tian <kevin.tian@intel.com>
     Reviewed-by: Yi Liu <yi.l.liu@intel.com>
     Link: 
https://lore.kernel.org/r/20250514060523.2862195-1-baolu.lu@linux.intel.com
     Link: 
https://lore.kernel.org/r/20250520075849.755012-2-baolu.lu@linux.intel.com
     Signed-off-by: Joerg Roedel <jroedel@suse.de>
     Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

  drivers/iommu/intel/iommu.c  | 11 +++++++++++
  drivers/iommu/intel/iommu.h  |  1 +
  drivers/iommu/intel/nested.c |  4 ++--
  3 files changed, 14 insertions(+), 2 deletions(-)

On 8/8/25 4:40 AM, cat wrote:
> I will perform bisection, yes.
>
> On 8/7/25 3:52 PM, Greg KH wrote:
>> On Thu, Aug 07, 2025 at 03:31:17PM +0000, cat wrote:
>>> #regzbot introduced: v6.12.34..v6.12.35
>>>
>>> After upgrade to kernel 6.12.35, vfio passthrough for my GPU has 
>>> stopped working within a windows VM, it sees device in device 
>>> manager but reports that it did not start correctly. I compared 
>>> lspci logs in the vm before and after upgrade to 6.12.35, and here 
>>> are the changes I noticed:
>>>
>>> - the reported link speed for the passthrough GPU has changed from 
>>> 2.5 to 16GT/s
>>> - the passthrough GPU has lost it's 'BusMaster' and MSI enable flags
>>> - latency measurement feature appeared
>>>
>>> These entries also began appearing within the vm in dmesg when host 
>>> kernel is 6.12.35 or above:
>>>
>>> [    1.963177] nouveau 0000:01:00.0: sec2(gsp): mbox 1c503000 00000001
>>> [    1.963296] nouveau 0000:01:00.0: sec2(gsp):booter-load: boot 
>>> failed: -5
>>> ...
>>> [    1.964580] nouveau 0000:01:00.0: gsp: init failed, -5
>>> [    1.964641] nouveau 0000:01:00.0: init failed with -5
>>> [    1.964681] nouveau: drm:00000000:00000080: init failed with -5
>>> [    1.964721] nouveau 0000:01:00.0: drm: Device allocation failed: -5
>>> [    1.966318] nouveau 0000:01:00.0: probe with driver nouveau 
>>> failed with error -5
>>>
>>>
>>> 6.12.34 worked fine, and latest 6.12 LTS does not work either. I am 
>>> using intel CPU and nvidia GPU (for passthrough, and as my GPU on 
>>> linux system).
>> Can you use git bisect to find the offending commit?
>>
>>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [REGRESSION] vfio gpu passthrough stopped working
  2025-08-08  9:00     ` cat
@ 2025-08-08  9:22       ` Harshit Mogalapalli
  0 siblings, 0 replies; 6+ messages in thread
From: Harshit Mogalapalli @ 2025-08-08  9:22 UTC (permalink / raw)
  To: cat, Greg KH, regressions, stable

Hi,


On 08/08/25 14:30, cat wrote:
> fb5873b779dd5858123c19bbd6959566771e2e83 is the first bad commit
> commit fb5873b779dd5858123c19bbd6959566771e2e83
> Author: Lu Baolu <baolu.lu@linux.intel.com>
> Date:   Tue May 20 15:58:49 2025 +0800
> 
>      iommu/vt-d: Restore context entry setup order for aliased devices
> 
>      commit 320302baed05c6456164652541f23d2a96522c06 upstream.
> 
>      Commit 2031c469f816 ("iommu/vt-d: Add support for static identity 
> domain")
>      changed the context entry setup during domain attachment from a
>      set-and-check policy to a clear-and-reset approach. This inadvertently
>      introduced a regression affecting PCI aliased devices behind PCIe- 
> to-PCI
>      bridges.
> 
>      Specifically, keyboard and touchpad stopped working on several Apple
>      Macbooks with below messages:
> 
>       kernel: platform pxa2xx-spi.3: Adding to iommu group 20
>       kernel: input: Apple SPI Keyboard as
>   /devices/pci0000:00/0000:00:1e.3/pxa2xx-spi.3/spi_master/spi2/spi- 
> APP000D:00/input/input0
>       kernel: DMAR: DRHD: handling fault status reg 3
>       kernel: DMAR: [DMA Read NO_PASID] Request device [00:1e.3] fault addr
>       0xffffa000 [fault reason 0x06] PTE Read access is not set
>       kernel: DMAR: DRHD: handling fault status reg 3
>       kernel: DMAR: [DMA Read NO_PASID] Request device [00:1e.3] fault addr
>       0xffffa000 [fault reason 0x06] PTE Read access is not set
>       kernel: applespi spi-APP000D:00: Error writing to device: 01 0e 00 00
>       kernel: DMAR: DRHD: handling fault status reg 3
>       kernel: DMAR: [DMA Read NO_PASID] Request device [00:1e.3] fault addr
>       0xffffa000 [fault reason 0x06] PTE Read access is not set
>       kernel: DMAR: DRHD: handling fault status reg 3
>       kernel: applespi spi-APP000D:00: Error writing to device: 01 0e 00 00
> 
>      Fix this by restoring the previous context setup order.
> 
>      Fixes: 2031c469f816 ("iommu/vt-d: Add support for static identity 
> domain")
>      Closes: https://lore.kernel.org/all/4dada48a- 
> c5dd-4c30-9c85-5b03b0aa01f0@bfh.ch/
>      Cc: stable@vger.kernel.org
>      Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
>      Reviewed-by: Kevin Tian <kevin.tian@intel.com>
>      Reviewed-by: Yi Liu <yi.l.liu@intel.com>
>      Link: https://lore.kernel.org/r/20250514060523.2862195-1- 
> baolu.lu@linux.intel.com
>      Link: https://lore.kernel.org/r/20250520075849.755012-2- 
> baolu.lu@linux.intel.com
>      Signed-off-by: Joerg Roedel <jroedel@suse.de>
>      Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> 
>   drivers/iommu/intel/iommu.c  | 11 +++++++++++
>   drivers/iommu/intel/iommu.h  |  1 +
>   drivers/iommu/intel/nested.c |  4 ++--
>   3 files changed, 14 insertions(+), 2 deletions(-)
> 

Looks like a duplicate of 
https://lore.kernel.org/linux-iommu/721D44AF820A4FEB+722679cb-2226-4287-8835-9251ad69a1ac@bbaa.fun/

And the fix for that was 
https://lore.kernel.org/all/468CF4B655888074+20250723120423.37924-1-bbaa@bbaa.fun/ 
which is present in 6.12.40, so maybe update to 6.12.40 and the issue 
will most likely be fixed.



Thanks,
Harshit

> On 8/8/25 4:40 AM, cat wrote:
>> I will perform bisection, yes.
>>
>> On 8/7/25 3:52 PM, Greg KH wrote:
>>> On Thu, Aug 07, 2025 at 03:31:17PM +0000, cat wrote:
>>>> #regzbot introduced: v6.12.34..v6.12.35
>>>>
>>>> After upgrade to kernel 6.12.35, vfio passthrough for my GPU has 
>>>> stopped working within a windows VM, it sees device in device 
>>>> manager but reports that it did not start correctly. I compared 
>>>> lspci logs in the vm before and after upgrade to 6.12.35, and here 
>>>> are the changes I noticed:
>>>>
>>>> - the reported link speed for the passthrough GPU has changed from 
>>>> 2.5 to 16GT/s
>>>> - the passthrough GPU has lost it's 'BusMaster' and MSI enable flags
>>>> - latency measurement feature appeared
>>>>
>>>> These entries also began appearing within the vm in dmesg when host 
>>>> kernel is 6.12.35 or above:
>>>>
>>>> [    1.963177] nouveau 0000:01:00.0: sec2(gsp): mbox 1c503000 00000001
>>>> [    1.963296] nouveau 0000:01:00.0: sec2(gsp):booter-load: boot 
>>>> failed: -5
>>>> ...
>>>> [    1.964580] nouveau 0000:01:00.0: gsp: init failed, -5
>>>> [    1.964641] nouveau 0000:01:00.0: init failed with -5
>>>> [    1.964681] nouveau: drm:00000000:00000080: init failed with -5
>>>> [    1.964721] nouveau 0000:01:00.0: drm: Device allocation failed: -5
>>>> [    1.966318] nouveau 0000:01:00.0: probe with driver nouveau 
>>>> failed with error -5
>>>>
>>>>
>>>> 6.12.34 worked fine, and latest 6.12 LTS does not work either. I am 
>>>> using intel CPU and nvidia GPU (for passthrough, and as my GPU on 
>>>> linux system).
>>> Can you use git bisect to find the offending commit?
>>>
>>>
> 


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-08-08  9:22 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-07 15:31 [REGRESSION] vfio gpu passthrough stopped working cat
2025-08-07 15:52 ` Greg KH
2025-08-07 18:31   ` Harshit Mogalapalli
2025-08-08  4:40   ` cat
2025-08-08  9:00     ` cat
2025-08-08  9:22       ` Harshit Mogalapalli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).