* [REGRESSION] GPU passes into VM improperly after c376a3456d8b or a98db518dde2
@ 2026-04-12 11:25 70sp
0 siblings, 0 replies; 6+ messages in thread
From: 70sp @ 2026-04-12 11:25 UTC (permalink / raw)
To: iommu@lists.linux.dev
Cc: baolu.lu@linux.intel.com, linux-kernel@vger.kernel.org,
regressions@lists.linux.dev
Hello,
I have been dealing with a regression launching a Windows QEMU/KVM virtual machine with a GPU passed through.
The issue consists of launching a QEMU/KVM VM, which gets stuck for about 2 minutes on booting with a white screen and then having NVIDIA’s code 43 in Windows.
I’m certain, that the issue is not caused by anything in Windows or related software in Linux, because I tried reinstalling my whole PC including the Windows VM. I tried to reproduce the bug on an out-of-the-box Arch Linux install and the bug is still present.
The first bad commit is either a98db518dde246e01ead53617dc0a30d6aaa3752 or c376a3456d8bef43ec556a98c0a04c35086c2737. I don’t know for sure which one introduced it, because during bisection I had to skip a98db518dde246e01ead53617dc0a30d6aaa3752 due to it being unable to launch the virtual machine resulting in a different error (didn’t even start booting). In kernels before these commits, the VM works flawlessly.
I have tested it on latest mainline kernel and the issue is still present. I have been experiencing the issue since kernel 6.13, so I just switched to the 6.12 LTS kernel instead which doesn’t have this issue.
Configuration of my Linux install and hardware: https://pastebin.com/rcsyyYiK
.config: https://pastebin.com/RTQCBduD
dmesg errors: https://pastebin.com/84jPP81E
lspci: https://pastebin.com/qi29BSWi
#regzbot introduced: a98db518dde246e01ead53617dc0a30d6aaa3752..c376a3456d8bef43ec556a98c0a04c35086c2737
Best Regards,
Šimon Pospíchal
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [REGRESSION] GPU passes into VM improperly after c376a3456d8b or a98db518dde2
[not found] <z7Rkts0EsooZcnCjNfJaK6ursftFU8ubOkH0hcNjzxugEAwGsHLMblOdfoAx-gJzBZ-TMmeUtD7iWBngbFY5UCJCwdSmYqmCVMy-yhbirUk=@protonmail.com>
@ 2026-04-13 6:47 ` Baolu Lu
2026-04-14 9:22 ` 70sp
0 siblings, 1 reply; 6+ messages in thread
From: Baolu Lu @ 2026-04-13 6:47 UTC (permalink / raw)
To: 70sp, iommu@lists.linux.dev
Cc: linux-kernel@vger.kernel.org, regressions@lists.linux.dev
On 4/12/26 19:17, 70sp wrote:
> Hello,
>
> I have been dealing with a regression launching a Windows QEMU/KVM
> virtual machine with a GPU passed through.
>
> The issue consists of launching a QEMU/KVM VM, which gets stuck for
> about 2 minutes on booting with a white screen and then having NVIDIA’s
> code 43 in Windows.
>
> I’m certain, that the issue is not caused by anything in Windows or
> related software in Linux, because I tried reinstalling my whole PC
> including the Windows VM. I tried to reproduce the bug on an out-of-the-
> box Arch Linux install and the bug is still present.
>
> The first bad commit is either a98db518dde246e01ead53617dc0a30d6aaa3752
> or c376a3456d8bef43ec556a98c0a04c35086c2737. I don’t know for sure which
> one introduced it, because during bisection I had to skip
> a98db518dde246e01ead53617dc0a30d6aaa3752 due to it being unable to
> launch the virtual machine resulting in a different error (didn’t even
> start booting). In kernels before these commits, the VM works flawlessly.
>
> I have tested it on latest mainline kernel and the issue is still
> present. I have been experiencing the issue since kernel 6.13, so I just
> switched to the 6.12 LTS kernel instead which doesn’t have this issue.
>
> Configuration of my Linux install and hardware: https://pastebin.com/
> rcsyyYiK
> .config: https://pastebin.com/RTQCBduD
> dmesg errors: https://pastebin.com/84jPP81E
> lspci: https://pastebin.com/qi29BSWi
>
> #regzbot introduced:
> a98db518dde246e01ead53617dc0a30d6aaa3752..c376a3456d8bef43ec556a98c0a04c35086c2737
Before these commits, if a device was attached to a domain that didn't
perfectly match the hardware's capabilities (such as address width or
coherency), the kernel would dynamically adjust the domain to
accommodate the hardware.
Following these two commits, the driver now applies a "match or fail"
policy. If the domain is incompatible with the device's hardware
capabilities, it returns -EINVAL. This expects the caller to allocate a
new domain dedicated to that specific device and attempt the attachment
again.
Can you please add a message line in paging_domain_compatible() to
verify whether it's a domain compatibility issue?
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 205debd76989..c7e1e0dfa250 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -3111,8 +3111,10 @@ int paging_domain_compatible(struct iommu_domain
*domain, struct device *dev)
ret =
paging_domain_compatible_second_stage(dmar_domain, iommu);
else if (WARN_ON(true))
ret = -EINVAL;
- if (ret)
+ if (ret) {
+ dev_info(dev, "domain is not compatible with device, ret
= %d", ret);
return ret;
+ }
if (sm_supported(iommu) && !dev_is_real_dma_subdevice(dev) &&
context_copied(iommu, info->bus, info->devfn))
Thanks,
baolu
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [REGRESSION] GPU passes into VM improperly after c376a3456d8b or a98db518dde2
2026-04-13 6:47 ` [REGRESSION] GPU passes into VM improperly after c376a3456d8b or a98db518dde2 Baolu Lu
@ 2026-04-14 9:22 ` 70sp
2026-04-23 9:25 ` 70sp
2026-04-27 7:15 ` Baolu Lu
0 siblings, 2 replies; 6+ messages in thread
From: 70sp @ 2026-04-14 9:22 UTC (permalink / raw)
To: Baolu Lu
Cc: iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
regressions@lists.linux.dev
I can confirm, that the "domain is not compatible with device" message is nowhere to be seen.
I have double checked by also adding an else statement with a different message and that one showed up several times. (by pci (iGPU) 0000:00:02.0, pcieport 0000:00:01.0 and vfio-pci (GTX 970) 0000:01:00.0, 0000:01:00.1). ret = 0.
Sent with Proton Mail secure email.
On Monday, April 13th, 2026 at 8:49 AM, Baolu Lu <baolu.lu@linux.intel.com> wrote:
> On 4/12/26 19:17, 70sp wrote:
> > Hello,
> >
> > I have been dealing with a regression launching a Windows QEMU/KVM
> > virtual machine with a GPU passed through.
> >
> > The issue consists of launching a QEMU/KVM VM, which gets stuck for
> > about 2 minutes on booting with a white screen and then having NVIDIA’s
> > code 43 in Windows.
> >
> > I’m certain, that the issue is not caused by anything in Windows or
> > related software in Linux, because I tried reinstalling my whole PC
> > including the Windows VM. I tried to reproduce the bug on an out-of-the-
> > box Arch Linux install and the bug is still present.
> >
> > The first bad commit is either a98db518dde246e01ead53617dc0a30d6aaa3752
> > or c376a3456d8bef43ec556a98c0a04c35086c2737. I don’t know for sure which
> > one introduced it, because during bisection I had to skip
> > a98db518dde246e01ead53617dc0a30d6aaa3752 due to it being unable to
> > launch the virtual machine resulting in a different error (didn’t even
> > start booting). In kernels before these commits, the VM works flawlessly.
> >
> > I have tested it on latest mainline kernel and the issue is still
> > present. I have been experiencing the issue since kernel 6.13, so I just
> > switched to the 6.12 LTS kernel instead which doesn’t have this issue.
> >
> > Configuration of my Linux install and hardware: https://pastebin.com/
> > rcsyyYiK
> > .config: https://pastebin.com/RTQCBduD
> > dmesg errors: https://pastebin.com/84jPP81E
> > lspci: https://pastebin.com/qi29BSWi
> >
> > #regzbot introduced:
> > a98db518dde246e01ead53617dc0a30d6aaa3752..c376a3456d8bef43ec556a98c0a04c35086c2737
>
> Before these commits, if a device was attached to a domain that didn't
> perfectly match the hardware's capabilities (such as address width or
> coherency), the kernel would dynamically adjust the domain to
> accommodate the hardware.
>
> Following these two commits, the driver now applies a "match or fail"
> policy. If the domain is incompatible with the device's hardware
> capabilities, it returns -EINVAL. This expects the caller to allocate a
> new domain dedicated to that specific device and attempt the attachment
> again.
>
> Can you please add a message line in paging_domain_compatible() to
> verify whether it's a domain compatibility issue?
>
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index 205debd76989..c7e1e0dfa250 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -3111,8 +3111,10 @@ int paging_domain_compatible(struct iommu_domain
> *domain, struct device *dev)
> ret =
> paging_domain_compatible_second_stage(dmar_domain, iommu);
> else if (WARN_ON(true))
> ret = -EINVAL;
> - if (ret)
> + if (ret) {
> + dev_info(dev, "domain is not compatible with device, ret
> = %d", ret);
> return ret;
> + }
>
> if (sm_supported(iommu) && !dev_is_real_dma_subdevice(dev) &&
> context_copied(iommu, info->bus, info->devfn))
>
> Thanks,
> baolu
>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [REGRESSION] GPU passes into VM improperly after c376a3456d8b or a98db518dde2
2026-04-14 9:22 ` 70sp
@ 2026-04-23 9:25 ` 70sp
2026-04-27 7:15 ` Baolu Lu
1 sibling, 0 replies; 6+ messages in thread
From: 70sp @ 2026-04-23 9:25 UTC (permalink / raw)
To: Baolu Lu
Cc: iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
regressions@lists.linux.dev
Hello,
sending a friendly reminder about this ongoing regression.
Thank you for your attention.
On Tuesday, April 14th, 2026 at 11:22 AM, 70sp <70sp@protonmail.com> wrote:
> I can confirm, that the "domain is not compatible with device" message is nowhere to be seen.
>
> I have double checked by also adding an else statement with a different message and that one showed up several times. (by pci (iGPU) 0000:00:02.0, pcieport 0000:00:01.0 and vfio-pci (GTX 970) 0000:01:00.0, 0000:01:00.1). ret = 0.
>
>
>
> Sent with Proton Mail secure email.
>
> On Monday, April 13th, 2026 at 8:49 AM, Baolu Lu <baolu.lu@linux.intel.com> wrote:
>
> > On 4/12/26 19:17, 70sp wrote:
> > > Hello,
> > >
> > > I have been dealing with a regression launching a Windows QEMU/KVM
> > > virtual machine with a GPU passed through.
> > >
> > > The issue consists of launching a QEMU/KVM VM, which gets stuck for
> > > about 2 minutes on booting with a white screen and then having NVIDIA’s
> > > code 43 in Windows.
> > >
> > > I’m certain, that the issue is not caused by anything in Windows or
> > > related software in Linux, because I tried reinstalling my whole PC
> > > including the Windows VM. I tried to reproduce the bug on an out-of-the-
> > > box Arch Linux install and the bug is still present.
> > >
> > > The first bad commit is either a98db518dde246e01ead53617dc0a30d6aaa3752
> > > or c376a3456d8bef43ec556a98c0a04c35086c2737. I don’t know for sure which
> > > one introduced it, because during bisection I had to skip
> > > a98db518dde246e01ead53617dc0a30d6aaa3752 due to it being unable to
> > > launch the virtual machine resulting in a different error (didn’t even
> > > start booting). In kernels before these commits, the VM works flawlessly.
> > >
> > > I have tested it on latest mainline kernel and the issue is still
> > > present. I have been experiencing the issue since kernel 6.13, so I just
> > > switched to the 6.12 LTS kernel instead which doesn’t have this issue.
> > >
> > > Configuration of my Linux install and hardware: https://pastebin.com/
> > > rcsyyYiK
> > > .config: https://pastebin.com/RTQCBduD
> > > dmesg errors: https://pastebin.com/84jPP81E
> > > lspci: https://pastebin.com/qi29BSWi
> > >
> > > #regzbot introduced:
> > > a98db518dde246e01ead53617dc0a30d6aaa3752..c376a3456d8bef43ec556a98c0a04c35086c2737
> >
> > Before these commits, if a device was attached to a domain that didn't
> > perfectly match the hardware's capabilities (such as address width or
> > coherency), the kernel would dynamically adjust the domain to
> > accommodate the hardware.
> >
> > Following these two commits, the driver now applies a "match or fail"
> > policy. If the domain is incompatible with the device's hardware
> > capabilities, it returns -EINVAL. This expects the caller to allocate a
> > new domain dedicated to that specific device and attempt the attachment
> > again.
> >
> > Can you please add a message line in paging_domain_compatible() to
> > verify whether it's a domain compatibility issue?
> >
> > diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> > index 205debd76989..c7e1e0dfa250 100644
> > --- a/drivers/iommu/intel/iommu.c
> > +++ b/drivers/iommu/intel/iommu.c
> > @@ -3111,8 +3111,10 @@ int paging_domain_compatible(struct iommu_domain
> > *domain, struct device *dev)
> > ret =
> > paging_domain_compatible_second_stage(dmar_domain, iommu);
> > else if (WARN_ON(true))
> > ret = -EINVAL;
> > - if (ret)
> > + if (ret) {
> > + dev_info(dev, "domain is not compatible with device, ret
> > = %d", ret);
> > return ret;
> > + }
> >
> > if (sm_supported(iommu) && !dev_is_real_dma_subdevice(dev) &&
> > context_copied(iommu, info->bus, info->devfn))
> >
> > Thanks,
> > baolu
> >
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [REGRESSION] GPU passes into VM improperly after c376a3456d8b or a98db518dde2
2026-04-14 9:22 ` 70sp
2026-04-23 9:25 ` 70sp
@ 2026-04-27 7:15 ` Baolu Lu
2026-04-29 20:30 ` 70sp
1 sibling, 1 reply; 6+ messages in thread
From: Baolu Lu @ 2026-04-27 7:15 UTC (permalink / raw)
To: 70sp
Cc: iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
regressions@lists.linux.dev
On 4/14/26 17:22, 70sp wrote:
> I can confirm, that the "domain is not compatible with device" message is nowhere to be seen.
>
> I have double checked by also adding an else statement with a different message and that one showed up several times. (by pci (iGPU) 0000:00:02.0, pcieport 0000:00:01.0 and vfio-pci (GTX 970) 0000:01:00.0, 0000:01:00.1). ret = 0.
>
Hmm, it seems the domain is compatible with the device hardware and was
attached successfully. Perhaps you can try to check the differences
between these two domain attachments by dumping the root, context, and
PASID table entries and comparing the configurations of the success and
failure cases.
To do this, simply apply the change below with CONFIG_DMAR_DEBUG
enabled:
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 4d0e65bc131d..bf303cfcf2ee 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -1345,6 +1345,9 @@ static int dmar_domain_attach_device(struct
dmar_domain *domain,
if (ret)
goto out_block_translation;
+ dmar_fault_dump_ptes(iommu, PCI_DEVID(info->bus, info->devfn),
+ 0, IOMMU_NO_PASID);
+
return 0;
out_block_translation:
Thanks,
baolu
>
> Sent with Proton Mail secure email.
>
> On Monday, April 13th, 2026 at 8:49 AM, Baolu Lu <baolu.lu@linux.intel.com> wrote:
>
>> On 4/12/26 19:17, 70sp wrote:
>>> Hello,
>>>
>>> I have been dealing with a regression launching a Windows QEMU/KVM
>>> virtual machine with a GPU passed through.
>>>
>>> The issue consists of launching a QEMU/KVM VM, which gets stuck for
>>> about 2 minutes on booting with a white screen and then having NVIDIA’s
>>> code 43 in Windows.
>>>
>>> I’m certain, that the issue is not caused by anything in Windows or
>>> related software in Linux, because I tried reinstalling my whole PC
>>> including the Windows VM. I tried to reproduce the bug on an out-of-the-
>>> box Arch Linux install and the bug is still present.
>>>
>>> The first bad commit is either a98db518dde246e01ead53617dc0a30d6aaa3752
>>> or c376a3456d8bef43ec556a98c0a04c35086c2737. I don’t know for sure which
>>> one introduced it, because during bisection I had to skip
>>> a98db518dde246e01ead53617dc0a30d6aaa3752 due to it being unable to
>>> launch the virtual machine resulting in a different error (didn’t even
>>> start booting). In kernels before these commits, the VM works flawlessly.
>>>
>>> I have tested it on latest mainline kernel and the issue is still
>>> present. I have been experiencing the issue since kernel 6.13, so I just
>>> switched to the 6.12 LTS kernel instead which doesn’t have this issue.
>>>
>>> Configuration of my Linux install and hardware: https://pastebin.com/
>>> rcsyyYiK
>>> .config: https://pastebin.com/RTQCBduD
>>> dmesg errors: https://pastebin.com/84jPP81E
>>> lspci: https://pastebin.com/qi29BSWi
>>>
>>> #regzbot introduced:
>>> a98db518dde246e01ead53617dc0a30d6aaa3752..c376a3456d8bef43ec556a98c0a04c35086c2737
>>
>> Before these commits, if a device was attached to a domain that didn't
>> perfectly match the hardware's capabilities (such as address width or
>> coherency), the kernel would dynamically adjust the domain to
>> accommodate the hardware.
>>
>> Following these two commits, the driver now applies a "match or fail"
>> policy. If the domain is incompatible with the device's hardware
>> capabilities, it returns -EINVAL. This expects the caller to allocate a
>> new domain dedicated to that specific device and attempt the attachment
>> again.
>>
>> Can you please add a message line in paging_domain_compatible() to
>> verify whether it's a domain compatibility issue?
>>
>> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
>> index 205debd76989..c7e1e0dfa250 100644
>> --- a/drivers/iommu/intel/iommu.c
>> +++ b/drivers/iommu/intel/iommu.c
>> @@ -3111,8 +3111,10 @@ int paging_domain_compatible(struct iommu_domain
>> *domain, struct device *dev)
>> ret =
>> paging_domain_compatible_second_stage(dmar_domain, iommu);
>> else if (WARN_ON(true))
>> ret = -EINVAL;
>> - if (ret)
>> + if (ret) {
>> + dev_info(dev, "domain is not compatible with device, ret
>> = %d", ret);
>> return ret;
>> + }
>>
>> if (sm_supported(iommu) && !dev_is_real_dma_subdevice(dev) &&
>> context_copied(iommu, info->bus, info->devfn))
>>
>> Thanks,
>> baolu
>>
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [REGRESSION] GPU passes into VM improperly after c376a3456d8b or a98db518dde2
2026-04-27 7:15 ` Baolu Lu
@ 2026-04-29 20:30 ` 70sp
0 siblings, 0 replies; 6+ messages in thread
From: 70sp @ 2026-04-29 20:30 UTC (permalink / raw)
To: Baolu Lu
Cc: regressions@lists.linux.dev, iommu@lists.linux.dev,
linux-kernel@vger.kernel.org
So I have tested it on both a working kernel (v6.12) and an affected kernel (mainline).
I managed to get the domain translation structures on the affected kernel, but couldn't get them on the working one since reading the file always ended up in a system crash or a segmentation fault.
I also couldn't get the PASID table entries, because the GPU doesn't support PASID.
Domain translation structures:
GPU: https://pastebin.com/4EY5Wmsj
GPU audio device: https://pastebin.com/fLf7weii
I've also looked through dmesg on both configurations, and on the affected kernel, this was the only data I found to have atleast something different. (On the working kernel basically nothing was reported)
Referenced snippet of dmesg: https://pastebin.com/5e3jQ2BL
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-04-29 20:30 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <z7Rkts0EsooZcnCjNfJaK6ursftFU8ubOkH0hcNjzxugEAwGsHLMblOdfoAx-gJzBZ-TMmeUtD7iWBngbFY5UCJCwdSmYqmCVMy-yhbirUk=@protonmail.com>
2026-04-13 6:47 ` [REGRESSION] GPU passes into VM improperly after c376a3456d8b or a98db518dde2 Baolu Lu
2026-04-14 9:22 ` 70sp
2026-04-23 9:25 ` 70sp
2026-04-27 7:15 ` Baolu Lu
2026-04-29 20:30 ` 70sp
2026-04-12 11:25 70sp
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox