public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [REGRESSION] GPU passes into VM improperly after c376a3456d8b or a98db518dde2
@ 2026-04-12 11:25 70sp
  0 siblings, 0 replies; 3+ messages in thread
From: 70sp @ 2026-04-12 11:25 UTC (permalink / raw)
  To: iommu@lists.linux.dev
  Cc: baolu.lu@linux.intel.com, linux-kernel@vger.kernel.org,
	regressions@lists.linux.dev

Hello,

I have been dealing with a regression launching a Windows QEMU/KVM virtual machine with a GPU passed through.

The issue consists of launching a QEMU/KVM VM, which gets stuck for about 2 minutes on booting with a white screen and then having NVIDIA’s code 43 in Windows.

I’m certain, that the issue is not caused by anything in Windows or related software in Linux, because I tried reinstalling my whole PC including the Windows VM. I tried to reproduce the bug on an out-of-the-box Arch Linux install and the bug is still present.

The first bad commit is either a98db518dde246e01ead53617dc0a30d6aaa3752 or c376a3456d8bef43ec556a98c0a04c35086c2737. I don’t know for sure which one introduced it, because during bisection I had to skip a98db518dde246e01ead53617dc0a30d6aaa3752 due to it being unable to launch the virtual machine resulting in a different error (didn’t even start booting). In kernels before these commits, the VM works flawlessly.

I have tested it on latest mainline kernel and the issue is still present. I have been experiencing the issue since kernel 6.13, so I just switched to the 6.12 LTS kernel instead which doesn’t have this issue.

Configuration of my Linux install and hardware: https://pastebin.com/rcsyyYiK
.config: https://pastebin.com/RTQCBduD
dmesg errors: https://pastebin.com/84jPP81E
lspci: https://pastebin.com/qi29BSWi

#regzbot introduced: a98db518dde246e01ead53617dc0a30d6aaa3752..c376a3456d8bef43ec556a98c0a04c35086c2737

Best Regards,
Šimon Pospíchal

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [REGRESSION] GPU passes into VM improperly after c376a3456d8b or a98db518dde2
       [not found] <z7Rkts0EsooZcnCjNfJaK6ursftFU8ubOkH0hcNjzxugEAwGsHLMblOdfoAx-gJzBZ-TMmeUtD7iWBngbFY5UCJCwdSmYqmCVMy-yhbirUk=@protonmail.com>
@ 2026-04-13  6:47 ` Baolu Lu
  2026-04-14  9:22   ` 70sp
  0 siblings, 1 reply; 3+ messages in thread
From: Baolu Lu @ 2026-04-13  6:47 UTC (permalink / raw)
  To: 70sp, iommu@lists.linux.dev
  Cc: linux-kernel@vger.kernel.org, regressions@lists.linux.dev

On 4/12/26 19:17, 70sp wrote:
> Hello,
> 
> I have been dealing with a regression launching a Windows QEMU/KVM 
> virtual machine with a GPU passed through.
> 
> The issue consists of launching a QEMU/KVM VM, which gets stuck for 
> about 2 minutes on booting with a white screen and then having NVIDIA’s 
> code 43 in Windows.
> 
> I’m certain, that the issue is not caused by anything in Windows or 
> related software in Linux, because I tried reinstalling my whole PC 
> including the Windows VM. I tried to reproduce the bug on an out-of-the- 
> box Arch Linux install and the bug is still present.
> 
> The first bad commit is either a98db518dde246e01ead53617dc0a30d6aaa3752 
> or c376a3456d8bef43ec556a98c0a04c35086c2737. I don’t know for sure which 
> one introduced it, because during bisection I had to skip 
> a98db518dde246e01ead53617dc0a30d6aaa3752 due to it being unable to 
> launch the virtual machine resulting in a different error (didn’t even 
> start booting). In kernels before these commits, the VM works flawlessly.
> 
> I have tested it on latest mainline kernel and the issue is still 
> present. I have been experiencing the issue since kernel 6.13, so I just 
> switched to the 6.12 LTS kernel instead which doesn’t have this issue.
> 
> Configuration of my Linux install and hardware: https://pastebin.com/ 
> rcsyyYiK
> .config: https://pastebin.com/RTQCBduD
> dmesg errors: https://pastebin.com/84jPP81E
> lspci: https://pastebin.com/qi29BSWi
> 
> #regzbot introduced: 
> a98db518dde246e01ead53617dc0a30d6aaa3752..c376a3456d8bef43ec556a98c0a04c35086c2737

Before these commits, if a device was attached to a domain that didn't
perfectly match the hardware's capabilities (such as address width or
coherency), the kernel would dynamically adjust the domain to
accommodate the hardware.

Following these two commits, the driver now applies a "match or fail"
policy. If the domain is incompatible with the device's hardware
capabilities, it returns -EINVAL. This expects the caller to allocate a
new domain dedicated to that specific device and attempt the attachment
again.

Can you please add a message line in paging_domain_compatible() to
verify whether it's a domain compatibility issue?

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 205debd76989..c7e1e0dfa250 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -3111,8 +3111,10 @@ int paging_domain_compatible(struct iommu_domain 
*domain, struct device *dev)
                 ret = 
paging_domain_compatible_second_stage(dmar_domain, iommu);
         else if (WARN_ON(true))
                 ret = -EINVAL;
-       if (ret)
+       if (ret) {
+               dev_info(dev, "domain is not compatible with device, ret 
= %d", ret);
                 return ret;
+       }

         if (sm_supported(iommu) && !dev_is_real_dma_subdevice(dev) &&
             context_copied(iommu, info->bus, info->devfn))

Thanks,
baolu

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [REGRESSION] GPU passes into VM improperly after c376a3456d8b or a98db518dde2
  2026-04-13  6:47 ` [REGRESSION] GPU passes into VM improperly after c376a3456d8b or a98db518dde2 Baolu Lu
@ 2026-04-14  9:22   ` 70sp
  0 siblings, 0 replies; 3+ messages in thread
From: 70sp @ 2026-04-14  9:22 UTC (permalink / raw)
  To: Baolu Lu
  Cc: iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	regressions@lists.linux.dev

I can confirm, that the "domain is not compatible with device" message is nowhere to be seen.

I have double checked by also adding an else statement with a different message and that one showed up several times. (by pci (iGPU) 0000:00:02.0, pcieport 0000:00:01.0 and vfio-pci (GTX 970) 0000:01:00.0, 0000:01:00.1). ret = 0.



Sent with Proton Mail secure email.

On Monday, April 13th, 2026 at 8:49 AM, Baolu Lu <baolu.lu@linux.intel.com> wrote:

> On 4/12/26 19:17, 70sp wrote:
> > Hello,
> >
> > I have been dealing with a regression launching a Windows QEMU/KVM
> > virtual machine with a GPU passed through.
> >
> > The issue consists of launching a QEMU/KVM VM, which gets stuck for
> > about 2 minutes on booting with a white screen and then having NVIDIA’s
> > code 43 in Windows.
> >
> > I’m certain, that the issue is not caused by anything in Windows or
> > related software in Linux, because I tried reinstalling my whole PC
> > including the Windows VM. I tried to reproduce the bug on an out-of-the-
> > box Arch Linux install and the bug is still present.
> >
> > The first bad commit is either a98db518dde246e01ead53617dc0a30d6aaa3752
> > or c376a3456d8bef43ec556a98c0a04c35086c2737. I don’t know for sure which
> > one introduced it, because during bisection I had to skip
> > a98db518dde246e01ead53617dc0a30d6aaa3752 due to it being unable to
> > launch the virtual machine resulting in a different error (didn’t even
> > start booting). In kernels before these commits, the VM works flawlessly.
> >
> > I have tested it on latest mainline kernel and the issue is still
> > present. I have been experiencing the issue since kernel 6.13, so I just
> > switched to the 6.12 LTS kernel instead which doesn’t have this issue.
> >
> > Configuration of my Linux install and hardware: https://pastebin.com/
> > rcsyyYiK
> > .config: https://pastebin.com/RTQCBduD
> > dmesg errors: https://pastebin.com/84jPP81E
> > lspci: https://pastebin.com/qi29BSWi
> >
> > #regzbot introduced:
> > a98db518dde246e01ead53617dc0a30d6aaa3752..c376a3456d8bef43ec556a98c0a04c35086c2737
> 
> Before these commits, if a device was attached to a domain that didn't
> perfectly match the hardware's capabilities (such as address width or
> coherency), the kernel would dynamically adjust the domain to
> accommodate the hardware.
> 
> Following these two commits, the driver now applies a "match or fail"
> policy. If the domain is incompatible with the device's hardware
> capabilities, it returns -EINVAL. This expects the caller to allocate a
> new domain dedicated to that specific device and attempt the attachment
> again.
> 
> Can you please add a message line in paging_domain_compatible() to
> verify whether it's a domain compatibility issue?
> 
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index 205debd76989..c7e1e0dfa250 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -3111,8 +3111,10 @@ int paging_domain_compatible(struct iommu_domain
> *domain, struct device *dev)
>                  ret =
> paging_domain_compatible_second_stage(dmar_domain, iommu);
>          else if (WARN_ON(true))
>                  ret = -EINVAL;
> -       if (ret)
> +       if (ret) {
> +               dev_info(dev, "domain is not compatible with device, ret
> = %d", ret);
>                  return ret;
> +       }
> 
>          if (sm_supported(iommu) && !dev_is_real_dma_subdevice(dev) &&
>              context_copied(iommu, info->bus, info->devfn))
> 
> Thanks,
> baolu
>

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-04-14  9:23 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <z7Rkts0EsooZcnCjNfJaK6ursftFU8ubOkH0hcNjzxugEAwGsHLMblOdfoAx-gJzBZ-TMmeUtD7iWBngbFY5UCJCwdSmYqmCVMy-yhbirUk=@protonmail.com>
2026-04-13  6:47 ` [REGRESSION] GPU passes into VM improperly after c376a3456d8b or a98db518dde2 Baolu Lu
2026-04-14  9:22   ` 70sp
2026-04-12 11:25 70sp

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox