kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX
@ 2025-02-01  0:50 Sean Christopherson
  2025-02-01 14:25 ` Dionna Amalie Glaze
                   ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: Sean Christopherson @ 2025-02-01  0:50 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Paolo Bonzini
  Cc: linux-kernel, kvm, Sean Christopherson, Dionna Glaze, Peter Gonda,
	Jürgen Groß, Kirill Shutemov, Vitaly Kuznetsov,
	H . Peter Anvin, Binbin Wu, Tom Lendacky

Attempt to hack around the SNP/TDX guest MTRR disaster by hijacking
x86_platform.is_untracked_pat_range() to force the legacy PCI hole, i.e.
memory from TOLUD => 4GiB, as unconditionally writeback.

TDX in particular has created an impossible situation with MTRRs.  Because
TDX disallows toggling CR0.CD, TDX enabling decided the easiest solution
was to ignore MTRRs entirely (because omitting CR0.CD write is obviously
too simple).

Unfortunately, under KVM at least, the kernel subtly relies on MTRRs to
make ACPI play nice with device drivers.  ACPI tries to map ranges it finds
as WB, which in turn prevents device drivers from mapping device memory as
WC/UC-.

For the record, I hate this hack.  But it's the safest approach I can come
up with.  E.g. forcing ioremap() to always use WB scares me because it's
possible, however unlikely, that the kernel could try to map non-emulated
memory (that is presented as MMIO to the guest) as WC/UC-, and silently
forcing those mappings to WB could do weird things.

My initial thought was to effectively revert the offending commit and
skip the cache disabling/enabling, i.e. the problematic CR0.CD toggling,
but unfortunately OVMF/EDKII has also added code to skip MTRR setup. :-(

Sean Christopherson (2):
  x86/mtrr: Return success vs. "failure" from guest_force_mtrr_state()
  x86/kvm: Override low memory above TOLUD to WB when MTRRs are forced
    WB

 arch/x86/include/asm/mtrr.h        |  5 +++--
 arch/x86/kernel/cpu/mtrr/generic.c | 11 +++++++----
 arch/x86/kernel/kvm.c              | 31 ++++++++++++++++++++++++++++--
 3 files changed, 39 insertions(+), 8 deletions(-)


base-commit: fd8c09ad0d87783b9b6a27900d66293be45b7bad
-- 
2.48.1.362.g079036d154-goog


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX
  2025-02-01  0:50 Sean Christopherson
@ 2025-02-01 14:25 ` Dionna Amalie Glaze
  2025-02-03 18:14 ` Edgecombe, Rick P
  2025-07-08 14:24 ` Nikolay Borisov
  2 siblings, 0 replies; 30+ messages in thread
From: Dionna Amalie Glaze @ 2025-02-01 14:25 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Paolo Bonzini, linux-kernel, kvm, Peter Gonda,
	Jürgen Groß, Kirill Shutemov, Vitaly Kuznetsov,
	H . Peter Anvin, Binbin Wu, Tom Lendacky, Yamahata, Isaku,
	Xiang, Qinglan, Xu, Min M

On Fri, Jan 31, 2025 at 4:50 PM Sean Christopherson <seanjc@google.com> wrote:
>
> Attempt to hack around the SNP/TDX guest MTRR disaster by hijacking
> x86_platform.is_untracked_pat_range() to force the legacy PCI hole, i.e.
> memory from TOLUD => 4GiB, as unconditionally writeback.
>
> TDX in particular has created an impossible situation with MTRRs.  Because
> TDX disallows toggling CR0.CD, TDX enabling decided the easiest solution
> was to ignore MTRRs entirely (because omitting CR0.CD write is obviously
> too simple).
>
> Unfortunately, under KVM at least, the kernel subtly relies on MTRRs to
> make ACPI play nice with device drivers.  ACPI tries to map ranges it finds
> as WB, which in turn prevents device drivers from mapping device memory as
> WC/UC-.
>
> For the record, I hate this hack.  But it's the safest approach I can come
> up with.  E.g. forcing ioremap() to always use WB scares me because it's
> possible, however unlikely, that the kernel could try to map non-emulated
> memory (that is presented as MMIO to the guest) as WC/UC-, and silently
> forcing those mappings to WB could do weird things.
>
> My initial thought was to effectively revert the offending commit and
> skip the cache disabling/enabling, i.e. the problematic CR0.CD toggling,
> but unfortunately OVMF/EDKII has also added code to skip MTRR setup. :-(
>

EDK2 has a bug tracker. Maybe this is still fixable on Intel's end.
Adding Qinglan, Isaku, and Min to comment.

> Sean Christopherson (2):
>   x86/mtrr: Return success vs. "failure" from guest_force_mtrr_state()
>   x86/kvm: Override low memory above TOLUD to WB when MTRRs are forced
>     WB
>
>  arch/x86/include/asm/mtrr.h        |  5 +++--
>  arch/x86/kernel/cpu/mtrr/generic.c | 11 +++++++----
>  arch/x86/kernel/kvm.c              | 31 ++++++++++++++++++++++++++++--
>  3 files changed, 39 insertions(+), 8 deletions(-)
>
>
> base-commit: fd8c09ad0d87783b9b6a27900d66293be45b7bad
> --
> 2.48.1.362.g079036d154-goog
>


-- 
-Dionna Glaze, PhD, CISSP, CCSP (she/her)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX
  2025-02-01  0:50 Sean Christopherson
  2025-02-01 14:25 ` Dionna Amalie Glaze
@ 2025-02-03 18:14 ` Edgecombe, Rick P
  2025-02-03 20:33   ` Sean Christopherson
  2025-07-08 14:24 ` Nikolay Borisov
  2 siblings, 1 reply; 30+ messages in thread
From: Edgecombe, Rick P @ 2025-02-03 18:14 UTC (permalink / raw)
  To: tglx@linutronix.de, x86@kernel.org, mingo@redhat.com,
	pbonzini@redhat.com, seanjc@google.com, bp@alien8.de,
	dave.hansen@linux.intel.com
  Cc: pgonda@google.com, jgross@suse.com, Wu, Binbin,
	vkuznets@redhat.com, hpa@zytor.com, linux-kernel@vger.kernel.org,
	kirill.shutemov@linux.intel.com, dionnaglaze@google.com,
	kvm@vger.kernel.org, thomas.lendacky@amd.com

On Fri, 2025-01-31 at 16:50 -0800, Sean Christopherson wrote:
> Attempt to hack around the SNP/TDX guest MTRR disaster by hijacking
> x86_platform.is_untracked_pat_range() to force the legacy PCI hole, i.e.
> memory from TOLUD => 4GiB, as unconditionally writeback.
> 
> TDX in particular has created an impossible situation with MTRRs.  Because
> TDX disallows toggling CR0.CD, TDX enabling decided the easiest solution
> was to ignore MTRRs entirely (because omitting CR0.CD write is obviously
> too simple).
> 
> Unfortunately, under KVM at least, the kernel subtly relies on MTRRs to
> make ACPI play nice with device drivers.  ACPI tries to map ranges it finds
> as WB, which in turn prevents device drivers from mapping device memory as
> WC/UC-.
> 
> For the record, I hate this hack.  But it's the safest approach I can come
> up with.  E.g. forcing ioremap() to always use WB scares me because it's
> possible, however unlikely, that the kernel could try to map non-emulated
> memory (that is presented as MMIO to the guest) as WC/UC-, and silently
> forcing those mappings to WB could do weird things.
> 
> My initial thought was to effectively revert the offending commit and
> skip the cache disabling/enabling, i.e. the problematic CR0.CD toggling,
> but unfortunately OVMF/EDKII has also added code to skip MTRR setup. :-(

Oof. The missing context in 8e690b817e38 ("x86/kvm: Override default caching
mode for SEV-SNP and TDX"), is that since it is impossible to virtualize MTRRs
on TDX private memory (in the old way KVM used to do it) and there was no
upstream support yet, there looked like an opportunity to avoid strange "happens
to work" support that normal VMs ended up with. Instead KVM could just not
support TDX KVM MTRRs from the beginning. So part of the thinking was that we
could drop all TDX KVM MTRR hacks. (which I guess turned out to be wrong).

Since there is no upstream KVM TDX support yet, why isn't it an option to still
revert the EDKII commit too? It was a relatively recent change.


To me it seems that the normal KVM MTRR support is not ideal, because it is
still lying about what it is doing. For example, in the past there was an
attempt to use UC to prevent speculative execution accesses to sensitive data.
The KVM MTRR support only happens to work with existing guests, but not all
possible MTRR usages.

Since diverging from the architecture creates loose ends like that, we could
instead define some other way for EDKII to communicate the ranges to the kernel.
Like some simple KVM PV MSRs that are for communication only, and not
overlapping with architecture that expects to cause memory behavior. EDKII and
the kernel could be changed to use them.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX
  2025-02-03 18:14 ` Edgecombe, Rick P
@ 2025-02-03 20:33   ` Sean Christopherson
  2025-02-03 23:01     ` Edgecombe, Rick P
  0 siblings, 1 reply; 30+ messages in thread
From: Sean Christopherson @ 2025-02-03 20:33 UTC (permalink / raw)
  To: Rick P Edgecombe
  Cc: tglx@linutronix.de, x86@kernel.org, mingo@redhat.com,
	pbonzini@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	pgonda@google.com, jgross@suse.com, Binbin Wu,
	vkuznets@redhat.com, hpa@zytor.com, linux-kernel@vger.kernel.org,
	kirill.shutemov@linux.intel.com, dionnaglaze@google.com,
	kvm@vger.kernel.org, thomas.lendacky@amd.com

On Mon, Feb 03, 2025, Rick P Edgecombe wrote:
> On Fri, 2025-01-31 at 16:50 -0800, Sean Christopherson wrote:
> > Attempt to hack around the SNP/TDX guest MTRR disaster by hijacking
> > x86_platform.is_untracked_pat_range() to force the legacy PCI hole, i.e.
> > memory from TOLUD => 4GiB, as unconditionally writeback.
> > 
> > TDX in particular has created an impossible situation with MTRRs.  Because
> > TDX disallows toggling CR0.CD, TDX enabling decided the easiest solution
> > was to ignore MTRRs entirely (because omitting CR0.CD write is obviously
> > too simple).
> > 
> > Unfortunately, under KVM at least, the kernel subtly relies on MTRRs to
> > make ACPI play nice with device drivers.  ACPI tries to map ranges it finds
> > as WB, which in turn prevents device drivers from mapping device memory as
> > WC/UC-.
> > 
> > For the record, I hate this hack.  But it's the safest approach I can come
> > up with.  E.g. forcing ioremap() to always use WB scares me because it's
> > possible, however unlikely, that the kernel could try to map non-emulated
> > memory (that is presented as MMIO to the guest) as WC/UC-, and silently
> > forcing those mappings to WB could do weird things.
> > 
> > My initial thought was to effectively revert the offending commit and
> > skip the cache disabling/enabling, i.e. the problematic CR0.CD toggling,
> > but unfortunately OVMF/EDKII has also added code to skip MTRR setup. :-(
> 
> Oof. The missing context in 8e690b817e38 ("x86/kvm: Override default caching
> mode for SEV-SNP and TDX"), is that since it is impossible to virtualize MTRRs
> on TDX private memory (in the old way KVM used to do it) and there was no
> upstream support yet, there looked like an opportunity to avoid strange "happens
> to work" support that normal VMs ended up with. Instead KVM could just not
> support TDX KVM MTRRs from the beginning. So part of the thinking was that we
> could drop all TDX KVM MTRR hacks. (which I guess turned out to be wrong).
> 
> Since there is no upstream KVM TDX support yet, why isn't it an option to still
> revert the EDKII commit too? It was a relatively recent change.

I'm fine with that route too, but it too is a band-aid.  Relying on the *untrusted*
hypervisor to essentially communicate memory maps is not a winning strategy. 

> To me it seems that the normal KVM MTRR support is not ideal, because it is
> still lying about what it is doing. For example, in the past there was an
> attempt to use UC to prevent speculative execution accesses to sensitive data.
> The KVM MTRR support only happens to work with existing guests, but not all
> possible MTRR usages.
> 
> Since diverging from the architecture creates loose ends like that, we could
> instead define some other way for EDKII to communicate the ranges to the kernel.
> Like some simple KVM PV MSRs that are for communication only, and not

Hard "no" to any PV solution.  This isn't KVM specific, and as above, bouncing
through the hypervisor to communicate information within the guest is asinine,
especially for CoCo VMs.

> overlapping with architecture that expects to cause memory behavior. EDKII and
> the kernel could be changed to use them.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX
  2025-02-03 20:33   ` Sean Christopherson
@ 2025-02-03 23:01     ` Edgecombe, Rick P
  2025-02-04  0:27       ` Sean Christopherson
  0 siblings, 1 reply; 30+ messages in thread
From: Edgecombe, Rick P @ 2025-02-03 23:01 UTC (permalink / raw)
  To: seanjc@google.com
  Cc: kvm@vger.kernel.org, dave.hansen@linux.intel.com,
	thomas.lendacky@amd.com, dionnaglaze@google.com, Wu, Binbin,
	kirill.shutemov@linux.intel.com, mingo@redhat.com,
	pbonzini@redhat.com, tglx@linutronix.de,
	linux-kernel@vger.kernel.org, hpa@zytor.com, vkuznets@redhat.com,
	bp@alien8.de, jgross@suse.com, pgonda@google.com, x86@kernel.org

On Mon, 2025-02-03 at 12:33 -0800, Sean Christopherson wrote:
> > Since there is no upstream KVM TDX support yet, why isn't it an option to
> > still
> > revert the EDKII commit too? It was a relatively recent change.
> 
> I'm fine with that route too, but it too is a band-aid.  Relying on the
> *untrusted*
> hypervisor to essentially communicate memory maps is not a winning strategy. 
> 
> > To me it seems that the normal KVM MTRR support is not ideal, because it is
> > still lying about what it is doing. For example, in the past there was an
> > attempt to use UC to prevent speculative execution accesses to sensitive
> > data.
> > The KVM MTRR support only happens to work with existing guests, but not all
> > possible MTRR usages.
> > 
> > Since diverging from the architecture creates loose ends like that, we could
> > instead define some other way for EDKII to communicate the ranges to the
> > kernel.
> > Like some simple KVM PV MSRs that are for communication only, and not
> 
> Hard "no" to any PV solution.  This isn't KVM specific, and as above, bouncing
> through the hypervisor to communicate information within the guest is asinine,
> especially for CoCo VMs.

Hmm, right.

So the other options could be:

1. Some TDX module feature to hold the ranges:
 - Con: Not shared with AMD

2. Re-use MTRRs for the communication, revert changes in guest and edk2:
 - Con: Creating more half support, when it's technically not required
 - Con: Still bouncing through the hypervisor
 - Pro: Design and code is clear

3. Create some new architectural definition, like a bit that means "MTRRs don't
actually work:
 - Con: Takes a long time, need to get agreement
 - Con: Still bouncing through the hypervisor
 - Pro: More pure solution

4. Do this series:
 - Pro: Looks ok to me
 - Cons: As explained in the patches, it's a bit hacky.
 - Cons: Could there be more cases than the legacy PCI hole?


I would kind of like to see something like 3, but 2 or 4 seem the only feasible
ones if we want to resolve this soon.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX
  2025-02-03 23:01     ` Edgecombe, Rick P
@ 2025-02-04  0:27       ` Sean Christopherson
  2025-02-05  3:51         ` Edgecombe, Rick P
  2025-02-10 15:29         ` Binbin Wu
  0 siblings, 2 replies; 30+ messages in thread
From: Sean Christopherson @ 2025-02-04  0:27 UTC (permalink / raw)
  To: Rick P Edgecombe
  Cc: kvm@vger.kernel.org, dave.hansen@linux.intel.com,
	thomas.lendacky@amd.com, dionnaglaze@google.com, Binbin Wu,
	kirill.shutemov@linux.intel.com, mingo@redhat.com,
	pbonzini@redhat.com, tglx@linutronix.de,
	linux-kernel@vger.kernel.org, hpa@zytor.com, vkuznets@redhat.com,
	bp@alien8.de, jgross@suse.com, pgonda@google.com, x86@kernel.org

On Mon, Feb 03, 2025, Rick P Edgecombe wrote:
> On Mon, 2025-02-03 at 12:33 -0800, Sean Christopherson wrote:
> > > Since there is no upstream KVM TDX support yet, why isn't it an option to
> > > still
> > > revert the EDKII commit too? It was a relatively recent change.
> > 
> > I'm fine with that route too, but it too is a band-aid.  Relying on the
> > *untrusted*
> > hypervisor to essentially communicate memory maps is not a winning strategy. 
> > 
> > > To me it seems that the normal KVM MTRR support is not ideal, because it is
> > > still lying about what it is doing. For example, in the past there was an
> > > attempt to use UC to prevent speculative execution accesses to sensitive
> > > data.
> > > The KVM MTRR support only happens to work with existing guests, but not all
> > > possible MTRR usages.
> > > 
> > > Since diverging from the architecture creates loose ends like that, we could
> > > instead define some other way for EDKII to communicate the ranges to the
> > > kernel.
> > > Like some simple KVM PV MSRs that are for communication only, and not
> > 
> > Hard "no" to any PV solution.  This isn't KVM specific, and as above, bouncing
> > through the hypervisor to communicate information within the guest is asinine,
> > especially for CoCo VMs.
> 
> Hmm, right.
> 
> So the other options could be:
> 
> 1. Some TDX module feature to hold the ranges:
>  - Con: Not shared with AMD
> 
> 2. Re-use MTRRs for the communication, revert changes in guest and edk2:

Thinking more about how EDK2 is consumed downstream, I think reverting the EDK2
changes is necessary regardless of what happens in the kernel.  Or at the least,
somehow communicate to EDK2 users that ingesting those changes is a bad idea
unless the kernel has also been updated.

AFAIK, Bring Your Own Firmware[*] isn't widely adopted, which means that the CSP
is shipping the firmware.  And shipping OVMF/EDK2 with the "ignores MTRRs" code
will cause problems for guests without commit 8e690b817e38 ("x86/kvm: Override
default caching mode for SEV-SNP and TDX").  Since the host doesn't control the
guest kernel, there's no way to know if deploying those EDK2 changes is safe.
 
[*] https://kvm-forum.qemu.org/2024/BYOF_-_KVM_Forum_2024_iWTioIP.pdf

>  - Con: Creating more half support, when it's technically not required
>  - Con: Still bouncing through the hypervisor

I assume by "Re-use MTRRs for the communication" you also mean updating the guest
to address the "everything is UC!" flaw, otherwise another con is:

   - Con: Doesn't address the performance issue with TDX guests "using" UC
          memory by default (unless there's yet more enabled).

Presumably that can be accomplished by simply skipping the CR0.CD toggling, and
doing MTRR stuff as nonrmal?

>  - Pro: Design and code is clear
>
> 3. Create some new architectural definition, like a bit that means "MTRRs don't
> actually work:
>  - Con: Takes a long time, need to get agreement
>  - Con: Still bouncing through the hypervisor

Not for KVM guests.  As I laid out in my bug report, it's safe to assume MTRRs
don't actually affect the memory type when running under KVM.

FWIW, PAT doesn't "work" on most KVM Intel setups either, because of misguided
KVM code that resulted in "Ignore Guest PAT" being set in all EPTEs for the
overwhelming majority of guests.  That's not desirable long term because it
prevents the guest from using WC (via PAT) in situations where doing so is needed
for performance and/or correctness.

>  - Pro: More pure solution

MTRRs "not working" is a red herring.  The problem isn't that MTRRs don't work,
it's that the kernel is (somewhat unknowingly) using MTRRs as a crutch to get the
desired memtype for devices.  E.g. for emulated MMIO, MTRRs _can't_ be virtualized,
because there's never a valid mapping, i.e. there is no physical memory and thus
no memtype.  In other words, under KVM guests (and possibly other hypervisors),
MTRRs end up being nothing more than a communication channel between guest firmware
and the kernel.

The gap for CoCo VMs is that using MTRRs is undesirable because they are controlled
by the untrusted host.  But that's largely a future problem, unless someone has a
clever way to fix the kernel mess.

> 4. Do this series:
>  - Pro: Looks ok to me
>  - Cons: As explained in the patches, it's a bit hacky.
>  - Cons: Could there be more cases than the legacy PCI hole?
> 
> I would kind of like to see something like 3, but 2 or 4 seem the only feasible
> ones if we want to resolve this soon.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX
  2025-02-04  0:27       ` Sean Christopherson
@ 2025-02-05  3:51         ` Edgecombe, Rick P
  2025-02-05  7:49           ` Xu, Min M
  2025-02-10 15:29         ` Binbin Wu
  1 sibling, 1 reply; 30+ messages in thread
From: Edgecombe, Rick P @ 2025-02-05  3:51 UTC (permalink / raw)
  To: seanjc@google.com, Xu, Min M
  Cc: kvm@vger.kernel.org, dave.hansen@linux.intel.com,
	thomas.lendacky@amd.com, dionnaglaze@google.com, Wu, Binbin,
	linux-kernel@vger.kernel.org, kirill.shutemov@linux.intel.com,
	mingo@redhat.com, pbonzini@redhat.com, tglx@linutronix.de,
	hpa@zytor.com, vkuznets@redhat.com, bp@alien8.de, jgross@suse.com,
	x86@kernel.org, pgonda@google.com

+Min, can you comment?

3a3b12cbda ("UefiCpuPkg/MtrrLib: MtrrLibIsMtrrSupported always return FALSE in
TD-Guest") turned out to be problematic in practice.

Full thread:
https://lore.kernel.org/kvm/20250201005048.657470-1-seanjc@google.com/

On Mon, 2025-02-03 at 16:27 -0800, Sean Christopherson wrote:
> On Mon, Feb 03, 2025, Rick P Edgecombe wrote:
> > On Mon, 2025-02-03 at 12:33 -0800, Sean Christopherson wrote:
> > > > Since there is no upstream KVM TDX support yet, why isn't it an option to
> > > > still
> > > > revert the EDKII commit too? It was a relatively recent change.
> > > 
> > > I'm fine with that route too, but it too is a band-aid.  Relying on the
> > > *untrusted*
> > > hypervisor to essentially communicate memory maps is not a winning strategy. 
> > > 
> > > > To me it seems that the normal KVM MTRR support is not ideal, because it is
> > > > still lying about what it is doing. For example, in the past there was an
> > > > attempt to use UC to prevent speculative execution accesses to sensitive
> > > > data.
> > > > The KVM MTRR support only happens to work with existing guests, but not all
> > > > possible MTRR usages.
> > > > 
> > > > Since diverging from the architecture creates loose ends like that, we could
> > > > instead define some other way for EDKII to communicate the ranges to the
> > > > kernel.
> > > > Like some simple KVM PV MSRs that are for communication only, and not
> > > 
> > > Hard "no" to any PV solution.  This isn't KVM specific, and as above, bouncing
> > > through the hypervisor to communicate information within the guest is asinine,
> > > especially for CoCo VMs.
> > 
> > Hmm, right.
> > 
> > So the other options could be:
> > 
> > 1. Some TDX module feature to hold the ranges:
> >  - Con: Not shared with AMD
> > 
> > 2. Re-use MTRRs for the communication, revert changes in guest and edk2:
> 
> Thinking more about how EDK2 is consumed downstream, I think reverting the EDK2
> changes is necessary regardless of what happens in the kernel.  Or at the least,
> somehow communicate to EDK2 users that ingesting those changes is a bad idea
> unless the kernel has also been updated.
> 
> AFAIK, Bring Your Own Firmware[*] isn't widely adopted, which means that the CSP
> is shipping the firmware.  And shipping OVMF/EDK2 with the "ignores MTRRs" code
> will cause problems for guests without commit 8e690b817e38 ("x86/kvm: Override
> default caching mode for SEV-SNP and TDX").  Since the host doesn't control the
> guest kernel, there's no way to know if deploying those EDK2 changes is safe.
>  
> [*] https://kvm-forum.qemu.org/2024/BYOF_-_KVM_Forum_2024_iWTioIP.pdf
> 

Hmm. Since there is no upstream TDX KVM support, for it's part, I guess KVM
should still get a chance to define a cleaner solution (if there actually was a
cleaner solution). But yea, it would mean only components from after the
solution was settled could be used together for a fully working stack. And
it should probably be called out somehow. Maybe could be in the KVM TDX docs or
something.

Still seems like a thing to avoid if possible.

> >  - Con: Creating more half support, when it's technically not required
> >  - Con: Still bouncing through the hypervisor
> 
> I assume by "Re-use MTRRs for the communication" you also mean updating the guest
> to address the "everything is UC!" flaw, otherwise another con is:
> 
>    - Con: Doesn't address the performance issue with TDX guests "using" UC
>           memory by default (unless there's yet more enabled).

Hmm. This is quite the tangled corner.

> 
> Presumably that can be accomplished by simply skipping the CR0.CD toggling, and
> doing MTRR stuff as nonrmal?

I'll have to get back to you on this one. Kirill probably could give a better
answer, but likely will not be able to follow up on this thread until next week.

> 
> >  - Pro: Design and code is clear
> > 
> > 3. Create some new architectural definition, like a bit that means "MTRRs don't
> > actually work:
> >  - Con: Takes a long time, need to get agreement
> >  - Con: Still bouncing through the hypervisor
> 
> Not for KVM guests.  As I laid out in my bug report, it's safe to assume MTRRs
> don't actually affect the memory type when running under KVM.
> 
> FWIW, PAT doesn't "work" on most KVM Intel setups either, because of misguided
> KVM code that resulted in "Ignore Guest PAT" being set in all EPTEs for the
> overwhelming majority of guests.  That's not desirable long term because it
> prevents the guest from using WC (via PAT) in situations where doing so is needed
> for performance and/or correctness.
> 
> >  - Pro: More pure solution
> 
> MTRRs "not working" is a red herring.  The problem isn't that MTRRs don't work,
> it's that the kernel is (somewhat unknowingly) using MTRRs as a crutch to get the
> desired memtype for devices.  E.g. for emulated MMIO, MTRRs _can't_ be virtualized,
> because there's never a valid mapping, i.e. there is no physical memory and thus
> no memtype.  In other words, under KVM guests (and possibly other hypervisors),
> MTRRs end up being nothing more than a communication channel between guest firmware
> and the kernel.

Yea.

> 
> The gap for CoCo VMs is that using MTRRs is undesirable because they are controlled
> by the untrusted host.  But that's largely a future problem, unless someone has a
> clever way to fix the kernel mess.
> 
> 

Yea, I wondered about that too. I imagine the thinking was that since it is only
controlling shared memory, it can be untrusted.

And I guess the solution in this patchset is hypothetically a bit more locked
down in that respect.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX
  2025-02-05  3:51         ` Edgecombe, Rick P
@ 2025-02-05  7:49           ` Xu, Min M
  0 siblings, 0 replies; 30+ messages in thread
From: Xu, Min M @ 2025-02-05  7:49 UTC (permalink / raw)
  To: Edgecombe, Rick P, seanjc@google.com, Wu, Binbin
  Cc: kvm@vger.kernel.org, dave.hansen@linux.intel.com,
	thomas.lendacky@amd.com, dionnaglaze@google.com,
	linux-kernel@vger.kernel.org, kirill.shutemov@linux.intel.com,
	mingo@redhat.com, pbonzini@redhat.com, tglx@linutronix.de,
	hpa@zytor.com, vkuznets@redhat.com, bp@alien8.de, jgross@suse.com,
	x86@kernel.org, pgonda@google.com, Xu, Min M, Yao, Jiewen

On Wednesday, February 5, 2025 11:51 AM, Edgecombe, Rick P wrote:
> 
> +Min, can you comment?
> 
> 3a3b12cbda ("UefiCpuPkg/MtrrLib: MtrrLibIsMtrrSupported always return FALSE
> in
> TD-Guest") turned out to be problematic in practice.
> 
As the commit(3a3b12cbda) message mentioned that "For Linux kernel, there is a mechanism called SW defined MTRR introduced  by the patch https://lore.kernel.org/all/20230502120931.20719-4-jgross@suse.com/". 
From the discussion in below thread it seems the patch (SW defined MTR in kernel) has not been introduced into kernel master branch, right?
We need some time to investigate it and will give an update here. 
> Full thread:
> https://lore.kernel.org/kvm/20250201005048.657470-1-seanjc@google.com/
> 

Thanks
Min

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX
  2025-02-04  0:27       ` Sean Christopherson
  2025-02-05  3:51         ` Edgecombe, Rick P
@ 2025-02-10 15:29         ` Binbin Wu
  1 sibling, 0 replies; 30+ messages in thread
From: Binbin Wu @ 2025-02-10 15:29 UTC (permalink / raw)
  To: Sean Christopherson, Rick P Edgecombe, Xu, Min M
  Cc: kvm@vger.kernel.org, dave.hansen@linux.intel.com,
	thomas.lendacky@amd.com, dionnaglaze@google.com, Binbin Wu,
	kirill.shutemov@linux.intel.com, mingo@redhat.com,
	pbonzini@redhat.com, tglx@linutronix.de,
	linux-kernel@vger.kernel.org, hpa@zytor.com, vkuznets@redhat.com,
	bp@alien8.de, jgross@suse.com, pgonda@google.com, x86@kernel.org,
	jiewen.yao, Binbin Wu



On 2/4/2025 8:27 AM, Sean Christopherson wrote:
> On Mon, Feb 03, 2025, Rick P Edgecombe wrote:
>> On Mon, 2025-02-03 at 12:33 -0800, Sean Christopherson wrote:
>>>> Since there is no upstream KVM TDX support yet, why isn't it an option to
>>>> still
>>>> revert the EDKII commit too? It was a relatively recent change.
>>> I'm fine with that route too, but it too is a band-aid.  Relying on the
>>> *untrusted*
>>> hypervisor to essentially communicate memory maps is not a winning strategy.
>>>
>>>> To me it seems that the normal KVM MTRR support is not ideal, because it is
>>>> still lying about what it is doing. For example, in the past there was an
>>>> attempt to use UC to prevent speculative execution accesses to sensitive
>>>> data.
>>>> The KVM MTRR support only happens to work with existing guests, but not all
>>>> possible MTRR usages.
>>>>
>>>> Since diverging from the architecture creates loose ends like that, we could
>>>> instead define some other way for EDKII to communicate the ranges to the
>>>> kernel.
>>>> Like some simple KVM PV MSRs that are for communication only, and not
>>> Hard "no" to any PV solution.  This isn't KVM specific, and as above, bouncing
>>> through the hypervisor to communicate information within the guest is asinine,
>>> especially for CoCo VMs.
>> Hmm, right.
>>
>> So the other options could be:
>>
>> 1. Some TDX module feature to hold the ranges:
>>   - Con: Not shared with AMD
>>
>> 2. Re-use MTRRs for the communication, revert changes in guest and edk2:
> Thinking more about how EDK2 is consumed downstream, I think reverting the EDK2
> changes is necessary regardless of what happens in the kernel.
IIUC, 071d2cfab8 ("OvmfPkg/Sec: Skip setup MTRR early in TD-Guest") was added
to avoid changing the setting of MTRR, which will trigger #VE by setting
CR0.CD=1.

And recently, 3a3b12cbda ("UefiCpuPkg/MtrrLib: MtrrLibIsMtrrSupported always
return FALSE in TD-Guest") was added to avoid touching MTRR MSRs at all,
so that the MTRR MSRs support for TDX guests was dropped as described in
[PATCH 00/18] KVM: TDX: TDX "the rest" part [1].

If we want to revert the two commits, we need to:
1. Make sure that OVMF will not set CR0.CD to 1 for TDX guests, probably
    needs some kind of hack in OVMF.

2. Need to bring back the support of MTRR MSRs in KVM for TDX guests.
    TDX KVM basic support patch set v19 and earlier versions enforce default
    MTRR type as WB and disabled fixed/variable MTRR (by reporting MSR_MTRRcap
    as 0) for TDX guests, which was kind of half support and needed
    "clearcpuid=mtrr" to avoid toggling of CR0.CD.

    If we really want to rely on the OVMF to program the MTRR values, maybe we
    can treat MTRR MSRs for TDX guests as normal VMs, i.e., allow guests to
    read/write the values without any further virtualization.
    Of course, it needs to prompt the guest kernel to skip toggling CR0.CD for
    TDX guests.

[1] https://lore.kernel.org/kvm/20241210004946.3718496-1-binbin.wu@linux.intel.com


>    Or at the least,
> somehow communicate to EDK2 users that ingesting those changes is a bad idea
> unless the kernel has also been updated.
>
> AFAIK, Bring Your Own Firmware[*] isn't widely adopted, which means that the CSP
> is shipping the firmware.  And shipping OVMF/EDK2 with the "ignores MTRRs" code
> will cause problems for guests without commit 8e690b817e38 ("x86/kvm: Override
> default caching mode for SEV-SNP and TDX").  Since the host doesn't control the
> guest kernel, there's no way to know if deploying those EDK2 changes is safe.

Oh, yea.

And if we drop the MTRR MSRs access in KVM for TDX guests, an old guest kernel
without commit 8e690b817e38 ("x86/kvm: Override default caching mode for
SEV-SNP and TDX") will require "clearcpuid=mtrr". :(

>   
> [*] https://kvm-forum.qemu.org/2024/BYOF_-_KVM_Forum_2024_iWTioIP.pdf
>
>>   - Con: Creating more half support, when it's technically not required
>>   - Con: Still bouncing through the hypervisor
> I assume by "Re-use MTRRs for the communication" you also mean updating the guest
> to address the "everything is UC!" flaw, otherwise another con is:
>
>     - Con: Doesn't address the performance issue with TDX guests "using" UC
>            memory by default (unless there's yet more enabled).
>
> Presumably that can be accomplished by simply skipping the CR0.CD toggling, and
> doing MTRR stuff as nonrmal?
>
>>   - Pro: Design and code is clear
>>
>> 3. Create some new architectural definition, like a bit that means "MTRRs don't
>> actually work:
>>   - Con: Takes a long time, need to get agreement
>>   - Con: Still bouncing through the hypervisor
> Not for KVM guests.  As I laid out in my bug report, it's safe to assume MTRRs
> don't actually affect the memory type when running under KVM.
>
> FWIW, PAT doesn't "work" on most KVM Intel setups either, because of misguided
> KVM code that resulted in "Ignore Guest PAT" being set in all EPTEs for the
> overwhelming majority of guests.  That's not desirable long term because it
> prevents the guest from using WC (via PAT) in situations where doing so is needed
> for performance and/or correctness.
>
>>   - Pro: More pure solution
> MTRRs "not working" is a red herring.  The problem isn't that MTRRs don't work,
> it's that the kernel is (somewhat unknowingly) using MTRRs as a crutch to get the
> desired memtype for devices.  E.g. for emulated MMIO, MTRRs _can't_ be virtualized,
> because there's never a valid mapping, i.e. there is no physical memory and thus
> no memtype.  In other words, under KVM guests (and possibly other hypervisors),
> MTRRs end up being nothing more than a communication channel between guest firmware
> and the kernel.
>
> The gap for CoCo VMs is that using MTRRs is undesirable because they are controlled
> by the untrusted host.  But that's largely a future problem, unless someone has a
> clever way to fix the kernel mess.
>
>> 4. Do this series:
>>   - Pro: Looks ok to me
It looks OK to me too.
But as mentioned above, mismatch of OVMF, guest kernel, host kernel
version will cause problem.

>>   - Cons: As explained in the patches, it's a bit hacky.
Skipping toggling CR0.CD in guest kernel seems also a bit hacky.

>>   - Cons: Could there be more cases than the legacy PCI hole?
>>
>> I would kind of like to see something like 3, but 2 or 4 seem the only feasible
>> ones if we want to resolve this soon.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX
  2025-02-01  0:50 Sean Christopherson
  2025-02-01 14:25 ` Dionna Amalie Glaze
  2025-02-03 18:14 ` Edgecombe, Rick P
@ 2025-07-08 14:24 ` Nikolay Borisov
  2 siblings, 0 replies; 30+ messages in thread
From: Nikolay Borisov @ 2025-07-08 14:24 UTC (permalink / raw)
  To: Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, Paolo Bonzini
  Cc: linux-kernel, kvm, Dionna Glaze, Peter Gonda,
	Jürgen Groß, Kirill Shutemov, Vitaly Kuznetsov,
	H . Peter Anvin, Binbin Wu, Tom Lendacky



On 2/1/25 02:50, Sean Christopherson wrote:
> Attempt to hack around the SNP/TDX guest MTRR disaster by hijacking
> x86_platform.is_untracked_pat_range() to force the legacy PCI hole, i.e.
> memory from TOLUD => 4GiB, as unconditionally writeback.
> 
> TDX in particular has created an impossible situation with MTRRs.  Because
> TDX disallows toggling CR0.CD, TDX enabling decided the easiest solution
> was to ignore MTRRs entirely (because omitting CR0.CD write is obviously
> too simple).
> 
> Unfortunately, under KVM at least, the kernel subtly relies on MTRRs to
> make ACPI play nice with device drivers.  ACPI tries to map ranges it finds
> as WB, which in turn prevents device drivers from mapping device memory as
> WC/UC-.
> 
> For the record, I hate this hack.  But it's the safest approach I can come
> up with.  E.g. forcing ioremap() to always use WB scares me because it's
> possible, however unlikely, that the kernel could try to map non-emulated
> memory (that is presented as MMIO to the guest) as WC/UC-, and silently
> forcing those mappings to WB could do weird things.
> 
> My initial thought was to effectively revert the offending commit and
> skip the cache disabling/enabling, i.e. the problematic CR0.CD toggling,
> but unfortunately OVMF/EDKII has also added code to skip MTRR setup. :-(
> 
> Sean Christopherson (2):
>    x86/mtrr: Return success vs. "failure" from guest_force_mtrr_state()
>    x86/kvm: Override low memory above TOLUD to WB when MTRRs are forced
>      WB
> 
>   arch/x86/include/asm/mtrr.h        |  5 +++--
>   arch/x86/kernel/cpu/mtrr/generic.c | 11 +++++++----
>   arch/x86/kernel/kvm.c              | 31 ++++++++++++++++++++++++++++--
>   3 files changed, 39 insertions(+), 8 deletions(-)
> 
> 
> base-commit: fd8c09ad0d87783b9b6a27900d66293be45b7bad


This prevents TPM from functioning which in turn breaks attestation on 
TDX enabled guests. So what's the status of it?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX
@ 2025-07-09 16:54 Jianxiong Gao
  2025-07-14  9:06 ` Binbin Wu
  0 siblings, 1 reply; 30+ messages in thread
From: Jianxiong Gao @ 2025-07-09 16:54 UTC (permalink / raw)
  To: Nikolay Borisov
  Cc: binbin.wu, Borislav Petkov (AMD), Dave Hansen, Dionna Glaze,
	H. Peter Anvin, jgross, Kirill A. Shutemov, kvm, linux-kernel,
	Ingo Molnar, pbonzini, Peter Gonda, Sean Christopherson,
	Thomas Gleixner, Tom Lendacky, Vitaly Kuznetsov, x86

I tested this patch on top of commit 8e690b817e38, however we are
still experiencing the same failure.

-- 
Jianxiong Gao

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX
  2025-07-09 16:54 [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX Jianxiong Gao
@ 2025-07-14  9:06 ` Binbin Wu
  2025-07-14 11:24   ` Nikolay Borisov
  0 siblings, 1 reply; 30+ messages in thread
From: Binbin Wu @ 2025-07-14  9:06 UTC (permalink / raw)
  To: Jianxiong Gao, Nikolay Borisov, Sean Christopherson
  Cc: Borislav Petkov (AMD), Dave Hansen, Dionna Glaze, H. Peter Anvin,
	jgross, Kirill A. Shutemov, kvm, linux-kernel, Ingo Molnar,
	pbonzini, Peter Gonda, Thomas Gleixner, Tom Lendacky,
	Vitaly Kuznetsov, x86, Rick Edgecombe, Binbin Wu



On 7/10/2025 12:54 AM, Jianxiong Gao wrote:
> I tested this patch on top of commit 8e690b817e38, however we are
> still experiencing the same failure.
>
I didn't reproduce the issue with QEMU.
After some comparison on how QEMU building the ACPI tables for HPET and TPM,

- For HPET, the HPET range is added as Operation Region:
     aml_append(dev,
         aml_operation_region("HPTM", AML_SYSTEM_MEMORY, aml_int(HPET_BASE),
                              HPET_LEN));

- For TPM, the range is added as 32-Bit Fixed Memory Range:
     if (TPM_IS_TIS_ISA(tpm_find())) {
         aml_append(crs, aml_memory32_fixed(TPM_TIS_ADDR_BASE,
                    TPM_TIS_ADDR_SIZE, AML_READ_WRITE));
     }

So, in KVM, the code patch of TPM is different from the trace for HPET in the
patch https://lore.kernel.org/kvm/20250201005048.657470-3-seanjc@google.com/,
HPET will trigger the code path acpi_os_map_iomem(), but TPM doesn't.

I tried to hack the code to map the region to WB first in tpm_tis driver to
trigger the error.
diff --git a/drivers/char/tpm/tpm_tis.c b/drivers/char/tpm/tpm_tis.c
index 9aa230a63616..62d303f88041 100644
--- a/drivers/char/tpm/tpm_tis.c
+++ b/drivers/char/tpm/tpm_tis.c
@@ -232,6 +232,7 @@ static int tpm_tis_init(struct device *dev, struct tpm_info *tpm_info)
         if (phy == NULL)
                 return -ENOMEM;

+       ioremap_cache(tpm_info->res.start, resource_size(&tpm_info->res));
         phy->iobase = devm_ioremap_resource(dev, &tpm_info->res);
         if (IS_ERR(phy->iobase))
                 return PTR_ERR(phy->iobase);
Then I got the same error
[ 4.606075] ioremap error for 0xfed40000-0xfed45000, requested 0x2, got 0x0
[ 4.607728] tpm_tis MSFT0101:00: probe with driver tpm_tis failed with error -12

And with Sean's patch set, the issue can be resolved.

I guess google's VMM has built different ACPI table for TPM.
But according to my experiment, the issue should be able to be fixed by this
patch set, though I am not sure whether it will be the final solution or not.

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX
  2025-07-14  9:06 ` Binbin Wu
@ 2025-07-14 11:24   ` Nikolay Borisov
  2025-07-15  2:53     ` Binbin Wu
  2025-07-23 14:34     ` Sean Christopherson
  0 siblings, 2 replies; 30+ messages in thread
From: Nikolay Borisov @ 2025-07-14 11:24 UTC (permalink / raw)
  To: Binbin Wu, Jianxiong Gao, Sean Christopherson
  Cc: Borislav Petkov (AMD), Dave Hansen, Dionna Glaze, H. Peter Anvin,
	jgross, Kirill A. Shutemov, kvm, linux-kernel, Ingo Molnar,
	pbonzini, Peter Gonda, Thomas Gleixner, Tom Lendacky,
	Vitaly Kuznetsov, x86, Rick Edgecombe



On 14.07.25 г. 12:06 ч., Binbin Wu wrote:
> 
> 
> On 7/10/2025 12:54 AM, Jianxiong Gao wrote:
>> I tested this patch on top of commit 8e690b817e38, however we are
>> still experiencing the same failure.
>>
> I didn't reproduce the issue with QEMU.
> After some comparison on how QEMU building the ACPI tables for HPET and 
> TPM,
> 
> - For HPET, the HPET range is added as Operation Region:
>      aml_append(dev,
>          aml_operation_region("HPTM", AML_SYSTEM_MEMORY, 
> aml_int(HPET_BASE),
>                               HPET_LEN));
> 
> - For TPM, the range is added as 32-Bit Fixed Memory Range:
>      if (TPM_IS_TIS_ISA(tpm_find())) {
>          aml_append(crs, aml_memory32_fixed(TPM_TIS_ADDR_BASE,
>                     TPM_TIS_ADDR_SIZE, AML_READ_WRITE));
>      }
> 
> So, in KVM, the code patch of TPM is different from the trace for HPET 
> in the
> patch 
> https://lore.kernel.org/kvm/20250201005048.657470-3-seanjc@google.com/,
> HPET will trigger the code path acpi_os_map_iomem(), but TPM doesn't.
> 
> I tried to hack the code to map the region to WB first in tpm_tis driver to
> trigger the error.
> diff --git a/drivers/char/tpm/tpm_tis.c b/drivers/char/tpm/tpm_tis.c
> index 9aa230a63616..62d303f88041 100644
> --- a/drivers/char/tpm/tpm_tis.c
> +++ b/drivers/char/tpm/tpm_tis.c
> @@ -232,6 +232,7 @@ static int tpm_tis_init(struct device *dev, struct 
> tpm_info *tpm_info)
>          if (phy == NULL)
>                  return -ENOMEM;
> 
> +       ioremap_cache(tpm_info->res.start, resource_size(&tpm_info->res));
>          phy->iobase = devm_ioremap_resource(dev, &tpm_info->res);
>          if (IS_ERR(phy->iobase))
>                  return PTR_ERR(phy->iobase);
> Then I got the same error
> [ 4.606075] ioremap error for 0xfed40000-0xfed45000, requested 0x2, got 0x0
> [ 4.607728] tpm_tis MSFT0101:00: probe with driver tpm_tis failed with 
> error -12


The thing is we don't really want to get into the if (pcm != new_pcm) { 
branch, because even if it succeeds there then the mapping will be 
wrong, because we want accesses to the TPM to be uncached since that's 
an iomem region, whereas this error shows that the new_pcm is WB.

Also looking at memtype_reserve in it there is the following piece of code:

if (x86_platform.is_untracked_pat_range(start, end)) {
      7                 if (new_type) 

      6                         *new_type = _PAGE_CACHE_MODE_WB; 

      5                 return 0; 

      4         }


So if is_untracked_pat_range returns true then the cache mode will 
always be WB.


> 
> And with Sean's patch set, the issue can be resolved.
> 
> I guess google's VMM has built different ACPI table for TPM.
> But according to my experiment, the issue should be able to be fixed by 
> this
> patch set, though I am not sure whether it will be the final solution or 
> not.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX
  2025-07-14 11:24   ` Nikolay Borisov
@ 2025-07-15  2:53     ` Binbin Wu
  2025-07-16  9:51       ` Binbin Wu
  2025-07-23 14:34     ` Sean Christopherson
  1 sibling, 1 reply; 30+ messages in thread
From: Binbin Wu @ 2025-07-15  2:53 UTC (permalink / raw)
  To: Sean Christopherson, Nikolay Borisov, Jianxiong Gao
  Cc: Borislav Petkov (AMD), Dave Hansen, Dionna Glaze, H. Peter Anvin,
	jgross, Kirill A. Shutemov, kvm, linux-kernel, Ingo Molnar,
	pbonzini, Peter Gonda, Thomas Gleixner, Tom Lendacky,
	Vitaly Kuznetsov, x86, Rick Edgecombe, kas, Xu, Min M, Binbin Wu



On 7/14/2025 7:24 PM, Nikolay Borisov wrote:
>
>
> On 14.07.25 г. 12:06 ч., Binbin Wu wrote:
>>
>>
>> On 7/10/2025 12:54 AM, Jianxiong Gao wrote:
>>> I tested this patch on top of commit 8e690b817e38, however we are
>>> still experiencing the same failure.
>>>
>> I didn't reproduce the issue with QEMU.
>> After some comparison on how QEMU building the ACPI tables for HPET and TPM,
>>
>> - For HPET, the HPET range is added as Operation Region:
>>      aml_append(dev,
>>          aml_operation_region("HPTM", AML_SYSTEM_MEMORY, aml_int(HPET_BASE),
>>                               HPET_LEN));
>>
>> - For TPM, the range is added as 32-Bit Fixed Memory Range:
>>      if (TPM_IS_TIS_ISA(tpm_find())) {
>>          aml_append(crs, aml_memory32_fixed(TPM_TIS_ADDR_BASE,
>>                     TPM_TIS_ADDR_SIZE, AML_READ_WRITE));
>>      }
>>
>> So, in KVM, the code patch of TPM is different from the trace for HPET in the
>> patch https://lore.kernel.org/kvm/20250201005048.657470-3-seanjc@google.com/,
>> HPET will trigger the code path acpi_os_map_iomem(), but TPM doesn't.
>>
>> I tried to hack the code to map the region to WB first in tpm_tis driver to
>> trigger the error.
>> diff --git a/drivers/char/tpm/tpm_tis.c b/drivers/char/tpm/tpm_tis.c
>> index 9aa230a63616..62d303f88041 100644
>> --- a/drivers/char/tpm/tpm_tis.c
>> +++ b/drivers/char/tpm/tpm_tis.c
>> @@ -232,6 +232,7 @@ static int tpm_tis_init(struct device *dev, struct tpm_info *tpm_info)
>>          if (phy == NULL)
>>                  return -ENOMEM;
>>
>> +       ioremap_cache(tpm_info->res.start, resource_size(&tpm_info->res));
>>          phy->iobase = devm_ioremap_resource(dev, &tpm_info->res);
>>          if (IS_ERR(phy->iobase))
>>                  return PTR_ERR(phy->iobase);
>> Then I got the same error
>> [ 4.606075] ioremap error for 0xfed40000-0xfed45000, requested 0x2, got 0x0
>> [ 4.607728] tpm_tis MSFT0101:00: probe with driver tpm_tis failed with error -12
>
>
> The thing is we don't really want to get into the if (pcm != new_pcm) { branch, because even if it succeeds there then the mapping will be wrong, because we want accesses to the TPM to be uncached since that's an iomem region, whereas this error shows that the new_pcm is WB.
>
> Also looking at memtype_reserve in it there is the following piece of code:
>
> if (x86_platform.is_untracked_pat_range(start, end)) {
>      7                 if (new_type)
>      6                         *new_type = _PAGE_CACHE_MODE_WB;
>      5                 return 0;
>      4         }
>
>
> So if is_untracked_pat_range returns true then the cache mode will always be WB.
So there are two different things per my understanding:
1. The patch set https://lore.kernel.org/all/20250201005048.657470-3-seanjc@google.com/
     applied to "Guest" kernel should be able to dismiss the error message.
     In __ioremap_caller(), when branch "pcm != new_pcm" is triggered, since
     is_new_memtype_allowed() returns true for TPM range, the error message
     shouldn NOT be printed.
     Otherwise, it really confuses me.

2. Setting the PAT to WB for iomem region of TPM is not desired the target type.
    Per my understanding, it should be safe to program the "iomem" to WB.
    - If the iomem region is emulated, it's OK since the accesses will be
      trapped.
    - If the iomem region is passed-through, the EPT will use UC and the
      effective memory type will be UC.


Another solution is using MTRR MSRs as the communication channel b/t guest BIOS
and guest kernel, probably need to do the following things:
For guest BIOS (i.e, OVMF), allow it to program the MTRR MSRs
- Revert 071d2cfab8 ("OvmfPkg/Sec: Skip setup MTRR early in TD-Guest")
- Revert 3a3b12cbda ("UefiCpuPkg/MtrrLib: MtrrLibIsMtrrSupported always return
   FALSE in TD-Guest")
- Make sure OVMF doesn't toggle CR0.CD, which will trigger #VE.
For guest linux kernel
- Revert 8e690b817e38 ("x86/kvm: Override default caching mode for SEV-SNP and
   TDX")
- Make sure guest kernel doesn't toggle CR0.CD, which will trigger #VE.

Sean, do you think this solution is the direction to go?

>
>
>>
>> And with Sean's patch set, the issue can be resolved.
>>
>> I guess google's VMM has built different ACPI table for TPM.
>> But according to my experiment, the issue should be able to be fixed by this
>> patch set, though I am not sure whether it will be the final solution or not.
>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX
  2025-07-15  2:53     ` Binbin Wu
@ 2025-07-16  9:51       ` Binbin Wu
  0 siblings, 0 replies; 30+ messages in thread
From: Binbin Wu @ 2025-07-16  9:51 UTC (permalink / raw)
  To: Sean Christopherson, Nikolay Borisov, Jianxiong Gao
  Cc: Borislav Petkov (AMD), Dave Hansen, Dionna Glaze, H. Peter Anvin,
	jgross, Kirill A. Shutemov, kvm, linux-kernel, Ingo Molnar,
	pbonzini, Peter Gonda, Thomas Gleixner, Tom Lendacky,
	Vitaly Kuznetsov, x86, Rick Edgecombe, kas, Xu, Min M,
	Yao, Jiewen



On 7/15/2025 10:53 AM, Binbin Wu wrote:
>
>
> On 7/14/2025 7:24 PM, Nikolay Borisov wrote:
>>
>>
>> On 14.07.25 г. 12:06 ч., Binbin Wu wrote:
>>>
>>>
>>> On 7/10/2025 12:54 AM, Jianxiong Gao wrote:
>>>> I tested this patch on top of commit 8e690b817e38, however we are
>>>> still experiencing the same failure.
>>>>
>>> I didn't reproduce the issue with QEMU.
>>> After some comparison on how QEMU building the ACPI tables for HPET and TPM,
>>>
>>> - For HPET, the HPET range is added as Operation Region:
>>>      aml_append(dev,
>>>          aml_operation_region("HPTM", AML_SYSTEM_MEMORY, aml_int(HPET_BASE),
>>>                               HPET_LEN));
>>>
>>> - For TPM, the range is added as 32-Bit Fixed Memory Range:
>>>      if (TPM_IS_TIS_ISA(tpm_find())) {
>>>          aml_append(crs, aml_memory32_fixed(TPM_TIS_ADDR_BASE,
>>>                     TPM_TIS_ADDR_SIZE, AML_READ_WRITE));
>>>      }
>>>
>>> So, in KVM, the code patch of TPM is different from the trace for HPET in the
>>> patch https://lore.kernel.org/kvm/20250201005048.657470-3-seanjc@google.com/,
>>> HPET will trigger the code path acpi_os_map_iomem(), but TPM doesn't.
>>>
>>> I tried to hack the code to map the region to WB first in tpm_tis driver to
>>> trigger the error.
>>> diff --git a/drivers/char/tpm/tpm_tis.c b/drivers/char/tpm/tpm_tis.c
>>> index 9aa230a63616..62d303f88041 100644
>>> --- a/drivers/char/tpm/tpm_tis.c
>>> +++ b/drivers/char/tpm/tpm_tis.c
>>> @@ -232,6 +232,7 @@ static int tpm_tis_init(struct device *dev, struct tpm_info *tpm_info)
>>>          if (phy == NULL)
>>>                  return -ENOMEM;
>>>
>>> +       ioremap_cache(tpm_info->res.start, resource_size(&tpm_info->res));
>>>          phy->iobase = devm_ioremap_resource(dev, &tpm_info->res);
>>>          if (IS_ERR(phy->iobase))
>>>                  return PTR_ERR(phy->iobase);
>>> Then I got the same error
>>> [ 4.606075] ioremap error for 0xfed40000-0xfed45000, requested 0x2, got 0x0
>>> [ 4.607728] tpm_tis MSFT0101:00: probe with driver tpm_tis failed with error -12
>>
>>
>> The thing is we don't really want to get into the if (pcm != new_pcm) { branch, because even if it succeeds there then the mapping will be wrong, because we want accesses to the TPM to be uncached since that's an iomem region, whereas this error shows that the new_pcm is WB.
>>
>> Also looking at memtype_reserve in it there is the following piece of code:
>>
>> if (x86_platform.is_untracked_pat_range(start, end)) {
>>      7                 if (new_type)
>>      6                         *new_type = _PAGE_CACHE_MODE_WB;
>>      5                 return 0;
>>      4         }
>>
>>
>> So if is_untracked_pat_range returns true then the cache mode will always be WB.
> So there are two different things per my understanding:
> 1. The patch set https://lore.kernel.org/all/20250201005048.657470-3-seanjc@google.com/
>     applied to "Guest" kernel should be able to dismiss the error message.
>     In __ioremap_caller(), when branch "pcm != new_pcm" is triggered, since
>     is_new_memtype_allowed() returns true for TPM range, the error message
>     shouldn NOT be printed.
>     Otherwise, it really confuses me.
>
> 2. Setting the PAT to WB for iomem region of TPM is not desired the target type.
>    Per my understanding, it should be safe to program the "iomem" to WB.
>    - If the iomem region is emulated, it's OK since the accesses will be
>      trapped.
>    - If the iomem region is passed-through, the EPT will use UC and the
>      effective memory type will be UC.
>
>
> Another solution is using MTRR MSRs as the communication channel b/t guest BIOS
> and guest kernel, probably need to do the following things:
> For guest BIOS (i.e, OVMF), allow it to program the MTRR MSRs
> - Revert 071d2cfab8 ("OvmfPkg/Sec: Skip setup MTRR early in TD-Guest")
> - Revert 3a3b12cbda ("UefiCpuPkg/MtrrLib: MtrrLibIsMtrrSupported always return
>   FALSE in TD-Guest")
> - Make sure OVMF doesn't toggle CR0.CD, which will trigger #VE.
One thing missing is that, for TDX guest, OVMF doesn't program MTRRs for ranges
[640 KB, 1 MB)  and [Uc32Base, 4 GB) to UC, as it does for non-TDX guest.

If MTRRs are intended to be used as a communication channel, this needs to be
added for TDX guests in OVMF.


> For guest linux kernel
> - Revert 8e690b817e38 ("x86/kvm: Override default caching mode for SEV-SNP and
>   TDX")
> - Make sure guest kernel doesn't toggle CR0.CD, which will trigger #VE.
>
> Sean, do you think this solution is the direction to go?
>
>>
>>
>>>
>>> And with Sean's patch set, the issue can be resolved.
>>>
>>> I guess google's VMM has built different ACPI table for TPM.
>>> But according to my experiment, the issue should be able to be fixed by this
>>> patch set, though I am not sure whether it will be the final solution or not.
>>
>
>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX
  2025-07-14 11:24   ` Nikolay Borisov
  2025-07-15  2:53     ` Binbin Wu
@ 2025-07-23 14:34     ` Sean Christopherson
  2025-07-24  3:16       ` Binbin Wu
  1 sibling, 1 reply; 30+ messages in thread
From: Sean Christopherson @ 2025-07-23 14:34 UTC (permalink / raw)
  To: Nikolay Borisov
  Cc: Binbin Wu, Jianxiong Gao, Borislav Petkov (AMD), Dave Hansen,
	Dionna Glaze, H. Peter Anvin, jgross, Kirill A. Shutemov, kvm,
	linux-kernel, Ingo Molnar, pbonzini, Peter Gonda, Thomas Gleixner,
	Tom Lendacky, Vitaly Kuznetsov, x86, Rick Edgecombe

On Mon, Jul 14, 2025, Nikolay Borisov wrote:
> On 14.07.25 г. 12:06 ч., Binbin Wu wrote:
> > On 7/10/2025 12:54 AM, Jianxiong Gao wrote:
> > > I tested this patch on top of commit 8e690b817e38, however we are
> > > still experiencing the same failure.
> > > 
> > I didn't reproduce the issue with QEMU.
> > After some comparison on how QEMU building the ACPI tables for HPET and
> > TPM,
> > 
> > - For HPET, the HPET range is added as Operation Region:
> >      aml_append(dev,
> >          aml_operation_region("HPTM", AML_SYSTEM_MEMORY,
> > aml_int(HPET_BASE),
> >                               HPET_LEN));
> > 
> > - For TPM, the range is added as 32-Bit Fixed Memory Range:
> >      if (TPM_IS_TIS_ISA(tpm_find())) {
> >          aml_append(crs, aml_memory32_fixed(TPM_TIS_ADDR_BASE,
> >                     TPM_TIS_ADDR_SIZE, AML_READ_WRITE));
> >      }
> > 
> > So, in KVM, the code patch of TPM is different from the trace for HPET in
> > the patch https://lore.kernel.org/kvm/20250201005048.657470-3-seanjc@google.com/,
> > HPET will trigger the code path acpi_os_map_iomem(), but TPM doesn't.

Argh, I was looking at the wrong TPM resource when poking through QEMU.  I peeked
at TPM_PPI_ADDR_BASE, which gets an AML_SYSTEM_MEMORY entry, not TPM_TIS_ADDR_BASE.

*sigh*

Note, the HPET is also enumerated as a fixed resource:

    crs = aml_resource_template();
    aml_append(crs, aml_memory32_fixed(HPET_BASE, HPET_LEN, AML_READ_ONLY));
    aml_append(dev, aml_name_decl("_CRS", crs));

If I comment out the AML_SYSTEM_MEMORY entry for HPET, the kernel's auto-mapping
does NOT kick in (the kernel complains about required resources being missing,
but that's expected).  So I'm pretty sure it's the _lack_ of an AML_SYSTEM_MEMORY
entry for TPM TIS in QEMU's ACPI tables that make everything happy

I can't for the life of me suss out exactly what Google's ACPI tables will look
like.  I'll follow-up internally to try and get an answer on that front.

In the meantime, can someone who has reproduced the real issue get backtraces to
confirm or disprove that acpi_os_map_iomem() is trying to map the TPM TIS range
as WB?  E.g. with something like so:

diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
index 2e7923844afe..6c3c40909ef9 100644
--- a/arch/x86/mm/pat/memtype.c
+++ b/arch/x86/mm/pat/memtype.c
@@ -528,6 +528,9 @@ int memtype_reserve(u64 start, u64 end, enum page_cache_mode req_type,
 
        start = sanitize_phys(start);
 
+       WARN(start == 0xFED40000,
+            "Mapping TPM TIS with req_type = %u\n", req_type);
+
        /*
         * The end address passed into this function is exclusive, but
         * sanitize_phys() expects an inclusive address.
---

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX
  2025-07-23 14:34     ` Sean Christopherson
@ 2025-07-24  3:16       ` Binbin Wu
  2025-07-28 15:33         ` Sean Christopherson
  0 siblings, 1 reply; 30+ messages in thread
From: Binbin Wu @ 2025-07-24  3:16 UTC (permalink / raw)
  To: Sean Christopherson, Nikolay Borisov
  Cc: Jianxiong Gao, Borislav Petkov (AMD), Dave Hansen, Dionna Glaze,
	H. Peter Anvin, jgross, Kirill A. Shutemov, kvm, linux-kernel,
	Ingo Molnar, pbonzini, Peter Gonda, Thomas Gleixner, Tom Lendacky,
	Vitaly Kuznetsov, x86, Rick Edgecombe, Binbin Wu



On 7/23/2025 10:34 PM, Sean Christopherson wrote:
> On Mon, Jul 14, 2025, Nikolay Borisov wrote:
>> On 14.07.25 г. 12:06 ч., Binbin Wu wrote:
>>> On 7/10/2025 12:54 AM, Jianxiong Gao wrote:
>>>> I tested this patch on top of commit 8e690b817e38, however we are
>>>> still experiencing the same failure.
>>>>
>>> I didn't reproduce the issue with QEMU.
>>> After some comparison on how QEMU building the ACPI tables for HPET and
>>> TPM,
>>>
>>> - For HPET, the HPET range is added as Operation Region:
>>>       aml_append(dev,
>>>           aml_operation_region("HPTM", AML_SYSTEM_MEMORY,
>>> aml_int(HPET_BASE),
>>>                                HPET_LEN));
>>>
>>> - For TPM, the range is added as 32-Bit Fixed Memory Range:
>>>       if (TPM_IS_TIS_ISA(tpm_find())) {
>>>           aml_append(crs, aml_memory32_fixed(TPM_TIS_ADDR_BASE,
>>>                      TPM_TIS_ADDR_SIZE, AML_READ_WRITE));
>>>       }
>>>
>>> So, in KVM, the code patch of TPM is different from the trace for HPET in
>>> the patch https://lore.kernel.org/kvm/20250201005048.657470-3-seanjc@google.com/,
>>> HPET will trigger the code path acpi_os_map_iomem(), but TPM doesn't.
> Argh, I was looking at the wrong TPM resource when poking through QEMU.  I peeked
> at TPM_PPI_ADDR_BASE, which gets an AML_SYSTEM_MEMORY entry, not TPM_TIS_ADDR_BASE.
>
> *sigh*
>
> Note, the HPET is also enumerated as a fixed resource:
IIUC, the key differences of aml_memory32_fixed and aml_operation_region:
- aml_memory32_fixed is a resource descriptor that tells the OS about a device's
   address range for configuration or driver purposes.
   It is not meant for ACPI code runtime access, it simply reports the memory
   region to the OS via ACPI resource methods like _CRS.
- aml_operation_region actually defines how AML code can access hardware or
   regions at runtime. Fields, which can be declared within an operation region,
   are then physically mapped to addresses that the AML interpreter will read or
   write during method execution.

For HPET, the method _STA will access the "VEND" and "PRD" fields within the
region. So when probe HPET in acpi_init(), method _STA will be called, and it
needs to map the region before accessing, which triggers ioremap_cache().
(See the call chain below)

>
>      crs = aml_resource_template();
>      aml_append(crs, aml_memory32_fixed(HPET_BASE, HPET_LEN, AML_READ_ONLY));
>      aml_append(dev, aml_name_decl("_CRS", crs));
>
> If I comment out the AML_SYSTEM_MEMORY entry for HPET, the kernel's auto-mapping
> does NOT kick in (the kernel complains about required resources being missing,
> but that's expected).  So I'm pretty sure it's the _lack_ of an AML_SYSTEM_MEMORY
> entry for TPM TIS in QEMU's ACPI tables that make everything happy

Only add an AML_SYSTEM_MEMORY entry as an operation region is not enough, it
also needs an ACPI method to access a field within the region during the ACPI
device probe.

For HPET, I checked the call chain is as following:
acpi_init
   ...
     acpi_bus_get_status
       acpi_bus_get_status_handle
         acpi_evaluate_integer
           acpi_evaluate_object
             acpi_ns_evaluate
               [ACPI_TYPE_METHOD] --> *acpi_ps_execute_method*
                 acpi_ps_parse_aml
                   acpi_ps_parse_loop
                     walk_state->ascending_callback -> acpi_ds_exec_end_op
                       acpi_ds_evaluate_name_path
                         acpi_ex_resolve_to_value
                           acpi_ex_resolve_node_to_value
                             acpi_ex_read_data_from_field
                               acpi_ex_extract_from_field
                                 acpi_ex_field_datum_io
                                   *acpi_ex_access_region*
                                     acpi_ev_address_space_dispatch
acpi_ex_system_memory_space_handler
                                         acpi_os_map_memory
                                           acpi_os_map_iomem
                                             ioremap_cache

>
> I can't for the life of me suss out exactly what Google's ACPI tables will look
> like.  I'll follow-up internally to try and get an answer on that front.

I guess google has defined a ACPI method to access the region for TPM TIS during
ACPI device probe.

>
> In the meantime, can someone who has reproduced the real issue get backtraces to
> confirm or disprove that acpi_os_map_iomem() is trying to map the TPM TIS range
> as WB?  E.g. with something like so:

I tried to add an AML_SYSTEM_MEMORY entry as operation region in the ACPI
table and modify the _STA method to access the region for TPM TIS in QEMU, then
the issue can be reproduced.

diff --git a/hw/tpm/tpm_tis_isa.c b/hw/tpm/tpm_tis_isa.c
index 876cb02cb5..aca2b2993f 100644
--- a/hw/tpm/tpm_tis_isa.c
+++ b/hw/tpm/tpm_tis_isa.c
@@ -143,6 +143,9 @@ static void build_tpm_tis_isa_aml(AcpiDevAmlIf *adev, Aml *scope)
      Aml *dev, *crs;
      TPMStateISA *isadev = TPM_TIS_ISA(adev);
      TPMIf *ti = TPM_IF(isadev);
+    Aml *field;
+    Aml *method;
+    Aml *test = aml_local(0);

      dev = aml_device("TPM");
      if (tpm_tis_isa_get_tpm_version(ti) == TPM_VERSION_2_0) {
@@ -152,7 +155,19 @@ static void build_tpm_tis_isa_aml(AcpiDevAmlIf *adev, Aml *scope)
          aml_append(dev, aml_name_decl("_HID", aml_eisaid("PNP0C31")));
      }
      aml_append(dev, aml_name_decl("_UID", aml_int(1)));
-    aml_append(dev, aml_name_decl("_STA", aml_int(0xF)));
+
+    aml_append(dev, aml_operation_region("TPMM", AML_SYSTEM_MEMORY, aml_int(TPM_TIS_ADDR_BASE),
+                         TPM_TIS_ADDR_SIZE));
+
+    field = aml_field("TPMM", AML_DWORD_ACC, AML_LOCK, AML_PRESERVE);
+    aml_append(field, aml_named_field("TEST", 32));
+    aml_append(dev, field);
+
+    method = aml_method("_STA", 0, AML_NOTSERIALIZED);
+    aml_append(method, aml_store(aml_name("TEST"), test));
+    aml_append(method, aml_return(aml_int(0xF)));
+    aml_append(dev, method);

>
> diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
> index 2e7923844afe..6c3c40909ef9 100644
> --- a/arch/x86/mm/pat/memtype.c
> +++ b/arch/x86/mm/pat/memtype.c
> @@ -528,6 +528,9 @@ int memtype_reserve(u64 start, u64 end, enum page_cache_mode req_type,
>   
>          start = sanitize_phys(start);
>   
> +       WARN(start == 0xFED40000,
> +            "Mapping TPM TIS with req_type = %u\n", req_type);
> +
>          /*
>           * The end address passed into this function is exclusive, but
>           * sanitize_phys() expects an inclusive address.
> ---


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX
  2025-07-24  3:16       ` Binbin Wu
@ 2025-07-28 15:33         ` Sean Christopherson
  2025-07-30  7:34           ` Binbin Wu
  0 siblings, 1 reply; 30+ messages in thread
From: Sean Christopherson @ 2025-07-28 15:33 UTC (permalink / raw)
  To: Binbin Wu
  Cc: Nikolay Borisov, Jianxiong Gao, Borislav Petkov (AMD),
	Dave Hansen, Dionna Glaze, H. Peter Anvin, jgross,
	Kirill A. Shutemov, kvm, linux-kernel, Ingo Molnar, pbonzini,
	Peter Gonda, Thomas Gleixner, Tom Lendacky, Vitaly Kuznetsov, x86,
	Rick Edgecombe, jiewen.yao

+Jiewen

Summary, with the questions at the end.

Recent upstream kernels running in GCE SNP/TDX VMs fail to probe the TPM due to
the TPM driver's ioremap (with UC) failing because the kernel has already mapped
the range using a cachaeable mapping (WB).

 ioremap error for 0xfed40000-0xfed45000, requested 0x2, got 0x0
 tpm_tis MSFT0101:00: probe with driver tpm_tis failed with error -12

The "guilty" commit is 8e690b817e38 ("x86/kvm: Override default caching mode for
SEV-SNP and TDX"), which as the subject suggests, forces the kernel's MTRR memtype
to WB.  With SNP and TDX, the virtual MTRR state is (a) controlled by the VMM and
thus is untrusted, and (b) _should_ be irrelevant because no known hypervisor
actually honors the memtypes programmed into the virtual MTRRs.

It turns out that the kernel has been relying on the MTRRs to force the TPM TIS
region (and potentially other regions) to be UC, so that the kernel ACPI driver's
attempts to map of SystemMemory entries as cacheable get forced to UC.  With MTRRs
forced WB, x86_acpi_os_ioremap() succeeds in creating a WB mapping, which in turn
causes the ioremap infrastructure to reject the TPM driver's UC mapping.

IIUC, the TPM entry(s) in the ACPI tables for GCE VMs are derived (built?) from
EDK2's TPM ASL.  And (again, IIUC), this code in SecurityPkg/Tcg/Tcg2Acpi/Tpm.asl[1]

      //
      // Operational region for TPM access
      //
      OperationRegion (TPMR, SystemMemory, 0xfed40000, 0x5000)

generates the problematic SystemMemory entry that triggers the ACPI driver's
auto-mapping logic.

QEMU-based VMs don't suffer the same fate, as QEMU intentionally[2] doesn't use
EDK2's AML for the TPM, and QEMU doesn't define a SystemMemory entry, just a
Memory32Fixed entry.

Presumably this an EDK2 bug?  If it's not an EDK2 bug, then how is the kernel's
ACPI driver supposed to know that some ranges of SystemMemory must be mapped UC?

[1] https://github.com/tianocore/edk2/blob/master/SecurityPkg/Tcg/Tcg2Acpi/Tpm.asl#L53
[2] https://lists.gnu.org/archive/html/qemu-devel/2018-02/msg03397.html

On Thu, Jul 24, 2025, Binbin Wu wrote:
> On 7/23/2025 10:34 PM, Sean Christopherson wrote:
> > On Mon, Jul 14, 2025, Nikolay Borisov wrote:
> > > On 14.07.25 г. 12:06 ч., Binbin Wu wrote:
> > > > On 7/10/2025 12:54 AM, Jianxiong Gao wrote:
> > > > > I tested this patch on top of commit 8e690b817e38, however we are
> > > > > still experiencing the same failure.
> > > > > 
> > > > I didn't reproduce the issue with QEMU.
> > > > After some comparison on how QEMU building the ACPI tables for HPET and
> > > > TPM,
> > > > 
> > > > - For HPET, the HPET range is added as Operation Region:
> > > >       aml_append(dev,
> > > >           aml_operation_region("HPTM", AML_SYSTEM_MEMORY,
> > > > aml_int(HPET_BASE),
> > > >                                HPET_LEN));
> > > > 
> > > > - For TPM, the range is added as 32-Bit Fixed Memory Range:
> > > >       if (TPM_IS_TIS_ISA(tpm_find())) {
> > > >           aml_append(crs, aml_memory32_fixed(TPM_TIS_ADDR_BASE,
> > > >                      TPM_TIS_ADDR_SIZE, AML_READ_WRITE));
> > > >       }
> > > > 
> > > > So, in KVM, the code patch of TPM is different from the trace for HPET in
> > > > the patch https://lore.kernel.org/kvm/20250201005048.657470-3-seanjc@google.com/,
> > > > HPET will trigger the code path acpi_os_map_iomem(), but TPM doesn't.
> > Argh, I was looking at the wrong TPM resource when poking through QEMU.  I peeked
> > at TPM_PPI_ADDR_BASE, which gets an AML_SYSTEM_MEMORY entry, not TPM_TIS_ADDR_BASE.

...

> I guess google has defined a ACPI method to access the region for TPM TIS during
> ACPI device probe.
> 
> > 
> > In the meantime, can someone who has reproduced the real issue get backtraces to
> > confirm or disprove that acpi_os_map_iomem() is trying to map the TPM TIS range
> > as WB?  E.g. with something like so:

Got confirmation off-list that Google's ACPI tables due trigger the kernel's
cachable mapping logic for SYSTEM_MEMORY.

 Mapping TPM TIS with req_type = 0
 WARNING: CPU: 22 PID: 1 at arch/x86/mm/pat/memtype.c:530 memtype_reserve+0x2ab/0x460
 Modules linked in:
 CPU: 22 UID: 0 PID: 1 Comm: swapper/0 Tainted: G        W           6.16.0-rc7+ #2 VOLUNTARY 
 Tainted: [W]=WARN
 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/29/2025
 RIP: 0010:memtype_reserve+0x2ab/0x460
  __ioremap_caller+0x16d/0x3d0
  ioremap_cache+0x17/0x30
  x86_acpi_os_ioremap+0xe/0x20
  acpi_os_map_iomem+0x1f3/0x240
  acpi_os_map_memory+0xe/0x20
  acpi_ex_system_memory_space_handler+0x273/0x440
  acpi_ev_address_space_dispatch+0x176/0x4c0
  acpi_ex_access_region+0x2ad/0x530
  acpi_ex_field_datum_io+0xa2/0x4f0
  acpi_ex_extract_from_field+0x296/0x3e0
  acpi_ex_read_data_from_field+0xd1/0x460
  acpi_ex_resolve_node_to_value+0x2ee/0x530
  acpi_ex_resolve_to_value+0x1f2/0x540
  acpi_ds_evaluate_name_path+0x11b/0x190
  acpi_ds_exec_end_op+0x456/0x960
  acpi_ps_parse_loop+0x27a/0xa50
  acpi_ps_parse_aml+0x226/0x600
  acpi_ps_execute_method+0x172/0x3e0
  acpi_ns_evaluate+0x175/0x5f0
  acpi_evaluate_object+0x213/0x490
  acpi_evaluate_integer+0x6d/0x140
  acpi_bus_get_status+0x93/0x150
  acpi_add_single_object+0x43a/0x7c0
  acpi_bus_check_add+0x149/0x3a0
  acpi_bus_check_add_1+0x16/0x30
  acpi_ns_walk_namespace+0x22c/0x360
  acpi_walk_namespace+0x15c/0x170
  acpi_bus_scan+0x1dd/0x200
  acpi_scan_init+0xe5/0x2b0
  acpi_init+0x264/0x5b0
  do_one_initcall+0x5a/0x310
  kernel_init_freeable+0x34f/0x4f0
  kernel_init+0x1b/0x200
  ret_from_fork+0x186/0x1b0
  ret_from_fork_asm+0x1a/0x30
  </TASK>

> I tried to add an AML_SYSTEM_MEMORY entry as operation region in the ACPI
> table and modify the _STA method to access the region for TPM TIS in QEMU, then
> the issue can be reproduced.
> 
> diff --git a/hw/tpm/tpm_tis_isa.c b/hw/tpm/tpm_tis_isa.c
> index 876cb02cb5..aca2b2993f 100644
> --- a/hw/tpm/tpm_tis_isa.c
> +++ b/hw/tpm/tpm_tis_isa.c
> @@ -143,6 +143,9 @@ static void build_tpm_tis_isa_aml(AcpiDevAmlIf *adev, Aml *scope)
>      Aml *dev, *crs;
>      TPMStateISA *isadev = TPM_TIS_ISA(adev);
>      TPMIf *ti = TPM_IF(isadev);
> +    Aml *field;
> +    Aml *method;
> +    Aml *test = aml_local(0);
> 
>      dev = aml_device("TPM");
>      if (tpm_tis_isa_get_tpm_version(ti) == TPM_VERSION_2_0) {
> @@ -152,7 +155,19 @@ static void build_tpm_tis_isa_aml(AcpiDevAmlIf *adev, Aml *scope)
>          aml_append(dev, aml_name_decl("_HID", aml_eisaid("PNP0C31")));
>      }
>      aml_append(dev, aml_name_decl("_UID", aml_int(1)));
> -    aml_append(dev, aml_name_decl("_STA", aml_int(0xF)));
> +
> +    aml_append(dev, aml_operation_region("TPMM", AML_SYSTEM_MEMORY, aml_int(TPM_TIS_ADDR_BASE),
> +                         TPM_TIS_ADDR_SIZE));
> +
> +    field = aml_field("TPMM", AML_DWORD_ACC, AML_LOCK, AML_PRESERVE);
> +    aml_append(field, aml_named_field("TEST", 32));
> +    aml_append(dev, field);
> +
> +    method = aml_method("_STA", 0, AML_NOTSERIALIZED);
> +    aml_append(method, aml_store(aml_name("TEST"), test));
> +    aml_append(method, aml_return(aml_int(0xF)));
> +    aml_append(dev, method);



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX
  2025-07-28 15:33         ` Sean Christopherson
@ 2025-07-30  7:34           ` Binbin Wu
  2025-08-15 23:55             ` Korakit Seemakhupt
  2025-08-20  3:07             ` Vishal Annapurve
  0 siblings, 2 replies; 30+ messages in thread
From: Binbin Wu @ 2025-07-30  7:34 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Nikolay Borisov, Jianxiong Gao, Borislav Petkov (AMD),
	Dave Hansen, Dionna Glaze, H. Peter Anvin, jgross,
	Kirill A. Shutemov, kvm, linux-kernel, Ingo Molnar, pbonzini,
	Peter Gonda, Thomas Gleixner, Tom Lendacky, Vitaly Kuznetsov, x86,
	Rick Edgecombe, jiewen.yao



On 7/28/2025 11:33 PM, Sean Christopherson wrote:
> +Jiewen

Jiewen is out of the office until August 4th.

>
> Summary, with the questions at the end.
>
> Recent upstream kernels running in GCE SNP/TDX VMs fail to probe the TPM due to
> the TPM driver's ioremap (with UC) failing because the kernel has already mapped
> the range using a cachaeable mapping (WB).
>
>   ioremap error for 0xfed40000-0xfed45000, requested 0x2, got 0x0
>   tpm_tis MSFT0101:00: probe with driver tpm_tis failed with error -12
>
> The "guilty" commit is 8e690b817e38 ("x86/kvm: Override default caching mode for
> SEV-SNP and TDX"), which as the subject suggests, forces the kernel's MTRR memtype
> to WB.  With SNP and TDX, the virtual MTRR state is (a) controlled by the VMM and
> thus is untrusted, and (b) _should_ be irrelevant because no known hypervisor
> actually honors the memtypes programmed into the virtual MTRRs.
>
> It turns out that the kernel has been relying on the MTRRs to force the TPM TIS
> region (and potentially other regions) to be UC, so that the kernel ACPI driver's
> attempts to map of SystemMemory entries as cacheable get forced to UC.  With MTRRs
> forced WB, x86_acpi_os_ioremap() succeeds in creating a WB mapping, which in turn
> causes the ioremap infrastructure to reject the TPM driver's UC mapping.
>
> IIUC, the TPM entry(s) in the ACPI tables for GCE VMs are derived (built?) from
> EDK2's TPM ASL.  And (again, IIUC), this code in SecurityPkg/Tcg/Tcg2Acpi/Tpm.asl[1]
>
>        //
>        // Operational region for TPM access
>        //
>        OperationRegion (TPMR, SystemMemory, 0xfed40000, 0x5000)
>
> generates the problematic SystemMemory entry that triggers the ACPI driver's
> auto-mapping logic.
>
> QEMU-based VMs don't suffer the same fate, as QEMU intentionally[2] doesn't use
> EDK2's AML for the TPM, and QEMU doesn't define a SystemMemory entry, just a
> Memory32Fixed entry.
>
> Presumably this an EDK2 bug?  If it's not an EDK2 bug, then how is the kernel's
> ACPI driver supposed to know that some ranges of SystemMemory must be mapped UC?
According to the ACPI spec 6.6, an operation region of SystemMemory has no
interface to specify the cacheable attribute.

One solution could be using MTRRs to communicate the memory attribute of legacy
PCI hole to the kernel. But during the PUCK meeting last week, Sean mentioned
that "long-term, firmware should not be using MTRRs to communicate anything to
the kernel." So this solution is not preferred.

If not MTRRs, there should be an alternative way to do the job.
1. ACPI table
    According to the ACPI spec, neither operation region nor 32-Bit Fixed Memory
    Range Descriptor can specify the cacheable attribute.
    "Address Space Resource Descriptors" could be used to describe a memory range
    and the they can specify the cacheable attribute via "Type Specific Flags".
    One of the Address Space Resource Descriptors could be added to the ACPI
    table as a hint when the kernel do the mapping for operation region.
    (There is "System Physical Address (SPA) Range Structure", which also can
    specify the cacheable attribute. But it's should be used for NVDIMMs.)
2. EFI memory map descriptor
    EFI memory descriptor can specify the cacheable attribute. Firmware can add
    a EFI memory descriptor for the TPM TIS device as a hint when the kernel do
    the mapping for operation region.

Operation region of SystemMemory is still needed if a "Control Method" of APCI
needs to access a field, e.g., the method _STA. Checking another descriptor for
cacheable attribute, either "Address Space Resource Descriptor" or "EFI memory
map descriptor" during the ACPI code doing the mapping for operation region
makes the code complicated.

Another thing is if long-term firmware should not be using MTRRs to to
communicate anything to the kernel. It seems it's safer to use ioremap() instead
of ioremap_cache() for MMIO resource when the kernel do the mapping for the
operation region access?

>
> [1] https://github.com/tianocore/edk2/blob/master/SecurityPkg/Tcg/Tcg2Acpi/Tpm.asl#L53
> [2] https://lists.gnu.org/archive/html/qemu-devel/2018-02/msg03397.html
>
> On Thu, Jul 24, 2025, Binbin Wu wrote:
>> On 7/23/2025 10:34 PM, Sean Christopherson wrote:
>>> On Mon, Jul 14, 2025, Nikolay Borisov wrote:
>>>> On 14.07.25 г. 12:06 ч., Binbin Wu wrote:
>>>>> On 7/10/2025 12:54 AM, Jianxiong Gao wrote:
>>>>>> I tested this patch on top of commit 8e690b817e38, however we are
>>>>>> still experiencing the same failure.
>>>>>>
>>>>> I didn't reproduce the issue with QEMU.
>>>>> After some comparison on how QEMU building the ACPI tables for HPET and
>>>>> TPM,
>>>>>
>>>>> - For HPET, the HPET range is added as Operation Region:
>>>>>        aml_append(dev,
>>>>>            aml_operation_region("HPTM", AML_SYSTEM_MEMORY,
>>>>> aml_int(HPET_BASE),
>>>>>                                 HPET_LEN));
>>>>>
>>>>> - For TPM, the range is added as 32-Bit Fixed Memory Range:
>>>>>        if (TPM_IS_TIS_ISA(tpm_find())) {
>>>>>            aml_append(crs, aml_memory32_fixed(TPM_TIS_ADDR_BASE,
>>>>>                       TPM_TIS_ADDR_SIZE, AML_READ_WRITE));
>>>>>        }
>>>>>
>>>>> So, in KVM, the code patch of TPM is different from the trace for HPET in
>>>>> the patch https://lore.kernel.org/kvm/20250201005048.657470-3-seanjc@google.com/,
>>>>> HPET will trigger the code path acpi_os_map_iomem(), but TPM doesn't.
>>> Argh, I was looking at the wrong TPM resource when poking through QEMU.  I peeked
>>> at TPM_PPI_ADDR_BASE, which gets an AML_SYSTEM_MEMORY entry, not TPM_TIS_ADDR_BASE.
> ...
>
>> I guess google has defined a ACPI method to access the region for TPM TIS during
>> ACPI device probe.
>>
>>> In the meantime, can someone who has reproduced the real issue get backtraces to
>>> confirm or disprove that acpi_os_map_iomem() is trying to map the TPM TIS range
>>> as WB?  E.g. with something like so:
> Got confirmation off-list that Google's ACPI tables due trigger the kernel's
> cachable mapping logic for SYSTEM_MEMORY.
>
>   Mapping TPM TIS with req_type = 0
>   WARNING: CPU: 22 PID: 1 at arch/x86/mm/pat/memtype.c:530 memtype_reserve+0x2ab/0x460
>   Modules linked in:
>   CPU: 22 UID: 0 PID: 1 Comm: swapper/0 Tainted: G        W           6.16.0-rc7+ #2 VOLUNTARY
>   Tainted: [W]=WARN
>   Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/29/2025
>   RIP: 0010:memtype_reserve+0x2ab/0x460
>    __ioremap_caller+0x16d/0x3d0
>    ioremap_cache+0x17/0x30
>    x86_acpi_os_ioremap+0xe/0x20
>    acpi_os_map_iomem+0x1f3/0x240
>    acpi_os_map_memory+0xe/0x20
>    acpi_ex_system_memory_space_handler+0x273/0x440
>    acpi_ev_address_space_dispatch+0x176/0x4c0
>    acpi_ex_access_region+0x2ad/0x530
>    acpi_ex_field_datum_io+0xa2/0x4f0
>    acpi_ex_extract_from_field+0x296/0x3e0
>    acpi_ex_read_data_from_field+0xd1/0x460
>    acpi_ex_resolve_node_to_value+0x2ee/0x530
>    acpi_ex_resolve_to_value+0x1f2/0x540
>    acpi_ds_evaluate_name_path+0x11b/0x190
>    acpi_ds_exec_end_op+0x456/0x960
>    acpi_ps_parse_loop+0x27a/0xa50
>    acpi_ps_parse_aml+0x226/0x600
>    acpi_ps_execute_method+0x172/0x3e0
>    acpi_ns_evaluate+0x175/0x5f0
>    acpi_evaluate_object+0x213/0x490
>    acpi_evaluate_integer+0x6d/0x140
>    acpi_bus_get_status+0x93/0x150
>    acpi_add_single_object+0x43a/0x7c0
>    acpi_bus_check_add+0x149/0x3a0
>    acpi_bus_check_add_1+0x16/0x30
>    acpi_ns_walk_namespace+0x22c/0x360
>    acpi_walk_namespace+0x15c/0x170
>    acpi_bus_scan+0x1dd/0x200
>    acpi_scan_init+0xe5/0x2b0
>    acpi_init+0x264/0x5b0
>    do_one_initcall+0x5a/0x310
>    kernel_init_freeable+0x34f/0x4f0
>    kernel_init+0x1b/0x200
>    ret_from_fork+0x186/0x1b0
>    ret_from_fork_asm+0x1a/0x30
>    </TASK>
>
>> I tried to add an AML_SYSTEM_MEMORY entry as operation region in the ACPI
>> table and modify the _STA method to access the region for TPM TIS in QEMU, then
>> the issue can be reproduced.
>>
>> diff --git a/hw/tpm/tpm_tis_isa.c b/hw/tpm/tpm_tis_isa.c
>> index 876cb02cb5..aca2b2993f 100644
>> --- a/hw/tpm/tpm_tis_isa.c
>> +++ b/hw/tpm/tpm_tis_isa.c
>> @@ -143,6 +143,9 @@ static void build_tpm_tis_isa_aml(AcpiDevAmlIf *adev, Aml *scope)
>>       Aml *dev, *crs;
>>       TPMStateISA *isadev = TPM_TIS_ISA(adev);
>>       TPMIf *ti = TPM_IF(isadev);
>> +    Aml *field;
>> +    Aml *method;
>> +    Aml *test = aml_local(0);
>>
>>       dev = aml_device("TPM");
>>       if (tpm_tis_isa_get_tpm_version(ti) == TPM_VERSION_2_0) {
>> @@ -152,7 +155,19 @@ static void build_tpm_tis_isa_aml(AcpiDevAmlIf *adev, Aml *scope)
>>           aml_append(dev, aml_name_decl("_HID", aml_eisaid("PNP0C31")));
>>       }
>>       aml_append(dev, aml_name_decl("_UID", aml_int(1)));
>> -    aml_append(dev, aml_name_decl("_STA", aml_int(0xF)));
>> +
>> +    aml_append(dev, aml_operation_region("TPMM", AML_SYSTEM_MEMORY, aml_int(TPM_TIS_ADDR_BASE),
>> +                         TPM_TIS_ADDR_SIZE));
>> +
>> +    field = aml_field("TPMM", AML_DWORD_ACC, AML_LOCK, AML_PRESERVE);
>> +    aml_append(field, aml_named_field("TEST", 32));
>> +    aml_append(dev, field);
>> +
>> +    method = aml_method("_STA", 0, AML_NOTSERIALIZED);
>> +    aml_append(method, aml_store(aml_name("TEST"), test));
>> +    aml_append(method, aml_return(aml_int(0xF)));
>> +    aml_append(dev, method);
>
>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX
  2025-07-30  7:34           ` Binbin Wu
@ 2025-08-15 23:55             ` Korakit Seemakhupt
  2025-08-18 11:07               ` Binbin Wu
  2025-08-20  3:07             ` Vishal Annapurve
  1 sibling, 1 reply; 30+ messages in thread
From: Korakit Seemakhupt @ 2025-08-15 23:55 UTC (permalink / raw)
  To: binbin.wu
  Cc: bp, dave.hansen, dionnaglaze, hpa, jgross, jiewen.yao, jxgao,
	kirill.shutemov, kvm, linux-kernel, mingo, nik.borisov, pbonzini,
	pgonda, rick.p.edgecombe, seanjc, tglx, thomas.lendacky, vkuznets,
	x86, vannapurve

>> On 5/28/2025 11:33 PM, Sean Christopherson wrote:
>>
>> Summary, with the questions at the end.
>>
>> Recent upstream kernels running in GCE SNP/TDX VMs fail to probe the TPM due to
>> the TPM driver's ioremap (with UC) failing because the kernel has already mapped
>> the range using a cachaeable mapping (WB).
>>
>>   ioremap error for 0xfed40000-0xfed45000, requested 0x2, got 0x0
>>   tpm_tis MSFT0101:00: probe with driver tpm_tis failed with error -12
>>
>> The "guilty" commit is 8e690b817e38 ("x86/kvm: Override default caching mode for
>> SEV-SNP and TDX"), which as the subject suggests, forces the kernel's MTRR memtype
>> to WB.  With SNP and TDX, the virtual MTRR state is (a) controlled by the VMM and
>> thus is untrusted, and (b) _should_ be irrelevant because no known hypervisor
>> actually honors the memtypes programmed into the virtual MTRRs.
>>
>> It turns out that the kernel has been relying on the MTRRs to force the TPM TIS
>> region (and potentially other regions) to be UC, so that the kernel ACPI driver's
>> attempts to map of SystemMemory entries as cacheable get forced to UC.  With MTRRs
>> forced WB, x86_acpi_os_ioremap() succeeds in creating a WB mapping, which in turn
>> causes the ioremap infrastructure to reject the TPM driver's UC mapping.
>>
>> IIUC, the TPM entry(s) in the ACPI tables for GCE VMs are derived (built?) from
>> EDK2's TPM ASL.  And (again, IIUC), this code in SecurityPkg/Tcg/Tcg2Acpi/Tpm.asl[1]
>>
>>        //
>>        // Operational region for TPM access
>>        //
>>        OperationRegion (TPMR, SystemMemory, 0xfed40000, 0x5000)
>>
>> generates the problematic SystemMemory entry that triggers the ACPI driver's
>> auto-mapping logic.
>>
>> QEMU-based VMs don't suffer the same fate, as QEMU intentionally[2] doesn't use
>> EDK2's AML for the TPM, and QEMU doesn't define a SystemMemory entry, just a
>> Memory32Fixed entry.
>>
>> Presumably this an EDK2 bug?  If it's not an EDK2 bug, then how is the kernel's
>> ACPI driver supposed to know that some ranges of SystemMemory must be mapped UC?
> 
> On 7/30/2025 3:34 PM, Binbin Wu: Wrote
> 
> According to the ACPI spec 6.6, an operation region of SystemMemory has no
> interface to specify the cacheable attribute.
> 
> One solution could be using MTRRs to communicate the memory attribute of legacy
> PCI hole to the kernel. But during the PUCK meeting last week, Sean mentioned
> that "long-term, firmware should not be using MTRRs to communicate anything to
> the kernel." So this solution is not preferred.
> 
> If not MTRRs, there should be an alternative way to do the job.
> 1. ACPI table
>     According to the ACPI spec, neither operation region nor 32-Bit Fixed Memory
>     Range Descriptor can specify the cacheable attribute.
>     "Address Space Resource Descriptors" could be used to describe a memory range
>     and the they can specify the cacheable attribute via "Type Specific Flags".
>     One of the Address Space Resource Descriptors could be added to the ACPI
>     table as a hint when the kernel do the mapping for operation region.
>     (There is "System Physical Address (SPA) Range Structure", which also can
>     specify the cacheable attribute. But it's should be used for NVDIMMs.)
> 2. EFI memory map descriptor
>     EFI memory descriptor can specify the cacheable attribute. Firmware can add
>     a EFI memory descriptor for the TPM TIS device as a hint when the kernel do
>     the mapping for operation region.
> 
> Operation region of SystemMemory is still needed if a "Control Method" of APCI
> needs to access a field, e.g., the method _STA. Checking another descriptor for
> cacheable attribute, either "Address Space Resource Descriptor" or "EFI memory
> map descriptor" during the ACPI code doing the mapping for operation region
> makes the code complicated.
> 
> Another thing is if long-term firmware should not be using MTRRs to to
> communicate anything to the kernel. It seems it's safer to use ioremap() instead
> of ioremap_cache() for MMIO resource when the kernel do the mapping for the
> operation region access?

Even after changing the ACPI memory resource descriptor from 32-Bit Fixed 
Memory to DWordMemory with caching parameter set to uncached, the ACPI stack still 
tries to ioremap the memory as cachable. 

However, forcing the Operation Region to be PCI_Config instead of SystemMemory 
in the ACPI table seems to allow the vTPM device initilization to succeed as 
it avoids the vTPM region from getting ioremapped by the ACPI stack.

We have also verified that forcing the ACPI stack to use ioremap() instead of 
ioremap_cache() also allows vTPM to initialize properly. The reference change 
I made is below.

diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index dae6a73be40e..2771d3f66d0a 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -1821,7 +1821,7 @@ u64 x86_default_get_root_pointer(void)
 #ifdef CONFIG_XEN_PV
 void __iomem *x86_acpi_os_ioremap(acpi_physical_address phys, acpi_size size)
 {
-       return ioremap_cache(phys, size);
+       return ioremap(phys, size);
 }

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX
  2025-08-15 23:55             ` Korakit Seemakhupt
@ 2025-08-18 11:07               ` Binbin Wu
  0 siblings, 0 replies; 30+ messages in thread
From: Binbin Wu @ 2025-08-18 11:07 UTC (permalink / raw)
  To: Korakit Seemakhupt
  Cc: bp, dave.hansen, dionnaglaze, hpa, jgross, jiewen.yao, jxgao,
	kirill.shutemov, kvm, linux-kernel, mingo, nik.borisov, pbonzini,
	pgonda, rick.p.edgecombe, seanjc, tglx, thomas.lendacky, vkuznets,
	x86, vannapurve



On 8/16/2025 7:55 AM, Korakit Seemakhupt wrote:
>>> On 5/28/2025 11:33 PM, Sean Christopherson wrote:
>>>
>>> Summary, with the questions at the end.
>>>
>>> Recent upstream kernels running in GCE SNP/TDX VMs fail to probe the TPM due to
>>> the TPM driver's ioremap (with UC) failing because the kernel has already mapped
>>> the range using a cachaeable mapping (WB).
>>>
>>>    ioremap error for 0xfed40000-0xfed45000, requested 0x2, got 0x0
>>>    tpm_tis MSFT0101:00: probe with driver tpm_tis failed with error -12
>>>
>>> The "guilty" commit is 8e690b817e38 ("x86/kvm: Override default caching mode for
>>> SEV-SNP and TDX"), which as the subject suggests, forces the kernel's MTRR memtype
>>> to WB.  With SNP and TDX, the virtual MTRR state is (a) controlled by the VMM and
>>> thus is untrusted, and (b) _should_ be irrelevant because no known hypervisor
>>> actually honors the memtypes programmed into the virtual MTRRs.
>>>
>>> It turns out that the kernel has been relying on the MTRRs to force the TPM TIS
>>> region (and potentially other regions) to be UC, so that the kernel ACPI driver's
>>> attempts to map of SystemMemory entries as cacheable get forced to UC.  With MTRRs
>>> forced WB, x86_acpi_os_ioremap() succeeds in creating a WB mapping, which in turn
>>> causes the ioremap infrastructure to reject the TPM driver's UC mapping.
>>>
>>> IIUC, the TPM entry(s) in the ACPI tables for GCE VMs are derived (built?) from
>>> EDK2's TPM ASL.  And (again, IIUC), this code in SecurityPkg/Tcg/Tcg2Acpi/Tpm.asl[1]
>>>
>>>         //
>>>         // Operational region for TPM access
>>>         //
>>>         OperationRegion (TPMR, SystemMemory, 0xfed40000, 0x5000)
>>>
>>> generates the problematic SystemMemory entry that triggers the ACPI driver's
>>> auto-mapping logic.
>>>
>>> QEMU-based VMs don't suffer the same fate, as QEMU intentionally[2] doesn't use
>>> EDK2's AML for the TPM, and QEMU doesn't define a SystemMemory entry, just a
>>> Memory32Fixed entry.
>>>
>>> Presumably this an EDK2 bug?  If it's not an EDK2 bug, then how is the kernel's
>>> ACPI driver supposed to know that some ranges of SystemMemory must be mapped UC?
>> On 7/30/2025 3:34 PM, Binbin Wu: Wrote
>>
>> According to the ACPI spec 6.6, an operation region of SystemMemory has no
>> interface to specify the cacheable attribute.
>>
>> One solution could be using MTRRs to communicate the memory attribute of legacy
>> PCI hole to the kernel. But during the PUCK meeting last week, Sean mentioned
>> that "long-term, firmware should not be using MTRRs to communicate anything to
>> the kernel." So this solution is not preferred.
>>
>> If not MTRRs, there should be an alternative way to do the job.
>> 1. ACPI table
>>      According to the ACPI spec, neither operation region nor 32-Bit Fixed Memory
>>      Range Descriptor can specify the cacheable attribute.
>>      "Address Space Resource Descriptors" could be used to describe a memory range
>>      and the they can specify the cacheable attribute via "Type Specific Flags".
>>      One of the Address Space Resource Descriptors could be added to the ACPI
>>      table as a hint when the kernel do the mapping for operation region.
>>      (There is "System Physical Address (SPA) Range Structure", which also can
>>      specify the cacheable attribute. But it's should be used for NVDIMMs.)
>> 2. EFI memory map descriptor
>>      EFI memory descriptor can specify the cacheable attribute. Firmware can add
>>      a EFI memory descriptor for the TPM TIS device as a hint when the kernel do
>>      the mapping for operation region.
>>
>> Operation region of SystemMemory is still needed if a "Control Method" of APCI
>> needs to access a field, e.g., the method _STA. Checking another descriptor for
>> cacheable attribute, either "Address Space Resource Descriptor" or "EFI memory
>> map descriptor" during the ACPI code doing the mapping for operation region
>> makes the code complicated.
>>
>> Another thing is if long-term firmware should not be using MTRRs to to
>> communicate anything to the kernel. It seems it's safer to use ioremap() instead
>> of ioremap_cache() for MMIO resource when the kernel do the mapping for the
>> operation region access?
> Even after changing the ACPI memory resource descriptor from 32-Bit Fixed
> Memory to DWordMemory with caching parameter set to uncached, the ACPI stack still
> tries to ioremap the memory as cachable.

As mentioned above on the "Address Space Resource Descriptors", Linux kernel
currently doesn't use any other information when do the remap for Operation
Region.

If we expect the memory attribute type  in "Address Space Resource Descriptors"
to be used, it requires Linux kernel code change, and IMO this would be too
complicated.

>
> However, forcing the Operation Region to be PCI_Config instead of SystemMemory
> in the ACPI table seems to allow the vTPM device initilization to succeed as
> it avoids the vTPM region from getting ioremapped by the ACPI stack.

You can see the handler used for different space_id types in
acpi_ev_install_space_handler().

- For ACPI_ADR_SPACE_SYSTEM_MEMORY, the handler is
   acpi_ex_system_memory_space_handler(), which will remap the operation region
   before the ACPI method _STA accessing the fields inside the operation region.

- For ACPI_ADR_SPACE_PCI_CONFIG, the handler is
   acpi_ex_pci_config_space_handler(), PIO read/write is used when the offset
   is within 256 bytes.

I am curious how this could work by providing a fake PCI config space information.

Anyway, I don't think it's the right way.

>
> We have also verified that forcing the ACPI stack to use ioremap() instead of
> ioremap_cache() also allows vTPM to initialize properly. The reference change
> I made is below.

This may have some side effect on performance though.
See the commit 6d5bbf00d251, the reason why ACPI code use ioremap_cache():
     ACPI: Use ioremap_cache()

     Although the temporary boot-time ACPI table mappings
     were set up with CPU caching enabled, the permanent table
     mappings and AML run-time region memory accesses were
     set up with ioremap(), which on x86 is a synonym for
     ioremap_nocache().

     Changing this to ioremap_cache() improves performance as
     seen when accessing the tables via acpidump,
     or /sys/firmware/acpi/tables. It should also improve
     AML run-time performance.

>
> diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
> index dae6a73be40e..2771d3f66d0a 100644
> --- a/arch/x86/kernel/acpi/boot.c
> +++ b/arch/x86/kernel/acpi/boot.c
> @@ -1821,7 +1821,7 @@ u64 x86_default_get_root_pointer(void)
>   #ifdef CONFIG_XEN_PV
>   void __iomem *x86_acpi_os_ioremap(acpi_physical_address phys, acpi_size size)
>   {
> -       return ioremap_cache(phys, size);
> +       return ioremap(phys, size);
>   }
>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX
  2025-07-30  7:34           ` Binbin Wu
  2025-08-15 23:55             ` Korakit Seemakhupt
@ 2025-08-20  3:07             ` Vishal Annapurve
  2025-08-20 10:03               ` Binbin Wu
  1 sibling, 1 reply; 30+ messages in thread
From: Vishal Annapurve @ 2025-08-20  3:07 UTC (permalink / raw)
  To: Binbin Wu
  Cc: Sean Christopherson, Nikolay Borisov, Jianxiong Gao,
	Borislav Petkov (AMD), Dave Hansen, Dionna Glaze, H. Peter Anvin,
	jgross, Kirill A. Shutemov, kvm, linux-kernel, Ingo Molnar,
	pbonzini, Peter Gonda, Thomas Gleixner, Tom Lendacky,
	Vitaly Kuznetsov, x86, Rick Edgecombe, jiewen.yao

On Wed, Jul 30, 2025 at 12:34 AM Binbin Wu <binbin.wu@linux.intel.com> wrote:
>
>
>
> On 7/28/2025 11:33 PM, Sean Christopherson wrote:
> > +Jiewen
>
> Jiewen is out of the office until August 4th.

Hi Jiewen, can we get some help in deciding the next steps here?

>
> >
> > Summary, with the questions at the end.
> >
> > Recent upstream kernels running in GCE SNP/TDX VMs fail to probe the TPM due to
> > the TPM driver's ioremap (with UC) failing because the kernel has already mapped
> > the range using a cachaeable mapping (WB).
> >
> >   ioremap error for 0xfed40000-0xfed45000, requested 0x2, got 0x0
> >   tpm_tis MSFT0101:00: probe with driver tpm_tis failed with error -12
> >
> > The "guilty" commit is 8e690b817e38 ("x86/kvm: Override default caching mode for
> > SEV-SNP and TDX"), which as the subject suggests, forces the kernel's MTRR memtype
> > to WB.  With SNP and TDX, the virtual MTRR state is (a) controlled by the VMM and
> > thus is untrusted, and (b) _should_ be irrelevant because no known hypervisor
> > actually honors the memtypes programmed into the virtual MTRRs.
> >
> > It turns out that the kernel has been relying on the MTRRs to force the TPM TIS
> > region (and potentially other regions) to be UC, so that the kernel ACPI driver's
> > attempts to map of SystemMemory entries as cacheable get forced to UC.  With MTRRs
> > forced WB, x86_acpi_os_ioremap() succeeds in creating a WB mapping, which in turn
> > causes the ioremap infrastructure to reject the TPM driver's UC mapping.
> >
> > IIUC, the TPM entry(s) in the ACPI tables for GCE VMs are derived (built?) from
> > EDK2's TPM ASL.  And (again, IIUC), this code in SecurityPkg/Tcg/Tcg2Acpi/Tpm.asl[1]
> >
> >        //
> >        // Operational region for TPM access
> >        //
> >        OperationRegion (TPMR, SystemMemory, 0xfed40000, 0x5000)
> >
> > generates the problematic SystemMemory entry that triggers the ACPI driver's
> > auto-mapping logic.
> >
> > QEMU-based VMs don't suffer the same fate, as QEMU intentionally[2] doesn't use
> > EDK2's AML for the TPM, and QEMU doesn't define a SystemMemory entry, just a
> > Memory32Fixed entry.
> >
> > Presumably this an EDK2 bug?  If it's not an EDK2 bug, then how is the kernel's
> > ACPI driver supposed to know that some ranges of SystemMemory must be mapped UC?
> According to the ACPI spec 6.6, an operation region of SystemMemory has no
> interface to specify the cacheable attribute.
>
> One solution could be using MTRRs to communicate the memory attribute of legacy
> PCI hole to the kernel. But during the PUCK meeting last week, Sean mentioned
> that "long-term, firmware should not be using MTRRs to communicate anything to
> the kernel." So this solution is not preferred.
>
> If not MTRRs, there should be an alternative way to do the job.
> 1. ACPI table
>     According to the ACPI spec, neither operation region nor 32-Bit Fixed Memory
>     Range Descriptor can specify the cacheable attribute.
>     "Address Space Resource Descriptors" could be used to describe a memory range
>     and the they can specify the cacheable attribute via "Type Specific Flags".
>     One of the Address Space Resource Descriptors could be added to the ACPI
>     table as a hint when the kernel do the mapping for operation region.
>     (There is "System Physical Address (SPA) Range Structure", which also can
>     specify the cacheable attribute. But it's should be used for NVDIMMs.)
> 2. EFI memory map descriptor
>     EFI memory descriptor can specify the cacheable attribute. Firmware can add
>     a EFI memory descriptor for the TPM TIS device as a hint when the kernel do
>     the mapping for operation region.
>
> Operation region of SystemMemory is still needed if a "Control Method" of APCI
> needs to access a field, e.g., the method _STA. Checking another descriptor for
> cacheable attribute, either "Address Space Resource Descriptor" or "EFI memory
> map descriptor" during the ACPI code doing the mapping for operation region
> makes the code complicated.
>
> Another thing is if long-term firmware should not be using MTRRs to to
> communicate anything to the kernel. It seems it's safer to use ioremap() instead
> of ioremap_cache() for MMIO resource when the kernel do the mapping for the
> operation region access?
>

Would it work if instead of doubling down on declaring the low memory
above TOLUD as WB, guest kernel reserves the range as uncacheable by
default i.e. effectively simulating a ioremap before ACPI tries to map
the memory as WB?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX
  2025-08-20  3:07             ` Vishal Annapurve
@ 2025-08-20 10:03               ` Binbin Wu
  2025-08-20 11:13                 ` Binbin Wu
  0 siblings, 1 reply; 30+ messages in thread
From: Binbin Wu @ 2025-08-20 10:03 UTC (permalink / raw)
  To: Vishal Annapurve, Sean Christopherson
  Cc: Nikolay Borisov, Jianxiong Gao, Borislav Petkov (AMD),
	Dave Hansen, Dionna Glaze, H. Peter Anvin, jgross,
	Kirill A. Shutemov, kvm, linux-kernel, Ingo Molnar, pbonzini,
	Peter Gonda, Thomas Gleixner, Tom Lendacky, Vitaly Kuznetsov, x86,
	Rick Edgecombe, jiewen.yao



On 8/20/2025 11:07 AM, Vishal Annapurve wrote:
> On Wed, Jul 30, 2025 at 12:34 AM Binbin Wu <binbin.wu@linux.intel.com> wrote:
>>
>>
>> On 7/28/2025 11:33 PM, Sean Christopherson wrote:
>>> +Jiewen
>> Jiewen is out of the office until August 4th.
> Hi Jiewen, can we get some help in deciding the next steps here?

Please see below.

>
>>> Summary, with the questions at the end.
>>>
>>> Recent upstream kernels running in GCE SNP/TDX VMs fail to probe the TPM due to
>>> the TPM driver's ioremap (with UC) failing because the kernel has already mapped
>>> the range using a cachaeable mapping (WB).
>>>
>>>    ioremap error for 0xfed40000-0xfed45000, requested 0x2, got 0x0
>>>    tpm_tis MSFT0101:00: probe with driver tpm_tis failed with error -12
>>>
>>> The "guilty" commit is 8e690b817e38 ("x86/kvm: Override default caching mode for
>>> SEV-SNP and TDX"), which as the subject suggests, forces the kernel's MTRR memtype
>>> to WB.  With SNP and TDX, the virtual MTRR state is (a) controlled by the VMM and
>>> thus is untrusted, and (b) _should_ be irrelevant because no known hypervisor
>>> actually honors the memtypes programmed into the virtual MTRRs.
>>>
>>> It turns out that the kernel has been relying on the MTRRs to force the TPM TIS
>>> region (and potentially other regions) to be UC, so that the kernel ACPI driver's
>>> attempts to map of SystemMemory entries as cacheable get forced to UC.  With MTRRs
>>> forced WB, x86_acpi_os_ioremap() succeeds in creating a WB mapping, which in turn
>>> causes the ioremap infrastructure to reject the TPM driver's UC mapping.
>>>
>>> IIUC, the TPM entry(s) in the ACPI tables for GCE VMs are derived (built?) from
>>> EDK2's TPM ASL.  And (again, IIUC), this code in SecurityPkg/Tcg/Tcg2Acpi/Tpm.asl[1]
>>>
>>>         //
>>>         // Operational region for TPM access
>>>         //
>>>         OperationRegion (TPMR, SystemMemory, 0xfed40000, 0x5000)
>>>
>>> generates the problematic SystemMemory entry that triggers the ACPI driver's
>>> auto-mapping logic.
>>>
>>> QEMU-based VMs don't suffer the same fate, as QEMU intentionally[2] doesn't use
>>> EDK2's AML for the TPM, and QEMU doesn't define a SystemMemory entry, just a
>>> Memory32Fixed entry.
>>>
>>> Presumably this an EDK2 bug?  If it's not an EDK2 bug, then how is the kernel's
>>> ACPI driver supposed to know that some ranges of SystemMemory must be mapped UC?

Checked with Jiewen offline.

He didn't think there was an existing interface to tell the OS to map a
OperationRegion of SystemMemory as UC via the ACPI table. He thought the
OS/ACPI driver still needed to rely on MTRRs for the hint before there was an
alternative way.


>> According to the ACPI spec 6.6, an operation region of SystemMemory has no
>> interface to specify the cacheable attribute.
>>
>> One solution could be using MTRRs to communicate the memory attribute of legacy
>> PCI hole to the kernel. But during the PUCK meeting last week, Sean mentioned
>> that "long-term, firmware should not be using MTRRs to communicate anything to
>> the kernel." So this solution is not preferred.
>>
>> If not MTRRs, there should be an alternative way to do the job.
>> 1. ACPI table
>>      According to the ACPI spec, neither operation region nor 32-Bit Fixed Memory
>>      Range Descriptor can specify the cacheable attribute.
>>      "Address Space Resource Descriptors" could be used to describe a memory range
>>      and the they can specify the cacheable attribute via "Type Specific Flags".
>>      One of the Address Space Resource Descriptors could be added to the ACPI
>>      table as a hint when the kernel do the mapping for operation region.
>>      (There is "System Physical Address (SPA) Range Structure", which also can
>>      specify the cacheable attribute. But it's should be used for NVDIMMs.)
>> 2. EFI memory map descriptor
>>      EFI memory descriptor can specify the cacheable attribute. Firmware can add
>>      a EFI memory descriptor for the TPM TIS device as a hint when the kernel do
>>      the mapping for operation region.
>>
>> Operation region of SystemMemory is still needed if a "Control Method" of APCI
>> needs to access a field, e.g., the method _STA. Checking another descriptor for
>> cacheable attribute, either "Address Space Resource Descriptor" or "EFI memory
>> map descriptor" during the ACPI code doing the mapping for operation region
>> makes the code complicated.
>>
>> Another thing is if long-term firmware should not be using MTRRs to to
>> communicate anything to the kernel. It seems it's safer to use ioremap() instead
>> of ioremap_cache() for MMIO resource when the kernel do the mapping for the
>> operation region access?
>>
> Would it work if instead of doubling down on declaring the low memory
> above TOLUD as WB, guest kernel reserves the range as uncacheable by
> default i.e. effectively simulating a ioremap before ACPI tries to map
> the memory as WB?

It seems as hacky as this patch set?


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX
  2025-08-20 10:03               ` Binbin Wu
@ 2025-08-20 11:13                 ` Binbin Wu
  2025-08-20 17:56                   ` Sean Christopherson
  0 siblings, 1 reply; 30+ messages in thread
From: Binbin Wu @ 2025-08-20 11:13 UTC (permalink / raw)
  To: Sean Christopherson, Vishal Annapurve
  Cc: Nikolay Borisov, Jianxiong Gao, Borislav Petkov (AMD),
	Dave Hansen, Dionna Glaze, H. Peter Anvin, jgross,
	Kirill A. Shutemov, kvm, linux-kernel, Ingo Molnar, pbonzini,
	Peter Gonda, Thomas Gleixner, Tom Lendacky, Vitaly Kuznetsov, x86,
	Rick Edgecombe, jiewen.yao



On 8/20/2025 6:03 PM, Binbin Wu wrote:
>
>
> On 8/20/2025 11:07 AM, Vishal Annapurve wrote:
>> On Wed, Jul 30, 2025 at 12:34 AM Binbin Wu <binbin.wu@linux.intel.com> wrote:
>>>
>>>
>>> On 7/28/2025 11:33 PM, Sean Christopherson wrote:
>>>> +Jiewen
>>> Jiewen is out of the office until August 4th.
>> Hi Jiewen, can we get some help in deciding the next steps here?
>
> Please see below.
>
>>
>>>> Summary, with the questions at the end.
>>>>
>>>> Recent upstream kernels running in GCE SNP/TDX VMs fail to probe the TPM due to
>>>> the TPM driver's ioremap (with UC) failing because the kernel has already mapped
>>>> the range using a cachaeable mapping (WB).
>>>>
>>>>    ioremap error for 0xfed40000-0xfed45000, requested 0x2, got 0x0
>>>>    tpm_tis MSFT0101:00: probe with driver tpm_tis failed with error -12
>>>>
>>>> The "guilty" commit is 8e690b817e38 ("x86/kvm: Override default caching mode for
>>>> SEV-SNP and TDX"), which as the subject suggests, forces the kernel's MTRR memtype
>>>> to WB.  With SNP and TDX, the virtual MTRR state is (a) controlled by the VMM and
>>>> thus is untrusted, and (b) _should_ be irrelevant because no known hypervisor
>>>> actually honors the memtypes programmed into the virtual MTRRs.
>>>>
>>>> It turns out that the kernel has been relying on the MTRRs to force the TPM TIS
>>>> region (and potentially other regions) to be UC, so that the kernel ACPI driver's
>>>> attempts to map of SystemMemory entries as cacheable get forced to UC.  With MTRRs
>>>> forced WB, x86_acpi_os_ioremap() succeeds in creating a WB mapping, which in turn
>>>> causes the ioremap infrastructure to reject the TPM driver's UC mapping.
>>>>
>>>> IIUC, the TPM entry(s) in the ACPI tables for GCE VMs are derived (built?) from
>>>> EDK2's TPM ASL.  And (again, IIUC), this code in SecurityPkg/Tcg/Tcg2Acpi/Tpm.asl[1]
>>>>
>>>>         //
>>>>         // Operational region for TPM access
>>>>         //
>>>>         OperationRegion (TPMR, SystemMemory, 0xfed40000, 0x5000)
>>>>
>>>> generates the problematic SystemMemory entry that triggers the ACPI driver's
>>>> auto-mapping logic.
>>>>
>>>> QEMU-based VMs don't suffer the same fate, as QEMU intentionally[2] doesn't use
>>>> EDK2's AML for the TPM, and QEMU doesn't define a SystemMemory entry, just a
>>>> Memory32Fixed entry.
>>>>
>>>> Presumably this an EDK2 bug?  If it's not an EDK2 bug, then how is the kernel's
>>>> ACPI driver supposed to know that some ranges of SystemMemory must be mapped UC?
>
> Checked with Jiewen offline.
>
> He didn't think there was an existing interface to tell the OS to map a
> OperationRegion of SystemMemory as UC via the ACPI table. He thought the
> OS/ACPI driver still needed to rely on MTRRs for the hint before there was an
> alternative way.
>
>
>>> According to the ACPI spec 6.6, an operation region of SystemMemory has no
>>> interface to specify the cacheable attribute.
>>>
>>> One solution could be using MTRRs to communicate the memory attribute of legacy
>>> PCI hole to the kernel. But during the PUCK meeting last week, Sean mentioned
>>> that "long-term, firmware should not be using MTRRs to communicate anything to
>>> the kernel." So this solution is not preferred.
>>>
>>> If not MTRRs, there should be an alternative way to do the job.
>>> 1. ACPI table
>>>      According to the ACPI spec, neither operation region nor 32-Bit Fixed Memory
>>>      Range Descriptor can specify the cacheable attribute.
>>>      "Address Space Resource Descriptors" could be used to describe a memory range
>>>      and the they can specify the cacheable attribute via "Type Specific Flags".
>>>      One of the Address Space Resource Descriptors could be added to the ACPI
>>>      table as a hint when the kernel do the mapping for operation region.
>>>      (There is "System Physical Address (SPA) Range Structure", which also can
>>>      specify the cacheable attribute. But it's should be used for NVDIMMs.)
>>> 2. EFI memory map descriptor
>>>      EFI memory descriptor can specify the cacheable attribute. Firmware can add
>>>      a EFI memory descriptor for the TPM TIS device as a hint when the kernel do
>>>      the mapping for operation region.
>>>
>>> Operation region of SystemMemory is still needed if a "Control Method" of APCI
>>> needs to access a field, e.g., the method _STA. Checking another descriptor for
>>> cacheable attribute, either "Address Space Resource Descriptor" or "EFI memory
>>> map descriptor" during the ACPI code doing the mapping for operation region
>>> makes the code complicated.
>>>
>>> Another thing is if long-term firmware should not be using MTRRs to to
>>> communicate anything to the kernel. It seems it's safer to use ioremap() instead
>>> of ioremap_cache() for MMIO resource when the kernel do the mapping for the
>>> operation region access?
>>>
>> Would it work if instead of doubling down on declaring the low memory
>> above TOLUD as WB, guest kernel reserves the range as uncacheable by
>> default i.e. effectively simulating a ioremap before ACPI tries to map
>> the memory as WB?
>
> It seems as hacky as this patch set?
>
>
Hi Sean,

Since guest_force_mtrr_state() also supports to force MTRR variable ranges,
I am wondering if we could use guest_force_mtrr_state() to set the legacy PCI
hole range as UC?

Is it less hacky?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX
  2025-08-20 11:13                 ` Binbin Wu
@ 2025-08-20 17:56                   ` Sean Christopherson
  2025-08-21  3:30                     ` Binbin Wu
  0 siblings, 1 reply; 30+ messages in thread
From: Sean Christopherson @ 2025-08-20 17:56 UTC (permalink / raw)
  To: Binbin Wu
  Cc: Vishal Annapurve, Nikolay Borisov, Jianxiong Gao,
	Borislav Petkov (AMD), Dave Hansen, Dionna Glaze, H. Peter Anvin,
	jgross, Kirill A. Shutemov, kvm, linux-kernel, Ingo Molnar,
	pbonzini, Peter Gonda, Thomas Gleixner, Tom Lendacky,
	Vitaly Kuznetsov, x86, Rick Edgecombe, jiewen.yao

On Wed, Aug 20, 2025, Binbin Wu wrote:
> On 8/20/2025 6:03 PM, Binbin Wu wrote:
> > > > > Presumably this an EDK2 bug?  If it's not an EDK2 bug, then how is the kernel's
> > > > > ACPI driver supposed to know that some ranges of SystemMemory must be mapped UC?
> > 
> > Checked with Jiewen offline.
> > 
> > He didn't think there was an existing interface to tell the OS to map a
> > OperationRegion of SystemMemory as UC via the ACPI table. He thought the
> > OS/ACPI driver still needed to rely on MTRRs for the hint before there was an
> > alternative way.
> > 
> > > > According to the ACPI spec 6.6, an operation region of SystemMemory has no
> > > > interface to specify the cacheable attribute.
> > > > 
> > > > One solution could be using MTRRs to communicate the memory attribute of legacy
> > > > PCI hole to the kernel.

So IIUC, there are no bugs anywhere, just a gap in specs that has been hidden
until now :-(

> > > > But during the PUCK meeting last week, Sean mentioned
> > > > that "long-term, firmware should not be using MTRRs to communicate anything to
> > > > the kernel." So this solution is not preferred.
> > > > 
> > > > If not MTRRs, there should be an alternative way to do the job.
> > > > 1. ACPI table
> > > >      According to the ACPI spec, neither operation region nor 32-Bit Fixed Memory
> > > >      Range Descriptor can specify the cacheable attribute.
> > > >      "Address Space Resource Descriptors" could be used to describe a memory range
> > > >      and the they can specify the cacheable attribute via "Type Specific Flags".
> > > >      One of the Address Space Resource Descriptors could be added to the ACPI
> > > >      table as a hint when the kernel do the mapping for operation region.
> > > >      (There is "System Physical Address (SPA) Range Structure", which also can
> > > >      specify the cacheable attribute. But it's should be used for NVDIMMs.)
> > > > 2. EFI memory map descriptor
> > > >      EFI memory descriptor can specify the cacheable attribute. Firmware can add
> > > >      a EFI memory descriptor for the TPM TIS device as a hint when the kernel do
> > > >      the mapping for operation region.
> > > > 
> > > > Operation region of SystemMemory is still needed if a "Control Method" of APCI
> > > > needs to access a field, e.g., the method _STA. Checking another descriptor for
> > > > cacheable attribute, either "Address Space Resource Descriptor" or "EFI memory
> > > > map descriptor" during the ACPI code doing the mapping for operation region
> > > > makes the code complicated.
> > > > 
> > > > Another thing is if long-term firmware should not be using MTRRs to to
> > > > communicate anything to the kernel. It seems it's safer to use ioremap() instead
> > > > of ioremap_cache() for MMIO resource when the kernel do the mapping for the
> > > > operation region access?
> > > > 
> > > Would it work if instead of doubling down on declaring the low memory
> > > above TOLUD as WB, guest kernel reserves the range as uncacheable by
> > > default i.e. effectively simulating a ioremap before ACPI tries to map
> > > the memory as WB?
> > 
> > It seems as hacky as this patch set?
> > 
> > 
> Hi Sean,
> 
> Since guest_force_mtrr_state() also supports to force MTRR variable ranges,
> I am wondering if we could use guest_force_mtrr_state() to set the legacy PCI
> hole range as UC?
> 
> Is it less hacky?

Oh!  That's a way better idea than my hack.  I missed that the kernel would still
consult MTRRs.

Compile tested only, but something like this?

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 8ae750cde0c6..45c8871cdda1 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -933,6 +933,13 @@ static void kvm_sev_hc_page_enc_status(unsigned long pfn, int npages, bool enc)
 
 static void __init kvm_init_platform(void)
 {
+       u64 tolud = e820__end_of_low_ram_pfn() << PAGE_SHIFT;
+       struct mtrr_var_range pci_hole = {
+               .base_lo = tolud | X86_MEMTYPE_UC,
+               .mask_lo = (u32)(~(SZ_4G - tolud - 1)) | BIT(11),
+               .mask_hi = (BIT_ULL(boot_cpu_data.x86_phys_bits) - 1) >> 32,
+       };
+
        if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT) &&
            kvm_para_has_feature(KVM_FEATURE_MIGRATION_CONTROL)) {
                unsigned long nr_pages;
@@ -982,8 +989,12 @@ static void __init kvm_init_platform(void)
        kvmclock_init();
        x86_platform.apic_post_init = kvm_apic_init;
 
-       /* Set WB as the default cache mode for SEV-SNP and TDX */
-       guest_force_mtrr_state(NULL, 0, MTRR_TYPE_WRBACK);
+       /*
+        * Set WB as the default cache mode for SEV-SNP and TDX, with a single
+        * UC range for the legacy PCI hole, e.g. so that devices that expect
+        * to get UC/WC mappings don't get surprised with WB.
+        */
+       guest_force_mtrr_state(&pci_hole, 1, MTRR_TYPE_WRBACK);
 }
 
 #if defined(CONFIG_AMD_MEM_ENCRYPT)

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX
  2025-08-20 17:56                   ` Sean Christopherson
@ 2025-08-21  3:30                     ` Binbin Wu
  2025-08-21  5:23                       ` Binbin Wu
  0 siblings, 1 reply; 30+ messages in thread
From: Binbin Wu @ 2025-08-21  3:30 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Vishal Annapurve, Nikolay Borisov, Jianxiong Gao,
	Borislav Petkov (AMD), Dave Hansen, Dionna Glaze, H. Peter Anvin,
	jgross, Kirill A. Shutemov, kvm, linux-kernel, Ingo Molnar,
	pbonzini, Peter Gonda, Thomas Gleixner, Tom Lendacky,
	Vitaly Kuznetsov, x86, Rick Edgecombe, jiewen.yao



On 8/21/2025 1:56 AM, Sean Christopherson wrote:
> On Wed, Aug 20, 2025, Binbin Wu wrote:
>> On 8/20/2025 6:03 PM, Binbin Wu wrote:
>>>>>> Presumably this an EDK2 bug?  If it's not an EDK2 bug, then how is the kernel's
>>>>>> ACPI driver supposed to know that some ranges of SystemMemory must be mapped UC?
>>> Checked with Jiewen offline.
>>>
>>> He didn't think there was an existing interface to tell the OS to map a
>>> OperationRegion of SystemMemory as UC via the ACPI table. He thought the
>>> OS/ACPI driver still needed to rely on MTRRs for the hint before there was an
>>> alternative way.
>>>
>>>>> According to the ACPI spec 6.6, an operation region of SystemMemory has no
>>>>> interface to specify the cacheable attribute.
>>>>>
>>>>> One solution could be using MTRRs to communicate the memory attribute of legacy
>>>>> PCI hole to the kernel.
> So IIUC, there are no bugs anywhere, just a gap in specs that has been hidden
> until now :-(
>
>>>>> But during the PUCK meeting last week, Sean mentioned
>>>>> that "long-term, firmware should not be using MTRRs to communicate anything to
>>>>> the kernel." So this solution is not preferred.
>>>>>
>>>>> If not MTRRs, there should be an alternative way to do the job.
>>>>> 1. ACPI table
>>>>>       According to the ACPI spec, neither operation region nor 32-Bit Fixed Memory
>>>>>       Range Descriptor can specify the cacheable attribute.
>>>>>       "Address Space Resource Descriptors" could be used to describe a memory range
>>>>>       and the they can specify the cacheable attribute via "Type Specific Flags".
>>>>>       One of the Address Space Resource Descriptors could be added to the ACPI
>>>>>       table as a hint when the kernel do the mapping for operation region.
>>>>>       (There is "System Physical Address (SPA) Range Structure", which also can
>>>>>       specify the cacheable attribute. But it's should be used for NVDIMMs.)
>>>>> 2. EFI memory map descriptor
>>>>>       EFI memory descriptor can specify the cacheable attribute. Firmware can add
>>>>>       a EFI memory descriptor for the TPM TIS device as a hint when the kernel do
>>>>>       the mapping for operation region.
>>>>>
>>>>> Operation region of SystemMemory is still needed if a "Control Method" of APCI
>>>>> needs to access a field, e.g., the method _STA. Checking another descriptor for
>>>>> cacheable attribute, either "Address Space Resource Descriptor" or "EFI memory
>>>>> map descriptor" during the ACPI code doing the mapping for operation region
>>>>> makes the code complicated.
>>>>>
>>>>> Another thing is if long-term firmware should not be using MTRRs to to
>>>>> communicate anything to the kernel. It seems it's safer to use ioremap() instead
>>>>> of ioremap_cache() for MMIO resource when the kernel do the mapping for the
>>>>> operation region access?
>>>>>
>>>> Would it work if instead of doubling down on declaring the low memory
>>>> above TOLUD as WB, guest kernel reserves the range as uncacheable by
>>>> default i.e. effectively simulating a ioremap before ACPI tries to map
>>>> the memory as WB?
>>> It seems as hacky as this patch set?
>>>
>>>
>> Hi Sean,
>>
>> Since guest_force_mtrr_state() also supports to force MTRR variable ranges,
>> I am wondering if we could use guest_force_mtrr_state() to set the legacy PCI
>> hole range as UC?
>>
>> Is it less hacky?
> Oh!  That's a way better idea than my hack.  I missed that the kernel would still
> consult MTRRs.
>
> Compile tested only, but something like this?
>
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index 8ae750cde0c6..45c8871cdda1 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -933,6 +933,13 @@ static void kvm_sev_hc_page_enc_status(unsigned long pfn, int npages, bool enc)
>   
>   static void __init kvm_init_platform(void)
>   {
> +       u64 tolud = e820__end_of_low_ram_pfn() << PAGE_SHIFT;
> +       struct mtrr_var_range pci_hole = {
> +               .base_lo = tolud | X86_MEMTYPE_UC,
> +               .mask_lo = (u32)(~(SZ_4G - tolud - 1)) | BIT(11),
> +               .mask_hi = (BIT_ULL(boot_cpu_data.x86_phys_bits) - 1) >> 32,
> +       };
> +

This value of tolud  may not meet the range size and alignment requirement for
variable MTRR.

Variable MTRR has requirement for range size and alignment:
For ranges greater than 4 KBytes, each range must be of length 2^n and its base
address must be aligned on a 2^n boundary, where n is a value equal to or
greater than 12. The base-address alignment value cannot be less than its length.

In my setup, the value of tolud is 0x7FF7C000, it requires 3 variable MTRRs to
meet the requirement, i.e.,
- 7FF7 C000  ~   7FF8 0000
- 7FF8 0000  ~   8000 0000
- 8000 0000  ~ 1 0000 0000

I checks the implementation in EDK2, in order to fit the legacy PCI hole into
one variable MTRR, it has some assumption to truncate the size and round up the
base address in PlatformQemuUc32BaseInitialization():
     ...
     ASSERT (
       PlatformInfoHob->HostBridgeDevId == INTEL_Q35_MCH_DEVICE_ID ||
       PlatformInfoHob->HostBridgeDevId == INTEL_82441_DEVICE_ID
       );
     ...
     //
     // Start with the [LowerMemorySize, 4GB) range. Make sure one
     // variable MTRR suffices by truncating the size to a whole power of two,
     // while keeping the end affixed to 4GB. This will round the base up.
     //
     PlatformInfoHob->Uc32Size = GetPowerOfTwo32 ((UINT32)(SIZE_4GB - PlatformInfoHob->LowMemory));
     PlatformInfoHob->Uc32Base = (UINT32)(SIZE_4GB - PlatformInfoHob->Uc32Size);
     //
     // Assuming that LowerMemorySize is at least 1 byte, Uc32Size is at most 2GB.
     // Therefore Uc32Base is at least 2GB.
     //
     ASSERT (PlatformInfoHob->Uc32Base >= BASE_2GB);

I am not sure if KVM can do such assumption.
Otherwise, KVM needs to calculate the ranges to meet the requirement. :(


>          if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT) &&
>              kvm_para_has_feature(KVM_FEATURE_MIGRATION_CONTROL)) {
>                  unsigned long nr_pages;
> @@ -982,8 +989,12 @@ static void __init kvm_init_platform(void)
>          kvmclock_init();
>          x86_platform.apic_post_init = kvm_apic_init;
>   
> -       /* Set WB as the default cache mode for SEV-SNP and TDX */
> -       guest_force_mtrr_state(NULL, 0, MTRR_TYPE_WRBACK);
> +       /*
> +        * Set WB as the default cache mode for SEV-SNP and TDX, with a single
> +        * UC range for the legacy PCI hole, e.g. so that devices that expect
> +        * to get UC/WC mappings don't get surprised with WB.
> +        */
> +       guest_force_mtrr_state(&pci_hole, 1, MTRR_TYPE_WRBACK);
>   }
>   
>   #if defined(CONFIG_AMD_MEM_ENCRYPT)


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX
  2025-08-21  3:30                     ` Binbin Wu
@ 2025-08-21  5:23                       ` Binbin Wu
  2025-08-21  6:02                         ` Jürgen Groß
  2025-08-21 15:27                         ` Sean Christopherson
  0 siblings, 2 replies; 30+ messages in thread
From: Binbin Wu @ 2025-08-21  5:23 UTC (permalink / raw)
  To: Sean Christopherson, Vishal Annapurve
  Cc: Nikolay Borisov, Jianxiong Gao, Borislav Petkov (AMD),
	Dave Hansen, Dionna Glaze, H. Peter Anvin, jgross,
	Kirill A. Shutemov, kvm, linux-kernel, Ingo Molnar, pbonzini,
	Peter Gonda, Thomas Gleixner, Tom Lendacky, Vitaly Kuznetsov, x86,
	Rick Edgecombe, jiewen.yao



On 8/21/2025 11:30 AM, Binbin Wu wrote:
>
>
> On 8/21/2025 1:56 AM, Sean Christopherson wrote:
>> On Wed, Aug 20, 2025, Binbin Wu wrote:
[...]
>>> Hi Sean,
>>>
>>> Since guest_force_mtrr_state() also supports to force MTRR variable ranges,
>>> I am wondering if we could use guest_force_mtrr_state() to set the legacy PCI
>>> hole range as UC?
>>>
>>> Is it less hacky?
>> Oh!  That's a way better idea than my hack.  I missed that the kernel would still
>> consult MTRRs.
>>
>> Compile tested only, but something like this?
>>
>> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
>> index 8ae750cde0c6..45c8871cdda1 100644
>> --- a/arch/x86/kernel/kvm.c
>> +++ b/arch/x86/kernel/kvm.c
>> @@ -933,6 +933,13 @@ static void kvm_sev_hc_page_enc_status(unsigned long pfn, int npages, bool enc)
>>     static void __init kvm_init_platform(void)
>>   {
>> +       u64 tolud = e820__end_of_low_ram_pfn() << PAGE_SHIFT;
>> +       struct mtrr_var_range pci_hole = {
>> +               .base_lo = tolud | X86_MEMTYPE_UC,
>> +               .mask_lo = (u32)(~(SZ_4G - tolud - 1)) | BIT(11),
>> +               .mask_hi = (BIT_ULL(boot_cpu_data.x86_phys_bits) - 1) >> 32,
>> +       };
>> +
>
> This value of tolud  may not meet the range size and alignment requirement for
> variable MTRR.
>
> Variable MTRR has requirement for range size and alignment:
> For ranges greater than 4 KBytes, each range must be of length 2^n and its base
> address must be aligned on a 2^n boundary, where n is a value equal to or
> greater than 12. The base-address alignment value cannot be less than its length.

Wait, Linux kernel converts MTRR register values to MTRR state (base and size) and
cache it for later lookups (refer to map_add_var()). I.e., in Linux kernel,
only the cached state will be used.

These MTRR register values are never programmed when using
guest_force_mtrr_state() , so even the values doesn't meet the
requirement from hardware perspective, Linux kernel can still get the right base and size.

No bothering to force the base and size alignment.
But a comment would be helpful.
Also, BIT(11) could be replaced by MTRR_PHYSMASK_V.

How about:
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 90097df4eafd..a9582ffc3088 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -934,9 +934,15 @@ static void kvm_sev_hc_page_enc_status(unsigned long pfn, int npages, bool enc)
  static void __init kvm_init_platform(void)
  {
         u64 tolud = e820__end_of_low_ram_pfn() << PAGE_SHIFT;
+       /*
+        * The range's base address and size may not meet the alignment
+        * requirement for variable MTRR. However, Linux guest never
+        * programs MTRRs when forcing guest MTRR state, no bothering to
+        * enforce the base and range size alignment.
+        */
         struct mtrr_var_range pci_hole = {
                 .base_lo = tolud | X86_MEMTYPE_UC,
-               .mask_lo = (u32)(~(SZ_4G - tolud - 1)) | BIT(11),
+               .mask_lo = (u32)(~(SZ_4G - tolud - 1)) | MTRR_PHYSMASK_V,
                 .mask_hi = (BIT_ULL(boot_cpu_data.x86_phys_bits) - 1) >> 32,
         };


I tested it in my setup, it can fix the issue of TPM driver failure with the
modified ACPI table for TPM in QEMU.


Hi Vishal,
Could you test it with google's VMM?


>
> In my setup, the value of tolud is 0x7FF7C000, it requires 3 variable MTRRs to
> meet the requirement, i.e.,
> - 7FF7 C000  ~   7FF8 0000
> - 7FF8 0000  ~   8000 0000
> - 8000 0000  ~ 1 0000 0000
>
> I checks the implementation in EDK2, in order to fit the legacy PCI hole into
> one variable MTRR, it has some assumption to truncate the size and round up the
> base address in PlatformQemuUc32BaseInitialization():
>     ...
>     ASSERT (
>       PlatformInfoHob->HostBridgeDevId == INTEL_Q35_MCH_DEVICE_ID ||
>       PlatformInfoHob->HostBridgeDevId == INTEL_82441_DEVICE_ID
>       );
>     ...
>     //
>     // Start with the [LowerMemorySize, 4GB) range. Make sure one
>     // variable MTRR suffices by truncating the size to a whole power of two,
>     // while keeping the end affixed to 4GB. This will round the base up.
>     //
>     PlatformInfoHob->Uc32Size = GetPowerOfTwo32 ((UINT32)(SIZE_4GB - PlatformInfoHob->LowMemory));
>     PlatformInfoHob->Uc32Base = (UINT32)(SIZE_4GB - PlatformInfoHob->Uc32Size);
>     //
>     // Assuming that LowerMemorySize is at least 1 byte, Uc32Size is at most 2GB.
>     // Therefore Uc32Base is at least 2GB.
>     //
>     ASSERT (PlatformInfoHob->Uc32Base >= BASE_2GB);
>
> I am not sure if KVM can do such assumption.
> Otherwise, KVM needs to calculate the ranges to meet the requirement. :(
>
>
>>          if (cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT) &&
>> kvm_para_has_feature(KVM_FEATURE_MIGRATION_CONTROL)) {
>>                  unsigned long nr_pages;
>> @@ -982,8 +989,12 @@ static void __init kvm_init_platform(void)
>>          kvmclock_init();
>>          x86_platform.apic_post_init = kvm_apic_init;
>>   -       /* Set WB as the default cache mode for SEV-SNP and TDX */
>> -       guest_force_mtrr_state(NULL, 0, MTRR_TYPE_WRBACK);
>> +       /*
>> +        * Set WB as the default cache mode for SEV-SNP and TDX, with a single
>> +        * UC range for the legacy PCI hole, e.g. so that devices that expect
>> +        * to get UC/WC mappings don't get surprised with WB.
>> +        */
>> +       guest_force_mtrr_state(&pci_hole, 1, MTRR_TYPE_WRBACK);
>>   }
>>     #if defined(CONFIG_AMD_MEM_ENCRYPT)
>
>


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX
  2025-08-21  5:23                       ` Binbin Wu
@ 2025-08-21  6:02                         ` Jürgen Groß
  2025-08-21 15:27                         ` Sean Christopherson
  1 sibling, 0 replies; 30+ messages in thread
From: Jürgen Groß @ 2025-08-21  6:02 UTC (permalink / raw)
  To: Binbin Wu, Sean Christopherson, Vishal Annapurve
  Cc: Nikolay Borisov, Jianxiong Gao, Borislav Petkov (AMD),
	Dave Hansen, Dionna Glaze, H. Peter Anvin, Kirill A. Shutemov,
	kvm, linux-kernel, Ingo Molnar, pbonzini, Peter Gonda,
	Thomas Gleixner, Tom Lendacky, Vitaly Kuznetsov, x86,
	Rick Edgecombe, jiewen.yao


[-- Attachment #1.1.1: Type: text/plain, Size: 3328 bytes --]

On 21.08.25 07:23, Binbin Wu wrote:
> 
> 
> On 8/21/2025 11:30 AM, Binbin Wu wrote:
>>
>>
>> On 8/21/2025 1:56 AM, Sean Christopherson wrote:
>>> On Wed, Aug 20, 2025, Binbin Wu wrote:
> [...]
>>>> Hi Sean,
>>>>
>>>> Since guest_force_mtrr_state() also supports to force MTRR variable ranges,
>>>> I am wondering if we could use guest_force_mtrr_state() to set the legacy PCI
>>>> hole range as UC?
>>>>
>>>> Is it less hacky?
>>> Oh!  That's a way better idea than my hack.  I missed that the kernel would 
>>> still
>>> consult MTRRs.
>>>
>>> Compile tested only, but something like this?
>>>
>>> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
>>> index 8ae750cde0c6..45c8871cdda1 100644
>>> --- a/arch/x86/kernel/kvm.c
>>> +++ b/arch/x86/kernel/kvm.c
>>> @@ -933,6 +933,13 @@ static void kvm_sev_hc_page_enc_status(unsigned long 
>>> pfn, int npages, bool enc)
>>>     static void __init kvm_init_platform(void)
>>>   {
>>> +       u64 tolud = e820__end_of_low_ram_pfn() << PAGE_SHIFT;
>>> +       struct mtrr_var_range pci_hole = {
>>> +               .base_lo = tolud | X86_MEMTYPE_UC,
>>> +               .mask_lo = (u32)(~(SZ_4G - tolud - 1)) | BIT(11),
>>> +               .mask_hi = (BIT_ULL(boot_cpu_data.x86_phys_bits) - 1) >> 32,
>>> +       };
>>> +
>>
>> This value of tolud  may not meet the range size and alignment requirement for
>> variable MTRR.
>>
>> Variable MTRR has requirement for range size and alignment:
>> For ranges greater than 4 KBytes, each range must be of length 2^n and its base
>> address must be aligned on a 2^n boundary, where n is a value equal to or
>> greater than 12. The base-address alignment value cannot be less than its length.
> 
> Wait, Linux kernel converts MTRR register values to MTRR state (base and size) and
> cache it for later lookups (refer to map_add_var()). I.e., in Linux kernel,
> only the cached state will be used.
> 
> These MTRR register values are never programmed when using
> guest_force_mtrr_state() , so even the values doesn't meet the
> requirement from hardware perspective, Linux kernel can still get the right base 
> and size.
> 
> No bothering to force the base and size alignment.
> But a comment would be helpful.
> Also, BIT(11) could be replaced by MTRR_PHYSMASK_V.
> 
> How about:
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index 90097df4eafd..a9582ffc3088 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -934,9 +934,15 @@ static void kvm_sev_hc_page_enc_status(unsigned long pfn, 
> int npages, bool enc)
>   static void __init kvm_init_platform(void)
>   {
>          u64 tolud = e820__end_of_low_ram_pfn() << PAGE_SHIFT;

I'd prefer to avoid open coding PFN_PHYS() here.

> +       /*
> +        * The range's base address and size may not meet the alignment
> +        * requirement for variable MTRR. However, Linux guest never
> +        * programs MTRRs when forcing guest MTRR state, no bothering to
> +        * enforce the base and range size alignment.
> +        */

Maybe a related comment should be added to guest_force_mtrr_state() in
order to avoid breaking this workaround again.


Juergen

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 3743 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX
  2025-08-21  5:23                       ` Binbin Wu
  2025-08-21  6:02                         ` Jürgen Groß
@ 2025-08-21 15:27                         ` Sean Christopherson
  2025-08-28  0:07                           ` Sean Christopherson
  1 sibling, 1 reply; 30+ messages in thread
From: Sean Christopherson @ 2025-08-21 15:27 UTC (permalink / raw)
  To: Binbin Wu
  Cc: Vishal Annapurve, Nikolay Borisov, Jianxiong Gao,
	Borislav Petkov (AMD), Dave Hansen, Dionna Glaze, H. Peter Anvin,
	jgross, Kirill A. Shutemov, kvm, linux-kernel, Ingo Molnar,
	pbonzini, Peter Gonda, Thomas Gleixner, Tom Lendacky,
	Vitaly Kuznetsov, x86, Rick Edgecombe, jiewen.yao

On Thu, Aug 21, 2025, Binbin Wu wrote:
> On 8/21/2025 11:30 AM, Binbin Wu wrote:
> > Variable MTRR has requirement for range size and alignment:
> > For ranges greater than 4 KBytes, each range must be of length 2^n and its base
> > address must be aligned on a 2^n boundary, where n is a value equal to or
> > greater than 12. The base-address alignment value cannot be less than its length.
> 
> Wait, Linux kernel converts MTRR register values to MTRR state (base and size) and
> cache it for later lookups (refer to map_add_var()). I.e., in Linux kernel,
> only the cached state will be used.
> 
> These MTRR register values are never programmed when using
> guest_force_mtrr_state() , so even the values doesn't meet the requirement
> from hardware perspective, Linux kernel can still get the right base and
> size.

Yeah.  I forget what happens if the ranges don't meet the power-of-2 requirements,
but the mask+match logic should work jus tfine.

> No bothering to force the base and size alignment.
> But a comment would be helpful.
> Also, BIT(11) could be replaced by MTRR_PHYSMASK_V.

Ha!  I spent a good 5 minutes looking for a #define couldn't find one.  What a
bizarre name...

> How about:
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index 90097df4eafd..a9582ffc3088 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -934,9 +934,15 @@ static void kvm_sev_hc_page_enc_status(unsigned long pfn, int npages, bool enc)
>  static void __init kvm_init_platform(void)
>  {
>         u64 tolud = e820__end_of_low_ram_pfn() << PAGE_SHIFT;
> +       /*
> +        * The range's base address and size may not meet the alignment
> +        * requirement for variable MTRR. However, Linux guest never
> +        * programs MTRRs when forcing guest MTRR state, no bothering to
> +        * enforce the base and range size alignment.
> +        */
>         struct mtrr_var_range pci_hole = {
>                 .base_lo = tolud | X86_MEMTYPE_UC,
> -               .mask_lo = (u32)(~(SZ_4G - tolud - 1)) | BIT(11),
> +               .mask_lo = (u32)(~(SZ_4G - tolud - 1)) | MTRR_PHYSMASK_V,
>                 .mask_hi = (BIT_ULL(boot_cpu_data.x86_phys_bits) - 1) >> 32,
>         };
> 
> 
> I tested it in my setup, it can fix the issue of TPM driver failure with the
> modified ACPI table for TPM in QEMU.
> 
> 
> Hi Vishal,
> Could you test it with google's VMM?

Vishal is OOO for a few days.  I pinged our internal bug tracker, I'll find
someone to test.

Thanks much!

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX
  2025-08-21 15:27                         ` Sean Christopherson
@ 2025-08-28  0:07                           ` Sean Christopherson
  0 siblings, 0 replies; 30+ messages in thread
From: Sean Christopherson @ 2025-08-28  0:07 UTC (permalink / raw)
  To: Binbin Wu
  Cc: Vishal Annapurve, Nikolay Borisov, Jianxiong Gao,
	Borislav Petkov (AMD), Dave Hansen, Dionna Glaze, H. Peter Anvin,
	jgross, Kirill A. Shutemov, kvm, linux-kernel, Ingo Molnar,
	pbonzini, Peter Gonda, Thomas Gleixner, Tom Lendacky,
	Vitaly Kuznetsov, x86, Rick Edgecombe, jiewen.yao

On Thu, Aug 21, 2025, Sean Christopherson wrote:
> On Thu, Aug 21, 2025, Binbin Wu wrote:
> > On 8/21/2025 11:30 AM, Binbin Wu wrote:
> > > Variable MTRR has requirement for range size and alignment:
> > > For ranges greater than 4 KBytes, each range must be of length 2^n and its base
> > > address must be aligned on a 2^n boundary, where n is a value equal to or
> > > greater than 12. The base-address alignment value cannot be less than its length.
> > 
> > Wait, Linux kernel converts MTRR register values to MTRR state (base and size) and
> > cache it for later lookups (refer to map_add_var()). I.e., in Linux kernel,
> > only the cached state will be used.
> > 
> > These MTRR register values are never programmed when using
> > guest_force_mtrr_state() , so even the values doesn't meet the requirement
> > from hardware perspective, Linux kernel can still get the right base and
> > size.
> 
> Yeah.  I forget what happens if the ranges don't meet the power-of-2 requirements,
> but the mask+match logic should work jus tfine.
> 
> > No bothering to force the base and size alignment.
> > But a comment would be helpful.
> > Also, BIT(11) could be replaced by MTRR_PHYSMASK_V.
> 
> Ha!  I spent a good 5 minutes looking for a #define couldn't find one.  What a
> bizarre name...
> 
> > How about:
> > diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> > index 90097df4eafd..a9582ffc3088 100644
> > --- a/arch/x86/kernel/kvm.c
> > +++ b/arch/x86/kernel/kvm.c
> > @@ -934,9 +934,15 @@ static void kvm_sev_hc_page_enc_status(unsigned long pfn, int npages, bool enc)
> >  static void __init kvm_init_platform(void)
> >  {
> >         u64 tolud = e820__end_of_low_ram_pfn() << PAGE_SHIFT;
> > +       /*
> > +        * The range's base address and size may not meet the alignment
> > +        * requirement for variable MTRR. However, Linux guest never
> > +        * programs MTRRs when forcing guest MTRR state, no bothering to
> > +        * enforce the base and range size alignment.
> > +        */
> >         struct mtrr_var_range pci_hole = {
> >                 .base_lo = tolud | X86_MEMTYPE_UC,
> > -               .mask_lo = (u32)(~(SZ_4G - tolud - 1)) | BIT(11),
> > +               .mask_lo = (u32)(~(SZ_4G - tolud - 1)) | MTRR_PHYSMASK_V,
> >                 .mask_hi = (BIT_ULL(boot_cpu_data.x86_phys_bits) - 1) >> 32,
> >         };
> > 
> > 
> > I tested it in my setup, it can fix the issue of TPM driver failure with the
> > modified ACPI table for TPM in QEMU.
> > 
> > 
> > Hi Vishal,
> > Could you test it with google's VMM?
> 
> Vishal is OOO for a few days.  I pinged our internal bug tracker, I'll find
> someone to test.

Got confirmation this fixes the vTPM woes with Google's VMM.  v2 incoming...

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2025-08-28  0:07 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-09 16:54 [PATCH 0/2] x86/kvm: Force legacy PCI hole as WB under SNP/TDX Jianxiong Gao
2025-07-14  9:06 ` Binbin Wu
2025-07-14 11:24   ` Nikolay Borisov
2025-07-15  2:53     ` Binbin Wu
2025-07-16  9:51       ` Binbin Wu
2025-07-23 14:34     ` Sean Christopherson
2025-07-24  3:16       ` Binbin Wu
2025-07-28 15:33         ` Sean Christopherson
2025-07-30  7:34           ` Binbin Wu
2025-08-15 23:55             ` Korakit Seemakhupt
2025-08-18 11:07               ` Binbin Wu
2025-08-20  3:07             ` Vishal Annapurve
2025-08-20 10:03               ` Binbin Wu
2025-08-20 11:13                 ` Binbin Wu
2025-08-20 17:56                   ` Sean Christopherson
2025-08-21  3:30                     ` Binbin Wu
2025-08-21  5:23                       ` Binbin Wu
2025-08-21  6:02                         ` Jürgen Groß
2025-08-21 15:27                         ` Sean Christopherson
2025-08-28  0:07                           ` Sean Christopherson
  -- strict thread matches above, loose matches on Subject: below --
2025-02-01  0:50 Sean Christopherson
2025-02-01 14:25 ` Dionna Amalie Glaze
2025-02-03 18:14 ` Edgecombe, Rick P
2025-02-03 20:33   ` Sean Christopherson
2025-02-03 23:01     ` Edgecombe, Rick P
2025-02-04  0:27       ` Sean Christopherson
2025-02-05  3:51         ` Edgecombe, Rick P
2025-02-05  7:49           ` Xu, Min M
2025-02-10 15:29         ` Binbin Wu
2025-07-08 14:24 ` Nikolay Borisov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).