From: Sean Christopherson <seanjc@google.com>
To: dan.j.williams@intel.com
Cc: Chao Gao <chao.gao@intel.com>,
kvm@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: VMXON for TDX (was: Re: [PATCH] KVM: VMX: Flush shadow VMCS on emergency reboot)
Date: Fri, 10 Oct 2025 14:22:26 -0700 [thread overview]
Message-ID: <aOl5EutrdL_OlVOO@google.com> (raw)
In-Reply-To: <68e85e503ae05_29801003f@dwillia2-mobl4.notmuch>
On Thu, Oct 09, 2025, dan.j.williams@intel.com wrote:
> Hey, while we on the topic of being off topic about that branch, I
> mentioned to Chao that I was looking to post patches for a simple VMX
> module. He pointed me here to say "I think Sean is already working on
> it". This posting would be to save you the trouble of untangling that
> branch, at least in the near term.
>
> Here's the work-in-progress shortlog:
>
> Chao Gao (2):
> x86/virt/tdx: Move TDX per-CPU initialization into tdx_enable()
> coco/tdx-host: Introduce a "tdx_host" device
>
> Dan Williams (5):
> x86/virt/tdx: Drop KVM_INTEL dependency
> x86/reboot: Convert cpu_emergency_disable_virtualization() to notifier chain
I don't actually think we want a notifier chain, because the order matters. I
think what we want instead if a daisy-chain of notifiers (I swear I'm not trying
to troll y'all). If we go with a standard notifier chain, there will be a hidden
dependency on KVM's notifier running first (to do VMCLEAR) and what you appear to
be calling vmx.ko's notifier running second.
> x86/reboot: Introduce CONFIG_VIRT_X86
> KVM: VMX: Rename vmx_init() to kvm_vmx_init()
> KVM: VMX: Move VMX from kvm_intel.ko to vmx.ko
Spoiler for the wall of text below, but I don't think it makes sense to have a
vmx.ko, or an intermediate module of any kind.
> I can put the finishing touches on it and send it out against
> kvm-x86/next after -rc1 is out, or hold off if your extraction is going
> ok.
It went ok, but it's dead in the water.
> At least my approach to the VMX module leaves VMCS clearing with KVM.
Yes, I came to this conclusion as well.
TL;DR: I don't think we should pursue my idea (read: pipe dream) of extracting
all core virtualization functionality to allow multiple hypervisors. It's
not _that_ hard (my "not 6 months of work" statement still holds true),
but it raises a number of policy-related questions and would add non-trivial
complexity, for no immediate benefit (and potential no benefit even in the
long run).
For all intents and purposes, moving the VMCS tracking to a central module commits
to fully supporting multiple in-kernel hypervisors. After extracting/rebasing
almost all of the internal code, I realized that fully supporting another hypervisor
without actually having another hypervisor in-hand (and that we want to support
upstream), is an absurd amount of churn and complexity for no practical benefit.
For the internal implementation, we had punted on a few key annoyances because they
were way out-of-scope for our use case. Sadly, it's not even the hardware side of
things that adds the most complexity, it's all the paravirt crud. Specifically,
KVM's support for various Hyper-V features, e.g. Hyper-V's Enlightened VMCS (eVMCS),
used when KVM is running as a VM on a Hyper-V, or on KVM emulating Hyper-V.
Whether or not to use eVMCS is in some ways a system wide policy decision (currently
it's a KVM module param), because the control state is per CPU. Our internal approach
was to limit support for multple KVM modules to bare metal kernel configurations,
but that doesn't really fly when trying to make a truly generic hypervisor framework.
I'm pretty sure we could solve that by having KVM purge the eVMCS pointer when a
vCPU is scheduled out (to avoid another hypervisor unintentionally use KVM's eVMCS),
but that's a whole new pile of complexity, especially if we tried to have proactive
clearing limited to eVMCS (currently, KVM lazily does VMCLEAR via an IPI shootdown
if a VMCS needs to be migrated to a different pCPU; it's easy enough to have KVM
eagerly do VMCLEAR (on sched out), but it's not obvious that it'd be a performance
win).
There are a few other things of a similar nature, where it's not _too_ hard to
massage some piece of functionality into common infrastructure, but as I was
mentally working through them, I kept thinking "this isn't going to end well".
When we were specifically targeting multiple KVM instances, it wasn't anywhere
near as scary. Because while each KVM instance could obviously run different
code, the general architecture and wants of each active hypervisor would be the
same.
E.g. whether or not to enable SNP's new ciphertext hiding features, and how many
ASIDs to reserve for SNP+ VMs, is another system wide policy decision that is a
non-issue when the goal is purely to be able to run different versions of the same
hypervisor. But it's less clear cut how that should be handled when the "other"
hypervisor is an unknown, faceless entity.
Ditto for the "obvious" things like the user-return MSR framework. That "needs"
to be system wide because the current value in hardware is per-CPU, but in quotes
because it presupposes that another hypervisor would even want to play the same
games.
So, while it's definitely doable to separate the core virtualization bits from
KVM, I think the added complexity (in code and in the mental model) would make
it all _extremely_ premature infrastructure.
> Main goal is not regress any sequencing around VMX.
Ya, that requires some care.
A big takeaway/revelation for me from after going through the above exercise of
trying and failing (well, giving up) to extract _all_ system-wide virtualization
functionality is that not only is trying to generically support other hypervisors
undesirable, if we commit to NOT doing that, then we really can special case the
low level VMXON + EFER.SVME logic.
What I mean by that is we don't need a fully generic, standalone module to provide
VMXON+VMXOFF, we only need common tracking. KVM is the only module user, and TDX
is a super special snowflake that already needs to provide a pile of its own
functionality. KVM (obviously) already has the necessary hooks to handle things
like CPU bringup/teardown and reboot, and adding them to the TDX subystem code is
easy enough.
And more importantly, except for VMXON+VMXOFF, the _only_ requirements TDX brings
to the table are one-shot, unidirectional things (e.g. TDH_SYS_LP_INIT). I.e.
there's no need for multi-user notifier chains, because the core code can blast
VMXOFF in an emergency, and clean reboots can be easily handled through syscore.
I've got patches that appear to work (writing the code didn't take long at all
once I pieced together the idea). I'll post an RFC and we can go from there.
next prev parent reply other threads:[~2025-10-10 21:22 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-03-24 14:08 [PATCH] KVM: VMX: Flush shadow VMCS on emergency reboot Chao Gao
2025-03-31 23:17 ` Huang, Kai
2025-04-10 21:55 ` Sean Christopherson
2025-04-11 8:46 ` Chao Gao
2025-04-11 16:57 ` Sean Christopherson
2025-04-14 6:24 ` Xiaoyao Li
2025-04-14 12:15 ` Huang, Kai
2025-04-14 13:18 ` Chao Gao
2025-04-15 1:03 ` Sean Christopherson
2025-04-15 1:55 ` Chao Gao
2025-10-08 23:01 ` Sean Christopherson
2025-10-09 5:36 ` Chao Gao
2025-10-10 1:16 ` dan.j.williams
2025-10-10 21:22 ` Sean Christopherson [this message]
2025-05-02 21:51 ` Sean Christopherson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aOl5EutrdL_OlVOO@google.com \
--to=seanjc@google.com \
--cc=chao.gao@intel.com \
--cc=dan.j.williams@intel.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).