public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
From: Sean Christopherson <seanjc@google.com>
To: Mathias Krause <minipli@grsecurity.net>
Cc: kvm@vger.kernel.org, Paolo Bonzini <pbonzini@redhat.com>,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH v4 6/6] KVM: VMX: Make CR0.WP a guest owned bit
Date: Thu, 30 Mar 2023 10:12:34 -0700	[thread overview]
Message-ID: <ZCXDAiUOnsL3fRBj@google.com> (raw)
In-Reply-To: <677169b4-051f-fcae-756b-9a3e1bb9f8fe@grsecurity.net>

On Thu, Mar 30, 2023, Mathias Krause wrote:
> On 22.03.23 02:37, Mathias Krause wrote:
> > Guests like grsecurity that make heavy use of CR0.WP to implement kernel
> > level W^X will suffer from the implied VMEXITs.
> > 
> > With EPT there is no need to intercept a guest change of CR0.WP, so
> > simply make it a guest owned bit if we can do so.
> > 
> > This implies that a read of a guest's CR0.WP bit might need a VMREAD.
> > However, the only potentially affected user seems to be kvm_init_mmu()
> > which is a heavy operation to begin with. But also most callers already
> > cache the full value of CR0 anyway, so no additional VMREAD is needed.
> > The only exception is nested_vmx_load_cr3().
> > 
> > This change is VMX-specific, as SVM has no such fine grained control
> > register intercept control.
> 
> Just a heads up! We did more tests, especially with the backports we did
> internally already, and ran into a bug when running a nested guest on an
> ESXi host.
> 
> Setup is like: ESXi (L0) -> Linux (L1) -> Linux (L2)
> 
> The Linux system, especially the kernel, is the same for L1 and L2. It's
> a grsecurity kernel, so makes use of toggling CR0.WP at runtime.
> 
> The bug we see is that when L2 disables CR0.WP and tries to write to an
> r/o memory region (implicitly to the r/o GDT via LTR in our use case),
> this triggers a fault (EPT violation?) that gets ignored by L1, as,
> apparently, everything is fine from its point of view.

It's not an EPT violation if there's a triple fault.  Given that you're dumping
the VMCS from handle_triple_fault(), I assume that L2 gets an unexpected #PF that
leads to a triple fault in L2.

Just to make sure we're on the same page: L1 is still alive after this, correct?

> I suspect the L0 VMM to be at fault here, as the VMCS structures look
> good, IMO. Here is a dump of vmx->loaded_vmcs in handle_triple_fault():

...

> The "host" (which is our L1 VMM, I guess) has CR0.WP enabled and that is
> what I think confuses ESXi to enforce the read-only property to the L2
> guest as well -- for unknown reasons so far.

...

> I tried to reproduce the bug with different KVM based L0 VMMs (with and
> without this series; vanilla and grsecurity kernels) but no luck. That's
> why I'm suspecting a ESXi bug.
   
...

> I'm leaning to make CR0.WP guest owned only iff we're running on bare
> metal or the VMM is KVM to not play whack-a-mole for all the VMMs that
> might have similar bugs. (Will try to test a few others here as well.)
> However, that would prevent them from getting the performance gain, so
> I'd rather have this fixed / worked around in KVM instead.

Before we start putting bandaids on this, let's (a) confirm this appears to be
an issue with ESXi and (b) pull in VMware folks to get their input.

> Any ideas how to investigate this further?

Does the host in question support UMIP?

Can you capture a tracepoint log from L1 to verify that L1 KVM is _not_ injecting
any kind of exception?  E.g. to get the KVM kitchen sink:

  echo 1 > /sys/kernel/debug/tracing/tracing_on
  echo 1 > /sys/kernel/debug/tracing/events/kvm/enable

  cat /sys/kernel/debug/tracing/trace > log

Or if that's too noisy, a more targeted trace (exception injection + emulation)
woud be:

  echo 1 > /sys/kernel/debug/tracing/tracing_on

  echo 1 > /sys/kernel/debug/tracing/events/kvm/kvm_emulate_insn/enable
  echo 1 > /sys/kernel/debug/tracing/events/kvm/kvm_inj_exception/enable
  echo 1 > /sys/kernel/debug/tracing/events/kvm/kvm_entry/enable
  echo 1 > /sys/kernel/debug/tracing/events/kvm/kvm_exit/enable

> PS: ...should have left the chicken bit of v3 to be able to disable the
> feature by a module parameter ;)

A chicken bit isn't a good solution for this sort of thing.  Toggling a KVM module
param requires (a) knowing that it exists and (b) knowing the conditions under which
it is/isn't safe to toggle the bit.

E.g. if this ends up being an ESXi L0 bug, then an option might be to add something
in vmware_platform_setup() to communicate the bug to KVM so that KVM can precisely
disable the optimization on affected platforms.

  reply	other threads:[~2023-03-30 17:12 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-22  1:37 [PATCH v4 0/6] KVM: MMU: performance tweaks for heavy CR0.WP users Mathias Krause
2023-03-22  1:37 ` [PATCH v4 1/6] KVM: x86/mmu: Avoid indirect call for get_cr3 Mathias Krause
2023-03-22  1:37 ` [PATCH v4 2/6] KVM: x86: Do not unload MMU roots when only toggling CR0.WP with TDP enabled Mathias Krause
2023-05-07  7:32   ` Robert Hoo
2023-05-08  9:30     ` Mathias Krause
2023-05-09  1:04       ` Robert Hoo
2023-03-22  1:37 ` [PATCH v4 3/6] KVM: x86: Ignore CR0.WP toggles in non-paging mode Mathias Krause
2023-03-22  1:37 ` [PATCH v4 4/6] KVM: x86: Make use of kvm_read_cr*_bits() when testing bits Mathias Krause
2023-03-22  1:37 ` [PATCH v4 5/6] KVM: x86/mmu: Fix comment typo Mathias Krause
2023-03-22  1:37 ` [PATCH v4 6/6] KVM: VMX: Make CR0.WP a guest owned bit Mathias Krause
2023-03-27  8:33   ` Xiaoyao Li
2023-03-27  8:37     ` Mathias Krause
2023-03-27 13:48       ` Xiaoyao Li
2023-03-30  8:45   ` Mathias Krause
2023-03-30 17:12     ` Sean Christopherson [this message]
2023-03-30 20:15       ` Mathias Krause
2023-03-30 20:30         ` Mathias Krause
2023-03-30 20:36           ` Sean Christopherson
2023-03-30 20:33       ` Sean Christopherson
2023-03-30 20:55         ` Mathias Krause
2023-03-31 14:18           ` Mathias Krause
2023-03-22  7:41 ` [PATCH v4 0/6] KVM: MMU: performance tweaks for heavy CR0.WP users Mathias Krause
2023-03-23 22:50 ` Sean Christopherson
2023-03-25 11:39   ` Mathias Krause
2023-03-25 12:25     ` Greg KH
2023-04-06  2:25       ` Sean Christopherson
2023-04-06 13:22         ` Mathias Krause
2023-04-14  9:29           ` Mathias Krause
2023-04-14 16:49             ` Sean Christopherson
2023-04-14 20:09               ` Jeremi Piotrowski
2023-04-14 20:17                 ` Sean Christopherson
2023-05-02 17:38                   ` Jeremi Piotrowski
2023-05-08  9:19               ` Mathias Krause
2023-05-08 15:57                 ` Mathias Krause

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZCXDAiUOnsL3fRBj@google.com \
    --to=seanjc@google.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=minipli@grsecurity.net \
    --cc=pbonzini@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox