Re: [PATCH 0/2] KVM: x86/mmu: .change_pte() optimization in TDP MMU

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Sean Christopherson <seanjc@google.com>
To: Yan Zhao <yan.y.zhao@intel.com>
Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, pbonzini@redhat.com
Subject: Re: [PATCH 0/2] KVM: x86/mmu: .change_pte() optimization in TDP MMU
Date: Tue, 5 Sep 2023 13:18:23 -0700	[thread overview]
Message-ID: <ZPeND9WFHR2Xx8BM@google.com> (raw)
In-Reply-To: <ZPWBM5DDC6MKINUe@yzhao56-desk.sh.intel.com>

On Mon, Sep 04, 2023, Yan Zhao wrote:
> ...
> > > Actually, I don't even completely understand how you're seeing CoW behavior in
> > > the first place.  No sane guest should blindly read (or execute) uninitialized
> > > memory.  IIUC, you're not running a Windows guest, and even if you are, AFAIK
> > > QEMU doesn't support Hyper-V's enlightment that lets the guest assume memory has
> > > been zeroed by the hypervisor.  If KSM is to blame, then my answer it to turn off
> > > KSM, because turning on KSM is antithetical to guest performance (not to mention
> > > that KSM is wildly insecure for the guest, especially given the number of speculative
> > > execution attacks these days).
> > I'm running a linux guest.
> > KSM is not turned on both in guest and host.
> > Both guest and host have turned on transparent huge page.
> > 
> > The guest first reads a GFN in a writable memslot (which is for "pc.ram"),
> > which will cause
> >     (1) KVM first sends a GUP without FOLL_WRITE, leaving a huge_zero_pfn or a zero-pfn
> >         mapped.
> >     (2) KVM calls get_user_page_fast_only() with FOLL_WRITE as the memslot is writable,
> >         which will fail
> > 
> > The guest then writes the GFN.
> > This step will trigger (huge pmd split for huge page case) and .change_pte().
> > 
> > My guest is surely a sane guest. But currently I can't find out why
> > certain pages are read before write.
> > Will return back to you the reason after figuring it out after my long vacation.
> Finally I figured out the reason.
> 
> Except 4 pages were read before written from vBIOS (I just want to skip finding
> out why vBIOS does this), the remaining thousands of pages were read before
> written from the guest Linux kernel.
> 
> If the guest kernel were configured with "CONFIG_INIT_ON_ALLOC_DEFAULT_ON=y" or
> "CONFIG_INIT_ON_FREE_DEFAULT_ON=y", or booted with param "init_on_alloc=1" or
> "init_on_free=1", this read before written problem goes away.
> 
> However, turning on those configs has side effects as said in kernel config
> message:
> "all page allocator and slab allocator memory will be zeroed when allocated,
> eliminating many kinds of "uninitialized heap memory" flaws, especially
> heap content exposures. The performance impact varies by workload, but most
> cases see <1% impact. Some synthetic workloads have measured as high as 7%."
> 
> If without the above two configs, or if with init_on_alloc=0 && init_on_free=0,
> the root cause for all the reads of uninitialized heap memory are related to

Yeah, forcing the guest to pre-initialize all memory is a hack-a-fix and not a
real solution.

> page cache pages of the guest virtual devices (specifically the virtual IDE
> device in my case).

Why are you using IDE?  IDE is comically slow compared to VirtIO, and VirtIO has
been broadly supported for something like 15 years, even on Windows.

> The reason for this unconditional read of page into bounce buffer
> (caused by "swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE)")
> is explained in the code:
> 
> /*
>  * When dir == DMA_FROM_DEVICE we could omit the copy from the orig
>  * to the tlb buffer, if we knew for sure the device will
>  * overwrite the entire current content. But we don't. Thus
>  * unconditional bounce may prevent leaking swiotlb content (i.e.
>  * kernel memory) to user-space.
>  */
> 
> If we neglect this risk and do changes like
> -       swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE);
> +       if (dir != DMA_FROM_DEVICE)
> +               swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE);
> 
> the issue of pages read before written from guest kernel just went away.
> 
> I don't think it's a swiotlb bug, because to prevent leaking swiotlb
> content, if target page content is not copied firstly to the swiotlb's
> bounce buffer, then the bounce buffer needs to be initialized to 0.
> However, swiotlb_tbl_map_single() does not know whether the target page
> is initialized or not. Then, it would cause page content to be trimmed
> if device does not overwrite the entire memory.
> 
> > 
> > > 
> > > If there's something else going on, i.e. if your VM really is somehow generating
> > > reads before writes, and if we really want to optimize use cases that can't use
> > > hugepages for whatever reason, I would much prefer to do something like add a
> > > memslot flag to state that the memslot should *always* be mapped writable.  Because
> > Will check if this flag is necessary after figuring out the reason.
> As explained above, I think it's a valid and non-rare practice in guest kernel to
> cause read of uninitialized heap memory.

Heh, for some definitions of valid.  

> And the host admin may not know exactly when it's appropriate to apply the
> memslot flag.

Yeah, a memslot flag is too fine-grained.

> Do you think it's good to make the "always write_fault = true" solution enabled
> by default?

Sadly, probably not, because that would regress setups that do want to utilize
CoW, e.g. I'm pretty sure requesting everything to be writable would be a big
negative for KSM.

I do think we should add a KVM knob though.  Regardless of the validity or frequency
of the guest behavior, and even though userspace can also workaround this by
preallocating guest memory, I am struggling to think of any reason outside of KSM
where CoW semantics are desirable.

Ooh, actually, maybe we could do

	static bool <name_tbd> = !IS_ENABLED(CONFIG_KSM);

and then cross our fingers that that doesn't regress some other funky setups.

next prev parent reply	other threads:[~2023-09-05 20:18 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-08  8:50 [PATCH 0/2] KVM: x86/mmu: .change_pte() optimization in TDP MMU Yan Zhao
2023-08-08  8:53 ` [PATCH 1/2] KVM: x86/mmu: Remove dead code in .change_pte() handler in x86 " Yan Zhao
2023-08-08  8:54 ` [PATCH 2/2] KVM: x86/mmu: prefetch SPTE directly in x86 TDP MMU's change_pte() handler Yan Zhao
2023-08-16 18:18 ` [PATCH 0/2] KVM: x86/mmu: .change_pte() optimization in TDP MMU Sean Christopherson
2023-08-17  0:00   ` Yan Zhao
2023-08-17 17:53     ` Sean Christopherson
2023-08-18 10:17       ` Yan Zhao
2023-08-18 13:46         ` Sean Christopherson
2023-09-04  7:03         ` Yan Zhao
2023-09-05 18:59           ` Sean Christopherson
2023-09-05 19:30             ` Linus Torvalds
2023-09-06  0:29             ` Robin Murphy
2023-09-06 14:44               ` Sean Christopherson
2023-09-06 16:18                 ` Robin Murphy
2023-09-06 16:46                   ` Sean Christopherson
2023-09-08  8:18                   ` Christoph Hellwig
2023-09-05 20:18           ` Sean Christopherson [this message]
2023-09-06  1:51             ` Yan Zhao
2023-09-06 22:17             ` Paolo Bonzini
2023-09-07  0:51               ` Sean Christopherson
2023-09-07  0:36                 ` Yan Zhao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZPeND9WFHR2Xx8BM@google.com \
    --to=seanjc@google.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=yan.y.zhao@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox