All of lore.kernel.org
 help / color / mirror / Atom feed
From: Sean Christopherson <seanjc@google.com>
To: Yan Zhao <yan.y.zhao@intel.com>
Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, pbonzini@redhat.com
Subject: Re: [PATCH 0/2] KVM: x86/mmu: .change_pte() optimization in TDP MMU
Date: Tue, 5 Sep 2023 13:18:23 -0700	[thread overview]
Message-ID: <ZPeND9WFHR2Xx8BM@google.com> (raw)
In-Reply-To: <ZPWBM5DDC6MKINUe@yzhao56-desk.sh.intel.com>

On Mon, Sep 04, 2023, Yan Zhao wrote:
> ...
> > > Actually, I don't even completely understand how you're seeing CoW behavior in
> > > the first place.  No sane guest should blindly read (or execute) uninitialized
> > > memory.  IIUC, you're not running a Windows guest, and even if you are, AFAIK
> > > QEMU doesn't support Hyper-V's enlightment that lets the guest assume memory has
> > > been zeroed by the hypervisor.  If KSM is to blame, then my answer it to turn off
> > > KSM, because turning on KSM is antithetical to guest performance (not to mention
> > > that KSM is wildly insecure for the guest, especially given the number of speculative
> > > execution attacks these days).
> > I'm running a linux guest.
> > KSM is not turned on both in guest and host.
> > Both guest and host have turned on transparent huge page.
> > 
> > The guest first reads a GFN in a writable memslot (which is for "pc.ram"),
> > which will cause
> >     (1) KVM first sends a GUP without FOLL_WRITE, leaving a huge_zero_pfn or a zero-pfn
> >         mapped.
> >     (2) KVM calls get_user_page_fast_only() with FOLL_WRITE as the memslot is writable,
> >         which will fail
> > 
> > The guest then writes the GFN.
> > This step will trigger (huge pmd split for huge page case) and .change_pte().
> > 
> > My guest is surely a sane guest. But currently I can't find out why
> > certain pages are read before write.
> > Will return back to you the reason after figuring it out after my long vacation.
> Finally I figured out the reason.
> 
> Except 4 pages were read before written from vBIOS (I just want to skip finding
> out why vBIOS does this), the remaining thousands of pages were read before
> written from the guest Linux kernel.
> 
> If the guest kernel were configured with "CONFIG_INIT_ON_ALLOC_DEFAULT_ON=y" or
> "CONFIG_INIT_ON_FREE_DEFAULT_ON=y", or booted with param "init_on_alloc=1" or
> "init_on_free=1", this read before written problem goes away.
> 
> However, turning on those configs has side effects as said in kernel config
> message:
> "all page allocator and slab allocator memory will be zeroed when allocated,
> eliminating many kinds of "uninitialized heap memory" flaws, especially
> heap content exposures. The performance impact varies by workload, but most
> cases see <1% impact. Some synthetic workloads have measured as high as 7%."
> 
> If without the above two configs, or if with init_on_alloc=0 && init_on_free=0,
> the root cause for all the reads of uninitialized heap memory are related to

Yeah, forcing the guest to pre-initialize all memory is a hack-a-fix and not a
real solution.

> page cache pages of the guest virtual devices (specifically the virtual IDE
> device in my case).

Why are you using IDE?  IDE is comically slow compared to VirtIO, and VirtIO has
been broadly supported for something like 15 years, even on Windows.

> The reason for this unconditional read of page into bounce buffer
> (caused by "swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE)")
> is explained in the code:
> 
> /*
>  * When dir == DMA_FROM_DEVICE we could omit the copy from the orig
>  * to the tlb buffer, if we knew for sure the device will
>  * overwrite the entire current content. But we don't. Thus
>  * unconditional bounce may prevent leaking swiotlb content (i.e.
>  * kernel memory) to user-space.
>  */
> 
> If we neglect this risk and do changes like
> -       swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE);
> +       if (dir != DMA_FROM_DEVICE)
> +               swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_TO_DEVICE);
> 
> the issue of pages read before written from guest kernel just went away.
> 
> I don't think it's a swiotlb bug, because to prevent leaking swiotlb
> content, if target page content is not copied firstly to the swiotlb's
> bounce buffer, then the bounce buffer needs to be initialized to 0.
> However, swiotlb_tbl_map_single() does not know whether the target page
> is initialized or not. Then, it would cause page content to be trimmed
> if device does not overwrite the entire memory.
> 
> > 
> > > 
> > > If there's something else going on, i.e. if your VM really is somehow generating
> > > reads before writes, and if we really want to optimize use cases that can't use
> > > hugepages for whatever reason, I would much prefer to do something like add a
> > > memslot flag to state that the memslot should *always* be mapped writable.  Because
> > Will check if this flag is necessary after figuring out the reason.
> As explained above, I think it's a valid and non-rare practice in guest kernel to
> cause read of uninitialized heap memory.

Heh, for some definitions of valid.  

> And the host admin may not know exactly when it's appropriate to apply the
> memslot flag.

Yeah, a memslot flag is too fine-grained.

> Do you think it's good to make the "always write_fault = true" solution enabled
> by default?

Sadly, probably not, because that would regress setups that do want to utilize
CoW, e.g. I'm pretty sure requesting everything to be writable would be a big
negative for KSM.

I do think we should add a KVM knob though.  Regardless of the validity or frequency
of the guest behavior, and even though userspace can also workaround this by
preallocating guest memory, I am struggling to think of any reason outside of KSM
where CoW semantics are desirable.

Ooh, actually, maybe we could do

	static bool <name_tbd> = !IS_ENABLED(CONFIG_KSM);

and then cross our fingers that that doesn't regress some other funky setups.

  parent reply	other threads:[~2023-09-05 20:19 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-08  8:50 [PATCH 0/2] KVM: x86/mmu: .change_pte() optimization in TDP MMU Yan Zhao
2023-08-08  8:53 ` [PATCH 1/2] KVM: x86/mmu: Remove dead code in .change_pte() handler in x86 " Yan Zhao
2023-08-08  8:54 ` [PATCH 2/2] KVM: x86/mmu: prefetch SPTE directly in x86 TDP MMU's change_pte() handler Yan Zhao
2023-08-16 18:18 ` [PATCH 0/2] KVM: x86/mmu: .change_pte() optimization in TDP MMU Sean Christopherson
2023-08-17  0:00   ` Yan Zhao
2023-08-17 17:53     ` Sean Christopherson
2023-08-18 10:17       ` Yan Zhao
2023-08-18 13:46         ` Sean Christopherson
2023-09-04  7:03         ` Yan Zhao
2023-09-05 18:59           ` Sean Christopherson
2023-09-05 19:30             ` Linus Torvalds
2023-09-06  0:29             ` Robin Murphy
2023-09-06 14:44               ` Sean Christopherson
2023-09-06 16:18                 ` Robin Murphy
2023-09-06 16:46                   ` Sean Christopherson
2023-09-08  8:18                   ` Christoph Hellwig
2023-09-05 20:18           ` Sean Christopherson [this message]
2023-09-06  1:51             ` Yan Zhao
2023-09-06 22:17             ` Paolo Bonzini
2023-09-07  0:51               ` Sean Christopherson
2023-09-07  0:36                 ` Yan Zhao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZPeND9WFHR2Xx8BM@google.com \
    --to=seanjc@google.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=yan.y.zhao@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.