From: "Michael S. Tsirkin" <mst@redhat.com>
To: David Hildenbrand <david@redhat.com>
Cc: "Wang, Wei W" <wei.w.wang@intel.com>,
Nitesh Narayan Lal <nitesh@redhat.com>,
"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"pbonzini@redhat.com" <pbonzini@redhat.com>,
"lcapitulino@redhat.com" <lcapitulino@redhat.com>,
"pagupta@redhat.com" <pagupta@redhat.com>,
"yang.zhang.wz@gmail.com" <yang.zhang.wz@gmail.com>,
"riel@surriel.com" <riel@surriel.com>,
"dodgen@google.com" <dodgen@google.com>,
"konrad.wilk@oracle.com" <konrad.wilk@oracle.com>,
"dhildenb@redhat.com" <dhildenb@redhat.com>,
"aarcange@redhat.com" <aarcange@redhat.com>
Subject: Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
Date: Wed, 13 Feb 2019 14:08:51 -0500 [thread overview]
Message-ID: <20190213140734-mutt-send-email-mst@kernel.org> (raw)
In-Reply-To: <a0035ddf-c871-b4e5-ca56-61da5c192dac@redhat.com>
On Wed, Feb 13, 2019 at 06:59:24PM +0100, David Hildenbrand wrote:
> >>>
> >>>> Nitesh uses MADV_FREE here (as far as I recall :) ), to only mark pages as
> >>>> candidates for removal and if the host is low on memory, only scanning the
> >>>> guest page tables is sufficient to free up memory.
> >>>>
> >>>> But both points might just be an implementation detail in the example you
> >>>> describe.
> >>>
> >>> Yes, it is an implementation detail. I think DONTNEED would be easier
> >>> for the first step.
> >>>
> >>>>
> >>>>>
> >>>>> In above 2), get_free_page_hints clears the bits which indicates that those
> >>>> pages are not ready to be used by the guest yet. Why?
> >>>>> This is because 3) will unmap the underlying physical pages from EPT.
> >>>> Normally, when guest re-visits those pages, EPT violations and QEMU page
> >>>> faults will get a new host page to set up the related EPT entry. If guest uses
> >>>> that page before the page gets unmapped (i.e. right before step 3), no EPT
> >>>> violation happens and the guest will use the same physical page that will be
> >>>> unmapped and given to other host threads. So we need to make sure that
> >>>> the guest free page is usable only after step 3 finishes.
> >>>>>
> >>>>> Back to arch_alloc_page(), it needs to check if the allocated pages
> >>>>> have "1" set in the bitmap, if that's true, just clear the bits. Otherwise, it
> >>>> means step 2) above has happened and step 4) hasn't been reached. In this
> >>>> case, we can either have arch_alloc_page() busywaiting a bit till 4) is done
> >>>> for that page Or better to have a balloon callback which prioritize 3) and 4)
> >>>> to make this page usable by the guest.
> >>>>
> >>>> Regarding the latter, the VCPU allocating a page cannot do anything if the
> >>>> page (along with other pages) is just being freed by the hypervisor.
> >>>> It has to busy-wait, no chance to prioritize.
> >>>
> >>> I meant this:
> >>> With this approach, essentially the free pages have 2 states:
> >>> ready free page: the page is on the free list and it has "1" in the bitmap
> >>> non-ready free page: the page is on the free list and it has "0" in the bitmap
> >>> Ready free pages are those who can be allocated to use.
> >>> Non-ready free pages are those who are in progress of being reported to
> >>> host and the related EPT mapping is about to be zapped.
> >>>
> >>> The non-ready pages are inserted into the report_vq and waiting for the
> >>> host to zap the mappings one by one. After the mapping gets zapped
> >>> (which means the backing host page has been taken away), host acks to
> >>> the guest to mark the free page as ready free page (set the bit to 1 in the bitmap).
> >>
> >> Yes, that's how I understood your approach. The interesting part is
> >> where somebody finds a buddy page and wants to allocate it.
> >>
> >>>
> >>> So the non-ready free page may happen to be used when they are waiting in
> >>> the report_vq to be handled by the host to zap the mapping, balloon could
> >>> have a fast path to notify the host:
> >>> "page 0x1000 is about to be used, don’t zap the mapping when you get
> >>> 0x1000 from the report_vq" /*option [1] */
> >>
> >> This requires coordination and in any case there will be a scenario
> >> where you have to wait for the hypervisor to eventually finish a madv
> >> call. You can just try to make that scenario less likely.
> >>
> >> What you propose is synchronous in the worst case. Getting pages of the
> >> buddy makes it possible to have it done completely asynchronous. Nobody
> >> allocating a page has to wait.
> >>
> >>>
> >>> Or
> >>>
> >>> "page 0x1000 is about to be used, please zap the mapping NOW, i.e. do 3) and 4) above,
> >>> so that the free page will be marked as ready free page and the guest can use it".
> >>> This option will generate an extra EPT violation and QEMU page fault to get a new host
> >>> page to back the guest ready free page.
> >>
> >> Again, coordination with the hypervisor while allocating a page. That is
> >> to be avoided in any case.
> >>
> >>>
> >>>>
> >>>>>
> >>>>> Using bitmaps to record free page hints don't need to take the free pages
> >>>> off the buddy list and return them later, which needs to go through the long
> >>>> allocation/free code path.
> >>>>>
> >>>>
> >>>> Yes, but it means that any process is able to get stuck on such a page for as
> >>>> long as it takes to report the free pages to the hypervisor and for it to call
> >>>> madvise(pfn_start, DONTNEED) on any such page.
> >>>
> >>> This only happens when the guest thread happens to get allocated on a page which is
> >>> being reported to the host. Using option [1] above will avoid this.
> >>
> >> I think getting pages out of the buddy system temporarily is the only
> >> way we can avoid somebody else stumbling over a page currently getting
> >> reported by the hypervisor. Otherwise, as I said, there are scenarios
> >> where a allocating VCPU has to wait for the hypervisor to finish the
> >> "freeing" task. While you can try to "speedup" that scenario -
> >> "hypervisor please prioritize" you cannot avoid it. There will be busy
> >> waiting.
> >
> > Right - there has to be waiting. But it does not have to be busy -
> > if you can defer page use until interrupt, that's one option.
> > Further if you are ready to exit to hypervisor it does not have to be
> > busy waiting. In particular right now virtio does not have a capability
> > to stop queue processing by device. We could add that if necessary. In
> > that case, you would stop queue and detach buffers. It is already
> > possible by reseting the balloon. Naturally there is no magic - you
> > exit to hypervisor and block there. It's not all that great
> > in that VCPU does not run at all. But it is not busy waiting.
>
> Of course, you can always yield to the hypervisor and not call it busy
> waiting. From the guest point of view, it is busy waiting. The VCPU is
> to making progress. If I am not wrong, one can easily construct examples
> where all VCPUs in the guest are waiting for the hypervisor to
> madv(dontneed) pages. I don't like that approach
>
> Especially if temporarily getting pages out of the buddy resolves these
> issues and seems to work.
Well hypervisor can send a singla and interrupt the dontneed work.
But yes I prefer not blocking the VCPU too.
I also prefer MADV_FREE generally.
>
> --
>
> Thanks,
>
> David / dhildenb
next prev parent reply other threads:[~2019-02-13 19:08 UTC|newest]
Thread overview: 116+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-02-04 20:18 [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting Nitesh Narayan Lal
2019-02-04 20:18 ` [RFC][Patch v8 1/7] KVM: Support for guest free page hinting Nitesh Narayan Lal
2019-02-05 4:14 ` Michael S. Tsirkin
2019-02-05 13:06 ` Nitesh Narayan Lal
2019-02-05 16:27 ` Michael S. Tsirkin
2019-02-05 16:34 ` Nitesh Narayan Lal
2019-02-04 20:18 ` [RFC][Patch v8 2/7] KVM: Enabling guest free page hinting via static key Nitesh Narayan Lal
2019-02-08 18:07 ` Alexander Duyck
2019-02-08 18:22 ` Nitesh Narayan Lal
2019-02-04 20:18 ` [RFC][Patch v8 3/7] KVM: Guest free page hinting functional skeleton Nitesh Narayan Lal
2019-02-04 20:18 ` [RFC][Patch v8 4/7] KVM: Disabling page poisoning to prevent corruption Nitesh Narayan Lal
2019-02-07 17:23 ` Alexander Duyck
2019-02-07 17:56 ` Nitesh Narayan Lal
2019-02-07 18:24 ` Alexander Duyck
2019-02-07 19:14 ` Michael S. Tsirkin
2019-02-07 21:08 ` Michael S. Tsirkin
2019-02-04 20:18 ` [RFC][Patch v8 5/7] virtio: Enables to add a single descriptor to the host Nitesh Narayan Lal
2019-02-05 20:49 ` Michael S. Tsirkin
2019-02-06 12:56 ` Nitesh Narayan Lal
2019-02-06 13:15 ` Luiz Capitulino
2019-02-06 13:24 ` Nitesh Narayan Lal
2019-02-06 13:29 ` Luiz Capitulino
2019-02-06 14:05 ` Nitesh Narayan Lal
2019-02-06 18:03 ` Michael S. Tsirkin
2019-02-06 18:19 ` Nitesh Narayan Lal
2019-02-04 20:18 ` [RFC][Patch v8 6/7] KVM: Enables the kernel to isolate and report free pages Nitesh Narayan Lal
2019-02-05 20:45 ` Michael S. Tsirkin
2019-02-05 21:54 ` Nitesh Narayan Lal
2019-02-05 21:55 ` Michael S. Tsirkin
2019-02-07 17:43 ` Alexander Duyck
2019-02-07 19:01 ` Michael S. Tsirkin
2019-02-07 20:50 ` Nitesh Narayan Lal
2019-02-08 17:58 ` Alexander Duyck
2019-02-08 20:41 ` Nitesh Narayan Lal
2019-02-08 21:38 ` Michael S. Tsirkin
2019-02-08 22:05 ` Alexander Duyck
2019-02-10 0:38 ` Michael S. Tsirkin
2019-02-11 9:28 ` David Hildenbrand
2019-02-12 5:16 ` Michael S. Tsirkin
2019-02-12 17:10 ` Nitesh Narayan Lal
2019-02-08 21:35 ` Michael S. Tsirkin
2019-02-04 20:18 ` [RFC][Patch v8 7/7] KVM: Adding tracepoints for guest page hinting Nitesh Narayan Lal
2019-02-04 20:20 ` [RFC][QEMU PATCH] KVM: Support for guest free " Nitesh Narayan Lal
2019-02-12 9:03 ` [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting Wang, Wei W
2019-02-12 9:24 ` David Hildenbrand
2019-02-12 17:24 ` Nitesh Narayan Lal
2019-02-12 19:34 ` David Hildenbrand
2019-02-13 8:55 ` Wang, Wei W
2019-02-13 9:19 ` David Hildenbrand
2019-02-13 12:17 ` Nitesh Narayan Lal
2019-02-13 17:09 ` Michael S. Tsirkin
2019-02-13 17:22 ` Nitesh Narayan Lal
[not found] ` <286AC319A985734F985F78AFA26841F73DF6F1C3@shsmsx102.ccr.corp.intel.com>
2019-02-14 9:34 ` David Hildenbrand
2019-02-13 17:16 ` Michael S. Tsirkin
2019-02-13 17:59 ` David Hildenbrand
2019-02-13 19:08 ` Michael S. Tsirkin [this message]
2019-02-14 9:08 ` Wang, Wei W
2019-02-14 10:00 ` David Hildenbrand
2019-02-14 10:44 ` David Hildenbrand
2019-02-15 9:15 ` Wang, Wei W
2019-02-15 9:33 ` David Hildenbrand
2019-02-13 9:00 ` Wang, Wei W
2019-02-13 12:06 ` Nitesh Narayan Lal
2019-02-14 8:48 ` Wang, Wei W
2019-02-14 9:42 ` David Hildenbrand
2019-02-15 9:05 ` Wang, Wei W
2019-02-15 9:41 ` David Hildenbrand
2019-02-18 2:36 ` Wei Wang
2019-02-18 2:39 ` Wei Wang
2019-02-15 12:40 ` Nitesh Narayan Lal
2019-02-14 13:00 ` Nitesh Narayan Lal
2019-02-16 9:40 ` David Hildenbrand
2019-02-18 15:50 ` Nitesh Narayan Lal
2019-02-18 16:02 ` David Hildenbrand
2019-02-18 16:49 ` Michael S. Tsirkin
2019-02-18 16:59 ` David Hildenbrand
2019-02-18 17:31 ` Alexander Duyck
2019-02-18 17:41 ` David Hildenbrand
2019-02-18 23:47 ` Alexander Duyck
2019-02-19 2:45 ` Michael S. Tsirkin
2019-02-19 2:46 ` Andrea Arcangeli
2019-02-19 12:52 ` Nitesh Narayan Lal
2019-02-19 16:23 ` Alexander Duyck
2019-02-19 8:06 ` David Hildenbrand
2019-02-19 14:40 ` Michael S. Tsirkin
2019-02-19 14:44 ` David Hildenbrand
2019-02-19 14:45 ` David Hildenbrand
2019-02-18 18:01 ` Michael S. Tsirkin
2019-02-18 17:54 ` Michael S. Tsirkin
2019-02-18 18:29 ` David Hildenbrand
2019-02-18 19:16 ` Michael S. Tsirkin
2019-02-18 19:35 ` David Hildenbrand
2019-02-18 19:47 ` Michael S. Tsirkin
2019-02-18 20:04 ` David Hildenbrand
2019-02-18 20:31 ` Michael S. Tsirkin
2019-02-18 20:40 ` Nitesh Narayan Lal
2019-02-18 21:04 ` David Hildenbrand
2019-02-19 0:01 ` Alexander Duyck
2019-02-19 7:54 ` David Hildenbrand
2019-02-19 18:06 ` Alexander Duyck
2019-02-19 18:31 ` David Hildenbrand
2019-02-19 21:57 ` Alexander Duyck
2019-02-19 22:17 ` Michael S. Tsirkin
2019-02-19 22:36 ` David Hildenbrand
2019-02-19 19:58 ` Michael S. Tsirkin
2019-02-19 20:02 ` David Hildenbrand
2019-02-19 20:17 ` Michael S. Tsirkin
2019-02-19 20:21 ` David Hildenbrand
2019-02-19 20:35 ` Michael S. Tsirkin
2019-02-19 12:47 ` Nitesh Narayan Lal
2019-02-19 13:03 ` David Hildenbrand
2019-02-19 14:17 ` Nitesh Narayan Lal
2019-02-19 14:21 ` David Hildenbrand
2019-02-18 20:53 ` David Hildenbrand
2019-02-23 0:02 ` Alexander Duyck
2019-02-25 13:01 ` Nitesh Narayan Lal
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190213140734-mutt-send-email-mst@kernel.org \
--to=mst@redhat.com \
--cc=aarcange@redhat.com \
--cc=david@redhat.com \
--cc=dhildenb@redhat.com \
--cc=dodgen@google.com \
--cc=konrad.wilk@oracle.com \
--cc=kvm@vger.kernel.org \
--cc=lcapitulino@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=nitesh@redhat.com \
--cc=pagupta@redhat.com \
--cc=pbonzini@redhat.com \
--cc=riel@surriel.com \
--cc=wei.w.wang@intel.com \
--cc=yang.zhang.wz@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).