Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: "Michael S. Tsirkin" <mst@redhat.com>
To: Nitesh Narayan Lal <nitesh@redhat.com>
Cc: David Hildenbrand <david@redhat.com>,
	"Wang, Wei W" <wei.w.wang@intel.com>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"pbonzini@redhat.com" <pbonzini@redhat.com>,
	"lcapitulino@redhat.com" <lcapitulino@redhat.com>,
	"pagupta@redhat.com" <pagupta@redhat.com>,
	"yang.zhang.wz@gmail.com" <yang.zhang.wz@gmail.com>,
	"riel@surriel.com" <riel@surriel.com>,
	"dodgen@google.com" <dodgen@google.com>,
	"konrad.wilk@oracle.com" <konrad.wilk@oracle.com>,
	"dhildenb@redhat.com" <dhildenb@redhat.com>,
	"aarcange@redhat.com" <aarcange@redhat.com>
Subject: Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting
Date: Wed, 13 Feb 2019 12:09:15 -0500	[thread overview]
Message-ID: <20190213120733-mutt-send-email-mst@kernel.org> (raw)
In-Reply-To: <6198f4b9-47ad-2647-73de-da057541c45f@redhat.com>

On Wed, Feb 13, 2019 at 07:17:13AM -0500, Nitesh Narayan Lal wrote:
> 
> On 2/13/19 4:19 AM, David Hildenbrand wrote:
> > On 13.02.19 09:55, Wang, Wei W wrote:
> >> On Tuesday, February 12, 2019 5:24 PM, David Hildenbrand wrote:
> >>> Global means all VCPUs will be competing potentially for a single lock when
> >>> freeing/allocating a page, no? What if you have 64VCPUs allocating/freeing
> >>> memory like crazy?
> >> I think the key point is that the 64 vcpus won't allocate/free on the same page simultaneously, so no need to have a global big lock, isn’t it?
> >> I think atomic operations on the bitmap would be enough.
> > If you have to resize/alloc/coordinate who will report, you will need
> > locking. Especially, I doubt that there is an atomic xbitmap  (prove me
> > wrong :) ).
> >
> >>> (I assume some kind of locking is required even if the bitmap would be
> >>> atomic. Also, doesn't xbitmap mean that we eventually have to allocate
> >>> memory at places where we don't want to - e.g. from arch_free_page ?)
> >> arch_free_pages is in free_pages_prepare, why can't we have memory allocation there?
> > I remember we were stumbling over some issues that were non-trivial. I
> > am not 100% sure yet anymore, but allocating memory while deep down in
> > the freeing part of MM core smells like "be careful".
> >
> >> It would also be doable to find a preferred place to preallocate some amount of memory for the bitmap.
> > That makes things very ugly. Especially, preallocation will most likely
> > require locking.
> >
> >>> That's the big benefit of taking the pages of the buddy free list. Other VCPUs
> >>> won't stumble over them, waiting for them to get freed in the hypervisor.
> >> As also mentioned above, I think other vcpus will not allocate/free on the same page that is in progress of being allocated/freed.
> > If a page is in the buddy but stuck in some other bitmap, there is
> > nothing stopping another VCPU from trying to allocate it. Nitesh has
> > been fighting with this problem already :)
> >
> >>> This sounds more like "the host requests to get free pages once in a while"
> >>> compared to "the host is always informed about free pages". At the time
> >>> where the host actually has to ask the guest (e.g. because the host is low on
> >>> memory), it might be to late to wait for guest action.
> >> Option 1: Host asks for free pages:
> >> Not necessary to ask only when the host has been in memory pressure.
> >> This could be the orchestration layer's job to monitor the host memory usage.
> >> For example, people could set the condition "when 50% of the host memory
> >> has been used, start to ask a guest for some amount of free pages" 
> >>
> >> Option 2: Guest actively offers free pages:
> >> Add a balloon callback to arch_free_page so that whenever a page gets freed its gfn
> >> will be filled into the balloon's report_vq and the host will take away the backing
> >> host page.
> >>
> >> Both options can be implemented. But I think option 1 would be more
> >> efficient as the guest free pages are offered on demand.  
> > Yes, but as I mentioned this has other drawbacks. Relying on a a guest
> > to free up memory when you really need it is not going to work. It might
> > work for some scenarios but should not dictate the design. It is a good
> > start though if it makes things easier.
> >
> > Enabling/disabling free page hintning by the hypervisor via some
> > mechanism is on the other hand a good idea. "I have plenty of free
> > space, don't worry".
> >
> >>> Nitesh uses MADV_FREE here (as far as I recall :) ), to only mark pages as
> >>> candidates for removal and if the host is low on memory, only scanning the
> >>> guest page tables is sufficient to free up memory.
> >>>
> >>> But both points might just be an implementation detail in the example you
> >>> describe.
> >> Yes, it is an implementation detail. I think DONTNEED would be easier
> >> for the first step.
> >>
> >>>> In above 2), get_free_page_hints clears the bits which indicates that those
> >>> pages are not ready to be used by the guest yet. Why?
> >>>> This is because 3) will unmap the underlying physical pages from EPT.
> >>> Normally, when guest re-visits those pages, EPT violations and QEMU page
> >>> faults will get a new host page to set up the related EPT entry. If guest uses
> >>> that page before the page gets unmapped (i.e. right before step 3), no EPT
> >>> violation happens and the guest will use the same physical page that will be
> >>> unmapped and given to other host threads. So we need to make sure that
> >>> the guest free page is usable only after step 3 finishes.
> >>>> Back to arch_alloc_page(), it needs to check if the allocated pages
> >>>> have "1" set in the bitmap, if that's true, just clear the bits. Otherwise, it
> >>> means step 2) above has happened and step 4) hasn't been reached. In this
> >>> case, we can either have arch_alloc_page() busywaiting a bit till 4) is done
> >>> for that page Or better to have a balloon callback which prioritize 3) and 4)
> >>> to make this page usable by the guest.
> >>>
> >>> Regarding the latter, the VCPU allocating a page cannot do anything if the
> >>> page (along with other pages) is just being freed by the hypervisor.
> >>> It has to busy-wait, no chance to prioritize.
> >> I meant this:
> >> With this approach, essentially the free pages have 2 states:
> >> ready free page: the page is on the free list and it has "1" in the bitmap
> >> non-ready free page: the page is on the free list and it has "0" in the bitmap
> >> Ready free pages are those who can be allocated to use.
> >> Non-ready free pages are those who are in progress of being reported to
> >> host and the related EPT mapping is about to be zapped. 
> >>
> >> The non-ready pages are inserted into the report_vq and waiting for the
> >> host to zap the mappings one by one. After the mapping gets zapped
> >> (which means the backing host page has been taken away), host acks to
> >> the guest to mark the free page as ready free page (set the bit to 1 in the bitmap).
> > Yes, that's how I understood your approach. The interesting part is
> > where somebody finds a buddy page and wants to allocate it.
> >
> >> So the non-ready free page may happen to be used when they are waiting in
> >> the report_vq to be handled by the host to zap the mapping, balloon could
> >> have a fast path to notify the host:
> >> "page 0x1000 is about to be used, don’t zap the mapping when you get
> >> 0x1000 from the report_vq"  /*option [1] */
> > This requires coordination and in any case there will be a scenario
> > where you have to wait for the hypervisor to eventually finish a madv
> > call. You can just try to make that scenario less likely.
> >
> > What you propose is synchronous in the worst case. Getting pages of the
> > buddy makes it possible to have it done completely asynchronous. Nobody
> > allocating a page has to wait.
> >
> >> Or
> >>
> >> "page 0x1000 is about to be used, please zap the mapping NOW, i.e. do 3) and 4) above,
> >> so that the free page will be marked as ready free page and the guest can use it".
> >> This option will generate an extra EPT violation and QEMU page fault to get a new host
> >> page to back the guest ready free page.
> > Again, coordination with the hypervisor while allocating a page. That is
> > to be avoided in any case.
> >
> >>>> Using bitmaps to record free page hints don't need to take the free pages
> >>> off the buddy list and return them later, which needs to go through the long
> >>> allocation/free code path.
> >>> Yes, but it means that any process is able to get stuck on such a page for as
> >>> long as it takes to report the free pages to the hypervisor and for it to call
> >>> madvise(pfn_start, DONTNEED) on any such page.
> >> This only happens when the guest thread happens to get allocated on a page which is
> >> being reported to the host. Using option [1] above will avoid this.
> > I think getting pages out of the buddy system temporarily is the only
> > way we can avoid somebody else stumbling over a page currently getting
> > reported by the hypervisor. Otherwise, as I said, there are scenarios
> > where a allocating VCPU has to wait for the hypervisor to finish the
> > "freeing" task. While you can try to "speedup" that scenario -
> > "hypervisor please prioritize" you cannot avoid it. There will be busy
> > waiting.
> >
> > I don't believe what you describe is going to work (especially the not
> > locking part when working with global resources).
> >
> > What would be interesting is to see if something like a xbitmap could be
> > used instead of the per-vcpu list. 
> Yeap, exactly.
> > Nitesh, do you remember what the
> > problem was with allocating memory from these hooks? Was it a locking issue?
> In the previous implementation, the issue was due to the locking. In the
> current implementation having an allocation under these hooks will
> result in lots of isolation failures under memory pressure.

But then we shouldn't be giving host memory when under pressure
at all, should we?

> By the above statement, if you are referring to having a dynamic array
> to hold the freed pages.
> Then, that is an idea Andrea also suggested to get around this fixed
> array size issue.
> >
> > Thanks!
> >
> >> Best,
> >> Wei
> >>
> >
> -- 
> Regards
> Nitesh
>

next prev parent reply	other threads:[~2019-02-13 17:09 UTC|newest]

Thread overview: 116+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-02-04 20:18 [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting Nitesh Narayan Lal
2019-02-04 20:18 ` [RFC][Patch v8 1/7] KVM: Support for guest free page hinting Nitesh Narayan Lal
2019-02-05  4:14   ` Michael S. Tsirkin
2019-02-05 13:06     ` Nitesh Narayan Lal
2019-02-05 16:27       ` Michael S. Tsirkin
2019-02-05 16:34         ` Nitesh Narayan Lal
2019-02-04 20:18 ` [RFC][Patch v8 2/7] KVM: Enabling guest free page hinting via static key Nitesh Narayan Lal
2019-02-08 18:07   ` Alexander Duyck
2019-02-08 18:22     ` Nitesh Narayan Lal
2019-02-04 20:18 ` [RFC][Patch v8 3/7] KVM: Guest free page hinting functional skeleton Nitesh Narayan Lal
2019-02-04 20:18 ` [RFC][Patch v8 4/7] KVM: Disabling page poisoning to prevent corruption Nitesh Narayan Lal
2019-02-07 17:23   ` Alexander Duyck
2019-02-07 17:56     ` Nitesh Narayan Lal
2019-02-07 18:24       ` Alexander Duyck
2019-02-07 19:14         ` Michael S. Tsirkin
2019-02-07 21:08   ` Michael S. Tsirkin
2019-02-04 20:18 ` [RFC][Patch v8 5/7] virtio: Enables to add a single descriptor to the host Nitesh Narayan Lal
2019-02-05 20:49   ` Michael S. Tsirkin
2019-02-06 12:56     ` Nitesh Narayan Lal
2019-02-06 13:15       ` Luiz Capitulino
2019-02-06 13:24         ` Nitesh Narayan Lal
2019-02-06 13:29           ` Luiz Capitulino
2019-02-06 14:05             ` Nitesh Narayan Lal
2019-02-06 18:03       ` Michael S. Tsirkin
2019-02-06 18:19         ` Nitesh Narayan Lal
2019-02-04 20:18 ` [RFC][Patch v8 6/7] KVM: Enables the kernel to isolate and report free pages Nitesh Narayan Lal
2019-02-05 20:45   ` Michael S. Tsirkin
2019-02-05 21:54     ` Nitesh Narayan Lal
2019-02-05 21:55       ` Michael S. Tsirkin
2019-02-07 17:43         ` Alexander Duyck
2019-02-07 19:01           ` Michael S. Tsirkin
2019-02-07 20:50           ` Nitesh Narayan Lal
2019-02-08 17:58             ` Alexander Duyck
2019-02-08 20:41               ` Nitesh Narayan Lal
2019-02-08 21:38                 ` Michael S. Tsirkin
2019-02-08 22:05                   ` Alexander Duyck
2019-02-10  0:38                     ` Michael S. Tsirkin
2019-02-11  9:28                       ` David Hildenbrand
2019-02-12  5:16                         ` Michael S. Tsirkin
2019-02-12 17:10                       ` Nitesh Narayan Lal
2019-02-08 21:35               ` Michael S. Tsirkin
2019-02-04 20:18 ` [RFC][Patch v8 7/7] KVM: Adding tracepoints for guest page hinting Nitesh Narayan Lal
2019-02-04 20:20 ` [RFC][QEMU PATCH] KVM: Support for guest free " Nitesh Narayan Lal
2019-02-12  9:03 ` [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting Wang, Wei W
2019-02-12  9:24   ` David Hildenbrand
2019-02-12 17:24     ` Nitesh Narayan Lal
2019-02-12 19:34       ` David Hildenbrand
2019-02-13  8:55     ` Wang, Wei W
2019-02-13  9:19       ` David Hildenbrand
2019-02-13 12:17         ` Nitesh Narayan Lal
2019-02-13 17:09           ` Michael S. Tsirkin [this message]
2019-02-13 17:22             ` Nitesh Narayan Lal
     [not found]               ` <286AC319A985734F985F78AFA26841F73DF6F1C3@shsmsx102.ccr.corp.intel.com>
2019-02-14  9:34                 ` David Hildenbrand
2019-02-13 17:16         ` Michael S. Tsirkin
2019-02-13 17:59           ` David Hildenbrand
2019-02-13 19:08             ` Michael S. Tsirkin
2019-02-14  9:08         ` Wang, Wei W
2019-02-14 10:00           ` David Hildenbrand
2019-02-14 10:44             ` David Hildenbrand
2019-02-15  9:15             ` Wang, Wei W
2019-02-15  9:33               ` David Hildenbrand
2019-02-13  9:00 ` Wang, Wei W
2019-02-13 12:06   ` Nitesh Narayan Lal
2019-02-14  8:48     ` Wang, Wei W
2019-02-14  9:42       ` David Hildenbrand
2019-02-15  9:05         ` Wang, Wei W
2019-02-15  9:41           ` David Hildenbrand
2019-02-18  2:36             ` Wei Wang
2019-02-18  2:39               ` Wei Wang
2019-02-15 12:40           ` Nitesh Narayan Lal
2019-02-14 13:00       ` Nitesh Narayan Lal
2019-02-16  9:40 ` David Hildenbrand
2019-02-18 15:50   ` Nitesh Narayan Lal
2019-02-18 16:02     ` David Hildenbrand
2019-02-18 16:49   ` Michael S. Tsirkin
2019-02-18 16:59     ` David Hildenbrand
2019-02-18 17:31       ` Alexander Duyck
2019-02-18 17:41         ` David Hildenbrand
2019-02-18 23:47           ` Alexander Duyck
2019-02-19  2:45             ` Michael S. Tsirkin
2019-02-19  2:46             ` Andrea Arcangeli
2019-02-19 12:52               ` Nitesh Narayan Lal
2019-02-19 16:23               ` Alexander Duyck
2019-02-19  8:06             ` David Hildenbrand
2019-02-19 14:40               ` Michael S. Tsirkin
2019-02-19 14:44                 ` David Hildenbrand
2019-02-19 14:45                   ` David Hildenbrand
2019-02-18 18:01         ` Michael S. Tsirkin
2019-02-18 17:54       ` Michael S. Tsirkin
2019-02-18 18:29         ` David Hildenbrand
2019-02-18 19:16           ` Michael S. Tsirkin
2019-02-18 19:35             ` David Hildenbrand
2019-02-18 19:47               ` Michael S. Tsirkin
2019-02-18 20:04                 ` David Hildenbrand
2019-02-18 20:31                   ` Michael S. Tsirkin
2019-02-18 20:40                     ` Nitesh Narayan Lal
2019-02-18 21:04                       ` David Hildenbrand
2019-02-19  0:01                         ` Alexander Duyck
2019-02-19  7:54                           ` David Hildenbrand
2019-02-19 18:06                             ` Alexander Duyck
2019-02-19 18:31                               ` David Hildenbrand
2019-02-19 21:57                                 ` Alexander Duyck
2019-02-19 22:17                                   ` Michael S. Tsirkin
2019-02-19 22:36                                   ` David Hildenbrand
2019-02-19 19:58                               ` Michael S. Tsirkin
2019-02-19 20:02                                 ` David Hildenbrand
2019-02-19 20:17                                   ` Michael S. Tsirkin
2019-02-19 20:21                                     ` David Hildenbrand
2019-02-19 20:35                                       ` Michael S. Tsirkin
2019-02-19 12:47                         ` Nitesh Narayan Lal
2019-02-19 13:03                           ` David Hildenbrand
2019-02-19 14:17                             ` Nitesh Narayan Lal
2019-02-19 14:21                               ` David Hildenbrand
2019-02-18 20:53                     ` David Hildenbrand
2019-02-23  0:02 ` Alexander Duyck
2019-02-25 13:01   ` Nitesh Narayan Lal

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190213120733-mutt-send-email-mst@kernel.org \
    --to=mst@redhat.com \
    --cc=aarcange@redhat.com \
    --cc=david@redhat.com \
    --cc=dhildenb@redhat.com \
    --cc=dodgen@google.com \
    --cc=konrad.wilk@oracle.com \
    --cc=kvm@vger.kernel.org \
    --cc=lcapitulino@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=nitesh@redhat.com \
    --cc=pagupta@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=riel@surriel.com \
    --cc=wei.w.wang@intel.com \
    --cc=yang.zhang.wz@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).