From: Wei Yang <richard.weiyang@huawei.com>
To: Liang Li <liang.z.li@intel.com>
Cc: rkagan@virtuozzo.com, linux-kernel@vger.kenel.org,
ehabkost@redhat.com, kvm@vger.kernel.org, mst@redhat.com,
simhan@hpe.com, quintela@redhat.com, qemu-devel@nongnu.org,
dgilbert@redhat.com, jitendra.kolhe@hpe.com,
mohan_parthasarathy@hpe.com, amit.shah@redhat.com,
pbonzini@redhat.com, rth@twiddle.net
Subject: Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages
Date: Wed, 23 Mar 2016 09:37:15 +0800 [thread overview]
Message-ID: <20160323013715.GB13750@linux-gk3p> (raw)
In-Reply-To: <1458632629-4649-1-git-send-email-liang.z.li@intel.com>
Hi, Liang
This is a very clear documentation of your work, I appreciated it a lot. Below
are some of my personal opinion and question.
On Tue, Mar 22, 2016 at 03:43:49PM +0800, Liang Li wrote:
>I have sent the RFC version patch set for live migration optimization
>by skipping processing the free pages in the ram bulk stage and
>received a lot of comments. The related threads can be found at:
>
>https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00715.html
>https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00714.html
>https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00717.html
>https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00716.html
>https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00718.html
>
>https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00719.html
>https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00720.html
>https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00721.html
>
Actually there are two threads, Qemu thread and kernel thread. It would be
more clear for audience, if you just list two first mail for these two thread
respectively.
>To make things easier, I wrote this doc about the possible designs
>and my choices. Comments are welcome!
>
>Content
>=======
>1. Background
>2. Why not use virtio-balloon
>3. Virtio interface
>4. Constructing free page bitmap
>5. Tighten free page bitmap
>6. Handling page cache in the guest
>7. APIs for live migration
>8. Pseudo code
>
>Details
>=======
>1. Background
>As we know, in the ram bulk stage of live migration, current QEMU live
>migration implementation mark the all guest's RAM pages as dirtied in
>the ram bulk stage, all these pages will be checked for zero page
>first, and the page content will be sent to the destination depends on
>the checking result, that process consumes quite a lot of CPU cycles
>and network bandwidth.
>
>>From guest's point of view, there are some pages currently not used by
I see in your original RFC patch and your RFC doc, this line starts with a
character '>'. Not sure this one has a special purpose?
>the guest, guest doesn't care about the content in these pages. Free
>pages are this kind of pages which are not used by guest. We can make
>use of this fact and skip processing the free pages in the ram bulk
>stage, it can save a lot CPU cycles and reduce the network traffic
>while speed up the live migration process obviously.
>
>Usually, only the guest has the information of free pages. But it’s
>possible to let the guest tell QEMU it’s free page information by some
>mechanism. E.g. Through the virtio interface. Once QEMU get the free
>page information, it can skip processing these free pages in the ram
>bulk stage by clearing the corresponding bit of the migration bitmap.
>
>2. Why not use virtio-balloon
>Actually, the virtio-balloon can do the similar thing by inflating the
>balloon before live migration, but its performance is no good, for an
>8GB idle guest just boots, it takes about 5.7 Sec to inflate the
>balloon to 7GB, but it only takes 25ms to get a valid free page bitmap
>from the guest. There are some of reasons for the bad performance of
>vitio-balloon:
>a. allocating pages (5%, 304ms)
>b. sending PFNs to host (71%, 4194ms)
>c. address translation and madvise() operation (24%, 1423ms)
>Debugging shows the time spends on these operations are listed in the
>brackets above. By changing the VIRTIO_BALLOON_ARRAY_PFNS_MAX to a
>large value, such as 16384, the time spends on sending the PFNs can be
>reduced to about 400ms, but it’s still too long.
>
>Obviously, the virtio-balloon mechanism has a bigger performance
>impact to the guest than the way we are trying to implement.
>
>3. Virtio interface
>There are three different ways of using the virtio interface to
>send the free page information.
>a. Extend the current virtio device
>The virtio spec has already defined some virtio devices, and we can
>extend one of these devices so as to use it to transport the free page
>information. It requires modifying the virtio spec.
>
>b. Implement a new virtio device
>Implementing a brand new virtio device to exchange information
>between host and guest is another choice. It requires modifying the
>virtio spec too.
>
>c. Make use of virtio-serial (Amit’s suggestion, my choice)
>It’s possible to make use the virtio-serial for communication between
>host and guest, the benefit of this solution is no need to modify the
>virtio spec.
>
>4. Construct free page bitmap
>To minimize the space for saving free page information, it’s better to
>use a bitmap to describe the free pages. There are two ways to
>construct the free page bitmap.
>
>a. Construct free page bitmap when demand (My choice)
>Guest can allocate memory for the free page bitmap only when it
>receives the request from QEMU, and set the free page bitmap by
>traversing the free page list. The advantage of this way is that it’s
>quite simple and easy to implement. The disadvantage is that the
>traversing operation may consume quite a long time when there are a
>lot of free pages. (About 20ms for 7GB free pages)
>
>b. Update free page bitmap when allocating/freeing pages
>Another choice is to allocate the memory for the free page bitmap
>when guest boots, and then update the free page bitmap when
>allocating/freeing pages. It needs more modification to the code
>related to memory management in guest. The advantage of this way is
>that guest can response QEMU’s request for a free page bitmap very
>quickly, no matter how many free pages in the guest. Do the kernel guys
>like this?
>
>5. Tighten the free page bitmap
>At last, the free page bitmap should be operated with the
>ramlist.dirty_memory to filter out the free pages. We should make sure
In exec.c, the variable name is ram_list. If we use the same name in code and
doc, this may be more easy for audience to understand.
>the bit N in the free page bitmap and the bit N in the
>ramlist.dirty_memory are corresponding to the same guest’s page.
>Some arch, like X86, there are ‘holes’ in the memory’s physical
>address, which means there are no actual physical RAM pages
>corresponding to some PFNs. So, some arch specific information is
>needed to construct a proper free page bitmap.
>
>migration dirty page bitmap:
> ---------------------
> |a|b|c|d|e|f|g|h|i|j|
> ---------------------
>loose free page bitmap:
> -----------------------------
> |a|b|c|d|e|f| | | | |g|h|i|j|
> -----------------------------
>tight free page bitmap:
> ---------------------
> |a|b|c|d|e|f|g|h|i|j|
> ---------------------
>
>There are two places for tightening the free page bitmap:
>a. In guest
>Constructing the free page bitmap in guest requires adding the arch
>related code in guest for building a tight bitmap. The advantage of
>this way is that less memory is needed to store the free page bitmap.
>b. In QEMU (My choice)
>Constructing the free page bitmap in QEMU is more flexible, we can get
>a loose free page bitmap which contains the holes, and then filter out
>the holes in QEMU, the advantage of this way is that we can keep the
>kernel code as simple as we can, the disadvantage is that more memory
>is needed to save the loose free page bitmap. Because this is a mainly
>QEMU feature, if possible, do all the related things in QEMU is
>better.
>
>6. Handling page cache in the guest
>The memory used for page cache in the guest will change depends on the
>workload, if guest run some block IO intensive work load, there will
Would this improvement benefit a lot when guest only has little free page?
In your Performance data Case 2, I think it mimic this kind of case. While the
memory consuming task is stopped before migration. If it continues, would we
still perform better than before?
I am thinking is it possible to have a threshold or configurable threshold to
utilize free page bitmap optimization?
>be lots of pages used for page cache, only a few free pages are left in
>the guest. In order to get more free pages, we can select to ask guest
>to drop some page caches. Because dropping the page cache may lead to
>performance degradation, only the clean cache should be dropped and we
>should let the user decide whether to do this.
>
>7. APIs for live migration
>To make things work, the following APIs should be implemented.
>
>a. Get memory info of the guest, like this:
>bool get_guest_mem_info(struct guest_mem_info * info )
>
>struct guest_mem_info is defined as bellow:
>
>struct guest_mem_info {
>uint64_t free_pages_num; // guest’s free pages count
>uint64_t cached_pages_num; //total cached pages count
>uint64_t max_pfn; // the max pfn of the guest
>};
>
>Return value:
>flase, when QEMU or guest can’t support this operation.
>true, when success.
>
>b. Request guest’s current free pages information.
>int get_free_page_bmap(unsigned long *bitmap, bool drop_cache);
>
>Return value:
>-1, when QEMU or guest can’t support this operation.
>1, when the free page bitmap is still in the progress of constructing.
>1, when a valid free page bitmap is ready.
>
>c. Tighten the free page bitmap
>unsigned long * tighten_free_page_bmap(unsigned long *bitmap);
>
>This function is an arch specific function to rebuild the loose free
>page bitmap so as to get a tight bitmap which can be operated easily
>with ramlist.dirty_memory.
>
>8. Pseudo code
>Dirty page logging should be enabled before getting the free page
>information from guest, this is important because during the process
>of getting free pages, some free pages may be used and written by the
>guest, dirty page logging can trace these pages. The pseudo code is
>like below:
>
> -----------------------------------------------
> MigrationState *s = migrate_get_current();
> ...
>
> memory_global_dirty_log_start();
>
> if (get_guest_mem_info(&info)) {
> while (!get_free_page_bmap(free_page_bitmap, drop_page_cache) &&
> s->state != MIGRATION_STATUS_CANCELLING) {
> usleep(1000) // sleep for 1 ms
> }
>
> tighten_free_page_bmap = tighten_guest_free_pages(free_page_bitmap);
> filter_out_guest_free_pages(tighten_free_page_bmap);
> }
>
> migration_bitmap_sync();
> ...
>
> -----------------------------------------------
>
>
>--
>1.9.1
>
>--
>To unsubscribe from this list: send the line "unsubscribe kvm" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Richard Yang\nHelp you, Help me
next prev parent reply other threads:[~2016-03-23 1:42 UTC|newest]
Thread overview: 56+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-03-22 7:43 [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages Liang Li
2016-03-22 10:11 ` Michael S. Tsirkin
2016-03-23 6:05 ` Li, Liang Z
2016-03-23 14:08 ` Michael S. Tsirkin
2016-03-24 1:19 ` Li, Liang Z
2016-03-24 9:48 ` Michael S. Tsirkin
2016-03-24 10:16 ` Li, Liang Z
2016-03-24 10:29 ` Michael S. Tsirkin
2016-03-24 14:33 ` Li, Liang Z
2016-03-24 14:44 ` Michael S. Tsirkin
2016-03-24 15:16 ` Li, Liang Z
2016-03-24 15:18 ` Paolo Bonzini
2016-03-24 15:25 ` Li, Liang Z
2016-03-24 15:27 ` Michael S. Tsirkin
2016-03-24 15:39 ` Li, Liang Z
2016-03-24 15:47 ` Paolo Bonzini
2016-03-24 15:59 ` Li, Liang Z
2016-03-22 19:05 ` Dr. David Alan Gilbert
2016-03-23 6:48 ` Li, Liang Z
2016-03-24 1:24 ` Wei Yang
2016-03-24 9:00 ` Dr. David Alan Gilbert
2016-03-24 10:09 ` Li, Liang Z
2016-03-24 10:23 ` Dr. David Alan Gilbert
2016-03-24 14:50 ` Li, Liang Z
2016-03-24 15:11 ` Michael S. Tsirkin
2016-03-24 15:53 ` Li, Liang Z
2016-03-24 15:56 ` Michael S. Tsirkin
2016-03-24 16:05 ` Li, Liang Z
2016-03-24 16:25 ` Michael S. Tsirkin
2016-03-24 17:49 ` Dr. David Alan Gilbert
2016-03-24 22:16 ` Michael S. Tsirkin
2016-03-25 1:59 ` Li, Liang Z
2016-03-25 1:32 ` Li, Liang Z
2016-04-18 11:08 ` Li, Liang Z
2016-04-18 11:29 ` Michael S. Tsirkin
2016-04-18 14:36 ` Li, Liang Z
2016-04-18 15:38 ` Michael S. Tsirkin
2016-04-19 2:20 ` Li, Liang Z
2016-04-19 19:12 ` Dr. David Alan Gilbert
2016-04-25 10:56 ` Michael S. Tsirkin
2016-04-19 19:05 ` Dr. David Alan Gilbert
2016-04-20 3:22 ` Li, Liang Z
2016-04-20 8:10 ` Dr. David Alan Gilbert
2016-03-25 1:32 ` Li, Liang Z
2016-04-01 10:54 ` Amit Shah
2016-04-05 1:49 ` Li, Liang Z
2016-03-23 1:37 ` Wei Yang [this message]
2016-03-23 7:18 ` Li, Liang Z
2016-03-23 9:46 ` Wei Yang
2016-03-23 14:35 ` Li, Liang Z
2016-03-24 0:52 ` Wei Yang
2016-03-24 1:32 ` Li, Liang Z
2016-03-24 1:56 ` Wei Yang
2016-03-23 16:53 ` Eric Blake
2016-03-23 21:41 ` Wei Yang
2016-03-24 1:23 ` Li, Liang Z
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20160323013715.GB13750@linux-gk3p \
--to=richard.weiyang@huawei.com \
--cc=amit.shah@redhat.com \
--cc=dgilbert@redhat.com \
--cc=ehabkost@redhat.com \
--cc=jitendra.kolhe@hpe.com \
--cc=kvm@vger.kernel.org \
--cc=liang.z.li@intel.com \
--cc=linux-kernel@vger.kenel.org \
--cc=mohan_parthasarathy@hpe.com \
--cc=mst@redhat.com \
--cc=pbonzini@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=quintela@redhat.com \
--cc=rkagan@virtuozzo.com \
--cc=rth@twiddle.net \
--cc=simhan@hpe.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).