qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Wei Yang <richard.weiyang@huawei.com>
To: Liang Li <liang.z.li@intel.com>
Cc: rkagan@virtuozzo.com, linux-kernel@vger.kenel.org,
	ehabkost@redhat.com, kvm@vger.kernel.org, mst@redhat.com,
	simhan@hpe.com, quintela@redhat.com, qemu-devel@nongnu.org,
	dgilbert@redhat.com, jitendra.kolhe@hpe.com,
	mohan_parthasarathy@hpe.com, amit.shah@redhat.com,
	pbonzini@redhat.com, rth@twiddle.net
Subject: Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages
Date: Wed, 23 Mar 2016 09:37:15 +0800	[thread overview]
Message-ID: <20160323013715.GB13750@linux-gk3p> (raw)
In-Reply-To: <1458632629-4649-1-git-send-email-liang.z.li@intel.com>

Hi, Liang

This is a very clear documentation of your work, I appreciated it a lot. Below
are some of my personal opinion and question.

On Tue, Mar 22, 2016 at 03:43:49PM +0800, Liang Li wrote:
>I have sent the RFC version patch set for live migration optimization
>by skipping processing the free pages in the ram bulk stage and
>received a lot of comments. The related threads can be found at:
>
>https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00715.html
>https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00714.html
>https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00717.html
>https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00716.html
>https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00718.html
>
>https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00719.html 
>https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00720.html
>https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00721.html
>

Actually there are two threads, Qemu thread and kernel thread. It would be
more clear for audience, if you just list two first mail for these two thread
respectively.

>To make things easier, I wrote this doc about the possible designs
>and my choices. Comments are welcome! 
>
>Content
>=======
>1. Background
>2. Why not use virtio-balloon
>3. Virtio interface
>4. Constructing free page bitmap
>5. Tighten free page bitmap
>6. Handling page cache in the guest
>7. APIs for live migration
>8. Pseudo code 
>
>Details
>=======
>1. Background
>As we know, in the ram bulk stage of live migration, current QEMU live
>migration implementation mark the all guest's RAM pages as dirtied in
>the ram bulk stage, all these pages will be checked for zero page
>first, and the page content will be sent to the destination depends on
>the checking result, that process consumes quite a lot of CPU cycles
>and network bandwidth.
>
>>From guest's point of view, there are some pages currently not used by

I see in your original RFC patch and your RFC doc, this line starts with a
character '>'. Not sure this one has a special purpose?

>the guest, guest doesn't care about the content in these pages. Free
>pages are this kind of pages which are not used by guest. We can make
>use of this fact and skip processing the free pages in the ram bulk
>stage, it can save a lot CPU cycles and reduce the network traffic
>while speed up the live migration process obviously.
>
>Usually, only the guest has the information of free pages. But it’s
>possible to let the guest tell QEMU it’s free page information by some
>mechanism. E.g. Through the virtio interface. Once QEMU get the free
>page information, it can skip processing these free pages in the ram
>bulk stage by clearing the corresponding bit of the migration bitmap. 
>
>2. Why not use virtio-balloon 
>Actually, the virtio-balloon can do the similar thing by inflating the
>balloon before live migration, but its performance is no good, for an
>8GB idle guest just boots, it takes about 5.7 Sec to inflate the
>balloon to 7GB, but it only takes 25ms to get a valid free page bitmap
>from the guest.  There are some of reasons for the bad performance of
>vitio-balloon:
>a. allocating pages (5%, 304ms)
>b. sending PFNs to host (71%, 4194ms)
>c. address translation and madvise() operation (24%, 1423ms)
>Debugging shows the time spends on these operations are listed in the
>brackets above. By changing the VIRTIO_BALLOON_ARRAY_PFNS_MAX to a
>large value, such as 16384, the time spends on sending the PFNs can be
>reduced to about 400ms, but it’s still too long.
>
>Obviously, the virtio-balloon mechanism has a bigger performance
>impact to the guest than the way we are trying to implement.
>
>3. Virtio interface
>There are three different ways of using the virtio interface to
>send the free page information.
>a. Extend the current virtio device
>The virtio spec has already defined some virtio devices, and we can
>extend one of these devices so as to use it to transport the free page
>information. It requires modifying the virtio spec.
>
>b. Implement a new virtio device
>Implementing a brand new virtio device to exchange information
>between host and guest is another choice. It requires modifying the
>virtio spec too.
>
>c. Make use of virtio-serial (Amit’s suggestion, my choice)
>It’s possible to make use the virtio-serial for communication between
>host and guest, the benefit of this solution is no need to modify the
>virtio spec. 
>
>4. Construct free page bitmap
>To minimize the space for saving free page information, it’s better to
>use a bitmap to describe the free pages. There are two ways to
>construct the free page bitmap.
>
>a. Construct free page bitmap when demand (My choice)
>Guest can allocate memory for the free page bitmap only when it
>receives the request from QEMU, and set the free page bitmap by
>traversing the free page list. The advantage of this way is that it’s
>quite simple and easy to implement. The disadvantage is that the
>traversing operation may consume quite a long time when there are a
>lot of free pages. (About 20ms for 7GB free pages)
>
>b. Update free page bitmap when allocating/freeing pages 
>Another choice is to allocate the memory for the free page bitmap
>when guest boots, and then update the free page bitmap when
>allocating/freeing pages. It needs more modification to the code
>related to memory management in guest. The advantage of this way is
>that guest can response QEMU’s request for a free page bitmap very
>quickly, no matter how many free pages in the guest. Do the kernel guys
>like this?
>
>5. Tighten the free page bitmap
>At last, the free page bitmap should be operated with the
>ramlist.dirty_memory to filter out the free pages. We should make sure

In exec.c, the variable name is ram_list. If we use the same name in code and
doc, this may be more easy for audience to understand.

>the bit N in the free page bitmap and the bit N in the
>ramlist.dirty_memory are corresponding to the same guest’s page. 
>Some arch, like X86, there are ‘holes’ in the memory’s physical
>address, which means there are no actual physical RAM pages
>corresponding to some PFNs. So, some arch specific information is
>needed to construct a proper free page bitmap.
>
>migration dirty page bitmap:
>    ---------------------
>    |a|b|c|d|e|f|g|h|i|j|
>    ---------------------
>loose free page bitmap:
>    -----------------------------  
>    |a|b|c|d|e|f| | | | |g|h|i|j|
>    -----------------------------
>tight free page bitmap:
>    ---------------------
>    |a|b|c|d|e|f|g|h|i|j|
>    ---------------------
>
>There are two places for tightening the free page bitmap:
>a. In guest 
>Constructing the free page bitmap in guest requires adding the arch
>related code in guest for building a tight bitmap. The advantage of
>this way is that less memory is needed to store the free page bitmap.
>b. In QEMU (My choice)
>Constructing the free page bitmap in QEMU is more flexible, we can get
>a loose free page bitmap which contains the holes, and then filter out
>the holes in QEMU, the advantage of this way is that we can keep the
>kernel code as simple as we can, the disadvantage is that more memory
>is needed to save the loose free page bitmap. Because this is a mainly
>QEMU feature, if possible, do all the related things in QEMU is
>better.
>
>6. Handling page cache in the guest
>The memory used for page cache in the guest will change depends on the
>workload, if guest run some block IO intensive work load, there will

Would this improvement benefit a lot when guest only has little free page?
In your Performance data Case 2, I think it mimic this kind of case. While the
memory consuming task is stopped before migration. If it continues, would we
still perform better than before?

I am thinking is it possible to have a threshold or configurable threshold to
utilize free page bitmap optimization?

>be lots of pages used for page cache, only a few free pages are left in
>the guest. In order to get more free pages, we can select to ask guest
>to drop some page caches.  Because dropping the page cache may lead to
>performance degradation, only the clean cache should be dropped and we
>should let the user decide whether to do this.
>
>7. APIs for live migration
>To make things work, the following APIs should be implemented.
>
>a. Get memory info of the guest, like this:
>bool get_guest_mem_info(struct guest_mem_info  * info )
>
>struct guest_mem_info is defined as bellow:
>
>struct guest_mem_info {
>uint64_t free_pages_num;      // guest’s free pages count 
>uint64_t cached_pages_num;     //total cached pages count
>uint64_t max_pfn;     // the max pfn of the guest
>};
>
>Return value:
>flase, when QEMU or guest can’t support this operation.
>true, when success.
>
>b. Request guest’s current free pages information.
>int get_free_page_bmap(unsigned long *bitmap,  bool drop_cache);
>
>Return value:
>-1, when QEMU or guest can’t support this operation.
>1, when the free page bitmap is still in the progress of constructing.
>1, when a valid free page bitmap is ready.
>
>c. Tighten the free page bitmap
>unsigned long * tighten_free_page_bmap(unsigned long *bitmap);
>
>This function is an arch specific function to rebuild the loose free
>page bitmap so as to get a tight bitmap which can be operated easily
>with ramlist.dirty_memory.
>
>8. Pseudo code 
>Dirty page logging should be enabled before getting the free page
>information from guest, this is important because during the process
>of getting free pages, some free pages may be used and written by the
>guest, dirty page logging can trace these pages. The pseudo code is
>like below:
>
>    -----------------------------------------------
>    MigrationState *s = migrate_get_current();
>    ...
>
>    memory_global_dirty_log_start();
>
>    if (get_guest_mem_info(&info)) {
>        while (!get_free_page_bmap(free_page_bitmap,  drop_page_cache) &&
>               s->state != MIGRATION_STATUS_CANCELLING) {
>            usleep(1000) // sleep for 1 ms
>        }
>
>        tighten_free_page_bmap = tighten_guest_free_pages(free_page_bitmap);
>        filter_out_guest_free_pages(tighten_free_page_bmap);
>    }
>
>    migration_bitmap_sync();
>    ...
>
>    -----------------------------------------------
>
>
>-- 
>1.9.1
>
>--
>To unsubscribe from this list: send the line "unsubscribe kvm" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Richard Yang\nHelp you, Help me

  parent reply	other threads:[~2016-03-23  1:42 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-03-22  7:43 [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages Liang Li
2016-03-22 10:11 ` Michael S. Tsirkin
2016-03-23  6:05   ` Li, Liang Z
2016-03-23 14:08     ` Michael S. Tsirkin
2016-03-24  1:19       ` Li, Liang Z
2016-03-24  9:48         ` Michael S. Tsirkin
2016-03-24 10:16           ` Li, Liang Z
2016-03-24 10:29             ` Michael S. Tsirkin
2016-03-24 14:33               ` Li, Liang Z
2016-03-24 14:44                 ` Michael S. Tsirkin
2016-03-24 15:16                   ` Li, Liang Z
2016-03-24 15:18                     ` Paolo Bonzini
2016-03-24 15:25                       ` Li, Liang Z
2016-03-24 15:27                     ` Michael S. Tsirkin
2016-03-24 15:39                       ` Li, Liang Z
2016-03-24 15:47                         ` Paolo Bonzini
2016-03-24 15:59                           ` Li, Liang Z
2016-03-22 19:05 ` Dr. David Alan Gilbert
2016-03-23  6:48   ` Li, Liang Z
2016-03-24  1:24     ` Wei Yang
2016-03-24  9:00       ` Dr. David Alan Gilbert
2016-03-24 10:09         ` Li, Liang Z
2016-03-24 10:23           ` Dr. David Alan Gilbert
2016-03-24 14:50             ` Li, Liang Z
2016-03-24 15:11               ` Michael S. Tsirkin
2016-03-24 15:53                 ` Li, Liang Z
2016-03-24 15:56                   ` Michael S. Tsirkin
2016-03-24 16:05                     ` Li, Liang Z
2016-03-24 16:25                       ` Michael S. Tsirkin
2016-03-24 17:49                         ` Dr. David Alan Gilbert
2016-03-24 22:16                           ` Michael S. Tsirkin
2016-03-25  1:59                             ` Li, Liang Z
2016-03-25  1:32                           ` Li, Liang Z
2016-04-18 11:08                           ` Li, Liang Z
2016-04-18 11:29                             ` Michael S. Tsirkin
2016-04-18 14:36                               ` Li, Liang Z
2016-04-18 15:38                                 ` Michael S. Tsirkin
2016-04-19  2:20                                   ` Li, Liang Z
2016-04-19 19:12                               ` Dr. David Alan Gilbert
2016-04-25 10:56                                 ` Michael S. Tsirkin
2016-04-19 19:05                             ` Dr. David Alan Gilbert
2016-04-20  3:22                               ` Li, Liang Z
2016-04-20  8:10                                 ` Dr. David Alan Gilbert
2016-03-25  1:32                         ` Li, Liang Z
2016-04-01 10:54   ` Amit Shah
2016-04-05  1:49     ` Li, Liang Z
2016-03-23  1:37 ` Wei Yang [this message]
2016-03-23  7:18   ` Li, Liang Z
2016-03-23  9:46     ` Wei Yang
2016-03-23 14:35       ` Li, Liang Z
2016-03-24  0:52         ` Wei Yang
2016-03-24  1:32           ` Li, Liang Z
2016-03-24  1:56             ` Wei Yang
2016-03-23 16:53     ` Eric Blake
2016-03-23 21:41       ` Wei Yang
2016-03-24  1:23       ` Li, Liang Z

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160323013715.GB13750@linux-gk3p \
    --to=richard.weiyang@huawei.com \
    --cc=amit.shah@redhat.com \
    --cc=dgilbert@redhat.com \
    --cc=ehabkost@redhat.com \
    --cc=jitendra.kolhe@hpe.com \
    --cc=kvm@vger.kernel.org \
    --cc=liang.z.li@intel.com \
    --cc=linux-kernel@vger.kenel.org \
    --cc=mohan_parthasarathy@hpe.com \
    --cc=mst@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    --cc=rkagan@virtuozzo.com \
    --cc=rth@twiddle.net \
    --cc=simhan@hpe.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).