From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([140.186.70.92]:49140)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <dlaor@redhat.com>) id 1QqRUa-0000n3-3U
	for qemu-devel@nongnu.org; Mon, 08 Aug 2011 11:11:49 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <dlaor@redhat.com>) id 1QqRUY-0000Bk-2B
	for qemu-devel@nongnu.org; Mon, 08 Aug 2011 11:11:48 -0400
Received: from mx1.redhat.com ([209.132.183.28]:36106)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <dlaor@redhat.com>) id 1QqRUX-0000Ac-Iz
	for qemu-devel@nongnu.org; Mon, 08 Aug 2011 11:11:46 -0400
Message-ID: <4E3FFC9E.8050300@redhat.com>
Date: Mon, 08 Aug 2011 18:11:26 +0300
From: Dor Laor <dlaor@redhat.com>
MIME-Version: 1.0
References: <20110808032438.GC24764@valinux.co.jp>
	<4E3FAA53.4030602@redhat.com> <4E3FD774.7010502@codemonkey.ws>
In-Reply-To: <4E3FD774.7010502@codemonkey.ws>
Content-Type: text/plain; charset=ISO-8859-15; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [RFC] postcopy livemigration proposal
Reply-To: dlaor@redhat.com
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Anthony Liguori <anthony@codemonkey.ws>
Cc: kvm@vger.kernel.org, Isaku Yamahata <yamahata@valinux.co.jp>, t.hirofuchi@aist.go.jp, satoshi.itoh@aist.go.jp, qemu-devel@nongnu.org, Orit Wasserman <owasserm@redhat.com>, Avi Kivity <avi@redhat.com>

On 08/08/2011 03:32 PM, Anthony Liguori wrote:
> On 08/08/2011 04:20 AM, Dor Laor wrote:
>> On 08/08/2011 06:24 AM, Isaku Yamahata wrote:
>>> This mail is on "Yabusame: Postcopy Live Migration for Qemu/KVM"
>>> on which we'll give a talk at KVM-forum.
>>> The purpose of this mail is to letting developers know it in advance
>>> so that we can get better feedback on its design/implementation approach
>>> early before our starting to implement it.
>>>
>>>
>>> Background
>>> ==========
>>> * What's is postcopy livemigration
>>> It is is yet another live migration mechanism for Qemu/KVM, which
>>> implements the migration technique known as "postcopy" or "lazy"
>>> migration. Just after the "migrate" command is invoked, the execution
>>> host of a VM is instantaneously switched to a destination host.
>>>
>>> The benefit is, total migration time is shorter because it transfer
>>> a page only once. On the other hand precopy may repeat sending same
>>> pages
>>> again and again because they can be dirtied.
>>> The switching time from the source to the destination is several
>>> hunderds mili seconds so that it enables quick load balancing.
>>> For details, please refer to the papers.
>>>
>>> We believe this is useful for others so that we'd like to merge this
>>> feature into the upstream qemu/kvm. The existing implementation that
>>> we have right now is very ad-hoc because it's for academic research.
>>> For the upstream merge, we're starting to re-design/implement it and
>>> we'd like to get feedback early. Although many
>>> improvements/optimizations
>>> are possible, we should implement/merge the simple/clean, but extensible
>>> as well, one at first and then improve/optimize it later.
>>>
>>> postcopy livemigration will be introduced as optional feature. The
>>> existing
>>> precopy livemigration remains as default behavior.
>>>
>>>
>>> * related links:
>>> project page
>>> http://sites.google.com/site/grivonhome/quick-kvm-migration
>>>
>>> Enabling Instantaneous Relocation of Virtual Machines with a
>>> Lightweight VMM Extension,
>>> (proof-of-concept, ad-hoc prototype. not a new design)
>>> http://grivon.googlecode.com/svn/pub/docs/ccgrid2010-hirofuchi-paper.pdf
>>> http://grivon.googlecode.com/svn/pub/docs/ccgrid2010-hirofuchi-talk.pdf
>>>
>>> Reactive consolidation of virtual machines enabled by postcopy live
>>> migration
>>> (advantage for VM consolidation)
>>> http://portal.acm.org/citation.cfm?id=1996125
>>> http://www.emn.fr/x-info/ascola/lib/exe/fetch.php?media=internet:vtdc-postcopy.pdf
>>>
>>>
>>>
>>> Qemu wiki
>>> http://wiki.qemu.org/Features/PostCopyLiveMigration
>>>
>>>
>>> Design/Implementation
>>> =====================
>>> The basic idea of postcopy livemigration is to use a sort of distributed
>>> shared memory between the migration source and destination.
>>>
>>> The migration procedure looks like
>>> - start migration
>>> stop the guest VM on the source and send the machine states except
>>> guest RAM to the destination
>>> - resume the guest VM on the destination without guest RAM contents
>>> - Hook guest access to pages, and pull page contents from the source
>>> This continues until all the pages are pulled to the destination
>>>
>>> The big picture is depicted at
>>> http://wiki.qemu.org/File:Postcopy-livemigration.png
>>
>> That's terrific (nice video also)!
>> Orit and myself had the exact same idea too (now we can't patent it..).
>>
>> Advantages:
>> - No down time due to memory copying.
>
> But non-deterministic down time due to network latency while trying to
> satisfy a page fault.

True but it is possible to limit it with some dedicated network or 
bandwidth reservation.

>
>> - Efficient, reduce needed traffic no need to re-send pages.
>
> It's not quite that simple. Post-copy needs to introduce a protocol
> capable of requesting pages.

Just another subsection.. (kidding), still it shouldn't be too 
complicated, just an offset+pagesize and return page_content/error

>
> I think in presenting something like this, it's important to collect
> quite a bit of performance data. I'd suggest doing runs while running
> jitterd in the guest to attempt to quantify the actual downtime
> experienced too.
>
> http://git.codemonkey.ws/cgit/jitterd.git/

and also comparing the speed that it takes for various benchmarks like 
iozone/netperf/linpack/..

>
> There's a lot of potential in something like this, but it's not obvious
> to me whether it's a net win. Should make for a very interesting
> presentation :-)
>
>> - Reduce overall RAM consumption of the source and destination
>> as opposed from current live migration (both the source and the
>> destination allocate the memory until the live migration
>> completes). We can free copied memory once the destination guest
>> received it and save RAM.
>> - Increase parallelism for SMP guests we can have multiple
>> virtual CPU handle their demand paging . Less time to hold a
>> global lock, less thread contention.
>> - Virtual machines are using more and more memory resources ,
>> for a virtual machine with very large working set doing live
>> migration with reasonable down time is impossible today.
>
> This is really just a limitation of our implementation. In theory,
> pre-copy allows you to exert fine grain resource control over the guest
> which you can use to encourage convergence.

But a very large guest w/ large working set that changes more frequent 
than the network bandwidth might always need huge down time with the 
current system.

>
>> Disadvantageous:
>> - During the live migration the guest will run slower than in
>> today's live migration. We need to remember that even today
>> guests suffer from performance penalty on the source during the
>> COW stage (memory copy).
>> - Failure of the source or destination or the network will cause
>> us to lose the running virtual machine. Those failures are very
>> rare.
>> In case there is shared storage we can store a copy of the
>> memory there , that can be recovered in case of such failure .
>>
>> Overall, it looks like a better approach for the vast majority of cases.
>> Hope it will get merged to kvm and become the default way.
>
> One thing I think we need to do is put together a live migration
> roadmap. We've got a lot of invasive efforts underway with live
> migration and I fear that without some planning and serialization, some
> of this useful work with get lost.

Some of them are parallel. I think all the readers here agree that post 
copy migration should be an option while we need to maintain the current 
one.

In addition to that there is the switch to separate thread which is 
standalone, and the vmstate vs QOM/visitor work.

Lastly the compression work..

>
> Regards,
>
> Anthony Liguori
>
>>>
>>>
>>> There are several design points.
>>> - who takes care of pulling page contents.
>>> an independent daemon vs a thread in qemu
>>> The daemon approach is preferable because an independent daemon would
>>> easy for debug postcopy memory mechanism without qemu.
>>> If required, it wouldn't be difficult to convert a daemon into
>>> a thread in qemu
>>>
>>> - connection between the source and the destination
>>> The connection for live migration can be re-used after sending machine
>>> state.
>>>
>>> - transfer protocol
>>> The existing protocol that exists today can be extended.
>>>
>>> - hooking guest RAM access
>>> Introduce a character device to handle page fault.
>>> When page fault occurs, it queues page request up to user space daemon
>>> at the destination. And the daemon pulls page contents from the source
>>> and serves it into the character device. Then the page fault is
>>> resovlved.
>>
>> Isn't there a simpler way of using madvise verb to mark that the
>> destination guest RAM will need paging?
>>
>> Cheers and looking forward to the presentation over the kvm forum,
>> Dor
>>
>>>
>>>
>>> * More on hooking guest RAM access
>>> There are several candidate for the implementation. Our preference is
>>> character device approach.
>>>
>>> - inserting hooks into everywhere in qemu/kvm
>>> This is impractical
>>>
>>> - backing store for guest ram
>>> a block device or a file can be used to back guest RAM.
>>> Thus hook the guest ram access.
>>>
>>> pros
>>> - new device driver isn't needed.
>>> cons
>>> - future improvement would be difficult
>>> - some KVM host feature(KSM, THP) wouldn't work
>>>
>>> - character device
>>> qemu mmap() the dedicated character device, and then hook page fault.
>>>
>>> pros
>>> - straght forward approach
>>> - future improvement would be easy
>>> cons
>>> - new driver is needed
>>> - some KVM host feature(KSM, THP) wouldn't work
>>> They checks if a given VMA is anonymous. This can be fixed.
>>>
>>> - swap device
>>> When creating guest, it is set up as if all the guest RAM is swapped out
>>> to a dedicated swap device, which may be nbd disk (or some kind of user
>>> space block device, BUSE?).
>>> When the VM tries to access memory, swap-in is triggered and IO to the
>>> swap device is issued. Then the IO to swap is routed to the daemon
>>> in user space with nbd protocol (or BUSE, AOE, iSCSI...). The daemon
>>> pulls
>>> pages from the migration source and services the IO request.
>>>
>>> pros
>>> - After the page transfer is complete, everything is same as normal
>>> case.
>>> - no new device driver isn't needed
>>> cons
>>> - future improvement would be difficult
>>> - administration: setting up nbd, swap device
>>>
>>> Thanks in advance
>>
>>
>
>