[Qemu-devel] [RFC] postcopy livemigration proposal

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [RFC] postcopy livemigration proposal
@ 2011-08-08  3:24 Isaku Yamahata
  2011-08-08  9:20 ` Dor Laor
                   ` (3 more replies)
  0 siblings, 4 replies; 28+ messages in thread
From: Isaku Yamahata @ 2011-08-08  3:24 UTC (permalink / raw)
  To: kvm, qemu-devel; +Cc: t.hirofuchi, satoshi.itoh

This mail is on "Yabusame: Postcopy Live Migration for Qemu/KVM"
on which we'll give a talk at KVM-forum.
The purpose of this mail is to letting developers know it in advance
so that we can get better feedback on its design/implementation approach
early before our starting to implement it.

Background
==========
* What's is postcopy livemigration
It is is yet another live migration mechanism for Qemu/KVM, which
implements the migration technique known as "postcopy" or "lazy"
migration. Just after the "migrate" command is invoked, the execution
host of a VM is instantaneously switched to a destination host.

The benefit is, total migration time is shorter because it transfer
a page only once. On the other hand precopy may repeat sending same pages
again and again because they can be dirtied.
The switching time from the source to the destination is several
hunderds mili seconds so that it enables quick load balancing.
For details, please refer to the papers.

We believe this is useful for others so that we'd like to merge this
feature into the upstream qemu/kvm. The existing implementation that
we have right now is very ad-hoc because it's for academic research.
For the upstream merge, we're starting to re-design/implement it and
we'd like to get feedback early.  Although many improvements/optimizations
are possible, we should implement/merge the simple/clean, but extensible
as well, one at first and then improve/optimize it later.

postcopy livemigration will be introduced as optional feature. The existing
precopy livemigration remains as default behavior.

* related links:
project page
http://sites.google.com/site/grivonhome/quick-kvm-migration

Enabling Instantaneous Relocation of Virtual Machines with a
Lightweight VMM Extension,
(proof-of-concept, ad-hoc prototype. not a new design)
http://grivon.googlecode.com/svn/pub/docs/ccgrid2010-hirofuchi-paper.pdf
http://grivon.googlecode.com/svn/pub/docs/ccgrid2010-hirofuchi-talk.pdf

Reactive consolidation of virtual machines enabled by postcopy live migration
(advantage for VM consolidation)
http://portal.acm.org/citation.cfm?id=1996125
http://www.emn.fr/x-info/ascola/lib/exe/fetch.php?media=internet:vtdc-postcopy.pdf

Qemu wiki
http://wiki.qemu.org/Features/PostCopyLiveMigration

Design/Implementation
=====================
The basic idea of postcopy livemigration is to use a sort of distributed
shared memory between the migration source and destination.

The migration procedure looks like
  - start migration
    stop the guest VM on the source and send the machine states except
    guest RAM to the destination
  - resume the guest VM on the destination without guest RAM contents
  - Hook guest access to pages, and pull page contents from the source
    This continues until all the pages are pulled to the destination

  The big picture is depicted at
  http://wiki.qemu.org/File:Postcopy-livemigration.png

There are several design points.
  - who takes care of pulling page contents.
    an independent daemon vs a thread in qemu
    The daemon approach is preferable because an independent daemon would
    easy for debug postcopy memory mechanism without qemu.
    If required, it wouldn't be difficult to convert a daemon into
    a thread in qemu

  - connection between the source and the destination
    The connection for live migration can be re-used after sending machine
    state.

  - transfer protocol
    The existing protocol that exists today can be extended.

  - hooking guest RAM access
    Introduce a character device to handle page fault.
    When page fault occurs, it queues page request up to user space daemon
    at the destination. And the daemon pulls page contents from the source
    and serves it into the character device. Then the page fault is resovlved.

* More on hooking guest RAM access
There are several candidate for the implementation. Our preference is
character device approach.

  - inserting hooks into everywhere in qemu/kvm
    This is impractical

  - backing store for guest ram
    a block device or a file can be used to back guest RAM.
    Thus hook the guest ram access.

    pros
    - new device driver isn't needed.
    cons
    - future improvement would be difficult
    - some KVM host feature(KSM, THP) wouldn't work

  - character device
    qemu mmap() the dedicated character device, and then hook page fault.

    pros
    - straght forward approach
    - future improvement would be easy
    cons
    - new driver is needed
    - some KVM host feature(KSM, THP) wouldn't work
      They checks if a given VMA is anonymous. This can be fixed.

  - swap device
    When creating guest, it is set up as if all the guest RAM is swapped out
    to a dedicated swap device, which may be nbd disk (or some kind of user
    space block device, BUSE?).
    When the VM tries to access memory, swap-in is triggered and IO to the
    swap device is issued. Then the IO to swap is routed to the daemon
    in user space with nbd protocol (or BUSE, AOE, iSCSI...). The daemon pulls
    pages from the migration source and services the IO request.

    pros
    - After the page transfer is complete, everything is same as normal case.
    - no new device driver isn't needed
    cons
    - future improvement would be difficult
    - administration: setting up nbd, swap device

Thanks in advance
-- 
yamahata

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [RFC] postcopy livemigration proposal
  2011-08-08  3:24 [Qemu-devel] [RFC] postcopy livemigration proposal Isaku Yamahata
@ 2011-08-08  9:20 ` Dor Laor
  2011-08-08  9:40   ` Yaniv Kaul
                     ` (2 more replies)
  2011-08-08  9:38 ` Stefan Hajnoczi
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 28+ messages in thread
From: Dor Laor @ 2011-08-08  9:20 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: kvm, satoshi.itoh, t.hirofuchi, qemu-devel, Orit Wasserman,
	Avi Kivity

On 08/08/2011 06:24 AM, Isaku Yamahata wrote:
> This mail is on "Yabusame: Postcopy Live Migration for Qemu/KVM"
> on which we'll give a talk at KVM-forum.
> The purpose of this mail is to letting developers know it in advance
> so that we can get better feedback on its design/implementation approach
> early before our starting to implement it.
>
>
> Background
> ==========
> * What's is postcopy livemigration
> It is is yet another live migration mechanism for Qemu/KVM, which
> implements the migration technique known as "postcopy" or "lazy"
> migration. Just after the "migrate" command is invoked, the execution
> host of a VM is instantaneously switched to a destination host.
>
> The benefit is, total migration time is shorter because it transfer
> a page only once. On the other hand precopy may repeat sending same pages
> again and again because they can be dirtied.
> The switching time from the source to the destination is several
> hunderds mili seconds so that it enables quick load balancing.
> For details, please refer to the papers.
>
> We believe this is useful for others so that we'd like to merge this
> feature into the upstream qemu/kvm. The existing implementation that
> we have right now is very ad-hoc because it's for academic research.
> For the upstream merge, we're starting to re-design/implement it and
> we'd like to get feedback early.  Although many improvements/optimizations
> are possible, we should implement/merge the simple/clean, but extensible
> as well, one at first and then improve/optimize it later.
>
> postcopy livemigration will be introduced as optional feature. The existing
> precopy livemigration remains as default behavior.
>
>
> * related links:
> project page
> http://sites.google.com/site/grivonhome/quick-kvm-migration
>
> Enabling Instantaneous Relocation of Virtual Machines with a
> Lightweight VMM Extension,
> (proof-of-concept, ad-hoc prototype. not a new design)
> http://grivon.googlecode.com/svn/pub/docs/ccgrid2010-hirofuchi-paper.pdf
> http://grivon.googlecode.com/svn/pub/docs/ccgrid2010-hirofuchi-talk.pdf
>
> Reactive consolidation of virtual machines enabled by postcopy live migration
> (advantage for VM consolidation)
> http://portal.acm.org/citation.cfm?id=1996125
> http://www.emn.fr/x-info/ascola/lib/exe/fetch.php?media=internet:vtdc-postcopy.pdf
>
> Qemu wiki
> http://wiki.qemu.org/Features/PostCopyLiveMigration
>
>
> Design/Implementation
> =====================
> The basic idea of postcopy livemigration is to use a sort of distributed
> shared memory between the migration source and destination.
>
> The migration procedure looks like
>    - start migration
>      stop the guest VM on the source and send the machine states except
>      guest RAM to the destination
>    - resume the guest VM on the destination without guest RAM contents
>    - Hook guest access to pages, and pull page contents from the source
>      This continues until all the pages are pulled to the destination
>
>    The big picture is depicted at
>    http://wiki.qemu.org/File:Postcopy-livemigration.png

That's terrific  (nice video also)!
Orit and myself had the exact same idea too (now we can't patent it..).

Advantages:
         - No down time due to memory copying.
         - Efficient, reduce needed traffic no need to re-send pages.
         - Reduce overall RAM consumption of the source and destination
         as opposed from current live migration (both the source and the
         destination allocate the memory until the live migration
         completes). We can free copied memory once the destination guest
         received it and save RAM.
         - Increase parallelism for SMP guests we can have multiple
         virtual CPU handle their demand paging . Less time to hold a
         global lock, less thread contention.
         - Virtual machines are using more and more memory resources ,
         for a virtual machine with very large working set doing live
         migration with reasonable down time is impossible today.

Disadvantageous:
         - During the live migration the guest will run slower than in
         today's live migration. We need to remember that even today
         guests suffer from performance penalty on the source during the
         COW stage (memory copy).
         - Failure of the source or destination or the network will cause
         us to lose the running virtual machine. Those failures are very
         rare.
         In case there is shared storage we can store a copy of the
         memory there , that can be recovered in case of such failure .

Overall, it looks like a better approach for the vast majority of cases.
Hope it will get merged to kvm and become the default way.

>
>
> There are several design points.
>    - who takes care of pulling page contents.
>      an independent daemon vs a thread in qemu
>      The daemon approach is preferable because an independent daemon would
>      easy for debug postcopy memory mechanism without qemu.
>      If required, it wouldn't be difficult to convert a daemon into
>      a thread in qemu
>
>    - connection between the source and the destination
>      The connection for live migration can be re-used after sending machine
>      state.
>
>    - transfer protocol
>      The existing protocol that exists today can be extended.
>
>    - hooking guest RAM access
>      Introduce a character device to handle page fault.
>      When page fault occurs, it queues page request up to user space daemon
>      at the destination. And the daemon pulls page contents from the source
>      and serves it into the character device. Then the page fault is resovlved.

Isn't there a simpler way of using madvise verb to mark that the 
destination guest RAM will need paging?

Cheers and looking forward to the presentation over the kvm forum,
Dor

>
>
> * More on hooking guest RAM access
> There are several candidate for the implementation. Our preference is
> character device approach.
>
>    - inserting hooks into everywhere in qemu/kvm
>      This is impractical
>
>    - backing store for guest ram
>      a block device or a file can be used to back guest RAM.
>      Thus hook the guest ram access.
>
>      pros
>      - new device driver isn't needed.
>      cons
>      - future improvement would be difficult
>      - some KVM host feature(KSM, THP) wouldn't work
>
>    - character device
>      qemu mmap() the dedicated character device, and then hook page fault.
>
>      pros
>      - straght forward approach
>      - future improvement would be easy
>      cons
>      - new driver is needed
>      - some KVM host feature(KSM, THP) wouldn't work
>        They checks if a given VMA is anonymous. This can be fixed.
>
>    - swap device
>      When creating guest, it is set up as if all the guest RAM is swapped out
>      to a dedicated swap device, which may be nbd disk (or some kind of user
>      space block device, BUSE?).
>      When the VM tries to access memory, swap-in is triggered and IO to the
>      swap device is issued. Then the IO to swap is routed to the daemon
>      in user space with nbd protocol (or BUSE, AOE, iSCSI...). The daemon pulls
>      pages from the migration source and services the IO request.
>
>      pros
>      - After the page transfer is complete, everything is same as normal case.
>      - no new device driver isn't needed
>      cons
>      - future improvement would be difficult
>      - administration: setting up nbd, swap device
>
> Thanks in advance

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [RFC] postcopy livemigration proposal
  2011-08-08  9:20 ` Dor Laor
@ 2011-08-08  9:40   ` Yaniv Kaul
  2011-08-08 21:42     ` Anthony Liguori
  2011-08-08 10:59   ` Nadav Har'El
  2011-08-08 12:32   ` Anthony Liguori
  2 siblings, 1 reply; 28+ messages in thread
From: Yaniv Kaul @ 2011-08-08  9:40 UTC (permalink / raw)
  To: dlaor
  Cc: kvm, Orit Wasserman, t.hirofuchi, satoshi.itoh, qemu-devel,
	Isaku Yamahata, Avi Kivity

On 08/08/2011 12:20, Dor Laor wrote:
> On 08/08/2011 06:24 AM, Isaku Yamahata wrote:
>> This mail is on "Yabusame: Postcopy Live Migration for Qemu/KVM"
>> on which we'll give a talk at KVM-forum.
>> The purpose of this mail is to letting developers know it in advance
>> so that we can get better feedback on its design/implementation approach
>> early before our starting to implement it.
>>
>>
>> Background
>> ==========
>> * What's is postcopy livemigration
>> It is is yet another live migration mechanism for Qemu/KVM, which
>> implements the migration technique known as "postcopy" or "lazy"
>> migration. Just after the "migrate" command is invoked, the execution
>> host of a VM is instantaneously switched to a destination host.
>>
>> The benefit is, total migration time is shorter because it transfer
>> a page only once. On the other hand precopy may repeat sending same 
>> pages
>> again and again because they can be dirtied.
>> The switching time from the source to the destination is several
>> hunderds mili seconds so that it enables quick load balancing.
>> For details, please refer to the papers.
>>
>> We believe this is useful for others so that we'd like to merge this
>> feature into the upstream qemu/kvm. The existing implementation that
>> we have right now is very ad-hoc because it's for academic research.
>> For the upstream merge, we're starting to re-design/implement it and
>> we'd like to get feedback early.  Although many 
>> improvements/optimizations
>> are possible, we should implement/merge the simple/clean, but extensible
>> as well, one at first and then improve/optimize it later.
>>
>> postcopy livemigration will be introduced as optional feature. The 
>> existing
>> precopy livemigration remains as default behavior.
>>
>>
>> * related links:
>> project page
>> http://sites.google.com/site/grivonhome/quick-kvm-migration
>>
>> Enabling Instantaneous Relocation of Virtual Machines with a
>> Lightweight VMM Extension,
>> (proof-of-concept, ad-hoc prototype. not a new design)
>> http://grivon.googlecode.com/svn/pub/docs/ccgrid2010-hirofuchi-paper.pdf
>> http://grivon.googlecode.com/svn/pub/docs/ccgrid2010-hirofuchi-talk.pdf
>>
>> Reactive consolidation of virtual machines enabled by postcopy live 
>> migration
>> (advantage for VM consolidation)
>> http://portal.acm.org/citation.cfm?id=1996125
>> http://www.emn.fr/x-info/ascola/lib/exe/fetch.php?media=internet:vtdc-postcopy.pdf 
>>
>>
>> Qemu wiki
>> http://wiki.qemu.org/Features/PostCopyLiveMigration
>>
>>
>> Design/Implementation
>> =====================
>> The basic idea of postcopy livemigration is to use a sort of distributed
>> shared memory between the migration source and destination.
>>
>> The migration procedure looks like
>>    - start migration
>>      stop the guest VM on the source and send the machine states except
>>      guest RAM to the destination
>>    - resume the guest VM on the destination without guest RAM contents
>>    - Hook guest access to pages, and pull page contents from the source
>>      This continues until all the pages are pulled to the destination
>>
>>    The big picture is depicted at
>>    http://wiki.qemu.org/File:Postcopy-livemigration.png
>
> That's terrific  (nice video also)!
> Orit and myself had the exact same idea too (now we can't patent it..).
>
> Advantages:
>         - No down time due to memory copying.
>         - Efficient, reduce needed traffic no need to re-send pages.
>         - Reduce overall RAM consumption of the source and destination
>         as opposed from current live migration (both the source and the
>         destination allocate the memory until the live migration
>         completes). We can free copied memory once the destination guest
>         received it and save RAM.
>         - Increase parallelism for SMP guests we can have multiple
>         virtual CPU handle their demand paging . Less time to hold a
>         global lock, less thread contention.
>         - Virtual machines are using more and more memory resources ,
>         for a virtual machine with very large working set doing live
>         migration with reasonable down time is impossible today.
>
> Disadvantageous:
>         - During the live migration the guest will run slower than in
>         today's live migration. We need to remember that even today
>         guests suffer from performance penalty on the source during the
>         COW stage (memory copy).
>         - Failure of the source or destination or the network will cause
>         us to lose the running virtual machine. Those failures are very
>         rare.

I highly doubt that's acceptable in enterprise deployments.

>         In case there is shared storage we can store a copy of the
>         memory there , that can be recovered in case of such failure .
>
> Overall, it looks like a better approach for the vast majority of cases.
> Hope it will get merged to kvm and become the default way.
>
>>
>>
>> There are several design points.
>>    - who takes care of pulling page contents.
>>      an independent daemon vs a thread in qemu
>>      The daemon approach is preferable because an independent daemon 
>> would
>>      easy for debug postcopy memory mechanism without qemu.
>>      If required, it wouldn't be difficult to convert a daemon into
>>      a thread in qemu

How about async. page faults mechanism?
Y.

>>
>>    - connection between the source and the destination
>>      The connection for live migration can be re-used after sending 
>> machine
>>      state.
>>
>>    - transfer protocol
>>      The existing protocol that exists today can be extended.
>>
>>    - hooking guest RAM access
>>      Introduce a character device to handle page fault.
>>      When page fault occurs, it queues page request up to user space 
>> daemon
>>      at the destination. And the daemon pulls page contents from the 
>> source
>>      and serves it into the character device. Then the page fault is 
>> resovlved.
>
> Isn't there a simpler way of using madvise verb to mark that the 
> destination guest RAM will need paging?
>
> Cheers and looking forward to the presentation over the kvm forum,
> Dor
>
>>
>>
>> * More on hooking guest RAM access
>> There are several candidate for the implementation. Our preference is
>> character device approach.
>>
>>    - inserting hooks into everywhere in qemu/kvm
>>      This is impractical
>>
>>    - backing store for guest ram
>>      a block device or a file can be used to back guest RAM.
>>      Thus hook the guest ram access.
>>
>>      pros
>>      - new device driver isn't needed.
>>      cons
>>      - future improvement would be difficult
>>      - some KVM host feature(KSM, THP) wouldn't work
>>
>>    - character device
>>      qemu mmap() the dedicated character device, and then hook page 
>> fault.
>>
>>      pros
>>      - straght forward approach
>>      - future improvement would be easy
>>      cons
>>      - new driver is needed
>>      - some KVM host feature(KSM, THP) wouldn't work
>>        They checks if a given VMA is anonymous. This can be fixed.
>>
>>    - swap device
>>      When creating guest, it is set up as if all the guest RAM is 
>> swapped out
>>      to a dedicated swap device, which may be nbd disk (or some kind 
>> of user
>>      space block device, BUSE?).
>>      When the VM tries to access memory, swap-in is triggered and IO 
>> to the
>>      swap device is issued. Then the IO to swap is routed to the daemon
>>      in user space with nbd protocol (or BUSE, AOE, iSCSI...). The 
>> daemon pulls
>>      pages from the migration source and services the IO request.
>>
>>      pros
>>      - After the page transfer is complete, everything is same as 
>> normal case.
>>      - no new device driver isn't needed
>>      cons
>>      - future improvement would be difficult
>>      - administration: setting up nbd, swap device
>>
>> Thanks in advance
>
> -- 
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [RFC] postcopy livemigration proposal
  2011-08-08  9:40   ` Yaniv Kaul
@ 2011-08-08 21:42     ` Anthony Liguori
  0 siblings, 0 replies; 28+ messages in thread
From: Anthony Liguori @ 2011-08-08 21:42 UTC (permalink / raw)
  To: Yaniv Kaul
  Cc: kvm, satoshi.itoh, t.hirofuchi, dlaor, qemu-devel, Orit Wasserman,
	Avi Kivity, Isaku Yamahata

On 08/08/2011 04:40 AM, Yaniv Kaul wrote:
> On 08/08/2011 12:20, Dor Laor wrote:
>> On 08/08/2011 06:24 AM, Isaku Yamahata wrote:

>>> Design/Implementation
>>> =====================
>>> The basic idea of postcopy livemigration is to use a sort of distributed
>>> shared memory between the migration source and destination.
>>>
>>> The migration procedure looks like
>>> - start migration
>>> stop the guest VM on the source and send the machine states except
>>> guest RAM to the destination
>>> - resume the guest VM on the destination without guest RAM contents
>>> - Hook guest access to pages, and pull page contents from the source
>>> This continues until all the pages are pulled to the destination
>>>
>>> The big picture is depicted at
>>> http://wiki.qemu.org/File:Postcopy-livemigration.png
>>
>> That's terrific (nice video also)!
>> Orit and myself had the exact same idea too (now we can't patent it..).
>>
>> Advantages:
>> - No down time due to memory copying.
>> - Efficient, reduce needed traffic no need to re-send pages.
>> - Reduce overall RAM consumption of the source and destination
>> as opposed from current live migration (both the source and the
>> destination allocate the memory until the live migration
>> completes). We can free copied memory once the destination guest
>> received it and save RAM.
>> - Increase parallelism for SMP guests we can have multiple
>> virtual CPU handle their demand paging . Less time to hold a
>> global lock, less thread contention.
>> - Virtual machines are using more and more memory resources ,
>> for a virtual machine with very large working set doing live
>> migration with reasonable down time is impossible today.
>>
>> Disadvantageous:
>> - During the live migration the guest will run slower than in
>> today's live migration. We need to remember that even today
>> guests suffer from performance penalty on the source during the
>> COW stage (memory copy).
>> - Failure of the source or destination or the network will cause
>> us to lose the running virtual machine. Those failures are very
>> rare.
>
> I highly doubt that's acceptable in enterprise deployments.

I don't think you can make blanket statements about enterprise deployments.

A lot of enterprises are increasingly building fault tolerance into 
their applications expecting that the underlying hardware will fail. 
With cloud environments like EC2 that experience failure on a pretty 
regular basis, this is just becoming all the more common.

So I really don't view this as a critical issue.  It certainly would be 
if it were the only mechanism available but as long as we can also 
support pre-copy migration it would be fine.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [RFC] postcopy livemigration proposal
  2011-08-08  9:20 ` Dor Laor
  2011-08-08  9:40   ` Yaniv Kaul
@ 2011-08-08 10:59   ` Nadav Har'El
  2011-08-08 11:47     ` Dor Laor
  2011-08-08 12:32   ` Anthony Liguori
  2 siblings, 1 reply; 28+ messages in thread
From: Nadav Har'El @ 2011-08-08 10:59 UTC (permalink / raw)
  To: Dor Laor
  Cc: kvm, Orit Wasserman, t.hirofuchi, satoshi.itoh, qemu-devel,
	Isaku Yamahata, Avi Kivity

> >* What's is postcopy livemigration
> >It is is yet another live migration mechanism for Qemu/KVM, which
> >implements the migration technique known as "postcopy" or "lazy"
> >migration. Just after the "migrate" command is invoked, the execution
> >host of a VM is instantaneously switched to a destination host.

Sounds like a cool idea.

> >The benefit is, total migration time is shorter because it transfer
> >a page only once. On the other hand precopy may repeat sending same pages
> >again and again because they can be dirtied.
> >The switching time from the source to the destination is several
> >hunderds mili seconds so that it enables quick load balancing.
> >For details, please refer to the papers.

While these are the obvious benefits, the possible downside (that, as
always, depends on the workload) is the amount of time that the guest
workload runs more slowly than usual, waiting for pages it needs to
continue. There are a whole spectrum between the guest pausing completely
(which would solve all the problems of migration, but is often considered
unacceptible) and running at full-speed. Is it acceptable that the guest
runs at 90% speed during the migration? 50%? 10%?
I guess we could have nothing to lose from having both options, and choosing
the most appropriate technique for each guest!

> That's terrific  (nice video also)!
> Orit and myself had the exact same idea too (now we can't patent it..).

I think new implementation is not the only reason why you cannot patent
this idea :-) Demand-paged migration has actually been discussed (and done)
for nearly a quarter of a century (!) in the area of *process* migration.

The first use I'm aware of was in CMU's Accent 1987 - see [1].
Another paper, [2], written in 1991, discusses how process migration is done
in UCB's Sprite operating system, and evaluates the various alternatives
common at the time (20 years ago), including what it calls "lazy copying"
is more-or-less the same thing as "post copy". Mosix (a project which, in some
sense, is still alive to day) also used some sort of cross between pre-copying
(of dirty pages) and copying on-demand of clean pages (from their backing
store on the source machine).

References
[1] "Attacking the Process Migration Bottleneck"
     http://www.nd.edu/~dthain/courses/cse598z/fall2004/papers/accent.pdf
[2]  "Transparent Process Migration: Design Alternatives and the Sprite
     Implementation"
     http://nd.edu/~dthain/courses/cse598z/fall2004/papers/sprite-migration.pdf

> Advantages:
>         - Virtual machines are using more and more memory resources ,
>         for a virtual machine with very large working set doing live
>         migration with reasonable down time is impossible today.

If a guest actually constantly uses (working set) most of its allocated
memory, it will basically be unable to do any significant amount of work
on the destination VM until this large working set is transfered to the
destination. So in this scenario, "post copying" doesn't give any
significant advantages over plain-old "pause guest and send it to the
destination". Or am I missing something?

> Disadvantageous:
>         - During the live migration the guest will run slower than in
>         today's live migration. We need to remember that even today
>         guests suffer from performance penalty on the source during the
>         COW stage (memory copy).

I wonder if something like asynchronous page faults can help somewhat with
multi-process guest workloads (and modified (PV) guest OS). 

>         - Failure of the source or destination or the network will cause
>         us to lose the running virtual machine. Those failures are very
>         rare.

How is this different from a VM running on a single machine that fails?
Just that the small probability of failure (roughly) doubles for the
relatively-short duration of the transfer?

-- 
Nadav Har'El                        |           Monday, Aug  8 2011, 8 Av 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |If glory comes after death, I'm not in a
http://nadav.harel.org.il           |hurry. (Latin proverb)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [RFC] postcopy livemigration proposal
  2011-08-08 10:59   ` Nadav Har'El
@ 2011-08-08 11:47     ` Dor Laor
  2011-08-08 16:52       ` Cleber Rosa
  0 siblings, 1 reply; 28+ messages in thread
From: Dor Laor @ 2011-08-08 11:47 UTC (permalink / raw)
  To: Nadav Har'El
  Cc: kvm, Orit Wasserman, t.hirofuchi, satoshi.itoh, qemu-devel,
	Isaku Yamahata, Avi Kivity

On 08/08/2011 01:59 PM, Nadav Har'El wrote:
>>> * What's is postcopy livemigration
>>> It is is yet another live migration mechanism for Qemu/KVM, which
>>> implements the migration technique known as "postcopy" or "lazy"
>>> migration. Just after the "migrate" command is invoked, the execution
>>> host of a VM is instantaneously switched to a destination host.
>
> Sounds like a cool idea.
>
>>> The benefit is, total migration time is shorter because it transfer
>>> a page only once. On the other hand precopy may repeat sending same pages
>>> again and again because they can be dirtied.
>>> The switching time from the source to the destination is several
>>> hunderds mili seconds so that it enables quick load balancing.
>>> For details, please refer to the papers.
>
> While these are the obvious benefits, the possible downside (that, as
> always, depends on the workload) is the amount of time that the guest
> workload runs more slowly than usual, waiting for pages it needs to
> continue. There are a whole spectrum between the guest pausing completely
> (which would solve all the problems of migration, but is often considered
> unacceptible) and running at full-speed. Is it acceptable that the guest
> runs at 90% speed during the migration? 50%? 10%?
> I guess we could have nothing to lose from having both options, and choosing
> the most appropriate technique for each guest!

+1

>
>> That's terrific  (nice video also)!
>> Orit and myself had the exact same idea too (now we can't patent it..).
>
> I think new implementation is not the only reason why you cannot patent
> this idea :-) Demand-paged migration has actually been discussed (and done)
> for nearly a quarter of a century (!) in the area of *process* migration.
>
> The first use I'm aware of was in CMU's Accent 1987 - see [1].
> Another paper, [2], written in 1991, discusses how process migration is done
> in UCB's Sprite operating system, and evaluates the various alternatives
> common at the time (20 years ago), including what it calls "lazy copying"
> is more-or-less the same thing as "post copy". Mosix (a project which, in some
> sense, is still alive to day) also used some sort of cross between pre-copying
> (of dirty pages) and copying on-demand of clean pages (from their backing
> store on the source machine).
>
>
> References
> [1] "Attacking the Process Migration Bottleneck"
>       http://www.nd.edu/~dthain/courses/cse598z/fall2004/papers/accent.pdf

w/o reading the internals, patents enable you to implement an existing 
idea on a new field. Anyway, there won't be no patent in this case. 
Still let's have the kvm innovation merged.

> [2]  "Transparent Process Migration: Design Alternatives and the Sprite
>       Implementation"
>       http://nd.edu/~dthain/courses/cse598z/fall2004/papers/sprite-migration.pdf
>
>> Advantages:
>>          - Virtual machines are using more and more memory resources ,
>>          for a virtual machine with very large working set doing live
>>          migration with reasonable down time is impossible today.
>
> If a guest actually constantly uses (working set) most of its allocated
> memory, it will basically be unable to do any significant amount of work
> on the destination VM until this large working set is transfered to the
> destination. So in this scenario, "post copying" doesn't give any
> significant advantages over plain-old "pause guest and send it to the
> destination". Or am I missing something?

There is one key advantage in this scheme/use case - if you have a guest 
with a very large working set, you'll need a very large downtime in 
order to migrate it with today's algorithm. With post copy (aka 
streaming/demand paging), the guest won't have any downtime but will run 
slower than expected.

There are guests today that is impractical to really live migrate them.

btw: Even today, marking pages RO also carries some performance penalty.

>
>> Disadvantageous:
>>          - During the live migration the guest will run slower than in
>>          today's live migration. We need to remember that even today
>>          guests suffer from performance penalty on the source during the
>>          COW stage (memory copy).
>
> I wonder if something like asynchronous page faults can help somewhat with
> multi-process guest workloads (and modified (PV) guest OS).

They should come in to play for some extent. Note that only newer Linux 
guest will enjoy of them.

>
>>          - Failure of the source or destination or the network will cause
>>          us to lose the running virtual machine. Those failures are very
>>          rare.
>
> How is this different from a VM running on a single machine that fails?
> Just that the small probability of failure (roughly) doubles for the
> relatively-short duration of the transfer?

Exactly my point, this is not a major disadvantage because of this low 
probability.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [RFC] postcopy livemigration proposal
  2011-08-08 11:47     ` Dor Laor
@ 2011-08-08 16:52       ` Cleber Rosa
  2011-08-08 15:52         ` Anthony Liguori
  0 siblings, 1 reply; 28+ messages in thread
From: Cleber Rosa @ 2011-08-08 16:52 UTC (permalink / raw)
  To: qemu-devel

On 08/08/2011 07:47 AM, Dor Laor wrote:
> On 08/08/2011 01:59 PM, Nadav Har'El wrote:
>>>> * What's is postcopy livemigration
>>>> It is is yet another live migration mechanism for Qemu/KVM, which
>>>> implements the migration technique known as "postcopy" or "lazy"
>>>> migration. Just after the "migrate" command is invoked, the execution
>>>> host of a VM is instantaneously switched to a destination host.
>>
>> Sounds like a cool idea.
>>
>>>> The benefit is, total migration time is shorter because it transfer
>>>> a page only once. On the other hand precopy may repeat sending same 
>>>> pages
>>>> again and again because they can be dirtied.
>>>> The switching time from the source to the destination is several
>>>> hunderds mili seconds so that it enables quick load balancing.
>>>> For details, please refer to the papers.
>>
>> While these are the obvious benefits, the possible downside (that, as
>> always, depends on the workload) is the amount of time that the guest
>> workload runs more slowly than usual, waiting for pages it needs to
>> continue. There are a whole spectrum between the guest pausing 
>> completely
>> (which would solve all the problems of migration, but is often 
>> considered
>> unacceptible) and running at full-speed. Is it acceptable that the guest
>> runs at 90% speed during the migration? 50%? 10%?
>> I guess we could have nothing to lose from having both options, and 
>> choosing
>> the most appropriate technique for each guest!

Not sure if it's possible to have smart heuristics on guest memory page 
faults, but maybe a technique that reads ahead more pages if a given 
pattern is detected may help to lower the impact.

>
> +1
>
>>
>>> That's terrific  (nice video also)!
>>> Orit and myself had the exact same idea too (now we can't patent it..).
>>
>> I think new implementation is not the only reason why you cannot patent
>> this idea :-) Demand-paged migration has actually been discussed (and 
>> done)
>> for nearly a quarter of a century (!) in the area of *process* 
>> migration.
>>
>> The first use I'm aware of was in CMU's Accent 1987 - see [1].
>> Another paper, [2], written in 1991, discusses how process migration 
>> is done
>> in UCB's Sprite operating system, and evaluates the various alternatives
>> common at the time (20 years ago), including what it calls "lazy 
>> copying"
>> is more-or-less the same thing as "post copy". Mosix (a project 
>> which, in some
>> sense, is still alive to day) also used some sort of cross between 
>> pre-copying
>> (of dirty pages) and copying on-demand of clean pages (from their 
>> backing
>> store on the source machine).
>>
>>
>> References
>> [1] "Attacking the Process Migration Bottleneck"
>>       
>> http://www.nd.edu/~dthain/courses/cse598z/fall2004/papers/accent.pdf
>
> w/o reading the internals, patents enable you to implement an existing 
> idea on a new field. Anyway, there won't be no patent in this case. 
> Still let's have the kvm innovation merged.
>
>> [2]  "Transparent Process Migration: Design Alternatives and the Sprite
>>       Implementation"
>>       
>> http://nd.edu/~dthain/courses/cse598z/fall2004/papers/sprite-migration.pdf
>>
>>> Advantages:
>>>          - Virtual machines are using more and more memory resources ,
>>>          for a virtual machine with very large working set doing live
>>>          migration with reasonable down time is impossible today.
>>
>> If a guest actually constantly uses (working set) most of its allocated
>> memory, it will basically be unable to do any significant amount of work
>> on the destination VM until this large working set is transfered to the
>> destination. So in this scenario, "post copying" doesn't give any
>> significant advantages over plain-old "pause guest and send it to the
>> destination". Or am I missing something?
>
> There is one key advantage in this scheme/use case - if you have a 
> guest with a very large working set, you'll need a very large downtime 
> in order to migrate it with today's algorithm. With post copy (aka 
> streaming/demand paging), the guest won't have any downtime but will 
> run slower than expected.
>
> There are guests today that is impractical to really live migrate them.
>
> btw: Even today, marking pages RO also carries some performance penalty.
>
>>
>>> Disadvantageous:
>>>          - During the live migration the guest will run slower than in
>>>          today's live migration. We need to remember that even today
>>>          guests suffer from performance penalty on the source during 
>>> the
>>>          COW stage (memory copy).
>>
>> I wonder if something like asynchronous page faults can help somewhat 
>> with
>> multi-process guest workloads (and modified (PV) guest OS).
>
> They should come in to play for some extent. Note that only newer 
> Linux guest will enjoy of them.
>
>>
>>>          - Failure of the source or destination or the network will 
>>> cause
>>>          us to lose the running virtual machine. Those failures are 
>>> very
>>>          rare.
>>
>> How is this different from a VM running on a single machine that fails?
>> Just that the small probability of failure (roughly) doubles for the
>> relatively-short duration of the transfer?
>
> Exactly my point, this is not a major disadvantage because of this low 
> probability.
>
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [RFC] postcopy livemigration proposal
  2011-08-08 16:52       ` Cleber Rosa
@ 2011-08-08 15:52         ` Anthony Liguori
  0 siblings, 0 replies; 28+ messages in thread
From: Anthony Liguori @ 2011-08-08 15:52 UTC (permalink / raw)
  To: Cleber Rosa; +Cc: qemu-devel

On 08/08/2011 11:52 AM, Cleber Rosa wrote:
> On 08/08/2011 07:47 AM, Dor Laor wrote:
>> On 08/08/2011 01:59 PM, Nadav Har'El wrote:
>>>>> * What's is postcopy livemigration
>>>>> It is is yet another live migration mechanism for Qemu/KVM, which
>>>>> implements the migration technique known as "postcopy" or "lazy"
>>>>> migration. Just after the "migrate" command is invoked, the execution
>>>>> host of a VM is instantaneously switched to a destination host.
>>>
>>> Sounds like a cool idea.
>>>
>>>>> The benefit is, total migration time is shorter because it transfer
>>>>> a page only once. On the other hand precopy may repeat sending same
>>>>> pages
>>>>> again and again because they can be dirtied.
>>>>> The switching time from the source to the destination is several
>>>>> hunderds mili seconds so that it enables quick load balancing.
>>>>> For details, please refer to the papers.
>>>
>>> While these are the obvious benefits, the possible downside (that, as
>>> always, depends on the workload) is the amount of time that the guest
>>> workload runs more slowly than usual, waiting for pages it needs to
>>> continue. There are a whole spectrum between the guest pausing
>>> completely
>>> (which would solve all the problems of migration, but is often
>>> considered
>>> unacceptible) and running at full-speed. Is it acceptable that the guest
>>> runs at 90% speed during the migration? 50%? 10%?
>>> I guess we could have nothing to lose from having both options, and
>>> choosing
>>> the most appropriate technique for each guest!
>
> Not sure if it's possible to have smart heuristics on guest memory page
> faults, but maybe a technique that reads ahead more pages if a given
> pattern is detected may help to lower the impact.

It's got to be a user choice.  Post-copy can mean unbounded downtime for 
a guest with no way to mitigate it.  It's impossible to cancel a 
post-copy migration.

I actually think the use-cases for post-copy are fairly limited in an 
enterprise environment.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [RFC] postcopy livemigration proposal
  2011-08-08  9:20 ` Dor Laor
  2011-08-08  9:40   ` Yaniv Kaul
  2011-08-08 10:59   ` Nadav Har'El
@ 2011-08-08 12:32   ` Anthony Liguori
  2011-08-08 15:11     ` Dor Laor
  2 siblings, 1 reply; 28+ messages in thread
From: Anthony Liguori @ 2011-08-08 12:32 UTC (permalink / raw)
  To: dlaor
  Cc: kvm, Orit Wasserman, t.hirofuchi, satoshi.itoh, qemu-devel,
	Isaku Yamahata, Avi Kivity

On 08/08/2011 04:20 AM, Dor Laor wrote:
> On 08/08/2011 06:24 AM, Isaku Yamahata wrote:
>> This mail is on "Yabusame: Postcopy Live Migration for Qemu/KVM"
>> on which we'll give a talk at KVM-forum.
>> The purpose of this mail is to letting developers know it in advance
>> so that we can get better feedback on its design/implementation approach
>> early before our starting to implement it.
>>
>>
>> Background
>> ==========
>> * What's is postcopy livemigration
>> It is is yet another live migration mechanism for Qemu/KVM, which
>> implements the migration technique known as "postcopy" or "lazy"
>> migration. Just after the "migrate" command is invoked, the execution
>> host of a VM is instantaneously switched to a destination host.
>>
>> The benefit is, total migration time is shorter because it transfer
>> a page only once. On the other hand precopy may repeat sending same pages
>> again and again because they can be dirtied.
>> The switching time from the source to the destination is several
>> hunderds mili seconds so that it enables quick load balancing.
>> For details, please refer to the papers.
>>
>> We believe this is useful for others so that we'd like to merge this
>> feature into the upstream qemu/kvm. The existing implementation that
>> we have right now is very ad-hoc because it's for academic research.
>> For the upstream merge, we're starting to re-design/implement it and
>> we'd like to get feedback early. Although many improvements/optimizations
>> are possible, we should implement/merge the simple/clean, but extensible
>> as well, one at first and then improve/optimize it later.
>>
>> postcopy livemigration will be introduced as optional feature. The
>> existing
>> precopy livemigration remains as default behavior.
>>
>>
>> * related links:
>> project page
>> http://sites.google.com/site/grivonhome/quick-kvm-migration
>>
>> Enabling Instantaneous Relocation of Virtual Machines with a
>> Lightweight VMM Extension,
>> (proof-of-concept, ad-hoc prototype. not a new design)
>> http://grivon.googlecode.com/svn/pub/docs/ccgrid2010-hirofuchi-paper.pdf
>> http://grivon.googlecode.com/svn/pub/docs/ccgrid2010-hirofuchi-talk.pdf
>>
>> Reactive consolidation of virtual machines enabled by postcopy live
>> migration
>> (advantage for VM consolidation)
>> http://portal.acm.org/citation.cfm?id=1996125
>> http://www.emn.fr/x-info/ascola/lib/exe/fetch.php?media=internet:vtdc-postcopy.pdf
>>
>>
>> Qemu wiki
>> http://wiki.qemu.org/Features/PostCopyLiveMigration
>>
>>
>> Design/Implementation
>> =====================
>> The basic idea of postcopy livemigration is to use a sort of distributed
>> shared memory between the migration source and destination.
>>
>> The migration procedure looks like
>> - start migration
>> stop the guest VM on the source and send the machine states except
>> guest RAM to the destination
>> - resume the guest VM on the destination without guest RAM contents
>> - Hook guest access to pages, and pull page contents from the source
>> This continues until all the pages are pulled to the destination
>>
>> The big picture is depicted at
>> http://wiki.qemu.org/File:Postcopy-livemigration.png
>
> That's terrific (nice video also)!
> Orit and myself had the exact same idea too (now we can't patent it..).
>
> Advantages:
> - No down time due to memory copying.

But non-deterministic down time due to network latency while trying to 
satisfy a page fault.

> - Efficient, reduce needed traffic no need to re-send pages.

It's not quite that simple.  Post-copy needs to introduce a protocol 
capable of requesting pages.

I think in presenting something like this, it's important to collect 
quite a bit of performance data.  I'd suggest doing runs while running 
jitterd in the guest to attempt to quantify the actual downtime 
experienced too.

http://git.codemonkey.ws/cgit/jitterd.git/

There's a lot of potential in something like this, but it's not obvious 
to me whether it's a net win.  Should make for a very interesting 
presentation :-)

> - Reduce overall RAM consumption of the source and destination
> as opposed from current live migration (both the source and the
> destination allocate the memory until the live migration
> completes). We can free copied memory once the destination guest
> received it and save RAM.
> - Increase parallelism for SMP guests we can have multiple
> virtual CPU handle their demand paging . Less time to hold a
> global lock, less thread contention.
> - Virtual machines are using more and more memory resources ,
> for a virtual machine with very large working set doing live
> migration with reasonable down time is impossible today.

This is really just a limitation of our implementation.  In theory, 
pre-copy allows you to exert fine grain resource control over the guest 
which you can use to encourage convergence.

> Disadvantageous:
> - During the live migration the guest will run slower than in
> today's live migration. We need to remember that even today
> guests suffer from performance penalty on the source during the
> COW stage (memory copy).
> - Failure of the source or destination or the network will cause
> us to lose the running virtual machine. Those failures are very
> rare.
> In case there is shared storage we can store a copy of the
> memory there , that can be recovered in case of such failure .
>
> Overall, it looks like a better approach for the vast majority of cases.
> Hope it will get merged to kvm and become the default way.

One thing I think we need to do is put together a live migration 
roadmap.  We've got a lot of invasive efforts underway with live 
migration and I fear that without some planning and serialization, some 
of this useful work with get lost.

Regards,

Anthony Liguori

>>
>>
>> There are several design points.
>> - who takes care of pulling page contents.
>> an independent daemon vs a thread in qemu
>> The daemon approach is preferable because an independent daemon would
>> easy for debug postcopy memory mechanism without qemu.
>> If required, it wouldn't be difficult to convert a daemon into
>> a thread in qemu
>>
>> - connection between the source and the destination
>> The connection for live migration can be re-used after sending machine
>> state.
>>
>> - transfer protocol
>> The existing protocol that exists today can be extended.
>>
>> - hooking guest RAM access
>> Introduce a character device to handle page fault.
>> When page fault occurs, it queues page request up to user space daemon
>> at the destination. And the daemon pulls page contents from the source
>> and serves it into the character device. Then the page fault is
>> resovlved.
>
> Isn't there a simpler way of using madvise verb to mark that the
> destination guest RAM will need paging?
>
> Cheers and looking forward to the presentation over the kvm forum,
> Dor
>
>>
>>
>> * More on hooking guest RAM access
>> There are several candidate for the implementation. Our preference is
>> character device approach.
>>
>> - inserting hooks into everywhere in qemu/kvm
>> This is impractical
>>
>> - backing store for guest ram
>> a block device or a file can be used to back guest RAM.
>> Thus hook the guest ram access.
>>
>> pros
>> - new device driver isn't needed.
>> cons
>> - future improvement would be difficult
>> - some KVM host feature(KSM, THP) wouldn't work
>>
>> - character device
>> qemu mmap() the dedicated character device, and then hook page fault.
>>
>> pros
>> - straght forward approach
>> - future improvement would be easy
>> cons
>> - new driver is needed
>> - some KVM host feature(KSM, THP) wouldn't work
>> They checks if a given VMA is anonymous. This can be fixed.
>>
>> - swap device
>> When creating guest, it is set up as if all the guest RAM is swapped out
>> to a dedicated swap device, which may be nbd disk (or some kind of user
>> space block device, BUSE?).
>> When the VM tries to access memory, swap-in is triggered and IO to the
>> swap device is issued. Then the IO to swap is routed to the daemon
>> in user space with nbd protocol (or BUSE, AOE, iSCSI...). The daemon
>> pulls
>> pages from the migration source and services the IO request.
>>
>> pros
>> - After the page transfer is complete, everything is same as normal case.
>> - no new device driver isn't needed
>> cons
>> - future improvement would be difficult
>> - administration: setting up nbd, swap device
>>
>> Thanks in advance
>
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [RFC] postcopy livemigration proposal
  2011-08-08 12:32   ` Anthony Liguori
@ 2011-08-08 15:11     ` Dor Laor
  2011-08-08 15:29       ` Anthony Liguori
  0 siblings, 1 reply; 28+ messages in thread
From: Dor Laor @ 2011-08-08 15:11 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: kvm, Isaku Yamahata, t.hirofuchi, satoshi.itoh, qemu-devel,
	Orit Wasserman, Avi Kivity

On 08/08/2011 03:32 PM, Anthony Liguori wrote:
> On 08/08/2011 04:20 AM, Dor Laor wrote:
>> On 08/08/2011 06:24 AM, Isaku Yamahata wrote:
>>> This mail is on "Yabusame: Postcopy Live Migration for Qemu/KVM"
>>> on which we'll give a talk at KVM-forum.
>>> The purpose of this mail is to letting developers know it in advance
>>> so that we can get better feedback on its design/implementation approach
>>> early before our starting to implement it.
>>>
>>>
>>> Background
>>> ==========
>>> * What's is postcopy livemigration
>>> It is is yet another live migration mechanism for Qemu/KVM, which
>>> implements the migration technique known as "postcopy" or "lazy"
>>> migration. Just after the "migrate" command is invoked, the execution
>>> host of a VM is instantaneously switched to a destination host.
>>>
>>> The benefit is, total migration time is shorter because it transfer
>>> a page only once. On the other hand precopy may repeat sending same
>>> pages
>>> again and again because they can be dirtied.
>>> The switching time from the source to the destination is several
>>> hunderds mili seconds so that it enables quick load balancing.
>>> For details, please refer to the papers.
>>>
>>> We believe this is useful for others so that we'd like to merge this
>>> feature into the upstream qemu/kvm. The existing implementation that
>>> we have right now is very ad-hoc because it's for academic research.
>>> For the upstream merge, we're starting to re-design/implement it and
>>> we'd like to get feedback early. Although many
>>> improvements/optimizations
>>> are possible, we should implement/merge the simple/clean, but extensible
>>> as well, one at first and then improve/optimize it later.
>>>
>>> postcopy livemigration will be introduced as optional feature. The
>>> existing
>>> precopy livemigration remains as default behavior.
>>>
>>>
>>> * related links:
>>> project page
>>> http://sites.google.com/site/grivonhome/quick-kvm-migration
>>>
>>> Enabling Instantaneous Relocation of Virtual Machines with a
>>> Lightweight VMM Extension,
>>> (proof-of-concept, ad-hoc prototype. not a new design)
>>> http://grivon.googlecode.com/svn/pub/docs/ccgrid2010-hirofuchi-paper.pdf
>>> http://grivon.googlecode.com/svn/pub/docs/ccgrid2010-hirofuchi-talk.pdf
>>>
>>> Reactive consolidation of virtual machines enabled by postcopy live
>>> migration
>>> (advantage for VM consolidation)
>>> http://portal.acm.org/citation.cfm?id=1996125
>>> http://www.emn.fr/x-info/ascola/lib/exe/fetch.php?media=internet:vtdc-postcopy.pdf
>>>
>>>
>>>
>>> Qemu wiki
>>> http://wiki.qemu.org/Features/PostCopyLiveMigration
>>>
>>>
>>> Design/Implementation
>>> =====================
>>> The basic idea of postcopy livemigration is to use a sort of distributed
>>> shared memory between the migration source and destination.
>>>
>>> The migration procedure looks like
>>> - start migration
>>> stop the guest VM on the source and send the machine states except
>>> guest RAM to the destination
>>> - resume the guest VM on the destination without guest RAM contents
>>> - Hook guest access to pages, and pull page contents from the source
>>> This continues until all the pages are pulled to the destination
>>>
>>> The big picture is depicted at
>>> http://wiki.qemu.org/File:Postcopy-livemigration.png
>>
>> That's terrific (nice video also)!
>> Orit and myself had the exact same idea too (now we can't patent it..).
>>
>> Advantages:
>> - No down time due to memory copying.
>
> But non-deterministic down time due to network latency while trying to
> satisfy a page fault.

True but it is possible to limit it with some dedicated network or 
bandwidth reservation.

>
>> - Efficient, reduce needed traffic no need to re-send pages.
>
> It's not quite that simple. Post-copy needs to introduce a protocol
> capable of requesting pages.

Just another subsection.. (kidding), still it shouldn't be too 
complicated, just an offset+pagesize and return page_content/error

>
> I think in presenting something like this, it's important to collect
> quite a bit of performance data. I'd suggest doing runs while running
> jitterd in the guest to attempt to quantify the actual downtime
> experienced too.
>
> http://git.codemonkey.ws/cgit/jitterd.git/

and also comparing the speed that it takes for various benchmarks like 
iozone/netperf/linpack/..

>
> There's a lot of potential in something like this, but it's not obvious
> to me whether it's a net win. Should make for a very interesting
> presentation :-)
>
>> - Reduce overall RAM consumption of the source and destination
>> as opposed from current live migration (both the source and the
>> destination allocate the memory until the live migration
>> completes). We can free copied memory once the destination guest
>> received it and save RAM.
>> - Increase parallelism for SMP guests we can have multiple
>> virtual CPU handle their demand paging . Less time to hold a
>> global lock, less thread contention.
>> - Virtual machines are using more and more memory resources ,
>> for a virtual machine with very large working set doing live
>> migration with reasonable down time is impossible today.
>
> This is really just a limitation of our implementation. In theory,
> pre-copy allows you to exert fine grain resource control over the guest
> which you can use to encourage convergence.

But a very large guest w/ large working set that changes more frequent 
than the network bandwidth might always need huge down time with the 
current system.

>
>> Disadvantageous:
>> - During the live migration the guest will run slower than in
>> today's live migration. We need to remember that even today
>> guests suffer from performance penalty on the source during the
>> COW stage (memory copy).
>> - Failure of the source or destination or the network will cause
>> us to lose the running virtual machine. Those failures are very
>> rare.
>> In case there is shared storage we can store a copy of the
>> memory there , that can be recovered in case of such failure .
>>
>> Overall, it looks like a better approach for the vast majority of cases.
>> Hope it will get merged to kvm and become the default way.
>
> One thing I think we need to do is put together a live migration
> roadmap. We've got a lot of invasive efforts underway with live
> migration and I fear that without some planning and serialization, some
> of this useful work with get lost.

Some of them are parallel. I think all the readers here agree that post 
copy migration should be an option while we need to maintain the current 
one.

In addition to that there is the switch to separate thread which is 
standalone, and the vmstate vs QOM/visitor work.

Lastly the compression work..

>
> Regards,
>
> Anthony Liguori
>
>>>
>>>
>>> There are several design points.
>>> - who takes care of pulling page contents.
>>> an independent daemon vs a thread in qemu
>>> The daemon approach is preferable because an independent daemon would
>>> easy for debug postcopy memory mechanism without qemu.
>>> If required, it wouldn't be difficult to convert a daemon into
>>> a thread in qemu
>>>
>>> - connection between the source and the destination
>>> The connection for live migration can be re-used after sending machine
>>> state.
>>>
>>> - transfer protocol
>>> The existing protocol that exists today can be extended.
>>>
>>> - hooking guest RAM access
>>> Introduce a character device to handle page fault.
>>> When page fault occurs, it queues page request up to user space daemon
>>> at the destination. And the daemon pulls page contents from the source
>>> and serves it into the character device. Then the page fault is
>>> resovlved.
>>
>> Isn't there a simpler way of using madvise verb to mark that the
>> destination guest RAM will need paging?
>>
>> Cheers and looking forward to the presentation over the kvm forum,
>> Dor
>>
>>>
>>>
>>> * More on hooking guest RAM access
>>> There are several candidate for the implementation. Our preference is
>>> character device approach.
>>>
>>> - inserting hooks into everywhere in qemu/kvm
>>> This is impractical
>>>
>>> - backing store for guest ram
>>> a block device or a file can be used to back guest RAM.
>>> Thus hook the guest ram access.
>>>
>>> pros
>>> - new device driver isn't needed.
>>> cons
>>> - future improvement would be difficult
>>> - some KVM host feature(KSM, THP) wouldn't work
>>>
>>> - character device
>>> qemu mmap() the dedicated character device, and then hook page fault.
>>>
>>> pros
>>> - straght forward approach
>>> - future improvement would be easy
>>> cons
>>> - new driver is needed
>>> - some KVM host feature(KSM, THP) wouldn't work
>>> They checks if a given VMA is anonymous. This can be fixed.
>>>
>>> - swap device
>>> When creating guest, it is set up as if all the guest RAM is swapped out
>>> to a dedicated swap device, which may be nbd disk (or some kind of user
>>> space block device, BUSE?).
>>> When the VM tries to access memory, swap-in is triggered and IO to the
>>> swap device is issued. Then the IO to swap is routed to the daemon
>>> in user space with nbd protocol (or BUSE, AOE, iSCSI...). The daemon
>>> pulls
>>> pages from the migration source and services the IO request.
>>>
>>> pros
>>> - After the page transfer is complete, everything is same as normal
>>> case.
>>> - no new device driver isn't needed
>>> cons
>>> - future improvement would be difficult
>>> - administration: setting up nbd, swap device
>>>
>>> Thanks in advance
>>
>>
>
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [RFC] postcopy livemigration proposal
  2011-08-08 15:11     ` Dor Laor
@ 2011-08-08 15:29       ` Anthony Liguori
  2011-08-08 15:36         ` Avi Kivity
  0 siblings, 1 reply; 28+ messages in thread
From: Anthony Liguori @ 2011-08-08 15:29 UTC (permalink / raw)
  To: dlaor
  Cc: kvm, Orit Wasserman, t.hirofuchi, satoshi.itoh, qemu-devel,
	Isaku Yamahata, Avi Kivity

On 08/08/2011 10:11 AM, Dor Laor wrote:
> On 08/08/2011 03:32 PM, Anthony Liguori wrote:
>> On 08/08/2011 04:20 AM, Dor Laor wrote:
>>>
>>> That's terrific (nice video also)!
>>> Orit and myself had the exact same idea too (now we can't patent it..).
>>>
>>> Advantages:
>>> - No down time due to memory copying.
>>
>> But non-deterministic down time due to network latency while trying to
>> satisfy a page fault.
>
> True but it is possible to limit it with some dedicated network or
> bandwidth reservation.

Yup.  Any technique that uses RDMA (which is basically what this is) 
requires dedicated network resources.

>>> - Efficient, reduce needed traffic no need to re-send pages.
>>
>> It's not quite that simple. Post-copy needs to introduce a protocol
>> capable of requesting pages.
>
> Just another subsection.. (kidding), still it shouldn't be too
> complicated, just an offset+pagesize and return page_content/error

What I meant by this is that there is potentially a lot of round trip 
overhead.  Pre-copy migration works well with reasonable high latency 
network connections because the downtime is capped only by the maximum 
latency sending from one point to another.

But with something like this, the total downtime is 
2*max_latency*nb_pagefaults.  That's potentially pretty high.

So it may be desirable to try to reduce nb_pagefaults by prefaulting in 
pages, etc.  Suffice to say, this ends up getting complicated and may 
end up burning network traffic too.

>> This is really just a limitation of our implementation. In theory,
>> pre-copy allows you to exert fine grain resource control over the guest
>> which you can use to encourage convergence.
>
> But a very large guest w/ large working set that changes more frequent
> than the network bandwidth might always need huge down time with the
> current system.

In theory, you can do things like reduce the guests' priority to reduce 
the amount of work it can do in order to encourage convergence.

>> One thing I think we need to do is put together a live migration
>> roadmap. We've got a lot of invasive efforts underway with live
>> migration and I fear that without some planning and serialization, some
>> of this useful work with get lost.
>
> Some of them are parallel. I think all the readers here agree that post
> copy migration should be an option while we need to maintain the current
> one.

I actually think they need to be done mostly in sequence while cleaning 
up some of the current infrastructure.  I don't think we really should 
make any major changes (beyond maybe the separate thread) until we 
eliminate QEMUFile.

There's so much overhead involved in using QEMUFile today, I think it's 
hard to talk about performance data when we've got a major bottleneck 
sitting in the middle.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [RFC] postcopy livemigration proposal
  2011-08-08 15:29       ` Anthony Liguori
@ 2011-08-08 15:36         ` Avi Kivity
  2011-08-08 15:59           ` Anthony Liguori
  0 siblings, 1 reply; 28+ messages in thread
From: Avi Kivity @ 2011-08-08 15:36 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: kvm, satoshi.itoh, t.hirofuchi, dlaor, Orit Wasserman, qemu-devel,
	Isaku Yamahata

On 08/08/2011 06:29 PM, Anthony Liguori wrote:
>
>>>> - Efficient, reduce needed traffic no need to re-send pages.
>>>
>>> It's not quite that simple. Post-copy needs to introduce a protocol
>>> capable of requesting pages.
>>
>> Just another subsection.. (kidding), still it shouldn't be too
>> complicated, just an offset+pagesize and return page_content/error
>
> What I meant by this is that there is potentially a lot of round trip 
> overhead.  Pre-copy migration works well with reasonable high latency 
> network connections because the downtime is capped only by the maximum 
> latency sending from one point to another.
>
> But with something like this, the total downtime is 
> 2*max_latency*nb_pagefaults.  That's potentially pretty high.

Let's be generous and assume that the latency is dominated by page copy 
time.  So the total downtime is equal to the first live migration pass, 
~20 sec for 2GB on 1GbE.  It's distributed over potentially even more 
time, though.  If the guest does a lot of I/O, it may not be noticeable 
(esp. if we don't copy over pages read from disk).  If the guest is 
cpu/memory bound, it'll probably suck badly.

>
> So it may be desirable to try to reduce nb_pagefaults by prefaulting 
> in pages, etc.  Suffice to say, this ends up getting complicated and 
> may end up burning network traffic too.

Yeah, and prefaulting in the background adds latency to synchronous 
requests.

This really needs excellent networking resources to work well.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [RFC] postcopy livemigration proposal
  2011-08-08 15:36         ` Avi Kivity
@ 2011-08-08 15:59           ` Anthony Liguori
  2011-08-08 19:47             ` Dor Laor
  0 siblings, 1 reply; 28+ messages in thread
From: Anthony Liguori @ 2011-08-08 15:59 UTC (permalink / raw)
  To: Avi Kivity
  Cc: kvm, satoshi.itoh, t.hirofuchi, dlaor, qemu-devel, Orit Wasserman,
	Isaku Yamahata

On 08/08/2011 10:36 AM, Avi Kivity wrote:
> On 08/08/2011 06:29 PM, Anthony Liguori wrote:
>>
>>>>> - Efficient, reduce needed traffic no need to re-send pages.
>>>>
>>>> It's not quite that simple. Post-copy needs to introduce a protocol
>>>> capable of requesting pages.
>>>
>>> Just another subsection.. (kidding), still it shouldn't be too
>>> complicated, just an offset+pagesize and return page_content/error
>>
>> What I meant by this is that there is potentially a lot of round trip
>> overhead. Pre-copy migration works well with reasonable high latency
>> network connections because the downtime is capped only by the maximum
>> latency sending from one point to another.
>>
>> But with something like this, the total downtime is
>> 2*max_latency*nb_pagefaults. That's potentially pretty high.
>
> Let's be generous and assume that the latency is dominated by page copy
> time. So the total downtime is equal to the first live migration pass,
> ~20 sec for 2GB on 1GbE. It's distributed over potentially even more
> time, though. If the guest does a lot of I/O, it may not be noticeable
> (esp. if we don't copy over pages read from disk). If the guest is
> cpu/memory bound, it'll probably suck badly.
>
>>
>> So it may be desirable to try to reduce nb_pagefaults by prefaulting
>> in pages, etc. Suffice to say, this ends up getting complicated and
>> may end up burning network traffic too.
>
> Yeah, and prefaulting in the background adds latency to synchronous
> requests.
>
> This really needs excellent networking resources to work well.

Yup, it's very similar to other technologies using RDMA (single system 
image, lock step execution, etc.).

Regards,

Anthony Liguori

>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [RFC] postcopy livemigration proposal
  2011-08-08 15:59           ` Anthony Liguori
@ 2011-08-08 19:47             ` Dor Laor
  2011-08-09  2:07               ` Isaku Yamahata
  0 siblings, 1 reply; 28+ messages in thread
From: Dor Laor @ 2011-08-08 19:47 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: kvm, satoshi.itoh, t.hirofuchi, qemu-devel, Orit Wasserman,
	Avi Kivity, Isaku Yamahata

On 08/08/2011 06:59 PM, Anthony Liguori wrote:
> On 08/08/2011 10:36 AM, Avi Kivity wrote:
>> On 08/08/2011 06:29 PM, Anthony Liguori wrote:
>>>
>>>>>> - Efficient, reduce needed traffic no need to re-send pages.
>>>>>
>>>>> It's not quite that simple. Post-copy needs to introduce a protocol
>>>>> capable of requesting pages.
>>>>
>>>> Just another subsection.. (kidding), still it shouldn't be too
>>>> complicated, just an offset+pagesize and return page_content/error
>>>
>>> What I meant by this is that there is potentially a lot of round trip
>>> overhead. Pre-copy migration works well with reasonable high latency
>>> network connections because the downtime is capped only by the maximum
>>> latency sending from one point to another.
>>>
>>> But with something like this, the total downtime is
>>> 2*max_latency*nb_pagefaults. That's potentially pretty high.
>>
>> Let's be generous and assume that the latency is dominated by page copy
>> time. So the total downtime is equal to the first live migration pass,
>> ~20 sec for 2GB on 1GbE. It's distributed over potentially even more
>> time, though. If the guest does a lot of I/O, it may not be noticeable
>> (esp. if we don't copy over pages read from disk). If the guest is
>> cpu/memory bound, it'll probably suck badly.
>>
>>>
>>> So it may be desirable to try to reduce nb_pagefaults by prefaulting
>>> in pages, etc. Suffice to say, this ends up getting complicated and
>>> may end up burning network traffic too.

It is complicated but can help (like pre faulting working set size 
pages). Beyond that async page fault will help a bit.
Lastly, if a guest has several apps, those that are memory intensive 
might suffer but light weight apps will function nicely.
It provides extra flexibility over the current protocol (that still has 
value for some of the loads).

>>
>> Yeah, and prefaulting in the background adds latency to synchronous
>> requests.
>>
>> This really needs excellent networking resources to work well.
>
> Yup, it's very similar to other technologies using RDMA (single system
> image, lock step execution, etc.).
>
> Regards,
>
> Anthony Liguori
>
>>
>
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [RFC] postcopy livemigration proposal
  2011-08-08 19:47             ` Dor Laor
@ 2011-08-09  2:07               ` Isaku Yamahata
  0 siblings, 0 replies; 28+ messages in thread
From: Isaku Yamahata @ 2011-08-09  2:07 UTC (permalink / raw)
  To: Dor Laor
  Cc: kvm, satoshi.itoh, t.hirofuchi, qemu-devel, Orit Wasserman,
	Avi Kivity

On Mon, Aug 08, 2011 at 10:47:09PM +0300, Dor Laor wrote:
> On 08/08/2011 06:59 PM, Anthony Liguori wrote:
>> On 08/08/2011 10:36 AM, Avi Kivity wrote:
>>> On 08/08/2011 06:29 PM, Anthony Liguori wrote:
>>>>
>>>>>>> - Efficient, reduce needed traffic no need to re-send pages.
>>>>>>
>>>>>> It's not quite that simple. Post-copy needs to introduce a protocol
>>>>>> capable of requesting pages.
>>>>>
>>>>> Just another subsection.. (kidding), still it shouldn't be too
>>>>> complicated, just an offset+pagesize and return page_content/error
>>>>
>>>> What I meant by this is that there is potentially a lot of round trip
>>>> overhead. Pre-copy migration works well with reasonable high latency
>>>> network connections because the downtime is capped only by the maximum
>>>> latency sending from one point to another.
>>>>
>>>> But with something like this, the total downtime is
>>>> 2*max_latency*nb_pagefaults. That's potentially pretty high.
>>>
>>> Let's be generous and assume that the latency is dominated by page copy
>>> time. So the total downtime is equal to the first live migration pass,
>>> ~20 sec for 2GB on 1GbE. It's distributed over potentially even more
>>> time, though. If the guest does a lot of I/O, it may not be noticeable
>>> (esp. if we don't copy over pages read from disk). If the guest is
>>> cpu/memory bound, it'll probably suck badly.
>>>
>>>>
>>>> So it may be desirable to try to reduce nb_pagefaults by prefaulting
>>>> in pages, etc. Suffice to say, this ends up getting complicated and
>>>> may end up burning network traffic too.
>
> It is complicated but can help (like pre faulting working set size  
> pages). Beyond that async page fault will help a bit.
> Lastly, if a guest has several apps, those that are memory intensive  
> might suffer but light weight apps will function nicely.
> It provides extra flexibility over the current protocol (that still has  
> value for some of the loads).

We can also combine postcopy with precopy.
For example, The migration is started in in precopy mode at the beginning
and then at some point it is switched into postcopy mode.

>
>>>
>>> Yeah, and prefaulting in the background adds latency to synchronous
>>> requests.
>>>
>>> This really needs excellent networking resources to work well.
>>
>> Yup, it's very similar to other technologies using RDMA (single system
>> image, lock step execution, etc.).
>>
>> Regards,
>>
>> Anthony Liguori
>>
>>>
>>
>>
>

-- 
yamahata

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [RFC] postcopy livemigration proposal
  2011-08-08  3:24 [Qemu-devel] [RFC] postcopy livemigration proposal Isaku Yamahata
  2011-08-08  9:20 ` Dor Laor
@ 2011-08-08  9:38 ` Stefan Hajnoczi
  2011-08-08  9:43   ` Isaku Yamahata
  2011-08-08 12:38 ` Avi Kivity
  2011-08-12 11:07 ` [Qemu-devel] [PATCH][RFC] post copy chardevice (was Re: [RFC] postcopy livemigration proposal) Isaku Yamahata
  3 siblings, 1 reply; 28+ messages in thread
From: Stefan Hajnoczi @ 2011-08-08  9:38 UTC (permalink / raw)
  To: Isaku Yamahata; +Cc: Gleb Natapov, t.hirofuchi, qemu-devel, kvm, satoshi.itoh

On Mon, Aug 8, 2011 at 4:24 AM, Isaku Yamahata <yamahata@valinux.co.jp> wrote:
> This mail is on "Yabusame: Postcopy Live Migration for Qemu/KVM"
> on which we'll give a talk at KVM-forum.

I'm curious if this approach is compatible with asynchronous page
faults?  The idea there was to tell the guest about a page fault so it
can continue to do useful work in the meantime (if the fault was in
guest userspace).

Stefan

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [RFC] postcopy livemigration proposal
  2011-08-08  9:38 ` Stefan Hajnoczi
@ 2011-08-08  9:43   ` Isaku Yamahata
  0 siblings, 0 replies; 28+ messages in thread
From: Isaku Yamahata @ 2011-08-08  9:43 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Gleb Natapov, t.hirofuchi, qemu-devel, kvm, satoshi.itoh

On Mon, Aug 08, 2011 at 10:38:35AM +0100, Stefan Hajnoczi wrote:
> On Mon, Aug 8, 2011 at 4:24 AM, Isaku Yamahata <yamahata@valinux.co.jp> wrote:
> > This mail is on "Yabusame: Postcopy Live Migration for Qemu/KVM"
> > on which we'll give a talk at KVM-forum.
> 
> I'm curious if this approach is compatible with asynchronous page
> faults?  The idea there was to tell the guest about a page fault so it
> can continue to do useful work in the meantime (if the fault was in
> guest userspace).

Yes. It's quite possible to inject async page fault into the guest
when the faulted page isn't available on the destination. At the same
time the page will be requested to the source of the migration.
I think it's not so difficult.
-- 
yamahata

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [RFC] postcopy livemigration proposal
  2011-08-08  3:24 [Qemu-devel] [RFC] postcopy livemigration proposal Isaku Yamahata
  2011-08-08  9:20 ` Dor Laor
  2011-08-08  9:38 ` Stefan Hajnoczi
@ 2011-08-08 12:38 ` Avi Kivity
  2011-08-09  2:33   ` Isaku Yamahata
  2011-08-12 11:07 ` [Qemu-devel] [PATCH][RFC] post copy chardevice (was Re: [RFC] postcopy livemigration proposal) Isaku Yamahata
  3 siblings, 1 reply; 28+ messages in thread
From: Avi Kivity @ 2011-08-08 12:38 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: Andrea Arcangeli, t.hirofuchi, qemu-devel, kvm, satoshi.itoh

On 08/08/2011 06:24 AM, Isaku Yamahata wrote:
> This mail is on "Yabusame: Postcopy Live Migration for Qemu/KVM"
> on which we'll give a talk at KVM-forum.
> The purpose of this mail is to letting developers know it in advance
> so that we can get better feedback on its design/implementation approach
> early before our starting to implement it.

Interesting; what is the impact of increased latency on memory reads?

>
>
> There are several design points.
>    - who takes care of pulling page contents.
>      an independent daemon vs a thread in qemu
>      The daemon approach is preferable because an independent daemon would
>      easy for debug postcopy memory mechanism without qemu.
>      If required, it wouldn't be difficult to convert a daemon into
>      a thread in qemu

Isn't this equivalent to touching each page in sequence?

Care must be taken that we don't post too many requests, or it could 
affect the latency of synchronous accesses by the guest.

>
>    - connection between the source and the destination
>      The connection for live migration can be re-used after sending machine
>      state.
>
>    - transfer protocol
>      The existing protocol that exists today can be extended.
>
>    - hooking guest RAM access
>      Introduce a character device to handle page fault.
>      When page fault occurs, it queues page request up to user space daemon
>      at the destination. And the daemon pulls page contents from the source
>      and serves it into the character device. Then the page fault is resovlved.

This doesn't play well with host swapping, transparent hugepages, or 
ksm, does it?

I see you note this later on.

> * More on hooking guest RAM access
> There are several candidate for the implementation. Our preference is
> character device approach.
>
>    - inserting hooks into everywhere in qemu/kvm
>      This is impractical
>
>    - backing store for guest ram
>      a block device or a file can be used to back guest RAM.
>      Thus hook the guest ram access.
>
>      pros
>      - new device driver isn't needed.
>      cons
>      - future improvement would be difficult
>      - some KVM host feature(KSM, THP) wouldn't work
>
>    - character device
>      qemu mmap() the dedicated character device, and then hook page fault.
>
>      pros
>      - straght forward approach
>      - future improvement would be easy
>      cons
>      - new driver is needed
>      - some KVM host feature(KSM, THP) wouldn't work
>        They checks if a given VMA is anonymous. This can be fixed.
>
>    - swap device
>      When creating guest, it is set up as if all the guest RAM is swapped out
>      to a dedicated swap device, which may be nbd disk (or some kind of user
>      space block device, BUSE?).
>      When the VM tries to access memory, swap-in is triggered and IO to the
>      swap device is issued. Then the IO to swap is routed to the daemon
>      in user space with nbd protocol (or BUSE, AOE, iSCSI...). The daemon pulls
>      pages from the migration source and services the IO request.
>
>      pros
>      - After the page transfer is complete, everything is same as normal case.
>      - no new device driver isn't needed
>      cons
>      - future improvement would be difficult
>      - administration: setting up nbd, swap device
>

Using a swap device would be my preference.  We'd still be using 
anonymous memory so thp/ksm/ordinary swap still work.

It would need to be a special kind of swap device since we only want to 
swap in, and never out, to that device.  We'd also need a special way of 
telling the kernel that memory comes from that device.  In that it's 
similar your second option.

Maybe we should use a backing file (using nbd) and have a madvise() call 
that converts the vma to anonymous memory once the migration is finished.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [RFC] postcopy livemigration proposal
  2011-08-08 12:38 ` Avi Kivity
@ 2011-08-09  2:33   ` Isaku Yamahata
  2011-08-10 13:55     ` Avi Kivity
  0 siblings, 1 reply; 28+ messages in thread
From: Isaku Yamahata @ 2011-08-09  2:33 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Andrea Arcangeli, t.hirofuchi, qemu-devel, kvm, satoshi.itoh

On Mon, Aug 08, 2011 at 03:38:54PM +0300, Avi Kivity wrote:
> On 08/08/2011 06:24 AM, Isaku Yamahata wrote:
>> This mail is on "Yabusame: Postcopy Live Migration for Qemu/KVM"
>> on which we'll give a talk at KVM-forum.
>> The purpose of this mail is to letting developers know it in advance
>> so that we can get better feedback on its design/implementation approach
>> early before our starting to implement it.
>
> Interesting; what is the impact of increased latency on memory reads?

Many people has already discussed it much in another thread. :-)
That's much more than I expected.


>> There are several design points.
>>    - who takes care of pulling page contents.
>>      an independent daemon vs a thread in qemu
>>      The daemon approach is preferable because an independent daemon would
>>      easy for debug postcopy memory mechanism without qemu.
>>      If required, it wouldn't be difficult to convert a daemon into
>>      a thread in qemu
>
> Isn't this equivalent to touching each page in sequence?

No. I don't get your point of this question.


> Care must be taken that we don't post too many requests, or it could  
> affect the latency of synchronous accesses by the guest.

Yes.


>>    - connection between the source and the destination
>>      The connection for live migration can be re-used after sending machine
>>      state.
>>
>>    - transfer protocol
>>      The existing protocol that exists today can be extended.
>>
>>    - hooking guest RAM access
>>      Introduce a character device to handle page fault.
>>      When page fault occurs, it queues page request up to user space daemon
>>      at the destination. And the daemon pulls page contents from the source
>>      and serves it into the character device. Then the page fault is resovlved.
>
> This doesn't play well with host swapping, transparent hugepages, or  
> ksm, does it?

No. At least it wouldn't be so difficult to fix it, I haven't looked ksm,
thp so closely though.
Although the vma is backed by the device, the populated page is
anonymous. (by MMAP_PRIVATE or the deriver returning anonymous page)
So swapping, thp, ksm should work.


> I see you note this later on.
>
>> * More on hooking guest RAM access
>> There are several candidate for the implementation. Our preference is
>> character device approach.
>>
>>    - inserting hooks into everywhere in qemu/kvm
>>      This is impractical
>>
>>    - backing store for guest ram
>>      a block device or a file can be used to back guest RAM.
>>      Thus hook the guest ram access.
>>
>>      pros
>>      - new device driver isn't needed.
>>      cons
>>      - future improvement would be difficult
>>      - some KVM host feature(KSM, THP) wouldn't work
>>
>>    - character device
>>      qemu mmap() the dedicated character device, and then hook page fault.
>>
>>      pros
>>      - straght forward approach
>>      - future improvement would be easy
>>      cons
>>      - new driver is needed
>>      - some KVM host feature(KSM, THP) wouldn't work
>>        They checks if a given VMA is anonymous. This can be fixed.
>>
>>    - swap device
>>      When creating guest, it is set up as if all the guest RAM is swapped out
>>      to a dedicated swap device, which may be nbd disk (or some kind of user
>>      space block device, BUSE?).
>>      When the VM tries to access memory, swap-in is triggered and IO to the
>>      swap device is issued. Then the IO to swap is routed to the daemon
>>      in user space with nbd protocol (or BUSE, AOE, iSCSI...). The daemon pulls
>>      pages from the migration source and services the IO request.
>>
>>      pros
>>      - After the page transfer is complete, everything is same as normal case.
>>      - no new device driver isn't needed
>>      cons
>>      - future improvement would be difficult
>>      - administration: setting up nbd, swap device
>>
>
> Using a swap device would be my preference.  We'd still be using  
> anonymous memory so thp/ksm/ordinary swap still work.
>
> It would need to be a special kind of swap device since we only want to  
> swap in, and never out, to that device.  We'd also need a special way of  
> telling the kernel that memory comes from that device.  In that it's  
> similar your second option.
>
> Maybe we should use a backing file (using nbd) and have a madvise() call  
> that converts the vma to anonymous memory once the migration is finished.

With whichever options, I'd like to convert the vma into anonymous area
after the migration completes somehow. i.e. nulling vma->vm_ops.
(The pages are already anonymous.)

It seems troublesome involving complicated races/lockings. So I'm not sure
it's worthwhile.
-- 
yamahata

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [RFC] postcopy livemigration proposal
  2011-08-09  2:33   ` Isaku Yamahata
@ 2011-08-10 13:55     ` Avi Kivity
  2011-08-11  2:19       ` Isaku Yamahata
  0 siblings, 1 reply; 28+ messages in thread
From: Avi Kivity @ 2011-08-10 13:55 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: Andrea Arcangeli, t.hirofuchi, qemu-devel, kvm, satoshi.itoh

On 08/09/2011 05:33 AM, Isaku Yamahata wrote:
> On Mon, Aug 08, 2011 at 03:38:54PM +0300, Avi Kivity wrote:
> >  On 08/08/2011 06:24 AM, Isaku Yamahata wrote:
> >>  This mail is on "Yabusame: Postcopy Live Migration for Qemu/KVM"
> >>  on which we'll give a talk at KVM-forum.
> >>  The purpose of this mail is to letting developers know it in advance
> >>  so that we can get better feedback on its design/implementation approach
> >>  early before our starting to implement it.
> >
> >  Interesting; what is the impact of increased latency on memory reads?
>
> Many people has already discussed it much in another thread. :-)
> That's much more than I expected.

Can you point me to the discussion?

>
> >>  There are several design points.
> >>     - who takes care of pulling page contents.
> >>       an independent daemon vs a thread in qemu
> >>       The daemon approach is preferable because an independent daemon would
> >>       easy for debug postcopy memory mechanism without qemu.
> >>       If required, it wouldn't be difficult to convert a daemon into
> >>       a thread in qemu
> >
> >  Isn't this equivalent to touching each page in sequence?
>
> No. I don't get your point of this question.

If you have a qemu thread that does

    for (each guest page)
        sum += *(char *)page;

doesn't that effectively pull all pages from the source node?

(but maybe I'm assuming that the kernel takes care of things and this 
isn't the case?)

> >>
> >>     - hooking guest RAM access
> >>       Introduce a character device to handle page fault.
> >>       When page fault occurs, it queues page request up to user space daemon
> >>       at the destination. And the daemon pulls page contents from the source
> >>       and serves it into the character device. Then the page fault is resovlved.
> >
> >  This doesn't play well with host swapping, transparent hugepages, or
> >  ksm, does it?
>
> No. At least it wouldn't be so difficult to fix it, I haven't looked ksm,
> thp so closely though.
> Although the vma is backed by the device, the populated page is
> anonymous. (by MMAP_PRIVATE or the deriver returning anonymous page)
> So swapping, thp, ksm should work.

I'm not 100% sure, but I think that thp and ksm need the vma to be 
anonymous, not just the page.

> >
> >  It would need to be a special kind of swap device since we only want to
> >  swap in, and never out, to that device.  We'd also need a special way of
> >  telling the kernel that memory comes from that device.  In that it's
> >  similar your second option.
> >
> >  Maybe we should use a backing file (using nbd) and have a madvise() call
> >  that converts the vma to anonymous memory once the migration is finished.
>
> With whichever options, I'd like to convert the vma into anonymous area
> after the migration completes somehow. i.e. nulling vma->vm_ops.
> (The pages are already anonymous.)
>
> It seems troublesome involving complicated races/lockings. So I'm not sure
> it's worthwhile.

Andrea, what's your take on this?

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [RFC] postcopy livemigration proposal
  2011-08-10 13:55     ` Avi Kivity
@ 2011-08-11  2:19       ` Isaku Yamahata
  2011-08-11 16:55         ` Andrea Arcangeli
  0 siblings, 1 reply; 28+ messages in thread
From: Isaku Yamahata @ 2011-08-11  2:19 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Andrea Arcangeli, t.hirofuchi, qemu-devel, kvm, satoshi.itoh

On Wed, Aug 10, 2011 at 04:55:32PM +0300, Avi Kivity wrote:
> On 08/09/2011 05:33 AM, Isaku Yamahata wrote:
>> On Mon, Aug 08, 2011 at 03:38:54PM +0300, Avi Kivity wrote:
>> >  On 08/08/2011 06:24 AM, Isaku Yamahata wrote:
>> >>  This mail is on "Yabusame: Postcopy Live Migration for Qemu/KVM"
>> >>  on which we'll give a talk at KVM-forum.
>> >>  The purpose of this mail is to letting developers know it in advance
>> >>  so that we can get better feedback on its design/implementation approach
>> >>  early before our starting to implement it.
>> >
>> >  Interesting; what is the impact of increased latency on memory reads?
>>
>> Many people has already discussed it much in another thread. :-)
>> That's much more than I expected.
>
> Can you point me to the discussion?

I misunderstood of your question.
Please refer to the papers which includes the evaluation results including
network latency. It discusses about it in details.
And the presentation that we will give at the KVM forum also includes
some results.


>> >>  There are several design points.
>> >>     - who takes care of pulling page contents.
>> >>       an independent daemon vs a thread in qemu
>> >>       The daemon approach is preferable because an independent daemon would
>> >>       easy for debug postcopy memory mechanism without qemu.
>> >>       If required, it wouldn't be difficult to convert a daemon into
>> >>       a thread in qemu
>> >
>> >  Isn't this equivalent to touching each page in sequence?
>>
>> No. I don't get your point of this question.
>
> If you have a qemu thread that does
>
>    for (each guest page)
>        sum += *(char *)page;
>
> doesn't that effectively pull all pages from the source node?
>
> (but maybe I'm assuming that the kernel takes care of things and this  
> isn't the case?)

Now I see your point. Right, it doesn't matter who starts the access
to guest RAM.
My point is, after the page fault, someone has to resolve the fault
by sending the request for the page to the migration source.

I think, daemon or thread isn't a big issue anyway.
If nbd with swap device is used, its IO request may be sent to the source
directly.


>> >>     - hooking guest RAM access
>> >>       Introduce a character device to handle page fault.
>> >>       When page fault occurs, it queues page request up to user space daemon
>> >>       at the destination. And the daemon pulls page contents from the source
>> >>       and serves it into the character device. Then the page fault is resovlved.
>> >
>> >  This doesn't play well with host swapping, transparent hugepages, or
>> >  ksm, does it?
>>
>> No. At least it wouldn't be so difficult to fix it, I haven't looked ksm,
>> thp so closely though.
>> Although the vma is backed by the device, the populated page is
>> anonymous. (by MMAP_PRIVATE or the deriver returning anonymous page)
>> So swapping, thp, ksm should work.
>
> I'm not 100% sure, but I think that thp and ksm need the vma to be  
> anonymous, not just the page.

Yes, they seems to check if not only the page is anonymous, but also the vma.
I'd like to hear from Andrea before digging into the code deeply.


>> >  It would need to be a special kind of swap device since we only want to
>> >  swap in, and never out, to that device.  We'd also need a special way of
>> >  telling the kernel that memory comes from that device.  In that it's
>> >  similar your second option.
>> >
>> >  Maybe we should use a backing file (using nbd) and have a madvise() call
>> >  that converts the vma to anonymous memory once the migration is finished.
>>
>> With whichever options, I'd like to convert the vma into anonymous area
>> after the migration completes somehow. i.e. nulling vma->vm_ops.
>> (The pages are already anonymous.)
>>
>> It seems troublesome involving complicated races/lockings. So I'm not sure
>> it's worthwhile.
>
> Andrea, what's your take on this?

I'd also like to hear from those who are familiar with ksm/thp.

If it is possible to convert the vma into anonymous, swap device or
backed by device/file wouldn't matter in respect to ksm and thp.
Acquiring mmap_sem suffices?

thanks,
-- 
yamahata

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [RFC] postcopy livemigration proposal
  2011-08-11  2:19       ` Isaku Yamahata
@ 2011-08-11 16:55         ` Andrea Arcangeli
  0 siblings, 0 replies; 28+ messages in thread
From: Andrea Arcangeli @ 2011-08-11 16:55 UTC (permalink / raw)
  To: Isaku Yamahata; +Cc: t.hirofuchi, satoshi.itoh, Avi Kivity, kvm, qemu-devel

Hello everyone,

so basically this is a tradeoff between not having a long latency for
the migration to succeed and reducing the total network traffic (and
CPU load) in the migration source and destination and reducing the
memory footprint a bit, by adding an initial latency to the memory
accesses on the destination of the migration (i.e. causing a more
significant and noticeable slowdown to the guest).

It's more or less like if when the guest starts on the destination
node, it will find all its memory swapped out to a network swap
device, so it needs to do I/O for the first access (side note: and
hopefully it won't run out of memory while the memory is copied to the
destination node or the guest will crash).

On Thu, Aug 11, 2011 at 11:19:19AM +0900, Isaku Yamahata wrote:
> On Wed, Aug 10, 2011 at 04:55:32PM +0300, Avi Kivity wrote:
> > I'm not 100% sure, but I think that thp and ksm need the vma to be  
> > anonymous, not just the page.
> 
> Yes, they seems to check if not only the page is anonymous, but also the vma.
> I'd like to hear from Andrea before digging into the code deeply.

The vma doesn't need to be anonymous for THP, an mmap on /dev/zero
MAP_PRIVATE also is backed by THP. But it must be close to anonymous
and not have special VM_IO/PFNMAP flags or khugepaged/ksm will not
scan it. ->vm_file itself isn't checked by THP/KSM (sure for THP
because of the /dev/zero example which I explicitly fixed as it wasn't
fully handled initially). NOTE a chardevice won't work on RHEL6
because I didn't allow /dev/zero to use it there (it wasn't an
important enough feature and it was more risky) but upstream it should
work already.

A chardevice doing this may work, even if it would be simpler/cleaner
if this was still an anonymous vma. A chardevice could act similar to
/dev/zero MAP_PRIVATE. In theory KSM should work on /dev/zero too, you
can test that if you want. But a chardevice will require dealing with
permissions when we don't actually need special permissions for this.

Another problem is you can't migrate the stuff using hugepages or it'd
multiply the latency 512 times (with 2M contiguous access it won't
make a difference but if the guest is accessing memory randomly it
would make a difference). So you will have to relay on khugepaged to
collapse the hugepages later. That should work but initially the guest
will run slower even when the migration is already fully completed.

> If it is possible to convert the vma into anonymous, swap device or
> backed by device/file wouldn't matter in respect to ksm and thp.
> Acquiring mmap_sem suffices?

A swap device would require root permissions and we don't want qemu to
mangle over the swapdevices automatically. It'd be bad to add new
admin requirements, few people would use it. Ideally the migration API
should remain the same and it should be an internal tweak in qemu to
select which migration mode to use beforehand.

Even if it was a swap device it'd still require special operations to
setup swap entries in the process pagetables before the pages
exists. A swap device may give more complication than it solves.

If it was only KVM accessing the guest physical memory we could just
handle it in KVM, and call get_user_pages_fast, if that fails and it's
the first ever invocation we just talk with QEMU to get the page and
establish it by hand. But qemu can also write to memory and if it's a
partial write and the guest reads the not-written yet part with
get_user_pages_fast+spte establishment, it'll go wrong. Maybe qemu is
already doing all checks on the pages it's going to write and we could
hook there too from qemu side.

Another more generic (not KVM centric) that will not require a special
chardev or a special daemon way could be a new:

	sys_set_muserload(unsigned long start, unsigned long len, int signal)
	sys_muserload(void *from, void *to)

When sys_set_muserload is called the region start,start+len gets
covered by muserload swap entries that trigger special page faults.

When anything touches the memory with an muserload swap entry still
set, the thread gets a signal with force_sig_info_fault(si_signo =
signal), and the signal handler will get the faulting address in
info.si_addr. The signal handler is then responsible to call
sys_userload after talking the thread doing the TCP send/recv()
talking with the qemu source. The recv(mmap(4096), 4096) should
generate a page in the destination node in some random mapping
(aligned).

Then the muserload(tcp_received_page_address,
guest_faulting_physical_address_from_info_si_addr) does get_user_pages
on tcp_received_page_address, takes it away from the
tcp_received_page_address (clears the pte at that address), adjusts
page->index for the new vma, and maps the page zercopy atomically into
the "guest_faulting_physical_address_from_info_si_addr" new address,
if and only if the pagetable at address is still of muserload
type. Then the signal munmap(tcp_received_page_address, 4096) to
truncate/free the vma (the page is already moved and it's already
free) and the signal returns and this time the guest access the page.

Another second thread in the background calls
muserload(tcp_received_page_address, guest_physical_address) maxing
out the network bandwidth using a second TCP socket for the streaming
transfer ignoring if muserload fails (if the async page fault arrives
first and already loaded the information there).

If a double muserload fault happens before the first pending I guess
it'll just hang and kill -9 will solve it (or we could detect a double
fault and sigsegv).

qemu better not touch guest physical ram marked muserload by
set_muserload() from signal context.

The signal handler may communicate with another thread if there's only
one TCP socket for the "sync muserload fault" transfers for each qemu
instance. Alternatively every vcpu thread plus the iothread could talk
with the source with a different TCP socket (also to reduce the
latency) and if multiple vcpu faults on the same address it'll just
loop a bit for the vcpus that didn't ask the page first to the source
qemu. For example if the source will respond "already freed" then
loader returns without calling muserload() and loop continues until
the vcpu that recv()d the page from the source finally calls
muserload().

Too much already for not having though enough about it yet, these are
the first ideas that comes to mind.

Not sure if the guaranteed slowdown on the destination node could be
preferable as a "default" migration mode, but this certainly sounds a
more reliable (i.e. maybe better for enterprise) way of doing
migration because there are no black magic numbers involved to decide
when to stop the source node and transfer all remaining dirty pages,
magic numbers aren't good for enterprise because of the potentially
enormous VM sizes or very heavy workloads. OTOH for desktop virt:
small VM and usually idle and with tiny working set the current
precopy method probably is less visible to the user, but it's less
reliable too. So maybe this is a better default because of the less
black magic and more reliability even if it would most certainly
perform worse for the small use case.

Especially on 100mbit networks it may be pretty bad, the equivalent of
a 10MB/sec swap device will be quite a pretty bad initial slowdown. At
least it won't be slowed down by seeking but swapping in at 10M/sec
like on a real old HD is going to be still very bad. It'll run almost
as slow as a low-end laptop that suspended-to-disk by swapping out
half of the RAM. So pretty noticeable. With 1gigabit and up it'll get
better.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Qemu-devel] [PATCH][RFC] post copy chardevice (was Re: [RFC] postcopy livemigration proposal)
  2011-08-08  3:24 [Qemu-devel] [RFC] postcopy livemigration proposal Isaku Yamahata
                   ` (2 preceding siblings ...)
  2011-08-08 12:38 ` Avi Kivity
@ 2011-08-12 11:07 ` Isaku Yamahata
  2011-08-12 11:09   ` Isaku Yamahata
                     ` (2 more replies)
  3 siblings, 3 replies; 28+ messages in thread
From: Isaku Yamahata @ 2011-08-12 11:07 UTC (permalink / raw)
  To: kvm, qemu-devel
  Cc: Andrea Arcangeli, Nadav Har'El, satoshi.itoh, Stefan Hajnoczi,
	t.hirofuchi, Dor Laor, Yaniv Kaul, Avi Kivity

Here is the what I have right now for post copy chardevice.
The sample user land will follow.
It would give you more concrete idea and help further discussion, I hope.
This is just for discussion, so it's incomplete.

I'm open to other ideas and quite happy to throw away this patch and
go for better way.

thanks,

>From e262979e95b3c5a095c8cb0bc178309baa861a3f Mon Sep 17 00:00:00 2001
Message-Id: <e262979e95b3c5a095c8cb0bc178309baa861a3f.1313146664.git.yamahata@valinux.co.jp>
From: Isaku Yamahata <yamahata@valinux.co.jp>
Date: Wed, 10 Aug 2011 18:28:05 +0900
Subject: [PATCH] kvm/postcopy: chardevice for postcopy

This is a character device to hook page access.
The page fault in the area is reported to another user process by
this chardriver. Then, the process fills the page contents and
resolves the page fault.

Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
---
 arch/x86/kvm/Kconfig     |    1 +
 arch/x86/kvm/Makefile    |    1 +
 include/linux/kvm.h      |   45 +++
 include/linux/kvm_host.h |    2 +
 mm/memcontrol.c          |    1 +
 mm/shmem.c               |    1 +
 virt/kvm/Kconfig         |    3 +
 virt/kvm/kvm_main.c      |    6 +
 virt/kvm/vmem.c          |  847 ++++++++++++++++++++++++++++++++++++++++++++++
 virt/kvm/vmem.h          |   68 ++++
 10 files changed, 975 insertions(+), 0 deletions(-)
 create mode 100644 virt/kvm/vmem.c
 create mode 100644 virt/kvm/vmem.h

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 0a09b58..dcbd52e 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -29,6 +29,7 @@ config KVM
 	select HAVE_KVM_EVENTFD
 	select KVM_APIC_ARCHITECTURE
 	select KVM_ASYNC_PF
+	select KVM_VMEM
 	select USER_RETURN_NOTIFIER
 	select KVM_MMIO
 	select TASKSTATS
diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index f15501f..6125f4c 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -10,6 +10,7 @@ kvm-y			+= $(addprefix ../../../virt/kvm/, kvm_main.o ioapic.o \
 				assigned-dev.o)
 kvm-$(CONFIG_IOMMU_API)	+= $(addprefix ../../../virt/kvm/, iommu.o)
 kvm-$(CONFIG_KVM_ASYNC_PF)	+= $(addprefix ../../../virt/kvm/, async_pf.o)
+kvm-$(CONFIG_KVM_VMEM)	+= $(addprefix ../../../virt/kvm/, vmem.o)
 
 kvm-y			+= x86.o mmu.o emulate.o i8259.o irq.o lapic.o \
 			   i8254.o timer.o
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 55f5afb..623109e 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -554,6 +554,7 @@ struct kvm_ppc_pvinfo {
 #define KVM_CAP_PPC_SMT 64
 #define KVM_CAP_PPC_RMA	65
 #define KVM_CAP_MAX_VCPUS 66       /* returns max vcpus per vm */
+#define KVM_CAP_POST_COPY_MEMORY 67
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -760,6 +761,50 @@ struct kvm_clock_data {
 /* Available with KVM_CAP_RMA */
 #define KVM_ALLOCATE_RMA	  _IOR(KVMIO,  0xa9, struct kvm_allocate_rma)
 
+struct kvm_vmem_create {
+	__u64 size;	/* in bytes */
+	__s32 vmem_fd;
+	__s32 shmem_fd;
+};
+
+struct kvm_vmem_page_request {
+	__u32 nr;
+	__u64 __user *pgoffs;
+};
+
+struct kvm_vmem_page_cached {
+	__u32 nr;
+	__u64 __user *pgoffs;
+};
+
+struct kvm_vmem_page_range {
+	__u64 pgoff;
+	__u64 nr_pages;
+};
+
+struct kvm_vmem_make_pages_present {
+	__u32 nr;
+	struct kvm_vmem_page_range __user *ranges;
+};
+
+/* Available with KVM_CAP_POST_COPY_MEMORY */
+#define KVM_CREATE_VMEM_DEV       _IO(KVMIO,  0xb0)
+
+/* ioctl for vmem_dev fd */
+#define KVM_CREATE_VMEM		  _IOR(KVMIO, 0xb1, __u32)
+
+/* ioctl for vmem fd */
+#define KVM_VMEM_WAIT_READY	  _IO(KVMIO,  0xb2)
+#define KVM_VMEM_READY		  _IO(KVMIO,  0xb3)
+#define KVM_VMEM_GET_PAGE_REQUEST \
+	_IOWR(KVMIO, 0xb4, struct kvm_vmem_page_request)
+#define KVM_VMEM_MARK_PAGE_CACHED \
+	_IOW(KVMIO,  0xb5, struct kvm_vmem_page_cached)
+#define KVM_VMEM_MAKE_PAGES_PRESENT \
+	_IOW(KVMIO,  0xb6, struct kvm_vmem_make_pages_present)
+#define KVM_VMEM_MAKE_VMA_ANONYMOUS _IO(KVMIO, 0xb7)
+
+
 #define KVM_DEV_ASSIGN_ENABLE_IOMMU	(1 << 0)
 
 struct kvm_assigned_pci_dev {
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ff4d406..8b3dafa 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -222,6 +222,8 @@ struct kvm_irq_routing_table {};
 
 #endif
 
+long kvm_dev_ioctl_create_vmem_dev(void);
+
 struct kvm_memslots {
 	int nmemslots;
 	u64 generation;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e013b8e..7f3fc4e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2838,6 +2838,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 
 	return ret;
 }
+EXPORT_SYMBOL_GPL(mem_cgroup_cache_charge);
 
 /*
  * While swap-in, try_charge -> commit or cancel, the page is locked.
diff --git a/mm/shmem.c b/mm/shmem.c
index fcedf54..ae7d61f 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -3035,6 +3035,7 @@ int shmem_zero_setup(struct vm_area_struct *vma)
 	vma->vm_flags |= VM_CAN_NONLINEAR;
 	return 0;
 }
+EXPORT_SYMBOL_GPL(shmem_zero_setup);
 
 /**
  * shmem_read_mapping_page_gfp - read into page cache, using specified page allocation flags.
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index f63ccb0..d3040ea 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -18,3 +18,6 @@ config KVM_MMIO
 
 config KVM_ASYNC_PF
        bool
+
+config KVM_VMEM
+       bool
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index aefdda3..9e47e20 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2184,6 +2184,7 @@ static long kvm_dev_ioctl_check_extension_generic(long arg)
 	case KVM_CAP_SET_BOOT_CPU_ID:
 #endif
 	case KVM_CAP_INTERNAL_ERROR_DATA:
+	case KVM_CAP_POST_COPY_MEMORY:
 		return 1;
 #ifdef CONFIG_HAVE_KVM_IRQCHIP
 	case KVM_CAP_IRQ_ROUTING:
@@ -2233,6 +2234,11 @@ static long kvm_dev_ioctl(struct file *filp,
 	case KVM_TRACE_DISABLE:
 		r = -EOPNOTSUPP;
 		break;
+#ifdef CONFIG_KVM_VMEM
+	case KVM_CREATE_VMEM_DEV:
+		r = kvm_dev_ioctl_create_vmem_dev();
+		break;
+#endif
 	default:
 		return kvm_arch_dev_ioctl(filp, ioctl, arg);
 	}
diff --git a/virt/kvm/vmem.c b/virt/kvm/vmem.c
new file mode 100644
index 0000000..b413663
--- /dev/null
+++ b/virt/kvm/vmem.c
@@ -0,0 +1,847 @@
+/*
+ * KVM post copy vmem
+ *
+ * Copyright (c) 2011,
+ * National Institute of Advanced Industrial Science and Technology
+ *
+ * https://sites.google.com/site/grivonhome/quick-kvm-migration
+ * Author: Isaku Yamahata <yamahata at valinux co jp>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 59 Temple
+ * Place - Suite 330, Boston, MA 02111-1307 USA.
+ */
+
+#include <linux/kvm_host.h>
+#include <linux/kvm.h>
+#include <linux/pagemap.h>
+#include <linux/mm.h>
+#include <linux/memcontrol.h>
+#include <linux/poll.h>
+#include <linux/file.h>
+#include <linux/anon_inodes.h>
+#include "vmem.h"
+
+static void kvm_vmem_release_fake_vmf(int ret, struct vm_fault *fake_vmf)
+{
+	if (ret & VM_FAULT_LOCKED) {
+		unlock_page(fake_vmf->page);
+	}
+	page_cache_release(fake_vmf->page);
+}
+
+static int kvm_vmem_minor_fault(struct kvm_vmem *vmem,
+				struct vm_area_struct *vma,
+				struct vm_fault *vmf)
+{
+	struct vm_fault fake_vmf;
+	int ret;
+	struct page *page;
+
+	BUG_ON(!test_bit(vmf->pgoff, vmem->cached));
+	fake_vmf = *vmf;
+	fake_vmf.page = NULL;
+	ret = vmem->vma->vm_ops->fault(vmem->vma, &fake_vmf);
+	if (ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY))
+		return ret;
+
+	/*
+	 * TODO: pull out fake_vmf->page from shmem file and donate it
+	 * to this vma resolving the page fault.
+	 * vmf->page = fake_vmf->page;
+	 */
+
+	page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vmf->address);
+	if (!page)
+		return VM_FAULT_OOM;
+	if (mem_cgroup_cache_charge(page, vma->vm_mm, GFP_KERNEL)) {
+		kvm_vmem_release_fake_vmf(ret, &fake_vmf);
+		page_cache_release(page);
+		return VM_FAULT_OOM;
+	}
+
+	copy_highpage(page, fake_vmf.page);
+	kvm_vmem_release_fake_vmf(ret, &fake_vmf);
+
+	ret |= VM_FAULT_LOCKED;
+	SetPageUptodate(page);
+	vmf->page = page;
+	set_bit(vmf->pgoff, vmem->faulted);
+
+	return ret;
+}
+
+static int kvm_vmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	struct file *filp = vma->vm_file;
+	struct kvm_vmem *vmem = filp->private_data;
+
+	if (vmf->pgoff >= vmem->pgoff_end) {
+		return VM_FAULT_SIGBUS;
+	}
+
+	BUG_ON(test_bit(vmf->pgoff, vmem->faulted));
+
+	if (!test_bit(vmf->pgoff, vmem->cached)) {
+		/* major fault */
+		unsigned long bit;
+		DEFINE_WAIT(wait);
+
+		if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT) {
+			/* async page fault */
+			spin_lock(&vmem->lock);
+			if (vmem->async_req_nr < vmem->async_req_max) {
+				vmem->async_req[vmem->async_req_nr] =
+					vmf->pgoff;
+				vmem->async_req_nr++;
+			}
+			spin_unlock(&vmem->lock);
+			wake_up_poll(&vmem->req_wait, POLLIN);
+
+			if (test_bit(vmf->pgoff, vmem->cached))
+				return kvm_vmem_minor_fault(vmem, vma, vmf);
+			return VM_FAULT_MAJOR | VM_FAULT_RETRY;
+		}
+
+		spin_lock(&vmem->lock);
+		bit = find_first_zero_bit(vmem->sync_wait_bitmap,
+					  vmem->sync_req_max);
+		if (likely(bit < vmem->sync_req_max)) {
+			vmem->sync_req[bit] = vmf->pgoff;
+			prepare_to_wait(&vmem->page_wait[bit], &wait,
+					TASK_UNINTERRUPTIBLE);
+			set_bit(bit, vmem->sync_req_bitmap);
+			set_bit(bit, vmem->sync_wait_bitmap);
+			spin_unlock(&vmem->lock);
+			wake_up_poll(&vmem->req_wait, POLLIN);
+
+			if (!test_bit(vmf->pgoff, vmem->cached))
+				schedule();
+			finish_wait(&vmem->page_wait[bit], &wait);
+			clear_bit(bit, vmem->sync_wait_bitmap);
+		} else {
+			struct kvm_vmem_page_req_list page_req_list = {
+				.pgoff = vmf->pgoff,
+			};
+			vmem->req_list_nr++;
+			list_add_tail(&page_req_list.list, &vmem->req_list);
+			wake_up_poll(&vmem->req_wait, POLLIN);
+			for (;;) {
+				prepare_to_wait(&vmem->req_list_wait, &wait,
+						TASK_UNINTERRUPTIBLE);
+				if (test_bit(vmf->pgoff, vmem->cached)) {
+					vmem->req_list_nr--;
+					break;
+				}
+				spin_unlock(&vmem->lock);
+				schedule();
+				spin_lock(&vmem->lock);
+			}
+			spin_unlock(&vmem->lock);
+			finish_wait(&vmem->req_list_wait, &wait);
+		}
+
+		return kvm_vmem_minor_fault(vmem, vma, vmf) | VM_FAULT_MAJOR;
+	}
+
+	return kvm_vmem_minor_fault(vmem, vma, vmf);
+}
+
+/* for partial munmap */
+static void kvm_vmem_vma_open(struct vm_area_struct *vma)
+{
+	struct file *filp = vma->vm_file;
+	struct kvm_vmem *vmem = filp->private_data;
+
+	spin_lock(&vmem->lock);
+	vmem->vma_nr++;
+	spin_unlock(&vmem->lock);
+}
+
+static void kvm_vmem_vma_close(struct vm_area_struct *vma)
+{
+	struct file *filp = vma->vm_file;
+	struct kvm_vmem *vmem = filp->private_data;
+	struct task_struct *task = NULL;
+
+	spin_lock(&vmem->lock);
+	vmem->vma_nr--;
+	if (vmem->vma_nr == 0) {
+		task = vmem->task;
+		vmem->task = NULL;
+	}
+	spin_unlock(&vmem->lock);
+
+	if (task)
+		put_task_struct(task);
+}
+
+static const struct vm_operations_struct kvm_vmem_vm_ops = {
+	.open = kvm_vmem_vma_open,
+	.close = kvm_vmem_vma_close,
+	.fault = kvm_vmem_fault,
+};
+
+static int kvm_vmem_mmap(struct file *filp, struct vm_area_struct *vma)
+{
+	struct kvm_vmem *vmem = filp->private_data;
+	int error;
+
+	/* allow mmap() only once */
+	spin_lock(&vmem->lock);
+	if (vmem->mmapped) {
+		error = -EBUSY;
+		goto out;
+	}
+	if (((vma->vm_end - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff >
+	    vmem->pgoff_end) {
+		error = -EINVAL;
+		goto out;
+	}
+
+	vmem->mmapped = true;
+	vmem->vma_nr = 1;
+	vmem->vm_start = vma->vm_start;
+	get_task_struct(current);
+	vmem->task = current;
+	spin_unlock(&vmem->lock);
+
+	vma->vm_ops = &kvm_vmem_vm_ops;
+	vma->vm_flags |= VM_DONTCOPY | VM_DONTEXPAND;
+	vma->vm_flags &= ~VM_SHARED;
+	return 0;
+
+out:
+	spin_unlock(&vmem->lock);
+	return error;
+}
+
+static bool kvm_vmem_req_pending(struct kvm_vmem* vmem)
+{
+	return !list_empty(&vmem->req_list) ||
+		!bitmap_empty(vmem->sync_req_bitmap, vmem->sync_req_max) ||
+		(vmem->async_req_nr > 0);
+}
+
+static unsigned int kvm_vmem_poll(struct file* filp, poll_table *wait)
+{
+	struct kvm_vmem *vmem = filp->private_data;
+	unsigned int events = 0;
+
+	poll_wait(filp, &vmem->req_wait, wait);
+
+	spin_lock(&vmem->lock);
+	if (kvm_vmem_req_pending(vmem))
+		events |= POLLIN;
+	spin_unlock(&vmem->lock);
+
+	return events;
+}
+
+/*
+ * return value
+ * true: finished
+ * false: more request
+ */
+static bool kvm_vmem_copy_page_request(struct kvm_vmem *vmem,
+				       pgoff_t *pgoffs, int req_max,
+				       int *req_nr)
+{
+	struct kvm_vmem_page_req_list *req_list;
+	struct kvm_vmem_page_req_list *tmp;
+
+	unsigned long bit;
+
+	*req_nr = 0;
+	list_for_each_entry_safe(req_list, tmp, &vmem->req_list, list) {
+		list_del(&req_list->list);
+		pgoffs[*req_nr] = req_list->pgoff;
+		(*req_nr)++;
+		if (*req_nr >= req_max)
+			return false;
+	}
+
+	bit = 0;
+	for (;;) {
+		bit = find_next_bit(vmem->sync_req_bitmap, vmem->sync_req_max,
+				    bit);
+		if (bit >= vmem->sync_req_max)
+			break;
+		pgoffs[*req_nr] = vmem->sync_req[bit];
+		(*req_nr)++;
+		clear_bit(bit, vmem->sync_req_bitmap);
+		if (*req_nr >= req_max)
+			return false;
+		bit++;
+	}
+
+	if (vmem->async_req_nr > 0) {
+		int nr = min(req_max - *req_nr, vmem->async_req_nr);
+		memcpy(pgoffs + *req_nr, vmem->async_req,
+		       sizeof(*vmem->async_req) * nr);
+		vmem->async_req_nr -= nr;
+		*req_nr += nr;
+		memmove(vmem->async_req, vmem->sync_req + nr,
+			vmem->async_req_nr * sizeof(*vmem->async_req));
+
+	}
+	return vmem->async_req_nr == 0;
+}
+
+static int kvm_vmem_get_page_request(struct kvm_vmem *vmem,
+				     struct kvm_vmem_page_request *page_req)
+{
+	DEFINE_WAIT(wait);
+#define REQ_MAX	((__u32)32)
+	pgoff_t pgoffs[REQ_MAX];
+	__u32 req_copied = 0;
+	int ret = 0;
+
+	spin_lock(&vmem->lock);
+	for (;;) {
+		prepare_to_wait(&vmem->req_wait, &wait, TASK_INTERRUPTIBLE);
+		if (kvm_vmem_req_pending(vmem)) {
+			break;
+		}
+		if (signal_pending(current)) {
+			ret = -ERESTARTSYS;
+			break;
+		}
+		spin_unlock(&vmem->lock);
+		schedule();
+		spin_lock(&vmem->lock);
+	}
+	finish_wait(&vmem->req_wait, &wait);
+	if (ret)
+		goto out_unlock;
+
+	while (req_copied < page_req->nr) {
+		int req_max;
+		int req_nr;
+		bool finished;
+		req_max = min(page_req->nr - req_copied, REQ_MAX);
+		finished = kvm_vmem_copy_page_request(vmem, pgoffs, req_max,
+						      &req_nr);
+
+		spin_unlock(&vmem->lock);
+
+		if (req_nr > 0) {
+			ret = 0;
+			if (copy_to_user(page_req->pgoffs + req_copied, pgoffs,
+					 sizeof(*pgoffs) * req_nr)) {
+				ret = -EFAULT;
+				goto out;
+			}
+		}
+		req_copied += req_nr;
+		if (finished)
+			goto out;
+
+		spin_lock(&vmem->lock);
+	}
+
+out_unlock:
+	spin_unlock(&vmem->lock);
+out:
+	page_req->nr = req_copied;
+	return ret;
+}
+
+static int kvm_vmem_mark_page_cached(struct kvm_vmem *vmem,
+				     struct kvm_vmem_page_cached *page_cached)
+{
+	int ret = 0;
+#define PG_MAX	((__u32)32)
+	__u64 pgoffs[PG_MAX];
+	__u32 nr;
+	unsigned long bit;
+	bool wake_up_list = false;
+
+	nr = 0;
+	while (nr < page_cached->nr) {
+		__u32 todo = min(PG_MAX, (page_cached->nr - nr));
+		int i;
+
+		if (copy_from_user(pgoffs, page_cached->pgoffs + nr,
+				   sizeof(*pgoffs) * todo)) {
+			ret = -EFAULT;
+			goto out;
+		}
+		for (i = 0; i < todo; ++i) {
+			if (pgoffs[i] >= vmem->pgoff_end) {
+				ret = -EINVAL;
+				goto out;
+			}
+			set_bit(pgoffs[i], vmem->cached);
+		}
+		nr += todo;
+	}
+
+	spin_lock(&vmem->lock);
+	bit = 0;
+	for (;;) {
+		bit = find_next_bit(vmem->sync_wait_bitmap, vmem->sync_req_max,
+				    bit);
+		if (bit >= vmem->sync_req_max)
+			break;
+		if (test_bit(vmem->sync_req[bit], vmem->cached))
+			wake_up(&vmem->page_wait[bit]);
+		bit++;
+	}
+
+	if (vmem->req_list_nr > 0)
+		wake_up_list = true;
+	spin_unlock(&vmem->lock);
+
+	if (wake_up_list)
+		wake_up_all(&vmem->req_list_wait);
+
+out:
+	return ret;
+}
+
+static bool kvm_vmem_is_vmem_vma(const struct kvm_vmem *vmem,
+				 const struct vm_area_struct *vma)
+{
+	return vma->vm_file && vma->vm_file->private_data == vmem;
+}
+
+static void kvm_vmem_make_pages_present_entry(struct kvm_vmem *vmem,
+					      struct kvm_vmem_page_range *range,
+					      struct task_struct *task,
+					      struct mm_struct *mm,
+					      unsigned long vm_start)
+{
+	unsigned long pgoff = range->pgoff;
+	unsigned long range_end = range->pgoff + range->nr_pages;
+
+	down_read(&mm->mmap_sem);
+
+	while (pgoff < range->pgoff + range->nr_pages) {
+		unsigned long pgoff_end;
+		struct vm_area_struct *vma;
+		unsigned long saddr;
+		unsigned long eaddr;
+
+		/* search unfaulted range */
+		spin_lock(&vmem->lock);
+		pgoff = find_next_zero_bit(vmem->faulted, range_end, pgoff);
+		if (pgoff >= range_end) {
+			spin_unlock(&vmem->lock);
+			break;
+		}
+		pgoff_end = find_next_bit(vmem->faulted, range_end, pgoff);
+		spin_unlock(&vmem->lock);
+
+		saddr = vm_start + (pgoff << PAGE_SHIFT);
+		eaddr = vm_start + (pgoff_end << PAGE_SHIFT);
+		vma = find_vma(mm, saddr);
+		if (vma == NULL) {
+			break;
+		}
+		if (eaddr < vma->vm_start) {
+			pgoff = (vma->vm_start - vm_start) >> PAGE_SHIFT;
+			continue;
+		}
+
+		if (kvm_vmem_is_vmem_vma(vmem, vma)) {
+			unsigned long start = max(vma->vm_start, saddr);
+			unsigned long end = min(vma->vm_end, eaddr);
+			int nr_pages = (end - start) >> PAGE_SHIFT;
+			get_user_pages(task, mm, start, nr_pages,
+				       1, 1, NULL, NULL);
+			pgoff = (end - vm_start) >> PAGE_SHIFT;
+		} else {
+			pgoff = (vma->vm_end - vm_start) >> PAGE_SHIFT;
+		}
+	}
+
+	up_read(&mm->mmap_sem);
+}
+
+static int kvm_vmem_make_pages_present(
+	struct kvm_vmem *vmem,
+	struct kvm_vmem_make_pages_present *pages_present)
+{
+	struct task_struct *task;
+	struct mm_struct *mm;
+	pgoff_t pgoff_end;
+	unsigned long vm_start;
+	unsigned long vm_eaddr;
+
+#define NUM_ENTRIES	((__u32)32)
+	struct kvm_vmem_page_range kranges[NUM_ENTRIES];
+	__u32 nr = 0;
+	int ret;
+
+	spin_lock(&vmem->lock);
+	task = vmem->task;
+	pgoff_end = vmem->pgoff_end;
+	vm_start = vmem->vm_start;
+	vm_eaddr = vm_start + vmem->size;
+	spin_unlock(&vmem->lock);
+	if (task == NULL)
+		return 0;
+	mm = get_task_mm(task);
+	if (mm == NULL)
+		return 0;
+
+	ret = 0;
+	while (nr < pages_present->nr) {
+		int nr_ranges = min(NUM_ENTRIES, pages_present->nr - nr);
+		int i;
+
+		if (copy_from_user(&kranges, pages_present->ranges + nr,
+				   sizeof(kranges[0]) * nr_ranges)) {
+			ret = -EFAULT;
+			break;
+		}
+		for (i = 0; i < nr_ranges; ++i) {
+			struct kvm_vmem_page_range *range = &kranges[i];
+			if (range->pgoff >= pgoff_end ||
+			    range->nr_pages >= pgoff_end ||
+			    range->pgoff + range->nr_pages >= pgoff_end) {
+				ret = -EINVAL;
+				break;
+			}
+			kvm_vmem_make_pages_present_entry(vmem, range,
+							  task, mm, vm_start);
+		}
+		nr += nr_ranges;
+	}
+
+	mmput(mm);
+	return ret;
+}
+
+static int kvm_vmem_make_vma_anonymous(struct kvm_vmem *vmem)
+{
+#if 1
+	return -ENOSYS;
+#else
+	unsigned long saddr;
+	unsigned long eaddr;
+	unsigned long addr;
+	unsigned long bit;
+	struct task_struct *task;
+	struct mm_struct *mm;
+
+	spin_lock(&vmem->lock);
+	task = vmem->task;
+	saddr = vmem->vm_start;
+	eaddr = saddr + vmem->size;
+	bit = find_first_zero_bit(vmem->faulted, vmem->pgoff_end);
+	if (bit < vmem->pgoff_end) {
+		spin_unlock(&vmem->lock);
+		return -EBUSY;
+	}
+	spin_unlock(&vmem->lock);
+	if (task == NULL)
+		return 0;
+	mm = get_task_mm(task);
+	if (mm == NULL)
+		return 0;
+
+	addr = saddr;
+	down_write(&mm->mmap_sem);
+	while (addr < eaddr) {
+		struct vm_area_struct *vma;
+		vma = find_vma(mm, addr);
+		if (kvm_vmem_is_vmem_vma(vmem, vma)) {
+			/* XXX incorrect. race/locking and more fix up */
+			struct file *filp = vma->vm_file;
+			vma->vm_ops->close(vma);
+			vma->vm_ops = NULL;
+			vma->vm_file = NULL;
+			/* vma->vm_flags */
+			fput(filp);
+		}
+		addr = vma->vm_end;
+	}
+	up_write(&mm->mmap_sem);
+
+	mmput(mm);
+	return 0;
+#endif
+}
+
+static void kvm_vmem_ready(struct kvm_vmem *vmem)
+{
+	spin_lock(&vmem->lock);
+	vmem->ready = true;
+	spin_unlock(&vmem->lock);
+	wake_up_interruptible(&vmem->ready_wait);
+}
+
+static int kvm_vmem_wait_ready(struct kvm_vmem *vmem)
+{
+	int ret = 0;
+	DEFINE_WAIT(wait);
+
+	spin_lock(&vmem->lock);
+	for (;;) {
+		prepare_to_wait(&vmem->ready_wait, &wait, TASK_INTERRUPTIBLE);
+		if (vmem->ready) {
+			break;
+		}
+		if (signal_pending(current)) {
+			ret = -ERESTARTSYS;
+			break;
+		}
+		spin_unlock(&vmem->lock);
+		schedule();
+		spin_lock(&vmem->lock);
+	}
+	spin_unlock(&vmem->lock);
+	finish_wait(&vmem->ready_wait, &wait);
+	return ret;
+}
+
+static long kvm_vmem_ioctl(struct file *filp, unsigned int ioctl,
+			   unsigned long arg)
+{
+	struct kvm_vmem *vmem = filp->private_data;
+	void __user *argp = (void __user *) arg;
+	long ret = 0;
+
+	switch (ioctl) {
+	case KVM_VMEM_READY:
+		kvm_vmem_ready(vmem);
+		ret = 0;
+		break;
+	case KVM_VMEM_WAIT_READY:
+		ret = kvm_vmem_wait_ready(vmem);
+		break;
+	case KVM_VMEM_GET_PAGE_REQUEST: {
+		struct kvm_vmem_page_request page_request;
+		ret = -EFAULT;
+		if (copy_from_user(&page_request, argp, sizeof(page_request)))
+			break;
+		ret = kvm_vmem_get_page_request(vmem, &page_request);
+		if (ret == 0 &&
+		    copy_to_user(argp +
+				 offsetof(struct kvm_vmem_page_request, nr),
+				 &page_request.nr,
+				 sizeof(page_request.nr))) {
+			ret = -EFAULT;
+			break;
+		}
+		break;
+	}
+	case KVM_VMEM_MARK_PAGE_CACHED: {
+		struct kvm_vmem_page_cached page_cached;
+		ret = -EFAULT;
+		if (copy_from_user(&page_cached, argp, sizeof(page_cached)))
+			break;
+		ret = kvm_vmem_mark_page_cached(vmem, &page_cached);
+		break;
+	}
+	case KVM_VMEM_MAKE_PAGES_PRESENT: {
+		struct kvm_vmem_make_pages_present pages_present;
+		ret = -EFAULT;
+		if (copy_from_user(&pages_present, argp,
+				   sizeof(pages_present)))
+			break;
+		ret = kvm_vmem_make_pages_present(vmem, &pages_present);
+		break;
+	}
+	case KVM_VMEM_MAKE_VMA_ANONYMOUS:
+		ret = kvm_vmem_make_vma_anonymous(vmem);
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+	return ret;
+}
+
+static unsigned long kvm_vmem_bitmap_bytes(const struct kvm_vmem *vmem)
+{
+	return round_up(vmem->pgoff_end, BITS_PER_LONG) / 8;
+}
+
+static void kvm_vmem_free(struct kvm_vmem *vmem)
+{
+	if (vmem->task) {
+		put_task_struct(vmem->task);
+		vmem->task = NULL;
+	}
+
+	if (vmem->shmem_filp)
+		fput(vmem->shmem_filp);
+	if (kvm_vmem_bitmap_bytes(vmem) > PAGE_SIZE) {
+		vfree(vmem->cached);
+		vfree(vmem->faulted);
+	} else {
+		kfree(vmem->cached);
+		kfree(vmem->faulted);
+	}
+	kfree(vmem->vma);
+	kfree(vmem->async_req);
+	kfree(vmem->sync_req_bitmap);
+	kfree(vmem->sync_wait_bitmap);
+	kfree(vmem->page_wait);
+	kfree(vmem->sync_req);
+	kfree(vmem);
+}
+
+static int kvm_vmem_release(struct inode *inode, struct file *filp)
+{
+	struct kvm_vmem *vmem = filp->private_data;
+	kvm_vmem_free(vmem);
+	return 0;
+}
+
+static struct file_operations kvm_vmem_fops = {
+	.release	= kvm_vmem_release,
+	.unlocked_ioctl = kvm_vmem_ioctl,
+	.mmap		= kvm_vmem_mmap,
+	.poll		= kvm_vmem_poll,
+	.llseek		= noop_llseek,
+};
+
+static int kvm_create_vmem(struct kvm_vmem_create *create)
+{
+	int error = 0;
+	struct kvm_vmem *vmem = NULL;
+	struct vm_area_struct *vma = NULL;
+	int shmem_fd;
+	unsigned long bitmap_bytes;
+	unsigned long sync_bitmap_bytes;
+	int i;
+
+	vmem = kzalloc(sizeof(*vmem), GFP_KERNEL);
+	vmem->task = NULL;
+	vmem->mmapped = false;
+	spin_lock_init(&vmem->lock);
+	vmem->size = roundup(create->size, PAGE_SIZE);
+	vmem->pgoff_end = vmem->size >> PAGE_SHIFT;
+	init_waitqueue_head(&vmem->req_wait);
+
+	vma = kzalloc(sizeof(*vma), GFP_KERNEL);
+	vma->vm_start = 0;
+	vma->vm_end = vmem->size;
+	/* this shmem file is used for temporal buffer for pages
+	   so it's unlikely that so many pages exists in this shmem file */
+	vma->vm_flags = VM_READ | VM_SHARED | VM_NOHUGEPAGE | VM_DONTCOPY |
+		VM_DONTEXPAND;
+	vma->vm_page_prot = vm_get_page_prot(vma->vm_flags);
+	vma->vm_pgoff = 0;
+	INIT_LIST_HEAD(&vma->anon_vma_chain);
+
+	vmem->vma = vma;
+
+	shmem_fd = get_unused_fd();
+	if (shmem_fd < 0) {
+		error = shmem_fd;
+		goto out;
+	}
+	error = shmem_zero_setup(vma);
+	if (error < 0) {
+		put_unused_fd(shmem_fd);
+		goto out;
+	}
+	vmem->shmem_filp = vma->vm_file;
+	get_file(vmem->shmem_filp);
+	fd_install(shmem_fd, vma->vm_file);
+	create->shmem_fd = shmem_fd;
+
+	create->vmem_fd = anon_inode_getfd("kvm-vmem",
+					   &kvm_vmem_fops, vmem, O_RDWR);
+	if (create->vmem_fd < 0) {
+		error = create->vmem_fd;
+		goto out;
+	}
+
+	bitmap_bytes = kvm_vmem_bitmap_bytes(vmem);
+	if (bitmap_bytes > PAGE_SIZE) {
+		vmem->cached = vzalloc(bitmap_bytes);
+		vmem->faulted = vzalloc(bitmap_bytes);
+	} else {
+		vmem->cached = kzalloc(bitmap_bytes, GFP_KERNEL);
+		vmem->faulted = kzalloc(bitmap_bytes, GFP_KERNEL);
+	}
+
+#define ASYNC_REQ_MAX	(ASYNC_PF_PER_VCPU * KVM_MAX_VCPUS)
+	vmem->async_req_max = ASYNC_REQ_MAX;
+	vmem->async_req_nr = 0;
+	vmem->async_req = kzalloc(sizeof(*vmem->async_req), GFP_KERNEL);
+
+#define SYNC_REQ_MAX	(KVM_MAX_VCPUS)
+	vmem->sync_req_max = round_up(SYNC_REQ_MAX, BITS_PER_LONG);
+	sync_bitmap_bytes = sizeof(unsigned long) *
+		(vmem->sync_req_max / BITS_PER_LONG);
+	vmem->sync_req_bitmap = kzalloc(sync_bitmap_bytes, GFP_KERNEL);
+	vmem->sync_wait_bitmap = kzalloc(sync_bitmap_bytes, GFP_KERNEL);
+	vmem->page_wait = kzalloc(sizeof(*vmem->page_wait) *
+				  vmem->sync_req_max, GFP_KERNEL);
+	for (i = 0; i < vmem->sync_req_max; ++i)
+		init_waitqueue_head(&vmem->page_wait[i]);
+	vmem->sync_req = kzalloc(sizeof(*vmem->sync_req) *
+				 vmem->sync_req_max, GFP_KERNEL);
+
+	vmem->req_list_nr = 0;
+	INIT_LIST_HEAD(&vmem->req_list);
+	init_waitqueue_head(&vmem->req_list_wait);
+
+	init_waitqueue_head(&vmem->ready_wait);
+	vmem->ready = false;
+
+	return 0;
+
+ out:
+	kvm_vmem_free(vmem);
+	return error;
+}
+
+static long kvm_vmem_dev_ioctl(struct file *filp, unsigned int ioctl,
+			       unsigned long arg)
+{
+	void __user *argp = (void __user *) arg;
+	long ret;
+
+	switch (ioctl) {
+	case KVM_CREATE_VMEM: {
+		struct kvm_vmem_create create;
+		if (copy_from_user(&create, argp, sizeof(create))) {
+			ret = -EFAULT;
+			break;
+		}
+		ret = kvm_create_vmem(&create);
+		if (copy_to_user(argp, &create, sizeof(create))) {
+			ret = -EFAULT;
+			break;
+		}
+		break;
+	}
+	default:
+		ret = -EINVAL;
+		break;
+	}
+	return ret;
+}
+
+static int kvm_vmem_dev_release(struct inode *inode, struct file *filp)
+{
+	return 0;
+}
+
+static struct file_operations kvm_vmem_dev_fops = {
+	.release = kvm_vmem_dev_release,
+	.unlocked_ioctl = kvm_vmem_dev_ioctl,
+};
+
+long kvm_dev_ioctl_create_vmem_dev(void)
+{
+	return anon_inode_getfd("kvm-vmem-dev", &kvm_vmem_dev_fops,
+				NULL, O_RDWR);
+}
diff --git a/virt/kvm/vmem.h b/virt/kvm/vmem.h
new file mode 100644
index 0000000..bc7e8cf
--- /dev/null
+++ b/virt/kvm/vmem.h
@@ -0,0 +1,68 @@
+/*
+ * KVM post copy vmem
+ *
+ * Copyright (c) 2011,
+ * National Institute of Advanced Industrial Science and Technology
+ *
+ * https://sites.google.com/site/grivonhome/quick-kvm-migration
+ * Author: Isaku Yamahata <yamahata at valinux co jp>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc., 59 Temple
+ * Place - Suite 330, Boston, MA 02111-1307 USA.
+ */
+
+#ifndef __KVM_VMEM_H__
+#define __KVM_VMEM_H__
+
+struct kvm_vmem_page_req_list {
+	struct list_head list;
+	pgoff_t pgoff;
+};
+
+struct kvm_vmem {
+	loff_t size;
+	pgoff_t pgoff_end;
+	spinlock_t lock;
+
+	wait_queue_head_t req_wait;
+
+	int async_req_max;
+	int async_req_nr;
+	pgoff_t *async_req;
+
+	int sync_req_max;
+	unsigned long *sync_req_bitmap;
+	unsigned long *sync_wait_bitmap;
+	pgoff_t *sync_req;
+	wait_queue_head_t *page_wait;
+
+	int req_list_nr;
+	struct list_head req_list;
+	wait_queue_head_t req_list_wait;
+
+	unsigned long *cached;
+	unsigned long *faulted;
+
+	bool mmapped;
+	unsigned long vm_start;
+	unsigned int vma_nr;
+	struct task_struct *task;
+
+	wait_queue_head_t ready_wait;
+	bool ready;
+
+	struct file *shmem_filp;
+	struct vm_area_struct *vma;
+};
+
+#endif /* __KVM_VMEM_H__ */
-- 
1.7.1.1


-- 
yamahata

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [PATCH][RFC] post copy chardevice (was Re: [RFC] postcopy livemigration proposal)
  2011-08-12 11:07 ` [Qemu-devel] [PATCH][RFC] post copy chardevice (was Re: [RFC] postcopy livemigration proposal) Isaku Yamahata
@ 2011-08-12 11:09   ` Isaku Yamahata
  2011-08-12 21:26   ` Blue Swirl
  2011-08-15 19:29   ` Avi Kivity
  2 siblings, 0 replies; 28+ messages in thread
From: Isaku Yamahata @ 2011-08-12 11:09 UTC (permalink / raw)
  To: kvm, qemu-devel
  Cc: Andrea Arcangeli, Nadav Har'El, satoshi.itoh, Stefan Hajnoczi,
	t.hirofuchi, Dor Laor, Yaniv Kaul, Avi Kivity

Sample user land program for testing the post copy chardevice.

===========================================================================
/*
 * sample user land for post copy vmem
 *
 * Copyright (c) 2011,
 * National Institute of Advanced Industrial Science and Technology
 *
 * https://sites.google.com/site/grivonhome/quick-kvm-migration
 * Author: Isaku Yamahata <yamahata at valinux co jp>
 *
 * This program is free software; you can redistribute it and/or modify it
 * under the terms and conditions of the GNU General Public License,
 * version 2, as published by the Free Software Foundation.
 *
 * This program is distributed in the hope it will be useful, but WITHOUT
 * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
 * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
 * more details.
 *
 * You should have received a copy of the GNU General Public License along with
 * this program; if not, write to the Free Software Foundation, Inc., 59 Temple
 * Place - Suite 330, Boston, MA 02111-1307 USA.
 */

#include <err.h>
#include <inttypes.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/ioctl.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

//#include <linux/kvm.h>
#define __user
#include "my-kvm.h"

#if 1
#define DPRINTF(format, ...) \
	printf("%s:%d "format, __func__, __LINE__, ## __VA_ARGS__)
#else
#define DPRINTF(format, ...)	do { } while (0)
#endif

#define VMEM_NR_PAGES	8

void server(int vmem_fd, int shmem_fd, size_t size, size_t page_size)
{
	int nr_pages = size / page_size;

	void* shmem = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED,
			   shmem_fd, 0);
	if (shmem == MAP_FAILED) {
		err(EXIT_FAILURE, "server: mmap(\"shmem\")");
	}
	close(shmem_fd);

	DPRINTF("KVM_VMEM_READY\n");
	if (ioctl(vmem_fd, KVM_VMEM_READY) < 0) {
		err(EXIT_FAILURE, "server: KVM_VMEM_READY");
	}

	struct kvm_vmem_page_request page_request;
	page_request.pgoffs = malloc(sizeof(*page_request.pgoffs) * nr_pages);
	if (page_request.pgoffs == NULL) {
		err(EXIT_FAILURE, "server: malloc(\"page_request.pgoffs\")");
	}

	struct kvm_vmem_page_cached page_cached;
	page_cached.pgoffs = malloc(sizeof(*page_cached.pgoffs) * nr_pages);
	if (page_cached.pgoffs == NULL) {
		err(EXIT_FAILURE, "server: malloc(\"page_cached.pgoffs\")");
	}

	int fill = 0;
	fill++;
	memset(shmem, fill, page_size);

	page_cached.nr = 1;
	page_cached.pgoffs[0] = 0;

	DPRINTF("KVM_VMEM_MARK_PAGE_CACHED\n");
	if (ioctl(vmem_fd, KVM_VMEM_MARK_PAGE_CACHED, &page_cached)) {
		err(EXIT_FAILURE, "server: KVM_VMEM_MARK_PAGE_CACHED");
	}

	struct kvm_vmem_page_range page_range = {
		.pgoff = 0,
		.nr_pages = 1,
	};
	struct kvm_vmem_make_pages_present pages_present = {
		.nr = 1,
		.ranges = &page_range,
	};
	DPRINTF("KVM_VMEM_MAKE_PAGES_PRESENT\n");
	if (ioctl(vmem_fd, KVM_VMEM_MAKE_PAGES_PRESENT, &pages_present) < 0) {
		err(EXIT_FAILURE, "server: KVM_VMEM_MAKE_PAGES_PRESENT");
	}

	int page_served = 1;
	while (page_served < nr_pages) {
		DPRINTF("KVM_VMEM_GET_PAGE_REQUEST\n");
		page_request.nr = nr_pages;
		if (ioctl(vmem_fd, KVM_VMEM_GET_PAGE_REQUEST, &page_request)) {
			err(EXIT_FAILURE, "server: KVM_VMEM_GET_PAGE_REQUEST");
		}

		DPRINTF("request.nr %d\n", page_request.nr);
		page_cached.nr = 0;
		int i;
		for (i = 0; i < page_request.nr; ++i) {
			memset(shmem + page_size * page_request.pgoffs[i],
			       fill, page_size);
			fill++;
			page_cached.pgoffs[page_cached.nr] =
				page_request.pgoffs[i];
			page_cached.nr++;
			DPRINTF("request[%d] %lx fill: %d\n",
				i, (unsigned long)page_request.pgoffs[i],
				fill - 1);
		}
		DPRINTF("KVM_VMEM_MARK_PAGE_CACHED\n");
		if (ioctl(vmem_fd, KVM_VMEM_MARK_PAGE_CACHED,
			  &page_cached) < 0) {
			err(EXIT_FAILURE, "server: KVM_VMEM_MARK_PAGE_CACHED");
		}
		page_served += page_cached.nr;
	}

#if 0
	DPRINTF("KVM_VMEM_MAKE_VMA_ANONYMOUS\n");
	if (ioctl(vmem_fd, KVM_VMEM_MAKE_VMA_ANONYMOUS)) {
		err(EXIT_FAILURE, "server: KVM_VMEM_MAKE_VMA_ANONYMOUS");
	}
#endif
	munmap(shmem, size);
	close(vmem_fd);
}

void qemu(int vmem_fd, size_t size, size_t page_size)
{
	DPRINTF("mmap\n");
	void *ram = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE,
			 vmem_fd, 0);
	if (ram == MAP_FAILED) {
		err(EXIT_FAILURE, "qemu: mmap");
	}

	DPRINTF("KVM_VMEM_WAIT_READY\n");
	if (ioctl(vmem_fd, KVM_VMEM_WAIT_READY) < 0) {
		err(EXIT_FAILURE, "qemu: KVM_VMEM_WAIT_READY");
	}
	DPRINTF("close\n");
	close(vmem_fd);

	int pages[] = {7, 1, 6, 2, 0, 5, 3, 4};
	int val[VMEM_NR_PAGES];
	int i;
	for (i = 0; i < VMEM_NR_PAGES; ++i) {
		if (i == 2 || i == 6)
			sleep(1);
		DPRINTF("access to %d\n", pages[i]);
		fflush(stdout);
		val[i] = *(uint8_t*)(ram + page_size * pages[i]);
		DPRINTF("page:%d val[i=%d]=%d\n", pages[i], i, val[i]);
	}

	munmap(ram, size);
}

int main(int argc, char **argv)
{
	int kvm_fd = open("/dev/kvm", O_RDWR);
	if (kvm_fd < 0) {
		perror("can't open /dev/kvm");
		exit(EXIT_FAILURE);
	}

	int vmem_dev_fd = ioctl(kvm_fd, KVM_CREATE_VMEM_DEV);
	if (vmem_dev_fd < 0) {
		err(EXIT_FAILURE, "can't create vmem_dev");
	}

	long page_size = sysconf(_SC_PAGESIZE);
	struct kvm_vmem_create create = {
		.size = VMEM_NR_PAGES * page_size,
	};
	if (ioctl(vmem_dev_fd, KVM_CREATE_VMEM, &create) < 0) {
		err(EXIT_FAILURE, "KVM_CREATE_VMEM");
	}
	close(vmem_dev_fd);

	int vmem_fd = create.vmem_fd;
	int shmem_fd = create.shmem_fd;
	size_t size = create.size;

	if (ftruncate(shmem_fd, size) < 0) {
		err(EXIT_FAILURE, "truncate(\"shmem_fd\")");
	}

	printf("vmem_fd %d shmem_fd %d\n", vmem_fd, shmem_fd);
	fflush(stdout);

	pid_t child = fork();
	if (child < 0) {
		err(EXIT_FAILURE, "fork");
	}
	if (child == 0) {
		sleep(1);
		printf("server pid: %d\n", getpid());
		server(vmem_fd, shmem_fd, size, page_size);
		return 0;
	}

	printf("qemu pid: %d server pid: %d\n", getpid(), child);
	close(shmem_fd);
	qemu(vmem_fd, size, page_size);
	return 0;
}

===========================================================================
-- 
yamahata

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [PATCH][RFC] post copy chardevice (was Re: [RFC] postcopy livemigration proposal)
  2011-08-12 11:07 ` [Qemu-devel] [PATCH][RFC] post copy chardevice (was Re: [RFC] postcopy livemigration proposal) Isaku Yamahata
  2011-08-12 11:09   ` Isaku Yamahata
@ 2011-08-12 21:26   ` Blue Swirl
  2011-08-15 19:29   ` Avi Kivity
  2 siblings, 0 replies; 28+ messages in thread
From: Blue Swirl @ 2011-08-12 21:26 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: Andrea Arcangeli, Nadav Har'El, kvm, satoshi.itoh,
	Stefan Hajnoczi, t.hirofuchi, Dor Laor, qemu-devel, Yaniv Kaul,
	Avi Kivity

On Fri, Aug 12, 2011 at 11:07 AM, Isaku Yamahata <yamahata@valinux.co.jp> wrote:
> Here is the what I have right now for post copy chardevice.
> The sample user land will follow.
> It would give you more concrete idea and help further discussion, I hope.
> This is just for discussion, so it's incomplete.
>
> I'm open to other ideas and quite happy to throw away this patch and
> go for better way.
>
> thanks,
>
> From e262979e95b3c5a095c8cb0bc178309baa861a3f Mon Sep 17 00:00:00 2001
> Message-Id: <e262979e95b3c5a095c8cb0bc178309baa861a3f.1313146664.git.yamahata@valinux.co.jp>
> From: Isaku Yamahata <yamahata@valinux.co.jp>
> Date: Wed, 10 Aug 2011 18:28:05 +0900
> Subject: [PATCH] kvm/postcopy: chardevice for postcopy
>
> This is a character device to hook page access.
> The page fault in the area is reported to another user process by
> this chardriver. Then, the process fills the page contents and
> resolves the page fault.
>
> Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
> ---
>  arch/x86/kvm/Kconfig     |    1 +
>  arch/x86/kvm/Makefile    |    1 +
>  include/linux/kvm.h      |   45 +++
>  include/linux/kvm_host.h |    2 +
>  mm/memcontrol.c          |    1 +
>  mm/shmem.c               |    1 +
>  virt/kvm/Kconfig         |    3 +
>  virt/kvm/kvm_main.c      |    6 +
>  virt/kvm/vmem.c          |  847 ++++++++++++++++++++++++++++++++++++++++++++++
>  virt/kvm/vmem.h          |   68 ++++
>  10 files changed, 975 insertions(+), 0 deletions(-)
>  create mode 100644 virt/kvm/vmem.c
>  create mode 100644 virt/kvm/vmem.h
>
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index 0a09b58..dcbd52e 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -29,6 +29,7 @@ config KVM
>        select HAVE_KVM_EVENTFD
>        select KVM_APIC_ARCHITECTURE
>        select KVM_ASYNC_PF
> +       select KVM_VMEM
>        select USER_RETURN_NOTIFIER
>        select KVM_MMIO
>        select TASKSTATS
> diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
> index f15501f..6125f4c 100644
> --- a/arch/x86/kvm/Makefile
> +++ b/arch/x86/kvm/Makefile
> @@ -10,6 +10,7 @@ kvm-y                 += $(addprefix ../../../virt/kvm/, kvm_main.o ioapic.o \
>                                assigned-dev.o)
>  kvm-$(CONFIG_IOMMU_API)        += $(addprefix ../../../virt/kvm/, iommu.o)
>  kvm-$(CONFIG_KVM_ASYNC_PF)     += $(addprefix ../../../virt/kvm/, async_pf.o)
> +kvm-$(CONFIG_KVM_VMEM) += $(addprefix ../../../virt/kvm/, vmem.o)
>
>  kvm-y                  += x86.o mmu.o emulate.o i8259.o irq.o lapic.o \
>                           i8254.o timer.o
> diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> index 55f5afb..623109e 100644
> --- a/include/linux/kvm.h
> +++ b/include/linux/kvm.h
> @@ -554,6 +554,7 @@ struct kvm_ppc_pvinfo {
>  #define KVM_CAP_PPC_SMT 64
>  #define KVM_CAP_PPC_RMA        65
>  #define KVM_CAP_MAX_VCPUS 66       /* returns max vcpus per vm */
> +#define KVM_CAP_POST_COPY_MEMORY 67
>
>  #ifdef KVM_CAP_IRQ_ROUTING
>
> @@ -760,6 +761,50 @@ struct kvm_clock_data {
>  /* Available with KVM_CAP_RMA */
>  #define KVM_ALLOCATE_RMA         _IOR(KVMIO,  0xa9, struct kvm_allocate_rma)
>
> +struct kvm_vmem_create {
> +       __u64 size;     /* in bytes */
> +       __s32 vmem_fd;
> +       __s32 shmem_fd;
> +};
> +
> +struct kvm_vmem_page_request {
> +       __u32 nr;

Padding will be needed here on 64 bit hosts unless the order is switched.

> +       __u64 __user *pgoffs;
> +};
> +
> +struct kvm_vmem_page_cached {
> +       __u32 nr;

Also here.

> +       __u64 __user *pgoffs;
> +};
> +
> +struct kvm_vmem_page_range {
> +       __u64 pgoff;
> +       __u64 nr_pages;
> +};
> +
> +struct kvm_vmem_make_pages_present {
> +       __u32 nr;

And here.

> +       struct kvm_vmem_page_range __user *ranges;
> +};
> +
> +/* Available with KVM_CAP_POST_COPY_MEMORY */
> +#define KVM_CREATE_VMEM_DEV       _IO(KVMIO,  0xb0)
> +
> +/* ioctl for vmem_dev fd */
> +#define KVM_CREATE_VMEM                  _IOR(KVMIO, 0xb1, __u32)
> +
> +/* ioctl for vmem fd */
> +#define KVM_VMEM_WAIT_READY      _IO(KVMIO,  0xb2)
> +#define KVM_VMEM_READY           _IO(KVMIO,  0xb3)
> +#define KVM_VMEM_GET_PAGE_REQUEST \
> +       _IOWR(KVMIO, 0xb4, struct kvm_vmem_page_request)
> +#define KVM_VMEM_MARK_PAGE_CACHED \
> +       _IOW(KVMIO,  0xb5, struct kvm_vmem_page_cached)
> +#define KVM_VMEM_MAKE_PAGES_PRESENT \
> +       _IOW(KVMIO,  0xb6, struct kvm_vmem_make_pages_present)
> +#define KVM_VMEM_MAKE_VMA_ANONYMOUS _IO(KVMIO, 0xb7)
> +
> +
>  #define KVM_DEV_ASSIGN_ENABLE_IOMMU    (1 << 0)
>
>  struct kvm_assigned_pci_dev {
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index ff4d406..8b3dafa 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -222,6 +222,8 @@ struct kvm_irq_routing_table {};
>
>  #endif
>
> +long kvm_dev_ioctl_create_vmem_dev(void);
> +
>  struct kvm_memslots {
>        int nmemslots;
>        u64 generation;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e013b8e..7f3fc4e 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2838,6 +2838,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
>
>        return ret;
>  }
> +EXPORT_SYMBOL_GPL(mem_cgroup_cache_charge);
>
>  /*
>  * While swap-in, try_charge -> commit or cancel, the page is locked.
> diff --git a/mm/shmem.c b/mm/shmem.c
> index fcedf54..ae7d61f 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -3035,6 +3035,7 @@ int shmem_zero_setup(struct vm_area_struct *vma)
>        vma->vm_flags |= VM_CAN_NONLINEAR;
>        return 0;
>  }
> +EXPORT_SYMBOL_GPL(shmem_zero_setup);
>
>  /**
>  * shmem_read_mapping_page_gfp - read into page cache, using specified page allocation flags.
> diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
> index f63ccb0..d3040ea 100644
> --- a/virt/kvm/Kconfig
> +++ b/virt/kvm/Kconfig
> @@ -18,3 +18,6 @@ config KVM_MMIO
>
>  config KVM_ASYNC_PF
>        bool
> +
> +config KVM_VMEM
> +       bool
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index aefdda3..9e47e20 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2184,6 +2184,7 @@ static long kvm_dev_ioctl_check_extension_generic(long arg)
>        case KVM_CAP_SET_BOOT_CPU_ID:
>  #endif
>        case KVM_CAP_INTERNAL_ERROR_DATA:
> +       case KVM_CAP_POST_COPY_MEMORY:
>                return 1;
>  #ifdef CONFIG_HAVE_KVM_IRQCHIP
>        case KVM_CAP_IRQ_ROUTING:
> @@ -2233,6 +2234,11 @@ static long kvm_dev_ioctl(struct file *filp,
>        case KVM_TRACE_DISABLE:
>                r = -EOPNOTSUPP;
>                break;
> +#ifdef CONFIG_KVM_VMEM
> +       case KVM_CREATE_VMEM_DEV:
> +               r = kvm_dev_ioctl_create_vmem_dev();
> +               break;
> +#endif
>        default:
>                return kvm_arch_dev_ioctl(filp, ioctl, arg);
>        }
> diff --git a/virt/kvm/vmem.c b/virt/kvm/vmem.c
> new file mode 100644
> index 0000000..b413663
> --- /dev/null
> +++ b/virt/kvm/vmem.c
> @@ -0,0 +1,847 @@
> +/*
> + * KVM post copy vmem
> + *
> + * Copyright (c) 2011,
> + * National Institute of Advanced Industrial Science and Technology
> + *
> + * https://sites.google.com/site/grivonhome/quick-kvm-migration
> + * Author: Isaku Yamahata <yamahata at valinux co jp>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + *
> + * You should have received a copy of the GNU General Public License along with
> + * this program; if not, write to the Free Software Foundation, Inc., 59 Temple
> + * Place - Suite 330, Boston, MA 02111-1307 USA.

The current address is:

51 Franklin Street, Fifth Floor
Boston, MA 02110-1301
USA

Then there is the version used in QEMU:
if not, see <http://www.gnu.org/licenses/>.

I don't know which one is preferred with kernel.

> + */
> +
> +#include <linux/kvm_host.h>
> +#include <linux/kvm.h>
> +#include <linux/pagemap.h>
> +#include <linux/mm.h>
> +#include <linux/memcontrol.h>
> +#include <linux/poll.h>
> +#include <linux/file.h>
> +#include <linux/anon_inodes.h>
> +#include "vmem.h"
> +
> +static void kvm_vmem_release_fake_vmf(int ret, struct vm_fault *fake_vmf)
> +{
> +       if (ret & VM_FAULT_LOCKED) {
> +               unlock_page(fake_vmf->page);
> +       }
> +       page_cache_release(fake_vmf->page);
> +}
> +
> +static int kvm_vmem_minor_fault(struct kvm_vmem *vmem,
> +                               struct vm_area_struct *vma,
> +                               struct vm_fault *vmf)
> +{
> +       struct vm_fault fake_vmf;
> +       int ret;
> +       struct page *page;
> +
> +       BUG_ON(!test_bit(vmf->pgoff, vmem->cached));
> +       fake_vmf = *vmf;
> +       fake_vmf.page = NULL;
> +       ret = vmem->vma->vm_ops->fault(vmem->vma, &fake_vmf);
> +       if (ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY))
> +               return ret;
> +
> +       /*
> +        * TODO: pull out fake_vmf->page from shmem file and donate it
> +        * to this vma resolving the page fault.
> +        * vmf->page = fake_vmf->page;
> +        */
> +
> +       page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vmf->address);
> +       if (!page)
> +               return VM_FAULT_OOM;
> +       if (mem_cgroup_cache_charge(page, vma->vm_mm, GFP_KERNEL)) {
> +               kvm_vmem_release_fake_vmf(ret, &fake_vmf);
> +               page_cache_release(page);
> +               return VM_FAULT_OOM;
> +       }
> +
> +       copy_highpage(page, fake_vmf.page);
> +       kvm_vmem_release_fake_vmf(ret, &fake_vmf);
> +
> +       ret |= VM_FAULT_LOCKED;
> +       SetPageUptodate(page);
> +       vmf->page = page;
> +       set_bit(vmf->pgoff, vmem->faulted);
> +
> +       return ret;
> +}
> +
> +static int kvm_vmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> +{
> +       struct file *filp = vma->vm_file;
> +       struct kvm_vmem *vmem = filp->private_data;
> +
> +       if (vmf->pgoff >= vmem->pgoff_end) {
> +               return VM_FAULT_SIGBUS;
> +       }
> +
> +       BUG_ON(test_bit(vmf->pgoff, vmem->faulted));
> +
> +       if (!test_bit(vmf->pgoff, vmem->cached)) {
> +               /* major fault */
> +               unsigned long bit;
> +               DEFINE_WAIT(wait);
> +
> +               if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT) {
> +                       /* async page fault */
> +                       spin_lock(&vmem->lock);
> +                       if (vmem->async_req_nr < vmem->async_req_max) {
> +                               vmem->async_req[vmem->async_req_nr] =
> +                                       vmf->pgoff;
> +                               vmem->async_req_nr++;
> +                       }
> +                       spin_unlock(&vmem->lock);
> +                       wake_up_poll(&vmem->req_wait, POLLIN);
> +
> +                       if (test_bit(vmf->pgoff, vmem->cached))
> +                               return kvm_vmem_minor_fault(vmem, vma, vmf);
> +                       return VM_FAULT_MAJOR | VM_FAULT_RETRY;
> +               }
> +
> +               spin_lock(&vmem->lock);
> +               bit = find_first_zero_bit(vmem->sync_wait_bitmap,
> +                                         vmem->sync_req_max);
> +               if (likely(bit < vmem->sync_req_max)) {
> +                       vmem->sync_req[bit] = vmf->pgoff;
> +                       prepare_to_wait(&vmem->page_wait[bit], &wait,
> +                                       TASK_UNINTERRUPTIBLE);
> +                       set_bit(bit, vmem->sync_req_bitmap);
> +                       set_bit(bit, vmem->sync_wait_bitmap);
> +                       spin_unlock(&vmem->lock);
> +                       wake_up_poll(&vmem->req_wait, POLLIN);
> +
> +                       if (!test_bit(vmf->pgoff, vmem->cached))
> +                               schedule();
> +                       finish_wait(&vmem->page_wait[bit], &wait);
> +                       clear_bit(bit, vmem->sync_wait_bitmap);
> +               } else {
> +                       struct kvm_vmem_page_req_list page_req_list = {
> +                               .pgoff = vmf->pgoff,
> +                       };
> +                       vmem->req_list_nr++;
> +                       list_add_tail(&page_req_list.list, &vmem->req_list);
> +                       wake_up_poll(&vmem->req_wait, POLLIN);
> +                       for (;;) {
> +                               prepare_to_wait(&vmem->req_list_wait, &wait,
> +                                               TASK_UNINTERRUPTIBLE);
> +                               if (test_bit(vmf->pgoff, vmem->cached)) {
> +                                       vmem->req_list_nr--;
> +                                       break;
> +                               }
> +                               spin_unlock(&vmem->lock);
> +                               schedule();
> +                               spin_lock(&vmem->lock);
> +                       }
> +                       spin_unlock(&vmem->lock);
> +                       finish_wait(&vmem->req_list_wait, &wait);
> +               }
> +
> +               return kvm_vmem_minor_fault(vmem, vma, vmf) | VM_FAULT_MAJOR;
> +       }
> +
> +       return kvm_vmem_minor_fault(vmem, vma, vmf);
> +}
> +
> +/* for partial munmap */
> +static void kvm_vmem_vma_open(struct vm_area_struct *vma)
> +{
> +       struct file *filp = vma->vm_file;
> +       struct kvm_vmem *vmem = filp->private_data;
> +
> +       spin_lock(&vmem->lock);
> +       vmem->vma_nr++;
> +       spin_unlock(&vmem->lock);
> +}
> +
> +static void kvm_vmem_vma_close(struct vm_area_struct *vma)
> +{
> +       struct file *filp = vma->vm_file;
> +       struct kvm_vmem *vmem = filp->private_data;
> +       struct task_struct *task = NULL;
> +
> +       spin_lock(&vmem->lock);
> +       vmem->vma_nr--;
> +       if (vmem->vma_nr == 0) {
> +               task = vmem->task;
> +               vmem->task = NULL;
> +       }
> +       spin_unlock(&vmem->lock);
> +
> +       if (task)
> +               put_task_struct(task);
> +}
> +
> +static const struct vm_operations_struct kvm_vmem_vm_ops = {
> +       .open = kvm_vmem_vma_open,
> +       .close = kvm_vmem_vma_close,
> +       .fault = kvm_vmem_fault,
> +};
> +
> +static int kvm_vmem_mmap(struct file *filp, struct vm_area_struct *vma)
> +{
> +       struct kvm_vmem *vmem = filp->private_data;
> +       int error;
> +
> +       /* allow mmap() only once */
> +       spin_lock(&vmem->lock);
> +       if (vmem->mmapped) {
> +               error = -EBUSY;
> +               goto out;
> +       }
> +       if (((vma->vm_end - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff >
> +           vmem->pgoff_end) {
> +               error = -EINVAL;
> +               goto out;
> +       }
> +
> +       vmem->mmapped = true;
> +       vmem->vma_nr = 1;
> +       vmem->vm_start = vma->vm_start;
> +       get_task_struct(current);
> +       vmem->task = current;
> +       spin_unlock(&vmem->lock);
> +
> +       vma->vm_ops = &kvm_vmem_vm_ops;
> +       vma->vm_flags |= VM_DONTCOPY | VM_DONTEXPAND;
> +       vma->vm_flags &= ~VM_SHARED;
> +       return 0;
> +
> +out:
> +       spin_unlock(&vmem->lock);
> +       return error;
> +}
> +
> +static bool kvm_vmem_req_pending(struct kvm_vmem* vmem)
> +{
> +       return !list_empty(&vmem->req_list) ||
> +               !bitmap_empty(vmem->sync_req_bitmap, vmem->sync_req_max) ||
> +               (vmem->async_req_nr > 0);
> +}
> +
> +static unsigned int kvm_vmem_poll(struct file* filp, poll_table *wait)
> +{
> +       struct kvm_vmem *vmem = filp->private_data;
> +       unsigned int events = 0;
> +
> +       poll_wait(filp, &vmem->req_wait, wait);
> +
> +       spin_lock(&vmem->lock);
> +       if (kvm_vmem_req_pending(vmem))
> +               events |= POLLIN;
> +       spin_unlock(&vmem->lock);
> +
> +       return events;
> +}
> +
> +/*
> + * return value
> + * true: finished
> + * false: more request
> + */
> +static bool kvm_vmem_copy_page_request(struct kvm_vmem *vmem,
> +                                      pgoff_t *pgoffs, int req_max,
> +                                      int *req_nr)
> +{
> +       struct kvm_vmem_page_req_list *req_list;
> +       struct kvm_vmem_page_req_list *tmp;
> +
> +       unsigned long bit;
> +
> +       *req_nr = 0;
> +       list_for_each_entry_safe(req_list, tmp, &vmem->req_list, list) {
> +               list_del(&req_list->list);
> +               pgoffs[*req_nr] = req_list->pgoff;
> +               (*req_nr)++;
> +               if (*req_nr >= req_max)
> +                       return false;
> +       }
> +
> +       bit = 0;
> +       for (;;) {
> +               bit = find_next_bit(vmem->sync_req_bitmap, vmem->sync_req_max,
> +                                   bit);
> +               if (bit >= vmem->sync_req_max)
> +                       break;
> +               pgoffs[*req_nr] = vmem->sync_req[bit];
> +               (*req_nr)++;
> +               clear_bit(bit, vmem->sync_req_bitmap);
> +               if (*req_nr >= req_max)
> +                       return false;
> +               bit++;
> +       }
> +
> +       if (vmem->async_req_nr > 0) {
> +               int nr = min(req_max - *req_nr, vmem->async_req_nr);
> +               memcpy(pgoffs + *req_nr, vmem->async_req,
> +                      sizeof(*vmem->async_req) * nr);
> +               vmem->async_req_nr -= nr;
> +               *req_nr += nr;
> +               memmove(vmem->async_req, vmem->sync_req + nr,
> +                       vmem->async_req_nr * sizeof(*vmem->async_req));
> +
> +       }
> +       return vmem->async_req_nr == 0;
> +}
> +
> +static int kvm_vmem_get_page_request(struct kvm_vmem *vmem,
> +                                    struct kvm_vmem_page_request *page_req)
> +{
> +       DEFINE_WAIT(wait);
> +#define REQ_MAX        ((__u32)32)
> +       pgoff_t pgoffs[REQ_MAX];
> +       __u32 req_copied = 0;
> +       int ret = 0;
> +
> +       spin_lock(&vmem->lock);
> +       for (;;) {
> +               prepare_to_wait(&vmem->req_wait, &wait, TASK_INTERRUPTIBLE);
> +               if (kvm_vmem_req_pending(vmem)) {
> +                       break;
> +               }
> +               if (signal_pending(current)) {
> +                       ret = -ERESTARTSYS;
> +                       break;
> +               }
> +               spin_unlock(&vmem->lock);
> +               schedule();
> +               spin_lock(&vmem->lock);
> +       }
> +       finish_wait(&vmem->req_wait, &wait);
> +       if (ret)
> +               goto out_unlock;
> +
> +       while (req_copied < page_req->nr) {
> +               int req_max;
> +               int req_nr;
> +               bool finished;
> +               req_max = min(page_req->nr - req_copied, REQ_MAX);
> +               finished = kvm_vmem_copy_page_request(vmem, pgoffs, req_max,
> +                                                     &req_nr);
> +
> +               spin_unlock(&vmem->lock);
> +
> +               if (req_nr > 0) {
> +                       ret = 0;
> +                       if (copy_to_user(page_req->pgoffs + req_copied, pgoffs,
> +                                        sizeof(*pgoffs) * req_nr)) {
> +                               ret = -EFAULT;
> +                               goto out;
> +                       }
> +               }
> +               req_copied += req_nr;
> +               if (finished)
> +                       goto out;
> +
> +               spin_lock(&vmem->lock);
> +       }
> +
> +out_unlock:
> +       spin_unlock(&vmem->lock);
> +out:
> +       page_req->nr = req_copied;
> +       return ret;
> +}
> +
> +static int kvm_vmem_mark_page_cached(struct kvm_vmem *vmem,
> +                                    struct kvm_vmem_page_cached *page_cached)
> +{
> +       int ret = 0;
> +#define PG_MAX ((__u32)32)
> +       __u64 pgoffs[PG_MAX];
> +       __u32 nr;
> +       unsigned long bit;
> +       bool wake_up_list = false;
> +
> +       nr = 0;
> +       while (nr < page_cached->nr) {
> +               __u32 todo = min(PG_MAX, (page_cached->nr - nr));
> +               int i;
> +
> +               if (copy_from_user(pgoffs, page_cached->pgoffs + nr,
> +                                  sizeof(*pgoffs) * todo)) {
> +                       ret = -EFAULT;
> +                       goto out;
> +               }
> +               for (i = 0; i < todo; ++i) {
> +                       if (pgoffs[i] >= vmem->pgoff_end) {
> +                               ret = -EINVAL;
> +                               goto out;
> +                       }
> +                       set_bit(pgoffs[i], vmem->cached);
> +               }
> +               nr += todo;
> +       }
> +
> +       spin_lock(&vmem->lock);
> +       bit = 0;
> +       for (;;) {
> +               bit = find_next_bit(vmem->sync_wait_bitmap, vmem->sync_req_max,
> +                                   bit);
> +               if (bit >= vmem->sync_req_max)
> +                       break;
> +               if (test_bit(vmem->sync_req[bit], vmem->cached))
> +                       wake_up(&vmem->page_wait[bit]);
> +               bit++;
> +       }
> +
> +       if (vmem->req_list_nr > 0)
> +               wake_up_list = true;
> +       spin_unlock(&vmem->lock);
> +
> +       if (wake_up_list)
> +               wake_up_all(&vmem->req_list_wait);
> +
> +out:
> +       return ret;
> +}
> +
> +static bool kvm_vmem_is_vmem_vma(const struct kvm_vmem *vmem,
> +                                const struct vm_area_struct *vma)
> +{
> +       return vma->vm_file && vma->vm_file->private_data == vmem;
> +}
> +
> +static void kvm_vmem_make_pages_present_entry(struct kvm_vmem *vmem,
> +                                             struct kvm_vmem_page_range *range,
> +                                             struct task_struct *task,
> +                                             struct mm_struct *mm,
> +                                             unsigned long vm_start)
> +{
> +       unsigned long pgoff = range->pgoff;
> +       unsigned long range_end = range->pgoff + range->nr_pages;
> +
> +       down_read(&mm->mmap_sem);
> +
> +       while (pgoff < range->pgoff + range->nr_pages) {
> +               unsigned long pgoff_end;
> +               struct vm_area_struct *vma;
> +               unsigned long saddr;
> +               unsigned long eaddr;
> +
> +               /* search unfaulted range */
> +               spin_lock(&vmem->lock);
> +               pgoff = find_next_zero_bit(vmem->faulted, range_end, pgoff);
> +               if (pgoff >= range_end) {
> +                       spin_unlock(&vmem->lock);
> +                       break;
> +               }
> +               pgoff_end = find_next_bit(vmem->faulted, range_end, pgoff);
> +               spin_unlock(&vmem->lock);
> +
> +               saddr = vm_start + (pgoff << PAGE_SHIFT);
> +               eaddr = vm_start + (pgoff_end << PAGE_SHIFT);
> +               vma = find_vma(mm, saddr);
> +               if (vma == NULL) {
> +                       break;
> +               }
> +               if (eaddr < vma->vm_start) {
> +                       pgoff = (vma->vm_start - vm_start) >> PAGE_SHIFT;
> +                       continue;
> +               }
> +
> +               if (kvm_vmem_is_vmem_vma(vmem, vma)) {
> +                       unsigned long start = max(vma->vm_start, saddr);
> +                       unsigned long end = min(vma->vm_end, eaddr);
> +                       int nr_pages = (end - start) >> PAGE_SHIFT;
> +                       get_user_pages(task, mm, start, nr_pages,
> +                                      1, 1, NULL, NULL);
> +                       pgoff = (end - vm_start) >> PAGE_SHIFT;
> +               } else {
> +                       pgoff = (vma->vm_end - vm_start) >> PAGE_SHIFT;
> +               }
> +       }
> +
> +       up_read(&mm->mmap_sem);
> +}
> +
> +static int kvm_vmem_make_pages_present(
> +       struct kvm_vmem *vmem,
> +       struct kvm_vmem_make_pages_present *pages_present)
> +{
> +       struct task_struct *task;
> +       struct mm_struct *mm;
> +       pgoff_t pgoff_end;
> +       unsigned long vm_start;
> +       unsigned long vm_eaddr;
> +
> +#define NUM_ENTRIES    ((__u32)32)
> +       struct kvm_vmem_page_range kranges[NUM_ENTRIES];
> +       __u32 nr = 0;
> +       int ret;
> +
> +       spin_lock(&vmem->lock);
> +       task = vmem->task;
> +       pgoff_end = vmem->pgoff_end;
> +       vm_start = vmem->vm_start;
> +       vm_eaddr = vm_start + vmem->size;
> +       spin_unlock(&vmem->lock);
> +       if (task == NULL)
> +               return 0;
> +       mm = get_task_mm(task);
> +       if (mm == NULL)
> +               return 0;
> +
> +       ret = 0;
> +       while (nr < pages_present->nr) {
> +               int nr_ranges = min(NUM_ENTRIES, pages_present->nr - nr);
> +               int i;
> +
> +               if (copy_from_user(&kranges, pages_present->ranges + nr,
> +                                  sizeof(kranges[0]) * nr_ranges)) {
> +                       ret = -EFAULT;
> +                       break;
> +               }
> +               for (i = 0; i < nr_ranges; ++i) {
> +                       struct kvm_vmem_page_range *range = &kranges[i];
> +                       if (range->pgoff >= pgoff_end ||
> +                           range->nr_pages >= pgoff_end ||
> +                           range->pgoff + range->nr_pages >= pgoff_end) {
> +                               ret = -EINVAL;
> +                               break;
> +                       }
> +                       kvm_vmem_make_pages_present_entry(vmem, range,
> +                                                         task, mm, vm_start);
> +               }
> +               nr += nr_ranges;
> +       }
> +
> +       mmput(mm);
> +       return ret;
> +}
> +
> +static int kvm_vmem_make_vma_anonymous(struct kvm_vmem *vmem)
> +{
> +#if 1
> +       return -ENOSYS;
> +#else
> +       unsigned long saddr;
> +       unsigned long eaddr;
> +       unsigned long addr;
> +       unsigned long bit;
> +       struct task_struct *task;
> +       struct mm_struct *mm;
> +
> +       spin_lock(&vmem->lock);
> +       task = vmem->task;
> +       saddr = vmem->vm_start;
> +       eaddr = saddr + vmem->size;
> +       bit = find_first_zero_bit(vmem->faulted, vmem->pgoff_end);
> +       if (bit < vmem->pgoff_end) {
> +               spin_unlock(&vmem->lock);
> +               return -EBUSY;
> +       }
> +       spin_unlock(&vmem->lock);
> +       if (task == NULL)
> +               return 0;
> +       mm = get_task_mm(task);
> +       if (mm == NULL)
> +               return 0;
> +
> +       addr = saddr;
> +       down_write(&mm->mmap_sem);
> +       while (addr < eaddr) {
> +               struct vm_area_struct *vma;
> +               vma = find_vma(mm, addr);
> +               if (kvm_vmem_is_vmem_vma(vmem, vma)) {
> +                       /* XXX incorrect. race/locking and more fix up */
> +                       struct file *filp = vma->vm_file;
> +                       vma->vm_ops->close(vma);
> +                       vma->vm_ops = NULL;
> +                       vma->vm_file = NULL;
> +                       /* vma->vm_flags */
> +                       fput(filp);
> +               }
> +               addr = vma->vm_end;
> +       }
> +       up_write(&mm->mmap_sem);
> +
> +       mmput(mm);
> +       return 0;
> +#endif
> +}
> +
> +static void kvm_vmem_ready(struct kvm_vmem *vmem)
> +{
> +       spin_lock(&vmem->lock);
> +       vmem->ready = true;
> +       spin_unlock(&vmem->lock);
> +       wake_up_interruptible(&vmem->ready_wait);
> +}
> +
> +static int kvm_vmem_wait_ready(struct kvm_vmem *vmem)
> +{
> +       int ret = 0;
> +       DEFINE_WAIT(wait);
> +
> +       spin_lock(&vmem->lock);
> +       for (;;) {
> +               prepare_to_wait(&vmem->ready_wait, &wait, TASK_INTERRUPTIBLE);
> +               if (vmem->ready) {
> +                       break;
> +               }
> +               if (signal_pending(current)) {
> +                       ret = -ERESTARTSYS;
> +                       break;
> +               }
> +               spin_unlock(&vmem->lock);
> +               schedule();
> +               spin_lock(&vmem->lock);
> +       }
> +       spin_unlock(&vmem->lock);
> +       finish_wait(&vmem->ready_wait, &wait);
> +       return ret;
> +}
> +
> +static long kvm_vmem_ioctl(struct file *filp, unsigned int ioctl,
> +                          unsigned long arg)
> +{
> +       struct kvm_vmem *vmem = filp->private_data;
> +       void __user *argp = (void __user *) arg;
> +       long ret = 0;
> +
> +       switch (ioctl) {
> +       case KVM_VMEM_READY:
> +               kvm_vmem_ready(vmem);
> +               ret = 0;
> +               break;
> +       case KVM_VMEM_WAIT_READY:
> +               ret = kvm_vmem_wait_ready(vmem);
> +               break;
> +       case KVM_VMEM_GET_PAGE_REQUEST: {
> +               struct kvm_vmem_page_request page_request;
> +               ret = -EFAULT;
> +               if (copy_from_user(&page_request, argp, sizeof(page_request)))
> +                       break;
> +               ret = kvm_vmem_get_page_request(vmem, &page_request);
> +               if (ret == 0 &&
> +                   copy_to_user(argp +
> +                                offsetof(struct kvm_vmem_page_request, nr),
> +                                &page_request.nr,
> +                                sizeof(page_request.nr))) {
> +                       ret = -EFAULT;
> +                       break;
> +               }
> +               break;
> +       }
> +       case KVM_VMEM_MARK_PAGE_CACHED: {
> +               struct kvm_vmem_page_cached page_cached;
> +               ret = -EFAULT;
> +               if (copy_from_user(&page_cached, argp, sizeof(page_cached)))
> +                       break;
> +               ret = kvm_vmem_mark_page_cached(vmem, &page_cached);
> +               break;
> +       }
> +       case KVM_VMEM_MAKE_PAGES_PRESENT: {
> +               struct kvm_vmem_make_pages_present pages_present;
> +               ret = -EFAULT;
> +               if (copy_from_user(&pages_present, argp,
> +                                  sizeof(pages_present)))
> +                       break;
> +               ret = kvm_vmem_make_pages_present(vmem, &pages_present);
> +               break;
> +       }
> +       case KVM_VMEM_MAKE_VMA_ANONYMOUS:
> +               ret = kvm_vmem_make_vma_anonymous(vmem);
> +               break;
> +       default:
> +               ret = -EINVAL;
> +               break;
> +       }
> +       return ret;
> +}
> +
> +static unsigned long kvm_vmem_bitmap_bytes(const struct kvm_vmem *vmem)
> +{
> +       return round_up(vmem->pgoff_end, BITS_PER_LONG) / 8;
> +}
> +
> +static void kvm_vmem_free(struct kvm_vmem *vmem)
> +{
> +       if (vmem->task) {
> +               put_task_struct(vmem->task);
> +               vmem->task = NULL;
> +       }
> +
> +       if (vmem->shmem_filp)
> +               fput(vmem->shmem_filp);
> +       if (kvm_vmem_bitmap_bytes(vmem) > PAGE_SIZE) {
> +               vfree(vmem->cached);
> +               vfree(vmem->faulted);
> +       } else {
> +               kfree(vmem->cached);
> +               kfree(vmem->faulted);
> +       }
> +       kfree(vmem->vma);
> +       kfree(vmem->async_req);
> +       kfree(vmem->sync_req_bitmap);
> +       kfree(vmem->sync_wait_bitmap);
> +       kfree(vmem->page_wait);
> +       kfree(vmem->sync_req);
> +       kfree(vmem);
> +}
> +
> +static int kvm_vmem_release(struct inode *inode, struct file *filp)
> +{
> +       struct kvm_vmem *vmem = filp->private_data;
> +       kvm_vmem_free(vmem);
> +       return 0;
> +}
> +
> +static struct file_operations kvm_vmem_fops = {
> +       .release        = kvm_vmem_release,
> +       .unlocked_ioctl = kvm_vmem_ioctl,
> +       .mmap           = kvm_vmem_mmap,
> +       .poll           = kvm_vmem_poll,
> +       .llseek         = noop_llseek,
> +};
> +
> +static int kvm_create_vmem(struct kvm_vmem_create *create)
> +{
> +       int error = 0;
> +       struct kvm_vmem *vmem = NULL;
> +       struct vm_area_struct *vma = NULL;
> +       int shmem_fd;
> +       unsigned long bitmap_bytes;
> +       unsigned long sync_bitmap_bytes;
> +       int i;
> +
> +       vmem = kzalloc(sizeof(*vmem), GFP_KERNEL);
> +       vmem->task = NULL;

Is this needed, doesn't kzalloc() return zeroed memory?

> +       vmem->mmapped = false;
> +       spin_lock_init(&vmem->lock);
> +       vmem->size = roundup(create->size, PAGE_SIZE);
> +       vmem->pgoff_end = vmem->size >> PAGE_SHIFT;
> +       init_waitqueue_head(&vmem->req_wait);
> +
> +       vma = kzalloc(sizeof(*vma), GFP_KERNEL);
> +       vma->vm_start = 0;

Also here.

> +       vma->vm_end = vmem->size;
> +       /* this shmem file is used for temporal buffer for pages
> +          so it's unlikely that so many pages exists in this shmem file */
> +       vma->vm_flags = VM_READ | VM_SHARED | VM_NOHUGEPAGE | VM_DONTCOPY |
> +               VM_DONTEXPAND;
> +       vma->vm_page_prot = vm_get_page_prot(vma->vm_flags);
> +       vma->vm_pgoff = 0;
> +       INIT_LIST_HEAD(&vma->anon_vma_chain);
> +
> +       vmem->vma = vma;
> +
> +       shmem_fd = get_unused_fd();
> +       if (shmem_fd < 0) {
> +               error = shmem_fd;
> +               goto out;
> +       }
> +       error = shmem_zero_setup(vma);
> +       if (error < 0) {
> +               put_unused_fd(shmem_fd);
> +               goto out;
> +       }
> +       vmem->shmem_filp = vma->vm_file;
> +       get_file(vmem->shmem_filp);
> +       fd_install(shmem_fd, vma->vm_file);
> +       create->shmem_fd = shmem_fd;
> +
> +       create->vmem_fd = anon_inode_getfd("kvm-vmem",
> +                                          &kvm_vmem_fops, vmem, O_RDWR);
> +       if (create->vmem_fd < 0) {
> +               error = create->vmem_fd;
> +               goto out;
> +       }
> +
> +       bitmap_bytes = kvm_vmem_bitmap_bytes(vmem);
> +       if (bitmap_bytes > PAGE_SIZE) {
> +               vmem->cached = vzalloc(bitmap_bytes);
> +               vmem->faulted = vzalloc(bitmap_bytes);
> +       } else {
> +               vmem->cached = kzalloc(bitmap_bytes, GFP_KERNEL);
> +               vmem->faulted = kzalloc(bitmap_bytes, GFP_KERNEL);
> +       }
> +
> +#define ASYNC_REQ_MAX  (ASYNC_PF_PER_VCPU * KVM_MAX_VCPUS)
> +       vmem->async_req_max = ASYNC_REQ_MAX;
> +       vmem->async_req_nr = 0;
> +       vmem->async_req = kzalloc(sizeof(*vmem->async_req), GFP_KERNEL);
> +
> +#define SYNC_REQ_MAX   (KVM_MAX_VCPUS)
> +       vmem->sync_req_max = round_up(SYNC_REQ_MAX, BITS_PER_LONG);
> +       sync_bitmap_bytes = sizeof(unsigned long) *
> +               (vmem->sync_req_max / BITS_PER_LONG);
> +       vmem->sync_req_bitmap = kzalloc(sync_bitmap_bytes, GFP_KERNEL);
> +       vmem->sync_wait_bitmap = kzalloc(sync_bitmap_bytes, GFP_KERNEL);
> +       vmem->page_wait = kzalloc(sizeof(*vmem->page_wait) *
> +                                 vmem->sync_req_max, GFP_KERNEL);
> +       for (i = 0; i < vmem->sync_req_max; ++i)
> +               init_waitqueue_head(&vmem->page_wait[i]);
> +       vmem->sync_req = kzalloc(sizeof(*vmem->sync_req) *
> +                                vmem->sync_req_max, GFP_KERNEL);
> +
> +       vmem->req_list_nr = 0;
> +       INIT_LIST_HEAD(&vmem->req_list);
> +       init_waitqueue_head(&vmem->req_list_wait);
> +
> +       init_waitqueue_head(&vmem->ready_wait);
> +       vmem->ready = false;
> +
> +       return 0;
> +
> + out:
> +       kvm_vmem_free(vmem);
> +       return error;
> +}
> +
> +static long kvm_vmem_dev_ioctl(struct file *filp, unsigned int ioctl,
> +                              unsigned long arg)
> +{
> +       void __user *argp = (void __user *) arg;
> +       long ret;
> +
> +       switch (ioctl) {
> +       case KVM_CREATE_VMEM: {
> +               struct kvm_vmem_create create;
> +               if (copy_from_user(&create, argp, sizeof(create))) {
> +                       ret = -EFAULT;
> +                       break;
> +               }
> +               ret = kvm_create_vmem(&create);
> +               if (copy_to_user(argp, &create, sizeof(create))) {
> +                       ret = -EFAULT;
> +                       break;
> +               }
> +               break;
> +       }
> +       default:
> +               ret = -EINVAL;
> +               break;
> +       }
> +       return ret;
> +}
> +
> +static int kvm_vmem_dev_release(struct inode *inode, struct file *filp)
> +{
> +       return 0;
> +}
> +
> +static struct file_operations kvm_vmem_dev_fops = {
> +       .release = kvm_vmem_dev_release,
> +       .unlocked_ioctl = kvm_vmem_dev_ioctl,
> +};
> +
> +long kvm_dev_ioctl_create_vmem_dev(void)
> +{
> +       return anon_inode_getfd("kvm-vmem-dev", &kvm_vmem_dev_fops,
> +                               NULL, O_RDWR);
> +}
> diff --git a/virt/kvm/vmem.h b/virt/kvm/vmem.h
> new file mode 100644
> index 0000000..bc7e8cf
> --- /dev/null
> +++ b/virt/kvm/vmem.h
> @@ -0,0 +1,68 @@
> +/*
> + * KVM post copy vmem
> + *
> + * Copyright (c) 2011,
> + * National Institute of Advanced Industrial Science and Technology
> + *
> + * https://sites.google.com/site/grivonhome/quick-kvm-migration
> + * Author: Isaku Yamahata <yamahata at valinux co jp>
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + *
> + * You should have received a copy of the GNU General Public License along with
> + * this program; if not, write to the Free Software Foundation, Inc., 59 Temple
> + * Place - Suite 330, Boston, MA 02111-1307 USA.

Old address also here.

> + */
> +
> +#ifndef __KVM_VMEM_H__
> +#define __KVM_VMEM_H__
> +
> +struct kvm_vmem_page_req_list {
> +       struct list_head list;
> +       pgoff_t pgoff;
> +};
> +
> +struct kvm_vmem {
> +       loff_t size;
> +       pgoff_t pgoff_end;
> +       spinlock_t lock;
> +
> +       wait_queue_head_t req_wait;
> +
> +       int async_req_max;
> +       int async_req_nr;
> +       pgoff_t *async_req;
> +
> +       int sync_req_max;

'int' between pointers would mean 4 bytes of structure padding on 64 bit hosts.

> +       unsigned long *sync_req_bitmap;
> +       unsigned long *sync_wait_bitmap;
> +       pgoff_t *sync_req;
> +       wait_queue_head_t *page_wait;
> +
> +       int req_list_nr;
> +       struct list_head req_list;
> +       wait_queue_head_t req_list_wait;
> +
> +       unsigned long *cached;
> +       unsigned long *faulted;
> +
> +       bool mmapped;
> +       unsigned long vm_start;
> +       unsigned int vma_nr;
> +       struct task_struct *task;
> +
> +       wait_queue_head_t ready_wait;
> +       bool ready;
> +
> +       struct file *shmem_filp;
> +       struct vm_area_struct *vma;
> +};
> +
> +#endif /* __KVM_VMEM_H__ */
> --
> 1.7.1.1
>
>
> --
> yamahata
>
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [PATCH][RFC] post copy chardevice (was Re: [RFC] postcopy livemigration proposal)
  2011-08-12 11:07 ` [Qemu-devel] [PATCH][RFC] post copy chardevice (was Re: [RFC] postcopy livemigration proposal) Isaku Yamahata
  2011-08-12 11:09   ` Isaku Yamahata
  2011-08-12 21:26   ` Blue Swirl
@ 2011-08-15 19:29   ` Avi Kivity
  2011-08-16  1:42     ` Isaku Yamahata
  2 siblings, 1 reply; 28+ messages in thread
From: Avi Kivity @ 2011-08-15 19:29 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: Andrea Arcangeli, Nadav Har'El, kvm, satoshi.itoh,
	Stefan Hajnoczi, t.hirofuchi, Dor Laor, qemu-devel, Yaniv Kaul

On 08/12/2011 04:07 AM, Isaku Yamahata wrote:
> This is a character device to hook page access.
> The page fault in the area is reported to another user process by
> this chardriver. Then, the process fills the page contents and
> resolves the page fault.

Have you considered CUSE (character device in userspace, fs/fuse/cuse.c)?

> index 55f5afb..623109e 100644
> --- a/include/linux/kvm.h
> +++ b/include/linux/kvm.h
> @@ -554,6 +554,7 @@ struct kvm_ppc_pvinfo {
>   #define KVM_CAP_PPC_SMT 64
>   #define KVM_CAP_PPC_RMA	65
>   #define KVM_CAP_MAX_VCPUS 66       /* returns max vcpus per vm */
> +#define KVM_CAP_POST_COPY_MEMORY 67
>
>   #ifdef KVM_CAP_IRQ_ROUTING
>
> @@ -760,6 +761,50 @@ struct kvm_clock_data {
>   /* Available with KVM_CAP_RMA */
>   #define KVM_ALLOCATE_RMA	  _IOR(KVMIO,  0xa9, struct kvm_allocate_rma)
>
> +struct kvm_vmem_create {
> +	__u64 size;	/* in bytes */
> +	__s32 vmem_fd;
> +	__s32 shmem_fd;
> +};

Should really be outside kvm.h (and virt/kvm), since it's not kvm specific.

> +
> +struct kvm_vmem_page_request {
> +	__u32 nr;
> +	__u64 __user *pgoffs;
> +};
> +
> +struct kvm_vmem_page_cached {
> +	__u32 nr;
> +	__u64 __user *pgoffs;
> +};
> +
> +struct kvm_vmem_page_range {
> +	__u64 pgoff;
> +	__u64 nr_pages;
> +};
> +
> +struct kvm_vmem_make_pages_present {
> +	__u32 nr;
> +	struct kvm_vmem_page_range __user *ranges;
> +};

This is madvise(MADV_WILLNEED), is it not?

> +
> +/* Available with KVM_CAP_POST_COPY_MEMORY */
> +#define KVM_CREATE_VMEM_DEV       _IO(KVMIO,  0xb0)
> +
> +/* ioctl for vmem_dev fd */
> +#define KVM_CREATE_VMEM		  _IOR(KVMIO, 0xb1, __u32)
> +
> +/* ioctl for vmem fd */
> +#define KVM_VMEM_WAIT_READY	  _IO(KVMIO,  0xb2)
> +#define KVM_VMEM_READY		  _IO(KVMIO,  0xb3)
> +#define KVM_VMEM_GET_PAGE_REQUEST \
> +	_IOWR(KVMIO, 0xb4, struct kvm_vmem_page_request)
> +#define KVM_VMEM_MARK_PAGE_CACHED \
> +	_IOW(KVMIO,  0xb5, struct kvm_vmem_page_cached)
> +#define KVM_VMEM_MAKE_PAGES_PRESENT \
> +	_IOW(KVMIO,  0xb6, struct kvm_vmem_make_pages_present)
> +#define KVM_VMEM_MAKE_VMA_ANONYMOUS _IO(KVMIO, 0xb7)

Can you explain these in some more detail?

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [PATCH][RFC] post copy chardevice (was Re: [RFC] postcopy livemigration proposal)
  2011-08-15 19:29   ` Avi Kivity
@ 2011-08-16  1:42     ` Isaku Yamahata
  2011-08-16 13:40       ` Avi Kivity
  0 siblings, 1 reply; 28+ messages in thread
From: Isaku Yamahata @ 2011-08-16  1:42 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Andrea Arcangeli, Nadav Har'El, kvm, satoshi.itoh,
	Stefan Hajnoczi, t.hirofuchi, Dor Laor, qemu-devel, Yaniv Kaul

On Mon, Aug 15, 2011 at 12:29:37PM -0700, Avi Kivity wrote:
> On 08/12/2011 04:07 AM, Isaku Yamahata wrote:
>> This is a character device to hook page access.
>> The page fault in the area is reported to another user process by
>> this chardriver. Then, the process fills the page contents and
>> resolves the page fault.
>
> Have you considered CUSE (character device in userspace, fs/fuse/cuse.c)?

By looking at dev.c and cuse.c, it doesn't seem to support mmap and
fault handler.

>
>> index 55f5afb..623109e 100644
>> --- a/include/linux/kvm.h
>> +++ b/include/linux/kvm.h
>> @@ -554,6 +554,7 @@ struct kvm_ppc_pvinfo {
>>   #define KVM_CAP_PPC_SMT 64
>>   #define KVM_CAP_PPC_RMA	65
>>   #define KVM_CAP_MAX_VCPUS 66       /* returns max vcpus per vm */
>> +#define KVM_CAP_POST_COPY_MEMORY 67
>>
>>   #ifdef KVM_CAP_IRQ_ROUTING
>>
>> @@ -760,6 +761,50 @@ struct kvm_clock_data {
>>   /* Available with KVM_CAP_RMA */
>>   #define KVM_ALLOCATE_RMA	  _IOR(KVMIO,  0xa9, struct kvm_allocate_rma)
>>
>> +struct kvm_vmem_create {
>> +	__u64 size;	/* in bytes */
>> +	__s32 vmem_fd;
>> +	__s32 shmem_fd;
>> +};
>
> Should really be outside kvm.h (and virt/kvm), since it's not kvm specific.

Okay. I'll un-kvm it.

>> +
>> +struct kvm_vmem_page_request {
>> +	__u32 nr;
>> +	__u64 __user *pgoffs;
>> +};
>> +
>> +struct kvm_vmem_page_cached {
>> +	__u32 nr;
>> +	__u64 __user *pgoffs;
>> +};
>> +
>> +struct kvm_vmem_page_range {
>> +	__u64 pgoff;
>> +	__u64 nr_pages;
>> +};
>> +
>> +struct kvm_vmem_make_pages_present {
>> +	__u32 nr;
>> +	struct kvm_vmem_page_range __user *ranges;
>> +};
>
> This is madvise(MADV_WILLNEED), is it not?

Another process, not qemu process, issues it,
and it make the pages are present in qemu process address space.


>> +
>> +/* Available with KVM_CAP_POST_COPY_MEMORY */
>> +#define KVM_CREATE_VMEM_DEV       _IO(KVMIO,  0xb0)
>> +
>> +/* ioctl for vmem_dev fd */
>> +#define KVM_CREATE_VMEM		  _IOR(KVMIO, 0xb1, __u32)
>> +
>> +/* ioctl for vmem fd */
>> +#define KVM_VMEM_WAIT_READY	  _IO(KVMIO,  0xb2)
>> +#define KVM_VMEM_READY		  _IO(KVMIO,  0xb3)
>> +#define KVM_VMEM_GET_PAGE_REQUEST \
>> +	_IOWR(KVMIO, 0xb4, struct kvm_vmem_page_request)
>> +#define KVM_VMEM_MARK_PAGE_CACHED \
>> +	_IOW(KVMIO,  0xb5, struct kvm_vmem_page_cached)
>> +#define KVM_VMEM_MAKE_PAGES_PRESENT \
>> +	_IOW(KVMIO,  0xb6, struct kvm_vmem_make_pages_present)
>> +#define KVM_VMEM_MAKE_VMA_ANONYMOUS _IO(KVMIO, 0xb7)
>
> Can you explain these in some more detail?


KVM_CRATE_VMEM_DEV: create vmem-dev device from kvm device
                    for qemu
KVM_CREATE_VMEM: create vmem device from vmem-dev device.
                 (note:qemu creates more than one memory region.)


KVM_VMEM_WAIT_READY: wait for KVM_VMEM_READY
                     for qemu
KVM_VMEM_READY: unblock KVM_VMEM_WAIT_READY
                for daemon uses
These are for qemu and daemon to synchronise to enter postcopy stage.


KVM_VMEM_GET_PAGE_REQUEST: retrieve page fault of qemu process
KVM_VMEM_MARK_PAGE_CACHED: mark the specified pages pulled from the source
                           for daemon
KVM_VMEM_MAKE_PAGES_PRESENT: make the specified pages present in qemu
                             virtual address space
                             for daemon uses
KVM_VMEM_MAKE_VMA_ANONYMOUS: make the specified vma in the qemu process
                             anonymous
			     I'm not sure whether this can be implemented
                             or not.

I think The following the work flow on the destination helps.

        qemu on the destination
              |
              V
        open(/dev/kvm)
              |
              V
        KVM_CREATE_VMEM_DEV
              |
              V
        Here we have two file descriptors to
        vmem device and shmem file
              |
              |
              |                                  daemon on the destination
              V                                  
        fork()---------------------------------------,
              |                                      |
              V                                      |
        close(socket)                                V
        close(shmem)                              mmap(shmem file)
              |                                      |
              V                                      V
        mmap(vmem device) for guest RAM           close(shmem file)
              |                                      |
              V                                      |
        KVM_VMEM_READY_WAIT <---------------------KVM_VMEM_READY
              |                                      |
              V                                      |
        close(vmem device)                        Here the daemon takes over
              |                                   the owner of the socket 
        entering post copy stage                  to the source
        start guest execution                        |
              |                                      |
              V                                      V
        access guest RAM                          KVM_VMEM_GET_PAGE_REQUEST
              |                                      |
              V                                      V
        page fault ------------------------------>page offset is returned
        block                                        |
                                                     V
                                                  pull page from the source
                                                  write the page contents
                                                  to the shmem.
                                                     |
                                                     V
        unblock     <-----------------------------KVM_VMEM_MARK_PAGE_CACHED
        the fault handler returns the page
        page fault is resolved
              |
              |                                   pages can be pulled
              |                                   backgroundly
              |                                      |
              |                                      V
              |                                   KVM_VMEM_MARK_PAGE_CACHED 
              |                                      |
              V                                      V
        The specified pages<----------------------KVM_VMEM_MAKE_PAGES_PRESENT
        are made present                             |
        so future page fault is avoided.             |
              |                                      |
              V                                      V

                 all the pages are pulled from the source

              |                                      |
              V                                      V
        the vma becomes anonymous<----------------KVM_VMEM_MAKE_VMA_ANONYMOUS
       (note: I'm not sure if this can be implemented or not)
              |                                      |
              V                                      V
        migration completes                        exit()


thanks,
--
yamahata

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [PATCH][RFC] post copy chardevice (was Re: [RFC] postcopy livemigration proposal)
  2011-08-16  1:42     ` Isaku Yamahata
@ 2011-08-16 13:40       ` Avi Kivity
  0 siblings, 0 replies; 28+ messages in thread
From: Avi Kivity @ 2011-08-16 13:40 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: Andrea Arcangeli, Nadav Har'El, kvm, satoshi.itoh,
	Stefan Hajnoczi, t.hirofuchi, Dor Laor, qemu-devel, Yaniv Kaul

On 08/15/2011 06:42 PM, Isaku Yamahata wrote:
> On Mon, Aug 15, 2011 at 12:29:37PM -0700, Avi Kivity wrote:
> >  On 08/12/2011 04:07 AM, Isaku Yamahata wrote:
> >>  This is a character device to hook page access.
> >>  The page fault in the area is reported to another user process by
> >>  this chardriver. Then, the process fills the page contents and
> >>  resolves the page fault.
> >
> >  Have you considered CUSE (character device in userspace, fs/fuse/cuse.c)?
>
> By looking at dev.c and cuse.c, it doesn't seem to support mmap and
> fault handler.

If performance is sufficient, this would be the preferred path.  Enhance 
an existing API which can be useful to others, rather than add a new one.

> >>  +
> >>  +struct kvm_vmem_make_pages_present {
> >>  +	__u32 nr;
> >>  +	struct kvm_vmem_page_range __user *ranges;
> >>  +};
> >
> >  This is madvise(MADV_WILLNEED), is it not?
>
> Another process, not qemu process, issues it,
> and it make the pages are present in qemu process address space.

That process just issues these calls in a loop until all memory is 
present, yes? it seems those few lines could be easily added to qemu.

>
>
> >  Can you explain these in some more detail?
>
>
> KVM_CRATE_VMEM_DEV: create vmem-dev device from kvm device
>                      for qemu
> KVM_CREATE_VMEM: create vmem device from vmem-dev device.
>                   (note:qemu creates more than one memory region.)
>
>
> KVM_VMEM_WAIT_READY: wait for KVM_VMEM_READY
>                       for qemu
> KVM_VMEM_READY: unblock KVM_VMEM_WAIT_READY
>                  for daemon uses
> These are for qemu and daemon to synchronise to enter postcopy stage.

This are eliminated if we fold the daemon into qemu.  Also, could just a 
semaphore or other synchronization mechanism.

>
> KVM_VMEM_GET_PAGE_REQUEST: retrieve page fault of qemu process

Equivalent to the fault callback of CUSE (if we add it)?

> KVM_VMEM_MARK_PAGE_CACHED: mark the specified pages pulled from the source
>                             for daemon

Equivalent to returning from that callback with a new page?

> KVM_VMEM_MAKE_PAGES_PRESENT: make the specified pages present in qemu
>                               virtual address space
>                               for daemon uses
> KVM_VMEM_MAKE_VMA_ANONYMOUS: make the specified vma in the qemu process
>                               anonymous
> 			     I'm not sure whether this can be implemented
>                               or not.
>
> I think The following the work flow on the destination helps.
>
>          qemu on the destination
>                |
>                V
>          open(/dev/kvm)
>                |
>                V
>          KVM_CREATE_VMEM_DEV
>                |
>                V
>          Here we have two file descriptors to
>          vmem device and shmem file
>                |
>                |
>                |                                  daemon on the destination
>                V
>          fork()---------------------------------------,
>                |                                      |
>                V                                      |
>          close(socket)                                V
>          close(shmem)                              mmap(shmem file)
>                |                                      |
>                V                                      V
>          mmap(vmem device) for guest RAM           close(shmem file)
>                |                                      |
>                V                                      |
>          KVM_VMEM_READY_WAIT<---------------------KVM_VMEM_READY
>                |                                      |
>                V                                      |
>          close(vmem device)                        Here the daemon takes over
>                |                                   the owner of the socket
>          entering post copy stage                  to the source
>          start guest execution                        |
>                |                                      |
>                V                                      V
>          access guest RAM                          KVM_VMEM_GET_PAGE_REQUEST
>                |                                      |
>                V                                      V
>          page fault ------------------------------>page offset is returned
>          block                                        |
>                                                       V
>                                                    pull page from the source
>                                                    write the page contents
>                                                    to the shmem.
>                                                       |
>                                                       V
>          unblock<-----------------------------KVM_VMEM_MARK_PAGE_CACHED
>          the fault handler returns the page
>          page fault is resolved
>                |
>                |                                   pages can be pulled
>                |                                   backgroundly
>                |                                      |
>                |                                      V
>                |                                   KVM_VMEM_MARK_PAGE_CACHED
>                |                                      |
>                V                                      V
>          The specified pages<----------------------KVM_VMEM_MAKE_PAGES_PRESENT
>          are made present                             |
>          so future page fault is avoided.             |
>                |                                      |
>                V                                      V
>
>                   all the pages are pulled from the source
>
>                |                                      |
>                V                                      V
>          the vma becomes anonymous<----------------KVM_VMEM_MAKE_VMA_ANONYMOUS
>         (note: I'm not sure if this can be implemented or not)
>                |                                      |
>                V                                      V
>          migration completes                        exit()
>

Yes, thanks, this was very helpful.


-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2011-08-16 13:41 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-08-08  3:24 [Qemu-devel] [RFC] postcopy livemigration proposal Isaku Yamahata
2011-08-08  9:20 ` Dor Laor
2011-08-08  9:40   ` Yaniv Kaul
2011-08-08 21:42     ` Anthony Liguori
2011-08-08 10:59   ` Nadav Har'El
2011-08-08 11:47     ` Dor Laor
2011-08-08 16:52       ` Cleber Rosa
2011-08-08 15:52         ` Anthony Liguori
2011-08-08 12:32   ` Anthony Liguori
2011-08-08 15:11     ` Dor Laor
2011-08-08 15:29       ` Anthony Liguori
2011-08-08 15:36         ` Avi Kivity
2011-08-08 15:59           ` Anthony Liguori
2011-08-08 19:47             ` Dor Laor
2011-08-09  2:07               ` Isaku Yamahata
2011-08-08  9:38 ` Stefan Hajnoczi
2011-08-08  9:43   ` Isaku Yamahata
2011-08-08 12:38 ` Avi Kivity
2011-08-09  2:33   ` Isaku Yamahata
2011-08-10 13:55     ` Avi Kivity
2011-08-11  2:19       ` Isaku Yamahata
2011-08-11 16:55         ` Andrea Arcangeli
2011-08-12 11:07 ` [Qemu-devel] [PATCH][RFC] post copy chardevice (was Re: [RFC] postcopy livemigration proposal) Isaku Yamahata
2011-08-12 11:09   ` Isaku Yamahata
2011-08-12 21:26   ` Blue Swirl
2011-08-15 19:29   ` Avi Kivity
2011-08-16  1:42     ` Isaku Yamahata
2011-08-16 13:40       ` Avi Kivity

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).