qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v6 00/11] rdma: migration support
@ 2013-04-24 19:00 mrhines
  2013-04-24 21:50 ` Paolo Bonzini
  0 siblings, 1 reply; 11+ messages in thread
From: mrhines @ 2013-04-24 19:00 UTC (permalink / raw)
  To: quintela; +Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, pbonzini

From: "Michael R. Hines" <mrhines@us.ibm.com>

Please pull.

Changes since v5:

- Removed max_size hook.
- Waiting for Signed-Off bys....

Wiki: http://wiki.qemu.org/Features/RDMALiveMigration
Github: git@github.com:hinesmr/qemu.git

Here is a brief summary of total migration time and downtime using RDMA:

Using a 40gbps infiniband link performing a worst-case stress test,
using an 8GB RAM virtual machine:
Using the following command:

$ apt-get install stress
$ stress --vm-bytes 7500M --vm 1 --vm-keep

RESULTS:

1. Migration throughput: 26 gigabits/second.
2. Downtime (stop time) varies between 15 and 100 milliseconds.

EFFECTS of memory registration on bulk phase round:

For example, in the same 8GB RAM example with all 8GB of memory in 
active use and the VM itself is completely idle using the same 40 gbps 
infiniband link:

1. x-rdma-pin-all disabled total time: approximately 7.5 seconds @ 9.5 Gbps
2. x-rdma-pin-all enabled total time: approximately 4 seconds @ 26 Gbps

These numbers would of course scale up to whatever size virtual machine
you have to migrate using RDMA.

Enabling this feature does *not* have any measurable affect on 
migration *downtime*. This is because, without this feature, all of the 
memory will have already been registered already in advance during
the bulk round and does not need to be re-registered during the successive
iteration rounds.

The following changes since commit f3aa844bbb2922a5b8393d17620eca7d7e921ab3:

  build: include config-{, all-}devices.mak after defining CONFIG_SOFTMMU and CONFIG_USER_ONLY (2013-04-24 12:18:41 -0500)

are available in the git repository at:

  git@github.com:hinesmr/qemu.git rdma_patch_v6

for you to fetch changes up to 75e6fac1f642885b93cefe6e1874d648e9850f8f:

  rdma: send pc.ram (2013-04-24 14:55:01 -0400)

----------------------------------------------------------------
Michael R. Hines (11):
      rdma: add documentation
      rdma: export yield_until_fd_readable()
      rdma: export throughput w/ MigrationStats QMP
      rdma: introduce qemu_file_mode_is_not_valid()
      rdma: export qemu_fflush()
      rdma: introduce  ram_handle_compressed()
      rdma: introduce qemu_ram_foreach_block()
      rdma: new QEMUFileOps hooks
      rdma: introduce capability x-rdma-pin-all
      rdma: core logic
      rdma: send pc.ram

 Makefile.objs                 |    1 +
 arch_init.c                   |   59 +-
 configure                     |   29 +
 docs/rdma.txt                 |  404 ++++++
 exec.c                        |    9 +
 hmp.c                         |    2 +
 include/block/coroutine.h     |    6 +
 include/exec/cpu-common.h     |    5 +
 include/migration/migration.h |   25 +
 include/migration/qemu-file.h |   30 +
 migration-rdma.c              | 2707 +++++++++++++++++++++++++++++++++++++++++
 migration.c                   |   27 +
 qapi-schema.json              |   12 +-
 qemu-coroutine-io.c           |   23 +
 savevm.c                      |  107 +-
 15 files changed, 3398 insertions(+), 48 deletions(-)
 create mode 100644 docs/rdma.txt
 create mode 100644 migration-rdma.c

-- 
1.7.10.4

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH v6 00/11] rdma: migration support
  2013-04-24 19:00 mrhines
@ 2013-04-24 21:50 ` Paolo Bonzini
  2013-04-24 23:48   ` Michael R. Hines
  0 siblings, 1 reply; 11+ messages in thread
From: Paolo Bonzini @ 2013-04-24 21:50 UTC (permalink / raw)
  To: mrhines; +Cc: aliguori, quintela, qemu-devel, owasserm, abali, mrhines, gokul

Il 24/04/2013 21:00, mrhines@linux.vnet.ibm.com ha scritto:
> From: "Michael R. Hines" <mrhines@us.ibm.com>
> 
> Changes since v5:
> 
> - Removed max_size hook.

The patches look good.  I will not be very available in the next few
days due to a public holiday here, but I believe that it's okay for 1.5.
 It's clearly marked as experimental, and the changes to the internals
are safe and ok.

The one small nit is that patch 11 should come before patch 10.  It can
be fixed by whoever applies the patch.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

Paolo

> 
> Wiki: http://wiki.qemu.org/Features/RDMALiveMigration
> Github: git@github.com:hinesmr/qemu.git
> 
> Here is a brief summary of total migration time and downtime using RDMA:
> 
> Using a 40gbps infiniband link performing a worst-case stress test,
> using an 8GB RAM virtual machine:
> Using the following command:
> 
> $ apt-get install stress
> $ stress --vm-bytes 7500M --vm 1 --vm-keep
> 
> RESULTS:
> 
> 1. Migration throughput: 26 gigabits/second.
> 2. Downtime (stop time) varies between 15 and 100 milliseconds.
> 
> EFFECTS of memory registration on bulk phase round:
> 
> For example, in the same 8GB RAM example with all 8GB of memory in 
> active use and the VM itself is completely idle using the same 40 gbps 
> infiniband link:
> 
> 1. x-rdma-pin-all disabled total time: approximately 7.5 seconds @ 9.5 Gbps
> 2. x-rdma-pin-all enabled total time: approximately 4 seconds @ 26 Gbps
> 
> These numbers would of course scale up to whatever size virtual machine
> you have to migrate using RDMA.
> 
> Enabling this feature does *not* have any measurable affect on 
> migration *downtime*. This is because, without this feature, all of the 
> memory will have already been registered already in advance during
> the bulk round and does not need to be re-registered during the successive
> iteration rounds.
> 
> The following changes since commit f3aa844bbb2922a5b8393d17620eca7d7e921ab3:
> 
>   build: include config-{, all-}devices.mak after defining CONFIG_SOFTMMU and CONFIG_USER_ONLY (2013-04-24 12:18:41 -0500)
> 
> are available in the git repository at:
> 
>   git@github.com:hinesmr/qemu.git rdma_patch_v6
> 
> for you to fetch changes up to 75e6fac1f642885b93cefe6e1874d648e9850f8f:
> 
>   rdma: send pc.ram (2013-04-24 14:55:01 -0400)
> 
> ----------------------------------------------------------------
> Michael R. Hines (11):
>       rdma: add documentation
>       rdma: export yield_until_fd_readable()
>       rdma: export throughput w/ MigrationStats QMP
>       rdma: introduce qemu_file_mode_is_not_valid()
>       rdma: export qemu_fflush()
>       rdma: introduce  ram_handle_compressed()
>       rdma: introduce qemu_ram_foreach_block()
>       rdma: new QEMUFileOps hooks
>       rdma: introduce capability x-rdma-pin-all
>       rdma: core logic
>       rdma: send pc.ram
> 
>  Makefile.objs                 |    1 +
>  arch_init.c                   |   59 +-
>  configure                     |   29 +
>  docs/rdma.txt                 |  404 ++++++
>  exec.c                        |    9 +
>  hmp.c                         |    2 +
>  include/block/coroutine.h     |    6 +
>  include/exec/cpu-common.h     |    5 +
>  include/migration/migration.h |   25 +
>  include/migration/qemu-file.h |   30 +
>  migration-rdma.c              | 2707 +++++++++++++++++++++++++++++++++++++++++
>  migration.c                   |   27 +
>  qapi-schema.json              |   12 +-
>  qemu-coroutine-io.c           |   23 +
>  savevm.c                      |  107 +-
>  15 files changed, 3398 insertions(+), 48 deletions(-)
>  create mode 100644 docs/rdma.txt
>  create mode 100644 migration-rdma.c
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH v6 00/11] rdma: migration support
  2013-04-24 21:50 ` Paolo Bonzini
@ 2013-04-24 23:48   ` Michael R. Hines
  0 siblings, 0 replies; 11+ messages in thread
From: Michael R. Hines @ 2013-04-24 23:48 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: aliguori, quintela, qemu-devel, owasserm, abali, mrhines, gokul

On 04/24/2013 05:50 PM, Paolo Bonzini wrote:
> Il 24/04/2013 21:00, mrhines@linux.vnet.ibm.com ha scritto:
>> From: "Michael R. Hines" <mrhines@us.ibm.com>
>>
>> Changes since v5:
>>
>> - Removed max_size hook.
> The patches look good.  I will not be very available in the next few
> days due to a public holiday here, but I believe that it's okay for 1.5.
>   It's clearly marked as experimental, and the changes to the internals
> are safe and ok.
>
> The one small nit is that patch 11 should come before patch 10.  It can
> be fixed by whoever applies the patch.
>
> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
>
> Paolo

Acknowledged. Thank you.

- Michael

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH v6 00/11] rdma: migration support
@ 2013-05-03 23:28 Chegu Vinod
  2013-05-09 17:20 ` Michael R. Hines
  0 siblings, 1 reply; 11+ messages in thread
From: Chegu Vinod @ 2013-05-03 23:28 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: Karen Noel, Juan Jose Quintela Carreira, Michael S. Tsirkin,
	qemu-devel qemu-devel, Orit Wasserman, Anthony Liguori,
	Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 2828 bytes --]


Hi Michael,

I picked up the qemu bits from your github branch and gave it a try.   
(BTW the setup I was given temporary access to has a pair of MLX's  IB 
QDR cards connected back to back via QSFP cables)

Observed a couple of things and wanted to share..perhaps you may be 
aware of them already or perhaps these are unrelated to your specific 
changes ? (Note: Still haven't finished the review of your changes ).

a) x-rdma-pin-all off case

Seem to only work sometimes but fails at other times. Here is an example...

(qemu) rdma: Accepting rdma connection...
rdma: Memory pin all: disabled
rdma: verbs context after listen: 0x555556757d50
rdma: dest_connect Source GID: fe80::2:c903:9:53a5, Dest GID: 
fe80::2:c903:9:5855
rdma: Accepted migration
qemu-system-x86_64: VQ 1 size 0x100 Guest index 0x4d2 inconsistent with 
Host ind
ex 0x4ec: delta 0xffe6
qemu: warning: error while loading state for instance 0x0 of device 
'virtio-net'
load of migration failed


b) x-rdma-pin-all on case :

The guest is not resuming on the target host. i.e. the source host's 
qemu states that migration is complete but the guest is not responsive 
anymore... (doesn't seem to have crashed but its stuck somewhere).    
Have you seen this behavior before ? Any tips on how I could extract 
additional info ?

Besides the list of noted restrictions/issues around having to pin all 
of guest memory....if the pinning is done as part of starting of the 
migration it ends up taking noticeably long time for larger guests. 
Wonder whether that should be counted as part of the total migration 
time ?.

Also the act of pinning all the memory seems to "freeze" the guest. e.g. 
: For larger enterprise sized guests (say 128GB and higher) the guest is 
"frozen" is anywhere from nearly a minute (~50seconds) to multiple 
minutes as the guest size increases...which imo kind of defeats the 
purpose of live guest migration.

Would like to hear if you have already thought about any other 
alternatives to address this issue ? for e.g. would it be better to pin 
all of the guest's memory as part of starting the guest itself ? Yes 
there are restrictions when we do pinning...but it can help with 
performance.
---
BTW, a different (yet sort of related) topic... recently a patch went 
into upstream that provided an option to qemu to mlock all of guest 
memory :

https://lists.gnu.org/archive/html/qemu-devel/2013-04/msg03947.html .

but when attempting to do the mlock for larger guests a lot of time is 
spent bringing each page into cache and clearing/zeron'g it etc.etc.

https://lists.gnu.org/archive/html/qemu-devel/2013-04/msg04161.html


----

Note: The basic tcp based live guest migration in the same qemu version 
still works fine on the same hosts over a pair of non-RDMA cards 10Gb 
NICs connected back-to-back.

Thanks
Vinod




[-- Attachment #2: Type: text/html, Size: 3808 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH v6 00/11] rdma: migration support
  2013-05-03 23:28 [Qemu-devel] [PATCH v6 00/11] rdma: migration support Chegu Vinod
@ 2013-05-09 17:20 ` Michael R. Hines
  2013-05-09 22:20   ` Chegu Vinod
  0 siblings, 1 reply; 11+ messages in thread
From: Michael R. Hines @ 2013-05-09 17:20 UTC (permalink / raw)
  To: Chegu Vinod
  Cc: Karen Noel, Michael S. Tsirkin, Juan Jose Quintela Carreira,
	qemu-devel qemu-devel, Orit Wasserman, Michael R. Hines,
	Anthony Liguori, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 4858 bytes --]

Comments inline. FYI: please CC mrhines@us.ibm.com,
because it helps me know when to scroll threw the bazillion qemu-devel 
emails.

I have things separated out into folders and rules, but a direct CC is 
better =)


On 05/03/2013 07:28 PM, Chegu Vinod wrote:
>
> Hi Michael,
>
> I picked up the qemu bits from your github branch and gave it a try.   
> (BTW the setup I was given temporary access to has a pair of MLX's  IB 
> QDR cards connected back to back via QSFP cables)
>
> Observed a couple of things and wanted to share..perhaps you may be 
> aware of them already or perhaps these are unrelated to your specific 
> changes ? (Note: Still haven't finished the review of your changes ).
>
> a) x-rdma-pin-all off case
>
> Seem to only work sometimes but fails at other times. Here is an 
> example...
>
> (qemu) rdma: Accepting rdma connection...
> rdma: Memory pin all: disabled
> rdma: verbs context after listen: 0x555556757d50
> rdma: dest_connect Source GID: fe80::2:c903:9:53a5, Dest GID: 
> fe80::2:c903:9:5855
> rdma: Accepted migration
> qemu-system-x86_64: VQ 1 size 0x100 Guest index 0x4d2 inconsistent 
> with Host ind
> ex 0x4ec: delta 0xffe6
> qemu: warning: error while loading state for instance 0x0 of device 
> 'virtio-net'
> load of migration failed
>

Can you give me more details about the configuration of your VM?

>
> b) x-rdma-pin-all on case :
>
> The guest is not resuming on the target host. i.e. the source host's 
> qemu states that migration is complete but the guest is not responsive 
> anymore... (doesn't seem to have crashed but its stuck somewhere).    
> Have you seen this behavior before ? Any tips on how I could extract 
> additional info ?

Is the QEMU monitor still responsive?
Can you capture a screenshot of the guest's console to see if there is a 
panic?
What kind of storage is attached to the VM?


>
> Besides the list of noted restrictions/issues around having to pin all 
> of guest memory....if the pinning is done as part of starting of the 
> migration it ends up taking noticeably long time for larger guests. 
> Wonder whether that should be counted as part of the total migration 
> time ?.
>

That's a good question: The pin-all option should not be slowing down 
your VM to much as the VM should still be running before the 
migration_thread() actually kicks in and starts the migration.
I need more information on the configuration of your VM, guest operating 
system, architecture and so forth.......
And similarly as before whether or not QEMU is not responsive or whether 
or not it's the guest that's panicked.......

> Also the act of pinning all the memory seems to "freeze" the guest. 
> e.g. : For larger enterprise sized guests (say 128GB and higher) the 
> guest is "frozen" is anywhere from nearly a minute (~50seconds) to 
> multiple minutes as the guest size increases...which imo kind of 
> defeats the purpose of live guest migration.

That's bad =) There must be a bug somewhere........ the largest VM I can 
create on my hardware is ~16GB - so let me give that a try and try to 
track down the problem.

>
> Would like to hear if you have already thought about any other 
> alternatives to address this issue ? for e.g. would it be better to 
> pin all of the guest's memory as part of starting the guest itself ? 
> Yes there are restrictions when we do pinning...but it can help with 
> performance.

For such a large VM, I would definitely recommend pinning because I'm 
assuming you have enough processors or a large enough application to 
actually *use* that much memory, which would suggest that even after the 
bulk phase round of the migration has already completed that your VM is 
probably going to remain to be pretty busy.

It's just a matter of me tracking down what's causing the freeze and 
fixing it........ I'll look into it right now on my machine.

> ---
> BTW, a different (yet sort of related) topic... recently a patch went 
> into upstream that provided an option to qemu to mlock all of guest 
> memory :
>
> https://lists.gnu.org/archive/html/qemu-devel/2013-04/msg03947.html .

I had no idea.......very interesting.

>
> but when attempting to do the mlock for larger guests a lot of time is 
> spent bringing each page into cache and clearing/zeron'g it etc.etc.
>
> https://lists.gnu.org/archive/html/qemu-devel/2013-04/msg04161.html
>

Wow, I didn't know that either. Perhaps this must be causing the entire 
QEMU process and its threads to seize up.

It may be necessary to run the pinning command *outside* of QEMU's I/O 
lock in a separate thread if it's really that much overhead.

Thanks a lot for pointing this out.........



>
> ----
>
> Note: The basic tcp based live guest migration in the same qemu 
> version still works fine on the same hosts over a pair of non-RDMA 
> cards 10Gb NICs connected back-to-back.
>

Acknowledged.


[-- Attachment #2: Type: text/html, Size: 7137 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH v6 00/11] rdma: migration support
  2013-05-09 17:20 ` Michael R. Hines
@ 2013-05-09 22:20   ` Chegu Vinod
  2013-05-09 22:45     ` Michael R. Hines
  2013-05-10  7:58     ` Paolo Bonzini
  0 siblings, 2 replies; 11+ messages in thread
From: Chegu Vinod @ 2013-05-09 22:20 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: Karen Noel, Michael S. Tsirkin, Juan Jose Quintela Carreira,
	qemu-devel qemu-devel, Orit Wasserman, Michael R. Hines,
	Anthony Liguori, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 7498 bytes --]

On 5/9/2013 10:20 AM, Michael R. Hines wrote:
> Comments inline. FYI: please CC mrhines@us.ibm.com,
> because it helps me know when to scroll threw the bazillion qemu-devel 
> emails.
>
> I have things separated out into folders and rules, but a direct CC is 
> better =)
>

Sure will do.

>
> On 05/03/2013 07:28 PM, Chegu Vinod wrote:
>>
>> Hi Michael,
>>
>> I picked up the qemu bits from your github branch and gave it a 
>> try.   (BTW the setup I was given temporary access to has a pair of 
>> MLX's  IB QDR cards connected back to back via QSFP cables)
>>
>> Observed a couple of things and wanted to share..perhaps you may be 
>> aware of them already or perhaps these are unrelated to your specific 
>> changes ? (Note: Still haven't finished the review of your changes ).
>>
>> a) x-rdma-pin-all off case
>>
>> Seem to only work sometimes but fails at other times. Here is an 
>> example...
>>
>> (qemu) rdma: Accepting rdma connection...
>> rdma: Memory pin all: disabled
>> rdma: verbs context after listen: 0x555556757d50
>> rdma: dest_connect Source GID: fe80::2:c903:9:53a5, Dest GID: 
>> fe80::2:c903:9:5855
>> rdma: Accepted migration
>> qemu-system-x86_64: VQ 1 size 0x100 Guest index 0x4d2 inconsistent 
>> with Host ind
>> ex 0x4ec: delta 0xffe6
>> qemu: warning: error while loading state for instance 0x0 of device 
>> 'virtio-net'
>> load of migration failed
>>
>
> Can you give me more details about the configuration of your VM?

The guest is a 10-VCPU/128GB ...and nothing really that fancy with 
respect to storage or networking.

Hosted on a large Westmere-EX box (target is a similarly configured 
Westmere-X system). There is a shared SAN disk between the two hosts.  
Both hosts have 3.9-rc7 kernel that I got at that time from kvm.git 
tree. The guest was also running the same kernel.

Since I was just trying it out I was not running any workload either.

On the source host the qemu command line :


/usr/local/bin/qemu-system-x86_64 \
-enable-kvm \
-cpu host \
-name vm1 \
-m 131072 -smp 10,sockets=1,cores=10,threads=1 \
-mem-path /dev/hugepages \
-chardev 
socket,id=charmonitor,path=/var/lib/libvirt/qemu/vm1.monitor,server,nowait \
-drive 
file=/dev/libvirt_lvm3/vm1,if=none,id=drive-virtio-disk0,format=raw,cache=none,aio=native 
\
-device 
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 
\
-monitor stdio \
-net nic,model=virtio,macaddr=52:54:00:71:01:01,netdev=nic-0 \
-netdev tap,id=nic-0,ifname=tap0,script=no,downscript=no,vhost=on \
-vnc :4


On the destination host the command line was same as the above with the 
following additional arg...

-incoming x-rdma:<static private ipaddr of the IB>:<port #>


>
>>
>> b) x-rdma-pin-all on case :
>>
>> The guest is not resuming on the target host. i.e. the source host's 
>> qemu states that migration is complete but the guest is not 
>> responsive anymore... (doesn't seem to have crashed but its stuck 
>> somewhere).    Have you seen this behavior before ? Any tips on how I 
>> could extract additional info ?
>
> Is the QEMU monitor still responsive?

They were responsive.

> Can you capture a screenshot of the guest's console to see if there is 
> a panic?

No panic on the guest's console :(

> What kind of storage is attached to the VM?
>

Simple virtio disk hosted on a SAN disk (see the qemu command line).

>
>>
>> Besides the list of noted restrictions/issues around having to pin 
>> all of guest memory....if the pinning is done as part of starting of 
>> the migration it ends up taking noticeably long time for larger 
>> guests. Wonder whether that should be counted as part of the total 
>> migration time ?.
>>
>
> That's a good question: The pin-all option should not be slowing down 
> your VM to much as the VM should still be running before the 
> migration_thread() actually kicks in and starts the migration.

Well I had hoped that it would not have any serious impacts but it ended 
up freezing the guest...



> I need more information on the configuration of your VM, guest 
> operating system, architecture and so forth.......

Pl. see above.

> And similarly as before whether or not QEMU is not responsive or 
> whether or not it's the guest that's panicked.......

Guest just freezes...doesn't panic when this pinning is in progress 
(i.e. after I set the capability and start the migration) . After the 
pin'ng completes the guest continues to run and the migration 
continues...till it "completes" (as per the source host's qemu)...but I 
never see it resume on the target host.
>
>> Also the act of pinning all the memory seems to "freeze" the guest. 
>> e.g. : For larger enterprise sized guests (say 128GB and higher) the 
>> guest is "frozen" is anywhere from nearly a minute (~50seconds) to 
>> multiple minutes as the guest size increases...which imo kind of 
>> defeats the purpose of live guest migration.
>
> That's bad =) There must be a bug somewhere........ the largest VM I 
> can create on my hardware is ~16GB - so let me give that a try and try 
> to track down the problem.

Ok. Perhaps run a simple test run inside the guest can help observe any 
scheduling delays even when you are attempting to pin a 16GB guest ?

>
>>
>> Would like to hear if you have already thought about any other 
>> alternatives to address this issue ? for e.g. would it be better to 
>> pin all of the guest's memory as part of starting the guest itself ? 
>> Yes there are restrictions when we do pinning...but it can help with 
>> performance.
>
> For such a large VM, I would definitely recommend pinning because I'm 
> assuming you have enough processors or a large enough application to 
> actually *use* that much memory, which would suggest that even after 
> the bulk phase round of the migration has already completed that your 
> VM is probably going to remain to be pretty busy.
>
> It's just a matter of me tracking down what's causing the freeze and 
> fixing it........ I'll look into it right now on my machine.
>

Ok
>> ---
>> BTW, a different (yet sort of related) topic... recently a patch went 
>> into upstream that provided an option to qemu to mlock all of guest 
>> memory :
>>
>> https://lists.gnu.org/archive/html/qemu-devel/2013-04/msg03947.html .
>
> I had no idea.......very interesting.
>
>>
>> but when attempting to do the mlock for larger guests a lot of time 
>> is spent bringing each page into cache and clearing/zeron'g it etc.etc.
>>
>> https://lists.gnu.org/archive/html/qemu-devel/2013-04/msg04161.html
>>
>
> Wow, I didn't know that either. Perhaps this must be causing the 
> entire QEMU process and its threads to seize up.
>
> It may be necessary to run the pinning command *outside* of QEMU's I/O 
> lock in a separate thread if it's really that much overhead.

Not really sure if the BQL is causing the freeze...but in general 
pinning of all memory when the guest is run is perhaps not the best 
choice for large enterprise class guests...i.e. its better to do it as 
part of the start of the guest.

>
> Thanks a lot for pointing this out.........
>
>

BTW, A good thing to try out is to see if we can mlock memory of a large 
guest (i.e. on the source and target qemu's) and migrate the guest using 
basic TCP over a regular 10Gig NIC.

Thanks,
Vinod
>
>>
>> ----
>>
>> Note: The basic tcp based live guest migration in the same qemu 
>> version still works fine on the same hosts over a pair of non-RDMA 
>> cards 10Gb NICs connected back-to-back.
>>
>
> Acknowledged.
>


[-- Attachment #2: Type: text/html, Size: 11926 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH v6 00/11] rdma: migration support
  2013-05-09 22:20   ` Chegu Vinod
@ 2013-05-09 22:45     ` Michael R. Hines
  2013-06-02  4:09       ` Michael R. Hines
  2013-05-10  7:58     ` Paolo Bonzini
  1 sibling, 1 reply; 11+ messages in thread
From: Michael R. Hines @ 2013-05-09 22:45 UTC (permalink / raw)
  To: Chegu Vinod
  Cc: Karen Noel, Juan Jose Quintela Carreira, Michael S. Tsirkin,
	qemu-devel qemu-devel, Orit Wasserman, Michael R. Hines,
	Anthony Liguori, Paolo Bonzini


Some more followup questions below to help me debug before I start 
digging in.......

On 05/09/2013 06:20 PM, Chegu Vinod wrote:

Setting aside the mlock() freezes for the moment, let's first fix your 
crashing
problem on the destination-side. Let's make that a priority before we fix
the mlock problem.

When the migration "completes", can you provide me with more detailed 
information
about the state of QEMU on the destination?

Is it responding?
What's on the VNC console?
Is QEMU responding?
Is the network responding?
Was the VM idle? Or running an application?
Can you attach GDB to QEMU after the migration?


> /usr/local/bin/qemu-system-x86_64 \
> -enable-kvm \
> -cpu host \
> -name vm1 \
> -m 131072 -smp 10,sockets=1,cores=10,threads=1 \
> -mem-path /dev/hugepages \

Can you disable hugepages and re-test?

I'll get back to the other mlock() issues later after we at least first 
make sure the migration itself is working.....

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH v6 00/11] rdma: migration support
  2013-05-09 22:20   ` Chegu Vinod
  2013-05-09 22:45     ` Michael R. Hines
@ 2013-05-10  7:58     ` Paolo Bonzini
  1 sibling, 0 replies; 11+ messages in thread
From: Paolo Bonzini @ 2013-05-10  7:58 UTC (permalink / raw)
  To: Chegu Vinod
  Cc: Karen Noel, Michael S. Tsirkin, Juan Jose Quintela Carreira,
	qemu-devel qemu-devel, Michael R. Hines, Orit Wasserman,
	Michael R. Hines, Anthony Liguori

Il 10/05/2013 00:20, Chegu Vinod ha scritto:
>>
>> Wow, I didn't know that either. Perhaps this must be causing the
>> entire QEMU process and its threads to seize up.
>>
>> It may be necessary to run the pinning command *outside* of QEMU's I/O
>> lock in a separate thread if it's really that much overhead.
> 
> Not really sure if the BQL is causing the freeze...but in general
> pinning of all memory when the guest is run is perhaps not the best
> choice for large enterprise class guests...i.e. its better to do it as
> part of the start of the guest.

If pinning is done in the setup phase, it should run outside the BQL.

Paolo

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH v6 00/11] rdma: migration support
  2013-05-09 22:45     ` Michael R. Hines
@ 2013-06-02  4:09       ` Michael R. Hines
  2013-06-06 23:51         ` Chegu Vinod
  0 siblings, 1 reply; 11+ messages in thread
From: Michael R. Hines @ 2013-06-02  4:09 UTC (permalink / raw)
  To: Chegu Vinod
  Cc: Karen Noel, Juan Jose Quintela Carreira, Michael S. Tsirkin,
	qemu-devel qemu-devel, Orit Wasserman, Michael R. Hines,
	Anthony Liguori, Paolo Bonzini

All,

I have successfully performed over 1000+ back-to-back RDMA migrations 
automatically looped *in a row* using a heavy-weight memory-stress 
benchmark here at IBM.
Migration success is done by capturing the actual serial console output 
of the virtual machine while the benchmark is running and redirecting 
each migration output to a file to verify that the output matches the 
expected output of a successful migration. For half of the 1000 
migrations, I used a 14GB virtual machine size (largest VM I can create) 
and the remaining 500 migrations I used a 2GB virtual machine (to make 
sure I was testing both 32-bit and 64-bit address boundaries). The 
benchmark is configured to have 75% stores and 25% loads and is 
configured to use 80% of the allocatable free memory of the VM (i.e. no 
swapping allowed).

I have defined a successful migration per the output file as follows:

1. The memory benchmark is still running and active (CPU near 100% and 
memory usage is high)
2. There are no kernel panics in the console output (regex keywords 
"panic", "BUG", "oom", etc...)
3. The VM is still responding to network activity (pings)
4. The console is still responsive by printing periodic messages 
throughout the life of the VM to the console from inside the VM using 
the 'write' command in infinite loop.

With this method in a loop, I believe I've ironed out all the 
regression-testing bugs that I can find. You all may find the following 
bugs interesting. The original version of this patch was written in 2010 
(Before my time @ IBM).

Bug #1: In the original 2010 patch, each write operation uses the same 
"identifier". (A "Work Request ID" in infiniband terminology).
This is not typical (but allowed by the hardware) - and instead each 
operation should have its own unique identifier so that the write 
operation can be tracked properly as it completes.

Bug #2: Also in the original 2010 patch, write operations were grouped 
into separate "signaled" and "unsignaled" work requests, which is also 
not typical (but allowed by the hardware). "Signalling" is infiniband 
terminology which means to activate/deactivate notifying the sender 
whether or not the RDMA operation has already completed. (Note: the 
receiver is never notified - which is what a DMA is supposed to be). In 
normal operation per infiniband specifications, "unsignaled" operations 
(which indicate to the hardware *not* to notify the sender of 
completion) are *supposed* to be paired simultaneously with a signaled 
operation using the *same* work request identifier. Instead, the 
original patch was using *different* work requests for 
signaled/unsignaled writes, which means that most of the writes would be 
transmitted without ever being tracked for completion whatsoever. (Per 
infinband specifications, signaled and unsignaled writes must be grouped 
together because the hardware ensures that completion notification is 
not given until *all* of the writes of the same request have actually 
completed).

Bug #3: Finally, in the original 2010 patch, ordering was not being 
handled. Per infiniband specifications, writes can happen completely out 
of order. Not only that, but PCI-express itself can change the order of 
the writes as well. It was only until after the first 2 bugs were fixed 
that I could actually manifest this bug *in code*: What was happening 
was that a very large group of requests would "burst" from the QEMU 
migration thread. At which point, not all of the requests would finish. 
Then a short time later, the next iteration would start and the virtual 
machine's writable working set was still "hovering" somewhere in the 
same vicinity of the address space as the previous burst of writes that 
had not yet completed. When this happens, the new writes were much 
smaller (not a part of a larger "chunk" per our algorithms). Since the 
new writes were smaller they would complete faster than the larger, 
older writes in the same address range. Since they complete out of 
order, the newer writes would then get clobbered by the older writes - 
resulting in an inconsistent virtual machine. So, to solve this: during 
each new write, we now do a "search" to see if the address of the next 
requested write matches or overlaps with the address range of any of the 
previous "outstanding" writes that were still in transit, and I found 
several hits. This was easily solved by blocking until the conflicting 
write has completed before proceeding to issue a new write to the hardware.

- Michael


On 05/09/2013 06:45 PM, Michael R. Hines wrote:
>
> Some more followup questions below to help me debug before I start 
> digging in.......
>
> On 05/09/2013 06:20 PM, Chegu Vinod wrote:
>
> Setting aside the mlock() freezes for the moment, let's first fix your 
> crashing
> problem on the destination-side. Let's make that a priority before we fix
> the mlock problem.
>
> When the migration "completes", can you provide me with more detailed 
> information
> about the state of QEMU on the destination?
>
> Is it responding?
> What's on the VNC console?
> Is QEMU responding?
> Is the network responding?
> Was the VM idle? Or running an application?
> Can you attach GDB to QEMU after the migration?
>
>
>> /usr/local/bin/qemu-system-x86_64 \
>> -enable-kvm \
>> -cpu host \
>> -name vm1 \
>> -m 131072 -smp 10,sockets=1,cores=10,threads=1 \
>> -mem-path /dev/hugepages \
>
> Can you disable hugepages and re-test?
>
> I'll get back to the other mlock() issues later after we at least 
> first make sure the migration itself is working.....

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH v6 00/11] rdma: migration support
  2013-06-02  4:09       ` Michael R. Hines
@ 2013-06-06 23:51         ` Chegu Vinod
  2013-06-07  5:38           ` Michael R. Hines
  0 siblings, 1 reply; 11+ messages in thread
From: Chegu Vinod @ 2013-06-06 23:51 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: Karen Noel, Juan Jose Quintela Carreira, Michael S. Tsirkin,
	qemu-devel qemu-devel, Orit Wasserman, Michael R. Hines,
	Anthony Liguori, Paolo Bonzini

On 6/1/2013 9:09 PM, Michael R. Hines wrote:
> All,
>
> I have successfully performed over 1000+ back-to-back RDMA migrations 
> automatically looped *in a row* using a heavy-weight memory-stress 
> benchmark here at IBM.
> Migration success is done by capturing the actual serial console 
> output of the virtual machine while the benchmark is running and 
> redirecting each migration output to a file to verify that the output 
> matches the expected output of a successful migration. For half of the 
> 1000 migrations, I used a 14GB virtual machine size (largest VM I can 
> create) and the remaining 500 migrations I used a 2GB virtual machine 
> (to make sure I was testing both 32-bit and 64-bit address 
> boundaries). The benchmark is configured to have 75% stores and 25% 
> loads and is configured to use 80% of the allocatable free memory of 
> the VM (i.e. no swapping allowed).
>
> I have defined a successful migration per the output file as follows:
>
> 1. The memory benchmark is still running and active (CPU near 100% and 
> memory usage is high)
> 2. There are no kernel panics in the console output (regex keywords 
> "panic", "BUG", "oom", etc...)
> 3. The VM is still responding to network activity (pings)
> 4. The console is still responsive by printing periodic messages 
> throughout the life of the VM to the console from inside the VM using 
> the 'write' command in infinite loop.
>
> With this method in a loop, I believe I've ironed out all the 
> regression-testing bugs that I can find. You all may find the 
> following bugs interesting. The original version of this patch was 
> written in 2010 (Before my time @ IBM).
>
> Bug #1: In the original 2010 patch, each write operation uses the same 
> "identifier". (A "Work Request ID" in infiniband terminology).
> This is not typical (but allowed by the hardware) - and instead each 
> operation should have its own unique identifier so that the write 
> operation can be tracked properly as it completes.
>
> Bug #2: Also in the original 2010 patch, write operations were grouped 
> into separate "signaled" and "unsignaled" work requests, which is also 
> not typical (but allowed by the hardware). "Signalling" is infiniband 
> terminology which means to activate/deactivate notifying the sender 
> whether or not the RDMA operation has already completed. (Note: the 
> receiver is never notified - which is what a DMA is supposed to be). 
> In normal operation per infiniband specifications, "unsignaled" 
> operations (which indicate to the hardware *not* to notify the sender 
> of completion) are *supposed* to be paired simultaneously with a 
> signaled operation using the *same* work request identifier. Instead, 
> the original patch was using *different* work requests for 
> signaled/unsignaled writes, which means that most of the writes would 
> be transmitted without ever being tracked for completion whatsoever. 
> (Per infinband specifications, signaled and unsignaled writes must be 
> grouped together because the hardware ensures that completion 
> notification is not given until *all* of the writes of the same 
> request have actually completed).
>
> Bug #3: Finally, in the original 2010 patch, ordering was not being 
> handled. Per infiniband specifications, writes can happen completely 
> out of order. Not only that, but PCI-express itself can change the 
> order of the writes as well. It was only until after the first 2 bugs 
> were fixed that I could actually manifest this bug *in code*: What was 
> happening was that a very large group of requests would "burst" from 
> the QEMU migration thread. At which point, not all of the requests 
> would finish. Then a short time later, the next iteration would start 
> and the virtual machine's writable working set was still "hovering" 
> somewhere in the same vicinity of the address space as the previous 
> burst of writes that had not yet completed. When this happens, the new 
> writes were much smaller (not a part of a larger "chunk" per our 
> algorithms). Since the new writes were smaller they would complete 
> faster than the larger, older writes in the same address range. Since 
> they complete out of order, the newer writes would then get clobbered 
> by the older writes - resulting in an inconsistent virtual machine. 
> So, to solve this: during each new write, we now do a "search" to see 
> if the address of the next requested write matches or overlaps with 
> the address range of any of the previous "outstanding" writes that 
> were still in transit, and I found several hits. This was easily 
> solved by blocking until the conflicting write has completed before 
> proceeding to issue a new write to the hardware.
>
> - Michael
>
>
Hi Michael,

Got some limited time on the systems so gave your latest bits a quick 
try today (with the default no pinning) and it seems to be better than 
before.

Ran a Java warehouse workload where the guest was 85-90% busy...

For both cases
(qemu) migrate_set_speed 40G
(qemu) migrate_set_downtime 2
(qemu) migrate -d x-rdma:<ip>:<port>

...

20VCPU/256G guest

(qemu) info migrate
capabilities: xbzrle: off x-rdma-pin-all: off
Migration status: completed
total time: 106994 milliseconds
downtime: 3795 milliseconds
transferred ram: 15425453 kbytes
throughput: 20418.27 mbps
remaining ram: 0 kbytes
total ram: 268444224 kbytes
duplicate: 64707112 pages
skipped: 0 pages
normal: 3839625 pages
normal bytes: 15358500 kbytes

----

40VCPU/512G guest         <- I had more warehouse threads with higher 
heap size etc. to make the guest busy...and hence it seems to have taken 
a while to converge.

(qemu) info migrate
capabilities: xbzrle: off x-rdma-pin-all: off
Migration status: completed
total time: 2470056 milliseconds
downtime: 6254 milliseconds
transferred ram: 3230142002 kbytes
throughput: 22118.67 mbps
remaining ram: 0 kbytes
total ram: 536879680 kbytes
duplicate: 127436402 pages
skipped: 0 pages
normal: 807307274 pages
normal bytes: 3229229096 kbytes


<..>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH v6 00/11] rdma: migration support
  2013-06-06 23:51         ` Chegu Vinod
@ 2013-06-07  5:38           ` Michael R. Hines
  0 siblings, 0 replies; 11+ messages in thread
From: Michael R. Hines @ 2013-06-07  5:38 UTC (permalink / raw)
  To: Chegu Vinod
  Cc: Karen Noel, Juan Jose Quintela Carreira, qemu-devel qemu-devel,
	Orit Wasserman, Michael R. Hines, Anthony Liguori, Paolo Bonzini


OK, this looks excellent. I think we're ready for a PULL request now - I 
will submit to Juan - I've already got a signed-off from Paolo & Eric.
I think we've got sufficient testing now.

I'm expecting to get access to a big machine in the next week or two 
(256G machine) - and I should be able to reproduce your mlock() delay 
issue at that time.
With such a big VM - I think pinning will help you significantly.

Stay tuned,

- Michael

On 06/06/2013 07:51 PM, Chegu Vinod wrote:
>
>>
> Hi Michael,
>
> Got some limited time on the systems so gave your latest bits a quick 
> try today (with the default no pinning) and it seems to be better than 
> before.
>
> Ran a Java warehouse workload where the guest was 85-90% busy...
>
> For both cases
> (qemu) migrate_set_speed 40G
> (qemu) migrate_set_downtime 2
> (qemu) migrate -d x-rdma:<ip>:<port>
>
> ...
>
> 20VCPU/256G guest
>
> (qemu) info migrate
> capabilities: xbzrle: off x-rdma-pin-all: off
> Migration status: completed
> total time: 106994 milliseconds
> downtime: 3795 milliseconds
> transferred ram: 15425453 kbytes
> throughput: 20418.27 mbps
> remaining ram: 0 kbytes
> total ram: 268444224 kbytes
> duplicate: 64707112 pages
> skipped: 0 pages
> normal: 3839625 pages
> normal bytes: 15358500 kbytes
>
> ----
>
> 40VCPU/512G guest         <- I had more warehouse threads with higher 
> heap size etc. to make the guest busy...and hence it seems to have 
> taken a while to converge.
>
> (qemu) info migrate
> capabilities: xbzrle: off x-rdma-pin-all: off
> Migration status: completed
> total time: 2470056 milliseconds
> downtime: 6254 milliseconds
> transferred ram: 3230142002 kbytes
> throughput: 22118.67 mbps
> remaining ram: 0 kbytes
> total ram: 536879680 kbytes
> duplicate: 127436402 pages
> skipped: 0 pages
> normal: 807307274 pages
> normal bytes: 3229229096 kbytes
>
>
> <..>
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2013-06-07  5:38 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-05-03 23:28 [Qemu-devel] [PATCH v6 00/11] rdma: migration support Chegu Vinod
2013-05-09 17:20 ` Michael R. Hines
2013-05-09 22:20   ` Chegu Vinod
2013-05-09 22:45     ` Michael R. Hines
2013-06-02  4:09       ` Michael R. Hines
2013-06-06 23:51         ` Chegu Vinod
2013-06-07  5:38           ` Michael R. Hines
2013-05-10  7:58     ` Paolo Bonzini
  -- strict thread matches above, loose matches on Subject: below --
2013-04-24 19:00 mrhines
2013-04-24 21:50 ` Paolo Bonzini
2013-04-24 23:48   ` Michael R. Hines

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).