* [Qemu-devel] [PATCH v6 00/11] rdma: migration support
@ 2013-04-24 19:00 mrhines
2013-04-24 21:50 ` Paolo Bonzini
0 siblings, 1 reply; 11+ messages in thread
From: mrhines @ 2013-04-24 19:00 UTC (permalink / raw)
To: quintela; +Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, pbonzini
From: "Michael R. Hines" <mrhines@us.ibm.com>
Please pull.
Changes since v5:
- Removed max_size hook.
- Waiting for Signed-Off bys....
Wiki: http://wiki.qemu.org/Features/RDMALiveMigration
Github: git@github.com:hinesmr/qemu.git
Here is a brief summary of total migration time and downtime using RDMA:
Using a 40gbps infiniband link performing a worst-case stress test,
using an 8GB RAM virtual machine:
Using the following command:
$ apt-get install stress
$ stress --vm-bytes 7500M --vm 1 --vm-keep
RESULTS:
1. Migration throughput: 26 gigabits/second.
2. Downtime (stop time) varies between 15 and 100 milliseconds.
EFFECTS of memory registration on bulk phase round:
For example, in the same 8GB RAM example with all 8GB of memory in
active use and the VM itself is completely idle using the same 40 gbps
infiniband link:
1. x-rdma-pin-all disabled total time: approximately 7.5 seconds @ 9.5 Gbps
2. x-rdma-pin-all enabled total time: approximately 4 seconds @ 26 Gbps
These numbers would of course scale up to whatever size virtual machine
you have to migrate using RDMA.
Enabling this feature does *not* have any measurable affect on
migration *downtime*. This is because, without this feature, all of the
memory will have already been registered already in advance during
the bulk round and does not need to be re-registered during the successive
iteration rounds.
The following changes since commit f3aa844bbb2922a5b8393d17620eca7d7e921ab3:
build: include config-{, all-}devices.mak after defining CONFIG_SOFTMMU and CONFIG_USER_ONLY (2013-04-24 12:18:41 -0500)
are available in the git repository at:
git@github.com:hinesmr/qemu.git rdma_patch_v6
for you to fetch changes up to 75e6fac1f642885b93cefe6e1874d648e9850f8f:
rdma: send pc.ram (2013-04-24 14:55:01 -0400)
----------------------------------------------------------------
Michael R. Hines (11):
rdma: add documentation
rdma: export yield_until_fd_readable()
rdma: export throughput w/ MigrationStats QMP
rdma: introduce qemu_file_mode_is_not_valid()
rdma: export qemu_fflush()
rdma: introduce ram_handle_compressed()
rdma: introduce qemu_ram_foreach_block()
rdma: new QEMUFileOps hooks
rdma: introduce capability x-rdma-pin-all
rdma: core logic
rdma: send pc.ram
Makefile.objs | 1 +
arch_init.c | 59 +-
configure | 29 +
docs/rdma.txt | 404 ++++++
exec.c | 9 +
hmp.c | 2 +
include/block/coroutine.h | 6 +
include/exec/cpu-common.h | 5 +
include/migration/migration.h | 25 +
include/migration/qemu-file.h | 30 +
migration-rdma.c | 2707 +++++++++++++++++++++++++++++++++++++++++
migration.c | 27 +
qapi-schema.json | 12 +-
qemu-coroutine-io.c | 23 +
savevm.c | 107 +-
15 files changed, 3398 insertions(+), 48 deletions(-)
create mode 100644 docs/rdma.txt
create mode 100644 migration-rdma.c
--
1.7.10.4
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Qemu-devel] [PATCH v6 00/11] rdma: migration support
2013-04-24 19:00 mrhines
@ 2013-04-24 21:50 ` Paolo Bonzini
2013-04-24 23:48 ` Michael R. Hines
0 siblings, 1 reply; 11+ messages in thread
From: Paolo Bonzini @ 2013-04-24 21:50 UTC (permalink / raw)
To: mrhines; +Cc: aliguori, quintela, qemu-devel, owasserm, abali, mrhines, gokul
Il 24/04/2013 21:00, mrhines@linux.vnet.ibm.com ha scritto:
> From: "Michael R. Hines" <mrhines@us.ibm.com>
>
> Changes since v5:
>
> - Removed max_size hook.
The patches look good. I will not be very available in the next few
days due to a public holiday here, but I believe that it's okay for 1.5.
It's clearly marked as experimental, and the changes to the internals
are safe and ok.
The one small nit is that patch 11 should come before patch 10. It can
be fixed by whoever applies the patch.
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo
>
> Wiki: http://wiki.qemu.org/Features/RDMALiveMigration
> Github: git@github.com:hinesmr/qemu.git
>
> Here is a brief summary of total migration time and downtime using RDMA:
>
> Using a 40gbps infiniband link performing a worst-case stress test,
> using an 8GB RAM virtual machine:
> Using the following command:
>
> $ apt-get install stress
> $ stress --vm-bytes 7500M --vm 1 --vm-keep
>
> RESULTS:
>
> 1. Migration throughput: 26 gigabits/second.
> 2. Downtime (stop time) varies between 15 and 100 milliseconds.
>
> EFFECTS of memory registration on bulk phase round:
>
> For example, in the same 8GB RAM example with all 8GB of memory in
> active use and the VM itself is completely idle using the same 40 gbps
> infiniband link:
>
> 1. x-rdma-pin-all disabled total time: approximately 7.5 seconds @ 9.5 Gbps
> 2. x-rdma-pin-all enabled total time: approximately 4 seconds @ 26 Gbps
>
> These numbers would of course scale up to whatever size virtual machine
> you have to migrate using RDMA.
>
> Enabling this feature does *not* have any measurable affect on
> migration *downtime*. This is because, without this feature, all of the
> memory will have already been registered already in advance during
> the bulk round and does not need to be re-registered during the successive
> iteration rounds.
>
> The following changes since commit f3aa844bbb2922a5b8393d17620eca7d7e921ab3:
>
> build: include config-{, all-}devices.mak after defining CONFIG_SOFTMMU and CONFIG_USER_ONLY (2013-04-24 12:18:41 -0500)
>
> are available in the git repository at:
>
> git@github.com:hinesmr/qemu.git rdma_patch_v6
>
> for you to fetch changes up to 75e6fac1f642885b93cefe6e1874d648e9850f8f:
>
> rdma: send pc.ram (2013-04-24 14:55:01 -0400)
>
> ----------------------------------------------------------------
> Michael R. Hines (11):
> rdma: add documentation
> rdma: export yield_until_fd_readable()
> rdma: export throughput w/ MigrationStats QMP
> rdma: introduce qemu_file_mode_is_not_valid()
> rdma: export qemu_fflush()
> rdma: introduce ram_handle_compressed()
> rdma: introduce qemu_ram_foreach_block()
> rdma: new QEMUFileOps hooks
> rdma: introduce capability x-rdma-pin-all
> rdma: core logic
> rdma: send pc.ram
>
> Makefile.objs | 1 +
> arch_init.c | 59 +-
> configure | 29 +
> docs/rdma.txt | 404 ++++++
> exec.c | 9 +
> hmp.c | 2 +
> include/block/coroutine.h | 6 +
> include/exec/cpu-common.h | 5 +
> include/migration/migration.h | 25 +
> include/migration/qemu-file.h | 30 +
> migration-rdma.c | 2707 +++++++++++++++++++++++++++++++++++++++++
> migration.c | 27 +
> qapi-schema.json | 12 +-
> qemu-coroutine-io.c | 23 +
> savevm.c | 107 +-
> 15 files changed, 3398 insertions(+), 48 deletions(-)
> create mode 100644 docs/rdma.txt
> create mode 100644 migration-rdma.c
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Qemu-devel] [PATCH v6 00/11] rdma: migration support
2013-04-24 21:50 ` Paolo Bonzini
@ 2013-04-24 23:48 ` Michael R. Hines
0 siblings, 0 replies; 11+ messages in thread
From: Michael R. Hines @ 2013-04-24 23:48 UTC (permalink / raw)
To: Paolo Bonzini
Cc: aliguori, quintela, qemu-devel, owasserm, abali, mrhines, gokul
On 04/24/2013 05:50 PM, Paolo Bonzini wrote:
> Il 24/04/2013 21:00, mrhines@linux.vnet.ibm.com ha scritto:
>> From: "Michael R. Hines" <mrhines@us.ibm.com>
>>
>> Changes since v5:
>>
>> - Removed max_size hook.
> The patches look good. I will not be very available in the next few
> days due to a public holiday here, but I believe that it's okay for 1.5.
> It's clearly marked as experimental, and the changes to the internals
> are safe and ok.
>
> The one small nit is that patch 11 should come before patch 10. It can
> be fixed by whoever applies the patch.
>
> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
>
> Paolo
Acknowledged. Thank you.
- Michael
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Qemu-devel] [PATCH v6 00/11] rdma: migration support
@ 2013-05-03 23:28 Chegu Vinod
2013-05-09 17:20 ` Michael R. Hines
0 siblings, 1 reply; 11+ messages in thread
From: Chegu Vinod @ 2013-05-03 23:28 UTC (permalink / raw)
To: Michael R. Hines
Cc: Karen Noel, Juan Jose Quintela Carreira, Michael S. Tsirkin,
qemu-devel qemu-devel, Orit Wasserman, Anthony Liguori,
Paolo Bonzini
[-- Attachment #1: Type: text/plain, Size: 2828 bytes --]
Hi Michael,
I picked up the qemu bits from your github branch and gave it a try.
(BTW the setup I was given temporary access to has a pair of MLX's IB
QDR cards connected back to back via QSFP cables)
Observed a couple of things and wanted to share..perhaps you may be
aware of them already or perhaps these are unrelated to your specific
changes ? (Note: Still haven't finished the review of your changes ).
a) x-rdma-pin-all off case
Seem to only work sometimes but fails at other times. Here is an example...
(qemu) rdma: Accepting rdma connection...
rdma: Memory pin all: disabled
rdma: verbs context after listen: 0x555556757d50
rdma: dest_connect Source GID: fe80::2:c903:9:53a5, Dest GID:
fe80::2:c903:9:5855
rdma: Accepted migration
qemu-system-x86_64: VQ 1 size 0x100 Guest index 0x4d2 inconsistent with
Host ind
ex 0x4ec: delta 0xffe6
qemu: warning: error while loading state for instance 0x0 of device
'virtio-net'
load of migration failed
b) x-rdma-pin-all on case :
The guest is not resuming on the target host. i.e. the source host's
qemu states that migration is complete but the guest is not responsive
anymore... (doesn't seem to have crashed but its stuck somewhere).
Have you seen this behavior before ? Any tips on how I could extract
additional info ?
Besides the list of noted restrictions/issues around having to pin all
of guest memory....if the pinning is done as part of starting of the
migration it ends up taking noticeably long time for larger guests.
Wonder whether that should be counted as part of the total migration
time ?.
Also the act of pinning all the memory seems to "freeze" the guest. e.g.
: For larger enterprise sized guests (say 128GB and higher) the guest is
"frozen" is anywhere from nearly a minute (~50seconds) to multiple
minutes as the guest size increases...which imo kind of defeats the
purpose of live guest migration.
Would like to hear if you have already thought about any other
alternatives to address this issue ? for e.g. would it be better to pin
all of the guest's memory as part of starting the guest itself ? Yes
there are restrictions when we do pinning...but it can help with
performance.
---
BTW, a different (yet sort of related) topic... recently a patch went
into upstream that provided an option to qemu to mlock all of guest
memory :
https://lists.gnu.org/archive/html/qemu-devel/2013-04/msg03947.html .
but when attempting to do the mlock for larger guests a lot of time is
spent bringing each page into cache and clearing/zeron'g it etc.etc.
https://lists.gnu.org/archive/html/qemu-devel/2013-04/msg04161.html
----
Note: The basic tcp based live guest migration in the same qemu version
still works fine on the same hosts over a pair of non-RDMA cards 10Gb
NICs connected back-to-back.
Thanks
Vinod
[-- Attachment #2: Type: text/html, Size: 3808 bytes --]
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Qemu-devel] [PATCH v6 00/11] rdma: migration support
2013-05-03 23:28 [Qemu-devel] [PATCH v6 00/11] rdma: migration support Chegu Vinod
@ 2013-05-09 17:20 ` Michael R. Hines
2013-05-09 22:20 ` Chegu Vinod
0 siblings, 1 reply; 11+ messages in thread
From: Michael R. Hines @ 2013-05-09 17:20 UTC (permalink / raw)
To: Chegu Vinod
Cc: Karen Noel, Michael S. Tsirkin, Juan Jose Quintela Carreira,
qemu-devel qemu-devel, Orit Wasserman, Michael R. Hines,
Anthony Liguori, Paolo Bonzini
[-- Attachment #1: Type: text/plain, Size: 4858 bytes --]
Comments inline. FYI: please CC mrhines@us.ibm.com,
because it helps me know when to scroll threw the bazillion qemu-devel
emails.
I have things separated out into folders and rules, but a direct CC is
better =)
On 05/03/2013 07:28 PM, Chegu Vinod wrote:
>
> Hi Michael,
>
> I picked up the qemu bits from your github branch and gave it a try.
> (BTW the setup I was given temporary access to has a pair of MLX's IB
> QDR cards connected back to back via QSFP cables)
>
> Observed a couple of things and wanted to share..perhaps you may be
> aware of them already or perhaps these are unrelated to your specific
> changes ? (Note: Still haven't finished the review of your changes ).
>
> a) x-rdma-pin-all off case
>
> Seem to only work sometimes but fails at other times. Here is an
> example...
>
> (qemu) rdma: Accepting rdma connection...
> rdma: Memory pin all: disabled
> rdma: verbs context after listen: 0x555556757d50
> rdma: dest_connect Source GID: fe80::2:c903:9:53a5, Dest GID:
> fe80::2:c903:9:5855
> rdma: Accepted migration
> qemu-system-x86_64: VQ 1 size 0x100 Guest index 0x4d2 inconsistent
> with Host ind
> ex 0x4ec: delta 0xffe6
> qemu: warning: error while loading state for instance 0x0 of device
> 'virtio-net'
> load of migration failed
>
Can you give me more details about the configuration of your VM?
>
> b) x-rdma-pin-all on case :
>
> The guest is not resuming on the target host. i.e. the source host's
> qemu states that migration is complete but the guest is not responsive
> anymore... (doesn't seem to have crashed but its stuck somewhere).
> Have you seen this behavior before ? Any tips on how I could extract
> additional info ?
Is the QEMU monitor still responsive?
Can you capture a screenshot of the guest's console to see if there is a
panic?
What kind of storage is attached to the VM?
>
> Besides the list of noted restrictions/issues around having to pin all
> of guest memory....if the pinning is done as part of starting of the
> migration it ends up taking noticeably long time for larger guests.
> Wonder whether that should be counted as part of the total migration
> time ?.
>
That's a good question: The pin-all option should not be slowing down
your VM to much as the VM should still be running before the
migration_thread() actually kicks in and starts the migration.
I need more information on the configuration of your VM, guest operating
system, architecture and so forth.......
And similarly as before whether or not QEMU is not responsive or whether
or not it's the guest that's panicked.......
> Also the act of pinning all the memory seems to "freeze" the guest.
> e.g. : For larger enterprise sized guests (say 128GB and higher) the
> guest is "frozen" is anywhere from nearly a minute (~50seconds) to
> multiple minutes as the guest size increases...which imo kind of
> defeats the purpose of live guest migration.
That's bad =) There must be a bug somewhere........ the largest VM I can
create on my hardware is ~16GB - so let me give that a try and try to
track down the problem.
>
> Would like to hear if you have already thought about any other
> alternatives to address this issue ? for e.g. would it be better to
> pin all of the guest's memory as part of starting the guest itself ?
> Yes there are restrictions when we do pinning...but it can help with
> performance.
For such a large VM, I would definitely recommend pinning because I'm
assuming you have enough processors or a large enough application to
actually *use* that much memory, which would suggest that even after the
bulk phase round of the migration has already completed that your VM is
probably going to remain to be pretty busy.
It's just a matter of me tracking down what's causing the freeze and
fixing it........ I'll look into it right now on my machine.
> ---
> BTW, a different (yet sort of related) topic... recently a patch went
> into upstream that provided an option to qemu to mlock all of guest
> memory :
>
> https://lists.gnu.org/archive/html/qemu-devel/2013-04/msg03947.html .
I had no idea.......very interesting.
>
> but when attempting to do the mlock for larger guests a lot of time is
> spent bringing each page into cache and clearing/zeron'g it etc.etc.
>
> https://lists.gnu.org/archive/html/qemu-devel/2013-04/msg04161.html
>
Wow, I didn't know that either. Perhaps this must be causing the entire
QEMU process and its threads to seize up.
It may be necessary to run the pinning command *outside* of QEMU's I/O
lock in a separate thread if it's really that much overhead.
Thanks a lot for pointing this out.........
>
> ----
>
> Note: The basic tcp based live guest migration in the same qemu
> version still works fine on the same hosts over a pair of non-RDMA
> cards 10Gb NICs connected back-to-back.
>
Acknowledged.
[-- Attachment #2: Type: text/html, Size: 7137 bytes --]
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Qemu-devel] [PATCH v6 00/11] rdma: migration support
2013-05-09 17:20 ` Michael R. Hines
@ 2013-05-09 22:20 ` Chegu Vinod
2013-05-09 22:45 ` Michael R. Hines
2013-05-10 7:58 ` Paolo Bonzini
0 siblings, 2 replies; 11+ messages in thread
From: Chegu Vinod @ 2013-05-09 22:20 UTC (permalink / raw)
To: Michael R. Hines
Cc: Karen Noel, Michael S. Tsirkin, Juan Jose Quintela Carreira,
qemu-devel qemu-devel, Orit Wasserman, Michael R. Hines,
Anthony Liguori, Paolo Bonzini
[-- Attachment #1: Type: text/plain, Size: 7498 bytes --]
On 5/9/2013 10:20 AM, Michael R. Hines wrote:
> Comments inline. FYI: please CC mrhines@us.ibm.com,
> because it helps me know when to scroll threw the bazillion qemu-devel
> emails.
>
> I have things separated out into folders and rules, but a direct CC is
> better =)
>
Sure will do.
>
> On 05/03/2013 07:28 PM, Chegu Vinod wrote:
>>
>> Hi Michael,
>>
>> I picked up the qemu bits from your github branch and gave it a
>> try. (BTW the setup I was given temporary access to has a pair of
>> MLX's IB QDR cards connected back to back via QSFP cables)
>>
>> Observed a couple of things and wanted to share..perhaps you may be
>> aware of them already or perhaps these are unrelated to your specific
>> changes ? (Note: Still haven't finished the review of your changes ).
>>
>> a) x-rdma-pin-all off case
>>
>> Seem to only work sometimes but fails at other times. Here is an
>> example...
>>
>> (qemu) rdma: Accepting rdma connection...
>> rdma: Memory pin all: disabled
>> rdma: verbs context after listen: 0x555556757d50
>> rdma: dest_connect Source GID: fe80::2:c903:9:53a5, Dest GID:
>> fe80::2:c903:9:5855
>> rdma: Accepted migration
>> qemu-system-x86_64: VQ 1 size 0x100 Guest index 0x4d2 inconsistent
>> with Host ind
>> ex 0x4ec: delta 0xffe6
>> qemu: warning: error while loading state for instance 0x0 of device
>> 'virtio-net'
>> load of migration failed
>>
>
> Can you give me more details about the configuration of your VM?
The guest is a 10-VCPU/128GB ...and nothing really that fancy with
respect to storage or networking.
Hosted on a large Westmere-EX box (target is a similarly configured
Westmere-X system). There is a shared SAN disk between the two hosts.
Both hosts have 3.9-rc7 kernel that I got at that time from kvm.git
tree. The guest was also running the same kernel.
Since I was just trying it out I was not running any workload either.
On the source host the qemu command line :
/usr/local/bin/qemu-system-x86_64 \
-enable-kvm \
-cpu host \
-name vm1 \
-m 131072 -smp 10,sockets=1,cores=10,threads=1 \
-mem-path /dev/hugepages \
-chardev
socket,id=charmonitor,path=/var/lib/libvirt/qemu/vm1.monitor,server,nowait \
-drive
file=/dev/libvirt_lvm3/vm1,if=none,id=drive-virtio-disk0,format=raw,cache=none,aio=native
\
-device
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
\
-monitor stdio \
-net nic,model=virtio,macaddr=52:54:00:71:01:01,netdev=nic-0 \
-netdev tap,id=nic-0,ifname=tap0,script=no,downscript=no,vhost=on \
-vnc :4
On the destination host the command line was same as the above with the
following additional arg...
-incoming x-rdma:<static private ipaddr of the IB>:<port #>
>
>>
>> b) x-rdma-pin-all on case :
>>
>> The guest is not resuming on the target host. i.e. the source host's
>> qemu states that migration is complete but the guest is not
>> responsive anymore... (doesn't seem to have crashed but its stuck
>> somewhere). Have you seen this behavior before ? Any tips on how I
>> could extract additional info ?
>
> Is the QEMU monitor still responsive?
They were responsive.
> Can you capture a screenshot of the guest's console to see if there is
> a panic?
No panic on the guest's console :(
> What kind of storage is attached to the VM?
>
Simple virtio disk hosted on a SAN disk (see the qemu command line).
>
>>
>> Besides the list of noted restrictions/issues around having to pin
>> all of guest memory....if the pinning is done as part of starting of
>> the migration it ends up taking noticeably long time for larger
>> guests. Wonder whether that should be counted as part of the total
>> migration time ?.
>>
>
> That's a good question: The pin-all option should not be slowing down
> your VM to much as the VM should still be running before the
> migration_thread() actually kicks in and starts the migration.
Well I had hoped that it would not have any serious impacts but it ended
up freezing the guest...
> I need more information on the configuration of your VM, guest
> operating system, architecture and so forth.......
Pl. see above.
> And similarly as before whether or not QEMU is not responsive or
> whether or not it's the guest that's panicked.......
Guest just freezes...doesn't panic when this pinning is in progress
(i.e. after I set the capability and start the migration) . After the
pin'ng completes the guest continues to run and the migration
continues...till it "completes" (as per the source host's qemu)...but I
never see it resume on the target host.
>
>> Also the act of pinning all the memory seems to "freeze" the guest.
>> e.g. : For larger enterprise sized guests (say 128GB and higher) the
>> guest is "frozen" is anywhere from nearly a minute (~50seconds) to
>> multiple minutes as the guest size increases...which imo kind of
>> defeats the purpose of live guest migration.
>
> That's bad =) There must be a bug somewhere........ the largest VM I
> can create on my hardware is ~16GB - so let me give that a try and try
> to track down the problem.
Ok. Perhaps run a simple test run inside the guest can help observe any
scheduling delays even when you are attempting to pin a 16GB guest ?
>
>>
>> Would like to hear if you have already thought about any other
>> alternatives to address this issue ? for e.g. would it be better to
>> pin all of the guest's memory as part of starting the guest itself ?
>> Yes there are restrictions when we do pinning...but it can help with
>> performance.
>
> For such a large VM, I would definitely recommend pinning because I'm
> assuming you have enough processors or a large enough application to
> actually *use* that much memory, which would suggest that even after
> the bulk phase round of the migration has already completed that your
> VM is probably going to remain to be pretty busy.
>
> It's just a matter of me tracking down what's causing the freeze and
> fixing it........ I'll look into it right now on my machine.
>
Ok
>> ---
>> BTW, a different (yet sort of related) topic... recently a patch went
>> into upstream that provided an option to qemu to mlock all of guest
>> memory :
>>
>> https://lists.gnu.org/archive/html/qemu-devel/2013-04/msg03947.html .
>
> I had no idea.......very interesting.
>
>>
>> but when attempting to do the mlock for larger guests a lot of time
>> is spent bringing each page into cache and clearing/zeron'g it etc.etc.
>>
>> https://lists.gnu.org/archive/html/qemu-devel/2013-04/msg04161.html
>>
>
> Wow, I didn't know that either. Perhaps this must be causing the
> entire QEMU process and its threads to seize up.
>
> It may be necessary to run the pinning command *outside* of QEMU's I/O
> lock in a separate thread if it's really that much overhead.
Not really sure if the BQL is causing the freeze...but in general
pinning of all memory when the guest is run is perhaps not the best
choice for large enterprise class guests...i.e. its better to do it as
part of the start of the guest.
>
> Thanks a lot for pointing this out.........
>
>
BTW, A good thing to try out is to see if we can mlock memory of a large
guest (i.e. on the source and target qemu's) and migrate the guest using
basic TCP over a regular 10Gig NIC.
Thanks,
Vinod
>
>>
>> ----
>>
>> Note: The basic tcp based live guest migration in the same qemu
>> version still works fine on the same hosts over a pair of non-RDMA
>> cards 10Gb NICs connected back-to-back.
>>
>
> Acknowledged.
>
[-- Attachment #2: Type: text/html, Size: 11926 bytes --]
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Qemu-devel] [PATCH v6 00/11] rdma: migration support
2013-05-09 22:20 ` Chegu Vinod
@ 2013-05-09 22:45 ` Michael R. Hines
2013-06-02 4:09 ` Michael R. Hines
2013-05-10 7:58 ` Paolo Bonzini
1 sibling, 1 reply; 11+ messages in thread
From: Michael R. Hines @ 2013-05-09 22:45 UTC (permalink / raw)
To: Chegu Vinod
Cc: Karen Noel, Juan Jose Quintela Carreira, Michael S. Tsirkin,
qemu-devel qemu-devel, Orit Wasserman, Michael R. Hines,
Anthony Liguori, Paolo Bonzini
Some more followup questions below to help me debug before I start
digging in.......
On 05/09/2013 06:20 PM, Chegu Vinod wrote:
Setting aside the mlock() freezes for the moment, let's first fix your
crashing
problem on the destination-side. Let's make that a priority before we fix
the mlock problem.
When the migration "completes", can you provide me with more detailed
information
about the state of QEMU on the destination?
Is it responding?
What's on the VNC console?
Is QEMU responding?
Is the network responding?
Was the VM idle? Or running an application?
Can you attach GDB to QEMU after the migration?
> /usr/local/bin/qemu-system-x86_64 \
> -enable-kvm \
> -cpu host \
> -name vm1 \
> -m 131072 -smp 10,sockets=1,cores=10,threads=1 \
> -mem-path /dev/hugepages \
Can you disable hugepages and re-test?
I'll get back to the other mlock() issues later after we at least first
make sure the migration itself is working.....
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Qemu-devel] [PATCH v6 00/11] rdma: migration support
2013-05-09 22:20 ` Chegu Vinod
2013-05-09 22:45 ` Michael R. Hines
@ 2013-05-10 7:58 ` Paolo Bonzini
1 sibling, 0 replies; 11+ messages in thread
From: Paolo Bonzini @ 2013-05-10 7:58 UTC (permalink / raw)
To: Chegu Vinod
Cc: Karen Noel, Michael S. Tsirkin, Juan Jose Quintela Carreira,
qemu-devel qemu-devel, Michael R. Hines, Orit Wasserman,
Michael R. Hines, Anthony Liguori
Il 10/05/2013 00:20, Chegu Vinod ha scritto:
>>
>> Wow, I didn't know that either. Perhaps this must be causing the
>> entire QEMU process and its threads to seize up.
>>
>> It may be necessary to run the pinning command *outside* of QEMU's I/O
>> lock in a separate thread if it's really that much overhead.
>
> Not really sure if the BQL is causing the freeze...but in general
> pinning of all memory when the guest is run is perhaps not the best
> choice for large enterprise class guests...i.e. its better to do it as
> part of the start of the guest.
If pinning is done in the setup phase, it should run outside the BQL.
Paolo
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Qemu-devel] [PATCH v6 00/11] rdma: migration support
2013-05-09 22:45 ` Michael R. Hines
@ 2013-06-02 4:09 ` Michael R. Hines
2013-06-06 23:51 ` Chegu Vinod
0 siblings, 1 reply; 11+ messages in thread
From: Michael R. Hines @ 2013-06-02 4:09 UTC (permalink / raw)
To: Chegu Vinod
Cc: Karen Noel, Juan Jose Quintela Carreira, Michael S. Tsirkin,
qemu-devel qemu-devel, Orit Wasserman, Michael R. Hines,
Anthony Liguori, Paolo Bonzini
All,
I have successfully performed over 1000+ back-to-back RDMA migrations
automatically looped *in a row* using a heavy-weight memory-stress
benchmark here at IBM.
Migration success is done by capturing the actual serial console output
of the virtual machine while the benchmark is running and redirecting
each migration output to a file to verify that the output matches the
expected output of a successful migration. For half of the 1000
migrations, I used a 14GB virtual machine size (largest VM I can create)
and the remaining 500 migrations I used a 2GB virtual machine (to make
sure I was testing both 32-bit and 64-bit address boundaries). The
benchmark is configured to have 75% stores and 25% loads and is
configured to use 80% of the allocatable free memory of the VM (i.e. no
swapping allowed).
I have defined a successful migration per the output file as follows:
1. The memory benchmark is still running and active (CPU near 100% and
memory usage is high)
2. There are no kernel panics in the console output (regex keywords
"panic", "BUG", "oom", etc...)
3. The VM is still responding to network activity (pings)
4. The console is still responsive by printing periodic messages
throughout the life of the VM to the console from inside the VM using
the 'write' command in infinite loop.
With this method in a loop, I believe I've ironed out all the
regression-testing bugs that I can find. You all may find the following
bugs interesting. The original version of this patch was written in 2010
(Before my time @ IBM).
Bug #1: In the original 2010 patch, each write operation uses the same
"identifier". (A "Work Request ID" in infiniband terminology).
This is not typical (but allowed by the hardware) - and instead each
operation should have its own unique identifier so that the write
operation can be tracked properly as it completes.
Bug #2: Also in the original 2010 patch, write operations were grouped
into separate "signaled" and "unsignaled" work requests, which is also
not typical (but allowed by the hardware). "Signalling" is infiniband
terminology which means to activate/deactivate notifying the sender
whether or not the RDMA operation has already completed. (Note: the
receiver is never notified - which is what a DMA is supposed to be). In
normal operation per infiniband specifications, "unsignaled" operations
(which indicate to the hardware *not* to notify the sender of
completion) are *supposed* to be paired simultaneously with a signaled
operation using the *same* work request identifier. Instead, the
original patch was using *different* work requests for
signaled/unsignaled writes, which means that most of the writes would be
transmitted without ever being tracked for completion whatsoever. (Per
infinband specifications, signaled and unsignaled writes must be grouped
together because the hardware ensures that completion notification is
not given until *all* of the writes of the same request have actually
completed).
Bug #3: Finally, in the original 2010 patch, ordering was not being
handled. Per infiniband specifications, writes can happen completely out
of order. Not only that, but PCI-express itself can change the order of
the writes as well. It was only until after the first 2 bugs were fixed
that I could actually manifest this bug *in code*: What was happening
was that a very large group of requests would "burst" from the QEMU
migration thread. At which point, not all of the requests would finish.
Then a short time later, the next iteration would start and the virtual
machine's writable working set was still "hovering" somewhere in the
same vicinity of the address space as the previous burst of writes that
had not yet completed. When this happens, the new writes were much
smaller (not a part of a larger "chunk" per our algorithms). Since the
new writes were smaller they would complete faster than the larger,
older writes in the same address range. Since they complete out of
order, the newer writes would then get clobbered by the older writes -
resulting in an inconsistent virtual machine. So, to solve this: during
each new write, we now do a "search" to see if the address of the next
requested write matches or overlaps with the address range of any of the
previous "outstanding" writes that were still in transit, and I found
several hits. This was easily solved by blocking until the conflicting
write has completed before proceeding to issue a new write to the hardware.
- Michael
On 05/09/2013 06:45 PM, Michael R. Hines wrote:
>
> Some more followup questions below to help me debug before I start
> digging in.......
>
> On 05/09/2013 06:20 PM, Chegu Vinod wrote:
>
> Setting aside the mlock() freezes for the moment, let's first fix your
> crashing
> problem on the destination-side. Let's make that a priority before we fix
> the mlock problem.
>
> When the migration "completes", can you provide me with more detailed
> information
> about the state of QEMU on the destination?
>
> Is it responding?
> What's on the VNC console?
> Is QEMU responding?
> Is the network responding?
> Was the VM idle? Or running an application?
> Can you attach GDB to QEMU after the migration?
>
>
>> /usr/local/bin/qemu-system-x86_64 \
>> -enable-kvm \
>> -cpu host \
>> -name vm1 \
>> -m 131072 -smp 10,sockets=1,cores=10,threads=1 \
>> -mem-path /dev/hugepages \
>
> Can you disable hugepages and re-test?
>
> I'll get back to the other mlock() issues later after we at least
> first make sure the migration itself is working.....
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Qemu-devel] [PATCH v6 00/11] rdma: migration support
2013-06-02 4:09 ` Michael R. Hines
@ 2013-06-06 23:51 ` Chegu Vinod
2013-06-07 5:38 ` Michael R. Hines
0 siblings, 1 reply; 11+ messages in thread
From: Chegu Vinod @ 2013-06-06 23:51 UTC (permalink / raw)
To: Michael R. Hines
Cc: Karen Noel, Juan Jose Quintela Carreira, Michael S. Tsirkin,
qemu-devel qemu-devel, Orit Wasserman, Michael R. Hines,
Anthony Liguori, Paolo Bonzini
On 6/1/2013 9:09 PM, Michael R. Hines wrote:
> All,
>
> I have successfully performed over 1000+ back-to-back RDMA migrations
> automatically looped *in a row* using a heavy-weight memory-stress
> benchmark here at IBM.
> Migration success is done by capturing the actual serial console
> output of the virtual machine while the benchmark is running and
> redirecting each migration output to a file to verify that the output
> matches the expected output of a successful migration. For half of the
> 1000 migrations, I used a 14GB virtual machine size (largest VM I can
> create) and the remaining 500 migrations I used a 2GB virtual machine
> (to make sure I was testing both 32-bit and 64-bit address
> boundaries). The benchmark is configured to have 75% stores and 25%
> loads and is configured to use 80% of the allocatable free memory of
> the VM (i.e. no swapping allowed).
>
> I have defined a successful migration per the output file as follows:
>
> 1. The memory benchmark is still running and active (CPU near 100% and
> memory usage is high)
> 2. There are no kernel panics in the console output (regex keywords
> "panic", "BUG", "oom", etc...)
> 3. The VM is still responding to network activity (pings)
> 4. The console is still responsive by printing periodic messages
> throughout the life of the VM to the console from inside the VM using
> the 'write' command in infinite loop.
>
> With this method in a loop, I believe I've ironed out all the
> regression-testing bugs that I can find. You all may find the
> following bugs interesting. The original version of this patch was
> written in 2010 (Before my time @ IBM).
>
> Bug #1: In the original 2010 patch, each write operation uses the same
> "identifier". (A "Work Request ID" in infiniband terminology).
> This is not typical (but allowed by the hardware) - and instead each
> operation should have its own unique identifier so that the write
> operation can be tracked properly as it completes.
>
> Bug #2: Also in the original 2010 patch, write operations were grouped
> into separate "signaled" and "unsignaled" work requests, which is also
> not typical (but allowed by the hardware). "Signalling" is infiniband
> terminology which means to activate/deactivate notifying the sender
> whether or not the RDMA operation has already completed. (Note: the
> receiver is never notified - which is what a DMA is supposed to be).
> In normal operation per infiniband specifications, "unsignaled"
> operations (which indicate to the hardware *not* to notify the sender
> of completion) are *supposed* to be paired simultaneously with a
> signaled operation using the *same* work request identifier. Instead,
> the original patch was using *different* work requests for
> signaled/unsignaled writes, which means that most of the writes would
> be transmitted without ever being tracked for completion whatsoever.
> (Per infinband specifications, signaled and unsignaled writes must be
> grouped together because the hardware ensures that completion
> notification is not given until *all* of the writes of the same
> request have actually completed).
>
> Bug #3: Finally, in the original 2010 patch, ordering was not being
> handled. Per infiniband specifications, writes can happen completely
> out of order. Not only that, but PCI-express itself can change the
> order of the writes as well. It was only until after the first 2 bugs
> were fixed that I could actually manifest this bug *in code*: What was
> happening was that a very large group of requests would "burst" from
> the QEMU migration thread. At which point, not all of the requests
> would finish. Then a short time later, the next iteration would start
> and the virtual machine's writable working set was still "hovering"
> somewhere in the same vicinity of the address space as the previous
> burst of writes that had not yet completed. When this happens, the new
> writes were much smaller (not a part of a larger "chunk" per our
> algorithms). Since the new writes were smaller they would complete
> faster than the larger, older writes in the same address range. Since
> they complete out of order, the newer writes would then get clobbered
> by the older writes - resulting in an inconsistent virtual machine.
> So, to solve this: during each new write, we now do a "search" to see
> if the address of the next requested write matches or overlaps with
> the address range of any of the previous "outstanding" writes that
> were still in transit, and I found several hits. This was easily
> solved by blocking until the conflicting write has completed before
> proceeding to issue a new write to the hardware.
>
> - Michael
>
>
Hi Michael,
Got some limited time on the systems so gave your latest bits a quick
try today (with the default no pinning) and it seems to be better than
before.
Ran a Java warehouse workload where the guest was 85-90% busy...
For both cases
(qemu) migrate_set_speed 40G
(qemu) migrate_set_downtime 2
(qemu) migrate -d x-rdma:<ip>:<port>
...
20VCPU/256G guest
(qemu) info migrate
capabilities: xbzrle: off x-rdma-pin-all: off
Migration status: completed
total time: 106994 milliseconds
downtime: 3795 milliseconds
transferred ram: 15425453 kbytes
throughput: 20418.27 mbps
remaining ram: 0 kbytes
total ram: 268444224 kbytes
duplicate: 64707112 pages
skipped: 0 pages
normal: 3839625 pages
normal bytes: 15358500 kbytes
----
40VCPU/512G guest <- I had more warehouse threads with higher
heap size etc. to make the guest busy...and hence it seems to have taken
a while to converge.
(qemu) info migrate
capabilities: xbzrle: off x-rdma-pin-all: off
Migration status: completed
total time: 2470056 milliseconds
downtime: 6254 milliseconds
transferred ram: 3230142002 kbytes
throughput: 22118.67 mbps
remaining ram: 0 kbytes
total ram: 536879680 kbytes
duplicate: 127436402 pages
skipped: 0 pages
normal: 807307274 pages
normal bytes: 3229229096 kbytes
<..>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Qemu-devel] [PATCH v6 00/11] rdma: migration support
2013-06-06 23:51 ` Chegu Vinod
@ 2013-06-07 5:38 ` Michael R. Hines
0 siblings, 0 replies; 11+ messages in thread
From: Michael R. Hines @ 2013-06-07 5:38 UTC (permalink / raw)
To: Chegu Vinod
Cc: Karen Noel, Juan Jose Quintela Carreira, qemu-devel qemu-devel,
Orit Wasserman, Michael R. Hines, Anthony Liguori, Paolo Bonzini
OK, this looks excellent. I think we're ready for a PULL request now - I
will submit to Juan - I've already got a signed-off from Paolo & Eric.
I think we've got sufficient testing now.
I'm expecting to get access to a big machine in the next week or two
(256G machine) - and I should be able to reproduce your mlock() delay
issue at that time.
With such a big VM - I think pinning will help you significantly.
Stay tuned,
- Michael
On 06/06/2013 07:51 PM, Chegu Vinod wrote:
>
>>
> Hi Michael,
>
> Got some limited time on the systems so gave your latest bits a quick
> try today (with the default no pinning) and it seems to be better than
> before.
>
> Ran a Java warehouse workload where the guest was 85-90% busy...
>
> For both cases
> (qemu) migrate_set_speed 40G
> (qemu) migrate_set_downtime 2
> (qemu) migrate -d x-rdma:<ip>:<port>
>
> ...
>
> 20VCPU/256G guest
>
> (qemu) info migrate
> capabilities: xbzrle: off x-rdma-pin-all: off
> Migration status: completed
> total time: 106994 milliseconds
> downtime: 3795 milliseconds
> transferred ram: 15425453 kbytes
> throughput: 20418.27 mbps
> remaining ram: 0 kbytes
> total ram: 268444224 kbytes
> duplicate: 64707112 pages
> skipped: 0 pages
> normal: 3839625 pages
> normal bytes: 15358500 kbytes
>
> ----
>
> 40VCPU/512G guest <- I had more warehouse threads with higher
> heap size etc. to make the guest busy...and hence it seems to have
> taken a while to converge.
>
> (qemu) info migrate
> capabilities: xbzrle: off x-rdma-pin-all: off
> Migration status: completed
> total time: 2470056 milliseconds
> downtime: 6254 milliseconds
> transferred ram: 3230142002 kbytes
> throughput: 22118.67 mbps
> remaining ram: 0 kbytes
> total ram: 536879680 kbytes
> duplicate: 127436402 pages
> skipped: 0 pages
> normal: 807307274 pages
> normal bytes: 3229229096 kbytes
>
>
> <..>
>
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2013-06-07 5:38 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-05-03 23:28 [Qemu-devel] [PATCH v6 00/11] rdma: migration support Chegu Vinod
2013-05-09 17:20 ` Michael R. Hines
2013-05-09 22:20 ` Chegu Vinod
2013-05-09 22:45 ` Michael R. Hines
2013-06-02 4:09 ` Michael R. Hines
2013-06-06 23:51 ` Chegu Vinod
2013-06-07 5:38 ` Michael R. Hines
2013-05-10 7:58 ` Paolo Bonzini
-- strict thread matches above, loose matches on Subject: below --
2013-04-24 19:00 mrhines
2013-04-24 21:50 ` Paolo Bonzini
2013-04-24 23:48 ` Michael R. Hines
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).