* Re: [Qemu-devel] [PATCH v6 00/11] rdma: migration support @ 2013-05-03 23:28 Chegu Vinod 2013-05-09 17:20 ` Michael R. Hines 0 siblings, 1 reply; 11+ messages in thread From: Chegu Vinod @ 2013-05-03 23:28 UTC (permalink / raw) To: Michael R. Hines Cc: Karen Noel, Juan Jose Quintela Carreira, Michael S. Tsirkin, qemu-devel qemu-devel, Orit Wasserman, Anthony Liguori, Paolo Bonzini [-- Attachment #1: Type: text/plain, Size: 2828 bytes --] Hi Michael, I picked up the qemu bits from your github branch and gave it a try. (BTW the setup I was given temporary access to has a pair of MLX's IB QDR cards connected back to back via QSFP cables) Observed a couple of things and wanted to share..perhaps you may be aware of them already or perhaps these are unrelated to your specific changes ? (Note: Still haven't finished the review of your changes ). a) x-rdma-pin-all off case Seem to only work sometimes but fails at other times. Here is an example... (qemu) rdma: Accepting rdma connection... rdma: Memory pin all: disabled rdma: verbs context after listen: 0x555556757d50 rdma: dest_connect Source GID: fe80::2:c903:9:53a5, Dest GID: fe80::2:c903:9:5855 rdma: Accepted migration qemu-system-x86_64: VQ 1 size 0x100 Guest index 0x4d2 inconsistent with Host ind ex 0x4ec: delta 0xffe6 qemu: warning: error while loading state for instance 0x0 of device 'virtio-net' load of migration failed b) x-rdma-pin-all on case : The guest is not resuming on the target host. i.e. the source host's qemu states that migration is complete but the guest is not responsive anymore... (doesn't seem to have crashed but its stuck somewhere). Have you seen this behavior before ? Any tips on how I could extract additional info ? Besides the list of noted restrictions/issues around having to pin all of guest memory....if the pinning is done as part of starting of the migration it ends up taking noticeably long time for larger guests. Wonder whether that should be counted as part of the total migration time ?. Also the act of pinning all the memory seems to "freeze" the guest. e.g. : For larger enterprise sized guests (say 128GB and higher) the guest is "frozen" is anywhere from nearly a minute (~50seconds) to multiple minutes as the guest size increases...which imo kind of defeats the purpose of live guest migration. Would like to hear if you have already thought about any other alternatives to address this issue ? for e.g. would it be better to pin all of the guest's memory as part of starting the guest itself ? Yes there are restrictions when we do pinning...but it can help with performance. --- BTW, a different (yet sort of related) topic... recently a patch went into upstream that provided an option to qemu to mlock all of guest memory : https://lists.gnu.org/archive/html/qemu-devel/2013-04/msg03947.html . but when attempting to do the mlock for larger guests a lot of time is spent bringing each page into cache and clearing/zeron'g it etc.etc. https://lists.gnu.org/archive/html/qemu-devel/2013-04/msg04161.html ---- Note: The basic tcp based live guest migration in the same qemu version still works fine on the same hosts over a pair of non-RDMA cards 10Gb NICs connected back-to-back. Thanks Vinod [-- Attachment #2: Type: text/html, Size: 3808 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Qemu-devel] [PATCH v6 00/11] rdma: migration support 2013-05-03 23:28 [Qemu-devel] [PATCH v6 00/11] rdma: migration support Chegu Vinod @ 2013-05-09 17:20 ` Michael R. Hines 2013-05-09 22:20 ` Chegu Vinod 0 siblings, 1 reply; 11+ messages in thread From: Michael R. Hines @ 2013-05-09 17:20 UTC (permalink / raw) To: Chegu Vinod Cc: Karen Noel, Michael S. Tsirkin, Juan Jose Quintela Carreira, qemu-devel qemu-devel, Orit Wasserman, Michael R. Hines, Anthony Liguori, Paolo Bonzini [-- Attachment #1: Type: text/plain, Size: 4858 bytes --] Comments inline. FYI: please CC mrhines@us.ibm.com, because it helps me know when to scroll threw the bazillion qemu-devel emails. I have things separated out into folders and rules, but a direct CC is better =) On 05/03/2013 07:28 PM, Chegu Vinod wrote: > > Hi Michael, > > I picked up the qemu bits from your github branch and gave it a try. > (BTW the setup I was given temporary access to has a pair of MLX's IB > QDR cards connected back to back via QSFP cables) > > Observed a couple of things and wanted to share..perhaps you may be > aware of them already or perhaps these are unrelated to your specific > changes ? (Note: Still haven't finished the review of your changes ). > > a) x-rdma-pin-all off case > > Seem to only work sometimes but fails at other times. Here is an > example... > > (qemu) rdma: Accepting rdma connection... > rdma: Memory pin all: disabled > rdma: verbs context after listen: 0x555556757d50 > rdma: dest_connect Source GID: fe80::2:c903:9:53a5, Dest GID: > fe80::2:c903:9:5855 > rdma: Accepted migration > qemu-system-x86_64: VQ 1 size 0x100 Guest index 0x4d2 inconsistent > with Host ind > ex 0x4ec: delta 0xffe6 > qemu: warning: error while loading state for instance 0x0 of device > 'virtio-net' > load of migration failed > Can you give me more details about the configuration of your VM? > > b) x-rdma-pin-all on case : > > The guest is not resuming on the target host. i.e. the source host's > qemu states that migration is complete but the guest is not responsive > anymore... (doesn't seem to have crashed but its stuck somewhere). > Have you seen this behavior before ? Any tips on how I could extract > additional info ? Is the QEMU monitor still responsive? Can you capture a screenshot of the guest's console to see if there is a panic? What kind of storage is attached to the VM? > > Besides the list of noted restrictions/issues around having to pin all > of guest memory....if the pinning is done as part of starting of the > migration it ends up taking noticeably long time for larger guests. > Wonder whether that should be counted as part of the total migration > time ?. > That's a good question: The pin-all option should not be slowing down your VM to much as the VM should still be running before the migration_thread() actually kicks in and starts the migration. I need more information on the configuration of your VM, guest operating system, architecture and so forth....... And similarly as before whether or not QEMU is not responsive or whether or not it's the guest that's panicked....... > Also the act of pinning all the memory seems to "freeze" the guest. > e.g. : For larger enterprise sized guests (say 128GB and higher) the > guest is "frozen" is anywhere from nearly a minute (~50seconds) to > multiple minutes as the guest size increases...which imo kind of > defeats the purpose of live guest migration. That's bad =) There must be a bug somewhere........ the largest VM I can create on my hardware is ~16GB - so let me give that a try and try to track down the problem. > > Would like to hear if you have already thought about any other > alternatives to address this issue ? for e.g. would it be better to > pin all of the guest's memory as part of starting the guest itself ? > Yes there are restrictions when we do pinning...but it can help with > performance. For such a large VM, I would definitely recommend pinning because I'm assuming you have enough processors or a large enough application to actually *use* that much memory, which would suggest that even after the bulk phase round of the migration has already completed that your VM is probably going to remain to be pretty busy. It's just a matter of me tracking down what's causing the freeze and fixing it........ I'll look into it right now on my machine. > --- > BTW, a different (yet sort of related) topic... recently a patch went > into upstream that provided an option to qemu to mlock all of guest > memory : > > https://lists.gnu.org/archive/html/qemu-devel/2013-04/msg03947.html . I had no idea.......very interesting. > > but when attempting to do the mlock for larger guests a lot of time is > spent bringing each page into cache and clearing/zeron'g it etc.etc. > > https://lists.gnu.org/archive/html/qemu-devel/2013-04/msg04161.html > Wow, I didn't know that either. Perhaps this must be causing the entire QEMU process and its threads to seize up. It may be necessary to run the pinning command *outside* of QEMU's I/O lock in a separate thread if it's really that much overhead. Thanks a lot for pointing this out......... > > ---- > > Note: The basic tcp based live guest migration in the same qemu > version still works fine on the same hosts over a pair of non-RDMA > cards 10Gb NICs connected back-to-back. > Acknowledged. [-- Attachment #2: Type: text/html, Size: 7137 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Qemu-devel] [PATCH v6 00/11] rdma: migration support 2013-05-09 17:20 ` Michael R. Hines @ 2013-05-09 22:20 ` Chegu Vinod 2013-05-09 22:45 ` Michael R. Hines 2013-05-10 7:58 ` Paolo Bonzini 0 siblings, 2 replies; 11+ messages in thread From: Chegu Vinod @ 2013-05-09 22:20 UTC (permalink / raw) To: Michael R. Hines Cc: Karen Noel, Michael S. Tsirkin, Juan Jose Quintela Carreira, qemu-devel qemu-devel, Orit Wasserman, Michael R. Hines, Anthony Liguori, Paolo Bonzini [-- Attachment #1: Type: text/plain, Size: 7498 bytes --] On 5/9/2013 10:20 AM, Michael R. Hines wrote: > Comments inline. FYI: please CC mrhines@us.ibm.com, > because it helps me know when to scroll threw the bazillion qemu-devel > emails. > > I have things separated out into folders and rules, but a direct CC is > better =) > Sure will do. > > On 05/03/2013 07:28 PM, Chegu Vinod wrote: >> >> Hi Michael, >> >> I picked up the qemu bits from your github branch and gave it a >> try. (BTW the setup I was given temporary access to has a pair of >> MLX's IB QDR cards connected back to back via QSFP cables) >> >> Observed a couple of things and wanted to share..perhaps you may be >> aware of them already or perhaps these are unrelated to your specific >> changes ? (Note: Still haven't finished the review of your changes ). >> >> a) x-rdma-pin-all off case >> >> Seem to only work sometimes but fails at other times. Here is an >> example... >> >> (qemu) rdma: Accepting rdma connection... >> rdma: Memory pin all: disabled >> rdma: verbs context after listen: 0x555556757d50 >> rdma: dest_connect Source GID: fe80::2:c903:9:53a5, Dest GID: >> fe80::2:c903:9:5855 >> rdma: Accepted migration >> qemu-system-x86_64: VQ 1 size 0x100 Guest index 0x4d2 inconsistent >> with Host ind >> ex 0x4ec: delta 0xffe6 >> qemu: warning: error while loading state for instance 0x0 of device >> 'virtio-net' >> load of migration failed >> > > Can you give me more details about the configuration of your VM? The guest is a 10-VCPU/128GB ...and nothing really that fancy with respect to storage or networking. Hosted on a large Westmere-EX box (target is a similarly configured Westmere-X system). There is a shared SAN disk between the two hosts. Both hosts have 3.9-rc7 kernel that I got at that time from kvm.git tree. The guest was also running the same kernel. Since I was just trying it out I was not running any workload either. On the source host the qemu command line : /usr/local/bin/qemu-system-x86_64 \ -enable-kvm \ -cpu host \ -name vm1 \ -m 131072 -smp 10,sockets=1,cores=10,threads=1 \ -mem-path /dev/hugepages \ -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/vm1.monitor,server,nowait \ -drive file=/dev/libvirt_lvm3/vm1,if=none,id=drive-virtio-disk0,format=raw,cache=none,aio=native \ -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 \ -monitor stdio \ -net nic,model=virtio,macaddr=52:54:00:71:01:01,netdev=nic-0 \ -netdev tap,id=nic-0,ifname=tap0,script=no,downscript=no,vhost=on \ -vnc :4 On the destination host the command line was same as the above with the following additional arg... -incoming x-rdma:<static private ipaddr of the IB>:<port #> > >> >> b) x-rdma-pin-all on case : >> >> The guest is not resuming on the target host. i.e. the source host's >> qemu states that migration is complete but the guest is not >> responsive anymore... (doesn't seem to have crashed but its stuck >> somewhere). Have you seen this behavior before ? Any tips on how I >> could extract additional info ? > > Is the QEMU monitor still responsive? They were responsive. > Can you capture a screenshot of the guest's console to see if there is > a panic? No panic on the guest's console :( > What kind of storage is attached to the VM? > Simple virtio disk hosted on a SAN disk (see the qemu command line). > >> >> Besides the list of noted restrictions/issues around having to pin >> all of guest memory....if the pinning is done as part of starting of >> the migration it ends up taking noticeably long time for larger >> guests. Wonder whether that should be counted as part of the total >> migration time ?. >> > > That's a good question: The pin-all option should not be slowing down > your VM to much as the VM should still be running before the > migration_thread() actually kicks in and starts the migration. Well I had hoped that it would not have any serious impacts but it ended up freezing the guest... > I need more information on the configuration of your VM, guest > operating system, architecture and so forth....... Pl. see above. > And similarly as before whether or not QEMU is not responsive or > whether or not it's the guest that's panicked....... Guest just freezes...doesn't panic when this pinning is in progress (i.e. after I set the capability and start the migration) . After the pin'ng completes the guest continues to run and the migration continues...till it "completes" (as per the source host's qemu)...but I never see it resume on the target host. > >> Also the act of pinning all the memory seems to "freeze" the guest. >> e.g. : For larger enterprise sized guests (say 128GB and higher) the >> guest is "frozen" is anywhere from nearly a minute (~50seconds) to >> multiple minutes as the guest size increases...which imo kind of >> defeats the purpose of live guest migration. > > That's bad =) There must be a bug somewhere........ the largest VM I > can create on my hardware is ~16GB - so let me give that a try and try > to track down the problem. Ok. Perhaps run a simple test run inside the guest can help observe any scheduling delays even when you are attempting to pin a 16GB guest ? > >> >> Would like to hear if you have already thought about any other >> alternatives to address this issue ? for e.g. would it be better to >> pin all of the guest's memory as part of starting the guest itself ? >> Yes there are restrictions when we do pinning...but it can help with >> performance. > > For such a large VM, I would definitely recommend pinning because I'm > assuming you have enough processors or a large enough application to > actually *use* that much memory, which would suggest that even after > the bulk phase round of the migration has already completed that your > VM is probably going to remain to be pretty busy. > > It's just a matter of me tracking down what's causing the freeze and > fixing it........ I'll look into it right now on my machine. > Ok >> --- >> BTW, a different (yet sort of related) topic... recently a patch went >> into upstream that provided an option to qemu to mlock all of guest >> memory : >> >> https://lists.gnu.org/archive/html/qemu-devel/2013-04/msg03947.html . > > I had no idea.......very interesting. > >> >> but when attempting to do the mlock for larger guests a lot of time >> is spent bringing each page into cache and clearing/zeron'g it etc.etc. >> >> https://lists.gnu.org/archive/html/qemu-devel/2013-04/msg04161.html >> > > Wow, I didn't know that either. Perhaps this must be causing the > entire QEMU process and its threads to seize up. > > It may be necessary to run the pinning command *outside* of QEMU's I/O > lock in a separate thread if it's really that much overhead. Not really sure if the BQL is causing the freeze...but in general pinning of all memory when the guest is run is perhaps not the best choice for large enterprise class guests...i.e. its better to do it as part of the start of the guest. > > Thanks a lot for pointing this out......... > > BTW, A good thing to try out is to see if we can mlock memory of a large guest (i.e. on the source and target qemu's) and migrate the guest using basic TCP over a regular 10Gig NIC. Thanks, Vinod > >> >> ---- >> >> Note: The basic tcp based live guest migration in the same qemu >> version still works fine on the same hosts over a pair of non-RDMA >> cards 10Gb NICs connected back-to-back. >> > > Acknowledged. > [-- Attachment #2: Type: text/html, Size: 11926 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Qemu-devel] [PATCH v6 00/11] rdma: migration support 2013-05-09 22:20 ` Chegu Vinod @ 2013-05-09 22:45 ` Michael R. Hines 2013-06-02 4:09 ` Michael R. Hines 2013-05-10 7:58 ` Paolo Bonzini 1 sibling, 1 reply; 11+ messages in thread From: Michael R. Hines @ 2013-05-09 22:45 UTC (permalink / raw) To: Chegu Vinod Cc: Karen Noel, Juan Jose Quintela Carreira, Michael S. Tsirkin, qemu-devel qemu-devel, Orit Wasserman, Michael R. Hines, Anthony Liguori, Paolo Bonzini Some more followup questions below to help me debug before I start digging in....... On 05/09/2013 06:20 PM, Chegu Vinod wrote: Setting aside the mlock() freezes for the moment, let's first fix your crashing problem on the destination-side. Let's make that a priority before we fix the mlock problem. When the migration "completes", can you provide me with more detailed information about the state of QEMU on the destination? Is it responding? What's on the VNC console? Is QEMU responding? Is the network responding? Was the VM idle? Or running an application? Can you attach GDB to QEMU after the migration? > /usr/local/bin/qemu-system-x86_64 \ > -enable-kvm \ > -cpu host \ > -name vm1 \ > -m 131072 -smp 10,sockets=1,cores=10,threads=1 \ > -mem-path /dev/hugepages \ Can you disable hugepages and re-test? I'll get back to the other mlock() issues later after we at least first make sure the migration itself is working..... ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Qemu-devel] [PATCH v6 00/11] rdma: migration support 2013-05-09 22:45 ` Michael R. Hines @ 2013-06-02 4:09 ` Michael R. Hines 2013-06-06 23:51 ` Chegu Vinod 0 siblings, 1 reply; 11+ messages in thread From: Michael R. Hines @ 2013-06-02 4:09 UTC (permalink / raw) To: Chegu Vinod Cc: Karen Noel, Juan Jose Quintela Carreira, Michael S. Tsirkin, qemu-devel qemu-devel, Orit Wasserman, Michael R. Hines, Anthony Liguori, Paolo Bonzini All, I have successfully performed over 1000+ back-to-back RDMA migrations automatically looped *in a row* using a heavy-weight memory-stress benchmark here at IBM. Migration success is done by capturing the actual serial console output of the virtual machine while the benchmark is running and redirecting each migration output to a file to verify that the output matches the expected output of a successful migration. For half of the 1000 migrations, I used a 14GB virtual machine size (largest VM I can create) and the remaining 500 migrations I used a 2GB virtual machine (to make sure I was testing both 32-bit and 64-bit address boundaries). The benchmark is configured to have 75% stores and 25% loads and is configured to use 80% of the allocatable free memory of the VM (i.e. no swapping allowed). I have defined a successful migration per the output file as follows: 1. The memory benchmark is still running and active (CPU near 100% and memory usage is high) 2. There are no kernel panics in the console output (regex keywords "panic", "BUG", "oom", etc...) 3. The VM is still responding to network activity (pings) 4. The console is still responsive by printing periodic messages throughout the life of the VM to the console from inside the VM using the 'write' command in infinite loop. With this method in a loop, I believe I've ironed out all the regression-testing bugs that I can find. You all may find the following bugs interesting. The original version of this patch was written in 2010 (Before my time @ IBM). Bug #1: In the original 2010 patch, each write operation uses the same "identifier". (A "Work Request ID" in infiniband terminology). This is not typical (but allowed by the hardware) - and instead each operation should have its own unique identifier so that the write operation can be tracked properly as it completes. Bug #2: Also in the original 2010 patch, write operations were grouped into separate "signaled" and "unsignaled" work requests, which is also not typical (but allowed by the hardware). "Signalling" is infiniband terminology which means to activate/deactivate notifying the sender whether or not the RDMA operation has already completed. (Note: the receiver is never notified - which is what a DMA is supposed to be). In normal operation per infiniband specifications, "unsignaled" operations (which indicate to the hardware *not* to notify the sender of completion) are *supposed* to be paired simultaneously with a signaled operation using the *same* work request identifier. Instead, the original patch was using *different* work requests for signaled/unsignaled writes, which means that most of the writes would be transmitted without ever being tracked for completion whatsoever. (Per infinband specifications, signaled and unsignaled writes must be grouped together because the hardware ensures that completion notification is not given until *all* of the writes of the same request have actually completed). Bug #3: Finally, in the original 2010 patch, ordering was not being handled. Per infiniband specifications, writes can happen completely out of order. Not only that, but PCI-express itself can change the order of the writes as well. It was only until after the first 2 bugs were fixed that I could actually manifest this bug *in code*: What was happening was that a very large group of requests would "burst" from the QEMU migration thread. At which point, not all of the requests would finish. Then a short time later, the next iteration would start and the virtual machine's writable working set was still "hovering" somewhere in the same vicinity of the address space as the previous burst of writes that had not yet completed. When this happens, the new writes were much smaller (not a part of a larger "chunk" per our algorithms). Since the new writes were smaller they would complete faster than the larger, older writes in the same address range. Since they complete out of order, the newer writes would then get clobbered by the older writes - resulting in an inconsistent virtual machine. So, to solve this: during each new write, we now do a "search" to see if the address of the next requested write matches or overlaps with the address range of any of the previous "outstanding" writes that were still in transit, and I found several hits. This was easily solved by blocking until the conflicting write has completed before proceeding to issue a new write to the hardware. - Michael On 05/09/2013 06:45 PM, Michael R. Hines wrote: > > Some more followup questions below to help me debug before I start > digging in....... > > On 05/09/2013 06:20 PM, Chegu Vinod wrote: > > Setting aside the mlock() freezes for the moment, let's first fix your > crashing > problem on the destination-side. Let's make that a priority before we fix > the mlock problem. > > When the migration "completes", can you provide me with more detailed > information > about the state of QEMU on the destination? > > Is it responding? > What's on the VNC console? > Is QEMU responding? > Is the network responding? > Was the VM idle? Or running an application? > Can you attach GDB to QEMU after the migration? > > >> /usr/local/bin/qemu-system-x86_64 \ >> -enable-kvm \ >> -cpu host \ >> -name vm1 \ >> -m 131072 -smp 10,sockets=1,cores=10,threads=1 \ >> -mem-path /dev/hugepages \ > > Can you disable hugepages and re-test? > > I'll get back to the other mlock() issues later after we at least > first make sure the migration itself is working..... ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Qemu-devel] [PATCH v6 00/11] rdma: migration support 2013-06-02 4:09 ` Michael R. Hines @ 2013-06-06 23:51 ` Chegu Vinod 2013-06-07 5:38 ` Michael R. Hines 0 siblings, 1 reply; 11+ messages in thread From: Chegu Vinod @ 2013-06-06 23:51 UTC (permalink / raw) To: Michael R. Hines Cc: Karen Noel, Juan Jose Quintela Carreira, Michael S. Tsirkin, qemu-devel qemu-devel, Orit Wasserman, Michael R. Hines, Anthony Liguori, Paolo Bonzini On 6/1/2013 9:09 PM, Michael R. Hines wrote: > All, > > I have successfully performed over 1000+ back-to-back RDMA migrations > automatically looped *in a row* using a heavy-weight memory-stress > benchmark here at IBM. > Migration success is done by capturing the actual serial console > output of the virtual machine while the benchmark is running and > redirecting each migration output to a file to verify that the output > matches the expected output of a successful migration. For half of the > 1000 migrations, I used a 14GB virtual machine size (largest VM I can > create) and the remaining 500 migrations I used a 2GB virtual machine > (to make sure I was testing both 32-bit and 64-bit address > boundaries). The benchmark is configured to have 75% stores and 25% > loads and is configured to use 80% of the allocatable free memory of > the VM (i.e. no swapping allowed). > > I have defined a successful migration per the output file as follows: > > 1. The memory benchmark is still running and active (CPU near 100% and > memory usage is high) > 2. There are no kernel panics in the console output (regex keywords > "panic", "BUG", "oom", etc...) > 3. The VM is still responding to network activity (pings) > 4. The console is still responsive by printing periodic messages > throughout the life of the VM to the console from inside the VM using > the 'write' command in infinite loop. > > With this method in a loop, I believe I've ironed out all the > regression-testing bugs that I can find. You all may find the > following bugs interesting. The original version of this patch was > written in 2010 (Before my time @ IBM). > > Bug #1: In the original 2010 patch, each write operation uses the same > "identifier". (A "Work Request ID" in infiniband terminology). > This is not typical (but allowed by the hardware) - and instead each > operation should have its own unique identifier so that the write > operation can be tracked properly as it completes. > > Bug #2: Also in the original 2010 patch, write operations were grouped > into separate "signaled" and "unsignaled" work requests, which is also > not typical (but allowed by the hardware). "Signalling" is infiniband > terminology which means to activate/deactivate notifying the sender > whether or not the RDMA operation has already completed. (Note: the > receiver is never notified - which is what a DMA is supposed to be). > In normal operation per infiniband specifications, "unsignaled" > operations (which indicate to the hardware *not* to notify the sender > of completion) are *supposed* to be paired simultaneously with a > signaled operation using the *same* work request identifier. Instead, > the original patch was using *different* work requests for > signaled/unsignaled writes, which means that most of the writes would > be transmitted without ever being tracked for completion whatsoever. > (Per infinband specifications, signaled and unsignaled writes must be > grouped together because the hardware ensures that completion > notification is not given until *all* of the writes of the same > request have actually completed). > > Bug #3: Finally, in the original 2010 patch, ordering was not being > handled. Per infiniband specifications, writes can happen completely > out of order. Not only that, but PCI-express itself can change the > order of the writes as well. It was only until after the first 2 bugs > were fixed that I could actually manifest this bug *in code*: What was > happening was that a very large group of requests would "burst" from > the QEMU migration thread. At which point, not all of the requests > would finish. Then a short time later, the next iteration would start > and the virtual machine's writable working set was still "hovering" > somewhere in the same vicinity of the address space as the previous > burst of writes that had not yet completed. When this happens, the new > writes were much smaller (not a part of a larger "chunk" per our > algorithms). Since the new writes were smaller they would complete > faster than the larger, older writes in the same address range. Since > they complete out of order, the newer writes would then get clobbered > by the older writes - resulting in an inconsistent virtual machine. > So, to solve this: during each new write, we now do a "search" to see > if the address of the next requested write matches or overlaps with > the address range of any of the previous "outstanding" writes that > were still in transit, and I found several hits. This was easily > solved by blocking until the conflicting write has completed before > proceeding to issue a new write to the hardware. > > - Michael > > Hi Michael, Got some limited time on the systems so gave your latest bits a quick try today (with the default no pinning) and it seems to be better than before. Ran a Java warehouse workload where the guest was 85-90% busy... For both cases (qemu) migrate_set_speed 40G (qemu) migrate_set_downtime 2 (qemu) migrate -d x-rdma:<ip>:<port> ... 20VCPU/256G guest (qemu) info migrate capabilities: xbzrle: off x-rdma-pin-all: off Migration status: completed total time: 106994 milliseconds downtime: 3795 milliseconds transferred ram: 15425453 kbytes throughput: 20418.27 mbps remaining ram: 0 kbytes total ram: 268444224 kbytes duplicate: 64707112 pages skipped: 0 pages normal: 3839625 pages normal bytes: 15358500 kbytes ---- 40VCPU/512G guest <- I had more warehouse threads with higher heap size etc. to make the guest busy...and hence it seems to have taken a while to converge. (qemu) info migrate capabilities: xbzrle: off x-rdma-pin-all: off Migration status: completed total time: 2470056 milliseconds downtime: 6254 milliseconds transferred ram: 3230142002 kbytes throughput: 22118.67 mbps remaining ram: 0 kbytes total ram: 536879680 kbytes duplicate: 127436402 pages skipped: 0 pages normal: 807307274 pages normal bytes: 3229229096 kbytes <..> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Qemu-devel] [PATCH v6 00/11] rdma: migration support 2013-06-06 23:51 ` Chegu Vinod @ 2013-06-07 5:38 ` Michael R. Hines 0 siblings, 0 replies; 11+ messages in thread From: Michael R. Hines @ 2013-06-07 5:38 UTC (permalink / raw) To: Chegu Vinod Cc: Karen Noel, Juan Jose Quintela Carreira, qemu-devel qemu-devel, Orit Wasserman, Michael R. Hines, Anthony Liguori, Paolo Bonzini OK, this looks excellent. I think we're ready for a PULL request now - I will submit to Juan - I've already got a signed-off from Paolo & Eric. I think we've got sufficient testing now. I'm expecting to get access to a big machine in the next week or two (256G machine) - and I should be able to reproduce your mlock() delay issue at that time. With such a big VM - I think pinning will help you significantly. Stay tuned, - Michael On 06/06/2013 07:51 PM, Chegu Vinod wrote: > >> > Hi Michael, > > Got some limited time on the systems so gave your latest bits a quick > try today (with the default no pinning) and it seems to be better than > before. > > Ran a Java warehouse workload where the guest was 85-90% busy... > > For both cases > (qemu) migrate_set_speed 40G > (qemu) migrate_set_downtime 2 > (qemu) migrate -d x-rdma:<ip>:<port> > > ... > > 20VCPU/256G guest > > (qemu) info migrate > capabilities: xbzrle: off x-rdma-pin-all: off > Migration status: completed > total time: 106994 milliseconds > downtime: 3795 milliseconds > transferred ram: 15425453 kbytes > throughput: 20418.27 mbps > remaining ram: 0 kbytes > total ram: 268444224 kbytes > duplicate: 64707112 pages > skipped: 0 pages > normal: 3839625 pages > normal bytes: 15358500 kbytes > > ---- > > 40VCPU/512G guest <- I had more warehouse threads with higher > heap size etc. to make the guest busy...and hence it seems to have > taken a while to converge. > > (qemu) info migrate > capabilities: xbzrle: off x-rdma-pin-all: off > Migration status: completed > total time: 2470056 milliseconds > downtime: 6254 milliseconds > transferred ram: 3230142002 kbytes > throughput: 22118.67 mbps > remaining ram: 0 kbytes > total ram: 536879680 kbytes > duplicate: 127436402 pages > skipped: 0 pages > normal: 807307274 pages > normal bytes: 3229229096 kbytes > > > <..> > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Qemu-devel] [PATCH v6 00/11] rdma: migration support 2013-05-09 22:20 ` Chegu Vinod 2013-05-09 22:45 ` Michael R. Hines @ 2013-05-10 7:58 ` Paolo Bonzini 1 sibling, 0 replies; 11+ messages in thread From: Paolo Bonzini @ 2013-05-10 7:58 UTC (permalink / raw) To: Chegu Vinod Cc: Karen Noel, Michael S. Tsirkin, Juan Jose Quintela Carreira, qemu-devel qemu-devel, Michael R. Hines, Orit Wasserman, Michael R. Hines, Anthony Liguori Il 10/05/2013 00:20, Chegu Vinod ha scritto: >> >> Wow, I didn't know that either. Perhaps this must be causing the >> entire QEMU process and its threads to seize up. >> >> It may be necessary to run the pinning command *outside* of QEMU's I/O >> lock in a separate thread if it's really that much overhead. > > Not really sure if the BQL is causing the freeze...but in general > pinning of all memory when the guest is run is perhaps not the best > choice for large enterprise class guests...i.e. its better to do it as > part of the start of the guest. If pinning is done in the setup phase, it should run outside the BQL. Paolo ^ permalink raw reply [flat|nested] 11+ messages in thread
* [Qemu-devel] [PATCH v6 00/11] rdma: migration support
@ 2013-04-24 19:00 mrhines
2013-04-24 21:50 ` Paolo Bonzini
0 siblings, 1 reply; 11+ messages in thread
From: mrhines @ 2013-04-24 19:00 UTC (permalink / raw)
To: quintela; +Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, pbonzini
From: "Michael R. Hines" <mrhines@us.ibm.com>
Please pull.
Changes since v5:
- Removed max_size hook.
- Waiting for Signed-Off bys....
Wiki: http://wiki.qemu.org/Features/RDMALiveMigration
Github: git@github.com:hinesmr/qemu.git
Here is a brief summary of total migration time and downtime using RDMA:
Using a 40gbps infiniband link performing a worst-case stress test,
using an 8GB RAM virtual machine:
Using the following command:
$ apt-get install stress
$ stress --vm-bytes 7500M --vm 1 --vm-keep
RESULTS:
1. Migration throughput: 26 gigabits/second.
2. Downtime (stop time) varies between 15 and 100 milliseconds.
EFFECTS of memory registration on bulk phase round:
For example, in the same 8GB RAM example with all 8GB of memory in
active use and the VM itself is completely idle using the same 40 gbps
infiniband link:
1. x-rdma-pin-all disabled total time: approximately 7.5 seconds @ 9.5 Gbps
2. x-rdma-pin-all enabled total time: approximately 4 seconds @ 26 Gbps
These numbers would of course scale up to whatever size virtual machine
you have to migrate using RDMA.
Enabling this feature does *not* have any measurable affect on
migration *downtime*. This is because, without this feature, all of the
memory will have already been registered already in advance during
the bulk round and does not need to be re-registered during the successive
iteration rounds.
The following changes since commit f3aa844bbb2922a5b8393d17620eca7d7e921ab3:
build: include config-{, all-}devices.mak after defining CONFIG_SOFTMMU and CONFIG_USER_ONLY (2013-04-24 12:18:41 -0500)
are available in the git repository at:
git@github.com:hinesmr/qemu.git rdma_patch_v6
for you to fetch changes up to 75e6fac1f642885b93cefe6e1874d648e9850f8f:
rdma: send pc.ram (2013-04-24 14:55:01 -0400)
----------------------------------------------------------------
Michael R. Hines (11):
rdma: add documentation
rdma: export yield_until_fd_readable()
rdma: export throughput w/ MigrationStats QMP
rdma: introduce qemu_file_mode_is_not_valid()
rdma: export qemu_fflush()
rdma: introduce ram_handle_compressed()
rdma: introduce qemu_ram_foreach_block()
rdma: new QEMUFileOps hooks
rdma: introduce capability x-rdma-pin-all
rdma: core logic
rdma: send pc.ram
Makefile.objs | 1 +
arch_init.c | 59 +-
configure | 29 +
docs/rdma.txt | 404 ++++++
exec.c | 9 +
hmp.c | 2 +
include/block/coroutine.h | 6 +
include/exec/cpu-common.h | 5 +
include/migration/migration.h | 25 +
include/migration/qemu-file.h | 30 +
migration-rdma.c | 2707 +++++++++++++++++++++++++++++++++++++++++
migration.c | 27 +
qapi-schema.json | 12 +-
qemu-coroutine-io.c | 23 +
savevm.c | 107 +-
15 files changed, 3398 insertions(+), 48 deletions(-)
create mode 100644 docs/rdma.txt
create mode 100644 migration-rdma.c
--
1.7.10.4
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [Qemu-devel] [PATCH v6 00/11] rdma: migration support 2013-04-24 19:00 mrhines @ 2013-04-24 21:50 ` Paolo Bonzini 2013-04-24 23:48 ` Michael R. Hines 0 siblings, 1 reply; 11+ messages in thread From: Paolo Bonzini @ 2013-04-24 21:50 UTC (permalink / raw) To: mrhines; +Cc: aliguori, quintela, qemu-devel, owasserm, abali, mrhines, gokul Il 24/04/2013 21:00, mrhines@linux.vnet.ibm.com ha scritto: > From: "Michael R. Hines" <mrhines@us.ibm.com> > > Changes since v5: > > - Removed max_size hook. The patches look good. I will not be very available in the next few days due to a public holiday here, but I believe that it's okay for 1.5. It's clearly marked as experimental, and the changes to the internals are safe and ok. The one small nit is that patch 11 should come before patch 10. It can be fixed by whoever applies the patch. Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Paolo > > Wiki: http://wiki.qemu.org/Features/RDMALiveMigration > Github: git@github.com:hinesmr/qemu.git > > Here is a brief summary of total migration time and downtime using RDMA: > > Using a 40gbps infiniband link performing a worst-case stress test, > using an 8GB RAM virtual machine: > Using the following command: > > $ apt-get install stress > $ stress --vm-bytes 7500M --vm 1 --vm-keep > > RESULTS: > > 1. Migration throughput: 26 gigabits/second. > 2. Downtime (stop time) varies between 15 and 100 milliseconds. > > EFFECTS of memory registration on bulk phase round: > > For example, in the same 8GB RAM example with all 8GB of memory in > active use and the VM itself is completely idle using the same 40 gbps > infiniband link: > > 1. x-rdma-pin-all disabled total time: approximately 7.5 seconds @ 9.5 Gbps > 2. x-rdma-pin-all enabled total time: approximately 4 seconds @ 26 Gbps > > These numbers would of course scale up to whatever size virtual machine > you have to migrate using RDMA. > > Enabling this feature does *not* have any measurable affect on > migration *downtime*. This is because, without this feature, all of the > memory will have already been registered already in advance during > the bulk round and does not need to be re-registered during the successive > iteration rounds. > > The following changes since commit f3aa844bbb2922a5b8393d17620eca7d7e921ab3: > > build: include config-{, all-}devices.mak after defining CONFIG_SOFTMMU and CONFIG_USER_ONLY (2013-04-24 12:18:41 -0500) > > are available in the git repository at: > > git@github.com:hinesmr/qemu.git rdma_patch_v6 > > for you to fetch changes up to 75e6fac1f642885b93cefe6e1874d648e9850f8f: > > rdma: send pc.ram (2013-04-24 14:55:01 -0400) > > ---------------------------------------------------------------- > Michael R. Hines (11): > rdma: add documentation > rdma: export yield_until_fd_readable() > rdma: export throughput w/ MigrationStats QMP > rdma: introduce qemu_file_mode_is_not_valid() > rdma: export qemu_fflush() > rdma: introduce ram_handle_compressed() > rdma: introduce qemu_ram_foreach_block() > rdma: new QEMUFileOps hooks > rdma: introduce capability x-rdma-pin-all > rdma: core logic > rdma: send pc.ram > > Makefile.objs | 1 + > arch_init.c | 59 +- > configure | 29 + > docs/rdma.txt | 404 ++++++ > exec.c | 9 + > hmp.c | 2 + > include/block/coroutine.h | 6 + > include/exec/cpu-common.h | 5 + > include/migration/migration.h | 25 + > include/migration/qemu-file.h | 30 + > migration-rdma.c | 2707 +++++++++++++++++++++++++++++++++++++++++ > migration.c | 27 + > qapi-schema.json | 12 +- > qemu-coroutine-io.c | 23 + > savevm.c | 107 +- > 15 files changed, 3398 insertions(+), 48 deletions(-) > create mode 100644 docs/rdma.txt > create mode 100644 migration-rdma.c > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Qemu-devel] [PATCH v6 00/11] rdma: migration support 2013-04-24 21:50 ` Paolo Bonzini @ 2013-04-24 23:48 ` Michael R. Hines 0 siblings, 0 replies; 11+ messages in thread From: Michael R. Hines @ 2013-04-24 23:48 UTC (permalink / raw) To: Paolo Bonzini Cc: aliguori, quintela, qemu-devel, owasserm, abali, mrhines, gokul On 04/24/2013 05:50 PM, Paolo Bonzini wrote: > Il 24/04/2013 21:00, mrhines@linux.vnet.ibm.com ha scritto: >> From: "Michael R. Hines" <mrhines@us.ibm.com> >> >> Changes since v5: >> >> - Removed max_size hook. > The patches look good. I will not be very available in the next few > days due to a public holiday here, but I believe that it's okay for 1.5. > It's clearly marked as experimental, and the changes to the internals > are safe and ok. > > The one small nit is that patch 11 should come before patch 10. It can > be fixed by whoever applies the patch. > > Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> > > Paolo Acknowledged. Thank you. - Michael ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2013-06-07 5:38 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-05-03 23:28 [Qemu-devel] [PATCH v6 00/11] rdma: migration support Chegu Vinod 2013-05-09 17:20 ` Michael R. Hines 2013-05-09 22:20 ` Chegu Vinod 2013-05-09 22:45 ` Michael R. Hines 2013-06-02 4:09 ` Michael R. Hines 2013-06-06 23:51 ` Chegu Vinod 2013-06-07 5:38 ` Michael R. Hines 2013-05-10 7:58 ` Paolo Bonzini -- strict thread matches above, loose matches on Subject: below -- 2013-04-24 19:00 mrhines 2013-04-24 21:50 ` Paolo Bonzini 2013-04-24 23:48 ` Michael R. Hines
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.