From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:58158) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UkjyA-00057K-SA for qemu-devel@nongnu.org; Thu, 06 Jun 2013 19:51:52 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Ukjy9-0005kc-JZ for qemu-devel@nongnu.org; Thu, 06 Jun 2013 19:51:50 -0400 Received: from g1t0027.austin.hp.com ([15.216.28.34]:40528) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Ukjy9-0005kM-CG for qemu-devel@nongnu.org; Thu, 06 Jun 2013 19:51:49 -0400 Message-ID: <51B1208C.2080406@hp.com> Date: Thu, 06 Jun 2013 16:51:40 -0700 From: Chegu Vinod MIME-Version: 1.0 References: <51844811.4030001@hp.com> <518BDADB.1070705@linux.vnet.ibm.com> <518C2135.306@hp.com> <518C26F6.5080009@linux.vnet.ibm.com> <51AAC589.3080302@linux.vnet.ibm.com> In-Reply-To: <51AAC589.3080302@linux.vnet.ibm.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH v6 00/11] rdma: migration support List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: "Michael R. Hines" Cc: Karen Noel , Juan Jose Quintela Carreira , "Michael S. Tsirkin" , qemu-devel qemu-devel , Orit Wasserman , "Michael R. Hines" , Anthony Liguori , Paolo Bonzini On 6/1/2013 9:09 PM, Michael R. Hines wrote: > All, > > I have successfully performed over 1000+ back-to-back RDMA migrations > automatically looped *in a row* using a heavy-weight memory-stress > benchmark here at IBM. > Migration success is done by capturing the actual serial console > output of the virtual machine while the benchmark is running and > redirecting each migration output to a file to verify that the output > matches the expected output of a successful migration. For half of the > 1000 migrations, I used a 14GB virtual machine size (largest VM I can > create) and the remaining 500 migrations I used a 2GB virtual machine > (to make sure I was testing both 32-bit and 64-bit address > boundaries). The benchmark is configured to have 75% stores and 25% > loads and is configured to use 80% of the allocatable free memory of > the VM (i.e. no swapping allowed). > > I have defined a successful migration per the output file as follows: > > 1. The memory benchmark is still running and active (CPU near 100% and > memory usage is high) > 2. There are no kernel panics in the console output (regex keywords > "panic", "BUG", "oom", etc...) > 3. The VM is still responding to network activity (pings) > 4. The console is still responsive by printing periodic messages > throughout the life of the VM to the console from inside the VM using > the 'write' command in infinite loop. > > With this method in a loop, I believe I've ironed out all the > regression-testing bugs that I can find. You all may find the > following bugs interesting. The original version of this patch was > written in 2010 (Before my time @ IBM). > > Bug #1: In the original 2010 patch, each write operation uses the same > "identifier". (A "Work Request ID" in infiniband terminology). > This is not typical (but allowed by the hardware) - and instead each > operation should have its own unique identifier so that the write > operation can be tracked properly as it completes. > > Bug #2: Also in the original 2010 patch, write operations were grouped > into separate "signaled" and "unsignaled" work requests, which is also > not typical (but allowed by the hardware). "Signalling" is infiniband > terminology which means to activate/deactivate notifying the sender > whether or not the RDMA operation has already completed. (Note: the > receiver is never notified - which is what a DMA is supposed to be). > In normal operation per infiniband specifications, "unsignaled" > operations (which indicate to the hardware *not* to notify the sender > of completion) are *supposed* to be paired simultaneously with a > signaled operation using the *same* work request identifier. Instead, > the original patch was using *different* work requests for > signaled/unsignaled writes, which means that most of the writes would > be transmitted without ever being tracked for completion whatsoever. > (Per infinband specifications, signaled and unsignaled writes must be > grouped together because the hardware ensures that completion > notification is not given until *all* of the writes of the same > request have actually completed). > > Bug #3: Finally, in the original 2010 patch, ordering was not being > handled. Per infiniband specifications, writes can happen completely > out of order. Not only that, but PCI-express itself can change the > order of the writes as well. It was only until after the first 2 bugs > were fixed that I could actually manifest this bug *in code*: What was > happening was that a very large group of requests would "burst" from > the QEMU migration thread. At which point, not all of the requests > would finish. Then a short time later, the next iteration would start > and the virtual machine's writable working set was still "hovering" > somewhere in the same vicinity of the address space as the previous > burst of writes that had not yet completed. When this happens, the new > writes were much smaller (not a part of a larger "chunk" per our > algorithms). Since the new writes were smaller they would complete > faster than the larger, older writes in the same address range. Since > they complete out of order, the newer writes would then get clobbered > by the older writes - resulting in an inconsistent virtual machine. > So, to solve this: during each new write, we now do a "search" to see > if the address of the next requested write matches or overlaps with > the address range of any of the previous "outstanding" writes that > were still in transit, and I found several hits. This was easily > solved by blocking until the conflicting write has completed before > proceeding to issue a new write to the hardware. > > - Michael > > Hi Michael, Got some limited time on the systems so gave your latest bits a quick try today (with the default no pinning) and it seems to be better than before. Ran a Java warehouse workload where the guest was 85-90% busy... For both cases (qemu) migrate_set_speed 40G (qemu) migrate_set_downtime 2 (qemu) migrate -d x-rdma:: ... 20VCPU/256G guest (qemu) info migrate capabilities: xbzrle: off x-rdma-pin-all: off Migration status: completed total time: 106994 milliseconds downtime: 3795 milliseconds transferred ram: 15425453 kbytes throughput: 20418.27 mbps remaining ram: 0 kbytes total ram: 268444224 kbytes duplicate: 64707112 pages skipped: 0 pages normal: 3839625 pages normal bytes: 15358500 kbytes ---- 40VCPU/512G guest <- I had more warehouse threads with higher heap size etc. to make the guest busy...and hence it seems to have taken a while to converge. (qemu) info migrate capabilities: xbzrle: off x-rdma-pin-all: off Migration status: completed total time: 2470056 milliseconds downtime: 6254 milliseconds transferred ram: 3230142002 kbytes throughput: 22118.67 mbps remaining ram: 0 kbytes total ram: 536879680 kbytes duplicate: 127436402 pages skipped: 0 pages normal: 807307274 pages normal bytes: 3229229096 kbytes <..>