From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:58158)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <chegu_vinod@hp.com>) id 1UkjyA-00057K-SA
	for qemu-devel@nongnu.org; Thu, 06 Jun 2013 19:51:52 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <chegu_vinod@hp.com>) id 1Ukjy9-0005kc-JZ
	for qemu-devel@nongnu.org; Thu, 06 Jun 2013 19:51:50 -0400
Received: from g1t0027.austin.hp.com ([15.216.28.34]:40528)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <chegu_vinod@hp.com>) id 1Ukjy9-0005kM-CG
	for qemu-devel@nongnu.org; Thu, 06 Jun 2013 19:51:49 -0400
Message-ID: <51B1208C.2080406@hp.com>
Date: Thu, 06 Jun 2013 16:51:40 -0700
From: Chegu Vinod <chegu_vinod@hp.com>
MIME-Version: 1.0
References: <51844811.4030001@hp.com> <518BDADB.1070705@linux.vnet.ibm.com>
	<518C2135.306@hp.com> <518C26F6.5080009@linux.vnet.ibm.com>
	<51AAC589.3080302@linux.vnet.ibm.com>
In-Reply-To: <51AAC589.3080302@linux.vnet.ibm.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [PATCH v6 00/11] rdma: migration support
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: "Michael R. Hines" <mrhines@linux.vnet.ibm.com>
Cc: Karen Noel <knoel@redhat.com>, Juan Jose Quintela Carreira <quintela@redhat.com>, "Michael S. Tsirkin" <mst@redhat.com>, qemu-devel qemu-devel <qemu-devel@nongnu.org>, Orit Wasserman <owasserm@redhat.com>, "Michael R. Hines" <mrhines@us.ibm.com>, Anthony Liguori <anthony@codemonkey.ws>, Paolo Bonzini <pbonzini@redhat.com>

On 6/1/2013 9:09 PM, Michael R. Hines wrote:
> All,
>
> I have successfully performed over 1000+ back-to-back RDMA migrations 
> automatically looped *in a row* using a heavy-weight memory-stress 
> benchmark here at IBM.
> Migration success is done by capturing the actual serial console 
> output of the virtual machine while the benchmark is running and 
> redirecting each migration output to a file to verify that the output 
> matches the expected output of a successful migration. For half of the 
> 1000 migrations, I used a 14GB virtual machine size (largest VM I can 
> create) and the remaining 500 migrations I used a 2GB virtual machine 
> (to make sure I was testing both 32-bit and 64-bit address 
> boundaries). The benchmark is configured to have 75% stores and 25% 
> loads and is configured to use 80% of the allocatable free memory of 
> the VM (i.e. no swapping allowed).
>
> I have defined a successful migration per the output file as follows:
>
> 1. The memory benchmark is still running and active (CPU near 100% and 
> memory usage is high)
> 2. There are no kernel panics in the console output (regex keywords 
> "panic", "BUG", "oom", etc...)
> 3. The VM is still responding to network activity (pings)
> 4. The console is still responsive by printing periodic messages 
> throughout the life of the VM to the console from inside the VM using 
> the 'write' command in infinite loop.
>
> With this method in a loop, I believe I've ironed out all the 
> regression-testing bugs that I can find. You all may find the 
> following bugs interesting. The original version of this patch was 
> written in 2010 (Before my time @ IBM).
>
> Bug #1: In the original 2010 patch, each write operation uses the same 
> "identifier". (A "Work Request ID" in infiniband terminology).
> This is not typical (but allowed by the hardware) - and instead each 
> operation should have its own unique identifier so that the write 
> operation can be tracked properly as it completes.
>
> Bug #2: Also in the original 2010 patch, write operations were grouped 
> into separate "signaled" and "unsignaled" work requests, which is also 
> not typical (but allowed by the hardware). "Signalling" is infiniband 
> terminology which means to activate/deactivate notifying the sender 
> whether or not the RDMA operation has already completed. (Note: the 
> receiver is never notified - which is what a DMA is supposed to be). 
> In normal operation per infiniband specifications, "unsignaled" 
> operations (which indicate to the hardware *not* to notify the sender 
> of completion) are *supposed* to be paired simultaneously with a 
> signaled operation using the *same* work request identifier. Instead, 
> the original patch was using *different* work requests for 
> signaled/unsignaled writes, which means that most of the writes would 
> be transmitted without ever being tracked for completion whatsoever. 
> (Per infinband specifications, signaled and unsignaled writes must be 
> grouped together because the hardware ensures that completion 
> notification is not given until *all* of the writes of the same 
> request have actually completed).
>
> Bug #3: Finally, in the original 2010 patch, ordering was not being 
> handled. Per infiniband specifications, writes can happen completely 
> out of order. Not only that, but PCI-express itself can change the 
> order of the writes as well. It was only until after the first 2 bugs 
> were fixed that I could actually manifest this bug *in code*: What was 
> happening was that a very large group of requests would "burst" from 
> the QEMU migration thread. At which point, not all of the requests 
> would finish. Then a short time later, the next iteration would start 
> and the virtual machine's writable working set was still "hovering" 
> somewhere in the same vicinity of the address space as the previous 
> burst of writes that had not yet completed. When this happens, the new 
> writes were much smaller (not a part of a larger "chunk" per our 
> algorithms). Since the new writes were smaller they would complete 
> faster than the larger, older writes in the same address range. Since 
> they complete out of order, the newer writes would then get clobbered 
> by the older writes - resulting in an inconsistent virtual machine. 
> So, to solve this: during each new write, we now do a "search" to see 
> if the address of the next requested write matches or overlaps with 
> the address range of any of the previous "outstanding" writes that 
> were still in transit, and I found several hits. This was easily 
> solved by blocking until the conflicting write has completed before 
> proceeding to issue a new write to the hardware.
>
> - Michael
>
>
Hi Michael,

Got some limited time on the systems so gave your latest bits a quick 
try today (with the default no pinning) and it seems to be better than 
before.

Ran a Java warehouse workload where the guest was 85-90% busy...

For both cases
(qemu) migrate_set_speed 40G
(qemu) migrate_set_downtime 2
(qemu) migrate -d x-rdma:<ip>:<port>

...

20VCPU/256G guest

(qemu) info migrate
capabilities: xbzrle: off x-rdma-pin-all: off
Migration status: completed
total time: 106994 milliseconds
downtime: 3795 milliseconds
transferred ram: 15425453 kbytes
throughput: 20418.27 mbps
remaining ram: 0 kbytes
total ram: 268444224 kbytes
duplicate: 64707112 pages
skipped: 0 pages
normal: 3839625 pages
normal bytes: 15358500 kbytes

----

40VCPU/512G guest         <- I had more warehouse threads with higher 
heap size etc. to make the guest busy...and hence it seems to have taken 
a while to converge.

(qemu) info migrate
capabilities: xbzrle: off x-rdma-pin-all: off
Migration status: completed
total time: 2470056 milliseconds
downtime: 6254 milliseconds
transferred ram: 3230142002 kbytes
throughput: 22118.67 mbps
remaining ram: 0 kbytes
total ram: 536879680 kbytes
duplicate: 127436402 pages
skipped: 0 pages
normal: 807307274 pages
normal bytes: 3229229096 kbytes


<..>