Re: [Qemu-devel] [PATCH v6 00/11] rdma: migration support

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Chegu Vinod <chegu_vinod@hp.com>
To: "Michael R. Hines" <mrhines@linux.vnet.ibm.com>
Cc: Karen Noel <knoel@redhat.com>,
	Juan Jose Quintela Carreira <quintela@redhat.com>,
	"Michael S. Tsirkin" <mst@redhat.com>,
	qemu-devel qemu-devel <qemu-devel@nongnu.org>,
	Orit Wasserman <owasserm@redhat.com>,
	"Michael R. Hines" <mrhines@us.ibm.com>,
	Anthony Liguori <anthony@codemonkey.ws>,
	Paolo Bonzini <pbonzini@redhat.com>
Subject: Re: [Qemu-devel] [PATCH v6 00/11] rdma: migration support
Date: Thu, 06 Jun 2013 16:51:40 -0700	[thread overview]
Message-ID: <51B1208C.2080406@hp.com> (raw)
In-Reply-To: <51AAC589.3080302@linux.vnet.ibm.com>

On 6/1/2013 9:09 PM, Michael R. Hines wrote:
> All,
>
> I have successfully performed over 1000+ back-to-back RDMA migrations 
> automatically looped *in a row* using a heavy-weight memory-stress 
> benchmark here at IBM.
> Migration success is done by capturing the actual serial console 
> output of the virtual machine while the benchmark is running and 
> redirecting each migration output to a file to verify that the output 
> matches the expected output of a successful migration. For half of the 
> 1000 migrations, I used a 14GB virtual machine size (largest VM I can 
> create) and the remaining 500 migrations I used a 2GB virtual machine 
> (to make sure I was testing both 32-bit and 64-bit address 
> boundaries). The benchmark is configured to have 75% stores and 25% 
> loads and is configured to use 80% of the allocatable free memory of 
> the VM (i.e. no swapping allowed).
>
> I have defined a successful migration per the output file as follows:
>
> 1. The memory benchmark is still running and active (CPU near 100% and 
> memory usage is high)
> 2. There are no kernel panics in the console output (regex keywords 
> "panic", "BUG", "oom", etc...)
> 3. The VM is still responding to network activity (pings)
> 4. The console is still responsive by printing periodic messages 
> throughout the life of the VM to the console from inside the VM using 
> the 'write' command in infinite loop.
>
> With this method in a loop, I believe I've ironed out all the 
> regression-testing bugs that I can find. You all may find the 
> following bugs interesting. The original version of this patch was 
> written in 2010 (Before my time @ IBM).
>
> Bug #1: In the original 2010 patch, each write operation uses the same 
> "identifier". (A "Work Request ID" in infiniband terminology).
> This is not typical (but allowed by the hardware) - and instead each 
> operation should have its own unique identifier so that the write 
> operation can be tracked properly as it completes.
>
> Bug #2: Also in the original 2010 patch, write operations were grouped 
> into separate "signaled" and "unsignaled" work requests, which is also 
> not typical (but allowed by the hardware). "Signalling" is infiniband 
> terminology which means to activate/deactivate notifying the sender 
> whether or not the RDMA operation has already completed. (Note: the 
> receiver is never notified - which is what a DMA is supposed to be). 
> In normal operation per infiniband specifications, "unsignaled" 
> operations (which indicate to the hardware *not* to notify the sender 
> of completion) are *supposed* to be paired simultaneously with a 
> signaled operation using the *same* work request identifier. Instead, 
> the original patch was using *different* work requests for 
> signaled/unsignaled writes, which means that most of the writes would 
> be transmitted without ever being tracked for completion whatsoever. 
> (Per infinband specifications, signaled and unsignaled writes must be 
> grouped together because the hardware ensures that completion 
> notification is not given until *all* of the writes of the same 
> request have actually completed).
>
> Bug #3: Finally, in the original 2010 patch, ordering was not being 
> handled. Per infiniband specifications, writes can happen completely 
> out of order. Not only that, but PCI-express itself can change the 
> order of the writes as well. It was only until after the first 2 bugs 
> were fixed that I could actually manifest this bug *in code*: What was 
> happening was that a very large group of requests would "burst" from 
> the QEMU migration thread. At which point, not all of the requests 
> would finish. Then a short time later, the next iteration would start 
> and the virtual machine's writable working set was still "hovering" 
> somewhere in the same vicinity of the address space as the previous 
> burst of writes that had not yet completed. When this happens, the new 
> writes were much smaller (not a part of a larger "chunk" per our 
> algorithms). Since the new writes were smaller they would complete 
> faster than the larger, older writes in the same address range. Since 
> they complete out of order, the newer writes would then get clobbered 
> by the older writes - resulting in an inconsistent virtual machine. 
> So, to solve this: during each new write, we now do a "search" to see 
> if the address of the next requested write matches or overlaps with 
> the address range of any of the previous "outstanding" writes that 
> were still in transit, and I found several hits. This was easily 
> solved by blocking until the conflicting write has completed before 
> proceeding to issue a new write to the hardware.
>
> - Michael
>
>
Hi Michael,

Got some limited time on the systems so gave your latest bits a quick 
try today (with the default no pinning) and it seems to be better than 
before.

Ran a Java warehouse workload where the guest was 85-90% busy...

For both cases
(qemu) migrate_set_speed 40G
(qemu) migrate_set_downtime 2
(qemu) migrate -d x-rdma:<ip>:<port>

...

20VCPU/256G guest

(qemu) info migrate
capabilities: xbzrle: off x-rdma-pin-all: off
Migration status: completed
total time: 106994 milliseconds
downtime: 3795 milliseconds
transferred ram: 15425453 kbytes
throughput: 20418.27 mbps
remaining ram: 0 kbytes
total ram: 268444224 kbytes
duplicate: 64707112 pages
skipped: 0 pages
normal: 3839625 pages
normal bytes: 15358500 kbytes

----

40VCPU/512G guest         <- I had more warehouse threads with higher 
heap size etc. to make the guest busy...and hence it seems to have taken 
a while to converge.

(qemu) info migrate
capabilities: xbzrle: off x-rdma-pin-all: off
Migration status: completed
total time: 2470056 milliseconds
downtime: 6254 milliseconds
transferred ram: 3230142002 kbytes
throughput: 22118.67 mbps
remaining ram: 0 kbytes
total ram: 536879680 kbytes
duplicate: 127436402 pages
skipped: 0 pages
normal: 807307274 pages
normal bytes: 3229229096 kbytes


<..>

next prev parent reply	other threads:[~2013-06-06 23:51 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-05-03 23:28 [Qemu-devel] [PATCH v6 00/11] rdma: migration support Chegu Vinod
2013-05-09 17:20 ` Michael R. Hines
2013-05-09 22:20   ` Chegu Vinod
2013-05-09 22:45     ` Michael R. Hines
2013-06-02  4:09       ` Michael R. Hines
2013-06-06 23:51         ` Chegu Vinod [this message]
2013-06-07  5:38           ` Michael R. Hines
2013-05-10  7:58     ` Paolo Bonzini
  -- strict thread matches above, loose matches on Subject: below --
2013-04-24 19:00 mrhines
2013-04-24 21:50 ` Paolo Bonzini
2013-04-24 23:48   ` Michael R. Hines

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=51B1208C.2080406@hp.com \
    --to=chegu_vinod@hp.com \
    --cc=anthony@codemonkey.ws \
    --cc=knoel@redhat.com \
    --cc=mrhines@linux.vnet.ibm.com \
    --cc=mrhines@us.ibm.com \
    --cc=mst@redhat.com \
    --cc=owasserm@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.