Re: [Qemu-devel] [PATCH v6 00/11] rdma: migration support

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: "Michael R. Hines" <mrhines@linux.vnet.ibm.com>
To: Chegu Vinod <chegu_vinod@hp.com>
Cc: Karen Noel <knoel@redhat.com>,
	Juan Jose Quintela Carreira <quintela@redhat.com>,
	"Michael S. Tsirkin" <mst@redhat.com>,
	qemu-devel qemu-devel <qemu-devel@nongnu.org>,
	Orit Wasserman <owasserm@redhat.com>,
	"Michael R. Hines" <mrhines@us.ibm.com>,
	Anthony Liguori <anthony@codemonkey.ws>,
	Paolo Bonzini <pbonzini@redhat.com>
Subject: Re: [Qemu-devel] [PATCH v6 00/11] rdma: migration support
Date: Sun, 02 Jun 2013 00:09:45 -0400	[thread overview]
Message-ID: <51AAC589.3080302@linux.vnet.ibm.com> (raw)
In-Reply-To: <518C26F6.5080009@linux.vnet.ibm.com>

All,

I have successfully performed over 1000+ back-to-back RDMA migrations 
automatically looped *in a row* using a heavy-weight memory-stress 
benchmark here at IBM.
Migration success is done by capturing the actual serial console output 
of the virtual machine while the benchmark is running and redirecting 
each migration output to a file to verify that the output matches the 
expected output of a successful migration. For half of the 1000 
migrations, I used a 14GB virtual machine size (largest VM I can create) 
and the remaining 500 migrations I used a 2GB virtual machine (to make 
sure I was testing both 32-bit and 64-bit address boundaries). The 
benchmark is configured to have 75% stores and 25% loads and is 
configured to use 80% of the allocatable free memory of the VM (i.e. no 
swapping allowed).

I have defined a successful migration per the output file as follows:

1. The memory benchmark is still running and active (CPU near 100% and 
memory usage is high)
2. There are no kernel panics in the console output (regex keywords 
"panic", "BUG", "oom", etc...)
3. The VM is still responding to network activity (pings)
4. The console is still responsive by printing periodic messages 
throughout the life of the VM to the console from inside the VM using 
the 'write' command in infinite loop.

With this method in a loop, I believe I've ironed out all the 
regression-testing bugs that I can find. You all may find the following 
bugs interesting. The original version of this patch was written in 2010 
(Before my time @ IBM).

Bug #1: In the original 2010 patch, each write operation uses the same 
"identifier". (A "Work Request ID" in infiniband terminology).
This is not typical (but allowed by the hardware) - and instead each 
operation should have its own unique identifier so that the write 
operation can be tracked properly as it completes.

Bug #2: Also in the original 2010 patch, write operations were grouped 
into separate "signaled" and "unsignaled" work requests, which is also 
not typical (but allowed by the hardware). "Signalling" is infiniband 
terminology which means to activate/deactivate notifying the sender 
whether or not the RDMA operation has already completed. (Note: the 
receiver is never notified - which is what a DMA is supposed to be). In 
normal operation per infiniband specifications, "unsignaled" operations 
(which indicate to the hardware *not* to notify the sender of 
completion) are *supposed* to be paired simultaneously with a signaled 
operation using the *same* work request identifier. Instead, the 
original patch was using *different* work requests for 
signaled/unsignaled writes, which means that most of the writes would be 
transmitted without ever being tracked for completion whatsoever. (Per 
infinband specifications, signaled and unsignaled writes must be grouped 
together because the hardware ensures that completion notification is 
not given until *all* of the writes of the same request have actually 
completed).

Bug #3: Finally, in the original 2010 patch, ordering was not being 
handled. Per infiniband specifications, writes can happen completely out 
of order. Not only that, but PCI-express itself can change the order of 
the writes as well. It was only until after the first 2 bugs were fixed 
that I could actually manifest this bug *in code*: What was happening 
was that a very large group of requests would "burst" from the QEMU 
migration thread. At which point, not all of the requests would finish. 
Then a short time later, the next iteration would start and the virtual 
machine's writable working set was still "hovering" somewhere in the 
same vicinity of the address space as the previous burst of writes that 
had not yet completed. When this happens, the new writes were much 
smaller (not a part of a larger "chunk" per our algorithms). Since the 
new writes were smaller they would complete faster than the larger, 
older writes in the same address range. Since they complete out of 
order, the newer writes would then get clobbered by the older writes - 
resulting in an inconsistent virtual machine. So, to solve this: during 
each new write, we now do a "search" to see if the address of the next 
requested write matches or overlaps with the address range of any of the 
previous "outstanding" writes that were still in transit, and I found 
several hits. This was easily solved by blocking until the conflicting 
write has completed before proceeding to issue a new write to the hardware.

- Michael

On 05/09/2013 06:45 PM, Michael R. Hines wrote:
>
> Some more followup questions below to help me debug before I start 
> digging in.......
>
> On 05/09/2013 06:20 PM, Chegu Vinod wrote:
>
> Setting aside the mlock() freezes for the moment, let's first fix your 
> crashing
> problem on the destination-side. Let's make that a priority before we fix
> the mlock problem.
>
> When the migration "completes", can you provide me with more detailed 
> information
> about the state of QEMU on the destination?
>
> Is it responding?
> What's on the VNC console?
> Is QEMU responding?
> Is the network responding?
> Was the VM idle? Or running an application?
> Can you attach GDB to QEMU after the migration?
>
>
>> /usr/local/bin/qemu-system-x86_64 \
>> -enable-kvm \
>> -cpu host \
>> -name vm1 \
>> -m 131072 -smp 10,sockets=1,cores=10,threads=1 \
>> -mem-path /dev/hugepages \
>
> Can you disable hugepages and re-test?
>
> I'll get back to the other mlock() issues later after we at least 
> first make sure the migration itself is working.....

next prev parent reply	other threads:[~2013-06-02  4:22 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-05-03 23:28 [Qemu-devel] [PATCH v6 00/11] rdma: migration support Chegu Vinod
2013-05-09 17:20 ` Michael R. Hines
2013-05-09 22:20   ` Chegu Vinod
2013-05-09 22:45     ` Michael R. Hines
2013-06-02  4:09       ` Michael R. Hines [this message]
2013-06-06 23:51         ` Chegu Vinod
2013-06-07  5:38           ` Michael R. Hines
2013-05-10  7:58     ` Paolo Bonzini
  -- strict thread matches above, loose matches on Subject: below --
2013-04-24 19:00 mrhines
2013-04-24 21:50 ` Paolo Bonzini
2013-04-24 23:48   ` Michael R. Hines

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=51AAC589.3080302@linux.vnet.ibm.com \
    --to=mrhines@linux.vnet.ibm.com \
    --cc=anthony@codemonkey.ws \
    --cc=chegu_vinod@hp.com \
    --cc=knoel@redhat.com \
    --cc=mrhines@us.ibm.com \
    --cc=mst@redhat.com \
    --cc=owasserm@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).