All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] Long QEMU main loop pauses during migration (to file) under heavy load
@ 2011-11-11 13:03 Daniel P. Berrange
  2011-11-11 13:13 ` Daniel P. Berrange
  2011-11-11 13:31 ` Paolo Bonzini
  0 siblings, 2 replies; 3+ messages in thread
From: Daniel P. Berrange @ 2011-11-11 13:03 UTC (permalink / raw)
  To: qemu-devel; +Cc: libvir-list

Libvirt recently introduced a change to the way it does 'save to file'
with QEMU. Historically QEMU has a 32MB/s I/O limit on migration by
default. When saving to file, we didn't want any artificial limit,
but rather to max out the underlying storage. So when doing save to
file, we set a large bandwidth limit (INT64_MAX / (1024 * 1024)) so
it is effectively unlimited.

After doing this, we discovered that the QEMU monitor was becoming
entirely blocked. It did not even return from the 'migrate' command
until migration was complete despite the 'detach' flag being set.
This was a bug in libvirt, because we passed a plain file descriptor
which does not support EAGAIN. Thank you POSIX.

Libvirt has another mode where it uses an I/O helper command so get
O_DIRECT, and in this mode we pass a pipe() FD to QEMU. After ensuring
that this pipe FD really does have O_NONBLOCK set, we still saw some
odd behaviour.

I'm not sure whether what I describe can neccessarily be called a QEMU
bug, but I wanted to raise it for discussion anyway....

The sequence of steps is

  - libvirt sets qemu migration bandwidth to "unlimited"
  - libvirt opens a pipe() and sets O_NONBLOCK on the write end
  - libvirt spawns  libvirt-iohelper giving it the target file
    on disk, and the read end of the pipe
  - libvirt does 'getfd migfile' monitor command to give QEMU
    the write end of the pipe
  - libvirt does 'migrate fd:migfile -d' to run migration
  - In parallel
       - QEMU is writing to the pipe (which is non-blocking)
       - libvirt_helper reading the pipe & writing to disk with O_DIRECT

The initial 'migrate' command detaches into the background OK, and
libvirt can enter its loop doing 'query-migrate' frequently to
monitor progress. Initially this works fine, but at some points
during the migration, QEMU will get "stuck" for a very long time
and not respond to the monitor (or indeed the mainloop at all).
These blackouts are anywhere from 10 to 20 seconds long.

Using a combination of systemtap, gdb and strace I managed to determine
out the following

 - Most of the qemu_savevm_state_iterate() calls complete in 10-20 ms

 - Reasonably often a qemu_savevm_state_iterate() call takes 300-400 ms

 - Fairly rarely a qemu_savevm_state_iterate() call takes 10-20 *seconds*

 - I can see EAGAIN from the FD QEMU is migrating from - hence most
   of the iterations are quite short.

 - In the 10-20 second long calls, no EAGAIN is seen for the entire
   period.

 - The host OS in general is fairly "laggy", presumably due to the high
   rate of direct I/O being performed by the I/O helper, and bad schedular
   tuning


IIUC, there are two things which will cause a qemu_savevm_state_iterate()
call to return

   - Getting EAGAIN on the migration FD
   - Hitting the max bandwidth limit

We have set effectively unlimited bandwidth, so everything relies on
the EAGAIN behaviour.

If the OS is doing a good job at scheduling processes & I/O, then this
seems to work reasonably well. If the OS is highly loaded and becoming
less responsive to scheduling apps, then QEMU gets itself into a bad
way.

What I think is happening is that the OS is giving too much time to
the I/O helper process that is reading the other end of the pipe
given to QEMU, and then doing the O_DIRECT to disk.

Thus in the shorter-than-normal windows of time when QEMU itself is
scheduled by the OS, the pipe is fairly empty, so QEMU does not see
EAGAIN for a very long period of wallclock time.

So we get into a case where QEMU sees 10-20 second gaps betweeen
iterations of the main loop. Having a non-inifinite max-bandwidth
for the migration, would likely mitigate this to some extent, but
I still think it'd be possible to get QEMU into these pathelogical
conditions under high load for a host.

Is this a scenario we need to worry about for QEMU ? On the one
hand it seems like it is a rare edge case in OS behaviour overall.
On the other hand, times when a host is highly loaded and non-responsive
are exactly the times when a mgmt app might want to save a guest to
disk, or migrate it elsewhere.  Which would mean we need QEMU to
behave as well as possible in these adverse conditions.

Thus should we consider having an absolute bound on the execution time
of qemu_savevm_state_iterate(), independant of EAGAIN & bandwidth
limits, to ensure the main loop doesn't get starved ?

Or perhaps moving migration to a separate thread, out of the mainloop
is what we need to strive for ?

Regards,
Daniel
--  
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [Qemu-devel] Long QEMU main loop pauses during migration (to file) under heavy load
  2011-11-11 13:03 [Qemu-devel] Long QEMU main loop pauses during migration (to file) under heavy load Daniel P. Berrange
@ 2011-11-11 13:13 ` Daniel P. Berrange
  2011-11-11 13:31 ` Paolo Bonzini
  1 sibling, 0 replies; 3+ messages in thread
From: Daniel P. Berrange @ 2011-11-11 13:13 UTC (permalink / raw)
  To: qemu-devel; +Cc: libvir-list

[-- Attachment #1: Type: text/plain, Size: 5040 bytes --]

On Fri, Nov 11, 2011 at 01:03:20PM +0000, Daniel P. Berrange wrote:
> Libvirt recently introduced a change to the way it does 'save to file'
> with QEMU. Historically QEMU has a 32MB/s I/O limit on migration by
> default. When saving to file, we didn't want any artificial limit,
> but rather to max out the underlying storage. So when doing save to
> file, we set a large bandwidth limit (INT64_MAX / (1024 * 1024)) so
> it is effectively unlimited.
> 
> After doing this, we discovered that the QEMU monitor was becoming
> entirely blocked. It did not even return from the 'migrate' command
> until migration was complete despite the 'detach' flag being set.
> This was a bug in libvirt, because we passed a plain file descriptor
> which does not support EAGAIN. Thank you POSIX.
> 
> Libvirt has another mode where it uses an I/O helper command so get
> O_DIRECT, and in this mode we pass a pipe() FD to QEMU. After ensuring
> that this pipe FD really does have O_NONBLOCK set, we still saw some
> odd behaviour.
> 
> I'm not sure whether what I describe can neccessarily be called a QEMU
> bug, but I wanted to raise it for discussion anyway....
> 
> The sequence of steps is
> 
>   - libvirt sets qemu migration bandwidth to "unlimited"
>   - libvirt opens a pipe() and sets O_NONBLOCK on the write end
>   - libvirt spawns  libvirt-iohelper giving it the target file
>     on disk, and the read end of the pipe
>   - libvirt does 'getfd migfile' monitor command to give QEMU
>     the write end of the pipe
>   - libvirt does 'migrate fd:migfile -d' to run migration
>   - In parallel
>        - QEMU is writing to the pipe (which is non-blocking)
>        - libvirt_helper reading the pipe & writing to disk with O_DIRECT


I should have mentioned that the way I'm testing this is with
libvirt 0.9.7, with both QEMU 0.14 and QEMU GIT master, using
a guest with 2 GB of RAM:


    $ virsh start l3
    Domain l3 started

    $ virsh dominfo l3
    Id:             17
    Name:           l3
    UUID:           c7a3edbd-edaf-9455-926a-d65c16db1803
    OS Type:        hvm
    State:          running
    CPU(s):         1
    CPU time:       1.1s
    Max memory:     2292000 kB
    Used memory:    2292736 kB
    Persistent:     yes
    Autostart:      disable
    Managed save:   no
    Security model: selinux
    Security DOI:   0
    Security label: system_u:system_r:unconfined_t:s0:c94,c700 (permissive)

To actually perform the save-to-file, I use the '--bypass-cache' flag
for libvirt, which ensures we pass a pipe to QEMU and run our I/O
helper for O_DIRECT, instead of directly giving QEMU a plain file

     $ virsh save --bypass-cache l3 l3.image 
     Domain l3 saved to l3.image

>  - Most of the qemu_savevm_state_iterate() calls complete in 10-20 ms
> 
>  - Reasonably often a qemu_savevm_state_iterate() call takes 300-400 ms
> 
>  - Fairly rarely a qemu_savevm_state_iterate() call takes 10-20 *seconds*

I use the attached systemtap script for determining these

eg run this before starting the migration to disk:

# stap qemu-mig.stp
Begin
  0.000 Start
  5.198 > Begin
  5.220 < Begin   0.022
  5.220 > Iterate
  5.224 < Iterate   0.004

  ...snip..

  6.299 > Iterate
  6.314 < Iterate   0.015
  6.314 > Iterate
  6.319 < Iterate   0.005
  6.409 > Iterate
  8.139 < Iterate   1.730         <<< very slow iteration
  8.152 > Iterate
 13.078 < Iterate   4.926         <<< very slow iteration
 13.963 > Iterate
 14.248 < Iterate   0.285
 14.441 > Iterate
 14.448 < Iterate   0.007

 ...snip...

 24.171 > Iterate
 24.178 < Iterate   0.007
 24.178 > Complete
 24.588 < Complete   0.410

 <Ctrl-C>

avg 79 = sum 8033 / count 101; min 3 max 4926
value |-------------------------------------------------- count
    0 |                                                    0
    1 |                                                    0
    2 |                                                    1
    4 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@              74
    8 |@@@@@@@@@                                          19
   16 |@                                                   3
   32 |                                                    0
   64 |                                                    0
  128 |                                                    0
  256 |@                                                   2
  512 |                                                    0
 1024 |                                                    1
 2048 |                                                    0
 4096 |                                                    1
 8192 |                                                    0
16384 |                                                    0

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

[-- Attachment #2: qemu-mig.stp --]
[-- Type: text/plain, Size: 2023 bytes --]



global then;
global deltas;

global start;

function print_ts(str) {
  now = gettimeofday_ns() / (1000*1000)

  delta = (now - start)

  printf("%3d.%03d %s\n",
         (delta / 1000), (delta % 1000), str);
}


probe begin {
   printf("Begin\n");
   then = 0;
  start = gettimeofday_ns() / (1000*1000);
  print_ts("Start");

}


probe process("/home/berrange/usr/qemu-git/bin/qemu-system-x86_64").function("qemu_savevm_state_begin") {
   then = gettimeofday_ns() / (1000*1000);
   print_ts("> Begin");   
}

probe process("/home/berrange/usr/qemu-git/bin/qemu-system-x86_64").function("qemu_savevm_state_begin").return {
   now = gettimeofday_ns() / (1000*1000);
   if (then != 0) {
     delta = now - then;
     deltas <<< delta;
     print_ts(sprintf("< Begin %3d.%03d",
         (delta / 1000), (delta % 1000)));
   }
}

probe process("/home/berrange/usr/qemu-git/bin/qemu-system-x86_64").function("qemu_savevm_state_iterate") {
   then = gettimeofday_ns() / (1000*1000);
   print_ts("> Iterate");   
}

probe process("/home/berrange/usr/qemu-git/bin/qemu-system-x86_64").function("qemu_savevm_state_iterate").return {
   now = gettimeofday_ns() / (1000*1000);
   if (then != 0) {
     delta = now - then;
     deltas <<< delta;
     print_ts(sprintf("< Iterate %3d.%03d",
         (delta / 1000), (delta % 1000)));
   }
}

probe process("/home/berrange/usr/qemu-git/bin/qemu-system-x86_64").function("qemu_savevm_state_complete") {
   then = gettimeofday_ns() / (1000*1000);
   print_ts("> Complete");   
}

probe process("/home/berrange/usr/qemu-git/bin/qemu-system-x86_64").function("qemu_savevm_state_complete").return {
   now = gettimeofday_ns() / (1000*1000);
   if (then != 0) {
     delta = now - then;
     deltas <<< delta;
     print_ts(sprintf("< Complete %3d.%03d",
         (delta / 1000), (delta % 1000)));
   }
}


probe end {
   printf ("avg %d = sum %d / count %d; min %d max %d\n",
           @avg(deltas), @sum(deltas), @count(deltas), @min(deltas), @max(deltas));
   print (@hist_log(deltas));
}

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [Qemu-devel] Long QEMU main loop pauses during migration (to file) under heavy load
  2011-11-11 13:03 [Qemu-devel] Long QEMU main loop pauses during migration (to file) under heavy load Daniel P. Berrange
  2011-11-11 13:13 ` Daniel P. Berrange
@ 2011-11-11 13:31 ` Paolo Bonzini
  1 sibling, 0 replies; 3+ messages in thread
From: Paolo Bonzini @ 2011-11-11 13:31 UTC (permalink / raw)
  To: qemu-devel; +Cc: libvir-list

On 11/11/2011 02:03 PM, Daniel P. Berrange wrote:
> Or perhaps moving migration to a separate thread, out of the mainloop
> is what we need to strive for ?

Yes, this is planned.  This way the I/O does not even need to be 
non-blocking.

Patches have been posted (and I have up-to-date versions) that should 
work except for block migration.

Paolo

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2011-11-11 13:32 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-11-11 13:03 [Qemu-devel] Long QEMU main loop pauses during migration (to file) under heavy load Daniel P. Berrange
2011-11-11 13:13 ` Daniel P. Berrange
2011-11-11 13:31 ` Paolo Bonzini

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.