qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
To: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: pbonzini@redhat.com, jdenemar@redhat.com, qemu-devel@nongnu.org,
	"Jason J. Herne" <jjherne@linux.vnet.ibm.com>
Subject: Re: [Qemu-devel] [PATCH] use broadcast on qemu_pause_cond
Date: Tue, 26 Jan 2016 20:07:23 +0000	[thread overview]
Message-ID: <20160126200723.GA13904@work-vm> (raw)
In-Reply-To: <56A7CBCF.2070004@de.ibm.com>

* Christian Borntraeger (borntraeger@de.ibm.com) wrote:
> On 01/25/2016 11:08 AM, Dr. David Alan Gilbert (git) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > Jiri saw a hang on pause_all_vcpus called from postcopy_start,
> > where the cpus are all apparently stopped ('stopped' flag set)
> > but pause_all_vcpus is still stuck on a cond_wait on qemu_paused_cond.
> > We suspect this is happening if a qmp_stop is called at about the
> > same time as the postcopy code calls that pause_all_vcpus;
> > although they both should have the main lock held, Paolo spotted
> > the cond_wait unlocks the global lock so perhaps they both
> > could end up waiting at the same time?
> 
> We have been chasing a similar problem, with many guests with lots of cpus, that
> sometimes thread 1 waits like
> Thread 1 (Thread 0x3fffa670c00 (LWP 15652)):
> #0  0x000003fffcdf21b2 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
> #1  0x000000008023f8f2 in qemu_cond_wait ()
> #2  0x0000000080060332 in pause_all_vcpus ()
> #3  0x00000000800603e8 in vm_stop ()
> #4  0x00000000800f9b04 in qmp_marshal_input_stop ()
> #5  0x0000000080063154 in handle_qmp_command ()
> #6  0x000000008023b77e in json_message_process_token ()
> #7  0x000000008024ef98 in json_lexer_feed_char ()
> ---Type <return> to continue, or q <return> to quit---
> #8  0x000000008024f056 in json_lexer_feed ()
> #9  0x0000000080061756 in monitor_qmp_read ()
> #10 0x00000000800e4966 in tcp_chr_read ()
> #11 0x000003fffcce3fb6 in g_main_context_dispatch () from /lib64/libglib-2.0.so.0
> #12 0x00000000801bd18e in main_loop_wait ()
> #13 0x000000008002e244 in main ()
> (gdb) 
> 
> One thread was still running inside KVM, not being kicked out into userspace.
> Now: This might actually be the same problem. I was chasing the still running
> CPU (why it does not exit, and I was able to make progress with killall -SIGUSR1 
> qemu), but in fact, the problem might have been that thread 1 did not get
> notified by the LAST CPUs (notify getting lost), therefore, never kicked this
> CPU out.

I think the patch should only have helped if there was something else trying
to do a stop at the same time;  if there was only one thing then the signal
and broadcast should be identical;  in my case it's a race between a 'stop'
issued on the monitor and a 'stop' from migration;  where's the second stop in
your case?

Dave

> The problem was never reproducable with qemu 2.3, so maybe the BQL avoided the
> issue?
> We will test if this fixes our problem as well.
> 
> > 
> > Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > Reported-by: Jiri Denemark <jdenemar@redhat.com>
> > ---
> >  cpus.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/cpus.c b/cpus.c
> > index 3efff6b..1e97cc4 100644
> > --- a/cpus.c
> > +++ b/cpus.c
> > @@ -986,7 +986,7 @@ static void qemu_wait_io_event_common(CPUState *cpu)
> >      if (cpu->stop) {
> >          cpu->stop = false;
> >          cpu->stopped = true;
> > -        qemu_cond_signal(&qemu_pause_cond);
> > +        qemu_cond_broadcast(&qemu_pause_cond);
> >      }
> >      flush_queued_work(cpu);
> >      cpu->thread_kicked = false;
> > @@ -1396,7 +1396,7 @@ void cpu_stop_current(void)
> >          current_cpu->stop = false;
> >          current_cpu->stopped = true;
> >          cpu_exit(current_cpu);
> > -        qemu_cond_signal(&qemu_pause_cond);
> > +        qemu_cond_broadcast(&qemu_pause_cond);
> >      }
> >  }
> > 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

      reply	other threads:[~2016-01-26 20:07 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-01-25 10:08 [Qemu-devel] [PATCH] use broadcast on qemu_pause_cond Dr. David Alan Gilbert (git)
2016-01-25 13:18 ` Paolo Bonzini
2016-01-26 19:41 ` Christian Borntraeger
2016-01-26 20:07   ` Dr. David Alan Gilbert [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160126200723.GA13904@work-vm \
    --to=dgilbert@redhat.com \
    --cc=borntraeger@de.ibm.com \
    --cc=jdenemar@redhat.com \
    --cc=jjherne@linux.vnet.ibm.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).