From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:47495) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aO9tj-0007wA-GZ for qemu-devel@nongnu.org; Tue, 26 Jan 2016 15:07:35 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aO9tg-0002mi-9k for qemu-devel@nongnu.org; Tue, 26 Jan 2016 15:07:31 -0500 Received: from mx1.redhat.com ([209.132.183.28]:59364) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aO9tg-0002mR-2O for qemu-devel@nongnu.org; Tue, 26 Jan 2016 15:07:28 -0500 Date: Tue, 26 Jan 2016 20:07:23 +0000 From: "Dr. David Alan Gilbert" Message-ID: <20160126200723.GA13904@work-vm> References: <1453716498-27238-1-git-send-email-dgilbert@redhat.com> <56A7CBCF.2070004@de.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <56A7CBCF.2070004@de.ibm.com> Subject: Re: [Qemu-devel] [PATCH] use broadcast on qemu_pause_cond List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Christian Borntraeger Cc: pbonzini@redhat.com, jdenemar@redhat.com, qemu-devel@nongnu.org, "Jason J. Herne" * Christian Borntraeger (borntraeger@de.ibm.com) wrote: > On 01/25/2016 11:08 AM, Dr. David Alan Gilbert (git) wrote: > > From: "Dr. David Alan Gilbert" > > > > Jiri saw a hang on pause_all_vcpus called from postcopy_start, > > where the cpus are all apparently stopped ('stopped' flag set) > > but pause_all_vcpus is still stuck on a cond_wait on qemu_paused_cond. > > We suspect this is happening if a qmp_stop is called at about the > > same time as the postcopy code calls that pause_all_vcpus; > > although they both should have the main lock held, Paolo spotted > > the cond_wait unlocks the global lock so perhaps they both > > could end up waiting at the same time? > > We have been chasing a similar problem, with many guests with lots of cpus, that > sometimes thread 1 waits like > Thread 1 (Thread 0x3fffa670c00 (LWP 15652)): > #0 0x000003fffcdf21b2 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 > #1 0x000000008023f8f2 in qemu_cond_wait () > #2 0x0000000080060332 in pause_all_vcpus () > #3 0x00000000800603e8 in vm_stop () > #4 0x00000000800f9b04 in qmp_marshal_input_stop () > #5 0x0000000080063154 in handle_qmp_command () > #6 0x000000008023b77e in json_message_process_token () > #7 0x000000008024ef98 in json_lexer_feed_char () > ---Type to continue, or q to quit--- > #8 0x000000008024f056 in json_lexer_feed () > #9 0x0000000080061756 in monitor_qmp_read () > #10 0x00000000800e4966 in tcp_chr_read () > #11 0x000003fffcce3fb6 in g_main_context_dispatch () from /lib64/libglib-2.0.so.0 > #12 0x00000000801bd18e in main_loop_wait () > #13 0x000000008002e244 in main () > (gdb) > > One thread was still running inside KVM, not being kicked out into userspace. > Now: This might actually be the same problem. I was chasing the still running > CPU (why it does not exit, and I was able to make progress with killall -SIGUSR1 > qemu), but in fact, the problem might have been that thread 1 did not get > notified by the LAST CPUs (notify getting lost), therefore, never kicked this > CPU out. I think the patch should only have helped if there was something else trying to do a stop at the same time; if there was only one thing then the signal and broadcast should be identical; in my case it's a race between a 'stop' issued on the monitor and a 'stop' from migration; where's the second stop in your case? Dave > The problem was never reproducable with qemu 2.3, so maybe the BQL avoided the > issue? > We will test if this fixes our problem as well. > > > > > Signed-off-by: Dr. David Alan Gilbert > > Reported-by: Jiri Denemark > > --- > > cpus.c | 4 ++-- > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > diff --git a/cpus.c b/cpus.c > > index 3efff6b..1e97cc4 100644 > > --- a/cpus.c > > +++ b/cpus.c > > @@ -986,7 +986,7 @@ static void qemu_wait_io_event_common(CPUState *cpu) > > if (cpu->stop) { > > cpu->stop = false; > > cpu->stopped = true; > > - qemu_cond_signal(&qemu_pause_cond); > > + qemu_cond_broadcast(&qemu_pause_cond); > > } > > flush_queued_work(cpu); > > cpu->thread_kicked = false; > > @@ -1396,7 +1396,7 @@ void cpu_stop_current(void) > > current_cpu->stop = false; > > current_cpu->stopped = true; > > cpu_exit(current_cpu); > > - qemu_cond_signal(&qemu_pause_cond); > > + qemu_cond_broadcast(&qemu_pause_cond); > > } > > } > > > -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK