Lockup with --accel tcg,thread=single

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* Lockup with --accel tcg,thread=single
@ 2019-09-30 13:15 Doug Gale
  2019-09-30 15:37 ` Peter Maydell
  0 siblings, 1 reply; 11+ messages in thread
From: Doug Gale @ 2019-09-30 13:15 UTC (permalink / raw)
  To: qemu-devel

I found a lockup in single threaded TCG, with OVMF bios, 16 CPUs.

qemu-system-x86_64 --accel tcg,thread=single -smp cpus=16 -bios
/usr/share/ovmf/OVMF.fd

Using Ubuntu 18.04 LTS, default gnome desktop. There is no guest OS,
there is no hard drive, just the OVMF firmware locks it up. (I
narrowed it down to this from a much larger repro)

Let that run for about 10 seconds or so, then attempt to Ctrl-C in the
terminal that launched qemu. You should see the hung program thing
(wait or force quit) appear on the qemu window, and the terminal just
prints ^C without interrupting qemu. DO NOT click force quit or wait.
Use killall -9 qemu-system-x86_64 to kill it if you must, gdb kill is
also okay. If you force quit you will likely freeze the entire gnome
desktop. This is what kmsg reports:
https://gist.githubusercontent.com/doug65536/810e0471c11008c97edb4aa0d86ddd66/raw/40038a355542f0edb1f14a7c7fb7c2db60da0025/end%2520of%2520syslog%2520after%2520desktop%2520lockup,
this is what syslog reports:
https://gist.githubusercontent.com/doug65536/810e0471c11008c97edb4aa0d86ddd66/raw/40038a355542f0edb1f14a7c7fb7c2db60da0025/the%2520end%2520of%2520dmesg%2520after%2520desktop%2520lockup.
Probably bugs in gnome but just warning about locking up your machines
when reproducing.

Peter Maydell helped me bisect it in IRC.
 Works fine at commit 1e8a98b53867f61
 Fails at commit 9458a9a1df1a4c7

MTTCG works fine all the way up to master.

Configure command line:

../qemu/configure --target-list=x86_64-softmmu,i386-softmmu
--audio-drv-list=pa --enable-libusb --disable-libssh
--enable-virglrenderer --enable-opengl --enable-debug

The issue occurs without --enable-debug. I didn't strip the configure
down though, it may not need all of those configure options exactly.

OVMF from ubuntu package manager, package named ovmf, exact version is
0~20180205.c0d9813c-2ubuntu0.1

Stack trace at lockup at
https://gist.githubusercontent.com/doug65536/810e0471c11008c97edb4aa0d86ddd66/raw/40038a355542f0edb1f14a7c7fb7c2db60da0025/backtrace%2520all

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Lockup with --accel tcg,thread=single
  2019-09-30 13:15 Lockup with --accel tcg,thread=single Doug Gale
@ 2019-09-30 15:37 ` Peter Maydell
  2019-09-30 16:38   ` Alex Bennée
  2019-09-30 17:47   ` Paolo Bonzini
  0 siblings, 2 replies; 11+ messages in thread
From: Peter Maydell @ 2019-09-30 15:37 UTC (permalink / raw)
  To: Doug Gale; +Cc: Paolo Bonzini, Alex Bennée, qemu-devel

On Mon, 30 Sep 2019 at 14:17, Doug Gale <doug16k@gmail.com> wrote:
>
> I found a lockup in single threaded TCG, with OVMF bios, 16 CPUs.
>
> qemu-system-x86_64 --accel tcg,thread=single -smp cpus=16 -bios
> /usr/share/ovmf/OVMF.fd
>
> Using Ubuntu 18.04 LTS, default gnome desktop. There is no guest OS,
> there is no hard drive, just the OVMF firmware locks it up. (I
> narrowed it down to this from a much larger repro)

> Peter Maydell helped me bisect it in IRC.
>  Works fine at commit 1e8a98b53867f61
>  Fails at commit 9458a9a1df1a4c7
>
> MTTCG works fine all the way up to master.

Thanks for this bug report. I've reproduced it and think
I have figured out what is going on here.

Commit 9458a9a1df1a4c719e245 is Paolo's commit that fixes the
TCG-vs-dirty-bitmap race by having the thread which is
doing a memory access wait for the vCPU thread to finish
its current TB using a no-op run_on_cpu() operation.

In the case of the hang the thread doing the memory access
is the iothread, like this:

#14 0x000055c150c0a98c in run_on_cpu (cpu=0x55c153801c60,
func=0x55c150bbb542 <do_nothing>, data=...)
    at /home/petmay01/linaro/qemu-from-laptop/qemu/cpus.c:1205
#15 0x000055c150bbb58c in tcg_log_global_after_sync
(listener=0x55c1538410c8) at
/home/petmay01/linaro/qemu-from-laptop/qemu/exec.c:2963
#16 0x000055c150c1fe18 in memory_global_after_dirty_log_sync () at
/home/petmay01/linaro/qemu-from-laptop/qemu/memory.c:2598
#17 0x000055c150c1e6b8 in memory_region_snapshot_and_clear_dirty
(mr=0x55c1541e82b0, addr=0, size=1920000, client=0)
    at /home/petmay01/linaro/qemu-from-laptop/qemu/memory.c:2106
#18 0x000055c150c76c05 in vga_draw_graphic (s=0x55c1541e82a0, full_update=0)
    at /home/petmay01/linaro/qemu-from-laptop/qemu/hw/display/vga.c:1661
#19 0x000055c150c771c4 in vga_update_display (opaque=0x55c1541e82a0)
at /home/petmay01/linaro/qemu-from-laptop/qemu/hw/display/vga.c:1785
#20 0x000055c151052a83 in graphic_hw_update (con=0x55c1536dfaa0) at
/home/petmay01/linaro/qemu-from-laptop/qemu/ui/console.c:268
#21 0x000055c151091490 in gd_refresh (dcl=0x55c1549af090) at
/home/petmay01/linaro/qemu-from-laptop/qemu/ui/gtk.c:484
#22 0x000055c151056571 in dpy_refresh (s=0x55c1542f9d90) at
/home/petmay01/linaro/qemu-from-laptop/qemu/ui/console.c:1629
#23 0x000055c1510527f0 in gui_update (opaque=0x55c1542f9d90) at
/home/petmay01/linaro/qemu-from-laptop/qemu/ui/console.c:206
#24 0x000055c1511ee67c in timerlist_run_timers
(timer_list=0x55c15370c280) at
/home/petmay01/linaro/qemu-from-laptop/qemu/util/qemu-timer.c:592
#25 0x000055c1511ee726 in qemu_clock_run_timers
(type=QEMU_CLOCK_REALTIME) at
/home/petmay01/linaro/qemu-from-laptop/qemu/util/qemu-timer.c:606
#26 0x000055c1511ee9e5 in qemu_clock_run_all_timers () at
/home/petmay01/linaro/qemu-from-laptop/qemu/util/qemu-timer.c:692
#27 0x000055c1511ef181 in main_loop_wait (nonblocking=0) at
/home/petmay01/linaro/qemu-from-laptop/qemu/util/main-loop.c:524
#28 0x000055c150db03fe in main_loop () at
/home/petmay01/linaro/qemu-from-laptop/qemu/vl.c:1806

run_on_cpu() adds the do_nothing function to a cpu queued-work list.

However, the single-threaded TCG runloop qemu_tcg_rr_cpu_thread_fn()
has the basic structure:

   while (1) {
       for (each vCPU) {
           run this vCPU for a timeslice;
       }
       qemu_tcg_rr_wait_io_event();
   }

and it processes cpu work queues only in qemu_tcg_rr_wait_io_event().
This is a problem, because the thing which causes us to stop running
one vCPU when its timeslice ends and move on to the next is the
tcg_kick_vcpu_timer -- and the iothread will never process that timer
and kick the vcpu because it is currently blocked in run_on_cpu() !

Not sure currently what the best fix is here.

Side note -- this use of run_on_cpu() means that we now drop
the iothread lock within memory_region_snapshot_and_clear_dirty(),
which we didn't before. This means that a vCPU thread can now
get in and execute an access to the device registers of
hw/display/vga.c, updating its state in a way I suspect that the
device model code is not expecting... So maybe the right answer
here should be to come up with a fix for the race that 9458a9a1
addresses that doesn't require us to drop the iothread lock in
memory_region_snapshot_and_clear_dirty() ? Alternatively we need
to audit the callers and flag in the documentation that this
function has the unexpected side effect of briefly dropping the
iothread lock.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Lockup with --accel tcg,thread=single
  2019-09-30 15:37 ` Peter Maydell
@ 2019-09-30 16:38   ` Alex Bennée
  2019-09-30 17:47   ` Paolo Bonzini
  1 sibling, 0 replies; 11+ messages in thread
From: Alex Bennée @ 2019-09-30 16:38 UTC (permalink / raw)
  To: Peter Maydell; +Cc: Paolo Bonzini, Doug Gale, qemu-devel


Peter Maydell <peter.maydell@linaro.org> writes:

> On Mon, 30 Sep 2019 at 14:17, Doug Gale <doug16k@gmail.com> wrote:
>>
>> I found a lockup in single threaded TCG, with OVMF bios, 16 CPUs.
>>
>> qemu-system-x86_64 --accel tcg,thread=single -smp cpus=16 -bios
>> /usr/share/ovmf/OVMF.fd
>>
>> Using Ubuntu 18.04 LTS, default gnome desktop. There is no guest OS,
>> there is no hard drive, just the OVMF firmware locks it up. (I
>> narrowed it down to this from a much larger repro)
>
>> Peter Maydell helped me bisect it in IRC.
>>  Works fine at commit 1e8a98b53867f61
>>  Fails at commit 9458a9a1df1a4c7
>>
>> MTTCG works fine all the way up to master.
>
> Thanks for this bug report. I've reproduced it and think
> I have figured out what is going on here.
>
> Commit 9458a9a1df1a4c719e245 is Paolo's commit that fixes the
> TCG-vs-dirty-bitmap race by having the thread which is
> doing a memory access wait for the vCPU thread to finish
> its current TB using a no-op run_on_cpu() operation.
>
> In the case of the hang the thread doing the memory access
> is the iothread, like this:
>
> #14 0x000055c150c0a98c in run_on_cpu (cpu=0x55c153801c60,
> func=0x55c150bbb542 <do_nothing>, data=...)
>     at /home/petmay01/linaro/qemu-from-laptop/qemu/cpus.c:1205
> #15 0x000055c150bbb58c in tcg_log_global_after_sync
> (listener=0x55c1538410c8) at
> /home/petmay01/linaro/qemu-from-laptop/qemu/exec.c:2963
> #16 0x000055c150c1fe18 in memory_global_after_dirty_log_sync () at
> /home/petmay01/linaro/qemu-from-laptop/qemu/memory.c:2598
> #17 0x000055c150c1e6b8 in memory_region_snapshot_and_clear_dirty
> (mr=0x55c1541e82b0, addr=0, size=1920000, client=0)
>     at /home/petmay01/linaro/qemu-from-laptop/qemu/memory.c:2106
> #18 0x000055c150c76c05 in vga_draw_graphic (s=0x55c1541e82a0, full_update=0)
>     at /home/petmay01/linaro/qemu-from-laptop/qemu/hw/display/vga.c:1661
> #19 0x000055c150c771c4 in vga_update_display (opaque=0x55c1541e82a0)
> at /home/petmay01/linaro/qemu-from-laptop/qemu/hw/display/vga.c:1785
> #20 0x000055c151052a83 in graphic_hw_update (con=0x55c1536dfaa0) at
> /home/petmay01/linaro/qemu-from-laptop/qemu/ui/console.c:268
> #21 0x000055c151091490 in gd_refresh (dcl=0x55c1549af090) at
> /home/petmay01/linaro/qemu-from-laptop/qemu/ui/gtk.c:484
> #22 0x000055c151056571 in dpy_refresh (s=0x55c1542f9d90) at
> /home/petmay01/linaro/qemu-from-laptop/qemu/ui/console.c:1629
> #23 0x000055c1510527f0 in gui_update (opaque=0x55c1542f9d90) at
> /home/petmay01/linaro/qemu-from-laptop/qemu/ui/console.c:206
> #24 0x000055c1511ee67c in timerlist_run_timers
> (timer_list=0x55c15370c280) at
> /home/petmay01/linaro/qemu-from-laptop/qemu/util/qemu-timer.c:592
> #25 0x000055c1511ee726 in qemu_clock_run_timers
> (type=QEMU_CLOCK_REALTIME) at
> /home/petmay01/linaro/qemu-from-laptop/qemu/util/qemu-timer.c:606
> #26 0x000055c1511ee9e5 in qemu_clock_run_all_timers () at
> /home/petmay01/linaro/qemu-from-laptop/qemu/util/qemu-timer.c:692
> #27 0x000055c1511ef181 in main_loop_wait (nonblocking=0) at
> /home/petmay01/linaro/qemu-from-laptop/qemu/util/main-loop.c:524
> #28 0x000055c150db03fe in main_loop () at
> /home/petmay01/linaro/qemu-from-laptop/qemu/vl.c:1806
>
> run_on_cpu() adds the do_nothing function to a cpu queued-work list.
>
> However, the single-threaded TCG runloop qemu_tcg_rr_cpu_thread_fn()
> has the basic structure:
>
>    while (1) {
>        for (each vCPU) {
>            run this vCPU for a timeslice;
>        }
>        qemu_tcg_rr_wait_io_event();
>    }
>
> and it processes cpu work queues only in qemu_tcg_rr_wait_io_event().
> This is a problem, because the thing which causes us to stop running
> one vCPU when its timeslice ends and move on to the next is the
> tcg_kick_vcpu_timer -- and the iothread will never process that timer
> and kick the vcpu because it is currently blocked in run_on_cpu() !
>
> Not sure currently what the best fix is here.

We seem to be repeating ourselves because:

  a8efa60633575a2ee4dbf807a71cb44d44b0e0f8
  Author:     Paolo Bonzini <pbonzini@redhat.com>
  AuthorDate: Wed Nov 14 12:36:57 2018 +0100
  cpus: run work items for all vCPUs if single-threaded

However looking at the commit I can still see we have the problem of not
advancing to the next vCPU if the kick timer (or some other event)
doesn't bring the execution to an exit. I suspect you could get this in
Linux but it's probably sufficiently busy to ensure vCPUs are always
exiting for some reason or another.

So options I can think of so far are:

1. Kick 'em all when not inter-vCPU

Something like this untested patch...

--8<---------------cut here---------------start------------->8---
1 file changed, 17 insertions(+), 5 deletions(-)
cpus.c | 22 +++++++++++++++++-----

modified   cpus.c
@@ -949,8 +949,8 @@ static inline int64_t qemu_tcg_next_kick(void)
     return qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + TCG_KICK_PERIOD;
 }

-/* Kick the currently round-robin scheduled vCPU */
-static void qemu_cpu_kick_rr_cpu(void)
+/* Kick the currently round-robin scheduled vCPU to next */
+static void qemu_cpu_kick_rr_next_cpu(void)
 {
     CPUState *cpu;
     do {
@@ -961,6 +961,16 @@ static void qemu_cpu_kick_rr_cpu(void)
     } while (cpu != atomic_mb_read(&tcg_current_rr_cpu));
 }

+/* Kick all RR vCPUs */
+static void qemu_cpu_kick_rr_cpus(void)
+{
+    CPUState *cpu;
+
+    CPU_FOREACH(cpu) {
+        cpu_exit(cpu);
+    };
+}
+
 static void do_nothing(CPUState *cpu, run_on_cpu_data unused)
 {
 }
@@ -993,7 +1003,7 @@ void qemu_timer_notify_cb(void *opaque, QEMUClockType type)
 static void kick_tcg_thread(void *opaque)
 {
     timer_mod(tcg_kick_vcpu_timer, qemu_tcg_next_kick());
-    qemu_cpu_kick_rr_cpu();
+    qemu_cpu_kick_rr_next_cpu();
 }

 static void start_tcg_kick_timer(void)
@@ -1828,9 +1838,11 @@ void qemu_cpu_kick(CPUState *cpu)
 {
     qemu_cond_broadcast(cpu->halt_cond);
     if (tcg_enabled()) {
+        if (qemu_tcg_mttcg_enabled()) {
         cpu_exit(cpu);
-        /* NOP unless doing single-thread RR */
-        qemu_cpu_kick_rr_cpu();
+        } else {
+            qemu_cpu_kick_rr_cpus();
+        }
     } else {
         if (hax_enabled()) {
             /*
--8<---------------cut here---------------end--------------->8---

2. Add handling of kicking all VCPUs to do_run_on_cpu when current_cpu
== NULL

Which would basically kick all vCPUs when events come from outside the
emulation.

3. Figure out multi-threaded icount and record/replay and drop the
special RR case.

This might take a while.

> Side note -- this use of run_on_cpu() means that we now drop
> the iothread lock within memory_region_snapshot_and_clear_dirty(),
> which we didn't before. This means that a vCPU thread can now
> get in and execute an access to the device registers of
> hw/display/vga.c, updating its state in a way I suspect that the
> device model code is not expecting... So maybe the right answer
> here should be to come up with a fix for the race that 9458a9a1
> addresses that doesn't require us to drop the iothread lock in
> memory_region_snapshot_and_clear_dirty() ? Alternatively we need
> to audit the callers and flag in the documentation that this
> function has the unexpected side effect of briefly dropping the
> iothread lock.

There was a series Emilio posted to get rid of more of the BQL in place
of per-CPU locks which IIRC also stopped some of the bouncing we do in
the *_on_cpu functions. Each time we have to do a lock shuffle to get
things moving is a bit of a red flag.

>
> thanks
> -- PMM


--
Alex Bennée


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Lockup with --accel tcg,thread=single
  2019-09-30 15:37 ` Peter Maydell
  2019-09-30 16:38   ` Alex Bennée
@ 2019-09-30 17:47   ` Paolo Bonzini
  2019-09-30 19:20     ` Alex Bennée
  2019-10-01 12:10     ` Peter Maydell
  1 sibling, 2 replies; 11+ messages in thread
From: Paolo Bonzini @ 2019-09-30 17:47 UTC (permalink / raw)
  To: Peter Maydell, Doug Gale; +Cc: Alex Bennée, qemu-devel

On 30/09/19 17:37, Peter Maydell wrote:
> Not sure currently what the best fix is here.

Since thread=single uses just one queued work list, could it be as simple as

diff --git a/cpus.c b/cpus.c
index d2c61ff..314f9aa 100644
--- a/cpus.c
+++ b/cpus.c
@@ -1539,7 +1539,7 @@ static void *qemu_tcg_rr_cpu_thread_fn(void *arg)
             cpu = first_cpu;
         }
 
-        while (cpu && !cpu->queued_work_first && !cpu->exit_request) {
+        while (cpu && !first_cpu->queued_work_first && !cpu->exit_request) {
 
             atomic_mb_set(&tcg_current_rr_cpu, cpu);
             current_cpu = cpu;

or something like that?

> Side note -- this use of run_on_cpu() means that we now drop
> the iothread lock within memory_region_snapshot_and_clear_dirty(),
> which we didn't before. This means that a vCPU thread can now
> get in and execute an access to the device registers of
> hw/display/vga.c, updating its state in a way I suspect that the
> device model code is not expecting... So maybe the right answer
> here should be to come up with a fix for the race that 9458a9a1
> addresses that doesn't require us to drop the iothread lock in
> memory_region_snapshot_and_clear_dirty() ? Alternatively we need
> to audit the callers and flag in the documentation that this
> function has the unexpected side effect of briefly dropping the
> iothread lock.

Yes, this is intended.  There shouldn't be side effects other than
possibly a one-frame flash for all callers.

Paolo



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: Lockup with --accel tcg,thread=single
  2019-09-30 17:47   ` Paolo Bonzini
@ 2019-09-30 19:20     ` Alex Bennée
  2019-09-30 20:40       ` Paolo Bonzini
  2019-10-01 12:10     ` Peter Maydell
  1 sibling, 1 reply; 11+ messages in thread
From: Alex Bennée @ 2019-09-30 19:20 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Peter Maydell, Doug Gale, qemu-devel


Paolo Bonzini <pbonzini@redhat.com> writes:

> On 30/09/19 17:37, Peter Maydell wrote:
>> Not sure currently what the best fix is here.
>
> Since thread=single uses just one queued work list, could it be as
> simple as

Does it? I thought this was the case but:

  static void queue_work_on_cpu(CPUState *cpu, struct qemu_work_item *wi)
  {
      qemu_mutex_lock(&cpu->work_mutex);
      if (cpu->queued_work_first == NULL) {
          cpu->queued_work_first = wi;
      } else {
          cpu->queued_work_last->next = wi;
      }
      cpu->queued_work_last = wi;
      wi->next = NULL;
      wi->done = false;
      qemu_mutex_unlock(&cpu->work_mutex);

      qemu_cpu_kick(cpu);
  }

Does seem to imply the vCPU CPUState is where the queue is. That's not
to say there shouldn't be a single work queue for thread=single.

>
> diff --git a/cpus.c b/cpus.c
> index d2c61ff..314f9aa 100644
> --- a/cpus.c
> +++ b/cpus.c
> @@ -1539,7 +1539,7 @@ static void *qemu_tcg_rr_cpu_thread_fn(void *arg)
>              cpu = first_cpu;
>          }
>
> -        while (cpu && !cpu->queued_work_first && !cpu->exit_request) {
> +        while (cpu && !first_cpu->queued_work_first && !cpu->exit_request) {
>
>              atomic_mb_set(&tcg_current_rr_cpu, cpu);
>              current_cpu = cpu;
>
> or something like that?
>
>> Side note -- this use of run_on_cpu() means that we now drop
>> the iothread lock within memory_region_snapshot_and_clear_dirty(),
>> which we didn't before. This means that a vCPU thread can now
>> get in and execute an access to the device registers of
>> hw/display/vga.c, updating its state in a way I suspect that the
>> device model code is not expecting... So maybe the right answer
>> here should be to come up with a fix for the race that 9458a9a1
>> addresses that doesn't require us to drop the iothread lock in
>> memory_region_snapshot_and_clear_dirty() ? Alternatively we need
>> to audit the callers and flag in the documentation that this
>> function has the unexpected side effect of briefly dropping the
>> iothread lock.
>
> Yes, this is intended.  There shouldn't be side effects other than
> possibly a one-frame flash for all callers.
>
> Paolo


--
Alex Bennée


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Lockup with --accel tcg,thread=single
  2019-09-30 19:20     ` Alex Bennée
@ 2019-09-30 20:40       ` Paolo Bonzini
  2019-10-01  8:42         ` Alex Bennée
  0 siblings, 1 reply; 11+ messages in thread
From: Paolo Bonzini @ 2019-09-30 20:40 UTC (permalink / raw)
  To: Alex Bennée; +Cc: Peter Maydell, Doug Gale, qemu-devel

On 30/09/19 21:20, Alex Bennée wrote:
> Does seem to imply the vCPU CPUState is where the queue is. That's not
> to say there shouldn't be a single work queue for thread=single.

Indeed it doesn't.  I confused this with commit a8efa60633 ("cpus: run
work items for all vCPUs if single-threaded", 2018-11-27).

Are you going to make a patch to have a single work queue, or should I?

Paolo



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Lockup with --accel tcg,thread=single
  2019-09-30 20:40       ` Paolo Bonzini
@ 2019-10-01  8:42         ` Alex Bennée
  2019-10-01  9:16           ` Paolo Bonzini
  0 siblings, 1 reply; 11+ messages in thread
From: Alex Bennée @ 2019-10-01  8:42 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Peter Maydell, Doug Gale, qemu-devel


Paolo Bonzini <pbonzini@redhat.com> writes:

> On 30/09/19 21:20, Alex Bennée wrote:
>> Does seem to imply the vCPU CPUState is where the queue is. That's not
>> to say there shouldn't be a single work queue for thread=single.
>
> Indeed it doesn't.  I confused this with commit a8efa60633 ("cpus: run
> work items for all vCPUs if single-threaded", 2018-11-27).
>
> Are you going to make a patch to have a single work queue, or should
> I?

What's the neatest way to do it? Are you thinking just special case
queue_work_on_cpu to special case first_cpu when mttcg is not enabled?

--
Alex Bennée


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Lockup with --accel tcg,thread=single
  2019-10-01  8:42         ` Alex Bennée
@ 2019-10-01  9:16           ` Paolo Bonzini
  2019-10-01 15:28             ` Alex Bennée
  0 siblings, 1 reply; 11+ messages in thread
From: Paolo Bonzini @ 2019-10-01  9:16 UTC (permalink / raw)
  To: Alex Bennée; +Cc: Peter Maydell, Doug Gale, qemu-devel

On 01/10/19 10:42, Alex Bennée wrote:
> 
> Paolo Bonzini <pbonzini@redhat.com> writes:
> 
>> On 30/09/19 21:20, Alex Bennée wrote:
>>> Does seem to imply the vCPU CPUState is where the queue is. That's not
>>> to say there shouldn't be a single work queue for thread=single.
>>
>> Indeed it doesn't.  I confused this with commit a8efa60633 ("cpus: run
>> work items for all vCPUs if single-threaded", 2018-11-27).
>>
>> Are you going to make a patch to have a single work queue, or should
>> I?
> 
> What's the neatest way to do it? Are you thinking just special case
> queue_work_on_cpu to special case first_cpu when mttcg is not enabled?

Yes, I cannot think of anything better.

Paolo



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Lockup with --accel tcg,thread=single
  2019-10-01  9:16           ` Paolo Bonzini
@ 2019-10-01 15:28             ` Alex Bennée
  0 siblings, 0 replies; 11+ messages in thread
From: Alex Bennée @ 2019-10-01 15:28 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Peter Maydell, Doug Gale, qemu-devel


Paolo Bonzini <pbonzini@redhat.com> writes:

> On 01/10/19 10:42, Alex Bennée wrote:
>>
>> Paolo Bonzini <pbonzini@redhat.com> writes:
>>
>>> On 30/09/19 21:20, Alex Bennée wrote:
>>>> Does seem to imply the vCPU CPUState is where the queue is. That's not
>>>> to say there shouldn't be a single work queue for thread=single.
>>>
>>> Indeed it doesn't.  I confused this with commit a8efa60633 ("cpus: run
>>> work items for all vCPUs if single-threaded", 2018-11-27).
>>>
>>> Are you going to make a patch to have a single work queue, or should
>>> I?
>>
>> What's the neatest way to do it? Are you thinking just special case
>> queue_work_on_cpu to special case first_cpu when mttcg is not enabled?
>
> Yes, I cannot think of anything better.

And I am immediately stymied by the fact that cpus-common is a common
blob so can't have differentiation for SOFTMMU cases in it.

Did you have a look at:

  Date: Mon, 30 Sep 2019 17:38:15 +0100
  Message-ID: <87h84tloy0.fsf@linaro.org>

Is that too ugly?

>
> Paolo


--
Alex Bennée


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Lockup with --accel tcg,thread=single
  2019-09-30 17:47   ` Paolo Bonzini
  2019-09-30 19:20     ` Alex Bennée
@ 2019-10-01 12:10     ` Peter Maydell
  1 sibling, 0 replies; 11+ messages in thread
From: Peter Maydell @ 2019-10-01 12:10 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Doug Gale, Alex Bennée, qemu-devel

On Mon, 30 Sep 2019 at 18:47, Paolo Bonzini <pbonzini@redhat.com> wrote:
> On 30/09/19 17:37, Peter Maydell wrote:
> > Side note -- this use of run_on_cpu() means that we now drop
> > the iothread lock within memory_region_snapshot_and_clear_dirty(),
> > which we didn't before. This means that a vCPU thread can now
> > get in and execute an access to the device registers of
> > hw/display/vga.c, updating its state in a way I suspect that the
> > device model code is not expecting... So maybe the right answer
> > here should be to come up with a fix for the race that 9458a9a1
> > addresses that doesn't require us to drop the iothread lock in
> > memory_region_snapshot_and_clear_dirty() ? Alternatively we need
> > to audit the callers and flag in the documentation that this
> > function has the unexpected side effect of briefly dropping the
> > iothread lock.
>
> Yes, this is intended.  There shouldn't be side effects other than
> possibly a one-frame flash for all callers.

This seems quite tricky to audit to me -- for instance is the code
in vga_draw_graphic() really designed for and expecting to cope with
races where information it reads from s->foo after the call to
memory_region_snapshot_and_clear_dirty() might have been updated
by the guest and not be consistent with the information it read
from s->bar before the call and cached in a local variable?
I think most people write device code assuming that while execution
is within a function in the device model the device won't have to
deal with possible reentrancy where another thread comes in and
alters the device state underfoot.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Lockup with --accel tcg,thread=single
@ 2019-09-30 13:12 Doug Gale
  0 siblings, 0 replies; 11+ messages in thread
From: Doug Gale @ 2019-09-30 13:12 UTC (permalink / raw)
  To: qemu-devel

[-- Attachment #1: Type: text/plain, Size: 2037 bytes --]

I found a lockup in single threaded TCG, with OVMF bios, 16 CPUs.

qemu-system-x86_64 --accel tcg,thread=single -smp cpus=16 -bios
/usr/share/ovmf/OVMF.fd

Using Ubuntu 18.04 LTS, default gnome desktop. There is no guest OS, there
is no hard drive, just the OVMF firmware locks it up. (I narrowed it down
to this from a much larger repro)

Let that run for about 10 seconds or so, then attempt to Ctrl-C in the
terminal that launched qemu. You should see the hung program thing (wait or
force quit) appear on the qemu window, and the terminal just prints ^C
without interrupting qemu. DO NOT click force quit or wait. Use killall -9
qemu-system-x86_64 to kill it if you must, gdb kill is also okay. If you
force quit you will likely freeze the entire gnome desktop. This is what
kmsg reports:
https://gist.githubusercontent.com/doug65536/810e0471c11008c97edb4aa0d86ddd66/raw/40038a355542f0edb1f14a7c7fb7c2db60da0025/end%2520of%2520syslog%2520after%2520desktop%2520lockup,
this is what syslog reports:
https://gist.githubusercontent.com/doug65536/810e0471c11008c97edb4aa0d86ddd66/raw/40038a355542f0edb1f14a7c7fb7c2db60da0025/the%2520end%2520of%2520dmesg%2520after%2520desktop%2520lockup.
Probably bugs in gnome but just warning about locking up your machines when
reproducing.

Peter Maydell helped me bisect it in IRC.
 Works fine at commit 1e8a98b53867f61
 Fails at commit 9458a9a1df1a4c7

MTTCG works fine all the way up to master.

Configure command line:

../qemu/configure --target-list=x86_64-softmmu,i386-softmmu
--audio-drv-list=pa --enable-libusb --disable-libssh --enable-virglrenderer
--enable-opengl --enable-debug

The issue occurs without --enable-debug. I didn't strip the configure down
though, it may not need all of those configure options exactly.

OVMF from ubuntu package manager, package named ovmf, exact version
is 0~20180205.c0d9813c-2ubuntu0.1

Stack trace at lockup at
https://gist.githubusercontent.com/doug65536/810e0471c11008c97edb4aa0d86ddd66/raw/40038a355542f0edb1f14a7c7fb7c2db60da0025/backtrace%2520all

[-- Attachment #2: Type: text/html, Size: 4663 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2019-10-01 15:43 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-09-30 13:15 Lockup with --accel tcg,thread=single Doug Gale
2019-09-30 15:37 ` Peter Maydell
2019-09-30 16:38   ` Alex Bennée
2019-09-30 17:47   ` Paolo Bonzini
2019-09-30 19:20     ` Alex Bennée
2019-09-30 20:40       ` Paolo Bonzini
2019-10-01  8:42         ` Alex Bennée
2019-10-01  9:16           ` Paolo Bonzini
2019-10-01 15:28             ` Alex Bennée
2019-10-01 12:10     ` Peter Maydell
  -- strict thread matches above, loose matches on Subject: below --
2019-09-30 13:12 Doug Gale

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).