qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v2] target/ppc: Fix system lockups caused by interrupt_request state corruption
@ 2017-12-01 15:49 Richard Purdie
  2017-12-04  1:00 ` [Qemu-devel] [Qemu-ppc] " David Gibson
  0 siblings, 1 reply; 4+ messages in thread
From: Richard Purdie @ 2017-12-01 15:49 UTC (permalink / raw)
  To: qemu-devel, david; +Cc: qemu-ppc, Richard Purdie

Occasionally in Linux guests on x86_64 we're seeing logs like:

ppc_set_irq: 0x55b4e0d562f0 n_IRQ 8 level 1 => pending 00000100req 00000004

when they should read:

ppc_set_irq: 0x55b4e0d562f0 n_IRQ 8 level 1 => pending 00000100req 00000002

The "00000004" is CPU_INTERRUPT_EXITTB yet the code calls
cpu_interrupt(cs, CPU_INTERRUPT_HARD) ("00000002") in this function
just before the log message. Something is causing the HARD bit setting
to get lost.

The knock on effect of losing that bit is the decrementer timer interrupts
don't get delivered which causes the guest to sit idle in its idle handler
and 'hang'.

The issue occurs due to races from code which sets CPU_INTERRUPT_EXITTB.

Rather than poking directly into cs->interrupt_request, that code needs to:

a) hold BQL
b) use the cpu_interrupt() helper

This patch fixes the call sites to do this, fixing the hang.

Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>
---
 target/ppc/excp_helper.c | 16 +++++++++++++---
 target/ppc/helper_regs.h | 10 ++++++++--
 2 files changed, 21 insertions(+), 5 deletions(-)

v2: Fixes a compile issue with master and ensures BQL is held in one case
where it potentially wasn't.

diff --git a/target/ppc/excp_helper.c b/target/ppc/excp_helper.c
index e6009e7..8040277 100644
--- a/target/ppc/excp_helper.c
+++ b/target/ppc/excp_helper.c
@@ -207,7 +207,13 @@ static inline void powerpc_excp(PowerPCCPU *cpu, int excp_model, int excp)
                         "Entering checkstop state\n");
             }
             cs->halted = 1;
-            cs->interrupt_request |= CPU_INTERRUPT_EXITTB;
+            if (!qemu_mutex_iothread_locked()) {
+                qemu_mutex_lock_iothread();
+                cpu_interrupt(cs, CPU_INTERRUPT_EXITTB);
+                qemu_mutex_unlock_iothread();
+            } else {
+                cpu_interrupt(cs, CPU_INTERRUPT_EXITTB);
+            }
         }
         if (env->msr_mask & MSR_HVB) {
             /* ISA specifies HV, but can be delivered to guest with HV clear
@@ -940,7 +946,9 @@ void helper_store_msr(CPUPPCState *env, target_ulong val)
 
     if (excp != 0) {
         CPUState *cs = CPU(ppc_env_get_cpu(env));
-        cs->interrupt_request |= CPU_INTERRUPT_EXITTB;
+        qemu_mutex_lock_iothread();
+        cpu_interrupt(cs, CPU_INTERRUPT_EXITTB);
+        qemu_mutex_unlock_iothread();
         raise_exception(env, excp);
     }
 }
@@ -995,7 +1003,9 @@ static inline void do_rfi(CPUPPCState *env, target_ulong nip, target_ulong msr)
     /* No need to raise an exception here,
      * as rfi is always the last insn of a TB
      */
-    cs->interrupt_request |= CPU_INTERRUPT_EXITTB;
+    qemu_mutex_lock_iothread();
+    cpu_interrupt(cs, CPU_INTERRUPT_EXITTB);
+    qemu_mutex_unlock_iothread();
 
     /* Reset the reservation */
     env->reserve_addr = -1;
diff --git a/target/ppc/helper_regs.h b/target/ppc/helper_regs.h
index 2627a70..0beaad5 100644
--- a/target/ppc/helper_regs.h
+++ b/target/ppc/helper_regs.h
@@ -20,6 +20,8 @@
 #ifndef HELPER_REGS_H
 #define HELPER_REGS_H
 
+#include "qemu/main-loop.h"
+
 /* Swap temporary saved registers with GPRs */
 static inline void hreg_swap_gpr_tgpr(CPUPPCState *env)
 {
@@ -114,11 +116,15 @@ static inline int hreg_store_msr(CPUPPCState *env, target_ulong value,
     }
     if (((value >> MSR_IR) & 1) != msr_ir ||
         ((value >> MSR_DR) & 1) != msr_dr) {
-        cs->interrupt_request |= CPU_INTERRUPT_EXITTB;
+        qemu_mutex_lock_iothread();
+        cpu_interrupt(cs, CPU_INTERRUPT_EXITTB);
+        qemu_mutex_unlock_iothread();
     }
     if ((env->mmu_model & POWERPC_MMU_BOOKE) &&
         ((value >> MSR_GS) & 1) != msr_gs) {
-        cs->interrupt_request |= CPU_INTERRUPT_EXITTB;
+        qemu_mutex_lock_iothread();
+        cpu_interrupt(cs, CPU_INTERRUPT_EXITTB);
+        qemu_mutex_unlock_iothread();
     }
     if (unlikely((env->flags & POWERPC_FLAG_TGPR) &&
                  ((value ^ env->msr) & (1 << MSR_TGPR)))) {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v2] target/ppc: Fix system lockups caused by interrupt_request state corruption
  2017-12-01 15:49 [Qemu-devel] [PATCH v2] target/ppc: Fix system lockups caused by interrupt_request state corruption Richard Purdie
@ 2017-12-04  1:00 ` David Gibson
  2017-12-04  1:44   ` David Gibson
  0 siblings, 1 reply; 4+ messages in thread
From: David Gibson @ 2017-12-04  1:00 UTC (permalink / raw)
  To: Richard Purdie; +Cc: qemu-devel, qemu-ppc

[-- Attachment #1: Type: text/plain, Size: 4783 bytes --]

On Fri, Dec 01, 2017 at 03:49:07PM +0000, Richard Purdie wrote:
> Occasionally in Linux guests on x86_64 we're seeing logs like:
> 
> ppc_set_irq: 0x55b4e0d562f0 n_IRQ 8 level 1 => pending 00000100req 00000004
> 
> when they should read:
> 
> ppc_set_irq: 0x55b4e0d562f0 n_IRQ 8 level 1 => pending 00000100req 00000002
> 
> The "00000004" is CPU_INTERRUPT_EXITTB yet the code calls
> cpu_interrupt(cs, CPU_INTERRUPT_HARD) ("00000002") in this function
> just before the log message. Something is causing the HARD bit setting
> to get lost.
> 
> The knock on effect of losing that bit is the decrementer timer interrupts
> don't get delivered which causes the guest to sit idle in its idle handler
> and 'hang'.
> 
> The issue occurs due to races from code which sets CPU_INTERRUPT_EXITTB.
> 
> Rather than poking directly into cs->interrupt_request, that code needs to:
> 
> a) hold BQL
> b) use the cpu_interrupt() helper
> 
> This patch fixes the call sites to do this, fixing the hang.
> 
> Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>

I strongly suspect there's a better way to do this long term - a lot
of that old ppc TCG code is really crufty.  But as best I can tell,
this is certainly a fix over what we had.  So, applied to ppc-for-2.11.

> ---
>  target/ppc/excp_helper.c | 16 +++++++++++++---
>  target/ppc/helper_regs.h | 10 ++++++++--
>  2 files changed, 21 insertions(+), 5 deletions(-)
> 
> v2: Fixes a compile issue with master and ensures BQL is held in one case
> where it potentially wasn't.
> 
> diff --git a/target/ppc/excp_helper.c b/target/ppc/excp_helper.c
> index e6009e7..8040277 100644
> --- a/target/ppc/excp_helper.c
> +++ b/target/ppc/excp_helper.c
> @@ -207,7 +207,13 @@ static inline void powerpc_excp(PowerPCCPU *cpu, int excp_model, int excp)
>                          "Entering checkstop state\n");
>              }
>              cs->halted = 1;
> -            cs->interrupt_request |= CPU_INTERRUPT_EXITTB;
> +            if (!qemu_mutex_iothread_locked()) {
> +                qemu_mutex_lock_iothread();
> +                cpu_interrupt(cs, CPU_INTERRUPT_EXITTB);
> +                qemu_mutex_unlock_iothread();
> +            } else {
> +                cpu_interrupt(cs, CPU_INTERRUPT_EXITTB);
> +            }
>          }
>          if (env->msr_mask & MSR_HVB) {
>              /* ISA specifies HV, but can be delivered to guest with HV clear
> @@ -940,7 +946,9 @@ void helper_store_msr(CPUPPCState *env, target_ulong val)
>  
>      if (excp != 0) {
>          CPUState *cs = CPU(ppc_env_get_cpu(env));
> -        cs->interrupt_request |= CPU_INTERRUPT_EXITTB;
> +        qemu_mutex_lock_iothread();
> +        cpu_interrupt(cs, CPU_INTERRUPT_EXITTB);
> +        qemu_mutex_unlock_iothread();
>          raise_exception(env, excp);
>      }
>  }
> @@ -995,7 +1003,9 @@ static inline void do_rfi(CPUPPCState *env, target_ulong nip, target_ulong msr)
>      /* No need to raise an exception here,
>       * as rfi is always the last insn of a TB
>       */
> -    cs->interrupt_request |= CPU_INTERRUPT_EXITTB;
> +    qemu_mutex_lock_iothread();
> +    cpu_interrupt(cs, CPU_INTERRUPT_EXITTB);
> +    qemu_mutex_unlock_iothread();
>  
>      /* Reset the reservation */
>      env->reserve_addr = -1;
> diff --git a/target/ppc/helper_regs.h b/target/ppc/helper_regs.h
> index 2627a70..0beaad5 100644
> --- a/target/ppc/helper_regs.h
> +++ b/target/ppc/helper_regs.h
> @@ -20,6 +20,8 @@
>  #ifndef HELPER_REGS_H
>  #define HELPER_REGS_H
>  
> +#include "qemu/main-loop.h"
> +
>  /* Swap temporary saved registers with GPRs */
>  static inline void hreg_swap_gpr_tgpr(CPUPPCState *env)
>  {
> @@ -114,11 +116,15 @@ static inline int hreg_store_msr(CPUPPCState *env, target_ulong value,
>      }
>      if (((value >> MSR_IR) & 1) != msr_ir ||
>          ((value >> MSR_DR) & 1) != msr_dr) {
> -        cs->interrupt_request |= CPU_INTERRUPT_EXITTB;
> +        qemu_mutex_lock_iothread();
> +        cpu_interrupt(cs, CPU_INTERRUPT_EXITTB);
> +        qemu_mutex_unlock_iothread();
>      }
>      if ((env->mmu_model & POWERPC_MMU_BOOKE) &&
>          ((value >> MSR_GS) & 1) != msr_gs) {
> -        cs->interrupt_request |= CPU_INTERRUPT_EXITTB;
> +        qemu_mutex_lock_iothread();
> +        cpu_interrupt(cs, CPU_INTERRUPT_EXITTB);
> +        qemu_mutex_unlock_iothread();
>      }
>      if (unlikely((env->flags & POWERPC_FLAG_TGPR) &&
>                   ((value ^ env->msr) & (1 << MSR_TGPR)))) {

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v2] target/ppc: Fix system lockups caused by interrupt_request state corruption
  2017-12-04  1:00 ` [Qemu-devel] [Qemu-ppc] " David Gibson
@ 2017-12-04  1:44   ` David Gibson
  2017-12-04 22:28     ` Richard Purdie
  0 siblings, 1 reply; 4+ messages in thread
From: David Gibson @ 2017-12-04  1:44 UTC (permalink / raw)
  To: Richard Purdie; +Cc: qemu-devel, qemu-ppc

[-- Attachment #1: Type: text/plain, Size: 2106 bytes --]

On Mon, Dec 04, 2017 at 12:00:40PM +1100, David Gibson wrote:
> On Fri, Dec 01, 2017 at 03:49:07PM +0000, Richard Purdie wrote:
> > Occasionally in Linux guests on x86_64 we're seeing logs like:
> > 
> > ppc_set_irq: 0x55b4e0d562f0 n_IRQ 8 level 1 => pending 00000100req 00000004
> > 
> > when they should read:
> > 
> > ppc_set_irq: 0x55b4e0d562f0 n_IRQ 8 level 1 => pending 00000100req 00000002
> > 
> > The "00000004" is CPU_INTERRUPT_EXITTB yet the code calls
> > cpu_interrupt(cs, CPU_INTERRUPT_HARD) ("00000002") in this function
> > just before the log message. Something is causing the HARD bit setting
> > to get lost.
> > 
> > The knock on effect of losing that bit is the decrementer timer interrupts
> > don't get delivered which causes the guest to sit idle in its idle handler
> > and 'hang'.
> > 
> > The issue occurs due to races from code which sets CPU_INTERRUPT_EXITTB.
> > 
> > Rather than poking directly into cs->interrupt_request, that code needs to:
> > 
> > a) hold BQL
> > b) use the cpu_interrupt() helper
> > 
> > This patch fixes the call sites to do this, fixing the hang.
> > 
> > Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>
> 
> I strongly suspect there's a better way to do this long term - a lot
> of that old ppc TCG code is really crufty.  But as best I can tell,
> this is certainly a fix over what we had.  So, applied to
> ppc-for-2.11.

I take that back.  Running make check with this patch results in:

  GTESTER check-qtest-ppc64
**
ERROR:/home/dwg/src/qemu/cpus.c:1582:qemu_mutex_lock_iothread: assertion failed: (!qemu_mutex_iothread_locked())
Broken pipe
qemu-system-ppc64: RP: Received invalid message 0x0000 length 0x0000
GTester: last random seed: R02S895b0f4813776bf68c147bf987e73f7b
make: *** [/home/dwg/src/qemu/tests/Makefile.include:852: check-qtest-ppc64] Error 1

So, I've reverted it.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v2] target/ppc: Fix system lockups caused by interrupt_request state corruption
  2017-12-04  1:44   ` David Gibson
@ 2017-12-04 22:28     ` Richard Purdie
  0 siblings, 0 replies; 4+ messages in thread
From: Richard Purdie @ 2017-12-04 22:28 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-devel, qemu-ppc

On Mon, 2017-12-04 at 12:44 +1100, David Gibson wrote:
> On Mon, Dec 04, 2017 at 12:00:40PM +1100, David Gibson wrote:
> > 
> > On Fri, Dec 01, 2017 at 03:49:07PM +0000, Richard Purdie wrote:
> > > 
> > > Occasionally in Linux guests on x86_64 we're seeing logs like:
> > > 
> > > ppc_set_irq: 0x55b4e0d562f0 n_IRQ 8 level 1 => pending
> > > 00000100req 00000004
> > > 
> > > when they should read:
> > > 
> > > ppc_set_irq: 0x55b4e0d562f0 n_IRQ 8 level 1 => pending
> > > 00000100req 00000002
> > > 
> > > The "00000004" is CPU_INTERRUPT_EXITTB yet the code calls
> > > cpu_interrupt(cs, CPU_INTERRUPT_HARD) ("00000002") in this
> > > function
> > > just before the log message. Something is causing the HARD bit
> > > setting
> > > to get lost.
> > > 
> > > The knock on effect of losing that bit is the decrementer timer
> > > interrupts
> > > don't get delivered which causes the guest to sit idle in its
> > > idle handler
> > > and 'hang'.
> > > 
> > > The issue occurs due to races from code which sets
> > > CPU_INTERRUPT_EXITTB.
> > > 
> > > Rather than poking directly into cs->interrupt_request, that code
> > > needs to:
> > > 
> > > a) hold BQL
> > > b) use the cpu_interrupt() helper
> > > 
> > > This patch fixes the call sites to do this, fixing the hang.
> > > 
> > > Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org
> > > >
> > I strongly suspect there's a better way to do this long term - a
> > lot
> > of that old ppc TCG code is really crufty.  But as best I can tell,
> > this is certainly a fix over what we had.  So, applied to
> > ppc-for-2.11.
> I take that back.  Running make check with this patch results in:
> 
>   GTESTER check-qtest-ppc64
> **
> ERROR:/home/dwg/src/qemu/cpus.c:1582:qemu_mutex_lock_iothread:
> assertion failed: (!qemu_mutex_iothread_locked())
> Broken pipe
> qemu-system-ppc64: RP: Received invalid message 0x0000 length 0x0000
> GTester: last random seed: R02S895b0f4813776bf68c147bf987e73f7b
> make: *** [/home/dwg/src/qemu/tests/Makefile.include:852: check-
> qtest-ppc64] Error 1
> 
> So, I've reverted it.

Sorry about that. I tried to stress the code paths and no issues showed
up in our testing but hadn't realised those tests existed.

Given there do seem to be mixed locked and unlocked code paths I've
taken a different approach and sent a v3 which passes those tests.

I do agree with you that there is probably a better way but that would
need someone with a better understanding of the bigger picture than I
have. This patch does stop our image tests locking up so does seem to
fix a valid/real problem which people can run into.

Cheers,

Richard

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2017-12-04 22:28 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-12-01 15:49 [Qemu-devel] [PATCH v2] target/ppc: Fix system lockups caused by interrupt_request state corruption Richard Purdie
2017-12-04  1:00 ` [Qemu-devel] [Qemu-ppc] " David Gibson
2017-12-04  1:44   ` David Gibson
2017-12-04 22:28     ` Richard Purdie

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).