[RFC PATCH 0/4] ppc: nested TCG migration (KVM-on-TCG)

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/4] ppc: nested TCG migration (KVM-on-TCG)
@ 2022-02-24 18:58 Fabiano Rosas
  2022-02-24 18:58 ` [RFC PATCH 1/4] target/ppc: TCG: Migrate tb_offset and decr Fabiano Rosas
                   ` (4 more replies)
  0 siblings, 5 replies; 20+ messages in thread
From: Fabiano Rosas @ 2022-02-24 18:58 UTC (permalink / raw)
  To: qemu-devel; +Cc: aik, danielhb413, npiggin, qemu-ppc, clg, david

This series implements the migration for a TCG pseries guest running a
nested KVM guest. This is just like migrating a pseries TCG guest, but
with some extra state to allow a nested guest to continue to run on
the destination.

Unfortunately the regular TCG migration scenario (not nested) is not
fully working so I cannot be entirely sure the nested migration is
correct. I have included a couple of patches for the general migration
case that (I think?) improve the situation a bit, but I'm still seeing
hard lockups and other issues with more than 1 vcpu.

This is more of an early RFC to see if anyone spots something right
away. I haven't made much progress in debugging the general TCG
migration case so if anyone has any input there as well I'd appreciate
it.

Thanks

Fabiano Rosas (4):
  target/ppc: TCG: Migrate tb_offset and decr
  spapr: TCG: Migrate spapr_cpu->prod
  hw/ppc: Take nested guest into account when saving timebase
  spapr: Add KVM-on-TCG migration support

 hw/ppc/ppc.c                    | 17 +++++++-
 hw/ppc/spapr.c                  | 19 ++++++++
 hw/ppc/spapr_cpu_core.c         | 77 +++++++++++++++++++++++++++++++++
 include/hw/ppc/spapr_cpu_core.h |  2 +-
 target/ppc/machine.c            | 61 ++++++++++++++++++++++++++
 5 files changed, 174 insertions(+), 2 deletions(-)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC PATCH 1/4] target/ppc: TCG: Migrate tb_offset and decr
  2022-02-24 18:58 [RFC PATCH 0/4] ppc: nested TCG migration (KVM-on-TCG) Fabiano Rosas
@ 2022-02-24 18:58 ` Fabiano Rosas
  2022-02-24 20:06   ` Richard Henderson
  2022-02-25  3:15   ` David Gibson
  2022-02-24 18:58 ` [RFC PATCH 2/4] spapr: TCG: Migrate spapr_cpu->prod Fabiano Rosas
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 20+ messages in thread
From: Fabiano Rosas @ 2022-02-24 18:58 UTC (permalink / raw)
  To: qemu-devel; +Cc: aik, danielhb413, npiggin, qemu-ppc, clg, david

These two were not migrated so the remote end was starting with the
decrementer expired.

I am seeing less frequent crashes with this patch (tested with -smp 4
and -smp 32). It certainly doesn't fix all issues but it looks like it
helps.

Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com>
---
 target/ppc/machine.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/target/ppc/machine.c b/target/ppc/machine.c
index 1b63146ed1..7ee1984500 100644
--- a/target/ppc/machine.c
+++ b/target/ppc/machine.c
@@ -9,6 +9,7 @@
 #include "qemu/main-loop.h"
 #include "kvm_ppc.h"
 #include "power8-pmu.h"
+#include "hw/ppc/ppc.h"
 
 static void post_load_update_msr(CPUPPCState *env)
 {
@@ -666,6 +667,18 @@ static const VMStateDescription vmstate_compat = {
     }
 };
 
+static const VMStateDescription vmstate_tb_env = {
+    .name = "cpu/env/tb_env",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .fields = (VMStateField[]) {
+        VMSTATE_INT64(tb_offset, ppc_tb_t),
+        VMSTATE_UINT64(decr_next, ppc_tb_t),
+        VMSTATE_TIMER_PTR(decr_timer, ppc_tb_t),
+        VMSTATE_END_OF_LIST()
+    }
+};
+
 const VMStateDescription vmstate_ppc_cpu = {
     .name = "cpu",
     .version_id = 5,
@@ -696,12 +709,16 @@ const VMStateDescription vmstate_ppc_cpu = {
         /* Backward compatible internal state */
         VMSTATE_UINTTL(env.hflags_compat_nmsr, PowerPCCPU),
 
+        VMSTATE_STRUCT_POINTER_V(env.tb_env, PowerPCCPU, 1,
+                                 vmstate_tb_env, ppc_tb_t),
+
         /* Sanity checking */
         VMSTATE_UINTTL_TEST(mig_msr_mask, PowerPCCPU, cpu_pre_2_8_migration),
         VMSTATE_UINT64_TEST(mig_insns_flags, PowerPCCPU, cpu_pre_2_8_migration),
         VMSTATE_UINT64_TEST(mig_insns_flags2, PowerPCCPU,
                             cpu_pre_2_8_migration),
         VMSTATE_UINT32_TEST(mig_nb_BATs, PowerPCCPU, cpu_pre_2_8_migration),
+
         VMSTATE_END_OF_LIST()
     },
     .subsections = (const VMStateDescription*[]) {
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH 2/4] spapr: TCG: Migrate spapr_cpu->prod
  2022-02-24 18:58 [RFC PATCH 0/4] ppc: nested TCG migration (KVM-on-TCG) Fabiano Rosas
  2022-02-24 18:58 ` [RFC PATCH 1/4] target/ppc: TCG: Migrate tb_offset and decr Fabiano Rosas
@ 2022-02-24 18:58 ` Fabiano Rosas
  2022-02-25  3:17   ` David Gibson
  2022-02-24 18:58 ` [RFC PATCH 3/4] hw/ppc: Take nested guest into account when saving timebase Fabiano Rosas
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 20+ messages in thread
From: Fabiano Rosas @ 2022-02-24 18:58 UTC (permalink / raw)
  To: qemu-devel; +Cc: aik, danielhb413, npiggin, qemu-ppc, clg, david

I'm seeing some stack traces in the migrated guest going through cede
and some hangs at the plpar_hcall_norets so let's make sure everything
related to cede/prod is being migrated just in case.

Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com>
---
 hw/ppc/spapr_cpu_core.c         | 1 +
 include/hw/ppc/spapr_cpu_core.h | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/hw/ppc/spapr_cpu_core.c b/hw/ppc/spapr_cpu_core.c
index ed84713960..efda7730f1 100644
--- a/hw/ppc/spapr_cpu_core.c
+++ b/hw/ppc/spapr_cpu_core.c
@@ -179,6 +179,7 @@ static const VMStateDescription vmstate_spapr_cpu_state = {
     .version_id = 1,
     .minimum_version_id = 1,
     .fields = (VMStateField[]) {
+        VMSTATE_BOOL(prod, SpaprCpuState),
         VMSTATE_END_OF_LIST()
     },
     .subsections = (const VMStateDescription * []) {
diff --git a/include/hw/ppc/spapr_cpu_core.h b/include/hw/ppc/spapr_cpu_core.h
index b560514560..2772689c84 100644
--- a/include/hw/ppc/spapr_cpu_core.h
+++ b/include/hw/ppc/spapr_cpu_core.h
@@ -45,7 +45,7 @@ typedef struct SpaprCpuState {
     uint64_t vpa_addr;
     uint64_t slb_shadow_addr, slb_shadow_size;
     uint64_t dtl_addr, dtl_size;
-    bool prod; /* not migrated, only used to improve dispatch latencies */
+    bool prod;
     struct ICPState *icp;
     struct XiveTCTX *tctx;
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH 3/4] hw/ppc: Take nested guest into account when saving timebase
  2022-02-24 18:58 [RFC PATCH 0/4] ppc: nested TCG migration (KVM-on-TCG) Fabiano Rosas
  2022-02-24 18:58 ` [RFC PATCH 1/4] target/ppc: TCG: Migrate tb_offset and decr Fabiano Rosas
  2022-02-24 18:58 ` [RFC PATCH 2/4] spapr: TCG: Migrate spapr_cpu->prod Fabiano Rosas
@ 2022-02-24 18:58 ` Fabiano Rosas
  2022-02-25  3:21   ` David Gibson
  2022-02-24 18:58 ` [RFC PATCH 4/4] spapr: Add KVM-on-TCG migration support Fabiano Rosas
  2022-02-24 21:00 ` [RFC PATCH 0/4] ppc: nested TCG migration (KVM-on-TCG) Mark Cave-Ayland
  4 siblings, 1 reply; 20+ messages in thread
From: Fabiano Rosas @ 2022-02-24 18:58 UTC (permalink / raw)
  To: qemu-devel; +Cc: aik, danielhb413, npiggin, qemu-ppc, clg, david

When saving the guest "timebase" we look to the first_cpu for its
tb_offset. If that CPU happens to be running a nested guest at this
time, the tb_offset will have the nested guest value.

This was caught by code inspection.

Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com>
---
 hw/ppc/ppc.c | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/hw/ppc/ppc.c b/hw/ppc/ppc.c
index 9e99625ea9..093cd87014 100644
--- a/hw/ppc/ppc.c
+++ b/hw/ppc/ppc.c
@@ -36,6 +36,7 @@
 #include "kvm_ppc.h"
 #include "migration/vmstate.h"
 #include "trace.h"
+#include "hw/ppc/spapr_cpu_core.h"
 
 static void cpu_ppc_tb_stop (CPUPPCState *env);
 static void cpu_ppc_tb_start (CPUPPCState *env);
@@ -961,19 +962,33 @@ static void timebase_save(PPCTimebase *tb)
 {
     uint64_t ticks = cpu_get_host_ticks();
     PowerPCCPU *first_ppc_cpu = POWERPC_CPU(first_cpu);
+    int64_t tb_offset;
 
     if (!first_ppc_cpu->env.tb_env) {
         error_report("No timebase object");
         return;
     }
 
+    tb_offset = first_ppc_cpu->env.tb_env->tb_offset;
+
+    if (first_ppc_cpu->vhyp && vhyp_cpu_in_nested(first_ppc_cpu)) {
+        SpaprCpuState *spapr_cpu = spapr_cpu_state(first_ppc_cpu);
+
+        /*
+         * If the first_cpu happens to be running a nested guest at
+         * this time, tb_env->tb_offset will contain the nested guest
+         * offset.
+         */
+        tb_offset -= spapr_cpu->nested_tb_offset;
+    }
+
     /* not used anymore, we keep it for compatibility */
     tb->time_of_the_day_ns = qemu_clock_get_ns(QEMU_CLOCK_HOST);
     /*
      * tb_offset is only expected to be changed by QEMU so
      * there is no need to update it from KVM here
      */
-    tb->guest_timebase = ticks + first_ppc_cpu->env.tb_env->tb_offset;
+    tb->guest_timebase = ticks + tb_offset;
 
     tb->runstate_paused =
         runstate_check(RUN_STATE_PAUSED) || runstate_check(RUN_STATE_SAVE_VM);
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH 4/4] spapr: Add KVM-on-TCG migration support
  2022-02-24 18:58 [RFC PATCH 0/4] ppc: nested TCG migration (KVM-on-TCG) Fabiano Rosas
                   ` (2 preceding siblings ...)
  2022-02-24 18:58 ` [RFC PATCH 3/4] hw/ppc: Take nested guest into account when saving timebase Fabiano Rosas
@ 2022-02-24 18:58 ` Fabiano Rosas
  2022-02-25  0:51   ` Nicholas Piggin
  2022-02-25  3:42   ` David Gibson
  2022-02-24 21:00 ` [RFC PATCH 0/4] ppc: nested TCG migration (KVM-on-TCG) Mark Cave-Ayland
  4 siblings, 2 replies; 20+ messages in thread
From: Fabiano Rosas @ 2022-02-24 18:58 UTC (permalink / raw)
  To: qemu-devel; +Cc: aik, danielhb413, npiggin, qemu-ppc, clg, david

This adds migration support for TCG pseries machines running a KVM-HV
guest.

The state that needs to be migrated is:

- the nested PTCR value;
- the in_nested flag;
- the nested_tb_offset.
- the saved host CPUPPCState structure;

Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com>

---
(this migrates just fine with L2 running stress and 1 VCPU in L1. With
32 VCPUs in L1 there's crashes which I still don't understand. They might
be related to L1 migration being flaky right now)
---
 hw/ppc/spapr.c          | 19 +++++++++++
 hw/ppc/spapr_cpu_core.c | 76 +++++++++++++++++++++++++++++++++++++++++
 target/ppc/machine.c    | 44 ++++++++++++++++++++++++
 3 files changed, 139 insertions(+)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index f0b75b22bb..6e87c515db 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -1934,6 +1934,13 @@ static bool spapr_patb_entry_needed(void *opaque)
     return !!spapr->patb_entry;
 }
 
+static bool spapr_nested_ptcr_needed(void *opaque)
+{
+    SpaprMachineState *spapr = opaque;
+
+    return !!spapr->nested_ptcr;
+}
+
 static const VMStateDescription vmstate_spapr_patb_entry = {
     .name = "spapr_patb_entry",
     .version_id = 1,
@@ -1945,6 +1952,17 @@ static const VMStateDescription vmstate_spapr_patb_entry = {
     },
 };
 
+static const VMStateDescription vmstate_spapr_nested_ptcr = {
+    .name = "spapr_nested_ptcr",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .needed = spapr_nested_ptcr_needed,
+    .fields = (VMStateField[]) {
+        VMSTATE_UINT64(nested_ptcr, SpaprMachineState),
+        VMSTATE_END_OF_LIST()
+    },
+};
+
 static bool spapr_irq_map_needed(void *opaque)
 {
     SpaprMachineState *spapr = opaque;
@@ -2069,6 +2087,7 @@ static const VMStateDescription vmstate_spapr = {
         &vmstate_spapr_cap_fwnmi,
         &vmstate_spapr_fwnmi,
         &vmstate_spapr_cap_rpt_invalidate,
+        &vmstate_spapr_nested_ptcr,
         NULL
     }
 };
diff --git a/hw/ppc/spapr_cpu_core.c b/hw/ppc/spapr_cpu_core.c
index efda7730f1..3ec13c0660 100644
--- a/hw/ppc/spapr_cpu_core.c
+++ b/hw/ppc/spapr_cpu_core.c
@@ -25,6 +25,7 @@
 #include "sysemu/reset.h"
 #include "sysemu/hw_accel.h"
 #include "qemu/error-report.h"
+#include "migration/cpu.h"
 
 static void spapr_reset_vcpu(PowerPCCPU *cpu)
 {
@@ -174,6 +175,80 @@ static const VMStateDescription vmstate_spapr_cpu_vpa = {
     }
 };
 
+static bool nested_needed(void *opaque)
+{
+    SpaprCpuState *spapr_cpu = opaque;
+
+    return spapr_cpu->in_nested;
+}
+
+static int nested_state_pre_save(void *opaque)
+{
+    CPUPPCState *env = opaque;
+
+    env->spr[SPR_LR] = env->lr;
+    env->spr[SPR_CTR] = env->ctr;
+    env->spr[SPR_XER] = cpu_read_xer(env);
+    env->spr[SPR_CFAR] = env->cfar;
+
+    return 0;
+}
+
+static int nested_state_post_load(void *opaque, int version_id)
+{
+    CPUPPCState *env = opaque;
+
+    env->lr = env->spr[SPR_LR];
+    env->ctr = env->spr[SPR_CTR];
+    cpu_write_xer(env, env->spr[SPR_XER]);
+    env->cfar = env->spr[SPR_CFAR];
+
+    return 0;
+}
+
+static const VMStateDescription vmstate_nested_host_state = {
+    .name = "spapr_nested_host_state",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .pre_save = nested_state_pre_save,
+    .post_load = nested_state_post_load,
+    .fields = (VMStateField[]) {
+        VMSTATE_UINTTL_ARRAY(gpr, CPUPPCState, 32),
+        VMSTATE_UINTTL_ARRAY(spr, CPUPPCState, 1024),
+        VMSTATE_UINT32_ARRAY(crf, CPUPPCState, 8),
+        VMSTATE_UINTTL(nip, CPUPPCState),
+        VMSTATE_UINTTL(msr, CPUPPCState),
+        VMSTATE_END_OF_LIST()
+    }
+};
+
+static int nested_cpu_pre_load(void *opaque)
+{
+    SpaprCpuState *spapr_cpu = opaque;
+
+    spapr_cpu->nested_host_state = g_try_malloc(sizeof(CPUPPCState));
+    if (!spapr_cpu->nested_host_state) {
+        return -1;
+    }
+
+    return 0;
+}
+
+static const VMStateDescription vmstate_spapr_cpu_nested = {
+    .name = "spapr_cpu/nested",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .needed = nested_needed,
+    .pre_load = nested_cpu_pre_load,
+    .fields = (VMStateField[]) {
+        VMSTATE_BOOL(in_nested, SpaprCpuState),
+        VMSTATE_INT64(nested_tb_offset, SpaprCpuState),
+        VMSTATE_STRUCT_POINTER_V(nested_host_state, SpaprCpuState, 1,
+                                 vmstate_nested_host_state, CPUPPCState),
+        VMSTATE_END_OF_LIST()
+    },
+};
+
 static const VMStateDescription vmstate_spapr_cpu_state = {
     .name = "spapr_cpu",
     .version_id = 1,
@@ -184,6 +259,7 @@ static const VMStateDescription vmstate_spapr_cpu_state = {
     },
     .subsections = (const VMStateDescription * []) {
         &vmstate_spapr_cpu_vpa,
+        &vmstate_spapr_cpu_nested,
         NULL
     }
 };
diff --git a/target/ppc/machine.c b/target/ppc/machine.c
index 7ee1984500..ae09b1bcfe 100644
--- a/target/ppc/machine.c
+++ b/target/ppc/machine.c
@@ -10,6 +10,7 @@
 #include "kvm_ppc.h"
 #include "power8-pmu.h"
 #include "hw/ppc/ppc.h"
+#include "hw/ppc/spapr_cpu_core.h"
 
 static void post_load_update_msr(CPUPPCState *env)
 {
@@ -679,6 +680,48 @@ static const VMStateDescription vmstate_tb_env = {
     }
 };
 
+static const VMStateDescription vmstate_hdecr = {
+    .name = "cpu/hdecr",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .fields = (VMStateField[]) {
+        VMSTATE_UINT64(hdecr_next, ppc_tb_t),
+        VMSTATE_TIMER_PTR(hdecr_timer, ppc_tb_t),
+        VMSTATE_END_OF_LIST()
+    }
+};
+
+static bool nested_needed(void *opaque)
+{
+    PowerPCCPU *cpu = opaque;
+    SpaprCpuState *spapr_cpu = spapr_cpu_state(cpu);
+
+    return spapr_cpu->in_nested;
+}
+
+static int nested_pre_load(void *opaque)
+{
+    PowerPCCPU *cpu = opaque;
+    CPUPPCState *env = &cpu->env;
+
+    cpu_ppc_hdecr_init(env);
+
+    return 0;
+}
+
+static const VMStateDescription vmstate_nested = {
+    .name = "cpu/nested-guest",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .needed = nested_needed,
+    .pre_load = nested_pre_load,
+    .fields = (VMStateField[]) {
+        VMSTATE_STRUCT_POINTER_V(env.tb_env, PowerPCCPU, 1,
+                                 vmstate_hdecr, ppc_tb_t),
+        VMSTATE_END_OF_LIST()
+    }
+};
+
 const VMStateDescription vmstate_ppc_cpu = {
     .name = "cpu",
     .version_id = 5,
@@ -734,6 +777,7 @@ const VMStateDescription vmstate_ppc_cpu = {
         &vmstate_tlbemb,
         &vmstate_tlbmas,
         &vmstate_compat,
+        &vmstate_nested,
         NULL
     }
 };
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 1/4] target/ppc: TCG: Migrate tb_offset and decr
  2022-02-24 18:58 ` [RFC PATCH 1/4] target/ppc: TCG: Migrate tb_offset and decr Fabiano Rosas
@ 2022-02-24 20:06   ` Richard Henderson
  2022-02-25  3:15   ` David Gibson
  1 sibling, 0 replies; 20+ messages in thread
From: Richard Henderson @ 2022-02-24 20:06 UTC (permalink / raw)
  To: Fabiano Rosas, qemu-devel; +Cc: aik, danielhb413, npiggin, qemu-ppc, clg, david

On 2/24/22 08:58, Fabiano Rosas wrote:
> These two were not migrated so the remote end was starting with the
> decrementer expired.
> 
> I am seeing less frequent crashes with this patch (tested with -smp 4
> and -smp 32). It certainly doesn't fix all issues but it looks like it
> helps.
> 
> Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com>
> ---
>   target/ppc/machine.c | 17 +++++++++++++++++
>   1 file changed, 17 insertions(+)
> 
> diff --git a/target/ppc/machine.c b/target/ppc/machine.c
> index 1b63146ed1..7ee1984500 100644
> --- a/target/ppc/machine.c
> +++ b/target/ppc/machine.c
> @@ -9,6 +9,7 @@
>   #include "qemu/main-loop.h"
>   #include "kvm_ppc.h"
>   #include "power8-pmu.h"
> +#include "hw/ppc/ppc.h"
>   
>   static void post_load_update_msr(CPUPPCState *env)
>   {
> @@ -666,6 +667,18 @@ static const VMStateDescription vmstate_compat = {
>       }
>   };
>   
> +static const VMStateDescription vmstate_tb_env = {
> +    .name = "cpu/env/tb_env",
> +    .version_id = 1,
> +    .minimum_version_id = 1,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_INT64(tb_offset, ppc_tb_t),
> +        VMSTATE_UINT64(decr_next, ppc_tb_t),
> +        VMSTATE_TIMER_PTR(decr_timer, ppc_tb_t),
> +        VMSTATE_END_OF_LIST()
> +    }
> +};
> +
>   const VMStateDescription vmstate_ppc_cpu = {
>       .name = "cpu",
>       .version_id = 5,
> @@ -696,12 +709,16 @@ const VMStateDescription vmstate_ppc_cpu = {
>           /* Backward compatible internal state */
>           VMSTATE_UINTTL(env.hflags_compat_nmsr, PowerPCCPU),
>   
> +        VMSTATE_STRUCT_POINTER_V(env.tb_env, PowerPCCPU, 1,
> +                                 vmstate_tb_env, ppc_tb_t),
> +
>           /* Sanity checking */
>           VMSTATE_UINTTL_TEST(mig_msr_mask, PowerPCCPU, cpu_pre_2_8_migration),
>           VMSTATE_UINT64_TEST(mig_insns_flags, PowerPCCPU, cpu_pre_2_8_migration),
>           VMSTATE_UINT64_TEST(mig_insns_flags2, PowerPCCPU,
>                               cpu_pre_2_8_migration),
>           VMSTATE_UINT32_TEST(mig_nb_BATs, PowerPCCPU, cpu_pre_2_8_migration),
> +
>           VMSTATE_END_OF_LIST()
>       },
>       .subsections = (const VMStateDescription*[]) {

I think the new struct should go into a subsection.

r~


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/4] ppc: nested TCG migration (KVM-on-TCG)
  2022-02-24 18:58 [RFC PATCH 0/4] ppc: nested TCG migration (KVM-on-TCG) Fabiano Rosas
                   ` (3 preceding siblings ...)
  2022-02-24 18:58 ` [RFC PATCH 4/4] spapr: Add KVM-on-TCG migration support Fabiano Rosas
@ 2022-02-24 21:00 ` Mark Cave-Ayland
  2022-02-25  3:54   ` David Gibson
  2022-02-25 16:11   ` Fabiano Rosas
  4 siblings, 2 replies; 20+ messages in thread
From: Mark Cave-Ayland @ 2022-02-24 21:00 UTC (permalink / raw)
  To: Fabiano Rosas, qemu-devel; +Cc: danielhb413, qemu-ppc, clg, npiggin, david

On 24/02/2022 18:58, Fabiano Rosas wrote:

> This series implements the migration for a TCG pseries guest running a
> nested KVM guest. This is just like migrating a pseries TCG guest, but
> with some extra state to allow a nested guest to continue to run on
> the destination.
> 
> Unfortunately the regular TCG migration scenario (not nested) is not
> fully working so I cannot be entirely sure the nested migration is
> correct. I have included a couple of patches for the general migration
> case that (I think?) improve the situation a bit, but I'm still seeing
> hard lockups and other issues with more than 1 vcpu.
> 
> This is more of an early RFC to see if anyone spots something right
> away. I haven't made much progress in debugging the general TCG
> migration case so if anyone has any input there as well I'd appreciate
> it.
> 
> Thanks
> 
> Fabiano Rosas (4):
>    target/ppc: TCG: Migrate tb_offset and decr
>    spapr: TCG: Migrate spapr_cpu->prod
>    hw/ppc: Take nested guest into account when saving timebase
>    spapr: Add KVM-on-TCG migration support
> 
>   hw/ppc/ppc.c                    | 17 +++++++-
>   hw/ppc/spapr.c                  | 19 ++++++++
>   hw/ppc/spapr_cpu_core.c         | 77 +++++++++++++++++++++++++++++++++
>   include/hw/ppc/spapr_cpu_core.h |  2 +-
>   target/ppc/machine.c            | 61 ++++++++++++++++++++++++++
>   5 files changed, 174 insertions(+), 2 deletions(-)

FWIW I noticed there were some issues with migrating the decrementer on Mac machines 
a while ago which causes a hang on the destination with TCG (for MacOS on a x86 host 
in my case). Have a look at the following threads for reference:

https://lists.gnu.org/archive/html/qemu-devel/2016-01/msg00546.html
https://lists.gnu.org/archive/html/qemu-devel/2016-01/msg04622.html

IIRC there is code that assumes any migration in PPC is being done live, and so 
adjusts the timebase on the destination to reflect wall clock time by recalculating 
tb_offset. I haven't looked at the code for a while but I think the outcome was that 
there needs to be 2 phases in migration: the first is to migrate the timebase as-is 
for guests that are paused during migration, whilst the second is to notify 
hypervisor-aware guest OSs such as Linux to make the timebase adjustment if required 
if the guest is running.


ATB,

Mark.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 4/4] spapr: Add KVM-on-TCG migration support
  2022-02-24 18:58 ` [RFC PATCH 4/4] spapr: Add KVM-on-TCG migration support Fabiano Rosas
@ 2022-02-25  0:51   ` Nicholas Piggin
  2022-02-25  3:42   ` David Gibson
  1 sibling, 0 replies; 20+ messages in thread
From: Nicholas Piggin @ 2022-02-25  0:51 UTC (permalink / raw)
  To: Fabiano Rosas, qemu-devel; +Cc: aik, danielhb413, qemu-ppc, clg, david

Excerpts from Fabiano Rosas's message of February 25, 2022 4:58 am:
> This adds migration support for TCG pseries machines running a KVM-HV
> guest.
> 
> The state that needs to be migrated is:
> 
> - the nested PTCR value;
> - the in_nested flag;
> - the nested_tb_offset.
> - the saved host CPUPPCState structure;
> 
> Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com>

The series generally looks good to me, I guess patches 1 and 2 are
fixes that could go ahead.

Main thing about this is I was thinking of cutting down the CPUPPCState
structure for saving the host state when in the L2, and making a
specific structure for that that only contains what is required.

This patch could easily switch to that so it's no big deal AFAIKS.

> diff --git a/target/ppc/machine.c b/target/ppc/machine.c
> index 7ee1984500..ae09b1bcfe 100644
> --- a/target/ppc/machine.c
> +++ b/target/ppc/machine.c
> @@ -10,6 +10,7 @@
>  #include "kvm_ppc.h"
>  #include "power8-pmu.h"
>  #include "hw/ppc/ppc.h"
> +#include "hw/ppc/spapr_cpu_core.h"
>  
>  static void post_load_update_msr(CPUPPCState *env)
>  {
> @@ -679,6 +680,48 @@ static const VMStateDescription vmstate_tb_env = {
>      }
>  };
>  
> +static const VMStateDescription vmstate_hdecr = {
> +    .name = "cpu/hdecr",
> +    .version_id = 1,
> +    .minimum_version_id = 1,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_UINT64(hdecr_next, ppc_tb_t),
> +        VMSTATE_TIMER_PTR(hdecr_timer, ppc_tb_t),
> +        VMSTATE_END_OF_LIST()
> +    }
> +};
> +
> +static bool nested_needed(void *opaque)
> +{
> +    PowerPCCPU *cpu = opaque;
> +    SpaprCpuState *spapr_cpu = spapr_cpu_state(cpu);
> +
> +    return spapr_cpu->in_nested;
> +}

I don't know the migration code -- are you assured of having a
spapr CPU here?

Maybe this could call a helper function located near the spapr/nested
code like 'return ppc_cpu_need_hdec_migrate(cpu)' ?

Thanks,
Nick


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 1/4] target/ppc: TCG: Migrate tb_offset and decr
  2022-02-24 18:58 ` [RFC PATCH 1/4] target/ppc: TCG: Migrate tb_offset and decr Fabiano Rosas
  2022-02-24 20:06   ` Richard Henderson
@ 2022-02-25  3:15   ` David Gibson
  2022-02-25 16:08     ` Fabiano Rosas
  1 sibling, 1 reply; 20+ messages in thread
From: David Gibson @ 2022-02-25  3:15 UTC (permalink / raw)
  To: Fabiano Rosas; +Cc: aik, danielhb413, qemu-devel, npiggin, qemu-ppc, clg

[-- Attachment #1: Type: text/plain, Size: 3843 bytes --]

On Thu, Feb 24, 2022 at 03:58:14PM -0300, Fabiano Rosas wrote:
> These two were not migrated so the remote end was starting with the
> decrementer expired.
> 
> I am seeing less frequent crashes with this patch (tested with -smp 4
> and -smp 32). It certainly doesn't fix all issues but it looks like it
> helps.
> 
> Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com>

Nack.  This completely breaks migration compatibility for all ppc
machines.  In order to maintain compatibility as Richard says new info
has to go into a subsection, with a needed function that allows
migration of existing machine types both to and from a new qemu
version.

There are also some other problems.

> ---
>  target/ppc/machine.c | 17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
> 
> diff --git a/target/ppc/machine.c b/target/ppc/machine.c
> index 1b63146ed1..7ee1984500 100644
> --- a/target/ppc/machine.c
> +++ b/target/ppc/machine.c
> @@ -9,6 +9,7 @@
>  #include "qemu/main-loop.h"
>  #include "kvm_ppc.h"
>  #include "power8-pmu.h"
> +#include "hw/ppc/ppc.h"
>  
>  static void post_load_update_msr(CPUPPCState *env)
>  {
> @@ -666,6 +667,18 @@ static const VMStateDescription vmstate_compat = {
>      }
>  };
>  
> +static const VMStateDescription vmstate_tb_env = {
> +    .name = "cpu/env/tb_env",

Because spapr requires that all cpus have synchronized timebases, we
migrate the timebase state from vmstate_spapr, not from each cpu
individually, to make sure it stays synchronized across migration.  If
that's not working right, that's a bug, but it needs to be fixed
there, not just plastered over with extra information transmitted at
cpu level.

> +    .version_id = 1,
> +    .minimum_version_id = 1,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_INT64(tb_offset, ppc_tb_t),

tb_offset isn't a good thing to put directly in the migration stream.
tb_offset has kind of non-obvious semantics to begin with: when we're
dealing with TCG (which is your explicit case), there is no host TB,
so what's this an offset from?  That's why vmstate_ppc_timebase uses
an explicit guest timebase value matched with a time of day in real
units.  Again, if there's a bug, that needs fixing there.

> +        VMSTATE_UINT64(decr_next, ppc_tb_t),
> +        VMSTATE_TIMER_PTR(decr_timer, ppc_tb_t),

You're attempting to migrate decr_next and decr_timer, but not the
actual DECR value, which makes me suspect that *is* being migrated.
In which case you should be able to reconstruct the next tick and
timer state in a post_load function on receive.  If that's buggy, it
needs to be fixed there.

> +        VMSTATE_END_OF_LIST()
> +    }
> +};
> +
>  const VMStateDescription vmstate_ppc_cpu = {
>      .name = "cpu",
>      .version_id = 5,
> @@ -696,12 +709,16 @@ const VMStateDescription vmstate_ppc_cpu = {
>          /* Backward compatible internal state */
>          VMSTATE_UINTTL(env.hflags_compat_nmsr, PowerPCCPU),
>  
> +        VMSTATE_STRUCT_POINTER_V(env.tb_env, PowerPCCPU, 1,
> +                                 vmstate_tb_env, ppc_tb_t),
> +
>          /* Sanity checking */
>          VMSTATE_UINTTL_TEST(mig_msr_mask, PowerPCCPU, cpu_pre_2_8_migration),
>          VMSTATE_UINT64_TEST(mig_insns_flags, PowerPCCPU, cpu_pre_2_8_migration),
>          VMSTATE_UINT64_TEST(mig_insns_flags2, PowerPCCPU,
>                              cpu_pre_2_8_migration),
>          VMSTATE_UINT32_TEST(mig_nb_BATs, PowerPCCPU, cpu_pre_2_8_migration),
> +
>          VMSTATE_END_OF_LIST()
>      },
>      .subsections = (const VMStateDescription*[]) {

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 2/4] spapr: TCG: Migrate spapr_cpu->prod
  2022-02-24 18:58 ` [RFC PATCH 2/4] spapr: TCG: Migrate spapr_cpu->prod Fabiano Rosas
@ 2022-02-25  3:17   ` David Gibson
  2022-02-25 16:08     ` Fabiano Rosas
  0 siblings, 1 reply; 20+ messages in thread
From: David Gibson @ 2022-02-25  3:17 UTC (permalink / raw)
  To: Fabiano Rosas; +Cc: aik, danielhb413, qemu-devel, npiggin, qemu-ppc, clg

[-- Attachment #1: Type: text/plain, Size: 2103 bytes --]

On Thu, Feb 24, 2022 at 03:58:15PM -0300, Fabiano Rosas wrote:
> I'm seeing some stack traces in the migrated guest going through cede
> and some hangs at the plpar_hcall_norets so let's make sure everything
> related to cede/prod is being migrated just in case.

This is a poor approach in general.  Migration becomes even harder to
maintain than it already is if you don't pare down the set of migrated
data to something minimal and non-redundant.

If you want to migrate prod, you have to give a case for why you
*need* it, not "just in case".

Also, you have to put this in a subsection with a needed function in
order not to break compatibility.

> 
> Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com>
> ---
>  hw/ppc/spapr_cpu_core.c         | 1 +
>  include/hw/ppc/spapr_cpu_core.h | 2 +-
>  2 files changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/ppc/spapr_cpu_core.c b/hw/ppc/spapr_cpu_core.c
> index ed84713960..efda7730f1 100644
> --- a/hw/ppc/spapr_cpu_core.c
> +++ b/hw/ppc/spapr_cpu_core.c
> @@ -179,6 +179,7 @@ static const VMStateDescription vmstate_spapr_cpu_state = {
>      .version_id = 1,
>      .minimum_version_id = 1,
>      .fields = (VMStateField[]) {
> +        VMSTATE_BOOL(prod, SpaprCpuState),
>          VMSTATE_END_OF_LIST()
>      },
>      .subsections = (const VMStateDescription * []) {
> diff --git a/include/hw/ppc/spapr_cpu_core.h b/include/hw/ppc/spapr_cpu_core.h
> index b560514560..2772689c84 100644
> --- a/include/hw/ppc/spapr_cpu_core.h
> +++ b/include/hw/ppc/spapr_cpu_core.h
> @@ -45,7 +45,7 @@ typedef struct SpaprCpuState {
>      uint64_t vpa_addr;
>      uint64_t slb_shadow_addr, slb_shadow_size;
>      uint64_t dtl_addr, dtl_size;
> -    bool prod; /* not migrated, only used to improve dispatch latencies */
> +    bool prod;
>      struct ICPState *icp;
>      struct XiveTCTX *tctx;
>  

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 3/4] hw/ppc: Take nested guest into account when saving timebase
  2022-02-24 18:58 ` [RFC PATCH 3/4] hw/ppc: Take nested guest into account when saving timebase Fabiano Rosas
@ 2022-02-25  3:21   ` David Gibson
  2022-02-25 16:08     ` Fabiano Rosas
  0 siblings, 1 reply; 20+ messages in thread
From: David Gibson @ 2022-02-25  3:21 UTC (permalink / raw)
  To: Fabiano Rosas; +Cc: aik, danielhb413, qemu-devel, npiggin, qemu-ppc, clg

[-- Attachment #1: Type: text/plain, Size: 2594 bytes --]

On Thu, Feb 24, 2022 at 03:58:16PM -0300, Fabiano Rosas wrote:
> When saving the guest "timebase" we look to the first_cpu for its
> tb_offset. If that CPU happens to be running a nested guest at this
> time, the tb_offset will have the nested guest value.
> 
> This was caught by code inspection.

This doesn't seem right.  Isn't the real problem that nested_tb_offset
isn't being migrated?  If you migrate that, shouldn't everything be
fixed up when the L1 cpu leaves the nested guest on the destination
host?

> 
> Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com>
> ---
>  hw/ppc/ppc.c | 17 ++++++++++++++++-
>  1 file changed, 16 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/ppc/ppc.c b/hw/ppc/ppc.c
> index 9e99625ea9..093cd87014 100644
> --- a/hw/ppc/ppc.c
> +++ b/hw/ppc/ppc.c
> @@ -36,6 +36,7 @@
>  #include "kvm_ppc.h"
>  #include "migration/vmstate.h"
>  #include "trace.h"
> +#include "hw/ppc/spapr_cpu_core.h"
>  
>  static void cpu_ppc_tb_stop (CPUPPCState *env);
>  static void cpu_ppc_tb_start (CPUPPCState *env);
> @@ -961,19 +962,33 @@ static void timebase_save(PPCTimebase *tb)
>  {
>      uint64_t ticks = cpu_get_host_ticks();
>      PowerPCCPU *first_ppc_cpu = POWERPC_CPU(first_cpu);
> +    int64_t tb_offset;
>  
>      if (!first_ppc_cpu->env.tb_env) {
>          error_report("No timebase object");
>          return;
>      }
>  
> +    tb_offset = first_ppc_cpu->env.tb_env->tb_offset;
> +
> +    if (first_ppc_cpu->vhyp && vhyp_cpu_in_nested(first_ppc_cpu)) {
> +        SpaprCpuState *spapr_cpu = spapr_cpu_state(first_ppc_cpu);
> +
> +        /*
> +         * If the first_cpu happens to be running a nested guest at
> +         * this time, tb_env->tb_offset will contain the nested guest
> +         * offset.
> +         */
> +        tb_offset -= spapr_cpu->nested_tb_offset;
> +    }
> +
>      /* not used anymore, we keep it for compatibility */
>      tb->time_of_the_day_ns = qemu_clock_get_ns(QEMU_CLOCK_HOST);
>      /*
>       * tb_offset is only expected to be changed by QEMU so
>       * there is no need to update it from KVM here
>       */
> -    tb->guest_timebase = ticks + first_ppc_cpu->env.tb_env->tb_offset;
> +    tb->guest_timebase = ticks + tb_offset;
>  
>      tb->runstate_paused =
>          runstate_check(RUN_STATE_PAUSED) || runstate_check(RUN_STATE_SAVE_VM);

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 4/4] spapr: Add KVM-on-TCG migration support
  2022-02-24 18:58 ` [RFC PATCH 4/4] spapr: Add KVM-on-TCG migration support Fabiano Rosas
  2022-02-25  0:51   ` Nicholas Piggin
@ 2022-02-25  3:42   ` David Gibson
  2022-02-25 10:57     ` Nicholas Piggin
  1 sibling, 1 reply; 20+ messages in thread
From: David Gibson @ 2022-02-25  3:42 UTC (permalink / raw)
  To: Fabiano Rosas; +Cc: aik, danielhb413, qemu-devel, npiggin, qemu-ppc, clg

[-- Attachment #1: Type: text/plain, Size: 8062 bytes --]

On Thu, Feb 24, 2022 at 03:58:17PM -0300, Fabiano Rosas wrote:
> This adds migration support for TCG pseries machines running a KVM-HV
> guest.
> 
> The state that needs to be migrated is:
> 
> - the nested PTCR value;
> - the in_nested flag;
> - the nested_tb_offset.
> - the saved host CPUPPCState structure;
> 
> Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com>
> 
> ---
> (this migrates just fine with L2 running stress and 1 VCPU in L1. With
> 32 VCPUs in L1 there's crashes which I still don't understand. They might
> be related to L1 migration being flaky right now)
> ---
>  hw/ppc/spapr.c          | 19 +++++++++++
>  hw/ppc/spapr_cpu_core.c | 76 +++++++++++++++++++++++++++++++++++++++++
>  target/ppc/machine.c    | 44 ++++++++++++++++++++++++
>  3 files changed, 139 insertions(+)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index f0b75b22bb..6e87c515db 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -1934,6 +1934,13 @@ static bool spapr_patb_entry_needed(void *opaque)
>      return !!spapr->patb_entry;
>  }
>  
> +static bool spapr_nested_ptcr_needed(void *opaque)
> +{
> +    SpaprMachineState *spapr = opaque;
> +
> +    return !!spapr->nested_ptcr;
> +}
> +
>  static const VMStateDescription vmstate_spapr_patb_entry = {
>      .name = "spapr_patb_entry",
>      .version_id = 1,
> @@ -1945,6 +1952,17 @@ static const VMStateDescription vmstate_spapr_patb_entry = {
>      },
>  };
>  
> +static const VMStateDescription vmstate_spapr_nested_ptcr = {
> +    .name = "spapr_nested_ptcr",
> +    .version_id = 1,
> +    .minimum_version_id = 1,
> +    .needed = spapr_nested_ptcr_needed,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_UINT64(nested_ptcr, SpaprMachineState),
> +        VMSTATE_END_OF_LIST()
> +    },
> +};
> +
>  static bool spapr_irq_map_needed(void *opaque)
>  {
>      SpaprMachineState *spapr = opaque;
> @@ -2069,6 +2087,7 @@ static const VMStateDescription vmstate_spapr = {
>          &vmstate_spapr_cap_fwnmi,
>          &vmstate_spapr_fwnmi,
>          &vmstate_spapr_cap_rpt_invalidate,
> +        &vmstate_spapr_nested_ptcr,

Ok, the nested_ptcr stuff looks good.

>          NULL
>      }
>  };
> diff --git a/hw/ppc/spapr_cpu_core.c b/hw/ppc/spapr_cpu_core.c
> index efda7730f1..3ec13c0660 100644
> --- a/hw/ppc/spapr_cpu_core.c
> +++ b/hw/ppc/spapr_cpu_core.c
> @@ -25,6 +25,7 @@
>  #include "sysemu/reset.h"
>  #include "sysemu/hw_accel.h"
>  #include "qemu/error-report.h"
> +#include "migration/cpu.h"
>  
>  static void spapr_reset_vcpu(PowerPCCPU *cpu)
>  {
> @@ -174,6 +175,80 @@ static const VMStateDescription vmstate_spapr_cpu_vpa = {
>      }
>  };
>  
> +static bool nested_needed(void *opaque)
> +{
> +    SpaprCpuState *spapr_cpu = opaque;
> +
> +    return spapr_cpu->in_nested;
> +}
> +
> +static int nested_state_pre_save(void *opaque)
> +{
> +    CPUPPCState *env = opaque;
> +
> +    env->spr[SPR_LR] = env->lr;
> +    env->spr[SPR_CTR] = env->ctr;
> +    env->spr[SPR_XER] = cpu_read_xer(env);
> +    env->spr[SPR_CFAR] = env->cfar;
> +    return 0;
> +}
> +
> +static int nested_state_post_load(void *opaque, int version_id)
> +{
> +    CPUPPCState *env = opaque;
> +
> +    env->lr = env->spr[SPR_LR];
> +    env->ctr = env->spr[SPR_CTR];
> +    cpu_write_xer(env, env->spr[SPR_XER]);
> +    env->cfar = env->spr[SPR_CFAR];
> +
> +    return 0;
> +}
> +
> +static const VMStateDescription vmstate_nested_host_state = {
> +    .name = "spapr_nested_host_state",
> +    .version_id = 1,
> +    .minimum_version_id = 1,
> +    .pre_save = nested_state_pre_save,
> +    .post_load = nested_state_post_load,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_UINTTL_ARRAY(gpr, CPUPPCState, 32),
> +        VMSTATE_UINTTL_ARRAY(spr, CPUPPCState, 1024),
> +        VMSTATE_UINT32_ARRAY(crf, CPUPPCState, 8),
> +        VMSTATE_UINTTL(nip, CPUPPCState),
> +        VMSTATE_UINTTL(msr, CPUPPCState),
> +        VMSTATE_END_OF_LIST()
> +    }
> +};
> +
> +static int nested_cpu_pre_load(void *opaque)
> +{
> +    SpaprCpuState *spapr_cpu = opaque;
> +
> +    spapr_cpu->nested_host_state = g_try_malloc(sizeof(CPUPPCState));
> +    if (!spapr_cpu->nested_host_state) {
> +        return -1;
> +    }
> +
> +    return 0;
> +}
> +
> +static const VMStateDescription vmstate_spapr_cpu_nested = {
> +    .name = "spapr_cpu/nested",
> +    .version_id = 1,
> +    .minimum_version_id = 1,
> +    .needed = nested_needed,
> +    .pre_load = nested_cpu_pre_load,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_BOOL(in_nested, SpaprCpuState),
> +        VMSTATE_INT64(nested_tb_offset, SpaprCpuState),
> +        VMSTATE_STRUCT_POINTER_V(nested_host_state, SpaprCpuState, 1,
> +                                 vmstate_nested_host_state, CPUPPCState),
> +        VMSTATE_END_OF_LIST()
> +    },
> +};
> +
>  static const VMStateDescription vmstate_spapr_cpu_state = {
>      .name = "spapr_cpu",
>      .version_id = 1,
> @@ -184,6 +259,7 @@ static const VMStateDescription vmstate_spapr_cpu_state = {
>      },
>      .subsections = (const VMStateDescription * []) {
>          &vmstate_spapr_cpu_vpa,
> +        &vmstate_spapr_cpu_nested,
>          NULL
>      }

The vmstate_spapr_cpu_nested stuff looks good too, this is real
information that we weren't migrating and can't be recovered from elsewhere.

>  };
> diff --git a/target/ppc/machine.c b/target/ppc/machine.c
> index 7ee1984500..ae09b1bcfe 100644
> --- a/target/ppc/machine.c
> +++ b/target/ppc/machine.c
> @@ -10,6 +10,7 @@
>  #include "kvm_ppc.h"
>  #include "power8-pmu.h"
>  #include "hw/ppc/ppc.h"
> +#include "hw/ppc/spapr_cpu_core.h"
>  
>  static void post_load_update_msr(CPUPPCState *env)
>  {
> @@ -679,6 +680,48 @@ static const VMStateDescription vmstate_tb_env = {
>      }
>  };
>  
> +static const VMStateDescription vmstate_hdecr = {
> +    .name = "cpu/hdecr",
> +    .version_id = 1,
> +    .minimum_version_id = 1,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_UINT64(hdecr_next, ppc_tb_t),
> +        VMSTATE_TIMER_PTR(hdecr_timer, ppc_tb_t),
> +        VMSTATE_END_OF_LIST()
> +    }
> +};
> +
> +static bool nested_needed(void *opaque)
> +{
> +    PowerPCCPU *cpu = opaque;
> +    SpaprCpuState *spapr_cpu = spapr_cpu_state(cpu);
> +
> +    return spapr_cpu->in_nested;
> +}
> +
> +static int nested_pre_load(void *opaque)
> +{
> +    PowerPCCPU *cpu = opaque;
> +    CPUPPCState *env = &cpu->env;
> +
> +    cpu_ppc_hdecr_init(env);
> +
> +    return 0;
> +}
> +
> +static const VMStateDescription vmstate_nested = {
> +    .name = "cpu/nested-guest",
> +    .version_id = 1,
> +    .minimum_version_id = 1,
> +    .needed = nested_needed,
> +    .pre_load = nested_pre_load,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_STRUCT_POINTER_V(env.tb_env, PowerPCCPU, 1,
> +                                 vmstate_hdecr, ppc_tb_t),
> +        VMSTATE_END_OF_LIST()
> +    }
> +};
> +
>  const VMStateDescription vmstate_ppc_cpu = {
>      .name = "cpu",
>      .version_id = 5,
> @@ -734,6 +777,7 @@ const VMStateDescription vmstate_ppc_cpu = {
>          &vmstate_tlbemb,
>          &vmstate_tlbmas,
>          &vmstate_compat,
> +        &vmstate_nested,

The hdecr stuff doesn't seem quite right.  Notionally the L1 cpu,
since it is in PAPR mode, doesn't *have* an HDECR.  It's only the L0
nested-KVM extensions that allow it to kind of fake access to an
HDECR.  We're kind of abusing the HDECR fields in the cpu structure
for this.  At the very least I think the fake-HDECR migration stuff
needs to go in the spapr_cpu_state not the general cpu state, since it
would make no sense if the L1 were a powernv system.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/4] ppc: nested TCG migration (KVM-on-TCG)
  2022-02-24 21:00 ` [RFC PATCH 0/4] ppc: nested TCG migration (KVM-on-TCG) Mark Cave-Ayland
@ 2022-02-25  3:54   ` David Gibson
  2022-02-25 16:11   ` Fabiano Rosas
  1 sibling, 0 replies; 20+ messages in thread
From: David Gibson @ 2022-02-25  3:54 UTC (permalink / raw)
  To: Mark Cave-Ayland
  Cc: Fabiano Rosas, danielhb413, qemu-devel, npiggin, qemu-ppc, clg

[-- Attachment #1: Type: text/plain, Size: 3803 bytes --]

On Thu, Feb 24, 2022 at 09:00:24PM +0000, Mark Cave-Ayland wrote:
> On 24/02/2022 18:58, Fabiano Rosas wrote:
> 
> > This series implements the migration for a TCG pseries guest running a
> > nested KVM guest. This is just like migrating a pseries TCG guest, but
> > with some extra state to allow a nested guest to continue to run on
> > the destination.
> > 
> > Unfortunately the regular TCG migration scenario (not nested) is not
> > fully working so I cannot be entirely sure the nested migration is
> > correct. I have included a couple of patches for the general migration
> > case that (I think?) improve the situation a bit, but I'm still seeing
> > hard lockups and other issues with more than 1 vcpu.
> > 
> > This is more of an early RFC to see if anyone spots something right
> > away. I haven't made much progress in debugging the general TCG
> > migration case so if anyone has any input there as well I'd appreciate
> > it.
> > 
> > Thanks
> > 
> > Fabiano Rosas (4):
> >    target/ppc: TCG: Migrate tb_offset and decr
> >    spapr: TCG: Migrate spapr_cpu->prod
> >    hw/ppc: Take nested guest into account when saving timebase
> >    spapr: Add KVM-on-TCG migration support
> > 
> >   hw/ppc/ppc.c                    | 17 +++++++-
> >   hw/ppc/spapr.c                  | 19 ++++++++
> >   hw/ppc/spapr_cpu_core.c         | 77 +++++++++++++++++++++++++++++++++
> >   include/hw/ppc/spapr_cpu_core.h |  2 +-
> >   target/ppc/machine.c            | 61 ++++++++++++++++++++++++++
> >   5 files changed, 174 insertions(+), 2 deletions(-)
> 
> FWIW I noticed there were some issues with migrating the decrementer on Mac
> machines a while ago which causes a hang on the destination with TCG (for
> MacOS on a x86 host in my case). Have a look at the following threads for
> reference:
> 
> https://lists.gnu.org/archive/html/qemu-devel/2016-01/msg00546.html
> https://lists.gnu.org/archive/html/qemu-devel/2016-01/msg04622.html
> 
> IIRC there is code that assumes any migration in PPC is being done live, and
> so adjusts the timebase on the destination to reflect wall clock time by
> recalculating tb_offset. I haven't looked at the code for a while but I
> think the outcome was that there needs to be 2 phases in migration: the
> first is to migrate the timebase as-is for guests that are paused during
> migration, whilst the second is to notify hypervisor-aware guest OSs such as
> Linux to make the timebase adjustment if required if the guest is running.

Whether the timebase is adjusted for the migration downtime depends
whether the guest clock is pinned to wall clock time or not.  Usually
it should be (because you don't want your clocks to go wrong on
migration of a production system).  However in neither case should be
the guest be involved.

There may be guest side code related to this in Linux, but that's
probably for migration under pHyp, which is a guest aware migration
system.  That's essentially unrelated to migration under qemu/kvm,
which is a guest unaware system.

Guest aware migration has some nice-sounding advantages; in particular
itcan allow migrations across a heterogenous cluster with differences
between hosts that the hypervisor can't hide, or can't efficiently
hide.  However, it is IMO, a deeply broken approach, because it can
allow an un-cooperative guest to indefinitely block migration, and for
it to be reliably correct it requires *much* more pinning down of
exactly what host system changes the guest can and can't be expected
to cope with than PAPR has ever bothered to do.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 4/4] spapr: Add KVM-on-TCG migration support
  2022-02-25  3:42   ` David Gibson
@ 2022-02-25 10:57     ` Nicholas Piggin
  0 siblings, 0 replies; 20+ messages in thread
From: Nicholas Piggin @ 2022-02-25 10:57 UTC (permalink / raw)
  To: David Gibson, Fabiano Rosas; +Cc: aik, danielhb413, qemu-ppc, clg, qemu-devel

Excerpts from David Gibson's message of February 25, 2022 1:42 pm:
> On Thu, Feb 24, 2022 at 03:58:17PM -0300, Fabiano Rosas wrote:
>> This adds migration support for TCG pseries machines running a KVM-HV
>> guest.
>> @@ -734,6 +777,7 @@ const VMStateDescription vmstate_ppc_cpu = {
>>          &vmstate_tlbemb,
>>          &vmstate_tlbmas,
>>          &vmstate_compat,
>> +        &vmstate_nested,
> 
> The hdecr stuff doesn't seem quite right.  Notionally the L1 cpu,
> since it is in PAPR mode, doesn't *have* an HDECR.  It's only the L0
> nested-KVM extensions that allow it to kind of fake access to an
> HDECR.  We're kind of abusing the HDECR fields in the cpu structure
> for this.  At the very least I think the fake-HDECR migration stuff
> needs to go in the spapr_cpu_state not the general cpu state, since it
> would make no sense if the L1 were a powernv system.

We could possibly just make a new timer for this, btw. It's not really
buying a lot and may be slightly confusing to share hdecr. It could
still cause a HDEC exception though.

(If we really wanted to divorce that confusion about the HV architecture
from the vhyp nested L0 implementation we could come up with an entirely
new set of "exit nested" kind of exceptions that don't look like HV
interrupts, but reusing the HV interrupts was simple and kind of neat
in a way).

Thanks,
Nick


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 1/4] target/ppc: TCG: Migrate tb_offset and decr
  2022-02-25  3:15   ` David Gibson
@ 2022-02-25 16:08     ` Fabiano Rosas
  2022-02-28  2:04       ` David Gibson
  0 siblings, 1 reply; 20+ messages in thread
From: Fabiano Rosas @ 2022-02-25 16:08 UTC (permalink / raw)
  To: David Gibson; +Cc: danielhb413, qemu-ppc, qemu-devel, npiggin, clg

David Gibson <david@gibson.dropbear.id.au> writes:

> On Thu, Feb 24, 2022 at 03:58:14PM -0300, Fabiano Rosas wrote:
>> These two were not migrated so the remote end was starting with the
>> decrementer expired.
>> 
>> I am seeing less frequent crashes with this patch (tested with -smp 4
>> and -smp 32). It certainly doesn't fix all issues but it looks like it
>> helps.
>> 
>> Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com>
>
> Nack.  This completely breaks migration compatibility for all ppc
> machines.  In order to maintain compatibility as Richard says new info
> has to go into a subsection, with a needed function that allows
> migration of existing machine types both to and from a new qemu
> version.

Ok, I'm still learning the tenets of migration. I'll give more thought
to that in the next versions.

>
> There are also some other problems.
>
>> ---
>>  target/ppc/machine.c | 17 +++++++++++++++++
>>  1 file changed, 17 insertions(+)
>> 
>> diff --git a/target/ppc/machine.c b/target/ppc/machine.c
>> index 1b63146ed1..7ee1984500 100644
>> --- a/target/ppc/machine.c
>> +++ b/target/ppc/machine.c
>> @@ -9,6 +9,7 @@
>>  #include "qemu/main-loop.h"
>>  #include "kvm_ppc.h"
>>  #include "power8-pmu.h"
>> +#include "hw/ppc/ppc.h"
>>  
>>  static void post_load_update_msr(CPUPPCState *env)
>>  {
>> @@ -666,6 +667,18 @@ static const VMStateDescription vmstate_compat = {
>>      }
>>  };
>>  
>> +static const VMStateDescription vmstate_tb_env = {
>> +    .name = "cpu/env/tb_env",
>
> Because spapr requires that all cpus have synchronized timebases, we
> migrate the timebase state from vmstate_spapr, not from each cpu
> individually, to make sure it stays synchronized across migration.  If
> that's not working right, that's a bug, but it needs to be fixed
> there, not just plastered over with extra information transmitted at
> cpu level.

Ok, so it that what guest_timebase is about? We shouldn't need to
migrate DECR individually then.

>> +    .version_id = 1,
>> +    .minimum_version_id = 1,
>> +    .fields = (VMStateField[]) {
>> +        VMSTATE_INT64(tb_offset, ppc_tb_t),
>
> tb_offset isn't a good thing to put directly in the migration stream.
> tb_offset has kind of non-obvious semantics to begin with: when we're
> dealing with TCG (which is your explicit case), there is no host TB,
> so what's this an offset from?  That's why vmstate_ppc_timebase uses
> an explicit guest timebase value matched with a time of day in real
> units.  Again, if there's a bug, that needs fixing there.

This should be in patch 4, but tb_offset is needed for the nested
case. When we enter the nested guest we increment tb_offset with
nested_tb_offset and decrement it when leaving the nested guest. So
tb_offset needs to carry this value over.

But maybe we could alternatively modify the nested code to just zero
tb_offset when leaving the guest instead? We could then remove
nested_tb_offset altogether.

>> +        VMSTATE_UINT64(decr_next, ppc_tb_t),
>> +        VMSTATE_TIMER_PTR(decr_timer, ppc_tb_t),
>
> You're attempting to migrate decr_next and decr_timer, but not the
> actual DECR value, which makes me suspect that *is* being migrated.
> In which case you should be able to reconstruct the next tick and
> timer state in a post_load function on receive.  If that's buggy, it
> needs to be fixed there.

There isn't any "actual DECR" when running TCG, is there? My
understanding is that it is embedded in the QEMUTimer.

Mark mentioned this years ago:

 "I can't see anything in __cpu_ppc_store_decr() that
 updates the spr[SPR_DECR] value when the decrementer register is
 changed"
 
 https://lists.gnu.org/archive/html/qemu-devel/2016-01/msg01114.html

Your answer was along the lines of:

 "we should be reconstructing the decrementer on
 the destination based on an offset from the timebase"

 https://lists.gnu.org/archive/html/qemu-devel/2016-01/msg01373.html

So if I'm getting this right, in TCG we don't touch SPR_DECR because we
only effectively care about what is in the decr_timer->expire_time.

And in KVM we don't migrate DECR explicitly because we use the tb_offset
and timebase_save/timebase_load to ensure the tb_offset in the
destination has the correct value.

But timebase_save/timebase_load are not used for TCG currently. So there
would be nothing transfering the current decr value to the other side.

>> +        VMSTATE_END_OF_LIST()
>> +    }
>> +};
>> +
>>  const VMStateDescription vmstate_ppc_cpu = {
>>      .name = "cpu",
>>      .version_id = 5,
>> @@ -696,12 +709,16 @@ const VMStateDescription vmstate_ppc_cpu = {
>>          /* Backward compatible internal state */
>>          VMSTATE_UINTTL(env.hflags_compat_nmsr, PowerPCCPU),
>>  
>> +        VMSTATE_STRUCT_POINTER_V(env.tb_env, PowerPCCPU, 1,
>> +                                 vmstate_tb_env, ppc_tb_t),
>> +
>>          /* Sanity checking */
>>          VMSTATE_UINTTL_TEST(mig_msr_mask, PowerPCCPU, cpu_pre_2_8_migration),
>>          VMSTATE_UINT64_TEST(mig_insns_flags, PowerPCCPU, cpu_pre_2_8_migration),
>>          VMSTATE_UINT64_TEST(mig_insns_flags2, PowerPCCPU,
>>                              cpu_pre_2_8_migration),
>>          VMSTATE_UINT32_TEST(mig_nb_BATs, PowerPCCPU, cpu_pre_2_8_migration),
>> +
>>          VMSTATE_END_OF_LIST()
>>      },
>>      .subsections = (const VMStateDescription*[]) {


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 2/4] spapr: TCG: Migrate spapr_cpu->prod
  2022-02-25  3:17   ` David Gibson
@ 2022-02-25 16:08     ` Fabiano Rosas
  0 siblings, 0 replies; 20+ messages in thread
From: Fabiano Rosas @ 2022-02-25 16:08 UTC (permalink / raw)
  To: David Gibson; +Cc: danielhb413, qemu-ppc, qemu-devel, npiggin, clg

David Gibson <david@gibson.dropbear.id.au> writes:

> On Thu, Feb 24, 2022 at 03:58:15PM -0300, Fabiano Rosas wrote:
>> I'm seeing some stack traces in the migrated guest going through cede
>> and some hangs at the plpar_hcall_norets so let's make sure everything
>> related to cede/prod is being migrated just in case.
>
> This is a poor approach in general.  Migration becomes even harder to
> maintain than it already is if you don't pare down the set of migrated
> data to something minimal and non-redundant.
>
> If you want to migrate prod, you have to give a case for why you
> *need* it, not "just in case".

Ah yes, I'm not actually trying to merge stuff without a good
explanation. I haven't even delineated the problem properly. But I know
little about migration so I need to do some probing, bear with me
please.

> Also, you have to put this in a subsection with a needed function in
> order not to break compatibility.
>
>> 
>> Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com>
>> ---
>>  hw/ppc/spapr_cpu_core.c         | 1 +
>>  include/hw/ppc/spapr_cpu_core.h | 2 +-
>>  2 files changed, 2 insertions(+), 1 deletion(-)
>> 
>> diff --git a/hw/ppc/spapr_cpu_core.c b/hw/ppc/spapr_cpu_core.c
>> index ed84713960..efda7730f1 100644
>> --- a/hw/ppc/spapr_cpu_core.c
>> +++ b/hw/ppc/spapr_cpu_core.c
>> @@ -179,6 +179,7 @@ static const VMStateDescription vmstate_spapr_cpu_state = {
>>      .version_id = 1,
>>      .minimum_version_id = 1,
>>      .fields = (VMStateField[]) {
>> +        VMSTATE_BOOL(prod, SpaprCpuState),
>>          VMSTATE_END_OF_LIST()
>>      },
>>      .subsections = (const VMStateDescription * []) {
>> diff --git a/include/hw/ppc/spapr_cpu_core.h b/include/hw/ppc/spapr_cpu_core.h
>> index b560514560..2772689c84 100644
>> --- a/include/hw/ppc/spapr_cpu_core.h
>> +++ b/include/hw/ppc/spapr_cpu_core.h
>> @@ -45,7 +45,7 @@ typedef struct SpaprCpuState {
>>      uint64_t vpa_addr;
>>      uint64_t slb_shadow_addr, slb_shadow_size;
>>      uint64_t dtl_addr, dtl_size;
>> -    bool prod; /* not migrated, only used to improve dispatch latencies */
>> +    bool prod;
>>      struct ICPState *icp;
>>      struct XiveTCTX *tctx;
>>  


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 3/4] hw/ppc: Take nested guest into account when saving timebase
  2022-02-25  3:21   ` David Gibson
@ 2022-02-25 16:08     ` Fabiano Rosas
  2022-02-28  2:06       ` David Gibson
  0 siblings, 1 reply; 20+ messages in thread
From: Fabiano Rosas @ 2022-02-25 16:08 UTC (permalink / raw)
  To: David Gibson; +Cc: danielhb413, qemu-ppc, qemu-devel, npiggin, clg

David Gibson <david@gibson.dropbear.id.au> writes:

> On Thu, Feb 24, 2022 at 03:58:16PM -0300, Fabiano Rosas wrote:
>> When saving the guest "timebase" we look to the first_cpu for its
>> tb_offset. If that CPU happens to be running a nested guest at this
>> time, the tb_offset will have the nested guest value.
>> 
>> This was caught by code inspection.
>
> This doesn't seem right.  Isn't the real problem that nested_tb_offset
> isn't being migrated?  If you migrate that, shouldn't everything be
> fixed up when the L1 cpu leaves the nested guest on the destination
> host?

This uses first_cpu, so after we introduced the nested guest code, this
value has become dependent on what the first_cpu is doing. If it happens
to be running the nested guest when we migrate, then guest_timebase here
will have a different value from the one it would have if we had used
another cpu's tb_offset.

Now, we might have a bug or at least an inefficiency here because
timebase_load is never called for the TCG migration case. The
cpu_ppc_clock_vm_state_change callback is only registered for KVM. So in
TCG we call timebase_save during pre_save, migrate the guest_timebase,
but never do anything with it on the remote side.

>> 
>> Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com>
>> ---
>>  hw/ppc/ppc.c | 17 ++++++++++++++++-
>>  1 file changed, 16 insertions(+), 1 deletion(-)
>> 
>> diff --git a/hw/ppc/ppc.c b/hw/ppc/ppc.c
>> index 9e99625ea9..093cd87014 100644
>> --- a/hw/ppc/ppc.c
>> +++ b/hw/ppc/ppc.c
>> @@ -36,6 +36,7 @@
>>  #include "kvm_ppc.h"
>>  #include "migration/vmstate.h"
>>  #include "trace.h"
>> +#include "hw/ppc/spapr_cpu_core.h"
>>  
>>  static void cpu_ppc_tb_stop (CPUPPCState *env);
>>  static void cpu_ppc_tb_start (CPUPPCState *env);
>> @@ -961,19 +962,33 @@ static void timebase_save(PPCTimebase *tb)
>>  {
>>      uint64_t ticks = cpu_get_host_ticks();
>>      PowerPCCPU *first_ppc_cpu = POWERPC_CPU(first_cpu);
>> +    int64_t tb_offset;
>>  
>>      if (!first_ppc_cpu->env.tb_env) {
>>          error_report("No timebase object");
>>          return;
>>      }
>>  
>> +    tb_offset = first_ppc_cpu->env.tb_env->tb_offset;
>> +
>> +    if (first_ppc_cpu->vhyp && vhyp_cpu_in_nested(first_ppc_cpu)) {
>> +        SpaprCpuState *spapr_cpu = spapr_cpu_state(first_ppc_cpu);
>> +
>> +        /*
>> +         * If the first_cpu happens to be running a nested guest at
>> +         * this time, tb_env->tb_offset will contain the nested guest
>> +         * offset.
>> +         */
>> +        tb_offset -= spapr_cpu->nested_tb_offset;
>> +    }
>> +
>>      /* not used anymore, we keep it for compatibility */
>>      tb->time_of_the_day_ns = qemu_clock_get_ns(QEMU_CLOCK_HOST);
>>      /*
>>       * tb_offset is only expected to be changed by QEMU so
>>       * there is no need to update it from KVM here
>>       */
>> -    tb->guest_timebase = ticks + first_ppc_cpu->env.tb_env->tb_offset;
>> +    tb->guest_timebase = ticks + tb_offset;
>>  
>>      tb->runstate_paused =
>>          runstate_check(RUN_STATE_PAUSED) || runstate_check(RUN_STATE_SAVE_VM);


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/4] ppc: nested TCG migration (KVM-on-TCG)
  2022-02-24 21:00 ` [RFC PATCH 0/4] ppc: nested TCG migration (KVM-on-TCG) Mark Cave-Ayland
  2022-02-25  3:54   ` David Gibson
@ 2022-02-25 16:11   ` Fabiano Rosas
  1 sibling, 0 replies; 20+ messages in thread
From: Fabiano Rosas @ 2022-02-25 16:11 UTC (permalink / raw)
  To: Mark Cave-Ayland, qemu-devel; +Cc: danielhb413, qemu-ppc, clg, npiggin, david

Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk> writes:

> On 24/02/2022 18:58, Fabiano Rosas wrote:
>
>> This series implements the migration for a TCG pseries guest running a
>> nested KVM guest. This is just like migrating a pseries TCG guest, but
>> with some extra state to allow a nested guest to continue to run on
>> the destination.
>> 
>> Unfortunately the regular TCG migration scenario (not nested) is not
>> fully working so I cannot be entirely sure the nested migration is
>> correct. I have included a couple of patches for the general migration
>> case that (I think?) improve the situation a bit, but I'm still seeing
>> hard lockups and other issues with more than 1 vcpu.
>> 
>> This is more of an early RFC to see if anyone spots something right
>> away. I haven't made much progress in debugging the general TCG
>> migration case so if anyone has any input there as well I'd appreciate
>> it.
>> 
>> Thanks
>> 
>> Fabiano Rosas (4):
>>    target/ppc: TCG: Migrate tb_offset and decr
>>    spapr: TCG: Migrate spapr_cpu->prod
>>    hw/ppc: Take nested guest into account when saving timebase
>>    spapr: Add KVM-on-TCG migration support
>> 
>>   hw/ppc/ppc.c                    | 17 +++++++-
>>   hw/ppc/spapr.c                  | 19 ++++++++
>>   hw/ppc/spapr_cpu_core.c         | 77 +++++++++++++++++++++++++++++++++
>>   include/hw/ppc/spapr_cpu_core.h |  2 +-
>>   target/ppc/machine.c            | 61 ++++++++++++++++++++++++++
>>   5 files changed, 174 insertions(+), 2 deletions(-)
>
> FWIW I noticed there were some issues with migrating the decrementer on Mac machines 
> a while ago which causes a hang on the destination with TCG (for MacOS on a x86 host 
> in my case). Have a look at the following threads for reference:
>
> https://lists.gnu.org/archive/html/qemu-devel/2016-01/msg00546.html
> https://lists.gnu.org/archive/html/qemu-devel/2016-01/msg04622.html

Thanks, Mark! There's a lot of helpful information in these threads.

> IIRC there is code that assumes any migration in PPC is being done
>live, and so adjusts the timebase on the destination to reflect wall
>clock time by recalculating tb_offset. I haven't looked at the code for
>a while but I think the outcome was that there needs to be 2 phases in
>migration: the first is to migrate the timebase as-is for guests that
>are paused during migration, whilst the second is to notify
>hypervisor-aware guest OSs such as Linux to make the timebase
>adjustment if required if the guest is running.


>
>
> ATB,
>
> Mark.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 1/4] target/ppc: TCG: Migrate tb_offset and decr
  2022-02-25 16:08     ` Fabiano Rosas
@ 2022-02-28  2:04       ` David Gibson
  0 siblings, 0 replies; 20+ messages in thread
From: David Gibson @ 2022-02-28  2:04 UTC (permalink / raw)
  To: Fabiano Rosas; +Cc: danielhb413, qemu-ppc, qemu-devel, npiggin, clg

[-- Attachment #1: Type: text/plain, Size: 6902 bytes --]

On Fri, Feb 25, 2022 at 01:08:10PM -0300, Fabiano Rosas wrote:
> David Gibson <david@gibson.dropbear.id.au> writes:
> 
> > On Thu, Feb 24, 2022 at 03:58:14PM -0300, Fabiano Rosas wrote:
> >> These two were not migrated so the remote end was starting with the
> >> decrementer expired.
> >> 
> >> I am seeing less frequent crashes with this patch (tested with -smp 4
> >> and -smp 32). It certainly doesn't fix all issues but it looks like it
> >> helps.
> >> 
> >> Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com>
> >
> > Nack.  This completely breaks migration compatibility for all ppc
> > machines.  In order to maintain compatibility as Richard says new info
> > has to go into a subsection, with a needed function that allows
> > migration of existing machine types both to and from a new qemu
> > version.
> 
> Ok, I'm still learning the tenets of migration. I'll give more thought
> to that in the next versions.

Fair enough.  I'd had a very frustrating week for entirely unrelated
reasons, so I was probably a bit unfairly critical.

> > There are also some other problems.
> >
> >> ---
> >>  target/ppc/machine.c | 17 +++++++++++++++++
> >>  1 file changed, 17 insertions(+)
> >> 
> >> diff --git a/target/ppc/machine.c b/target/ppc/machine.c
> >> index 1b63146ed1..7ee1984500 100644
> >> --- a/target/ppc/machine.c
> >> +++ b/target/ppc/machine.c
> >> @@ -9,6 +9,7 @@
> >>  #include "qemu/main-loop.h"
> >>  #include "kvm_ppc.h"
> >>  #include "power8-pmu.h"
> >> +#include "hw/ppc/ppc.h"
> >>  
> >>  static void post_load_update_msr(CPUPPCState *env)
> >>  {
> >> @@ -666,6 +667,18 @@ static const VMStateDescription vmstate_compat = {
> >>      }
> >>  };
> >>  
> >> +static const VMStateDescription vmstate_tb_env = {
> >> +    .name = "cpu/env/tb_env",
> >
> > Because spapr requires that all cpus have synchronized timebases, we
> > migrate the timebase state from vmstate_spapr, not from each cpu
> > individually, to make sure it stays synchronized across migration.  If
> > that's not working right, that's a bug, but it needs to be fixed
> > there, not just plastered over with extra information transmitted at
> > cpu level.
> 
> Ok, so it that what guest_timebase is about? We shouldn't need to
> migrate DECR individually then.

Probably not.  Unlike the TB there is obviously per-cpu state that has
to be migrated for DECR, and I'm not immediately sure how that's
handled right now.  I think we would be a lot more broken if we
weren't currently migrating the DECRs in at least most ordinary
circumstances.

> >> +    .version_id = 1,
> >> +    .minimum_version_id = 1,
> >> +    .fields = (VMStateField[]) {
> >> +        VMSTATE_INT64(tb_offset, ppc_tb_t),
> >
> > tb_offset isn't a good thing to put directly in the migration stream.
> > tb_offset has kind of non-obvious semantics to begin with: when we're
> > dealing with TCG (which is your explicit case), there is no host TB,
> > so what's this an offset from?  That's why vmstate_ppc_timebase uses
> > an explicit guest timebase value matched with a time of day in real
> > units.  Again, if there's a bug, that needs fixing there.
> 
> This should be in patch 4, but tb_offset is needed for the nested
> case. When we enter the nested guest we increment tb_offset with
> nested_tb_offset and decrement it when leaving the nested guest. So
> tb_offset needs to carry this value over.

Yeah.. that's not really going to work.  We'll have to instead
reconstruct tb_offset from the real-time based stuff we have, then use
that on the destination where we need it.

> But maybe we could alternatively modify the nested code to just zero
> tb_offset when leaving the guest instead? We could then remove
> nested_tb_offset altogether.

Uh.. maybe?  I don't remember how the nested implementation works well
enough to quickly assess if that will work.

> 
> >> +        VMSTATE_UINT64(decr_next, ppc_tb_t),
> >> +        VMSTATE_TIMER_PTR(decr_timer, ppc_tb_t),
> >
> > You're attempting to migrate decr_next and decr_timer, but not the
> > actual DECR value, which makes me suspect that *is* being migrated.
> > In which case you should be able to reconstruct the next tick and
> > timer state in a post_load function on receive.  If that's buggy, it
> > needs to be fixed there.
> 
> There isn't any "actual DECR" when running TCG, is there? My
> understanding is that it is embedded in the QEMUTimer.
> 
> Mark mentioned this years ago:
> 
>  "I can't see anything in __cpu_ppc_store_decr() that
>  updates the spr[SPR_DECR] value when the decrementer register is
>  changed"
>  
>  https://lists.gnu.org/archive/html/qemu-devel/2016-01/msg01114.html
> 
> Your answer was along the lines of:
> 
>  "we should be reconstructing the decrementer on
>  the destination based on an offset from the timebase"
> 
>  https://lists.gnu.org/archive/html/qemu-devel/2016-01/msg01373.html
> 
> So if I'm getting this right, in TCG we don't touch SPR_DECR because we
> only effectively care about what is in the decr_timer->expire_time.
> 
> And in KVM we don't migrate DECR explicitly because we use the tb_offset
> and timebase_save/timebase_load to ensure the tb_offset in the
> destination has the correct value.
> 
> But timebase_save/timebase_load are not used for TCG currently. So there
> would be nothing transfering the current decr value to the other side.

Ah.. good points.  What we need to make sure of is that all the values
which spr_read_decr / gen_helper_load_decr needs to make it's
calculation are correctly migrated.

> >> +        VMSTATE_END_OF_LIST()
> >> +    }
> >> +};
> >> +
> >>  const VMStateDescription vmstate_ppc_cpu = {
> >>      .name = "cpu",
> >>      .version_id = 5,
> >> @@ -696,12 +709,16 @@ const VMStateDescription vmstate_ppc_cpu = {
> >>          /* Backward compatible internal state */
> >>          VMSTATE_UINTTL(env.hflags_compat_nmsr, PowerPCCPU),
> >>  
> >> +        VMSTATE_STRUCT_POINTER_V(env.tb_env, PowerPCCPU, 1,
> >> +                                 vmstate_tb_env, ppc_tb_t),
> >> +
> >>          /* Sanity checking */
> >>          VMSTATE_UINTTL_TEST(mig_msr_mask, PowerPCCPU, cpu_pre_2_8_migration),
> >>          VMSTATE_UINT64_TEST(mig_insns_flags, PowerPCCPU, cpu_pre_2_8_migration),
> >>          VMSTATE_UINT64_TEST(mig_insns_flags2, PowerPCCPU,
> >>                              cpu_pre_2_8_migration),
> >>          VMSTATE_UINT32_TEST(mig_nb_BATs, PowerPCCPU, cpu_pre_2_8_migration),
> >> +
> >>          VMSTATE_END_OF_LIST()
> >>      },
> >>      .subsections = (const VMStateDescription*[]) {
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 3/4] hw/ppc: Take nested guest into account when saving timebase
  2022-02-25 16:08     ` Fabiano Rosas
@ 2022-02-28  2:06       ` David Gibson
  0 siblings, 0 replies; 20+ messages in thread
From: David Gibson @ 2022-02-28  2:06 UTC (permalink / raw)
  To: Fabiano Rosas; +Cc: danielhb413, qemu-ppc, qemu-devel, npiggin, clg

[-- Attachment #1: Type: text/plain, Size: 3760 bytes --]

On Fri, Feb 25, 2022 at 01:08:46PM -0300, Fabiano Rosas wrote:
> David Gibson <david@gibson.dropbear.id.au> writes:
> 
> > On Thu, Feb 24, 2022 at 03:58:16PM -0300, Fabiano Rosas wrote:
> >> When saving the guest "timebase" we look to the first_cpu for its
> >> tb_offset. If that CPU happens to be running a nested guest at this
> >> time, the tb_offset will have the nested guest value.
> >> 
> >> This was caught by code inspection.
> >
> > This doesn't seem right.  Isn't the real problem that nested_tb_offset
> > isn't being migrated?  If you migrate that, shouldn't everything be
> > fixed up when the L1 cpu leaves the nested guest on the destination
> > host?
> 
> This uses first_cpu, so after we introduced the nested guest code, this
> value has become dependent on what the first_cpu is doing. If it happens
> to be running the nested guest when we migrate, then guest_timebase here
> will have a different value from the one it would have if we had used
> another cpu's tb_offset.

Oh, good point.  Yes, you probably do need this.

> Now, we might have a bug or at least an inefficiency here because
> timebase_load is never called for the TCG migration case. The
> cpu_ppc_clock_vm_state_change callback is only registered for KVM. So in
> TCG we call timebase_save during pre_save, migrate the guest_timebase,
> but never do anything with it on the remote side.

Oh.. yeah.. that sounds probably buggy.  Unless the logic you need is
done at the time of read TB or read DECR.

> 
> >> 
> >> Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com>
> >> ---
> >>  hw/ppc/ppc.c | 17 ++++++++++++++++-
> >>  1 file changed, 16 insertions(+), 1 deletion(-)
> >> 
> >> diff --git a/hw/ppc/ppc.c b/hw/ppc/ppc.c
> >> index 9e99625ea9..093cd87014 100644
> >> --- a/hw/ppc/ppc.c
> >> +++ b/hw/ppc/ppc.c
> >> @@ -36,6 +36,7 @@
> >>  #include "kvm_ppc.h"
> >>  #include "migration/vmstate.h"
> >>  #include "trace.h"
> >> +#include "hw/ppc/spapr_cpu_core.h"
> >>  
> >>  static void cpu_ppc_tb_stop (CPUPPCState *env);
> >>  static void cpu_ppc_tb_start (CPUPPCState *env);
> >> @@ -961,19 +962,33 @@ static void timebase_save(PPCTimebase *tb)
> >>  {
> >>      uint64_t ticks = cpu_get_host_ticks();
> >>      PowerPCCPU *first_ppc_cpu = POWERPC_CPU(first_cpu);
> >> +    int64_t tb_offset;
> >>  
> >>      if (!first_ppc_cpu->env.tb_env) {
> >>          error_report("No timebase object");
> >>          return;
> >>      }
> >>  
> >> +    tb_offset = first_ppc_cpu->env.tb_env->tb_offset;
> >> +
> >> +    if (first_ppc_cpu->vhyp && vhyp_cpu_in_nested(first_ppc_cpu)) {
> >> +        SpaprCpuState *spapr_cpu = spapr_cpu_state(first_ppc_cpu);
> >> +
> >> +        /*
> >> +         * If the first_cpu happens to be running a nested guest at
> >> +         * this time, tb_env->tb_offset will contain the nested guest
> >> +         * offset.
> >> +         */
> >> +        tb_offset -= spapr_cpu->nested_tb_offset;
> >> +    }
> >> +
> >>      /* not used anymore, we keep it for compatibility */
> >>      tb->time_of_the_day_ns = qemu_clock_get_ns(QEMU_CLOCK_HOST);
> >>      /*
> >>       * tb_offset is only expected to be changed by QEMU so
> >>       * there is no need to update it from KVM here
> >>       */
> >> -    tb->guest_timebase = ticks + first_ppc_cpu->env.tb_env->tb_offset;
> >> +    tb->guest_timebase = ticks + tb_offset;
> >>  
> >>      tb->runstate_paused =
> >>          runstate_check(RUN_STATE_PAUSED) || runstate_check(RUN_STATE_SAVE_VM);
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2022-02-28  2:18 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-02-24 18:58 [RFC PATCH 0/4] ppc: nested TCG migration (KVM-on-TCG) Fabiano Rosas
2022-02-24 18:58 ` [RFC PATCH 1/4] target/ppc: TCG: Migrate tb_offset and decr Fabiano Rosas
2022-02-24 20:06   ` Richard Henderson
2022-02-25  3:15   ` David Gibson
2022-02-25 16:08     ` Fabiano Rosas
2022-02-28  2:04       ` David Gibson
2022-02-24 18:58 ` [RFC PATCH 2/4] spapr: TCG: Migrate spapr_cpu->prod Fabiano Rosas
2022-02-25  3:17   ` David Gibson
2022-02-25 16:08     ` Fabiano Rosas
2022-02-24 18:58 ` [RFC PATCH 3/4] hw/ppc: Take nested guest into account when saving timebase Fabiano Rosas
2022-02-25  3:21   ` David Gibson
2022-02-25 16:08     ` Fabiano Rosas
2022-02-28  2:06       ` David Gibson
2022-02-24 18:58 ` [RFC PATCH 4/4] spapr: Add KVM-on-TCG migration support Fabiano Rosas
2022-02-25  0:51   ` Nicholas Piggin
2022-02-25  3:42   ` David Gibson
2022-02-25 10:57     ` Nicholas Piggin
2022-02-24 21:00 ` [RFC PATCH 0/4] ppc: nested TCG migration (KVM-on-TCG) Mark Cave-Ayland
2022-02-25  3:54   ` David Gibson
2022-02-25 16:11   ` Fabiano Rosas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).