[PATCH 0/3] migration: Downtime tracepoints

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/3] migration: Downtime tracepoints
@ 2023-10-26 15:53 Peter Xu
  2023-10-26 15:53 ` [PATCH 1/3] migration: Set downtime_start even for postcopy Peter Xu
                   ` (4 more replies)
  0 siblings, 5 replies; 21+ messages in thread
From: Peter Xu @ 2023-10-26 15:53 UTC (permalink / raw)
  To: qemu-devel; +Cc: peterx, Fabiano Rosas, Juan Quintela, Joao Martins

This small series (actually only the last patch; first two are cleanups)
wants to improve ability of QEMU downtime analysis similarly to what Joao
used to propose here:

  https://lore.kernel.org/r/20230926161841.98464-1-joao.m.martins@oracle.com

But with a few differences:

  - Nothing exported yet to qapi, all tracepoints so far

  - Instead of major checkpoints (stop, iterable, non-iterable, resume-rp),
    finer granule by providing downtime measurements for each vmstate (I
    made microsecond to be the unit to be accurate).  So far it seems
    iterable / non-iterable is the core of the problem, and I want to nail
    it to per-device.

  - Trace dest QEMU too

For the last bullet: consider the case where a device save() can be super
fast, while load() can actually be super slow.  Both of them will
contribute to the ultimate downtime, but not a simple summary: when src
QEMU is save()ing on device1, dst QEMU can be load()ing on device2.  So
they can run in parallel.  However the only way to figure all components of
the downtime is to record both.

Please have a look, thanks.

Peter Xu (3):
  migration: Set downtime_start even for postcopy
  migration: Add migration_downtime_start|end() helpers
  migration: Add per vmstate downtime tracepoints

 migration/migration.c  | 38 +++++++++++++++++++++-----------
 migration/savevm.c     | 49 ++++++++++++++++++++++++++++++++++++++----
 migration/trace-events |  2 ++
 3 files changed, 72 insertions(+), 17 deletions(-)

-- 
2.41.0



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/3] migration: Set downtime_start even for postcopy
  2023-10-26 15:53 [PATCH 0/3] migration: Downtime tracepoints Peter Xu
@ 2023-10-26 15:53 ` Peter Xu
  2023-10-26 17:05   ` Fabiano Rosas
  2023-10-26 15:53 ` [PATCH 2/3] migration: Add migration_downtime_start|end() helpers Peter Xu
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 21+ messages in thread
From: Peter Xu @ 2023-10-26 15:53 UTC (permalink / raw)
  To: qemu-devel; +Cc: peterx, Fabiano Rosas, Juan Quintela, Joao Martins

Postcopy calculates its downtime separately.  It always sets
MigrationState.downtime properly, but not MigrationState.downtime_start.

Make postcopy do the same as other modes on properly recording the
timestamp when the VM is going to be stopped.  Drop the temporary variable
in postcopy_start() along the way.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 67547eb6a1..f8a54ff4d1 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2121,7 +2121,6 @@ static int postcopy_start(MigrationState *ms, Error **errp)
     int ret;
     QIOChannelBuffer *bioc;
     QEMUFile *fb;
-    int64_t time_at_stop = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
     uint64_t bandwidth = migrate_max_postcopy_bandwidth();
     bool restart_block = false;
     int cur_state = MIGRATION_STATUS_ACTIVE;
@@ -2143,6 +2142,8 @@ static int postcopy_start(MigrationState *ms, Error **errp)
     qemu_mutex_lock_iothread();
     trace_postcopy_start_set_run();
 
+    ms->downtime_start = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
+
     qemu_system_wakeup_request(QEMU_WAKEUP_REASON_OTHER, NULL);
     global_state_store();
     ret = vm_stop_force_state(RUN_STATE_FINISH_MIGRATE);
@@ -2245,7 +2246,7 @@ static int postcopy_start(MigrationState *ms, Error **errp)
     ms->postcopy_after_devices = true;
     migration_call_notifiers(ms);
 
-    ms->downtime =  qemu_clock_get_ms(QEMU_CLOCK_REALTIME) - time_at_stop;
+    ms->downtime = qemu_clock_get_ms(QEMU_CLOCK_REALTIME) - ms->downtime_start;
 
     qemu_mutex_unlock_iothread();
 
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/3] migration: Add migration_downtime_start|end() helpers
  2023-10-26 15:53 [PATCH 0/3] migration: Downtime tracepoints Peter Xu
  2023-10-26 15:53 ` [PATCH 1/3] migration: Set downtime_start even for postcopy Peter Xu
@ 2023-10-26 15:53 ` Peter Xu
  2023-10-26 17:11   ` Fabiano Rosas
  2023-10-26 15:53 ` [PATCH 3/3] migration: Add per vmstate downtime tracepoints Peter Xu
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 21+ messages in thread
From: Peter Xu @ 2023-10-26 15:53 UTC (permalink / raw)
  To: qemu-devel; +Cc: peterx, Fabiano Rosas, Juan Quintela, Joao Martins

Unify the three users on recording downtimes with the same pair of helpers.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.c | 37 ++++++++++++++++++++++++-------------
 1 file changed, 24 insertions(+), 13 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index f8a54ff4d1..70d775942a 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -101,6 +101,24 @@ static int migration_maybe_pause(MigrationState *s,
 static void migrate_fd_cancel(MigrationState *s);
 static int close_return_path_on_source(MigrationState *s);
 
+static void migration_downtime_start(MigrationState *s)
+{
+    s->downtime_start = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
+}
+
+static void migration_downtime_end(MigrationState *s)
+{
+    int64_t now = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
+
+    /*
+     * If downtime already set, should mean that postcopy already set it,
+     * then that should be the real downtime already.
+     */
+    if (!s->downtime) {
+        s->downtime = now - s->downtime_start;
+    }
+}
+
 static bool migration_needs_multiple_sockets(void)
 {
     return migrate_multifd() || migrate_postcopy_preempt();
@@ -2142,7 +2160,7 @@ static int postcopy_start(MigrationState *ms, Error **errp)
     qemu_mutex_lock_iothread();
     trace_postcopy_start_set_run();
 
-    ms->downtime_start = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
+    migration_downtime_start(ms);
 
     qemu_system_wakeup_request(QEMU_WAKEUP_REASON_OTHER, NULL);
     global_state_store();
@@ -2246,7 +2264,7 @@ static int postcopy_start(MigrationState *ms, Error **errp)
     ms->postcopy_after_devices = true;
     migration_call_notifiers(ms);
 
-    ms->downtime = qemu_clock_get_ms(QEMU_CLOCK_REALTIME) - ms->downtime_start;
+    migration_downtime_end(ms);
 
     qemu_mutex_unlock_iothread();
 
@@ -2342,7 +2360,7 @@ static int migration_completion_precopy(MigrationState *s,
     int ret;
 
     qemu_mutex_lock_iothread();
-    s->downtime_start = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
+    migration_downtime_start(s);
     qemu_system_wakeup_request(QEMU_WAKEUP_REASON_OTHER, NULL);
 
     s->vm_old_state = runstate_get();
@@ -2699,15 +2717,8 @@ static void migration_calculate_complete(MigrationState *s)
     int64_t end_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
     int64_t transfer_time;
 
+    migration_downtime_end(s);
     s->total_time = end_time - s->start_time;
-    if (!s->downtime) {
-        /*
-         * It's still not set, so we are precopy migration.  For
-         * postcopy, downtime is calculated during postcopy_start().
-         */
-        s->downtime = end_time - s->downtime_start;
-    }
-
     transfer_time = s->total_time - s->setup_time;
     if (transfer_time) {
         s->mbps = ((double) bytes * 8.0) / transfer_time / 1000;
@@ -3126,7 +3137,7 @@ static void bg_migration_vm_start_bh(void *opaque)
     s->vm_start_bh = NULL;
 
     vm_start();
-    s->downtime = qemu_clock_get_ms(QEMU_CLOCK_REALTIME) - s->downtime_start;
+    migration_downtime_end(s);
 }
 
 /**
@@ -3193,7 +3204,7 @@ static void *bg_migration_thread(void *opaque)
     s->setup_time = qemu_clock_get_ms(QEMU_CLOCK_HOST) - setup_start;
 
     trace_migration_thread_setup_complete();
-    s->downtime_start = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
+    migration_downtime_start(s);
 
     qemu_mutex_lock_iothread();
 
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/3] migration: Add per vmstate downtime tracepoints
  2023-10-26 15:53 [PATCH 0/3] migration: Downtime tracepoints Peter Xu
  2023-10-26 15:53 ` [PATCH 1/3] migration: Set downtime_start even for postcopy Peter Xu
  2023-10-26 15:53 ` [PATCH 2/3] migration: Add migration_downtime_start|end() helpers Peter Xu
@ 2023-10-26 15:53 ` Peter Xu
  2023-10-26 16:06 ` [PATCH 0/3] migration: Downtime tracepoints Joao Martins
  2023-10-26 19:01 ` [PATCH 4/3] migration: Add tracepoints for downtime checkpoints Peter Xu
  4 siblings, 0 replies; 21+ messages in thread
From: Peter Xu @ 2023-10-26 15:53 UTC (permalink / raw)
  To: qemu-devel; +Cc: peterx, Fabiano Rosas, Juan Quintela, Joao Martins

We have a bunch of savevm_section* tracepoints, they're good to analyze
migration stream, but not always suitable if someone would like to analyze
the migration downtime.  Two major problems:

  - savevm_section* tracepoints are dumping all sections, we only care
    about the sections that contribute to the downtime

  - They don't have an identifier to show the type of sections, so no way
    to filter downtime information either easily.

We can add type into the tracepoints, but instead of doing so, this patch
kept them untouched, instead of adding a bunch of downtime specific
tracepoints, so one can enable "vmstate_downtime*" tracepoints and get a
full picture of how the downtime is distributed across iterative and
non-iterative vmstate save/load.

Note that here both save() and load() need to be traced, because both of
them may contribute to the downtime.  The contribution is not a simple "add
them together", though: consider when the src is doing a save() of device1
while the dest can be load()ing for device2, so they can happen
concurrently.

Tracking both sides make sense because device load() and save() can be
imbalanced, one device can save() super fast, but load() super slow, vice
versa.  We can't figure that out without tracing both.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/savevm.c     | 49 ++++++++++++++++++++++++++++++++++++++----
 migration/trace-events |  2 ++
 2 files changed, 47 insertions(+), 4 deletions(-)

diff --git a/migration/savevm.c b/migration/savevm.c
index 8622f229e5..cd6d6ba493 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1459,6 +1459,7 @@ void qemu_savevm_state_complete_postcopy(QEMUFile *f)
 static
 int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy)
 {
+    int64_t start_ts_each, end_ts_each;
     SaveStateEntry *se;
     int ret;
 
@@ -1475,6 +1476,8 @@ int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy)
                 continue;
             }
         }
+
+        start_ts_each = qemu_clock_get_us(QEMU_CLOCK_REALTIME);
         trace_savevm_section_start(se->idstr, se->section_id);
 
         save_section_header(f, se, QEMU_VM_SECTION_END);
@@ -1486,6 +1489,9 @@ int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy)
             qemu_file_set_error(f, ret);
             return -1;
         }
+        end_ts_each = qemu_clock_get_us(QEMU_CLOCK_REALTIME);
+        trace_vmstate_downtime_save("iterable", se->idstr, se->instance_id,
+                                    end_ts_each - start_ts_each);
     }
 
     return 0;
@@ -1496,6 +1502,7 @@ int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
                                                     bool inactivate_disks)
 {
     MigrationState *ms = migrate_get_current();
+    int64_t start_ts_each, end_ts_each;
     JSONWriter *vmdesc = ms->vmdesc;
     int vmdesc_len;
     SaveStateEntry *se;
@@ -1507,11 +1514,17 @@ int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
             continue;
         }
 
+        start_ts_each = qemu_clock_get_us(QEMU_CLOCK_REALTIME);
+
         ret = vmstate_save(f, se, vmdesc);
         if (ret) {
             qemu_file_set_error(f, ret);
             return ret;
         }
+
+        end_ts_each = qemu_clock_get_us(QEMU_CLOCK_REALTIME);
+        trace_vmstate_downtime_save("non-iterable", se->idstr, se->instance_id,
+                                    end_ts_each - start_ts_each);
     }
 
     if (inactivate_disks) {
@@ -2506,9 +2519,12 @@ static bool check_section_footer(QEMUFile *f, SaveStateEntry *se)
 }
 
 static int
-qemu_loadvm_section_start_full(QEMUFile *f, MigrationIncomingState *mis)
+qemu_loadvm_section_start_full(QEMUFile *f, MigrationIncomingState *mis,
+                               uint8_t type)
 {
+    bool trace_downtime = (type == QEMU_VM_SECTION_FULL);
     uint32_t instance_id, version_id, section_id;
+    int64_t start_ts, end_ts;
     SaveStateEntry *se;
     char idstr[256];
     int ret;
@@ -2557,12 +2573,23 @@ qemu_loadvm_section_start_full(QEMUFile *f, MigrationIncomingState *mis)
         return -EINVAL;
     }
 
+    if (trace_downtime) {
+        start_ts = qemu_clock_get_us(QEMU_CLOCK_REALTIME);
+    }
+
     ret = vmstate_load(f, se);
     if (ret < 0) {
         error_report("error while loading state for instance 0x%"PRIx32" of"
                      " device '%s'", instance_id, idstr);
         return ret;
     }
+
+    if (trace_downtime) {
+        end_ts = qemu_clock_get_us(QEMU_CLOCK_REALTIME);
+        trace_vmstate_downtime_load("non-iterable", se->idstr,
+                                    se->instance_id, end_ts - start_ts);
+    }
+
     if (!check_section_footer(f, se)) {
         return -EINVAL;
     }
@@ -2571,8 +2598,11 @@ qemu_loadvm_section_start_full(QEMUFile *f, MigrationIncomingState *mis)
 }
 
 static int
-qemu_loadvm_section_part_end(QEMUFile *f, MigrationIncomingState *mis)
+qemu_loadvm_section_part_end(QEMUFile *f, MigrationIncomingState *mis,
+                             uint8_t type)
 {
+    bool trace_downtime = (type == QEMU_VM_SECTION_END);
+    int64_t start_ts, end_ts;
     uint32_t section_id;
     SaveStateEntry *se;
     int ret;
@@ -2597,12 +2627,23 @@ qemu_loadvm_section_part_end(QEMUFile *f, MigrationIncomingState *mis)
         return -EINVAL;
     }
 
+    if (trace_downtime) {
+        start_ts = qemu_clock_get_us(QEMU_CLOCK_REALTIME);
+    }
+
     ret = vmstate_load(f, se);
     if (ret < 0) {
         error_report("error while loading state section id %d(%s)",
                      section_id, se->idstr);
         return ret;
     }
+
+    if (trace_downtime) {
+        end_ts = qemu_clock_get_us(QEMU_CLOCK_REALTIME);
+        trace_vmstate_downtime_load("iterable", se->idstr,
+                                    se->instance_id, end_ts - start_ts);
+    }
+
     if (!check_section_footer(f, se)) {
         return -EINVAL;
     }
@@ -2791,14 +2832,14 @@ retry:
         switch (section_type) {
         case QEMU_VM_SECTION_START:
         case QEMU_VM_SECTION_FULL:
-            ret = qemu_loadvm_section_start_full(f, mis);
+            ret = qemu_loadvm_section_start_full(f, mis, section_type);
             if (ret < 0) {
                 goto out;
             }
             break;
         case QEMU_VM_SECTION_PART:
         case QEMU_VM_SECTION_END:
-            ret = qemu_loadvm_section_part_end(f, mis);
+            ret = qemu_loadvm_section_part_end(f, mis, section_type);
             if (ret < 0) {
                 goto out;
             }
diff --git a/migration/trace-events b/migration/trace-events
index fa9486dffe..5820add1f3 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -48,6 +48,8 @@ savevm_state_cleanup(void) ""
 savevm_state_complete_precopy(void) ""
 vmstate_save(const char *idstr, const char *vmsd_name) "%s, %s"
 vmstate_load(const char *idstr, const char *vmsd_name) "%s, %s"
+vmstate_downtime_save(const char *type, const char *idstr, uint32_t instance_id, int64_t downtime) "type=%s idstr=%s instance_id=%d downtime=%"PRIi64
+vmstate_downtime_load(const char *type, const char *idstr, uint32_t instance_id, int64_t downtime) "type=%s idstr=%s instance_id=%d downtime=%"PRIi64
 postcopy_pause_incoming(void) ""
 postcopy_pause_incoming_continued(void) ""
 postcopy_page_req_sync(void *host_addr) "sync page req %p"
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/3] migration: Downtime tracepoints
  2023-10-26 15:53 [PATCH 0/3] migration: Downtime tracepoints Peter Xu
                   ` (2 preceding siblings ...)
  2023-10-26 15:53 ` [PATCH 3/3] migration: Add per vmstate downtime tracepoints Peter Xu
@ 2023-10-26 16:06 ` Joao Martins
  2023-10-26 17:03   ` Peter Xu
  2023-10-26 19:01 ` [PATCH 4/3] migration: Add tracepoints for downtime checkpoints Peter Xu
  4 siblings, 1 reply; 21+ messages in thread
From: Joao Martins @ 2023-10-26 16:06 UTC (permalink / raw)
  To: Peter Xu, qemu-devel; +Cc: Fabiano Rosas, Juan Quintela



On 26/10/2023 16:53, Peter Xu wrote:
> This small series (actually only the last patch; first two are cleanups)
> wants to improve ability of QEMU downtime analysis similarly to what Joao
> used to propose here:
> 
>   https://lore.kernel.org/r/20230926161841.98464-1-joao.m.martins@oracle.com
> 
Thanks for following up on the idea; It's been hard to have enough bandwidth for
everything on the past set of weeks :(

> But with a few differences:
> 
>   - Nothing exported yet to qapi, all tracepoints so far
> 
>   - Instead of major checkpoints (stop, iterable, non-iterable, resume-rp),
>     finer granule by providing downtime measurements for each vmstate (I
>     made microsecond to be the unit to be accurate).  So far it seems
>     iterable / non-iterable is the core of the problem, and I want to nail
>     it to per-device.
> 
>   - Trace dest QEMU too
> 
> For the last bullet: consider the case where a device save() can be super
> fast, while load() can actually be super slow.  Both of them will
> contribute to the ultimate downtime, but not a simple summary: when src
> QEMU is save()ing on device1, dst QEMU can be load()ing on device2.  So
> they can run in parallel.  However the only way to figure all components of
> the downtime is to record both.
> 
> Please have a look, thanks.
>

I like your series, as it allows a user to pinpoint one particular bad device,
while covering the load side too. The checkpoints of migration on the other hand
were useful -- while also a bit ugly -- for the sort of big picture of how
downtime breaks down. Perhaps we could add that /also/ as tracepoitns without
specifically commiting to be exposed in QAPI.

More fundamentally, how can one capture the 'stop' part? There's also time spent
there like e.g. quiescing/stopping vhost-net workers, or suspending the VF
device. All likely as bad to those tracepoints pertaining device-state/ram
related stuff (iterable and non-iterable portions).


> Peter Xu (3):
>   migration: Set downtime_start even for postcopy
>   migration: Add migration_downtime_start|end() helpers
>   migration: Add per vmstate downtime tracepoints
> 
>  migration/migration.c  | 38 +++++++++++++++++++++-----------
>  migration/savevm.c     | 49 ++++++++++++++++++++++++++++++++++++++----
>  migration/trace-events |  2 ++
>  3 files changed, 72 insertions(+), 17 deletions(-)
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/3] migration: Downtime tracepoints
  2023-10-26 16:06 ` [PATCH 0/3] migration: Downtime tracepoints Joao Martins
@ 2023-10-26 17:03   ` Peter Xu
  2023-10-26 18:18     ` Peter Xu
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Xu @ 2023-10-26 17:03 UTC (permalink / raw)
  To: Joao Martins; +Cc: qemu-devel, Fabiano Rosas, Juan Quintela

On Thu, Oct 26, 2023 at 05:06:37PM +0100, Joao Martins wrote:
> On 26/10/2023 16:53, Peter Xu wrote:
> > This small series (actually only the last patch; first two are cleanups)
> > wants to improve ability of QEMU downtime analysis similarly to what Joao
> > used to propose here:
> > 
> >   https://lore.kernel.org/r/20230926161841.98464-1-joao.m.martins@oracle.com
> > 
> Thanks for following up on the idea; It's been hard to have enough bandwidth for
> everything on the past set of weeks :(

Yeah, totally understdood.  I think our QE team pushed me towards some
series like this, while my plan was waiting for your new version. :)

Then when I started I decided to go into per-device.  I was thinking of
also persist that information, but then I remembered some ppc guest can
have ~40,000 vmstates..  and memory to maintain that may or may not regress
a ppc user.  So I figured I should first keep it simple with tracepoints.

> 
> > But with a few differences:
> > 
> >   - Nothing exported yet to qapi, all tracepoints so far
> > 
> >   - Instead of major checkpoints (stop, iterable, non-iterable, resume-rp),
> >     finer granule by providing downtime measurements for each vmstate (I
> >     made microsecond to be the unit to be accurate).  So far it seems
> >     iterable / non-iterable is the core of the problem, and I want to nail
> >     it to per-device.
> > 
> >   - Trace dest QEMU too
> > 
> > For the last bullet: consider the case where a device save() can be super
> > fast, while load() can actually be super slow.  Both of them will
> > contribute to the ultimate downtime, but not a simple summary: when src
> > QEMU is save()ing on device1, dst QEMU can be load()ing on device2.  So
> > they can run in parallel.  However the only way to figure all components of
> > the downtime is to record both.
> > 
> > Please have a look, thanks.
> >
> 
> I like your series, as it allows a user to pinpoint one particular bad device,
> while covering the load side too. The checkpoints of migration on the other hand
> were useful -- while also a bit ugly -- for the sort of big picture of how
> downtime breaks down. Perhaps we could add that /also/ as tracepoitns without
> specifically commiting to be exposed in QAPI.
> 
> More fundamentally, how can one capture the 'stop' part? There's also time spent
> there like e.g. quiescing/stopping vhost-net workers, or suspending the VF
> device. All likely as bad to those tracepoints pertaining device-state/ram
> related stuff (iterable and non-iterable portions).

Yeah that's a good point.  I didn't cover "stop" yet because I think it's
just more tricky and I didn't think it all through, yet.

The first question is, when stopping some backends, the vCPUs are still
running, so it's not 100% clear to me on which should be contributed as
part of real downtime.

Meanwhile that'll be another angle besides vmstates: need to keep some eye
on the state change handlers, and that can be a device, or something else.

Did you measure the stop process in some way before?  Do you have some
rough number or anything surprising you already observed?

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/3] migration: Set downtime_start even for postcopy
  2023-10-26 15:53 ` [PATCH 1/3] migration: Set downtime_start even for postcopy Peter Xu
@ 2023-10-26 17:05   ` Fabiano Rosas
  0 siblings, 0 replies; 21+ messages in thread
From: Fabiano Rosas @ 2023-10-26 17:05 UTC (permalink / raw)
  To: Peter Xu, qemu-devel; +Cc: peterx, Juan Quintela, Joao Martins

Peter Xu <peterx@redhat.com> writes:

> Postcopy calculates its downtime separately.  It always sets
> MigrationState.downtime properly, but not MigrationState.downtime_start.
>
> Make postcopy do the same as other modes on properly recording the
> timestamp when the VM is going to be stopped.  Drop the temporary variable
> in postcopy_start() along the way.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/3] migration: Add migration_downtime_start|end() helpers
  2023-10-26 15:53 ` [PATCH 2/3] migration: Add migration_downtime_start|end() helpers Peter Xu
@ 2023-10-26 17:11   ` Fabiano Rosas
  0 siblings, 0 replies; 21+ messages in thread
From: Fabiano Rosas @ 2023-10-26 17:11 UTC (permalink / raw)
  To: Peter Xu, qemu-devel; +Cc: peterx, Juan Quintela, Joao Martins

Peter Xu <peterx@redhat.com> writes:

> Unify the three users on recording downtimes with the same pair of helpers.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/3] migration: Downtime tracepoints
  2023-10-26 17:03   ` Peter Xu
@ 2023-10-26 18:18     ` Peter Xu
  2023-10-26 19:33       ` Joao Martins
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Xu @ 2023-10-26 18:18 UTC (permalink / raw)
  To: Joao Martins; +Cc: qemu-devel, Fabiano Rosas, Juan Quintela

On Thu, Oct 26, 2023 at 01:03:57PM -0400, Peter Xu wrote:
> On Thu, Oct 26, 2023 at 05:06:37PM +0100, Joao Martins wrote:
> > On 26/10/2023 16:53, Peter Xu wrote:
> > > This small series (actually only the last patch; first two are cleanups)
> > > wants to improve ability of QEMU downtime analysis similarly to what Joao
> > > used to propose here:
> > > 
> > >   https://lore.kernel.org/r/20230926161841.98464-1-joao.m.martins@oracle.com
> > > 
> > Thanks for following up on the idea; It's been hard to have enough bandwidth for
> > everything on the past set of weeks :(
> 
> Yeah, totally understdood.  I think our QE team pushed me towards some
> series like this, while my plan was waiting for your new version. :)
> 
> Then when I started I decided to go into per-device.  I was thinking of
> also persist that information, but then I remembered some ppc guest can
> have ~40,000 vmstates..  and memory to maintain that may or may not regress
> a ppc user.  So I figured I should first keep it simple with tracepoints.
> 
> > 
> > > But with a few differences:
> > > 
> > >   - Nothing exported yet to qapi, all tracepoints so far
> > > 
> > >   - Instead of major checkpoints (stop, iterable, non-iterable, resume-rp),
> > >     finer granule by providing downtime measurements for each vmstate (I
> > >     made microsecond to be the unit to be accurate).  So far it seems
> > >     iterable / non-iterable is the core of the problem, and I want to nail
> > >     it to per-device.
> > > 
> > >   - Trace dest QEMU too
> > > 
> > > For the last bullet: consider the case where a device save() can be super
> > > fast, while load() can actually be super slow.  Both of them will
> > > contribute to the ultimate downtime, but not a simple summary: when src
> > > QEMU is save()ing on device1, dst QEMU can be load()ing on device2.  So
> > > they can run in parallel.  However the only way to figure all components of
> > > the downtime is to record both.
> > > 
> > > Please have a look, thanks.
> > >
> > 
> > I like your series, as it allows a user to pinpoint one particular bad device,
> > while covering the load side too. The checkpoints of migration on the other hand
> > were useful -- while also a bit ugly -- for the sort of big picture of how
> > downtime breaks down. Perhaps we could add that /also/ as tracepoitns without
> > specifically commiting to be exposed in QAPI.
> > 
> > More fundamentally, how can one capture the 'stop' part? There's also time spent
> > there like e.g. quiescing/stopping vhost-net workers, or suspending the VF
> > device. All likely as bad to those tracepoints pertaining device-state/ram
> > related stuff (iterable and non-iterable portions).
> 
> Yeah that's a good point.  I didn't cover "stop" yet because I think it's
> just more tricky and I didn't think it all through, yet.
> 
> The first question is, when stopping some backends, the vCPUs are still
> running, so it's not 100% clear to me on which should be contributed as
> part of real downtime.

I was wrong.. we always stop vcpus first.

If you won't mind, I can add some traceopints for all those spots in this
series to cover your other series.  I'll also make sure I do that for both
sides.

Thanks,

> 
> Meanwhile that'll be another angle besides vmstates: need to keep some eye
> on the state change handlers, and that can be a device, or something else.
> 
> Did you measure the stop process in some way before?  Do you have some
> rough number or anything surprising you already observed?
> 
> Thanks,
> 
> -- 
> Peter Xu

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 4/3] migration: Add tracepoints for downtime checkpoints
  2023-10-26 15:53 [PATCH 0/3] migration: Downtime tracepoints Peter Xu
                   ` (3 preceding siblings ...)
  2023-10-26 16:06 ` [PATCH 0/3] migration: Downtime tracepoints Joao Martins
@ 2023-10-26 19:01 ` Peter Xu
  2023-10-26 19:43   ` Joao Martins
  4 siblings, 1 reply; 21+ messages in thread
From: Peter Xu @ 2023-10-26 19:01 UTC (permalink / raw)
  To: qemu-devel; +Cc: peterx, farosas, quintela, joao.m.martins

Add tracepoints for major downtime checkpoints on both src and dst.  They
share the same tracepoint with a string showing its stage.

On src, we have these checkpoints added:

  - downtime-start: right before vm stops on src
  - vm-stopped: after vm is fully stopped
  - iterable-saved: after all iterables saved (END sections)
  - non-iterable-saved: after all non-iterable saved (FULL sections)
  - downtime-stop: migration fully completed

On dst, we have these checkpoints added:

  - precopy-loadvm-completes: after loadvm all done for precopy
  - precopy-bh-*: record BH steps to resume VM for precopy
  - postcopy-bh-*: record BH steps to resume VM for postcopy

On dst side, we don't have a good way to trace total time consumed by
iterable or non-iterable for now.  We can mark it by 1st time receiving a
FULL / END section, but rather than that let's just rely on the other
tracepoints added for vmstates to back up the information.

Until this patch, one can enable "vmstate_downtime*" and it'll enable all
tracepoints for downtime measurements.

Since the downtime timestamp tracepoints will cover postcopy too, drop
loadvm_postcopy_handle_run_bh() tracepoint alongside, because they service
the same purpose, but that only for postcopy.  We then have unified prefix
for all downtime relevant tracepoints.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.c  | 16 +++++++++++++++-
 migration/savevm.c     | 14 +++++++++-----
 migration/trace-events |  2 +-
 3 files changed, 25 insertions(+), 7 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 9013c1b500..f1f1d2ae2b 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -103,6 +103,7 @@ static int close_return_path_on_source(MigrationState *s);
 
 static void migration_downtime_start(MigrationState *s)
 {
+    trace_vmstate_downtime_timestamp("downtime-start");
     s->downtime_start = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
 }
 
@@ -117,6 +118,8 @@ static void migration_downtime_end(MigrationState *s)
     if (!s->downtime) {
         s->downtime = now - s->downtime_start;
     }
+
+    trace_vmstate_downtime_timestamp("downtime-end");
 }
 
 static bool migration_needs_multiple_sockets(void)
@@ -151,7 +154,11 @@ static gint page_request_addr_cmp(gconstpointer ap, gconstpointer bp)
 
 int migration_stop_vm(RunState state)
 {
-    return vm_stop_force_state(state);
+    int ret = vm_stop_force_state(state);
+
+    trace_vmstate_downtime_timestamp("vm-stopped");
+
+    return ret;
 }
 
 void migration_object_init(void)
@@ -500,6 +507,8 @@ static void process_incoming_migration_bh(void *opaque)
     Error *local_err = NULL;
     MigrationIncomingState *mis = opaque;
 
+    trace_vmstate_downtime_timestamp("precopy-bh-enter");
+
     /* If capability late_block_activate is set:
      * Only fire up the block code now if we're going to restart the
      * VM, else 'cont' will do it.
@@ -525,6 +534,8 @@ static void process_incoming_migration_bh(void *opaque)
      */
     qemu_announce_self(&mis->announce_timer, migrate_announce_params());
 
+    trace_vmstate_downtime_timestamp("precopy-bh-announced");
+
     multifd_load_shutdown();
 
     dirty_bitmap_mig_before_vm_start();
@@ -542,6 +553,7 @@ static void process_incoming_migration_bh(void *opaque)
     } else {
         runstate_set(global_state_get_runstate());
     }
+    trace_vmstate_downtime_timestamp("precopy-bh-vm-started");
     /*
      * This must happen after any state changes since as soon as an external
      * observer sees this event they might start to prod at the VM assuming
@@ -576,6 +588,8 @@ process_incoming_migration_co(void *opaque)
     ret = qemu_loadvm_state(mis->from_src_file);
     mis->loadvm_co = NULL;
 
+    trace_vmstate_downtime_timestamp("precopy-loadvm-completed");
+
     ps = postcopy_state_get();
     trace_process_incoming_migration_co_end(ret, ps);
     if (ps != POSTCOPY_INCOMING_NONE) {
diff --git a/migration/savevm.c b/migration/savevm.c
index cd6d6ba493..49cbbd151c 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1494,6 +1494,8 @@ int qemu_savevm_state_complete_precopy_iterable(QEMUFile *f, bool in_postcopy)
                                     end_ts_each - start_ts_each);
     }
 
+    trace_vmstate_downtime_timestamp("iterable-saved");
+
     return 0;
 }
 
@@ -1560,6 +1562,8 @@ int qemu_savevm_state_complete_precopy_non_iterable(QEMUFile *f,
     json_writer_free(vmdesc);
     ms->vmdesc = NULL;
 
+    trace_vmstate_downtime_timestamp("non-iterable-saved");
+
     return 0;
 }
 
@@ -2102,18 +2106,18 @@ static void loadvm_postcopy_handle_run_bh(void *opaque)
     Error *local_err = NULL;
     MigrationIncomingState *mis = opaque;
 
-    trace_loadvm_postcopy_handle_run_bh("enter");
+    trace_vmstate_downtime_timestamp("postcopy-bh-enter");
 
     /* TODO we should move all of this lot into postcopy_ram.c or a shared code
      * in migration.c
      */
     cpu_synchronize_all_post_init();
 
-    trace_loadvm_postcopy_handle_run_bh("after cpu sync");
+    trace_vmstate_downtime_timestamp("postcopy-bh-cpu-synced");
 
     qemu_announce_self(&mis->announce_timer, migrate_announce_params());
 
-    trace_loadvm_postcopy_handle_run_bh("after announce");
+    trace_vmstate_downtime_timestamp("postcopy-bh-announced");
 
     /* Make sure all file formats throw away their mutable metadata.
      * If we get an error here, just don't restart the VM yet. */
@@ -2124,7 +2128,7 @@ static void loadvm_postcopy_handle_run_bh(void *opaque)
         autostart = false;
     }
 
-    trace_loadvm_postcopy_handle_run_bh("after invalidate cache");
+    trace_vmstate_downtime_timestamp("postcopy-bh-cache-invalidated");
 
     dirty_bitmap_mig_before_vm_start();
 
@@ -2138,7 +2142,7 @@ static void loadvm_postcopy_handle_run_bh(void *opaque)
 
     qemu_bh_delete(mis->bh);
 
-    trace_loadvm_postcopy_handle_run_bh("return");
+    trace_vmstate_downtime_timestamp("postcopy-bh-vm-started");
 }
 
 /* After all discards we can start running and asking for pages */
diff --git a/migration/trace-events b/migration/trace-events
index 5820add1f3..aebdd939b7 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -17,7 +17,6 @@ loadvm_handle_recv_bitmap(char *s) "%s"
 loadvm_postcopy_handle_advise(void) ""
 loadvm_postcopy_handle_listen(const char *str) "%s"
 loadvm_postcopy_handle_run(void) ""
-loadvm_postcopy_handle_run_bh(const char *str) "%s"
 loadvm_postcopy_handle_resume(void) ""
 loadvm_postcopy_ram_handle_discard(void) ""
 loadvm_postcopy_ram_handle_discard_end(void) ""
@@ -50,6 +49,7 @@ vmstate_save(const char *idstr, const char *vmsd_name) "%s, %s"
 vmstate_load(const char *idstr, const char *vmsd_name) "%s, %s"
 vmstate_downtime_save(const char *type, const char *idstr, uint32_t instance_id, int64_t downtime) "type=%s idstr=%s instance_id=%d downtime=%"PRIi64
 vmstate_downtime_load(const char *type, const char *idstr, uint32_t instance_id, int64_t downtime) "type=%s idstr=%s instance_id=%d downtime=%"PRIi64
+vmstate_downtime_timestamp(const char *checkpoint) "%s"
 postcopy_pause_incoming(void) ""
 postcopy_pause_incoming_continued(void) ""
 postcopy_page_req_sync(void *host_addr) "sync page req %p"
-- 
2.41.0



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/3] migration: Downtime tracepoints
  2023-10-26 18:18     ` Peter Xu
@ 2023-10-26 19:33       ` Joao Martins
  2023-10-26 20:07         ` Peter Xu
  0 siblings, 1 reply; 21+ messages in thread
From: Joao Martins @ 2023-10-26 19:33 UTC (permalink / raw)
  To: Peter Xu; +Cc: qemu-devel, Fabiano Rosas, Juan Quintela

On 26/10/2023 19:18, Peter Xu wrote:
> On Thu, Oct 26, 2023 at 01:03:57PM -0400, Peter Xu wrote:
>> On Thu, Oct 26, 2023 at 05:06:37PM +0100, Joao Martins wrote:
>>> On 26/10/2023 16:53, Peter Xu wrote:
>>>> This small series (actually only the last patch; first two are cleanups)
>>>> wants to improve ability of QEMU downtime analysis similarly to what Joao
>>>> used to propose here:
>>>>
>>>>   https://lore.kernel.org/r/20230926161841.98464-1-joao.m.martins@oracle.com
>>>>
>>> Thanks for following up on the idea; It's been hard to have enough bandwidth for
>>> everything on the past set of weeks :(
>>
>> Yeah, totally understdood.  I think our QE team pushed me towards some
>> series like this, while my plan was waiting for your new version. :)
>>

Oh my end, it was similar (though not by QE/QA) with folks feeling at a blank
when they see a bigger downtime.

Having an explainer/breakdown totally makes this easier to poke holes on where
problems are.

>> Then when I started I decided to go into per-device.  I was thinking of
>> also persist that information, but then I remembered some ppc guest can
>> have ~40,000 vmstates..  and memory to maintain that may or may not regress
>> a ppc user.  So I figured I should first keep it simple with tracepoints.
>>

Yeah, I should have removed that last patch for QAPI.

vmstates is something that I wasn't quite liking how it looked, but I think you
managed to square a relatively clean way on that last patch.

>>>
>>>> But with a few differences:
>>>>
>>>>   - Nothing exported yet to qapi, all tracepoints so far
>>>>
>>>>   - Instead of major checkpoints (stop, iterable, non-iterable, resume-rp),
>>>>     finer granule by providing downtime measurements for each vmstate (I
>>>>     made microsecond to be the unit to be accurate).  So far it seems
>>>>     iterable / non-iterable is the core of the problem, and I want to nail
>>>>     it to per-device.
>>>>
>>>>   - Trace dest QEMU too
>>>>
>>>> For the last bullet: consider the case where a device save() can be super
>>>> fast, while load() can actually be super slow.  Both of them will
>>>> contribute to the ultimate downtime, but not a simple summary: when src
>>>> QEMU is save()ing on device1, dst QEMU can be load()ing on device2.  So
>>>> they can run in parallel.  However the only way to figure all components of
>>>> the downtime is to record both.
>>>>
>>>> Please have a look, thanks.
>>>>
>>>
>>> I like your series, as it allows a user to pinpoint one particular bad device,
>>> while covering the load side too. The checkpoints of migration on the other hand
>>> were useful -- while also a bit ugly -- for the sort of big picture of how
>>> downtime breaks down. Perhaps we could add that /also/ as tracepoitns without
>>> specifically commiting to be exposed in QAPI.
>>>
>>> More fundamentally, how can one capture the 'stop' part? There's also time spent
>>> there like e.g. quiescing/stopping vhost-net workers, or suspending the VF
>>> device. All likely as bad to those tracepoints pertaining device-state/ram
>>> related stuff (iterable and non-iterable portions).
>>
>> Yeah that's a good point.  I didn't cover "stop" yet because I think it's
>> just more tricky and I didn't think it all through, yet.
>>

It could follow your previous line of thought where you do it per vmstate.

But the catch is that vm state change handlers are nameless so tracepoints
wouldn't be tell which vm-state is spending time on each

>> The first question is, when stopping some backends, the vCPUs are still
>> running, so it's not 100% clear to me on which should be contributed as
>> part of real downtime.
> 
> I was wrong.. we always stop vcpus first.
> 

I was about to say this, but I guess you figured out. Even if your vCPUs weren't
stopped first, the external I/O threads (qemu or kernel) wouldn't service
guest own I/O which is a portion of outage.

> If you won't mind, I can add some traceopints for all those spots in this
> series to cover your other series.  I'll also make sure I do that for both
> sides.
> 
Sure. For the fourth patch, feel free to add Suggested-by and/or a Link,
considering it started on the other patches (if you also agree it is right). The
patches ofc are enterily different, but at least I like to believe the ideas
initially presented and then subsequently improved are what lead to the downtime
observability improvements in this series.

	Joao


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/3] migration: Add tracepoints for downtime checkpoints
  2023-10-26 19:01 ` [PATCH 4/3] migration: Add tracepoints for downtime checkpoints Peter Xu
@ 2023-10-26 19:43   ` Joao Martins
  2023-10-26 20:08     ` Peter Xu
  0 siblings, 1 reply; 21+ messages in thread
From: Joao Martins @ 2023-10-26 19:43 UTC (permalink / raw)
  To: Peter Xu; +Cc: farosas, quintela, qemu-devel

On 26/10/2023 20:01, Peter Xu wrote:
> Add tracepoints for major downtime checkpoints on both src and dst.  They
> share the same tracepoint with a string showing its stage.
> 
> On src, we have these checkpoints added:
> 
>   - downtime-start: right before vm stops on src
>   - vm-stopped: after vm is fully stopped
>   - iterable-saved: after all iterables saved (END sections)
>   - non-iterable-saved: after all non-iterable saved (FULL sections)
>   - downtime-stop: migration fully completed
> 
> On dst, we have these checkpoints added:
> 
>   - precopy-loadvm-completes: after loadvm all done for precopy
>   - precopy-bh-*: record BH steps to resume VM for precopy
>   - postcopy-bh-*: record BH steps to resume VM for postcopy
> 
> On dst side, we don't have a good way to trace total time consumed by
> iterable or non-iterable for now.  We can mark it by 1st time receiving a
> FULL / END section, but rather than that let's just rely on the other
> tracepoints added for vmstates to back up the information.
> 
> Until this patch, one can enable "vmstate_downtime*" and it'll enable all
> tracepoints for downtime measurements.
> 
> Since the downtime timestamp tracepoints will cover postcopy too, drop
> loadvm_postcopy_handle_run_bh() tracepoint alongside, because they service
> the same purpose, but that only for postcopy.  We then have unified prefix
> for all downtime relevant tracepoints.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  migration/migration.c  | 16 +++++++++++++++-
>  migration/savevm.c     | 14 +++++++++-----
>  migration/trace-events |  2 +-
>  3 files changed, 25 insertions(+), 7 deletions(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 9013c1b500..f1f1d2ae2b 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -103,6 +103,7 @@ static int close_return_path_on_source(MigrationState *s);
>  
>  static void migration_downtime_start(MigrationState *s)
>  {
> +    trace_vmstate_downtime_timestamp("downtime-start");
>      s->downtime_start = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
>  }
>  
> @@ -117,6 +118,8 @@ static void migration_downtime_end(MigrationState *s)
>      if (!s->downtime) {
>          s->downtime = now - s->downtime_start;
>      }
> +
> +    trace_vmstate_downtime_timestamp("downtime-end");
>  }
> 

Considering we aren't including any downtime timestamps in the tracing, is this
a way to say that the tracing tool printing timestamps is what we use to extract
downtime contribution?

It might be obvious, but perhaps should be spelled out in the commit message?


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/3] migration: Downtime tracepoints
  2023-10-26 19:33       ` Joao Martins
@ 2023-10-26 20:07         ` Peter Xu
  2023-10-27  8:58           ` Joao Martins
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Xu @ 2023-10-26 20:07 UTC (permalink / raw)
  To: Joao Martins; +Cc: qemu-devel, Fabiano Rosas, Juan Quintela

On Thu, Oct 26, 2023 at 08:33:13PM +0100, Joao Martins wrote:
> Sure. For the fourth patch, feel free to add Suggested-by and/or a Link,
> considering it started on the other patches (if you also agree it is right). The
> patches ofc are enterily different, but at least I like to believe the ideas
> initially presented and then subsequently improved are what lead to the downtime
> observability improvements in this series.

Sure, I'll add that.

If you like, I would be definitely happy to have Co-developed-by: with you,
if you agree.  I just don't know whether that addressed all your need, and
I need some patch like that for our builds.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/3] migration: Add tracepoints for downtime checkpoints
  2023-10-26 19:43   ` Joao Martins
@ 2023-10-26 20:08     ` Peter Xu
  2023-10-26 20:14       ` Peter Xu
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Xu @ 2023-10-26 20:08 UTC (permalink / raw)
  To: Joao Martins; +Cc: farosas, quintela, qemu-devel

On Thu, Oct 26, 2023 at 08:43:59PM +0100, Joao Martins wrote:
> Considering we aren't including any downtime timestamps in the tracing, is this
> a way to say that the tracing tool printing timestamps is what we use to extract
> downtime contribution?
> 
> It might be obvious, but perhaps should be spelled out in the commit message?

Sure, I'll state that in the commit message in a new version.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/3] migration: Add tracepoints for downtime checkpoints
  2023-10-26 20:08     ` Peter Xu
@ 2023-10-26 20:14       ` Peter Xu
  0 siblings, 0 replies; 21+ messages in thread
From: Peter Xu @ 2023-10-26 20:14 UTC (permalink / raw)
  To: Joao Martins; +Cc: farosas, quintela, qemu-devel

On Thu, Oct 26, 2023 at 04:08:20PM -0400, Peter Xu wrote:
> On Thu, Oct 26, 2023 at 08:43:59PM +0100, Joao Martins wrote:
> > Considering we aren't including any downtime timestamps in the tracing, is this
> > a way to say that the tracing tool printing timestamps is what we use to extract
> > downtime contribution?
> > 
> > It might be obvious, but perhaps should be spelled out in the commit message?
> 
> Sure, I'll state that in the commit message in a new version.

After a second thought, I'll rename it into vmstate_downtime_checkopint().

I'll wait for 1-2 more days for further review comments before a repost.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/3] migration: Downtime tracepoints
  2023-10-26 20:07         ` Peter Xu
@ 2023-10-27  8:58           ` Joao Martins
  2023-10-27 14:41             ` Peter Xu
  0 siblings, 1 reply; 21+ messages in thread
From: Joao Martins @ 2023-10-27  8:58 UTC (permalink / raw)
  To: Peter Xu; +Cc: qemu-devel, Fabiano Rosas, Juan Quintela

On 26/10/2023 21:07, Peter Xu wrote:
> On Thu, Oct 26, 2023 at 08:33:13PM +0100, Joao Martins wrote:
>> Sure. For the fourth patch, feel free to add Suggested-by and/or a Link,
>> considering it started on the other patches (if you also agree it is right). The
>> patches ofc are enterily different, but at least I like to believe the ideas
>> initially presented and then subsequently improved are what lead to the downtime
>> observability improvements in this series.
> 
> Sure, I'll add that.
> 
> If you like, I would be definitely happy to have Co-developed-by: with you,
> if you agree. 

Oh, that's great, thanks!

> I just don't know whether that addressed all your need, and
> I need some patch like that for our builds.

I think it achieves the same as the other series. Or rather it re-implements it
but with less compromise on QAPI and made the tracepoints more 'generic' to even
other usecases and less specific to the 'checkpoint breakdown'. Which makes the
implementation simpler (like we don't need that array storing the checkpoint
timestamps) given that it's just tracing and not for QAPI.

Though while it puts more work over developing new tracing tooling for users, I
think it's a good start towards downtime breakdown "clearity" without trading
off maintenance.

	Joao

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/3] migration: Downtime tracepoints
  2023-10-27  8:58           ` Joao Martins
@ 2023-10-27 14:41             ` Peter Xu
  2023-10-27 22:17               ` Joao Martins
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Xu @ 2023-10-27 14:41 UTC (permalink / raw)
  To: Joao Martins; +Cc: qemu-devel, Fabiano Rosas, Juan Quintela

On Fri, Oct 27, 2023 at 09:58:03AM +0100, Joao Martins wrote:
> On 26/10/2023 21:07, Peter Xu wrote:
> > On Thu, Oct 26, 2023 at 08:33:13PM +0100, Joao Martins wrote:
> >> Sure. For the fourth patch, feel free to add Suggested-by and/or a Link,
> >> considering it started on the other patches (if you also agree it is right). The
> >> patches ofc are enterily different, but at least I like to believe the ideas
> >> initially presented and then subsequently improved are what lead to the downtime
> >> observability improvements in this series.
> > 
> > Sure, I'll add that.
> > 
> > If you like, I would be definitely happy to have Co-developed-by: with you,
> > if you agree. 
> 
> Oh, that's great, thanks!

Great!  I apologize on not asking already before a formal patch is post.

> 
> > I just don't know whether that addressed all your need, and
> > I need some patch like that for our builds.
> 
> I think it achieves the same as the other series. Or rather it re-implements it
> but with less compromise on QAPI and made the tracepoints more 'generic' to even
> other usecases and less specific to the 'checkpoint breakdown'. Which makes the
> implementation simpler (like we don't need that array storing the checkpoint
> timestamps) given that it's just tracing and not for QAPI.

Yes.  Please also feel free to have a closer look on the exact checkpoints
in that patch.  I just want to make sure that'll be able to service the
same as the patch you proposed, but with tracepoints, and I don't miss
anything important.

The dest checkpoints are all new, I hope I nailed them all right as we
would expect.

For src checkpoints, IIRC your patch explicitly tracked return path closing
while patch 4 only made it just part of final enclosure; the 4th checkpoint
is after non-iterable sent, until 5th to be the last "downtime-end". It can
cover more than "return path close":

    qemu_savevm_state_complete_precopy_non_iterable <--- 4th checkpoint
    qemu_fflush (after non-iterable sent)
    close_return_path_on_source
    cpu_throttle_stop
    qemu_mutex_lock_iothread
    migration_downtime_end                          <---- 5th checkpoint

If you see fit or necessary, I can, for example, also add one more
checkpoint right after close return path.  I just didn't know whether it's
your intention to explicitly check that point.  Just let me know if so.

Also on whether you prefer to keep a timestamp in the tracepoint itself;
I only use either "log" or "dtrace"+qemu-trace-stap for tracing: both of
them contain timestamps already.  But I can also add the timestamps
(offseted by downtime_start) if you prefer.

I plan to repost before early next week (want to catch the train for 8.2,
if it works out), so it'll be great to collect all your feedback and amend
that before the repost.

> 
> Though while it puts more work over developing new tracing tooling for users, I
> think it's a good start towards downtime breakdown "clearity" without trading
> off maintenance.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/3] migration: Downtime tracepoints
  2023-10-27 14:41             ` Peter Xu
@ 2023-10-27 22:17               ` Joao Martins
  2023-10-30 15:13                 ` Peter Xu
  0 siblings, 1 reply; 21+ messages in thread
From: Joao Martins @ 2023-10-27 22:17 UTC (permalink / raw)
  To: Peter Xu; +Cc: qemu-devel, Fabiano Rosas, Juan Quintela

On 27/10/2023 15:41, Peter Xu wrote:
> On Fri, Oct 27, 2023 at 09:58:03AM +0100, Joao Martins wrote:
>> On 26/10/2023 21:07, Peter Xu wrote:
>>> On Thu, Oct 26, 2023 at 08:33:13PM +0100, Joao Martins wrote:
>>>> Sure. For the fourth patch, feel free to add Suggested-by and/or a Link,
>>>> considering it started on the other patches (if you also agree it is right). The
>>>> patches ofc are enterily different, but at least I like to believe the ideas
>>>> initially presented and then subsequently improved are what lead to the downtime
>>>> observability improvements in this series.
>>>
>>> Sure, I'll add that.
>>>
>>> If you like, I would be definitely happy to have Co-developed-by: with you,
>>> if you agree. 
>>
>> Oh, that's great, thanks!
> 
> Great!  I apologize on not asking already before a formal patch is post.
> 
>>
>>> I just don't know whether that addressed all your need, and
>>> I need some patch like that for our builds.
>>
>> I think it achieves the same as the other series. Or rather it re-implements it
>> but with less compromise on QAPI and made the tracepoints more 'generic' to even
>> other usecases and less specific to the 'checkpoint breakdown'. Which makes the
>> implementation simpler (like we don't need that array storing the checkpoint
>> timestamps) given that it's just tracing and not for QAPI.
> 
> Yes.  Please also feel free to have a closer look on the exact checkpoints
> in that patch.  I just want to make sure that'll be able to service the
> same as the patch you proposed, but with tracepoints, and I don't miss
> anything important.
> 
> The dest checkpoints are all new, I hope I nailed them all right as we
> would expect.
> 
Yeah, it looks like so; but I am no master of postcopy so I don't have quite the
cirurgical eye there.

Perhaps for the loadvm side, 'precopy-bh-enter' suggests some ambiguity (given
that precopy happens on both source and destination?). Perhaps being explicit in
either incoming-bh-enter? or even prefix checkpoint names by 'source' on source
side for example to distinguish?

> For src checkpoints, IIRC your patch explicitly tracked return path closing
> while patch 4 only made it just part of final enclosure; the 4th checkpoint
> is after non-iterable sent, until 5th to be the last "downtime-end". It can
> cover more than "return path close":
> 
>     qemu_savevm_state_complete_precopy_non_iterable <--- 4th checkpoint
>     qemu_fflush (after non-iterable sent)
>     close_return_path_on_source
>     cpu_throttle_stop
>     qemu_mutex_lock_iothread
>     migration_downtime_end                          <---- 5th checkpoint
> 
> If you see fit or necessary, I can, for example, also add one more
> checkpoint right after close return path.  I just didn't know whether it's
> your intention to explicitly check that point.  Just let me know if so.
> 
It was put there because it was a simple catch-all from the source PoV on how
much the destination is taking. Of course bearing in mind that without
return-path then such metric/tracepoint is not available. On the other hand, with
migration_downtime_end I think we have the equivalent in that resume checkpoint.
So we just need to make the diff between that and the non-iterable and I think
we have the same things as I was looking for (even more I would say).

> Also on whether you prefer to keep a timestamp in the tracepoint itself;
> I only use either "log" or "dtrace"+qemu-trace-stap for tracing: both of
> them contain timestamps already.  But I can also add the timestamps
> (offseted by downtime_start) if you prefer.
> 
Perhaps it is easy to wrap the checkpoint tracepoint in its own function to
allow extension of something else e.g. add the timestamp (or any other data into
the checkpoints) or do something in a particular checkpoint of the migration.
The timing function from qemu would give qemu own idea of time (directly
correlable with the downtime metric that applications consume) which would be
more consistent? though I am at too minds on this whether to rely on tracing
stamps or align with what's provided to applications.

> I plan to repost before early next week (want to catch the train for 8.2,
> if it works out), so it'll be great to collect all your feedback and amend
> that before the repost.
> 
>>
>> Though while it puts more work over developing new tracing tooling for users, I
>> think it's a good start towards downtime breakdown "clearity" without trading
>> off maintenance.
> 
> Thanks,
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/3] migration: Downtime tracepoints
  2023-10-27 22:17               ` Joao Martins
@ 2023-10-30 15:13                 ` Peter Xu
  2023-10-30 16:09                   ` Peter Xu
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Xu @ 2023-10-30 15:13 UTC (permalink / raw)
  To: Joao Martins; +Cc: qemu-devel, Fabiano Rosas, Juan Quintela

On Fri, Oct 27, 2023 at 11:17:41PM +0100, Joao Martins wrote:
> On 27/10/2023 15:41, Peter Xu wrote:
> > On Fri, Oct 27, 2023 at 09:58:03AM +0100, Joao Martins wrote:
> >> On 26/10/2023 21:07, Peter Xu wrote:
> >>> On Thu, Oct 26, 2023 at 08:33:13PM +0100, Joao Martins wrote:
> >>>> Sure. For the fourth patch, feel free to add Suggested-by and/or a Link,
> >>>> considering it started on the other patches (if you also agree it is right). The
> >>>> patches ofc are enterily different, but at least I like to believe the ideas
> >>>> initially presented and then subsequently improved are what lead to the downtime
> >>>> observability improvements in this series.
> >>>
> >>> Sure, I'll add that.
> >>>
> >>> If you like, I would be definitely happy to have Co-developed-by: with you,
> >>> if you agree. 
> >>
> >> Oh, that's great, thanks!
> > 
> > Great!  I apologize on not asking already before a formal patch is post.
> > 
> >>
> >>> I just don't know whether that addressed all your need, and
> >>> I need some patch like that for our builds.
> >>
> >> I think it achieves the same as the other series. Or rather it re-implements it
> >> but with less compromise on QAPI and made the tracepoints more 'generic' to even
> >> other usecases and less specific to the 'checkpoint breakdown'. Which makes the
> >> implementation simpler (like we don't need that array storing the checkpoint
> >> timestamps) given that it's just tracing and not for QAPI.
> > 
> > Yes.  Please also feel free to have a closer look on the exact checkpoints
> > in that patch.  I just want to make sure that'll be able to service the
> > same as the patch you proposed, but with tracepoints, and I don't miss
> > anything important.
> > 
> > The dest checkpoints are all new, I hope I nailed them all right as we
> > would expect.
> > 
> Yeah, it looks like so; but I am no master of postcopy so I don't have quite the
> cirurgical eye there.
> 
> Perhaps for the loadvm side, 'precopy-bh-enter' suggests some ambiguity (given
> that precopy happens on both source and destination?). Perhaps being explicit in
> either incoming-bh-enter? or even prefix checkpoint names by 'source' on source
> side for example to distinguish?

I don't think src has a bottom half; if not considering cleanup_bh as part
of migration at all.  Destination's BH is a major part of migration.

In all cases, let me prefix them with "src"/"dst" then, hopefully even clearer.

> 
> > For src checkpoints, IIRC your patch explicitly tracked return path closing
> > while patch 4 only made it just part of final enclosure; the 4th checkpoint
> > is after non-iterable sent, until 5th to be the last "downtime-end". It can
> > cover more than "return path close":
> > 
> >     qemu_savevm_state_complete_precopy_non_iterable <--- 4th checkpoint
> >     qemu_fflush (after non-iterable sent)
> >     close_return_path_on_source
> >     cpu_throttle_stop
> >     qemu_mutex_lock_iothread
> >     migration_downtime_end                          <---- 5th checkpoint
> > 
> > If you see fit or necessary, I can, for example, also add one more
> > checkpoint right after close return path.  I just didn't know whether it's
> > your intention to explicitly check that point.  Just let me know if so.
> > 
> It was put there because it was a simple catch-all from the source PoV on how
> much the destination is taking. Of course bearing in mind that without
> return-path then such metric/tracepoint is not available. On the other hand, with
> migration_downtime_end I think we have the equivalent in that resume checkpoint.
> So we just need to make the diff between that and the non-iterable and I think
> we have the same things as I was looking for (even more I would say).
> 
> > Also on whether you prefer to keep a timestamp in the tracepoint itself;
> > I only use either "log" or "dtrace"+qemu-trace-stap for tracing: both of
> > them contain timestamps already.  But I can also add the timestamps
> > (offseted by downtime_start) if you prefer.
> > 
> Perhaps it is easy to wrap the checkpoint tracepoint in its own function to
> allow extension of something else e.g. add the timestamp (or any other data into
> the checkpoints) or do something in a particular checkpoint of the migration.
> The timing function from qemu would give qemu own idea of time (directly
> correlable with the downtime metric that applications consume) which would be
> more consistent? though I am at too minds on this whether to rely on tracing
> stamps or align with what's provided to applications.

Yes it should be more consistent.  Let me add them into the checkpoints.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/3] migration: Downtime tracepoints
  2023-10-30 15:13                 ` Peter Xu
@ 2023-10-30 16:09                   ` Peter Xu
  2023-10-30 16:11                     ` Joao Martins
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Xu @ 2023-10-30 16:09 UTC (permalink / raw)
  To: Joao Martins; +Cc: qemu-devel, Fabiano Rosas, Juan Quintela

On Mon, Oct 30, 2023 at 11:13:55AM -0400, Peter Xu wrote:
> > Perhaps it is easy to wrap the checkpoint tracepoint in its own function to
> > allow extension of something else e.g. add the timestamp (or any other data into
> > the checkpoints) or do something in a particular checkpoint of the migration.
> > The timing function from qemu would give qemu own idea of time (directly
> > correlable with the downtime metric that applications consume) which would be
> > more consistent? though I am at too minds on this whether to rely on tracing
> > stamps or align with what's provided to applications.
> 
> Yes it should be more consistent.  Let me add them into the checkpoints.

I just noticed dst qemu doesn't have downtime_start to offset..  Only
offset the src downtime checkpoints will make it less consistent.  I'll
leave this one for later.  Posting a new version soon.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/3] migration: Downtime tracepoints
  2023-10-30 16:09                   ` Peter Xu
@ 2023-10-30 16:11                     ` Joao Martins
  0 siblings, 0 replies; 21+ messages in thread
From: Joao Martins @ 2023-10-30 16:11 UTC (permalink / raw)
  To: Peter Xu; +Cc: qemu-devel, Fabiano Rosas, Juan Quintela

On 30/10/2023 16:09, Peter Xu wrote:
> On Mon, Oct 30, 2023 at 11:13:55AM -0400, Peter Xu wrote:
>>> Perhaps it is easy to wrap the checkpoint tracepoint in its own function to
>>> allow extension of something else e.g. add the timestamp (or any other data into
>>> the checkpoints) or do something in a particular checkpoint of the migration.
>>> The timing function from qemu would give qemu own idea of time (directly
>>> correlable with the downtime metric that applications consume) which would be
>>> more consistent? though I am at too minds on this whether to rely on tracing
>>> stamps or align with what's provided to applications.
>>
>> Yes it should be more consistent.  Let me add them into the checkpoints.
> 
> I just noticed dst qemu doesn't have downtime_start to offset..  Only
> offset the src downtime checkpoints will make it less consistent.  I'll
> leave this one for later.  Posting a new version soon.
> 
Argh, yes.

Somehow was thinking in source only -- let's leave for later then.


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2023-10-30 16:12 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-10-26 15:53 [PATCH 0/3] migration: Downtime tracepoints Peter Xu
2023-10-26 15:53 ` [PATCH 1/3] migration: Set downtime_start even for postcopy Peter Xu
2023-10-26 17:05   ` Fabiano Rosas
2023-10-26 15:53 ` [PATCH 2/3] migration: Add migration_downtime_start|end() helpers Peter Xu
2023-10-26 17:11   ` Fabiano Rosas
2023-10-26 15:53 ` [PATCH 3/3] migration: Add per vmstate downtime tracepoints Peter Xu
2023-10-26 16:06 ` [PATCH 0/3] migration: Downtime tracepoints Joao Martins
2023-10-26 17:03   ` Peter Xu
2023-10-26 18:18     ` Peter Xu
2023-10-26 19:33       ` Joao Martins
2023-10-26 20:07         ` Peter Xu
2023-10-27  8:58           ` Joao Martins
2023-10-27 14:41             ` Peter Xu
2023-10-27 22:17               ` Joao Martins
2023-10-30 15:13                 ` Peter Xu
2023-10-30 16:09                   ` Peter Xu
2023-10-30 16:11                     ` Joao Martins
2023-10-26 19:01 ` [PATCH 4/3] migration: Add tracepoints for downtime checkpoints Peter Xu
2023-10-26 19:43   ` Joao Martins
2023-10-26 20:08     ` Peter Xu
2023-10-26 20:14       ` Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).