qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/13] migration/postcopy: Blocktime tracking overhaul
@ 2025-05-27 23:12 Peter Xu
  2025-05-27 23:12 ` [PATCH 01/13] migration: Add option to set postcopy-blocktime Peter Xu
                   ` (12 more replies)
  0 siblings, 13 replies; 30+ messages in thread
From: Peter Xu @ 2025-05-27 23:12 UTC (permalink / raw)
  To: qemu-devel
  Cc: Dr . David Alan Gilbert, peterx, Alexey Perevalov, Fabiano Rosas,
	Juraj Marcin

This series is based on the other series I posted here:

Based-on: <20250527215850.1271072-1-peterx@redhat.com>
https://lore.kernel.org/r/20250527215850.1271072-1-peterx@redhat.com

Overview
========

This series almost rewrites the blocktime feature.  It is a postcopy
feature which can track how long a vCPU got blocked, and how long the whole
system got blocked.  I'm wildly guessing most people are not aware of it,
or tried to use it.

Recently, when I was doing some postcopy tests using my normal scripts for
traps, I remembered once again that we have the blocktime feature.  I
decided to bite the bullet this time to make this feature more useful.

The feature in general isn't extremely helpful before.  One major reason
might be the existance of KVM async page fault (which is also on by default
in many environments).  KVM async page fault allows guest OSes to schedule
some guest threads out when they're accessing a missing page.  It means the
vCPUs will _not_ be blocked even if they're accessing a missing page.
However blocktime reporting all close to zeros may not really mean the
guest is not affected: the workload is definitely impacted.  That's also
why I normally measure page fault latencies instead for most postcopy
tests, because they're more critical to me.

Said that, the blocktime layer is actually close to what we want to do
here, on trapping fault latencies too.  So I added it in this series.  Also
I added non-vCPU threads tracking too, so that at least the latency results
are valid even if KVM async PF is ON.

When at it, I found so many things missing in this feature.  I tackled
every single one of them with separate patches.  One can have a quick look
at what has changed in the follow up section.

Major Changes
=============

- Locking refactor: remove atomic ops, rely on page request mutex instead

  It used to rely on atomic ops but probably not correct either..  I have a
  paragraph in patch "migration/postcopy: Optimize blocktime fault tracking
  with hashtable" explaining why it is buggy.

- Extend all blocktime records internally from 32 bits to 64 bits

  Do this almost to support microsecond trackings (which used to be in
  milliseconds).  Note this does not change existing results reported in
  QMP in the past because it is ABI, however we'll use microseconds for any
  new results to be reported later.

- Added support to report fault average latencies (globally, per-vCPU,
  non-vCPU), results are in microseconds.

- Initialize blocktime reliably only when POSTCOPY_LISTEN

  Rather than hack-init it when creating any userfaultfd..

- Provide tid->vcpu cache

  Add a quick cache for tid->vcpu hash mapping.  It used to be a for loop
  looking for the CPU index, which is unwanted.

- Replace fault record arrays with a hashtable

  This is an optimization for fast inject/lookup of fault records.

  Again, it used to be yet another array keeping all vCPU data.  It's not
  only less performant when vCPUs can be a lot (especially on the lookups,
  which is another for loop..), but also buggy.  It's because in reality
  each vCPU can receive more than one fault sometimes.

  Please see the patch "migration/postcopy: Optimize blocktime fault
  tracking with hashtable" for more information.

- Added support for non-vCPU fault trackings

  This will be extremely useful when e.g. KVM async page fault is enabled,
  because vCPUs almost never block.

Test Results
============

I did quite some tests with the feature after the rewrote.  It looks pretty
well so far and I plan to throw my scripts away until proven more useful.

I was testing on a 80 vCPUs VM with 16GB memory, best I can find at hand.
The page latencies overhead is almost negligible:

  Disabled:
  Average: 236.00 (+-4.66%)

  Enabled:
  Average: 232.67 (+-2.01%)

It was average results out of three runs for each.  Enabling the feature
even makes the latency to be smaller?  Well that's probably noise..

Surprisingly, I did also try the old code, the overhead is also almost not
mesurable comparing to the faults.  I guess it's because a few "for loop"s
over some arrays of 80 elements isn't much of a hurdle when the round trip
is still ~200us.  But still, it's not only about perf, but other things
too, on such rewrites.  For example, arrays won't be able to trap non-vCPU
faults.  The hash is still pretty much needed, one way or another.

I think it means we should be able to enable this feature altogether with
postcopy-ram if we want, and unless the VM is extremely huge we shouldn't
expect much overhead.  Actually I'd bet it should work all fine even with
hundreds of vCPUs especially after this rewrite.  If add preempt mode into
picture, I think one should enable all three features for postcopy by
default at some point.

TODO: we may also add buckets to provide better statistics for the latency
report, e.g. how many faults were resolved in (2^N, 2^(N+1)) us window, etc.

Comments welcomed, thanks.

Peter Xu (13):
  migration: Add option to set postcopy-blocktime
  migration/postcopy: Push blocktime start/end into page req mutex
  migration/postcopy: Drop all atomic ops in blocktime feature
  migration/postcopy: Make all blocktime vars 64bits
  migration/postcopy: Drop PostcopyBlocktimeContext.start_time
  migration/postcopy: Bring blocktime layer to us level
  migration/postcopy: Add blocktime fault counts per-vcpu
  migration/postcopy: Report fault latencies in blocktime
  migration/postcopy: Initialize blocktime context only until listen
  migration/postcopy: Cache the tid->vcpu mapping for blocktime
  migration/postcopy: Cleanup the total blocktime accounting
  migration/postcopy: Optimize blocktime fault tracking with hashtable
  migration/postcopy: blocktime allows track / report non-vCPU faults

 qapi/migration.json                   |  20 +
 migration/migration.h                 |   2 +-
 migration/postcopy-ram.h              |   2 +
 migration/migration-hmp-cmds.c        |  75 ++--
 migration/migration.c                 |  24 +-
 migration/options.c                   |   2 +
 migration/postcopy-ram.c              | 518 ++++++++++++++++++++------
 tests/qtest/migration/migration-qmp.c |   4 +
 migration/trace-events                |   8 +-
 9 files changed, 497 insertions(+), 158 deletions(-)

-- 
2.49.0



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 01/13] migration: Add option to set postcopy-blocktime
  2025-05-27 23:12 [PATCH 00/13] migration/postcopy: Blocktime tracking overhaul Peter Xu
@ 2025-05-27 23:12 ` Peter Xu
  2025-06-02 16:50   ` Fabiano Rosas
  2025-05-27 23:12 ` [PATCH 02/13] migration/postcopy: Push blocktime start/end into page req mutex Peter Xu
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 30+ messages in thread
From: Peter Xu @ 2025-05-27 23:12 UTC (permalink / raw)
  To: qemu-devel
  Cc: Dr . David Alan Gilbert, peterx, Alexey Perevalov, Fabiano Rosas,
	Juraj Marcin

Add a global property to allow enabling postcopy-blocktime feature.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/options.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/migration/options.c b/migration/options.c
index 162c72cda4..4e923a2e07 100644
--- a/migration/options.c
+++ b/migration/options.c
@@ -187,6 +187,8 @@ const Property migration_properties[] = {
     DEFINE_PROP_MIG_CAP("x-postcopy-ram", MIGRATION_CAPABILITY_POSTCOPY_RAM),
     DEFINE_PROP_MIG_CAP("x-postcopy-preempt",
                         MIGRATION_CAPABILITY_POSTCOPY_PREEMPT),
+    DEFINE_PROP_MIG_CAP("postcopy-blocktime",
+                        MIGRATION_CAPABILITY_POSTCOPY_BLOCKTIME),
     DEFINE_PROP_MIG_CAP("x-colo", MIGRATION_CAPABILITY_X_COLO),
     DEFINE_PROP_MIG_CAP("x-release-ram", MIGRATION_CAPABILITY_RELEASE_RAM),
     DEFINE_PROP_MIG_CAP("x-return-path", MIGRATION_CAPABILITY_RETURN_PATH),
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 02/13] migration/postcopy: Push blocktime start/end into page req mutex
  2025-05-27 23:12 [PATCH 00/13] migration/postcopy: Blocktime tracking overhaul Peter Xu
  2025-05-27 23:12 ` [PATCH 01/13] migration: Add option to set postcopy-blocktime Peter Xu
@ 2025-05-27 23:12 ` Peter Xu
  2025-06-02 17:46   ` Fabiano Rosas
  2025-05-27 23:12 ` [PATCH 03/13] migration/postcopy: Drop all atomic ops in blocktime feature Peter Xu
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 30+ messages in thread
From: Peter Xu @ 2025-05-27 23:12 UTC (permalink / raw)
  To: qemu-devel
  Cc: Dr . David Alan Gilbert, peterx, Alexey Perevalov, Fabiano Rosas,
	Juraj Marcin

The postcopy blocktime feature was tricky that it used quite some atomic
operations over quite a few arrays and vars, without explaining how that
would be thread safe.  The thread safety here is about concurrency between
the fault thread and the fault resolution threads, possible to access the
same chunk of data.  All these atomic ops can be expensive too before
knowing clearly how it works.

OTOH, postcopy has one page_request_mutex used to serialize the received
bitmap updates.  So far it's ok - we don't yet have a lot of threads
contending the lock.  It might change after multifd will be supported, but
that's a separate story.  What is important is, with that mutex, it's
pretty lightweight to move all the blocktime maintenance into the mutex
critical section.  It's because the blocktime layer is lightweighted:
almost "remember which vcpu faulted on which address", and "ok we get some
fault resolved, calculate how long it takes".  It's also an optional
feature for now (but I have thought of changing that, maybe in the future).

Let's push the blocktime layer into the mutex, so that it's always
thread-safe even without any atomic ops.

To achieve that, I'll need to add a tid parameter on fault path so that
it'll start to pass the faulted thread ID into deeper the stack, but not
too deep.  When at it, add a comment for the shared fault handler (for
example, vhost-user devices running with postcopy), to mention a TODO.  One
reason it might not be trivial is that vhost-user's userfaultfds should be
opened by vhost-user process, so it's pretty hard to control making sure
the TID feature will be around.  It wasn't supported before, so keep it
like that for now.

Now we should be as ease when everything is protected by a mutex that we
always take anyway.

One side effect: we can finally remove one ramblock_recv_bitmap_test() in
mark_postcopy_blocktime_begin(), which was pretty weird and which also
includes a weird (but maybe necessary.. but maybe not?) operation to inject
a blocktime entry then quickly erase it..  When we're with the mutex, and
when we make sure it's invoked after checking the receive bitmap, it's not
needed anymore.  Instead, we assert.

As another side effect, this paves way for removing all atomic ops in all
the mem accesses in blocktime layer.

Note that we need a stub for mark_postcopy_blocktime_begin() for Windows
builds.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.h    |  2 +-
 migration/postcopy-ram.h |  2 ++
 migration/migration.c    | 24 ++++++++++-------
 migration/postcopy-ram.c | 56 +++++++++++++++++++++-------------------
 migration/trace-events   |  2 +-
 5 files changed, 47 insertions(+), 39 deletions(-)

diff --git a/migration/migration.h b/migration/migration.h
index d53f7cad84..9ca2b21459 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -546,7 +546,7 @@ void migrate_send_rp_shut(MigrationIncomingState *mis,
 void migrate_send_rp_pong(MigrationIncomingState *mis,
                           uint32_t value);
 int migrate_send_rp_req_pages(MigrationIncomingState *mis, RAMBlock *rb,
-                              ram_addr_t start, uint64_t haddr);
+                              ram_addr_t start, uint64_t haddr, uint32_t tid);
 int migrate_send_rp_message_req_pages(MigrationIncomingState *mis,
                                       RAMBlock *rb, ram_addr_t start);
 void migrate_send_rp_recv_bitmap(MigrationIncomingState *mis,
diff --git a/migration/postcopy-ram.h b/migration/postcopy-ram.h
index a6df1b2811..3852141d7e 100644
--- a/migration/postcopy-ram.h
+++ b/migration/postcopy-ram.h
@@ -196,5 +196,7 @@ void postcopy_preempt_new_channel(MigrationIncomingState *mis, QEMUFile *file);
 void postcopy_preempt_setup(MigrationState *s);
 int postcopy_preempt_establish_channel(MigrationState *s);
 bool postcopy_is_paused(MigrationStatus status);
+void mark_postcopy_blocktime_begin(uintptr_t addr, uint32_t ptid,
+                                   RAMBlock *rb);
 
 #endif
diff --git a/migration/migration.c b/migration/migration.c
index e542c09755..4e917b5fb0 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -576,22 +576,26 @@ int migrate_send_rp_message_req_pages(MigrationIncomingState *mis,
 }
 
 int migrate_send_rp_req_pages(MigrationIncomingState *mis,
-                              RAMBlock *rb, ram_addr_t start, uint64_t haddr)
+                              RAMBlock *rb, ram_addr_t start, uint64_t haddr,
+                              uint32_t tid)
 {
     void *aligned = (void *)(uintptr_t)ROUND_DOWN(haddr, qemu_ram_pagesize(rb));
     bool received = false;
 
     WITH_QEMU_LOCK_GUARD(&mis->page_request_mutex) {
         received = ramblock_recv_bitmap_test_byte_offset(rb, start);
-        if (!received && !g_tree_lookup(mis->page_requested, aligned)) {
-            /*
-             * The page has not been received, and it's not yet in the page
-             * request list.  Queue it.  Set the value of element to 1, so that
-             * things like g_tree_lookup() will return TRUE (1) when found.
-             */
-            g_tree_insert(mis->page_requested, aligned, (gpointer)1);
-            qatomic_inc(&mis->page_requested_count);
-            trace_postcopy_page_req_add(aligned, mis->page_requested_count);
+        if (!received) {
+            if (!g_tree_lookup(mis->page_requested, aligned)) {
+                /*
+                 * The page has not been received, and it's not yet in the page
+                 * request list.  Queue it.  Set the value of element to 1, so that
+                 * things like g_tree_lookup() will return TRUE (1) when found.
+                 */
+                g_tree_insert(mis->page_requested, aligned, (gpointer)1);
+                qatomic_inc(&mis->page_requested_count);
+                trace_postcopy_page_req_add(aligned, mis->page_requested_count);
+            }
+            mark_postcopy_blocktime_begin(haddr, tid, rb);
         }
     }
 
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 995614b38c..9db3eefec4 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -752,8 +752,12 @@ int postcopy_wake_shared(struct PostCopyFD *pcfd,
                        pagesize);
 }
 
+/*
+ * NOTE: @tid is only used when postcopy-blocktime feature is enabled, and
+ * also optional: when zero is provided, the fault accounting will be ignored.
+ */
 static int postcopy_request_page(MigrationIncomingState *mis, RAMBlock *rb,
-                                 ram_addr_t start, uint64_t haddr)
+                                 ram_addr_t start, uint64_t haddr, uint32_t tid)
 {
     void *aligned = (void *)(uintptr_t)ROUND_DOWN(haddr, qemu_ram_pagesize(rb));
 
@@ -772,7 +776,7 @@ static int postcopy_request_page(MigrationIncomingState *mis, RAMBlock *rb,
         return received ? 0 : postcopy_place_page_zero(mis, aligned, rb);
     }
 
-    return migrate_send_rp_req_pages(mis, rb, start, haddr);
+    return migrate_send_rp_req_pages(mis, rb, start, haddr, tid);
 }
 
 /*
@@ -793,7 +797,8 @@ int postcopy_request_shared_page(struct PostCopyFD *pcfd, RAMBlock *rb,
                                         qemu_ram_get_idstr(rb), rb_offset);
         return postcopy_wake_shared(pcfd, client_addr, rb);
     }
-    postcopy_request_page(mis, rb, aligned_rbo, client_addr);
+    /* TODO: support blocktime tracking */
+    postcopy_request_page(mis, rb, aligned_rbo, client_addr, 0);
     return 0;
 }
 
@@ -819,17 +824,17 @@ static uint32_t get_low_time_offset(PostcopyBlocktimeContext *dc)
 }
 
 /*
- * This function is being called when pagefault occurs. It
- * tracks down vCPU blocking time.
+ * This function is being called when pagefault occurs. It tracks down vCPU
+ * blocking time.  It's protected by @page_request_mutex.
  *
  * @addr: faulted host virtual address
  * @ptid: faulted process thread id
  * @rb: ramblock appropriate to addr
  */
-static void mark_postcopy_blocktime_begin(uintptr_t addr, uint32_t ptid,
-                                          RAMBlock *rb)
+void mark_postcopy_blocktime_begin(uintptr_t addr, uint32_t ptid,
+                                   RAMBlock *rb)
 {
-    int cpu, already_received;
+    int cpu;
     MigrationIncomingState *mis = migration_incoming_get_current();
     PostcopyBlocktimeContext *dc = mis->blocktime_ctx;
     uint32_t low_time_offset;
@@ -852,24 +857,19 @@ static void mark_postcopy_blocktime_begin(uintptr_t addr, uint32_t ptid,
     qatomic_xchg(&dc->vcpu_addr[cpu], addr);
 
     /*
-     * check it here, not at the beginning of the function,
-     * due to, check could occur early than bitmap_set in
-     * qemu_ufd_copy_ioctl
+     * The caller should only inject a blocktime entry when the page is
+     * yet missing.
      */
-    already_received = ramblock_recv_bitmap_test(rb, (void *)addr);
-    if (already_received) {
-        qatomic_xchg(&dc->vcpu_addr[cpu], 0);
-        qatomic_xchg(&dc->page_fault_vcpu_time[cpu], 0);
-        qatomic_dec(&dc->smp_cpus_down);
-    }
+    assert(!ramblock_recv_bitmap_test(rb, (void *)addr));
+
     trace_mark_postcopy_blocktime_begin(addr, dc, dc->page_fault_vcpu_time[cpu],
-                                        cpu, already_received);
+                                        cpu);
 }
 
 /*
- *  This function just provide calculated blocktime per cpu and trace it.
- *  Total blocktime is calculated in mark_postcopy_blocktime_end.
- *
+ * This function just provide calculated blocktime per cpu and trace it.
+ * Total blocktime is calculated in mark_postcopy_blocktime_end.  It's
+ * protected by @page_request_mutex.
  *
  * Assume we have 3 CPU
  *
@@ -1068,17 +1068,14 @@ static void *postcopy_ram_fault_thread(void *opaque)
                                                 qemu_ram_get_idstr(rb),
                                                 rb_offset,
                                                 msg.arg.pagefault.feat.ptid);
-            mark_postcopy_blocktime_begin(
-                    (uintptr_t)(msg.arg.pagefault.address),
-                                msg.arg.pagefault.feat.ptid, rb);
-
 retry:
             /*
              * Send the request to the source - we want to request one
              * of our host page sizes (which is >= TPS)
              */
             ret = postcopy_request_page(mis, rb, rb_offset,
-                                        msg.arg.pagefault.address);
+                                        msg.arg.pagefault.address,
+                                        msg.arg.pagefault.feat.ptid);
             if (ret) {
                 /* May be network failure, try to wait for recovery */
                 postcopy_pause_fault_thread(mis);
@@ -1299,8 +1296,8 @@ static int qemu_ufd_copy_ioctl(MigrationIncomingState *mis, void *host_addr,
                 qemu_cond_signal(&mis->page_request_cond);
             }
         }
-        qemu_mutex_unlock(&mis->page_request_mutex);
         mark_postcopy_blocktime_end((uintptr_t)host_addr);
+        qemu_mutex_unlock(&mis->page_request_mutex);
     }
     return ret;
 }
@@ -1430,6 +1427,11 @@ int postcopy_wake_shared(struct PostCopyFD *pcfd,
 {
     g_assert_not_reached();
 }
+
+void mark_postcopy_blocktime_begin(uintptr_t addr, uint32_t ptid,
+                                   RAMBlock *rb)
+{
+}
 #endif
 
 /* ------------------------------------------------------------------------- */
diff --git a/migration/trace-events b/migration/trace-events
index dcd8fe9a0c..917f521e88 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -285,7 +285,7 @@ postcopy_nhp_range(const char *ramblock, void *host_addr, size_t offset, size_t
 postcopy_place_page(void *host_addr) "host=%p"
 postcopy_place_page_zero(void *host_addr) "host=%p"
 postcopy_ram_enable_notify(void) ""
-mark_postcopy_blocktime_begin(uint64_t addr, void *dd, uint32_t time, int cpu, int received) "addr: 0x%" PRIx64 ", dd: %p, time: %u, cpu: %d, already_received: %d"
+mark_postcopy_blocktime_begin(uint64_t addr, void *dd, uint32_t time, int cpu) "addr: 0x%" PRIx64 ", dd: %p, time: %u, cpu: %d"
 mark_postcopy_blocktime_end(uint64_t addr, void *dd, uint32_t time, int affected_cpu) "addr: 0x%" PRIx64 ", dd: %p, time: %u, affected_cpu: %d"
 postcopy_pause_fault_thread(void) ""
 postcopy_pause_fault_thread_continued(void) ""
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 03/13] migration/postcopy: Drop all atomic ops in blocktime feature
  2025-05-27 23:12 [PATCH 00/13] migration/postcopy: Blocktime tracking overhaul Peter Xu
  2025-05-27 23:12 ` [PATCH 01/13] migration: Add option to set postcopy-blocktime Peter Xu
  2025-05-27 23:12 ` [PATCH 02/13] migration/postcopy: Push blocktime start/end into page req mutex Peter Xu
@ 2025-05-27 23:12 ` Peter Xu
  2025-06-02 18:00   ` Fabiano Rosas
  2025-05-27 23:12 ` [PATCH 04/13] migration/postcopy: Make all blocktime vars 64bits Peter Xu
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 30+ messages in thread
From: Peter Xu @ 2025-05-27 23:12 UTC (permalink / raw)
  To: qemu-devel
  Cc: Dr . David Alan Gilbert, peterx, Alexey Perevalov, Fabiano Rosas,
	Juraj Marcin

Now with the mutex protection it's not needed anymore.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/postcopy-ram.c | 23 ++++++++++-------------
 1 file changed, 10 insertions(+), 13 deletions(-)

diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 9db3eefec4..0b4efdb7fe 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -849,12 +849,12 @@ void mark_postcopy_blocktime_begin(uintptr_t addr, uint32_t ptid,
 
     low_time_offset = get_low_time_offset(dc);
     if (dc->vcpu_addr[cpu] == 0) {
-        qatomic_inc(&dc->smp_cpus_down);
+        dc->smp_cpus_down++;
     }
 
-    qatomic_xchg(&dc->last_begin, low_time_offset);
-    qatomic_xchg(&dc->page_fault_vcpu_time[cpu], low_time_offset);
-    qatomic_xchg(&dc->vcpu_addr[cpu], addr);
+    dc->last_begin = low_time_offset;
+    dc->page_fault_vcpu_time[cpu] = low_time_offset;
+    dc->vcpu_addr[cpu] = addr;
 
     /*
      * The caller should only inject a blocktime entry when the page is
@@ -915,29 +915,26 @@ static void mark_postcopy_blocktime_end(uintptr_t addr)
     for (i = 0; i < smp_cpus; i++) {
         uint32_t vcpu_blocktime = 0;
 
-        read_vcpu_time = qatomic_fetch_add(&dc->page_fault_vcpu_time[i], 0);
-        if (qatomic_fetch_add(&dc->vcpu_addr[i], 0) != addr ||
-            read_vcpu_time == 0) {
+        read_vcpu_time = dc->page_fault_vcpu_time[i];
+        if (dc->vcpu_addr[i] != addr || read_vcpu_time == 0) {
             continue;
         }
-        qatomic_xchg(&dc->vcpu_addr[i], 0);
+        dc->vcpu_addr[i] = 0;
         vcpu_blocktime = low_time_offset - read_vcpu_time;
         affected_cpu += 1;
         /* we need to know is that mark_postcopy_end was due to
          * faulted page, another possible case it's prefetched
          * page and in that case we shouldn't be here */
-        if (!vcpu_total_blocktime &&
-            qatomic_fetch_add(&dc->smp_cpus_down, 0) == smp_cpus) {
+        if (!vcpu_total_blocktime && dc->smp_cpus_down == smp_cpus) {
             vcpu_total_blocktime = true;
         }
         /* continue cycle, due to one page could affect several vCPUs */
         dc->vcpu_blocktime[i] += vcpu_blocktime;
     }
 
-    qatomic_sub(&dc->smp_cpus_down, affected_cpu);
+    dc->smp_cpus_down -= affected_cpu;
     if (vcpu_total_blocktime) {
-        dc->total_blocktime += low_time_offset - qatomic_fetch_add(
-                &dc->last_begin, 0);
+        dc->total_blocktime += low_time_offset - dc->last_begin;
     }
     trace_mark_postcopy_blocktime_end(addr, dc, dc->total_blocktime,
                                       affected_cpu);
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 04/13] migration/postcopy: Make all blocktime vars 64bits
  2025-05-27 23:12 [PATCH 00/13] migration/postcopy: Blocktime tracking overhaul Peter Xu
                   ` (2 preceding siblings ...)
  2025-05-27 23:12 ` [PATCH 03/13] migration/postcopy: Drop all atomic ops in blocktime feature Peter Xu
@ 2025-05-27 23:12 ` Peter Xu
  2025-06-02 20:42   ` Fabiano Rosas
  2025-05-27 23:12 ` [PATCH 05/13] migration/postcopy: Drop PostcopyBlocktimeContext.start_time Peter Xu
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 30+ messages in thread
From: Peter Xu @ 2025-05-27 23:12 UTC (permalink / raw)
  To: qemu-devel
  Cc: Dr . David Alan Gilbert, peterx, Alexey Perevalov, Fabiano Rosas,
	Juraj Marcin

I am guessing it was used to be 32bits because of the atomic ops.  Now all
the atomic ops are gone and we're protected by a mutex instead, it's ok we
can switch to 64 bits.

Reasons to move over:

  - Allow further patches to change the unit from ms to us: with postcopy
  preempt mode, we're really into hundreds of microseconds level on
  blocktime.  We'd better be able to trap those.

  - This also paves way for some other tricks that the original version
  used to avoid overflows, e.g., start_time was almost only useful before
  to make sure the sampled timestamp won't overflow a 32-bit field.

  - This prepares further reports on top of existing data collected,
  e.g. average page fault latencies.  When average operation is taken into
  account, milliseconds are simply too coarse grained.

When at it:

  - Rename page_fault_vcpu_time to vcpu_blocktime_start.

  - Rename vcpu_blocktime to vcpu_blocktime_total.

  - Touch up the trace-events to not dump blocktime ctx pointer

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/postcopy-ram.c | 50 ++++++++++++++++++++--------------------
 migration/trace-events   |  4 ++--
 2 files changed, 27 insertions(+), 27 deletions(-)

diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 0b4efdb7fe..72f9075e69 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -112,14 +112,15 @@ void postcopy_thread_create(MigrationIncomingState *mis,
 
 typedef struct PostcopyBlocktimeContext {
     /* time when page fault initiated per vCPU */
-    uint32_t *page_fault_vcpu_time;
+    uint64_t *vcpu_blocktime_start;
+    /* blocktime per vCPU */
+    uint64_t *vcpu_blocktime_total;
     /* page address per vCPU */
     uintptr_t *vcpu_addr;
-    uint32_t total_blocktime;
-    /* blocktime per vCPU */
-    uint32_t *vcpu_blocktime;
+    /* total blocktime when all vCPUs are stopped */
+    uint64_t total_blocktime;
     /* point in time when last page fault was initiated */
-    uint32_t last_begin;
+    uint64_t last_begin;
     /* number of vCPU are suspended */
     int smp_cpus_down;
     uint64_t start_time;
@@ -133,9 +134,9 @@ typedef struct PostcopyBlocktimeContext {
 
 static void destroy_blocktime_context(struct PostcopyBlocktimeContext *ctx)
 {
-    g_free(ctx->page_fault_vcpu_time);
+    g_free(ctx->vcpu_blocktime_start);
+    g_free(ctx->vcpu_blocktime_total);
     g_free(ctx->vcpu_addr);
-    g_free(ctx->vcpu_blocktime);
     g_free(ctx);
 }
 
@@ -151,13 +152,14 @@ static struct PostcopyBlocktimeContext *blocktime_context_new(void)
     MachineState *ms = MACHINE(qdev_get_machine());
     unsigned int smp_cpus = ms->smp.cpus;
     PostcopyBlocktimeContext *ctx = g_new0(PostcopyBlocktimeContext, 1);
-    ctx->page_fault_vcpu_time = g_new0(uint32_t, smp_cpus);
-    ctx->vcpu_addr = g_new0(uintptr_t, smp_cpus);
-    ctx->vcpu_blocktime = g_new0(uint32_t, smp_cpus);
 
+    ctx->vcpu_blocktime_start = g_new0(uint64_t, smp_cpus);
+    ctx->vcpu_blocktime_total = g_new0(uint64_t, smp_cpus);
+    ctx->vcpu_addr = g_new0(uintptr_t, smp_cpus);
     ctx->exit_notifier.notify = migration_exit_cb;
     ctx->start_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
     qemu_add_exit_notifier(&ctx->exit_notifier);
+
     return ctx;
 }
 
@@ -168,7 +170,7 @@ static uint32List *get_vcpu_blocktime_list(PostcopyBlocktimeContext *ctx)
     int i;
 
     for (i = ms->smp.cpus - 1; i >= 0; i--) {
-        QAPI_LIST_PREPEND(list, ctx->vcpu_blocktime[i]);
+        QAPI_LIST_PREPEND(list, (uint32_t)ctx->vcpu_blocktime_total[i]);
     }
 
     return list;
@@ -191,12 +193,12 @@ void fill_destination_postcopy_migration_info(MigrationInfo *info)
     }
 
     info->has_postcopy_blocktime = true;
-    info->postcopy_blocktime = bc->total_blocktime;
+    info->postcopy_blocktime = (uint32_t)bc->total_blocktime;
     info->has_postcopy_vcpu_blocktime = true;
     info->postcopy_vcpu_blocktime = get_vcpu_blocktime_list(bc);
 }
 
-static uint32_t get_postcopy_total_blocktime(void)
+static uint64_t get_postcopy_total_blocktime(void)
 {
     MigrationIncomingState *mis = migration_incoming_get_current();
     PostcopyBlocktimeContext *bc = mis->blocktime_ctx;
@@ -816,11 +818,9 @@ static int get_mem_fault_cpu_index(uint32_t pid)
     return -1;
 }
 
-static uint32_t get_low_time_offset(PostcopyBlocktimeContext *dc)
+static uint64_t get_low_time_offset(PostcopyBlocktimeContext *dc)
 {
-    int64_t start_time_offset = qemu_clock_get_ms(QEMU_CLOCK_REALTIME) -
-                                    dc->start_time;
-    return start_time_offset < 1 ? 1 : start_time_offset & UINT32_MAX;
+    return (uint64_t)qemu_clock_get_ms(QEMU_CLOCK_REALTIME) - dc->start_time;
 }
 
 /*
@@ -837,7 +837,7 @@ void mark_postcopy_blocktime_begin(uintptr_t addr, uint32_t ptid,
     int cpu;
     MigrationIncomingState *mis = migration_incoming_get_current();
     PostcopyBlocktimeContext *dc = mis->blocktime_ctx;
-    uint32_t low_time_offset;
+    uint64_t low_time_offset;
 
     if (!dc || ptid == 0) {
         return;
@@ -853,7 +853,7 @@ void mark_postcopy_blocktime_begin(uintptr_t addr, uint32_t ptid,
     }
 
     dc->last_begin = low_time_offset;
-    dc->page_fault_vcpu_time[cpu] = low_time_offset;
+    dc->vcpu_blocktime_start[cpu] = low_time_offset;
     dc->vcpu_addr[cpu] = addr;
 
     /*
@@ -862,7 +862,7 @@ void mark_postcopy_blocktime_begin(uintptr_t addr, uint32_t ptid,
      */
     assert(!ramblock_recv_bitmap_test(rb, (void *)addr));
 
-    trace_mark_postcopy_blocktime_begin(addr, dc, dc->page_fault_vcpu_time[cpu],
+    trace_mark_postcopy_blocktime_begin(addr, dc->vcpu_blocktime_start[cpu],
                                         cpu);
 }
 
@@ -901,7 +901,7 @@ static void mark_postcopy_blocktime_end(uintptr_t addr)
     unsigned int smp_cpus = ms->smp.cpus;
     int i, affected_cpu = 0;
     bool vcpu_total_blocktime = false;
-    uint32_t read_vcpu_time, low_time_offset;
+    uint64_t read_vcpu_time, low_time_offset;
 
     if (!dc) {
         return;
@@ -913,9 +913,9 @@ static void mark_postcopy_blocktime_end(uintptr_t addr)
      * optimal, more optimal algorithm is keeping tree or hash
      * where key is address value is a list of  */
     for (i = 0; i < smp_cpus; i++) {
-        uint32_t vcpu_blocktime = 0;
+        uint64_t vcpu_blocktime = 0;
 
-        read_vcpu_time = dc->page_fault_vcpu_time[i];
+        read_vcpu_time = dc->vcpu_blocktime_start[i];
         if (dc->vcpu_addr[i] != addr || read_vcpu_time == 0) {
             continue;
         }
@@ -929,14 +929,14 @@ static void mark_postcopy_blocktime_end(uintptr_t addr)
             vcpu_total_blocktime = true;
         }
         /* continue cycle, due to one page could affect several vCPUs */
-        dc->vcpu_blocktime[i] += vcpu_blocktime;
+        dc->vcpu_blocktime_total[i] += vcpu_blocktime;
     }
 
     dc->smp_cpus_down -= affected_cpu;
     if (vcpu_total_blocktime) {
         dc->total_blocktime += low_time_offset - dc->last_begin;
     }
-    trace_mark_postcopy_blocktime_end(addr, dc, dc->total_blocktime,
+    trace_mark_postcopy_blocktime_end(addr, dc->total_blocktime,
                                       affected_cpu);
 }
 
diff --git a/migration/trace-events b/migration/trace-events
index 917f521e88..02cdb6e7cc 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -285,8 +285,8 @@ postcopy_nhp_range(const char *ramblock, void *host_addr, size_t offset, size_t
 postcopy_place_page(void *host_addr) "host=%p"
 postcopy_place_page_zero(void *host_addr) "host=%p"
 postcopy_ram_enable_notify(void) ""
-mark_postcopy_blocktime_begin(uint64_t addr, void *dd, uint32_t time, int cpu) "addr: 0x%" PRIx64 ", dd: %p, time: %u, cpu: %d"
-mark_postcopy_blocktime_end(uint64_t addr, void *dd, uint32_t time, int affected_cpu) "addr: 0x%" PRIx64 ", dd: %p, time: %u, affected_cpu: %d"
+mark_postcopy_blocktime_begin(uint64_t addr, uint64_t time, int cpu) "addr: 0x%" PRIx64 ", time: %" PRIu64 ", cpu: %d"
+mark_postcopy_blocktime_end(uint64_t addr, uint64_t time, int affected_cpu) "addr: 0x%" PRIx64 ", time: %" PRIu64 ", affected_cpus: %d"
 postcopy_pause_fault_thread(void) ""
 postcopy_pause_fault_thread_continued(void) ""
 postcopy_pause_fast_load(void) ""
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 05/13] migration/postcopy: Drop PostcopyBlocktimeContext.start_time
  2025-05-27 23:12 [PATCH 00/13] migration/postcopy: Blocktime tracking overhaul Peter Xu
                   ` (3 preceding siblings ...)
  2025-05-27 23:12 ` [PATCH 04/13] migration/postcopy: Make all blocktime vars 64bits Peter Xu
@ 2025-05-27 23:12 ` Peter Xu
  2025-06-03 15:52   ` Fabiano Rosas
  2025-05-27 23:12 ` [PATCH 06/13] migration/postcopy: Bring blocktime layer to us level Peter Xu
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 30+ messages in thread
From: Peter Xu @ 2025-05-27 23:12 UTC (permalink / raw)
  To: qemu-devel
  Cc: Dr . David Alan Gilbert, peterx, Alexey Perevalov, Fabiano Rosas,
	Juraj Marcin

Now with 64bits, the offseting using start_time is not needed anymore,
because the array can always remember the whole timestamp.

Then drop the unused parameter in get_low_time_offset() altogether.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/postcopy-ram.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 72f9075e69..c6ae4b650c 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -123,7 +123,6 @@ typedef struct PostcopyBlocktimeContext {
     uint64_t last_begin;
     /* number of vCPU are suspended */
     int smp_cpus_down;
-    uint64_t start_time;
 
     /*
      * Handler for exit event, necessary for
@@ -157,7 +156,6 @@ static struct PostcopyBlocktimeContext *blocktime_context_new(void)
     ctx->vcpu_blocktime_total = g_new0(uint64_t, smp_cpus);
     ctx->vcpu_addr = g_new0(uintptr_t, smp_cpus);
     ctx->exit_notifier.notify = migration_exit_cb;
-    ctx->start_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
     qemu_add_exit_notifier(&ctx->exit_notifier);
 
     return ctx;
@@ -818,9 +816,9 @@ static int get_mem_fault_cpu_index(uint32_t pid)
     return -1;
 }
 
-static uint64_t get_low_time_offset(PostcopyBlocktimeContext *dc)
+static uint64_t get_low_time_offset(void)
 {
-    return (uint64_t)qemu_clock_get_ms(QEMU_CLOCK_REALTIME) - dc->start_time;
+    return (uint64_t)qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
 }
 
 /*
@@ -847,7 +845,7 @@ void mark_postcopy_blocktime_begin(uintptr_t addr, uint32_t ptid,
         return;
     }
 
-    low_time_offset = get_low_time_offset(dc);
+    low_time_offset = get_low_time_offset();
     if (dc->vcpu_addr[cpu] == 0) {
         dc->smp_cpus_down++;
     }
@@ -907,7 +905,7 @@ static void mark_postcopy_blocktime_end(uintptr_t addr)
         return;
     }
 
-    low_time_offset = get_low_time_offset(dc);
+    low_time_offset = get_low_time_offset();
     /* lookup cpu, to clear it,
      * that algorithm looks straightforward, but it's not
      * optimal, more optimal algorithm is keeping tree or hash
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 06/13] migration/postcopy: Bring blocktime layer to us level
  2025-05-27 23:12 [PATCH 00/13] migration/postcopy: Blocktime tracking overhaul Peter Xu
                   ` (4 preceding siblings ...)
  2025-05-27 23:12 ` [PATCH 05/13] migration/postcopy: Drop PostcopyBlocktimeContext.start_time Peter Xu
@ 2025-05-27 23:12 ` Peter Xu
  2025-06-03 15:54   ` Fabiano Rosas
  2025-05-27 23:12 ` [PATCH 07/13] migration/postcopy: Add blocktime fault counts per-vcpu Peter Xu
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 30+ messages in thread
From: Peter Xu @ 2025-05-27 23:12 UTC (permalink / raw)
  To: qemu-devel
  Cc: Dr . David Alan Gilbert, peterx, Alexey Perevalov, Fabiano Rosas,
	Juraj Marcin

With 64-bit fields, it is trivial.  The caution is when exposing any values
in QMP, it was still declared with milliseconds (ms).  Hence it's needed to
do the convertion when exporting the values to existing QMP queries.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/postcopy-ram.c | 26 ++++++++++++++------------
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index c6ae4b650c..5cbc7aa77e 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -110,6 +110,7 @@ void postcopy_thread_create(MigrationIncomingState *mis,
 #include <sys/eventfd.h>
 #include <linux/userfaultfd.h>
 
+/* All the time records are in unit of microseconds (us) */
 typedef struct PostcopyBlocktimeContext {
     /* time when page fault initiated per vCPU */
     uint64_t *vcpu_blocktime_start;
@@ -168,7 +169,8 @@ static uint32List *get_vcpu_blocktime_list(PostcopyBlocktimeContext *ctx)
     int i;
 
     for (i = ms->smp.cpus - 1; i >= 0; i--) {
-        QAPI_LIST_PREPEND(list, (uint32_t)ctx->vcpu_blocktime_total[i]);
+        QAPI_LIST_PREPEND(
+            list, (uint32_t)(ctx->vcpu_blocktime_total[i] / 1000));
     }
 
     return list;
@@ -191,7 +193,7 @@ void fill_destination_postcopy_migration_info(MigrationInfo *info)
     }
 
     info->has_postcopy_blocktime = true;
-    info->postcopy_blocktime = (uint32_t)bc->total_blocktime;
+    info->postcopy_blocktime = (uint32_t)(bc->total_blocktime / 1000);
     info->has_postcopy_vcpu_blocktime = true;
     info->postcopy_vcpu_blocktime = get_vcpu_blocktime_list(bc);
 }
@@ -816,9 +818,9 @@ static int get_mem_fault_cpu_index(uint32_t pid)
     return -1;
 }
 
-static uint64_t get_low_time_offset(void)
+static uint64_t get_current_us(void)
 {
-    return (uint64_t)qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
+    return (uint64_t)qemu_clock_get_us(QEMU_CLOCK_REALTIME);
 }
 
 /*
@@ -835,7 +837,7 @@ void mark_postcopy_blocktime_begin(uintptr_t addr, uint32_t ptid,
     int cpu;
     MigrationIncomingState *mis = migration_incoming_get_current();
     PostcopyBlocktimeContext *dc = mis->blocktime_ctx;
-    uint64_t low_time_offset;
+    uint64_t current_us;
 
     if (!dc || ptid == 0) {
         return;
@@ -845,13 +847,13 @@ void mark_postcopy_blocktime_begin(uintptr_t addr, uint32_t ptid,
         return;
     }
 
-    low_time_offset = get_low_time_offset();
+    current_us = get_current_us();
     if (dc->vcpu_addr[cpu] == 0) {
         dc->smp_cpus_down++;
     }
 
-    dc->last_begin = low_time_offset;
-    dc->vcpu_blocktime_start[cpu] = low_time_offset;
+    dc->last_begin = current_us;
+    dc->vcpu_blocktime_start[cpu] = current_us;
     dc->vcpu_addr[cpu] = addr;
 
     /*
@@ -899,13 +901,13 @@ static void mark_postcopy_blocktime_end(uintptr_t addr)
     unsigned int smp_cpus = ms->smp.cpus;
     int i, affected_cpu = 0;
     bool vcpu_total_blocktime = false;
-    uint64_t read_vcpu_time, low_time_offset;
+    uint64_t read_vcpu_time, current_us;
 
     if (!dc) {
         return;
     }
 
-    low_time_offset = get_low_time_offset();
+    current_us = get_current_us();
     /* lookup cpu, to clear it,
      * that algorithm looks straightforward, but it's not
      * optimal, more optimal algorithm is keeping tree or hash
@@ -918,7 +920,7 @@ static void mark_postcopy_blocktime_end(uintptr_t addr)
             continue;
         }
         dc->vcpu_addr[i] = 0;
-        vcpu_blocktime = low_time_offset - read_vcpu_time;
+        vcpu_blocktime = current_us - read_vcpu_time;
         affected_cpu += 1;
         /* we need to know is that mark_postcopy_end was due to
          * faulted page, another possible case it's prefetched
@@ -932,7 +934,7 @@ static void mark_postcopy_blocktime_end(uintptr_t addr)
 
     dc->smp_cpus_down -= affected_cpu;
     if (vcpu_total_blocktime) {
-        dc->total_blocktime += low_time_offset - dc->last_begin;
+        dc->total_blocktime += current_us - dc->last_begin;
     }
     trace_mark_postcopy_blocktime_end(addr, dc->total_blocktime,
                                       affected_cpu);
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 07/13] migration/postcopy: Add blocktime fault counts per-vcpu
  2025-05-27 23:12 [PATCH 00/13] migration/postcopy: Blocktime tracking overhaul Peter Xu
                   ` (5 preceding siblings ...)
  2025-05-27 23:12 ` [PATCH 06/13] migration/postcopy: Bring blocktime layer to us level Peter Xu
@ 2025-05-27 23:12 ` Peter Xu
  2025-06-03 15:59   ` Fabiano Rosas
  2025-05-27 23:12 ` [PATCH 08/13] migration/postcopy: Report fault latencies in blocktime Peter Xu
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 30+ messages in thread
From: Peter Xu @ 2025-05-27 23:12 UTC (permalink / raw)
  To: qemu-devel
  Cc: Dr . David Alan Gilbert, peterx, Alexey Perevalov, Fabiano Rosas,
	Juraj Marcin

Add a field to count how many remote faults one vCPU has taken.  So far
it's still not used, but will be soon.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/postcopy-ram.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 5cbc7aa77e..46a8fdb6c2 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -116,6 +116,8 @@ typedef struct PostcopyBlocktimeContext {
     uint64_t *vcpu_blocktime_start;
     /* blocktime per vCPU */
     uint64_t *vcpu_blocktime_total;
+    /* count of faults per vCPU */
+    uint64_t *vcpu_faults_count;
     /* page address per vCPU */
     uintptr_t *vcpu_addr;
     /* total blocktime when all vCPUs are stopped */
@@ -136,6 +138,7 @@ static void destroy_blocktime_context(struct PostcopyBlocktimeContext *ctx)
 {
     g_free(ctx->vcpu_blocktime_start);
     g_free(ctx->vcpu_blocktime_total);
+    g_free(ctx->vcpu_faults_count);
     g_free(ctx->vcpu_addr);
     g_free(ctx);
 }
@@ -155,6 +158,7 @@ static struct PostcopyBlocktimeContext *blocktime_context_new(void)
 
     ctx->vcpu_blocktime_start = g_new0(uint64_t, smp_cpus);
     ctx->vcpu_blocktime_total = g_new0(uint64_t, smp_cpus);
+    ctx->vcpu_faults_count = g_new0(uint64_t, smp_cpus);
     ctx->vcpu_addr = g_new0(uintptr_t, smp_cpus);
     ctx->exit_notifier.notify = migration_exit_cb;
     qemu_add_exit_notifier(&ctx->exit_notifier);
@@ -855,6 +859,7 @@ void mark_postcopy_blocktime_begin(uintptr_t addr, uint32_t ptid,
     dc->last_begin = current_us;
     dc->vcpu_blocktime_start[cpu] = current_us;
     dc->vcpu_addr[cpu] = addr;
+    dc->vcpu_faults_count[cpu]++;
 
     /*
      * The caller should only inject a blocktime entry when the page is
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 08/13] migration/postcopy: Report fault latencies in blocktime
  2025-05-27 23:12 [PATCH 00/13] migration/postcopy: Blocktime tracking overhaul Peter Xu
                   ` (6 preceding siblings ...)
  2025-05-27 23:12 ` [PATCH 07/13] migration/postcopy: Add blocktime fault counts per-vcpu Peter Xu
@ 2025-05-27 23:12 ` Peter Xu
  2025-06-02  9:26   ` Markus Armbruster
  2025-06-03 16:07   ` Fabiano Rosas
  2025-05-27 23:12 ` [PATCH 09/13] migration/postcopy: Initialize blocktime context only until listen Peter Xu
                   ` (4 subsequent siblings)
  12 siblings, 2 replies; 30+ messages in thread
From: Peter Xu @ 2025-05-27 23:12 UTC (permalink / raw)
  To: qemu-devel
  Cc: Dr . David Alan Gilbert, peterx, Alexey Perevalov, Fabiano Rosas,
	Juraj Marcin, Markus Armbruster

Blocktime so far only cares about the time one vcpu (or the whole system)
got blocked.  It would be also be helpful if it can also report the latency
of page requests, which could be very sensitive during postcopy.

Blocktime itself is sometimes not very important, especially when one
thinks about KVM async PF support, which means vCPUs are literally almost
not blocked at all because the guest OS is smart enough to switch to
another task when a remote fault is needed.

However, latency is still sensitive and important because even if the guest
vCPU is running on threads that do not need a remote fault, the workload
that accesses some missing page is still affected.

Add two entries to the report, showing how long it takes to resolve a
remote fault.  Mention in the QAPI doc that this is not the real average
fault latency, but only the ones that was requested for a remote fault.

Unwrap get_vcpu_blocktime_list() so we don't need to walk the list twice,
meanwhile add the entry checks in qtests for all postcopy tests.

Cc: Markus Armbruster <armbru@redhat.com>
Cc: Dr. David Alan Gilbert <dave@treblig.org>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 qapi/migration.json                   | 13 +++++
 migration/migration-hmp-cmds.c        | 70 ++++++++++++++++++---------
 migration/postcopy-ram.c              | 48 ++++++++++++------
 tests/qtest/migration/migration-qmp.c |  3 ++
 4 files changed, 97 insertions(+), 37 deletions(-)

diff --git a/qapi/migration.json b/qapi/migration.json
index 8b9c53595c..8b13cea169 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -236,6 +236,17 @@
 #     This is only present when the postcopy-blocktime migration
 #     capability is enabled.  (Since 3.0)
 #
+# @postcopy-latency: average remote page fault latency (in us).  Note that
+#     this doesn't include all faults, but only the ones that require a
+#     remote page request.  So it should be always bigger than the real
+#     average page fault latency. This is only present when the
+#     postcopy-blocktime migration capability is enabled.  (Since 10.1)
+#
+# @postcopy-vcpu-latency: average remote page fault latency per vCPU (in
+#     us).  It has the same definition of @postcopy-latency, but instead
+#     this is the per-vCPU statistics. This is only present when the
+#     postcopy-blocktime migration capability is enabled.  (Since 10.1)
+#
 # @socket-address: Only used for tcp, to know what the real port is
 #     (Since 4.0)
 #
@@ -275,6 +286,8 @@
            '*blocked-reasons': ['str'],
            '*postcopy-blocktime': 'uint32',
            '*postcopy-vcpu-blocktime': ['uint32'],
+           '*postcopy-latency': 'uint64',
+           '*postcopy-vcpu-latency': ['uint64'],
            '*socket-address': ['SocketAddress'],
            '*dirty-limit-throttle-time-per-round': 'uint64',
            '*dirty-limit-ring-full-time': 'uint64'} }
diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
index 3cf890b887..a18049a7e8 100644
--- a/migration/migration-hmp-cmds.c
+++ b/migration/migration-hmp-cmds.c
@@ -52,6 +52,53 @@ static void migration_global_dump(Monitor *mon)
                    ms->clear_bitmap_shift);
 }
 
+static void migration_dump_blocktime(Monitor *mon, MigrationInfo *info)
+{
+    if (info->has_postcopy_blocktime) {
+        monitor_printf(mon, "Postcopy Blocktime (ms): %" PRIu32 "\n",
+                       info->postcopy_blocktime);
+    }
+
+    if (info->has_postcopy_vcpu_blocktime) {
+        uint32List *item = info->postcopy_vcpu_blocktime;
+        int count = 0;
+
+        monitor_printf(mon, "Postcopy vCPU Blocktime (ms): \n [");
+
+        while (item) {
+            monitor_printf(mon, "%"PRIu32", ", item->value);
+            item = item->next;
+            /* Each line 10 vcpu results, newline if there's more */
+            if ((++count % 10 == 0) && item) {
+                monitor_printf(mon, "\n  ");
+            }
+        }
+        monitor_printf(mon, "\b\b]\n");
+    }
+
+    if (info->has_postcopy_latency) {
+        monitor_printf(mon, "Postcopy Latency (us): %" PRIu64 "\n",
+                       info->postcopy_latency);
+    }
+
+    if (info->has_postcopy_vcpu_latency) {
+        uint64List *item = info->postcopy_vcpu_latency;
+        int count = 0;
+
+        monitor_printf(mon, "Postcopy vCPU Latencies (us): \n [");
+
+        while (item) {
+            monitor_printf(mon, "%"PRIu64", ", item->value);
+            item = item->next;
+            /* Each line 10 vcpu results, newline if there's more */
+            if ((++count % 10 == 0) && item) {
+                monitor_printf(mon, "\n  ");
+            }
+        }
+        monitor_printf(mon, "\b\b]\n");
+    }
+}
+
 void hmp_info_migrate(Monitor *mon, const QDict *qdict)
 {
     bool show_all = qdict_get_try_bool(qdict, "all", false);
@@ -202,28 +249,7 @@ void hmp_info_migrate(Monitor *mon, const QDict *qdict)
                        info->dirty_limit_ring_full_time);
     }
 
-    if (info->has_postcopy_blocktime) {
-        monitor_printf(mon, "Postcopy Blocktime (ms): %" PRIu32 "\n",
-                       info->postcopy_blocktime);
-    }
-
-    if (info->has_postcopy_vcpu_blocktime) {
-        uint32List *item = info->postcopy_vcpu_blocktime;
-        int count = 0;
-
-        monitor_printf(mon, "Postcopy vCPU Blocktime (ms): \n [");
-
-        while (item) {
-            monitor_printf(mon, "%"PRIu32", ", item->value);
-            item = item->next;
-            /* Each line 10 vcpu results, newline if there's more */
-            if ((++count % 10 == 0) && item) {
-                monitor_printf(mon, "\n  ");
-            }
-        }
-        monitor_printf(mon, "\b\b]\n");
-    }
-
+    migration_dump_blocktime(mon, info);
 out:
     qapi_free_MigrationInfo(info);
 }
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 46a8fdb6c2..2aca41b3d7 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -166,20 +166,6 @@ static struct PostcopyBlocktimeContext *blocktime_context_new(void)
     return ctx;
 }
 
-static uint32List *get_vcpu_blocktime_list(PostcopyBlocktimeContext *ctx)
-{
-    MachineState *ms = MACHINE(qdev_get_machine());
-    uint32List *list = NULL;
-    int i;
-
-    for (i = ms->smp.cpus - 1; i >= 0; i--) {
-        QAPI_LIST_PREPEND(
-            list, (uint32_t)(ctx->vcpu_blocktime_total[i] / 1000));
-    }
-
-    return list;
-}
-
 /*
  * This function just populates MigrationInfo from postcopy's
  * blocktime context. It will not populate MigrationInfo,
@@ -191,15 +177,47 @@ void fill_destination_postcopy_migration_info(MigrationInfo *info)
 {
     MigrationIncomingState *mis = migration_incoming_get_current();
     PostcopyBlocktimeContext *bc = mis->blocktime_ctx;
+    MachineState *ms = MACHINE(qdev_get_machine());
+    uint64_t latency_total = 0, faults = 0;
+    uint32List *list_blocktime = NULL;
+    uint64List *list_latency = NULL;
+    int i;
 
     if (!bc) {
         return;
     }
 
+    for (i = ms->smp.cpus - 1; i >= 0; i--) {
+        uint64_t latency, total, count;
+
+        /* This is in milliseconds */
+        QAPI_LIST_PREPEND(list_blocktime,
+                          (uint32_t)(bc->vcpu_blocktime_total[i] / 1000));
+
+        /* The rest in microseconds */
+        total = bc->vcpu_blocktime_total[i];
+        latency_total += total;
+        count = bc->vcpu_faults_count[i];
+        faults += count;
+
+        if (count) {
+            latency = total / count;
+        } else {
+            /* No fault detected */
+            latency = 0;
+        }
+
+        QAPI_LIST_PREPEND(list_latency, latency);
+    }
+
     info->has_postcopy_blocktime = true;
     info->postcopy_blocktime = (uint32_t)(bc->total_blocktime / 1000);
     info->has_postcopy_vcpu_blocktime = true;
-    info->postcopy_vcpu_blocktime = get_vcpu_blocktime_list(bc);
+    info->postcopy_vcpu_blocktime = list_blocktime;
+    info->has_postcopy_latency = true;
+    info->postcopy_latency = faults ? (latency_total / faults) : 0;
+    info->has_postcopy_vcpu_latency = true;
+    info->postcopy_vcpu_latency = list_latency;
 }
 
 static uint64_t get_postcopy_total_blocktime(void)
diff --git a/tests/qtest/migration/migration-qmp.c b/tests/qtest/migration/migration-qmp.c
index fb59741b2c..1a5ab2d229 100644
--- a/tests/qtest/migration/migration-qmp.c
+++ b/tests/qtest/migration/migration-qmp.c
@@ -358,6 +358,9 @@ void read_blocktime(QTestState *who)
 
     rsp_return = migrate_query_not_failed(who);
     g_assert(qdict_haskey(rsp_return, "postcopy-blocktime"));
+    g_assert(qdict_haskey(rsp_return, "postcopy-vcpu-blocktime"));
+    g_assert(qdict_haskey(rsp_return, "postcopy-latency"));
+    g_assert(qdict_haskey(rsp_return, "postcopy-vcpu-latency"));
     qobject_unref(rsp_return);
 }
 
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 09/13] migration/postcopy: Initialize blocktime context only until listen
  2025-05-27 23:12 [PATCH 00/13] migration/postcopy: Blocktime tracking overhaul Peter Xu
                   ` (7 preceding siblings ...)
  2025-05-27 23:12 ` [PATCH 08/13] migration/postcopy: Report fault latencies in blocktime Peter Xu
@ 2025-05-27 23:12 ` Peter Xu
  2025-06-03 16:08   ` Fabiano Rosas
  2025-05-27 23:12 ` [PATCH 10/13] migration/postcopy: Cache the tid->vcpu mapping for blocktime Peter Xu
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 30+ messages in thread
From: Peter Xu @ 2025-05-27 23:12 UTC (permalink / raw)
  To: qemu-devel
  Cc: Dr . David Alan Gilbert, peterx, Alexey Perevalov, Fabiano Rosas,
	Juraj Marcin

Before this patch, the blocktime context can be created very early, because
postcopy_ram_supported_by_host() <- migrate_caps_check() can happen during
migration object init.

The trick here is the blocktime context needs system vCPU information,
which seems to be possible to change after that point.  I didn't verify it,
but it doesn't sound right.

Now move it out and initialize the context only when postcopy listen
starts.  That is already during a migration so it should be guaranteed the
vCPU topology can never change on both sides.

While at it, assert that the ctx isn't created instead this time; the old
"if" trick isn't needed when we're sure it will only happen once now.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/postcopy-ram.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 2aca41b3d7..e73f5d1cc7 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -324,13 +324,13 @@ static bool ufd_check_and_apply(int ufd, MigrationIncomingState *mis,
     }
 
 #ifdef UFFD_FEATURE_THREAD_ID
+    /*
+     * Postcopy blocktime conditionally needs THREAD_ID feature (introduced
+     * to Linux in 2017). Always try to enable it when QEMU is compiled
+     * with such environment.
+     */
     if (UFFD_FEATURE_THREAD_ID & supported_features) {
         asked_features |= UFFD_FEATURE_THREAD_ID;
-        if (migrate_postcopy_blocktime()) {
-            if (!mis->blocktime_ctx) {
-                mis->blocktime_ctx = blocktime_context_new();
-            }
-        }
     }
 #endif
 
@@ -1238,6 +1238,11 @@ int postcopy_ram_incoming_setup(MigrationIncomingState *mis)
         return -1;
     }
 
+    if (migrate_postcopy_blocktime()) {
+        assert(mis->blocktime_ctx == NULL);
+        mis->blocktime_ctx = blocktime_context_new();
+    }
+
     /* Now an eventfd we use to tell the fault-thread to quit */
     mis->userfault_event_fd = eventfd(0, EFD_CLOEXEC);
     if (mis->userfault_event_fd == -1) {
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 10/13] migration/postcopy: Cache the tid->vcpu mapping for blocktime
  2025-05-27 23:12 [PATCH 00/13] migration/postcopy: Blocktime tracking overhaul Peter Xu
                   ` (8 preceding siblings ...)
  2025-05-27 23:12 ` [PATCH 09/13] migration/postcopy: Initialize blocktime context only until listen Peter Xu
@ 2025-05-27 23:12 ` Peter Xu
  2025-06-03 16:20   ` Fabiano Rosas
  2025-05-27 23:12 ` [PATCH 11/13] migration/postcopy: Cleanup the total blocktime accounting Peter Xu
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 30+ messages in thread
From: Peter Xu @ 2025-05-27 23:12 UTC (permalink / raw)
  To: qemu-devel
  Cc: Dr . David Alan Gilbert, peterx, Alexey Perevalov, Fabiano Rosas,
	Juraj Marcin

Looking up the vCPU index for each fault can be expensive when there're
hundreds of vCPUs.  Provide a cache for tid->vcpu instead with a hash
table, then lookup from there.

When at it, add another counter to record how many non-vCPU faults it gets.
For example, the main thread can also access a guest page that was missing.
These kind of faults are not accounted by blocktime so far.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/postcopy-ram.c | 68 ++++++++++++++++++++++++++++++++++------
 migration/trace-events   |  3 +-
 2 files changed, 59 insertions(+), 12 deletions(-)

diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index e73f5d1cc7..4fb1d92d7b 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -127,6 +127,17 @@ typedef struct PostcopyBlocktimeContext {
     /* number of vCPU are suspended */
     int smp_cpus_down;
 
+    /*
+     * Fast path for looking up vcpu_index from tid.  NOTE: this result
+     * only reflects the vcpu setup when postcopy is running.  It may not
+     * always match with the current vcpu setup because vcpus can be hot
+     * attached/detached after migration completes.  However this should be
+     * stable when blocktime is using the structure.
+    */
+    GHashTable *tid_to_vcpu_hash;
+    /* Count of non-vCPU faults.  This is only for debugging purpose. */
+    uint64_t non_vcpu_faults;
+
     /*
      * Handler for exit event, necessary for
      * releasing whole blocktime_ctx
@@ -136,6 +147,7 @@ typedef struct PostcopyBlocktimeContext {
 
 static void destroy_blocktime_context(struct PostcopyBlocktimeContext *ctx)
 {
+    g_hash_table_destroy(ctx->tid_to_vcpu_hash);
     g_free(ctx->vcpu_blocktime_start);
     g_free(ctx->vcpu_blocktime_total);
     g_free(ctx->vcpu_faults_count);
@@ -150,6 +162,36 @@ static void migration_exit_cb(Notifier *n, void *data)
     destroy_blocktime_context(ctx);
 }
 
+static GHashTable *blocktime_init_tid_to_vcpu_hash(void)
+{
+    /*
+     * TID as an unsigned int can be directly used as the key.  However,
+     * CPU index can NOT be directly used as value, because CPU index can
+     * be 0, which means NULL.  Then when lookup we can never know whether
+     * it's 0 or "not found".  Hence use an indirection for CPU index.
+     */
+    GHashTable *table = g_hash_table_new_full(g_direct_hash, g_direct_equal,
+                                              NULL, g_free);
+    CPUState *cpu;
+
+    /*
+     * Initialize the tid->cpu_id mapping for lookups.  The caller needs to
+     * make sure when reaching here the CPU topology is frozen and will be
+     * stable for the whole blocktime trapping period.
+     */
+    CPU_FOREACH(cpu) {
+        int *value = g_new(int, 1);
+
+        *value = cpu->cpu_index;
+        g_hash_table_insert(table,
+                            GUINT_TO_POINTER((uint32_t)cpu->thread_id),
+                            value);
+        trace_postcopy_blocktime_tid_cpu_map(cpu->cpu_index, cpu->thread_id);
+    }
+
+    return table;
+}
+
 static struct PostcopyBlocktimeContext *blocktime_context_new(void)
 {
     MachineState *ms = MACHINE(qdev_get_machine());
@@ -160,6 +202,8 @@ static struct PostcopyBlocktimeContext *blocktime_context_new(void)
     ctx->vcpu_blocktime_total = g_new0(uint64_t, smp_cpus);
     ctx->vcpu_faults_count = g_new0(uint64_t, smp_cpus);
     ctx->vcpu_addr = g_new0(uintptr_t, smp_cpus);
+    ctx->tid_to_vcpu_hash = blocktime_init_tid_to_vcpu_hash();
+
     ctx->exit_notifier.notify = migration_exit_cb;
     qemu_add_exit_notifier(&ctx->exit_notifier);
 
@@ -826,18 +870,21 @@ int postcopy_request_shared_page(struct PostCopyFD *pcfd, RAMBlock *rb,
     return 0;
 }
 
-static int get_mem_fault_cpu_index(uint32_t pid)
+static int blocktime_get_vcpu(PostcopyBlocktimeContext *ctx, uint32_t tid)
 {
-    CPUState *cpu_iter;
+    int *found;
 
-    CPU_FOREACH(cpu_iter) {
-        if (cpu_iter->thread_id == pid) {
-            trace_get_mem_fault_cpu_index(cpu_iter->cpu_index, pid);
-            return cpu_iter->cpu_index;
-        }
+    found = g_hash_table_lookup(ctx->tid_to_vcpu_hash, GUINT_TO_POINTER(tid));
+    if (!found) {
+        /*
+         * NOTE: this is possible, because QEMU's non-vCPU threads can
+         * also access a missing page.  Or, when KVM async pf is enabled, a
+         * fault can even happen from a kworker..
+         */
+        return -1;
     }
-    trace_get_mem_fault_cpu_index(-1, pid);
-    return -1;
+
+    return *found;
 }
 
 static uint64_t get_current_us(void)
@@ -864,8 +911,9 @@ void mark_postcopy_blocktime_begin(uintptr_t addr, uint32_t ptid,
     if (!dc || ptid == 0) {
         return;
     }
-    cpu = get_mem_fault_cpu_index(ptid);
+    cpu = blocktime_get_vcpu(dc, ptid);
     if (cpu < 0) {
+        dc->non_vcpu_faults++;
         return;
     }
 
diff --git a/migration/trace-events b/migration/trace-events
index 02cdb6e7cc..9c1f3b7044 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -310,8 +310,7 @@ postcopy_preempt_tls_handshake(void) ""
 postcopy_preempt_new_channel(void) ""
 postcopy_preempt_thread_entry(void) ""
 postcopy_preempt_thread_exit(void) ""
-
-get_mem_fault_cpu_index(int cpu, uint32_t pid) "cpu: %d, pid: %u"
+postcopy_blocktime_tid_cpu_map(int cpu, uint32_t tid) "cpu: %d, tid: %u"
 
 # exec.c
 migration_exec_outgoing(const char *cmd) "cmd=%s"
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 11/13] migration/postcopy: Cleanup the total blocktime accounting
  2025-05-27 23:12 [PATCH 00/13] migration/postcopy: Blocktime tracking overhaul Peter Xu
                   ` (9 preceding siblings ...)
  2025-05-27 23:12 ` [PATCH 10/13] migration/postcopy: Cache the tid->vcpu mapping for blocktime Peter Xu
@ 2025-05-27 23:12 ` Peter Xu
  2025-06-03 16:22   ` Fabiano Rosas
  2025-05-27 23:12 ` [PATCH 12/13] migration/postcopy: Optimize blocktime fault tracking with hashtable Peter Xu
  2025-05-27 23:12 ` [PATCH 13/13] migration/postcopy: blocktime allows track / report non-vCPU faults Peter Xu
  12 siblings, 1 reply; 30+ messages in thread
From: Peter Xu @ 2025-05-27 23:12 UTC (permalink / raw)
  To: qemu-devel
  Cc: Dr . David Alan Gilbert, peterx, Alexey Perevalov, Fabiano Rosas,
	Juraj Marcin

The variable vcpu_total_blocktime isn't easy to follow.  In reality, it
wants to capture the case where all vCPUs are stopped, and now there will
be some vCPUs starts running.

The name now starts to conflict with vcpu_blocktime_total[], meanwhile it's
actually not necessary to have the variable at all: since nobody is
touching smp_cpus_down except ourselves, we can safely do the calculation
at the end before decrementing smp_cpus_down.

Hopefully this makes the logic easier to read, side benefit is we drop one
temp var.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/postcopy-ram.c | 16 +++++++---------
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 4fb1d92d7b..b76ce6b363 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -971,7 +971,6 @@ static void mark_postcopy_blocktime_end(uintptr_t addr)
     MachineState *ms = MACHINE(qdev_get_machine());
     unsigned int smp_cpus = ms->smp.cpus;
     int i, affected_cpu = 0;
-    bool vcpu_total_blocktime = false;
     uint64_t read_vcpu_time, current_us;
 
     if (!dc) {
@@ -993,20 +992,19 @@ static void mark_postcopy_blocktime_end(uintptr_t addr)
         dc->vcpu_addr[i] = 0;
         vcpu_blocktime = current_us - read_vcpu_time;
         affected_cpu += 1;
-        /* we need to know is that mark_postcopy_end was due to
-         * faulted page, another possible case it's prefetched
-         * page and in that case we shouldn't be here */
-        if (!vcpu_total_blocktime && dc->smp_cpus_down == smp_cpus) {
-            vcpu_total_blocktime = true;
-        }
         /* continue cycle, due to one page could affect several vCPUs */
         dc->vcpu_blocktime_total[i] += vcpu_blocktime;
     }
 
-    dc->smp_cpus_down -= affected_cpu;
-    if (vcpu_total_blocktime) {
+    /*
+     * If all vCPUs used to be down, and copying this page would free some
+     * vCPUs, then the system-level blocktime ends here.
+     */
+    if (dc->smp_cpus_down == smp_cpus && affected_cpu) {
         dc->total_blocktime += current_us - dc->last_begin;
     }
+    dc->smp_cpus_down -= affected_cpu;
+
     trace_mark_postcopy_blocktime_end(addr, dc->total_blocktime,
                                       affected_cpu);
 }
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 12/13] migration/postcopy: Optimize blocktime fault tracking with hashtable
  2025-05-27 23:12 [PATCH 00/13] migration/postcopy: Blocktime tracking overhaul Peter Xu
                   ` (10 preceding siblings ...)
  2025-05-27 23:12 ` [PATCH 11/13] migration/postcopy: Cleanup the total blocktime accounting Peter Xu
@ 2025-05-27 23:12 ` Peter Xu
  2025-06-03 16:44   ` Fabiano Rosas
  2025-05-27 23:12 ` [PATCH 13/13] migration/postcopy: blocktime allows track / report non-vCPU faults Peter Xu
  12 siblings, 1 reply; 30+ messages in thread
From: Peter Xu @ 2025-05-27 23:12 UTC (permalink / raw)
  To: qemu-devel
  Cc: Dr . David Alan Gilbert, peterx, Alexey Perevalov, Fabiano Rosas,
	Juraj Marcin

Currently, the postcopy blocktime feature maintains vCPU fault information
using an array (vcpu_addr[]).  It has two issues.

Issue 1: Performance Concern
============================

The old algorithm was almost OK and fast on inserts, except that the lookup
is slow and won't scale if there are a lot of vCPUs: when a page is copied
during postcopy, mark_postcopy_blocktime_end() will walk the whole array
trying to find which vCPUs are blocked by the address.  So it needs
constant O(N) walk for each page resolution.

Alexey (the author of postcopy blocktime) mentioned the perf issue and how
to optimize it in a piece of comment in the page resolution path.  The
comment was (interestingly..) not complete, but it's relatively clear what
he wanted to say about this perf issue.

Issue 2: Wrong Accounting on re-entrancies
==========================================

People might think that each vCPU should only and always get one fault at a
time, so that when the blocktime layer captured one fault on one vCPU, we
should never see another fault message on this vCPU.

It's almost correct, except in some extreme rare cases.

Case 1: it's possible the fault thread processes the userfaultfd messages
too fast so it can see >1 messages on one vCPU before the previous one was
resolved.

Case 2: it's theoretically also possible one vCPU can get even more than
one message on the same fault address if a fault is retried by the
kernel (e.g., handle_userfault() got interrupted before page resolution).

As this info might be important, instead of using commit message, I put
more details into the code as comment, when introducing an array
maintaining concurrent faults on one vCPU.  Please refer to the comments
for details on both cases, especially case 1 which can be tricky.

Case 1 sounds rare, but it can be easily reproduced locally for me when we
run blocktime together with the migration-test on the vanilla postcopy.

New Design
==========

This patch should do almost what Alexey mentioned, but slightly
differently: instead of having an array to maintain vCPU fault addresses,
for each of the fault message we push a message into a hash, indexed by the
fault address.

With the hash, it can replace the old two structs: both the vcpu_addr[]
array, and also the array to store the start time of the fault.  However
due to above we need one more counter array to account concurrent faults on
the same vCPU - that should even be needed in the old code, it's just that
the old code was buggy and it will blindly overwrite an existing
entry.. now we'll start to really track everything.

The hash structure might be more efficient than tree to maintain such
addr->(cpu, fault_time) information, so that the insert() and lookup()
paths should ideally both be ~O(1).  After all, we do not need to sort.

Here we need to do one remove() though after the lookup().  It could be
slow but only if many vCPUs faulted exactly on the same address (so when
the list of cpu entries is long), which should be unlikely. Even with that,
it's still a worst case O(N) (consider 400 vCPUs faulted on the same
address and how likely is it..) rather than a constant O(N) complexity.

When at it, touch up the tracepoints to make them slightly more useful.
One tracepoint is added when walking all the fault entries.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/postcopy-ram.c | 259 ++++++++++++++++++++++++++++++++-------
 migration/trace-events   |   5 +-
 2 files changed, 216 insertions(+), 48 deletions(-)

diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index b76ce6b363..066dc9f3ce 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -112,14 +112,69 @@ void postcopy_thread_create(MigrationIncomingState *mis,
 
 /* All the time records are in unit of microseconds (us) */
 typedef struct PostcopyBlocktimeContext {
-    /* time when page fault initiated per vCPU */
-    uint64_t *vcpu_blocktime_start;
     /* blocktime per vCPU */
     uint64_t *vcpu_blocktime_total;
     /* count of faults per vCPU */
     uint64_t *vcpu_faults_count;
-    /* page address per vCPU */
-    uintptr_t *vcpu_addr;
+    /*
+     * count of currently blocked faults per vCPU.
+     *
+     * NOTE: Normally there should only be one fault in-progress per vCPU
+     * thread, so logically it _seems_ vcpu_faults_count[] for any vCPU
+     * should be either zero or one.  However, there can be reasons we see
+     * >1 faults on the same vCPU thread.
+     *
+     * CASE (1): since the process to resolve faults (ioctl(UFFDIO_COPY),
+     * for example) is done before taking the mutex that protects the
+     * blocktime context, it can happen that we read more than one faulted
+     * addresses per vCPU.
+     *
+     * One example when we can see >1 faulted addresses for one vCPU:
+     *
+     *  vcpu1 thread       fault thread         resolve thread
+     *  ============       ============         ==============
+     *
+     *  faulted on addr1
+     *                     read uffd msg (addr1)
+     *                     MUTEX_LOCK
+     *                     add entry (cpu1, addr1)
+     *                     MUTEX_UNLOCK
+     *                     request remote fault (addr1)
+     *                                          resolve fault (addr1)
+     *  addr1 resolved, continue..
+     *  faulted on addr2
+     *                     read uffd msg (addr2)
+     *                     MUTEX_LOCK
+     *                     add entry (cpu1, addr2) <--------------- [A]
+     *                     MUTEX_UNLOCK
+     *                                          MUTEX_LOCK
+     *                                          remove entry (cpu1, addr1)
+     *                                          MUTEX_UNLOCK
+     *
+     * In above case, we may see (cpu1, addr1) and (cpu1, addr2) entries to
+     * appear together at [A], when it gets the lock before the resolve
+     * thread.  Use this counter to maintain such case, and only when it
+     * reaches zero we know the vCPU is not blocked anymore.
+     *
+     * CASE (2): theoretically (the author admit to not have verified
+     * this..), one vCPU thread can also generate more than one userfaultfd
+     * message on the same address. It can happen e.g. for whatever reason
+     * the fault got retried before a resolution arrives. In that extremely
+     * rare case, we could also see two (cpu1, addr1) entries.
+     *
+     * In all cases, be prepared with such re-entrancies with this array.
+     *
+     * Using uint8_t should be far enough for now.  For example, when
+     * there're only one resolve thread (postcopy ram listening thread),
+     * the max (concurrent fault entries) should be two.
+     */
+    uint8_t *vcpu_faults_current;
+    /*
+     * The hash that contains addr1->[(cpu1,ts1),(cpu2,ts2) ...] mappings.
+     * Each of the entry is a tuple of (CPU index, fault timestamp) showing
+     * that a fault was requested.
+     */
+    GHashTable *vcpu_addr_hash;
     /* total blocktime when all vCPUs are stopped */
     uint64_t total_blocktime;
     /* point in time when last page fault was initiated */
@@ -145,13 +200,38 @@ typedef struct PostcopyBlocktimeContext {
     Notifier exit_notifier;
 } PostcopyBlocktimeContext;
 
+typedef struct {
+    /* The time the fault was triggered */
+    uint64_t fault_time;
+    /* The vCPU index that was blocked */
+    int cpu;
+} BlocktimeVCPUEntry;
+
+/* Alloc an entry to record a vCPU fault */
+static BlocktimeVCPUEntry *
+blocktime_vcpu_entry_alloc(int cpu, uint64_t fault_time)
+{
+    BlocktimeVCPUEntry *entry = g_new(BlocktimeVCPUEntry, 1);
+
+    entry->fault_time = fault_time;
+    entry->cpu = cpu;
+
+    return entry;
+}
+
+/* Free a @GList of @BlocktimeVCPUEntry */
+static void blocktime_vcpu_list_free(gpointer data)
+{
+    g_list_free_full(data, g_free);
+}
+
 static void destroy_blocktime_context(struct PostcopyBlocktimeContext *ctx)
 {
     g_hash_table_destroy(ctx->tid_to_vcpu_hash);
-    g_free(ctx->vcpu_blocktime_start);
+    g_hash_table_destroy(ctx->vcpu_addr_hash);
     g_free(ctx->vcpu_blocktime_total);
     g_free(ctx->vcpu_faults_count);
-    g_free(ctx->vcpu_addr);
+    g_free(ctx->vcpu_faults_current);
     g_free(ctx);
 }
 
@@ -198,12 +278,22 @@ static struct PostcopyBlocktimeContext *blocktime_context_new(void)
     unsigned int smp_cpus = ms->smp.cpus;
     PostcopyBlocktimeContext *ctx = g_new0(PostcopyBlocktimeContext, 1);
 
-    ctx->vcpu_blocktime_start = g_new0(uint64_t, smp_cpus);
     ctx->vcpu_blocktime_total = g_new0(uint64_t, smp_cpus);
     ctx->vcpu_faults_count = g_new0(uint64_t, smp_cpus);
-    ctx->vcpu_addr = g_new0(uintptr_t, smp_cpus);
+    ctx->vcpu_faults_current = g_new0(uint8_t, smp_cpus);
     ctx->tid_to_vcpu_hash = blocktime_init_tid_to_vcpu_hash();
 
+    /*
+     * The key (host virtual addresses) will always be gpointer-sized on
+     * either 32bits or 64bits systems, so it'll fit as a direct key.
+     *
+     * The value will be a list of BlocktimeVCPUEntry entries.
+     */
+    ctx->vcpu_addr_hash = g_hash_table_new_full(g_direct_hash,
+                                                g_direct_equal,
+                                                NULL,
+                                                blocktime_vcpu_list_free);
+
     ctx->exit_notifier.notify = migration_exit_cb;
     qemu_add_exit_notifier(&ctx->exit_notifier);
 
@@ -892,6 +982,39 @@ static uint64_t get_current_us(void)
     return (uint64_t)qemu_clock_get_us(QEMU_CLOCK_REALTIME);
 }
 
+/* Inject an (cpu, fault_time) entry into the database, using addr as key */
+static void blocktime_fault_inject(PostcopyBlocktimeContext *ctx,
+                                   uintptr_t addr, int cpu, uint64_t time)
+{
+    BlocktimeVCPUEntry *entry = blocktime_vcpu_entry_alloc(cpu, time);
+    GHashTable *table = ctx->vcpu_addr_hash;
+    gpointer key = (gpointer)addr;
+    GList *head, *list;
+    gboolean result;
+
+    head = g_hash_table_lookup(table, key);
+    if (head) {
+        /*
+         * If existed, steal the @head for list operation rather than
+         * freeing it, making sure steal succeeded.
+         */
+        result = g_hash_table_steal(table, key);
+        assert(result == TRUE);
+    }
+
+    /*
+     * Now the key is guaranteed to be absent.  Two cases:
+     *
+     * (1) There's no existing entry, list contains the only one. Insert.
+     * (2) There're existing entries, after stealing we own it, prepend the
+     *     result and re-insert.
+     */
+    list = g_list_prepend(head, entry);
+    g_hash_table_insert(table, key, list);
+
+    trace_postcopy_blocktime_begin(addr, time, cpu, !!head);
+}
+
 /*
  * This function is being called when pagefault occurs. It tracks down vCPU
  * blocking time.  It's protected by @page_request_mutex.
@@ -911,30 +1034,74 @@ void mark_postcopy_blocktime_begin(uintptr_t addr, uint32_t ptid,
     if (!dc || ptid == 0) {
         return;
     }
+
+    /*
+     * The caller should only inject a blocktime entry when the page is
+     * yet missing.
+     */
+    assert(!ramblock_recv_bitmap_test(rb, (void *)addr));
+
+    current_us = get_current_us();
     cpu = blocktime_get_vcpu(dc, ptid);
-    if (cpu < 0) {
+
+    if (cpu >= 0) {
+        /* How many faults on this vCPU in total? */
+        dc->vcpu_faults_count[cpu]++;
+
+        /*
+         * Account how many concurrent faults on this vCPU we trapped.  See
+         * comments above vcpu_faults_current[] on why it can be more than one.
+         */
+        if (dc->vcpu_faults_current[cpu]++ == 0) {
+            dc->smp_cpus_down++;
+            /*
+             * We use last_begin to cover (1) the 1st fault on this specific
+             * vCPU, but meanwhile (2) the last vCPU that got blocked.  It's
+             * only used to calculate system-wise blocktime.
+             */
+            dc->last_begin = current_us;
+        }
+
+        /* Making sure it won't overflow - it really should never! */
+        assert(dc->vcpu_faults_current[cpu] <= 255);
+    } else {
+        /* We do not support non-vCPU thread tracking yet */
         dc->non_vcpu_faults++;
         return;
     }
 
-    current_us = get_current_us();
-    if (dc->vcpu_addr[cpu] == 0) {
-        dc->smp_cpus_down++;
-    }
+    blocktime_fault_inject(dc, addr, cpu, current_us);
+}
 
-    dc->last_begin = current_us;
-    dc->vcpu_blocktime_start[cpu] = current_us;
-    dc->vcpu_addr[cpu] = addr;
-    dc->vcpu_faults_count[cpu]++;
+typedef struct {
+    PostcopyBlocktimeContext *ctx;
+    uint64_t current_us;
+    int affected_cpus;
+} BlockTimeVCPUIter;
+
+static void blocktime_cpu_list_iter_fn(gpointer data, gpointer user_data)
+{
+    BlockTimeVCPUIter *iter = user_data;
+    PostcopyBlocktimeContext *ctx = iter->ctx;
+    BlocktimeVCPUEntry *entry = data;
+    int cpu = entry->cpu;
 
     /*
-     * The caller should only inject a blocktime entry when the page is
-     * yet missing.
+     * Time should never go back.. so when the fault is resolved it must be
+     * later than when it was faulted.
      */
-    assert(!ramblock_recv_bitmap_test(rb, (void *)addr));
+    assert(iter->current_us >= entry->fault_time);
+
+    /*
+     * If we resolved all pending faults on one vCPU due to this page
+     * resolution, take a note.
+     */
+    if (--ctx->vcpu_faults_current[cpu] == 0) {
+        ctx->vcpu_blocktime_total[cpu] += iter->current_us - entry->fault_time;
+        iter->affected_cpus += 1;
+    }
 
-    trace_mark_postcopy_blocktime_begin(addr, dc->vcpu_blocktime_start[cpu],
-                                        cpu);
+    trace_postcopy_blocktime_end_one(cpu, ctx->vcpu_faults_current[cpu]);
 }
 
 /*
@@ -970,43 +1137,43 @@ static void mark_postcopy_blocktime_end(uintptr_t addr)
     PostcopyBlocktimeContext *dc = mis->blocktime_ctx;
     MachineState *ms = MACHINE(qdev_get_machine());
     unsigned int smp_cpus = ms->smp.cpus;
-    int i, affected_cpu = 0;
-    uint64_t read_vcpu_time, current_us;
+    BlockTimeVCPUIter iter = {
+        .current_us = get_current_us(),
+        .affected_cpus = 0,
+        .ctx = dc,
+    };
+    gpointer key = (gpointer)addr;
+    GHashTable *table;
+    GList *list;
 
     if (!dc) {
         return;
     }
 
-    current_us = get_current_us();
-    /* lookup cpu, to clear it,
-     * that algorithm looks straightforward, but it's not
-     * optimal, more optimal algorithm is keeping tree or hash
-     * where key is address value is a list of  */
-    for (i = 0; i < smp_cpus; i++) {
-        uint64_t vcpu_blocktime = 0;
-
-        read_vcpu_time = dc->vcpu_blocktime_start[i];
-        if (dc->vcpu_addr[i] != addr || read_vcpu_time == 0) {
-            continue;
-        }
-        dc->vcpu_addr[i] = 0;
-        vcpu_blocktime = current_us - read_vcpu_time;
-        affected_cpu += 1;
-        /* continue cycle, due to one page could affect several vCPUs */
-        dc->vcpu_blocktime_total[i] += vcpu_blocktime;
+    table = dc->vcpu_addr_hash;
+    /* the address wasn't tracked at all? */
+    list = g_hash_table_lookup(table, key);
+    if (!list) {
+        return;
     }
 
+    /*
+     * Loop over the set of vCPUs that got blocked on this addr, do the
+     * blocktime accounting.  After that, remove the whole list.
+     */
+    g_list_foreach(list, blocktime_cpu_list_iter_fn, &iter);
+    g_hash_table_remove(table, key);
+
     /*
      * If all vCPUs used to be down, and copying this page would free some
      * vCPUs, then the system-level blocktime ends here.
      */
-    if (dc->smp_cpus_down == smp_cpus && affected_cpu) {
-        dc->total_blocktime += current_us - dc->last_begin;
+    if (dc->smp_cpus_down == smp_cpus && iter.affected_cpus) {
+        dc->total_blocktime += iter.current_us - dc->last_begin;
     }
-    dc->smp_cpus_down -= affected_cpu;
+    dc->smp_cpus_down -= iter.affected_cpus;
 
-    trace_mark_postcopy_blocktime_end(addr, dc->total_blocktime,
-                                      affected_cpu);
+    trace_postcopy_blocktime_end(addr, iter.current_us, iter.affected_cpus);
 }
 
 static void postcopy_pause_fault_thread(MigrationIncomingState *mis)
diff --git a/migration/trace-events b/migration/trace-events
index 9c1f3b7044..a36a78f01a 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -285,8 +285,6 @@ postcopy_nhp_range(const char *ramblock, void *host_addr, size_t offset, size_t
 postcopy_place_page(void *host_addr) "host=%p"
 postcopy_place_page_zero(void *host_addr) "host=%p"
 postcopy_ram_enable_notify(void) ""
-mark_postcopy_blocktime_begin(uint64_t addr, uint64_t time, int cpu) "addr: 0x%" PRIx64 ", time: %" PRIu64 ", cpu: %d"
-mark_postcopy_blocktime_end(uint64_t addr, uint64_t time, int affected_cpu) "addr: 0x%" PRIx64 ", time: %" PRIu64 ", affected_cpus: %d"
 postcopy_pause_fault_thread(void) ""
 postcopy_pause_fault_thread_continued(void) ""
 postcopy_pause_fast_load(void) ""
@@ -311,6 +309,9 @@ postcopy_preempt_new_channel(void) ""
 postcopy_preempt_thread_entry(void) ""
 postcopy_preempt_thread_exit(void) ""
 postcopy_blocktime_tid_cpu_map(int cpu, uint32_t tid) "cpu: %d, tid: %u"
+postcopy_blocktime_begin(uint64_t addr, uint64_t time, int cpu, bool exists) "addr: 0x%" PRIx64 ", time: %" PRIu64 ", cpu: %d, exist: %d"
+postcopy_blocktime_end(uint64_t addr, uint64_t time, int affected_cpu) "addr: 0x%" PRIx64 ", time: %" PRIu64 ", affected_cpus: %d"
+postcopy_blocktime_end_one(int cpu, uint8_t left_faults) "cpu: %d, left_faults: %" PRIu8
 
 # exec.c
 migration_exec_outgoing(const char *cmd) "cmd=%s"
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 13/13] migration/postcopy: blocktime allows track / report non-vCPU faults
  2025-05-27 23:12 [PATCH 00/13] migration/postcopy: Blocktime tracking overhaul Peter Xu
                   ` (11 preceding siblings ...)
  2025-05-27 23:12 ` [PATCH 12/13] migration/postcopy: Optimize blocktime fault tracking with hashtable Peter Xu
@ 2025-05-27 23:12 ` Peter Xu
  2025-06-03 16:50   ` Fabiano Rosas
  12 siblings, 1 reply; 30+ messages in thread
From: Peter Xu @ 2025-05-27 23:12 UTC (permalink / raw)
  To: qemu-devel
  Cc: Dr . David Alan Gilbert, peterx, Alexey Perevalov, Fabiano Rosas,
	Juraj Marcin, Markus Armbruster

When used to report page fault latencies, the blocktime feature can be
almost useless when KVM async page fault is enabled, because in most cases
such remote fault will kickoff async page faults, then it's not trackable
from blocktime layer.

After all these recent rewrites to blocktime layer, it's finally so easy to
also support tracking non-vCPU faults.  It'll be even faster if we could
always index fault records with TIDs, unfortunately we need to maintain the
blocktime API which report things in vCPU indexes.

Of course this can work not only for kworkers, but also any guest accesses
that may reach a missing page, for example, very likely when in the QEMU
main thread too (and all other threads whenever applicable).

In this case, we don't care about "how long the threads are blocked", but
we only care about "how long the fault will be resolved".

Cc: Markus Armbruster <armbru@redhat.com>
Cc: Dr. David Alan Gilbert <dave@treblig.org>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 qapi/migration.json                   |  7 +++
 migration/migration-hmp-cmds.c        |  5 +++
 migration/postcopy-ram.c              | 64 +++++++++++++++++++++------
 tests/qtest/migration/migration-qmp.c |  1 +
 migration/trace-events                |  2 +-
 5 files changed, 64 insertions(+), 15 deletions(-)

diff --git a/qapi/migration.json b/qapi/migration.json
index 8b13cea169..f59c473842 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -247,6 +247,12 @@
 #     this is the per-vCPU statistics. This is only present when the
 #     postcopy-blocktime migration capability is enabled.  (Since 10.1)
 #
+# @postcopy-non-vcpu-latency: average remote page fault latency for all
+#     faults happend in non-vCPU threads (in us).  It has the same
+#     definition of @postcopy-latency but this only provides statistics to
+#     non-vCPU faults. This is only present when the postcopy-blocktime
+#     migration capability is enabled.  (Since 10.1)
+#
 # @socket-address: Only used for tcp, to know what the real port is
 #     (Since 4.0)
 #
@@ -288,6 +294,7 @@
            '*postcopy-vcpu-blocktime': ['uint32'],
            '*postcopy-latency': 'uint64',
            '*postcopy-vcpu-latency': ['uint64'],
+           '*postcopy-non-vcpu-latency': 'uint64',
            '*socket-address': ['SocketAddress'],
            '*dirty-limit-throttle-time-per-round': 'uint64',
            '*dirty-limit-ring-full-time': 'uint64'} }
diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
index a18049a7e8..bdfc754d70 100644
--- a/migration/migration-hmp-cmds.c
+++ b/migration/migration-hmp-cmds.c
@@ -81,6 +81,11 @@ static void migration_dump_blocktime(Monitor *mon, MigrationInfo *info)
                        info->postcopy_latency);
     }
 
+    if (info->has_postcopy_non_vcpu_latency) {
+        monitor_printf(mon, "Postcopy non-vCPU Latencies (us): %" PRIu64 "\n",
+                       info->postcopy_non_vcpu_latency);
+    }
+
     if (info->has_postcopy_vcpu_latency) {
         uint64List *item = info->postcopy_vcpu_latency;
         int count = 0;
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 066dc9f3ce..bc6736f262 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -192,6 +192,8 @@ typedef struct PostcopyBlocktimeContext {
     GHashTable *tid_to_vcpu_hash;
     /* Count of non-vCPU faults.  This is only for debugging purpose. */
     uint64_t non_vcpu_faults;
+    /* total blocktime when a non-vCPU thread is stopped */
+    uint64_t non_vcpu_blocktime_total;
 
     /*
      * Handler for exit event, necessary for
@@ -203,7 +205,10 @@ typedef struct PostcopyBlocktimeContext {
 typedef struct {
     /* The time the fault was triggered */
     uint64_t fault_time;
-    /* The vCPU index that was blocked */
+    /*
+     * The vCPU index that was blocked, when cpu==-1, it means it's a
+     * fault from non-vCPU threads.
+     */
     int cpu;
 } BlocktimeVCPUEntry;
 
@@ -344,6 +349,12 @@ void fill_destination_postcopy_migration_info(MigrationInfo *info)
         QAPI_LIST_PREPEND(list_latency, latency);
     }
 
+    latency_total += bc->non_vcpu_blocktime_total;
+    faults += bc->non_vcpu_faults;
+
+    info->has_postcopy_non_vcpu_latency = true;
+    info->postcopy_non_vcpu_latency = bc->non_vcpu_faults ?
+        (bc->non_vcpu_blocktime_total / bc->non_vcpu_faults) : 0;
     info->has_postcopy_blocktime = true;
     info->postcopy_blocktime = (uint32_t)(bc->total_blocktime / 1000);
     info->has_postcopy_vcpu_blocktime = true;
@@ -982,7 +993,10 @@ static uint64_t get_current_us(void)
     return (uint64_t)qemu_clock_get_us(QEMU_CLOCK_REALTIME);
 }
 
-/* Inject an (cpu, fault_time) entry into the database, using addr as key */
+/*
+ * Inject an (cpu, fault_time) entry into the database, using addr as key.
+ * When cpu==-1, it means it's a non-vCPU fault.
+ */
 static void blocktime_fault_inject(PostcopyBlocktimeContext *ctx,
                                    uintptr_t addr, int cpu, uint64_t time)
 {
@@ -1065,9 +1079,17 @@ void mark_postcopy_blocktime_begin(uintptr_t addr, uint32_t ptid,
         /* Making sure it won't overflow - it really should never! */
         assert(dc->vcpu_faults_current[cpu] <= 255);
     } else {
-        /* We do not support non-vCPU thread tracking yet */
+        /*
+         * For non-vCPU thread faults, we don't care about tid or cpu index
+         * or time the thread is blocked (e.g., a kworker trying to help
+         * KVM when async_pf=on is OK to be blocked and not affect guest
+         * responsiveness), but we care about latency.  Track it with
+         * cpu=-1.
+         *
+         * Note that this will NOT affect blocktime reports on vCPU being
+         * blocked, but only about system-wise latency reports.
+         */
         dc->non_vcpu_faults++;
-        return;
     }
 
     blocktime_fault_inject(dc, addr, cpu, current_us);
@@ -1077,6 +1099,7 @@ typedef struct {
     PostcopyBlocktimeContext *ctx;
     uint64_t current_us;
     int affected_cpus;
+    int affected_non_cpus;
 } BlockTimeVCPUIter;
 
 static void blocktime_cpu_list_iter_fn(gpointer data, gpointer user_data)
@@ -1084,6 +1107,7 @@ static void blocktime_cpu_list_iter_fn(gpointer data, gpointer user_data)
     BlockTimeVCPUIter *iter = user_data;
     PostcopyBlocktimeContext *ctx = iter->ctx;
     BlocktimeVCPUEntry *entry = data;
+    uint64_t time_passed;
     int cpu = entry->cpu;
 
     /*
@@ -1091,17 +1115,27 @@ static void blocktime_cpu_list_iter_fn(gpointer data, gpointer user_data)
      * later than when it was faulted.
      */
     assert(iter->current_us >= entry->fault_time);
+    time_passed = iter->current_us - entry->fault_time;
 
-    /*
-     * If we resolved all pending faults on one vCPU due to this page
-     * resolution, take a note.
-     */
-    if (--ctx->vcpu_faults_current[cpu] == 0) {
-        ctx->vcpu_blocktime_total[cpu] += iter->current_us - entry->fault_time;
-        iter->affected_cpus += 1;
+    if (cpu >= 0) {
+        /*
+         * If we resolved all pending faults on one vCPU due to this page
+         * resolution, take a note.
+         */
+        if (--ctx->vcpu_faults_current[cpu] == 0) {
+            ctx->vcpu_blocktime_total[cpu] += time_passed;
+            iter->affected_cpus += 1;
+        }
+        trace_postcopy_blocktime_end_one(cpu, ctx->vcpu_faults_current[cpu]);
+    } else {
+        iter->affected_non_cpus++;
+        ctx->non_vcpu_blocktime_total += time_passed;
+        /*
+         * We do not maintain how many pending non-vCPU faults because we
+         * do not care about blocktime, only latency.
+         */
+        trace_postcopy_blocktime_end_one(-1, 0);
     }
-
-    trace_postcopy_blocktime_end_one(cpu, ctx->vcpu_faults_current[cpu]);
 }
 
 /*
@@ -1140,6 +1174,7 @@ static void mark_postcopy_blocktime_end(uintptr_t addr)
     BlockTimeVCPUIter iter = {
         .current_us = get_current_us(),
         .affected_cpus = 0,
+        .affected_non_cpus = 0,
         .ctx = dc,
     };
     gpointer key = (gpointer)addr;
@@ -1173,7 +1208,8 @@ static void mark_postcopy_blocktime_end(uintptr_t addr)
     }
     dc->smp_cpus_down -= iter.affected_cpus;
 
-    trace_postcopy_blocktime_end(addr, iter.current_us, iter.affected_cpus);
+    trace_postcopy_blocktime_end(addr, iter.current_us, iter.affected_cpus,
+                                 iter.affected_non_cpus);
 }
 
 static void postcopy_pause_fault_thread(MigrationIncomingState *mis)
diff --git a/tests/qtest/migration/migration-qmp.c b/tests/qtest/migration/migration-qmp.c
index 1a5ab2d229..67a67d4bd6 100644
--- a/tests/qtest/migration/migration-qmp.c
+++ b/tests/qtest/migration/migration-qmp.c
@@ -361,6 +361,7 @@ void read_blocktime(QTestState *who)
     g_assert(qdict_haskey(rsp_return, "postcopy-vcpu-blocktime"));
     g_assert(qdict_haskey(rsp_return, "postcopy-latency"));
     g_assert(qdict_haskey(rsp_return, "postcopy-vcpu-latency"));
+    g_assert(qdict_haskey(rsp_return, "postcopy-non-vcpu-latency"));
     qobject_unref(rsp_return);
 }
 
diff --git a/migration/trace-events b/migration/trace-events
index a36a78f01a..706db97def 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -310,7 +310,7 @@ postcopy_preempt_thread_entry(void) ""
 postcopy_preempt_thread_exit(void) ""
 postcopy_blocktime_tid_cpu_map(int cpu, uint32_t tid) "cpu: %d, tid: %u"
 postcopy_blocktime_begin(uint64_t addr, uint64_t time, int cpu, bool exists) "addr: 0x%" PRIx64 ", time: %" PRIu64 ", cpu: %d, exist: %d"
-postcopy_blocktime_end(uint64_t addr, uint64_t time, int affected_cpu) "addr: 0x%" PRIx64 ", time: %" PRIu64 ", affected_cpus: %d"
+postcopy_blocktime_end(uint64_t addr, uint64_t time, int affected_cpu, int affected_non_cpus) "addr: 0x%" PRIx64 ", time: %" PRIu64 ", affected_cpus: %d, affected_non_cpus: %d"
 postcopy_blocktime_end_one(int cpu, uint8_t left_faults) "cpu: %d, left_faults: %" PRIu8
 
 # exec.c
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 08/13] migration/postcopy: Report fault latencies in blocktime
  2025-05-27 23:12 ` [PATCH 08/13] migration/postcopy: Report fault latencies in blocktime Peter Xu
@ 2025-06-02  9:26   ` Markus Armbruster
  2025-06-02 16:29     ` Peter Xu
  2025-06-03 16:07   ` Fabiano Rosas
  1 sibling, 1 reply; 30+ messages in thread
From: Markus Armbruster @ 2025-06-02  9:26 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Dr . David Alan Gilbert, Alexey Perevalov,
	Fabiano Rosas, Juraj Marcin

Peter Xu <peterx@redhat.com> writes:

> Blocktime so far only cares about the time one vcpu (or the whole system)
> got blocked.  It would be also be helpful if it can also report the latency
> of page requests, which could be very sensitive during postcopy.
>
> Blocktime itself is sometimes not very important, especially when one
> thinks about KVM async PF support, which means vCPUs are literally almost
> not blocked at all because the guest OS is smart enough to switch to
> another task when a remote fault is needed.
>
> However, latency is still sensitive and important because even if the guest
> vCPU is running on threads that do not need a remote fault, the workload
> that accesses some missing page is still affected.
>
> Add two entries to the report, showing how long it takes to resolve a
> remote fault.  Mention in the QAPI doc that this is not the real average
> fault latency, but only the ones that was requested for a remote fault.
>
> Unwrap get_vcpu_blocktime_list() so we don't need to walk the list twice,
> meanwhile add the entry checks in qtests for all postcopy tests.
>
> Cc: Markus Armbruster <armbru@redhat.com>
> Cc: Dr. David Alan Gilbert <dave@treblig.org>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  qapi/migration.json                   | 13 +++++
>  migration/migration-hmp-cmds.c        | 70 ++++++++++++++++++---------
>  migration/postcopy-ram.c              | 48 ++++++++++++------
>  tests/qtest/migration/migration-qmp.c |  3 ++
>  4 files changed, 97 insertions(+), 37 deletions(-)
>
> diff --git a/qapi/migration.json b/qapi/migration.json
> index 8b9c53595c..8b13cea169 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -236,6 +236,17 @@
>  #     This is only present when the postcopy-blocktime migration
>  #     capability is enabled.  (Since 3.0)
>  #
> +# @postcopy-latency: average remote page fault latency (in us).  Note that
> +#     this doesn't include all faults, but only the ones that require a
> +#     remote page request.  So it should be always bigger than the real
> +#     average page fault latency. This is only present when the
> +#     postcopy-blocktime migration capability is enabled.  (Since 10.1)
> +#
> +# @postcopy-vcpu-latency: average remote page fault latency per vCPU (in
> +#     us).  It has the same definition of @postcopy-latency, but instead
> +#     this is the per-vCPU statistics. This is only present when the

Two spaces between sentences for consistency, please.

> +#     postcopy-blocktime migration capability is enabled.  (Since 10.1)

I figure the the @i-th array element is for vCPU with index @i.  Correct?

This is also only present when @postcopy-blocktime is enabled.  Correct?

Could a QMP client compute @postcopy-latency from
@postcopy-vcpu-latency?

> +#
>  # @socket-address: Only used for tcp, to know what the real port is
>  #     (Since 4.0)
>  #
> @@ -275,6 +286,8 @@
>             '*blocked-reasons': ['str'],
>             '*postcopy-blocktime': 'uint32',
>             '*postcopy-vcpu-blocktime': ['uint32'],
> +           '*postcopy-latency': 'uint64',
> +           '*postcopy-vcpu-latency': ['uint64'],
>             '*socket-address': ['SocketAddress'],
>             '*dirty-limit-throttle-time-per-round': 'uint64',
>             '*dirty-limit-ring-full-time': 'uint64'} }

[...]



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 08/13] migration/postcopy: Report fault latencies in blocktime
  2025-06-02  9:26   ` Markus Armbruster
@ 2025-06-02 16:29     ` Peter Xu
  0 siblings, 0 replies; 30+ messages in thread
From: Peter Xu @ 2025-06-02 16:29 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: qemu-devel, Dr . David Alan Gilbert, Alexey Perevalov,
	Fabiano Rosas, Juraj Marcin

On Mon, Jun 02, 2025 at 11:26:36AM +0200, Markus Armbruster wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > Blocktime so far only cares about the time one vcpu (or the whole system)
> > got blocked.  It would be also be helpful if it can also report the latency
> > of page requests, which could be very sensitive during postcopy.
> >
> > Blocktime itself is sometimes not very important, especially when one
> > thinks about KVM async PF support, which means vCPUs are literally almost
> > not blocked at all because the guest OS is smart enough to switch to
> > another task when a remote fault is needed.
> >
> > However, latency is still sensitive and important because even if the guest
> > vCPU is running on threads that do not need a remote fault, the workload
> > that accesses some missing page is still affected.
> >
> > Add two entries to the report, showing how long it takes to resolve a
> > remote fault.  Mention in the QAPI doc that this is not the real average
> > fault latency, but only the ones that was requested for a remote fault.
> >
> > Unwrap get_vcpu_blocktime_list() so we don't need to walk the list twice,
> > meanwhile add the entry checks in qtests for all postcopy tests.
> >
> > Cc: Markus Armbruster <armbru@redhat.com>
> > Cc: Dr. David Alan Gilbert <dave@treblig.org>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  qapi/migration.json                   | 13 +++++
> >  migration/migration-hmp-cmds.c        | 70 ++++++++++++++++++---------
> >  migration/postcopy-ram.c              | 48 ++++++++++++------
> >  tests/qtest/migration/migration-qmp.c |  3 ++
> >  4 files changed, 97 insertions(+), 37 deletions(-)
> >
> > diff --git a/qapi/migration.json b/qapi/migration.json
> > index 8b9c53595c..8b13cea169 100644
> > --- a/qapi/migration.json
> > +++ b/qapi/migration.json
> > @@ -236,6 +236,17 @@
> >  #     This is only present when the postcopy-blocktime migration
> >  #     capability is enabled.  (Since 3.0)
> >  #
> > +# @postcopy-latency: average remote page fault latency (in us).  Note that
> > +#     this doesn't include all faults, but only the ones that require a
> > +#     remote page request.  So it should be always bigger than the real
> > +#     average page fault latency. This is only present when the
> > +#     postcopy-blocktime migration capability is enabled.  (Since 10.1)
> > +#
> > +# @postcopy-vcpu-latency: average remote page fault latency per vCPU (in
> > +#     us).  It has the same definition of @postcopy-latency, but instead
> > +#     this is the per-vCPU statistics. This is only present when the
> 
> Two spaces between sentences for consistency, please.

Fixed.  There's another similar occurance in the last patch, I'll fix that
too.

> 
> > +#     postcopy-blocktime migration capability is enabled.  (Since 10.1)
> 
> I figure the the @i-th array element is for vCPU with index @i.  Correct?
> 
> This is also only present when @postcopy-blocktime is enabled.  Correct?

Correct on both.

> 
> Could a QMP client compute @postcopy-latency from
> @postcopy-vcpu-latency?

Not with the current API.

Right now, the reported values are per-vCPU average latencies and global
average latencies, not yet per-vCPU fault counts. Per-vCPU fault counts
will be needed to do the calculation.

I chose to export global average latency only because that should be the
most important one to me as of now.  The per-vCPU results are pretty much
side effect of how blocktime feature does accounting so far (which is based
on per-vCPU), so it's very low hanging fruit.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 01/13] migration: Add option to set postcopy-blocktime
  2025-05-27 23:12 ` [PATCH 01/13] migration: Add option to set postcopy-blocktime Peter Xu
@ 2025-06-02 16:50   ` Fabiano Rosas
  0 siblings, 0 replies; 30+ messages in thread
From: Fabiano Rosas @ 2025-06-02 16:50 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Dr . David Alan Gilbert, peterx, Alexey Perevalov, Juraj Marcin

Peter Xu <peterx@redhat.com> writes:

> Add a global property to allow enabling postcopy-blocktime feature.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 02/13] migration/postcopy: Push blocktime start/end into page req mutex
  2025-05-27 23:12 ` [PATCH 02/13] migration/postcopy: Push blocktime start/end into page req mutex Peter Xu
@ 2025-06-02 17:46   ` Fabiano Rosas
  0 siblings, 0 replies; 30+ messages in thread
From: Fabiano Rosas @ 2025-06-02 17:46 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Dr . David Alan Gilbert, peterx, Alexey Perevalov, Juraj Marcin

Peter Xu <peterx@redhat.com> writes:

> The postcopy blocktime feature was tricky that it used quite some atomic
> operations over quite a few arrays and vars, without explaining how that
> would be thread safe.  The thread safety here is about concurrency between
> the fault thread and the fault resolution threads, possible to access the
> same chunk of data.  All these atomic ops can be expensive too before
> knowing clearly how it works.
>
> OTOH, postcopy has one page_request_mutex used to serialize the received
> bitmap updates.  So far it's ok - we don't yet have a lot of threads
> contending the lock.  It might change after multifd will be supported, but
> that's a separate story.  What is important is, with that mutex, it's
> pretty lightweight to move all the blocktime maintenance into the mutex
> critical section.  It's because the blocktime layer is lightweighted:
> almost "remember which vcpu faulted on which address", and "ok we get some
> fault resolved, calculate how long it takes".  It's also an optional
> feature for now (but I have thought of changing that, maybe in the future).
>
> Let's push the blocktime layer into the mutex, so that it's always
> thread-safe even without any atomic ops.
>
> To achieve that, I'll need to add a tid parameter on fault path so that
> it'll start to pass the faulted thread ID into deeper the stack, but not
> too deep.  When at it, add a comment for the shared fault handler (for
> example, vhost-user devices running with postcopy), to mention a TODO.  One
> reason it might not be trivial is that vhost-user's userfaultfds should be
> opened by vhost-user process, so it's pretty hard to control making sure
> the TID feature will be around.  It wasn't supported before, so keep it
> like that for now.
>
> Now we should be as ease when everything is protected by a mutex that we
> always take anyway.
>
> One side effect: we can finally remove one ramblock_recv_bitmap_test() in
> mark_postcopy_blocktime_begin(), which was pretty weird and which also
> includes a weird (but maybe necessary.. but maybe not?) operation to inject
> a blocktime entry then quickly erase it..  When we're with the mutex, and
> when we make sure it's invoked after checking the receive bitmap, it's not
> needed anymore.  Instead, we assert.
>
> As another side effect, this paves way for removing all atomic ops in all
> the mem accesses in blocktime layer.
>
> Note that we need a stub for mark_postcopy_blocktime_begin() for Windows
> builds.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 03/13] migration/postcopy: Drop all atomic ops in blocktime feature
  2025-05-27 23:12 ` [PATCH 03/13] migration/postcopy: Drop all atomic ops in blocktime feature Peter Xu
@ 2025-06-02 18:00   ` Fabiano Rosas
  0 siblings, 0 replies; 30+ messages in thread
From: Fabiano Rosas @ 2025-06-02 18:00 UTC (permalink / raw)
  To: Peter Xu, qemu-devel; +Cc: Dr . David Alan Gilbert, peterx, Juraj Marcin

Peter Xu <peterx@redhat.com> writes:

> Now with the mutex protection it's not needed anymore.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 04/13] migration/postcopy: Make all blocktime vars 64bits
  2025-05-27 23:12 ` [PATCH 04/13] migration/postcopy: Make all blocktime vars 64bits Peter Xu
@ 2025-06-02 20:42   ` Fabiano Rosas
  0 siblings, 0 replies; 30+ messages in thread
From: Fabiano Rosas @ 2025-06-02 20:42 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Dr . David Alan Gilbert, peterx, Alexey Perevalov, Juraj Marcin

Peter Xu <peterx@redhat.com> writes:

> I am guessing it was used to be 32bits because of the atomic ops.  Now all
> the atomic ops are gone and we're protected by a mutex instead, it's ok we
> can switch to 64 bits.
>
> Reasons to move over:
>
>   - Allow further patches to change the unit from ms to us: with postcopy
>   preempt mode, we're really into hundreds of microseconds level on
>   blocktime.  We'd better be able to trap those.
>
>   - This also paves way for some other tricks that the original version
>   used to avoid overflows, e.g., start_time was almost only useful before
>   to make sure the sampled timestamp won't overflow a 32-bit field.
>
>   - This prepares further reports on top of existing data collected,
>   e.g. average page fault latencies.  When average operation is taken into
>   account, milliseconds are simply too coarse grained.
>
> When at it:
>
>   - Rename page_fault_vcpu_time to vcpu_blocktime_start.
>
>   - Rename vcpu_blocktime to vcpu_blocktime_total.
>
>   - Touch up the trace-events to not dump blocktime ctx pointer
>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 05/13] migration/postcopy: Drop PostcopyBlocktimeContext.start_time
  2025-05-27 23:12 ` [PATCH 05/13] migration/postcopy: Drop PostcopyBlocktimeContext.start_time Peter Xu
@ 2025-06-03 15:52   ` Fabiano Rosas
  2025-06-03 15:55     ` Fabiano Rosas
  0 siblings, 1 reply; 30+ messages in thread
From: Fabiano Rosas @ 2025-06-03 15:52 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Dr . David Alan Gilbert, peterx, Alexey Perevalov, Juraj Marcin

Peter Xu <peterx@redhat.com> writes:

> Now with 64bits, the offseting using start_time is not needed anymore,
> because the array can always remember the whole timestamp.
>
> Then drop the unused parameter in get_low_time_offset() altogether.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  migration/postcopy-ram.c | 10 ++++------
>  1 file changed, 4 insertions(+), 6 deletions(-)
>
> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> index 72f9075e69..c6ae4b650c 100644
> --- a/migration/postcopy-ram.c
> +++ b/migration/postcopy-ram.c
> @@ -123,7 +123,6 @@ typedef struct PostcopyBlocktimeContext {
>      uint64_t last_begin;
>      /* number of vCPU are suspended */
>      int smp_cpus_down;
> -    uint64_t start_time;
>  
>      /*
>       * Handler for exit event, necessary for
> @@ -157,7 +156,6 @@ static struct PostcopyBlocktimeContext *blocktime_context_new(void)
>      ctx->vcpu_blocktime_total = g_new0(uint64_t, smp_cpus);
>      ctx->vcpu_addr = g_new0(uintptr_t, smp_cpus);
>      ctx->exit_notifier.notify = migration_exit_cb;
> -    ctx->start_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
>      qemu_add_exit_notifier(&ctx->exit_notifier);
>  
>      return ctx;
> @@ -818,9 +816,9 @@ static int get_mem_fault_cpu_index(uint32_t pid)
>      return -1;
>  }
>  
> -static uint64_t get_low_time_offset(PostcopyBlocktimeContext *dc)
> +static uint64_t get_low_time_offset(void)
>  {
> -    return (uint64_t)qemu_clock_get_ms(QEMU_CLOCK_REALTIME) - dc->start_time;
> +    return (uint64_t)qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
>  }

Looking at this and the previous patch, this not "low" anymore, i.e.,
not the lower 32 bits of a 64 bit value. This function and related
variables should be renamed.

>  
>  /*
> @@ -847,7 +845,7 @@ void mark_postcopy_blocktime_begin(uintptr_t addr, uint32_t ptid,
>          return;
>      }
>  
> -    low_time_offset = get_low_time_offset(dc);
> +    low_time_offset = get_low_time_offset();
>      if (dc->vcpu_addr[cpu] == 0) {
>          dc->smp_cpus_down++;
>      }
> @@ -907,7 +905,7 @@ static void mark_postcopy_blocktime_end(uintptr_t addr)
>          return;
>      }
>  
> -    low_time_offset = get_low_time_offset(dc);
> +    low_time_offset = get_low_time_offset();
>      /* lookup cpu, to clear it,
>       * that algorithm looks straightforward, but it's not
>       * optimal, more optimal algorithm is keeping tree or hash


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 06/13] migration/postcopy: Bring blocktime layer to us level
  2025-05-27 23:12 ` [PATCH 06/13] migration/postcopy: Bring blocktime layer to us level Peter Xu
@ 2025-06-03 15:54   ` Fabiano Rosas
  0 siblings, 0 replies; 30+ messages in thread
From: Fabiano Rosas @ 2025-06-03 15:54 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Dr . David Alan Gilbert, peterx, Alexey Perevalov, Juraj Marcin

Peter Xu <peterx@redhat.com> writes:

> With 64-bit fields, it is trivial.  The caution is when exposing any values
> in QMP, it was still declared with milliseconds (ms).  Hence it's needed to
> do the convertion when exporting the values to existing QMP queries.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 05/13] migration/postcopy: Drop PostcopyBlocktimeContext.start_time
  2025-06-03 15:52   ` Fabiano Rosas
@ 2025-06-03 15:55     ` Fabiano Rosas
  0 siblings, 0 replies; 30+ messages in thread
From: Fabiano Rosas @ 2025-06-03 15:55 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Dr . David Alan Gilbert, peterx, Alexey Perevalov, Juraj Marcin

Fabiano Rosas <farosas@suse.de> writes:

> Peter Xu <peterx@redhat.com> writes:
>
>> Now with 64bits, the offseting using start_time is not needed anymore,
>> because the array can always remember the whole timestamp.
>>
>> Then drop the unused parameter in get_low_time_offset() altogether.
>>
>> Signed-off-by: Peter Xu <peterx@redhat.com>
>> ---
>>  migration/postcopy-ram.c | 10 ++++------
>>  1 file changed, 4 insertions(+), 6 deletions(-)
>>
>> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
>> index 72f9075e69..c6ae4b650c 100644
>> --- a/migration/postcopy-ram.c
>> +++ b/migration/postcopy-ram.c
>> @@ -123,7 +123,6 @@ typedef struct PostcopyBlocktimeContext {
>>      uint64_t last_begin;
>>      /* number of vCPU are suspended */
>>      int smp_cpus_down;
>> -    uint64_t start_time;
>>  
>>      /*
>>       * Handler for exit event, necessary for
>> @@ -157,7 +156,6 @@ static struct PostcopyBlocktimeContext *blocktime_context_new(void)
>>      ctx->vcpu_blocktime_total = g_new0(uint64_t, smp_cpus);
>>      ctx->vcpu_addr = g_new0(uintptr_t, smp_cpus);
>>      ctx->exit_notifier.notify = migration_exit_cb;
>> -    ctx->start_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
>>      qemu_add_exit_notifier(&ctx->exit_notifier);
>>  
>>      return ctx;
>> @@ -818,9 +816,9 @@ static int get_mem_fault_cpu_index(uint32_t pid)
>>      return -1;
>>  }
>>  
>> -static uint64_t get_low_time_offset(PostcopyBlocktimeContext *dc)
>> +static uint64_t get_low_time_offset(void)
>>  {
>> -    return (uint64_t)qemu_clock_get_ms(QEMU_CLOCK_REALTIME) - dc->start_time;
>> +    return (uint64_t)qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
>>  }
>
> Looking at this and the previous patch, this not "low" anymore, i.e.,
> not the lower 32 bits of a 64 bit value. This function and related
> variables should be renamed.
>

Ok, next patch does it.

Reviewed-by: Fabiano Rosas <farosas@suse.de>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 07/13] migration/postcopy: Add blocktime fault counts per-vcpu
  2025-05-27 23:12 ` [PATCH 07/13] migration/postcopy: Add blocktime fault counts per-vcpu Peter Xu
@ 2025-06-03 15:59   ` Fabiano Rosas
  0 siblings, 0 replies; 30+ messages in thread
From: Fabiano Rosas @ 2025-06-03 15:59 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Dr . David Alan Gilbert, peterx, Alexey Perevalov, Juraj Marcin

Peter Xu <peterx@redhat.com> writes:

> Add a field to count how many remote faults one vCPU has taken.  So far
> it's still not used, but will be soon.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 08/13] migration/postcopy: Report fault latencies in blocktime
  2025-05-27 23:12 ` [PATCH 08/13] migration/postcopy: Report fault latencies in blocktime Peter Xu
  2025-06-02  9:26   ` Markus Armbruster
@ 2025-06-03 16:07   ` Fabiano Rosas
  1 sibling, 0 replies; 30+ messages in thread
From: Fabiano Rosas @ 2025-06-03 16:07 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Dr . David Alan Gilbert, peterx, Alexey Perevalov, Juraj Marcin,
	Markus Armbruster

Peter Xu <peterx@redhat.com> writes:

> Blocktime so far only cares about the time one vcpu (or the whole system)
> got blocked.  It would be also be helpful if it can also report the latency
> of page requests, which could be very sensitive during postcopy.
>
> Blocktime itself is sometimes not very important, especially when one
> thinks about KVM async PF support, which means vCPUs are literally almost
> not blocked at all because the guest OS is smart enough to switch to
> another task when a remote fault is needed.
>
> However, latency is still sensitive and important because even if the guest
> vCPU is running on threads that do not need a remote fault, the workload
> that accesses some missing page is still affected.
>
> Add two entries to the report, showing how long it takes to resolve a
> remote fault.  Mention in the QAPI doc that this is not the real average
> fault latency, but only the ones that was requested for a remote fault.
>
> Unwrap get_vcpu_blocktime_list() so we don't need to walk the list twice,
> meanwhile add the entry checks in qtests for all postcopy tests.
>
> Cc: Markus Armbruster <armbru@redhat.com>
> Cc: Dr. David Alan Gilbert <dave@treblig.org>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 09/13] migration/postcopy: Initialize blocktime context only until listen
  2025-05-27 23:12 ` [PATCH 09/13] migration/postcopy: Initialize blocktime context only until listen Peter Xu
@ 2025-06-03 16:08   ` Fabiano Rosas
  0 siblings, 0 replies; 30+ messages in thread
From: Fabiano Rosas @ 2025-06-03 16:08 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Dr . David Alan Gilbert, peterx, Alexey Perevalov, Juraj Marcin

Peter Xu <peterx@redhat.com> writes:

> Before this patch, the blocktime context can be created very early, because
> postcopy_ram_supported_by_host() <- migrate_caps_check() can happen during
> migration object init.
>
> The trick here is the blocktime context needs system vCPU information,
> which seems to be possible to change after that point.  I didn't verify it,
> but it doesn't sound right.
>
> Now move it out and initialize the context only when postcopy listen
> starts.  That is already during a migration so it should be guaranteed the
> vCPU topology can never change on both sides.
>
> While at it, assert that the ctx isn't created instead this time; the old
> "if" trick isn't needed when we're sure it will only happen once now.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 10/13] migration/postcopy: Cache the tid->vcpu mapping for blocktime
  2025-05-27 23:12 ` [PATCH 10/13] migration/postcopy: Cache the tid->vcpu mapping for blocktime Peter Xu
@ 2025-06-03 16:20   ` Fabiano Rosas
  0 siblings, 0 replies; 30+ messages in thread
From: Fabiano Rosas @ 2025-06-03 16:20 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Dr . David Alan Gilbert, peterx, Alexey Perevalov, Juraj Marcin

Peter Xu <peterx@redhat.com> writes:

> Looking up the vCPU index for each fault can be expensive when there're
> hundreds of vCPUs.  Provide a cache for tid->vcpu instead with a hash
> table, then lookup from there.
>
> When at it, add another counter to record how many non-vCPU faults it gets.
> For example, the main thread can also access a guest page that was missing.
> These kind of faults are not accounted by blocktime so far.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 11/13] migration/postcopy: Cleanup the total blocktime accounting
  2025-05-27 23:12 ` [PATCH 11/13] migration/postcopy: Cleanup the total blocktime accounting Peter Xu
@ 2025-06-03 16:22   ` Fabiano Rosas
  0 siblings, 0 replies; 30+ messages in thread
From: Fabiano Rosas @ 2025-06-03 16:22 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Dr . David Alan Gilbert, peterx, Alexey Perevalov, Juraj Marcin

Peter Xu <peterx@redhat.com> writes:

> The variable vcpu_total_blocktime isn't easy to follow.  In reality, it
> wants to capture the case where all vCPUs are stopped, and now there will
> be some vCPUs starts running.
>
> The name now starts to conflict with vcpu_blocktime_total[], meanwhile it's
> actually not necessary to have the variable at all: since nobody is
> touching smp_cpus_down except ourselves, we can safely do the calculation
> at the end before decrementing smp_cpus_down.
>
> Hopefully this makes the logic easier to read, side benefit is we drop one
> temp var.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 12/13] migration/postcopy: Optimize blocktime fault tracking with hashtable
  2025-05-27 23:12 ` [PATCH 12/13] migration/postcopy: Optimize blocktime fault tracking with hashtable Peter Xu
@ 2025-06-03 16:44   ` Fabiano Rosas
  0 siblings, 0 replies; 30+ messages in thread
From: Fabiano Rosas @ 2025-06-03 16:44 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Dr . David Alan Gilbert, peterx, Alexey Perevalov, Juraj Marcin

Peter Xu <peterx@redhat.com> writes:

> Currently, the postcopy blocktime feature maintains vCPU fault information
> using an array (vcpu_addr[]).  It has two issues.
>
> Issue 1: Performance Concern
> ============================
>
> The old algorithm was almost OK and fast on inserts, except that the lookup
> is slow and won't scale if there are a lot of vCPUs: when a page is copied
> during postcopy, mark_postcopy_blocktime_end() will walk the whole array
> trying to find which vCPUs are blocked by the address.  So it needs
> constant O(N) walk for each page resolution.
>
> Alexey (the author of postcopy blocktime) mentioned the perf issue and how
> to optimize it in a piece of comment in the page resolution path.  The
> comment was (interestingly..) not complete, but it's relatively clear what
> he wanted to say about this perf issue.
>
> Issue 2: Wrong Accounting on re-entrancies
> ==========================================
>
> People might think that each vCPU should only and always get one fault at a
> time, so that when the blocktime layer captured one fault on one vCPU, we
> should never see another fault message on this vCPU.
>
> It's almost correct, except in some extreme rare cases.
>
> Case 1: it's possible the fault thread processes the userfaultfd messages
> too fast so it can see >1 messages on one vCPU before the previous one was
> resolved.
>
> Case 2: it's theoretically also possible one vCPU can get even more than
> one message on the same fault address if a fault is retried by the
> kernel (e.g., handle_userfault() got interrupted before page resolution).
>
> As this info might be important, instead of using commit message, I put
> more details into the code as comment, when introducing an array
> maintaining concurrent faults on one vCPU.  Please refer to the comments
> for details on both cases, especially case 1 which can be tricky.
>
> Case 1 sounds rare, but it can be easily reproduced locally for me when we
> run blocktime together with the migration-test on the vanilla postcopy.
>
> New Design
> ==========
>
> This patch should do almost what Alexey mentioned, but slightly
> differently: instead of having an array to maintain vCPU fault addresses,
> for each of the fault message we push a message into a hash, indexed by the
> fault address.
>
> With the hash, it can replace the old two structs: both the vcpu_addr[]
> array, and also the array to store the start time of the fault.  However
> due to above we need one more counter array to account concurrent faults on
> the same vCPU - that should even be needed in the old code, it's just that
> the old code was buggy and it will blindly overwrite an existing
> entry.. now we'll start to really track everything.
>
> The hash structure might be more efficient than tree to maintain such
> addr->(cpu, fault_time) information, so that the insert() and lookup()
> paths should ideally both be ~O(1).  After all, we do not need to sort.
>
> Here we need to do one remove() though after the lookup().  It could be
> slow but only if many vCPUs faulted exactly on the same address (so when
> the list of cpu entries is long), which should be unlikely. Even with that,
> it's still a worst case O(N) (consider 400 vCPUs faulted on the same
> address and how likely is it..) rather than a constant O(N) complexity.
>
> When at it, touch up the tracepoints to make them slightly more useful.
> One tracepoint is added when walking all the fault entries.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 13/13] migration/postcopy: blocktime allows track / report non-vCPU faults
  2025-05-27 23:12 ` [PATCH 13/13] migration/postcopy: blocktime allows track / report non-vCPU faults Peter Xu
@ 2025-06-03 16:50   ` Fabiano Rosas
  0 siblings, 0 replies; 30+ messages in thread
From: Fabiano Rosas @ 2025-06-03 16:50 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Dr . David Alan Gilbert, peterx, Alexey Perevalov, Juraj Marcin,
	Markus Armbruster

Peter Xu <peterx@redhat.com> writes:

> When used to report page fault latencies, the blocktime feature can be
> almost useless when KVM async page fault is enabled, because in most cases
> such remote fault will kickoff async page faults, then it's not trackable
> from blocktime layer.
>
> After all these recent rewrites to blocktime layer, it's finally so easy to
> also support tracking non-vCPU faults.  It'll be even faster if we could
> always index fault records with TIDs, unfortunately we need to maintain the
> blocktime API which report things in vCPU indexes.
>
> Of course this can work not only for kworkers, but also any guest accesses
> that may reach a missing page, for example, very likely when in the QEMU
> main thread too (and all other threads whenever applicable).
>
> In this case, we don't care about "how long the threads are blocked", but
> we only care about "how long the fault will be resolved".
>
> Cc: Markus Armbruster <armbru@redhat.com>
> Cc: Dr. David Alan Gilbert <dave@treblig.org>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Only a typo below.

Reviewed-by: Fabiano Rosas <farosas@suse.de>

> ---
>  qapi/migration.json                   |  7 +++
>  migration/migration-hmp-cmds.c        |  5 +++
>  migration/postcopy-ram.c              | 64 +++++++++++++++++++++------
>  tests/qtest/migration/migration-qmp.c |  1 +
>  migration/trace-events                |  2 +-
>  5 files changed, 64 insertions(+), 15 deletions(-)
>
> diff --git a/qapi/migration.json b/qapi/migration.json
> index 8b13cea169..f59c473842 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -247,6 +247,12 @@
>  #     this is the per-vCPU statistics. This is only present when the
>  #     postcopy-blocktime migration capability is enabled.  (Since 10.1)
>  #
> +# @postcopy-non-vcpu-latency: average remote page fault latency for all
> +#     faults happend in non-vCPU threads (in us).  It has the same
> +#     definition of @postcopy-latency but this only provides statistics to
> +#     non-vCPU faults. This is only present when the postcopy-blocktime
> +#     migration capability is enabled.  (Since 10.1)
> +#
>  # @socket-address: Only used for tcp, to know what the real port is
>  #     (Since 4.0)
>  #
> @@ -288,6 +294,7 @@
>             '*postcopy-vcpu-blocktime': ['uint32'],
>             '*postcopy-latency': 'uint64',
>             '*postcopy-vcpu-latency': ['uint64'],
> +           '*postcopy-non-vcpu-latency': 'uint64',
>             '*socket-address': ['SocketAddress'],
>             '*dirty-limit-throttle-time-per-round': 'uint64',
>             '*dirty-limit-ring-full-time': 'uint64'} }
> diff --git a/migration/migration-hmp-cmds.c b/migration/migration-hmp-cmds.c
> index a18049a7e8..bdfc754d70 100644
> --- a/migration/migration-hmp-cmds.c
> +++ b/migration/migration-hmp-cmds.c
> @@ -81,6 +81,11 @@ static void migration_dump_blocktime(Monitor *mon, MigrationInfo *info)
>                         info->postcopy_latency);
>      }
>  
> +    if (info->has_postcopy_non_vcpu_latency) {
> +        monitor_printf(mon, "Postcopy non-vCPU Latencies (us): %" PRIu64 "\n",
> +                       info->postcopy_non_vcpu_latency);
> +    }
> +
>      if (info->has_postcopy_vcpu_latency) {
>          uint64List *item = info->postcopy_vcpu_latency;
>          int count = 0;
> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> index 066dc9f3ce..bc6736f262 100644
> --- a/migration/postcopy-ram.c
> +++ b/migration/postcopy-ram.c
> @@ -192,6 +192,8 @@ typedef struct PostcopyBlocktimeContext {
>      GHashTable *tid_to_vcpu_hash;
>      /* Count of non-vCPU faults.  This is only for debugging purpose. */
>      uint64_t non_vcpu_faults;
> +    /* total blocktime when a non-vCPU thread is stopped */
> +    uint64_t non_vcpu_blocktime_total;
>  
>      /*
>       * Handler for exit event, necessary for
> @@ -203,7 +205,10 @@ typedef struct PostcopyBlocktimeContext {
>  typedef struct {
>      /* The time the fault was triggered */
>      uint64_t fault_time;
> -    /* The vCPU index that was blocked */
> +    /*
> +     * The vCPU index that was blocked, when cpu==-1, it means it's a
> +     * fault from non-vCPU threads.
> +     */
>      int cpu;
>  } BlocktimeVCPUEntry;
>  
> @@ -344,6 +349,12 @@ void fill_destination_postcopy_migration_info(MigrationInfo *info)
>          QAPI_LIST_PREPEND(list_latency, latency);
>      }
>  
> +    latency_total += bc->non_vcpu_blocktime_total;
> +    faults += bc->non_vcpu_faults;
> +
> +    info->has_postcopy_non_vcpu_latency = true;
> +    info->postcopy_non_vcpu_latency = bc->non_vcpu_faults ?
> +        (bc->non_vcpu_blocktime_total / bc->non_vcpu_faults) : 0;
>      info->has_postcopy_blocktime = true;
>      info->postcopy_blocktime = (uint32_t)(bc->total_blocktime / 1000);
>      info->has_postcopy_vcpu_blocktime = true;
> @@ -982,7 +993,10 @@ static uint64_t get_current_us(void)
>      return (uint64_t)qemu_clock_get_us(QEMU_CLOCK_REALTIME);
>  }
>  
> -/* Inject an (cpu, fault_time) entry into the database, using addr as key */
> +/*
> + * Inject an (cpu, fault_time) entry into the database, using addr as key.
> + * When cpu==-1, it means it's a non-vCPU fault.
> + */
>  static void blocktime_fault_inject(PostcopyBlocktimeContext *ctx,
>                                     uintptr_t addr, int cpu, uint64_t time)
>  {
> @@ -1065,9 +1079,17 @@ void mark_postcopy_blocktime_begin(uintptr_t addr, uint32_t ptid,
>          /* Making sure it won't overflow - it really should never! */
>          assert(dc->vcpu_faults_current[cpu] <= 255);
>      } else {
> -        /* We do not support non-vCPU thread tracking yet */
> +        /*
> +         * For non-vCPU thread faults, we don't care about tid or cpu index
> +         * or time the thread is blocked (e.g., a kworker trying to help
> +         * KVM when async_pf=on is OK to be blocked and not affect guest
> +         * responsiveness), but we care about latency.  Track it with
> +         * cpu=-1.
> +         *
> +         * Note that this will NOT affect blocktime reports on vCPU being
> +         * blocked, but only about system-wise latency reports.

system-wide

> +         */
>          dc->non_vcpu_faults++;
> -        return;
>      }
>  
>      blocktime_fault_inject(dc, addr, cpu, current_us);
> @@ -1077,6 +1099,7 @@ typedef struct {
>      PostcopyBlocktimeContext *ctx;
>      uint64_t current_us;
>      int affected_cpus;
> +    int affected_non_cpus;
>  } BlockTimeVCPUIter;
>  
>  static void blocktime_cpu_list_iter_fn(gpointer data, gpointer user_data)
> @@ -1084,6 +1107,7 @@ static void blocktime_cpu_list_iter_fn(gpointer data, gpointer user_data)
>      BlockTimeVCPUIter *iter = user_data;
>      PostcopyBlocktimeContext *ctx = iter->ctx;
>      BlocktimeVCPUEntry *entry = data;
> +    uint64_t time_passed;
>      int cpu = entry->cpu;
>  
>      /*
> @@ -1091,17 +1115,27 @@ static void blocktime_cpu_list_iter_fn(gpointer data, gpointer user_data)
>       * later than when it was faulted.
>       */
>      assert(iter->current_us >= entry->fault_time);
> +    time_passed = iter->current_us - entry->fault_time;
>  
> -    /*
> -     * If we resolved all pending faults on one vCPU due to this page
> -     * resolution, take a note.
> -     */
> -    if (--ctx->vcpu_faults_current[cpu] == 0) {
> -        ctx->vcpu_blocktime_total[cpu] += iter->current_us - entry->fault_time;
> -        iter->affected_cpus += 1;
> +    if (cpu >= 0) {
> +        /*
> +         * If we resolved all pending faults on one vCPU due to this page
> +         * resolution, take a note.
> +         */
> +        if (--ctx->vcpu_faults_current[cpu] == 0) {
> +            ctx->vcpu_blocktime_total[cpu] += time_passed;
> +            iter->affected_cpus += 1;
> +        }
> +        trace_postcopy_blocktime_end_one(cpu, ctx->vcpu_faults_current[cpu]);
> +    } else {
> +        iter->affected_non_cpus++;
> +        ctx->non_vcpu_blocktime_total += time_passed;
> +        /*
> +         * We do not maintain how many pending non-vCPU faults because we
> +         * do not care about blocktime, only latency.
> +         */
> +        trace_postcopy_blocktime_end_one(-1, 0);
>      }
> -
> -    trace_postcopy_blocktime_end_one(cpu, ctx->vcpu_faults_current[cpu]);
>  }
>  
>  /*
> @@ -1140,6 +1174,7 @@ static void mark_postcopy_blocktime_end(uintptr_t addr)
>      BlockTimeVCPUIter iter = {
>          .current_us = get_current_us(),
>          .affected_cpus = 0,
> +        .affected_non_cpus = 0,
>          .ctx = dc,
>      };
>      gpointer key = (gpointer)addr;
> @@ -1173,7 +1208,8 @@ static void mark_postcopy_blocktime_end(uintptr_t addr)
>      }
>      dc->smp_cpus_down -= iter.affected_cpus;
>  
> -    trace_postcopy_blocktime_end(addr, iter.current_us, iter.affected_cpus);
> +    trace_postcopy_blocktime_end(addr, iter.current_us, iter.affected_cpus,
> +                                 iter.affected_non_cpus);
>  }
>  
>  static void postcopy_pause_fault_thread(MigrationIncomingState *mis)
> diff --git a/tests/qtest/migration/migration-qmp.c b/tests/qtest/migration/migration-qmp.c
> index 1a5ab2d229..67a67d4bd6 100644
> --- a/tests/qtest/migration/migration-qmp.c
> +++ b/tests/qtest/migration/migration-qmp.c
> @@ -361,6 +361,7 @@ void read_blocktime(QTestState *who)
>      g_assert(qdict_haskey(rsp_return, "postcopy-vcpu-blocktime"));
>      g_assert(qdict_haskey(rsp_return, "postcopy-latency"));
>      g_assert(qdict_haskey(rsp_return, "postcopy-vcpu-latency"));
> +    g_assert(qdict_haskey(rsp_return, "postcopy-non-vcpu-latency"));
>      qobject_unref(rsp_return);
>  }
>  
> diff --git a/migration/trace-events b/migration/trace-events
> index a36a78f01a..706db97def 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -310,7 +310,7 @@ postcopy_preempt_thread_entry(void) ""
>  postcopy_preempt_thread_exit(void) ""
>  postcopy_blocktime_tid_cpu_map(int cpu, uint32_t tid) "cpu: %d, tid: %u"
>  postcopy_blocktime_begin(uint64_t addr, uint64_t time, int cpu, bool exists) "addr: 0x%" PRIx64 ", time: %" PRIu64 ", cpu: %d, exist: %d"
> -postcopy_blocktime_end(uint64_t addr, uint64_t time, int affected_cpu) "addr: 0x%" PRIx64 ", time: %" PRIu64 ", affected_cpus: %d"
> +postcopy_blocktime_end(uint64_t addr, uint64_t time, int affected_cpu, int affected_non_cpus) "addr: 0x%" PRIx64 ", time: %" PRIu64 ", affected_cpus: %d, affected_non_cpus: %d"
>  postcopy_blocktime_end_one(int cpu, uint8_t left_faults) "cpu: %d, left_faults: %" PRIu8
>  
>  # exec.c


^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2025-06-03 16:51 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-27 23:12 [PATCH 00/13] migration/postcopy: Blocktime tracking overhaul Peter Xu
2025-05-27 23:12 ` [PATCH 01/13] migration: Add option to set postcopy-blocktime Peter Xu
2025-06-02 16:50   ` Fabiano Rosas
2025-05-27 23:12 ` [PATCH 02/13] migration/postcopy: Push blocktime start/end into page req mutex Peter Xu
2025-06-02 17:46   ` Fabiano Rosas
2025-05-27 23:12 ` [PATCH 03/13] migration/postcopy: Drop all atomic ops in blocktime feature Peter Xu
2025-06-02 18:00   ` Fabiano Rosas
2025-05-27 23:12 ` [PATCH 04/13] migration/postcopy: Make all blocktime vars 64bits Peter Xu
2025-06-02 20:42   ` Fabiano Rosas
2025-05-27 23:12 ` [PATCH 05/13] migration/postcopy: Drop PostcopyBlocktimeContext.start_time Peter Xu
2025-06-03 15:52   ` Fabiano Rosas
2025-06-03 15:55     ` Fabiano Rosas
2025-05-27 23:12 ` [PATCH 06/13] migration/postcopy: Bring blocktime layer to us level Peter Xu
2025-06-03 15:54   ` Fabiano Rosas
2025-05-27 23:12 ` [PATCH 07/13] migration/postcopy: Add blocktime fault counts per-vcpu Peter Xu
2025-06-03 15:59   ` Fabiano Rosas
2025-05-27 23:12 ` [PATCH 08/13] migration/postcopy: Report fault latencies in blocktime Peter Xu
2025-06-02  9:26   ` Markus Armbruster
2025-06-02 16:29     ` Peter Xu
2025-06-03 16:07   ` Fabiano Rosas
2025-05-27 23:12 ` [PATCH 09/13] migration/postcopy: Initialize blocktime context only until listen Peter Xu
2025-06-03 16:08   ` Fabiano Rosas
2025-05-27 23:12 ` [PATCH 10/13] migration/postcopy: Cache the tid->vcpu mapping for blocktime Peter Xu
2025-06-03 16:20   ` Fabiano Rosas
2025-05-27 23:12 ` [PATCH 11/13] migration/postcopy: Cleanup the total blocktime accounting Peter Xu
2025-06-03 16:22   ` Fabiano Rosas
2025-05-27 23:12 ` [PATCH 12/13] migration/postcopy: Optimize blocktime fault tracking with hashtable Peter Xu
2025-06-03 16:44   ` Fabiano Rosas
2025-05-27 23:12 ` [PATCH 13/13] migration/postcopy: blocktime allows track / report non-vCPU faults Peter Xu
2025-06-03 16:50   ` Fabiano Rosas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).