From: Peter Xu <peterx@redhat.com>
To: qemu-devel@nongnu.org
Cc: "Dr . David Alan Gilbert" <dave@treblig.org>,
peterx@redhat.com, Alexey Perevalov <a.perevalov@samsung.com>,
Fabiano Rosas <farosas@suse.de>,
Juraj Marcin <jmarcin@redhat.com>
Subject: [PATCH 02/13] migration/postcopy: Push blocktime start/end into page req mutex
Date: Tue, 27 May 2025 19:12:37 -0400 [thread overview]
Message-ID: <20250527231248.1279174-3-peterx@redhat.com> (raw)
In-Reply-To: <20250527231248.1279174-1-peterx@redhat.com>
The postcopy blocktime feature was tricky that it used quite some atomic
operations over quite a few arrays and vars, without explaining how that
would be thread safe. The thread safety here is about concurrency between
the fault thread and the fault resolution threads, possible to access the
same chunk of data. All these atomic ops can be expensive too before
knowing clearly how it works.
OTOH, postcopy has one page_request_mutex used to serialize the received
bitmap updates. So far it's ok - we don't yet have a lot of threads
contending the lock. It might change after multifd will be supported, but
that's a separate story. What is important is, with that mutex, it's
pretty lightweight to move all the blocktime maintenance into the mutex
critical section. It's because the blocktime layer is lightweighted:
almost "remember which vcpu faulted on which address", and "ok we get some
fault resolved, calculate how long it takes". It's also an optional
feature for now (but I have thought of changing that, maybe in the future).
Let's push the blocktime layer into the mutex, so that it's always
thread-safe even without any atomic ops.
To achieve that, I'll need to add a tid parameter on fault path so that
it'll start to pass the faulted thread ID into deeper the stack, but not
too deep. When at it, add a comment for the shared fault handler (for
example, vhost-user devices running with postcopy), to mention a TODO. One
reason it might not be trivial is that vhost-user's userfaultfds should be
opened by vhost-user process, so it's pretty hard to control making sure
the TID feature will be around. It wasn't supported before, so keep it
like that for now.
Now we should be as ease when everything is protected by a mutex that we
always take anyway.
One side effect: we can finally remove one ramblock_recv_bitmap_test() in
mark_postcopy_blocktime_begin(), which was pretty weird and which also
includes a weird (but maybe necessary.. but maybe not?) operation to inject
a blocktime entry then quickly erase it.. When we're with the mutex, and
when we make sure it's invoked after checking the receive bitmap, it's not
needed anymore. Instead, we assert.
As another side effect, this paves way for removing all atomic ops in all
the mem accesses in blocktime layer.
Note that we need a stub for mark_postcopy_blocktime_begin() for Windows
builds.
Signed-off-by: Peter Xu <peterx@redhat.com>
---
migration/migration.h | 2 +-
migration/postcopy-ram.h | 2 ++
migration/migration.c | 24 ++++++++++-------
migration/postcopy-ram.c | 56 +++++++++++++++++++++-------------------
migration/trace-events | 2 +-
5 files changed, 47 insertions(+), 39 deletions(-)
diff --git a/migration/migration.h b/migration/migration.h
index d53f7cad84..9ca2b21459 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -546,7 +546,7 @@ void migrate_send_rp_shut(MigrationIncomingState *mis,
void migrate_send_rp_pong(MigrationIncomingState *mis,
uint32_t value);
int migrate_send_rp_req_pages(MigrationIncomingState *mis, RAMBlock *rb,
- ram_addr_t start, uint64_t haddr);
+ ram_addr_t start, uint64_t haddr, uint32_t tid);
int migrate_send_rp_message_req_pages(MigrationIncomingState *mis,
RAMBlock *rb, ram_addr_t start);
void migrate_send_rp_recv_bitmap(MigrationIncomingState *mis,
diff --git a/migration/postcopy-ram.h b/migration/postcopy-ram.h
index a6df1b2811..3852141d7e 100644
--- a/migration/postcopy-ram.h
+++ b/migration/postcopy-ram.h
@@ -196,5 +196,7 @@ void postcopy_preempt_new_channel(MigrationIncomingState *mis, QEMUFile *file);
void postcopy_preempt_setup(MigrationState *s);
int postcopy_preempt_establish_channel(MigrationState *s);
bool postcopy_is_paused(MigrationStatus status);
+void mark_postcopy_blocktime_begin(uintptr_t addr, uint32_t ptid,
+ RAMBlock *rb);
#endif
diff --git a/migration/migration.c b/migration/migration.c
index e542c09755..4e917b5fb0 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -576,22 +576,26 @@ int migrate_send_rp_message_req_pages(MigrationIncomingState *mis,
}
int migrate_send_rp_req_pages(MigrationIncomingState *mis,
- RAMBlock *rb, ram_addr_t start, uint64_t haddr)
+ RAMBlock *rb, ram_addr_t start, uint64_t haddr,
+ uint32_t tid)
{
void *aligned = (void *)(uintptr_t)ROUND_DOWN(haddr, qemu_ram_pagesize(rb));
bool received = false;
WITH_QEMU_LOCK_GUARD(&mis->page_request_mutex) {
received = ramblock_recv_bitmap_test_byte_offset(rb, start);
- if (!received && !g_tree_lookup(mis->page_requested, aligned)) {
- /*
- * The page has not been received, and it's not yet in the page
- * request list. Queue it. Set the value of element to 1, so that
- * things like g_tree_lookup() will return TRUE (1) when found.
- */
- g_tree_insert(mis->page_requested, aligned, (gpointer)1);
- qatomic_inc(&mis->page_requested_count);
- trace_postcopy_page_req_add(aligned, mis->page_requested_count);
+ if (!received) {
+ if (!g_tree_lookup(mis->page_requested, aligned)) {
+ /*
+ * The page has not been received, and it's not yet in the page
+ * request list. Queue it. Set the value of element to 1, so that
+ * things like g_tree_lookup() will return TRUE (1) when found.
+ */
+ g_tree_insert(mis->page_requested, aligned, (gpointer)1);
+ qatomic_inc(&mis->page_requested_count);
+ trace_postcopy_page_req_add(aligned, mis->page_requested_count);
+ }
+ mark_postcopy_blocktime_begin(haddr, tid, rb);
}
}
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 995614b38c..9db3eefec4 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -752,8 +752,12 @@ int postcopy_wake_shared(struct PostCopyFD *pcfd,
pagesize);
}
+/*
+ * NOTE: @tid is only used when postcopy-blocktime feature is enabled, and
+ * also optional: when zero is provided, the fault accounting will be ignored.
+ */
static int postcopy_request_page(MigrationIncomingState *mis, RAMBlock *rb,
- ram_addr_t start, uint64_t haddr)
+ ram_addr_t start, uint64_t haddr, uint32_t tid)
{
void *aligned = (void *)(uintptr_t)ROUND_DOWN(haddr, qemu_ram_pagesize(rb));
@@ -772,7 +776,7 @@ static int postcopy_request_page(MigrationIncomingState *mis, RAMBlock *rb,
return received ? 0 : postcopy_place_page_zero(mis, aligned, rb);
}
- return migrate_send_rp_req_pages(mis, rb, start, haddr);
+ return migrate_send_rp_req_pages(mis, rb, start, haddr, tid);
}
/*
@@ -793,7 +797,8 @@ int postcopy_request_shared_page(struct PostCopyFD *pcfd, RAMBlock *rb,
qemu_ram_get_idstr(rb), rb_offset);
return postcopy_wake_shared(pcfd, client_addr, rb);
}
- postcopy_request_page(mis, rb, aligned_rbo, client_addr);
+ /* TODO: support blocktime tracking */
+ postcopy_request_page(mis, rb, aligned_rbo, client_addr, 0);
return 0;
}
@@ -819,17 +824,17 @@ static uint32_t get_low_time_offset(PostcopyBlocktimeContext *dc)
}
/*
- * This function is being called when pagefault occurs. It
- * tracks down vCPU blocking time.
+ * This function is being called when pagefault occurs. It tracks down vCPU
+ * blocking time. It's protected by @page_request_mutex.
*
* @addr: faulted host virtual address
* @ptid: faulted process thread id
* @rb: ramblock appropriate to addr
*/
-static void mark_postcopy_blocktime_begin(uintptr_t addr, uint32_t ptid,
- RAMBlock *rb)
+void mark_postcopy_blocktime_begin(uintptr_t addr, uint32_t ptid,
+ RAMBlock *rb)
{
- int cpu, already_received;
+ int cpu;
MigrationIncomingState *mis = migration_incoming_get_current();
PostcopyBlocktimeContext *dc = mis->blocktime_ctx;
uint32_t low_time_offset;
@@ -852,24 +857,19 @@ static void mark_postcopy_blocktime_begin(uintptr_t addr, uint32_t ptid,
qatomic_xchg(&dc->vcpu_addr[cpu], addr);
/*
- * check it here, not at the beginning of the function,
- * due to, check could occur early than bitmap_set in
- * qemu_ufd_copy_ioctl
+ * The caller should only inject a blocktime entry when the page is
+ * yet missing.
*/
- already_received = ramblock_recv_bitmap_test(rb, (void *)addr);
- if (already_received) {
- qatomic_xchg(&dc->vcpu_addr[cpu], 0);
- qatomic_xchg(&dc->page_fault_vcpu_time[cpu], 0);
- qatomic_dec(&dc->smp_cpus_down);
- }
+ assert(!ramblock_recv_bitmap_test(rb, (void *)addr));
+
trace_mark_postcopy_blocktime_begin(addr, dc, dc->page_fault_vcpu_time[cpu],
- cpu, already_received);
+ cpu);
}
/*
- * This function just provide calculated blocktime per cpu and trace it.
- * Total blocktime is calculated in mark_postcopy_blocktime_end.
- *
+ * This function just provide calculated blocktime per cpu and trace it.
+ * Total blocktime is calculated in mark_postcopy_blocktime_end. It's
+ * protected by @page_request_mutex.
*
* Assume we have 3 CPU
*
@@ -1068,17 +1068,14 @@ static void *postcopy_ram_fault_thread(void *opaque)
qemu_ram_get_idstr(rb),
rb_offset,
msg.arg.pagefault.feat.ptid);
- mark_postcopy_blocktime_begin(
- (uintptr_t)(msg.arg.pagefault.address),
- msg.arg.pagefault.feat.ptid, rb);
-
retry:
/*
* Send the request to the source - we want to request one
* of our host page sizes (which is >= TPS)
*/
ret = postcopy_request_page(mis, rb, rb_offset,
- msg.arg.pagefault.address);
+ msg.arg.pagefault.address,
+ msg.arg.pagefault.feat.ptid);
if (ret) {
/* May be network failure, try to wait for recovery */
postcopy_pause_fault_thread(mis);
@@ -1299,8 +1296,8 @@ static int qemu_ufd_copy_ioctl(MigrationIncomingState *mis, void *host_addr,
qemu_cond_signal(&mis->page_request_cond);
}
}
- qemu_mutex_unlock(&mis->page_request_mutex);
mark_postcopy_blocktime_end((uintptr_t)host_addr);
+ qemu_mutex_unlock(&mis->page_request_mutex);
}
return ret;
}
@@ -1430,6 +1427,11 @@ int postcopy_wake_shared(struct PostCopyFD *pcfd,
{
g_assert_not_reached();
}
+
+void mark_postcopy_blocktime_begin(uintptr_t addr, uint32_t ptid,
+ RAMBlock *rb)
+{
+}
#endif
/* ------------------------------------------------------------------------- */
diff --git a/migration/trace-events b/migration/trace-events
index dcd8fe9a0c..917f521e88 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -285,7 +285,7 @@ postcopy_nhp_range(const char *ramblock, void *host_addr, size_t offset, size_t
postcopy_place_page(void *host_addr) "host=%p"
postcopy_place_page_zero(void *host_addr) "host=%p"
postcopy_ram_enable_notify(void) ""
-mark_postcopy_blocktime_begin(uint64_t addr, void *dd, uint32_t time, int cpu, int received) "addr: 0x%" PRIx64 ", dd: %p, time: %u, cpu: %d, already_received: %d"
+mark_postcopy_blocktime_begin(uint64_t addr, void *dd, uint32_t time, int cpu) "addr: 0x%" PRIx64 ", dd: %p, time: %u, cpu: %d"
mark_postcopy_blocktime_end(uint64_t addr, void *dd, uint32_t time, int affected_cpu) "addr: 0x%" PRIx64 ", dd: %p, time: %u, affected_cpu: %d"
postcopy_pause_fault_thread(void) ""
postcopy_pause_fault_thread_continued(void) ""
--
2.49.0
next prev parent reply other threads:[~2025-05-27 23:14 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-05-27 23:12 [PATCH 00/13] migration/postcopy: Blocktime tracking overhaul Peter Xu
2025-05-27 23:12 ` [PATCH 01/13] migration: Add option to set postcopy-blocktime Peter Xu
2025-06-02 16:50 ` Fabiano Rosas
2025-05-27 23:12 ` Peter Xu [this message]
2025-06-02 17:46 ` [PATCH 02/13] migration/postcopy: Push blocktime start/end into page req mutex Fabiano Rosas
2025-05-27 23:12 ` [PATCH 03/13] migration/postcopy: Drop all atomic ops in blocktime feature Peter Xu
2025-06-02 18:00 ` Fabiano Rosas
2025-05-27 23:12 ` [PATCH 04/13] migration/postcopy: Make all blocktime vars 64bits Peter Xu
2025-06-02 20:42 ` Fabiano Rosas
2025-05-27 23:12 ` [PATCH 05/13] migration/postcopy: Drop PostcopyBlocktimeContext.start_time Peter Xu
2025-06-03 15:52 ` Fabiano Rosas
2025-06-03 15:55 ` Fabiano Rosas
2025-05-27 23:12 ` [PATCH 06/13] migration/postcopy: Bring blocktime layer to us level Peter Xu
2025-06-03 15:54 ` Fabiano Rosas
2025-05-27 23:12 ` [PATCH 07/13] migration/postcopy: Add blocktime fault counts per-vcpu Peter Xu
2025-06-03 15:59 ` Fabiano Rosas
2025-05-27 23:12 ` [PATCH 08/13] migration/postcopy: Report fault latencies in blocktime Peter Xu
2025-06-02 9:26 ` Markus Armbruster
2025-06-02 16:29 ` Peter Xu
2025-06-03 16:07 ` Fabiano Rosas
2025-05-27 23:12 ` [PATCH 09/13] migration/postcopy: Initialize blocktime context only until listen Peter Xu
2025-06-03 16:08 ` Fabiano Rosas
2025-05-27 23:12 ` [PATCH 10/13] migration/postcopy: Cache the tid->vcpu mapping for blocktime Peter Xu
2025-06-03 16:20 ` Fabiano Rosas
2025-05-27 23:12 ` [PATCH 11/13] migration/postcopy: Cleanup the total blocktime accounting Peter Xu
2025-06-03 16:22 ` Fabiano Rosas
2025-05-27 23:12 ` [PATCH 12/13] migration/postcopy: Optimize blocktime fault tracking with hashtable Peter Xu
2025-06-03 16:44 ` Fabiano Rosas
2025-05-27 23:12 ` [PATCH 13/13] migration/postcopy: blocktime allows track / report non-vCPU faults Peter Xu
2025-06-03 16:50 ` Fabiano Rosas
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250527231248.1279174-3-peterx@redhat.com \
--to=peterx@redhat.com \
--cc=a.perevalov@samsung.com \
--cc=dave@treblig.org \
--cc=farosas@suse.de \
--cc=jmarcin@redhat.com \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).