Re: [PATCH 12/13] migration/postcopy: Optimize blocktime fault tracking with hashtable

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Fabiano Rosas <farosas@suse.de>
To: Peter Xu <peterx@redhat.com>, qemu-devel@nongnu.org
Cc: "Dr . David Alan Gilbert" <dave@treblig.org>,
	peterx@redhat.com, Alexey Perevalov <a.perevalov@samsung.com>,
	Juraj Marcin <jmarcin@redhat.com>
Subject: Re: [PATCH 12/13] migration/postcopy: Optimize blocktime fault tracking with hashtable
Date: Tue, 03 Jun 2025 13:44:40 -0300	[thread overview]
Message-ID: <87tt4w3hc7.fsf@suse.de> (raw)
In-Reply-To: <20250527231248.1279174-13-peterx@redhat.com>

Peter Xu <peterx@redhat.com> writes:

> Currently, the postcopy blocktime feature maintains vCPU fault information
> using an array (vcpu_addr[]).  It has two issues.
>
> Issue 1: Performance Concern
> ============================
>
> The old algorithm was almost OK and fast on inserts, except that the lookup
> is slow and won't scale if there are a lot of vCPUs: when a page is copied
> during postcopy, mark_postcopy_blocktime_end() will walk the whole array
> trying to find which vCPUs are blocked by the address.  So it needs
> constant O(N) walk for each page resolution.
>
> Alexey (the author of postcopy blocktime) mentioned the perf issue and how
> to optimize it in a piece of comment in the page resolution path.  The
> comment was (interestingly..) not complete, but it's relatively clear what
> he wanted to say about this perf issue.
>
> Issue 2: Wrong Accounting on re-entrancies
> ==========================================
>
> People might think that each vCPU should only and always get one fault at a
> time, so that when the blocktime layer captured one fault on one vCPU, we
> should never see another fault message on this vCPU.
>
> It's almost correct, except in some extreme rare cases.
>
> Case 1: it's possible the fault thread processes the userfaultfd messages
> too fast so it can see >1 messages on one vCPU before the previous one was
> resolved.
>
> Case 2: it's theoretically also possible one vCPU can get even more than
> one message on the same fault address if a fault is retried by the
> kernel (e.g., handle_userfault() got interrupted before page resolution).
>
> As this info might be important, instead of using commit message, I put
> more details into the code as comment, when introducing an array
> maintaining concurrent faults on one vCPU.  Please refer to the comments
> for details on both cases, especially case 1 which can be tricky.
>
> Case 1 sounds rare, but it can be easily reproduced locally for me when we
> run blocktime together with the migration-test on the vanilla postcopy.
>
> New Design
> ==========
>
> This patch should do almost what Alexey mentioned, but slightly
> differently: instead of having an array to maintain vCPU fault addresses,
> for each of the fault message we push a message into a hash, indexed by the
> fault address.
>
> With the hash, it can replace the old two structs: both the vcpu_addr[]
> array, and also the array to store the start time of the fault.  However
> due to above we need one more counter array to account concurrent faults on
> the same vCPU - that should even be needed in the old code, it's just that
> the old code was buggy and it will blindly overwrite an existing
> entry.. now we'll start to really track everything.
>
> The hash structure might be more efficient than tree to maintain such
> addr->(cpu, fault_time) information, so that the insert() and lookup()
> paths should ideally both be ~O(1).  After all, we do not need to sort.
>
> Here we need to do one remove() though after the lookup().  It could be
> slow but only if many vCPUs faulted exactly on the same address (so when
> the list of cpu entries is long), which should be unlikely. Even with that,
> it's still a worst case O(N) (consider 400 vCPUs faulted on the same
> address and how likely is it..) rather than a constant O(N) complexity.
>
> When at it, touch up the tracepoints to make them slightly more useful.
> One tracepoint is added when walking all the fault entries.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Fabiano Rosas <farosas@suse.de>

next prev parent reply	other threads:[~2025-06-03 16:45 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-27 23:12 [PATCH 00/13] migration/postcopy: Blocktime tracking overhaul Peter Xu
2025-05-27 23:12 ` [PATCH 01/13] migration: Add option to set postcopy-blocktime Peter Xu
2025-06-02 16:50   ` Fabiano Rosas
2025-05-27 23:12 ` [PATCH 02/13] migration/postcopy: Push blocktime start/end into page req mutex Peter Xu
2025-06-02 17:46   ` Fabiano Rosas
2025-05-27 23:12 ` [PATCH 03/13] migration/postcopy: Drop all atomic ops in blocktime feature Peter Xu
2025-06-02 18:00   ` Fabiano Rosas
2025-05-27 23:12 ` [PATCH 04/13] migration/postcopy: Make all blocktime vars 64bits Peter Xu
2025-06-02 20:42   ` Fabiano Rosas
2025-05-27 23:12 ` [PATCH 05/13] migration/postcopy: Drop PostcopyBlocktimeContext.start_time Peter Xu
2025-06-03 15:52   ` Fabiano Rosas
2025-06-03 15:55     ` Fabiano Rosas
2025-05-27 23:12 ` [PATCH 06/13] migration/postcopy: Bring blocktime layer to us level Peter Xu
2025-06-03 15:54   ` Fabiano Rosas
2025-05-27 23:12 ` [PATCH 07/13] migration/postcopy: Add blocktime fault counts per-vcpu Peter Xu
2025-06-03 15:59   ` Fabiano Rosas
2025-05-27 23:12 ` [PATCH 08/13] migration/postcopy: Report fault latencies in blocktime Peter Xu
2025-06-02  9:26   ` Markus Armbruster
2025-06-02 16:29     ` Peter Xu
2025-06-03 16:07   ` Fabiano Rosas
2025-05-27 23:12 ` [PATCH 09/13] migration/postcopy: Initialize blocktime context only until listen Peter Xu
2025-06-03 16:08   ` Fabiano Rosas
2025-05-27 23:12 ` [PATCH 10/13] migration/postcopy: Cache the tid->vcpu mapping for blocktime Peter Xu
2025-06-03 16:20   ` Fabiano Rosas
2025-05-27 23:12 ` [PATCH 11/13] migration/postcopy: Cleanup the total blocktime accounting Peter Xu
2025-06-03 16:22   ` Fabiano Rosas
2025-05-27 23:12 ` [PATCH 12/13] migration/postcopy: Optimize blocktime fault tracking with hashtable Peter Xu
2025-06-03 16:44   ` Fabiano Rosas [this message]
2025-05-27 23:12 ` [PATCH 13/13] migration/postcopy: blocktime allows track / report non-vCPU faults Peter Xu
2025-06-03 16:50   ` Fabiano Rosas

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87tt4w3hc7.fsf@suse.de \
    --to=farosas@suse.de \
    --cc=a.perevalov@samsung.com \
    --cc=dave@treblig.org \
    --cc=jmarcin@redhat.com \
    --cc=peterx@redhat.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).