Re: [PATCH v1 0/3] Lockless SMP function call and TLB flushing

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Roger Pau Monné" <roger.pau@citrix.com>
To: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Ross Lagerwall <ross.lagerwall@citrix.com>,
	Jan Beulich <jbeulich@suse.com>,
	Anthony PERARD <anthony.perard@vates.tech>,
	Michal Orzel <michal.orzel@amd.com>,
	Julien Grall <julien@xen.org>,
	Stefano Stabellini <sstabellini@kernel.org>,
	xen-devel@lists.xenproject.org
Subject: Re: [PATCH v1 0/3] Lockless SMP function call and TLB flushing
Date: Thu, 9 Apr 2026 10:13:02 +0200	[thread overview]
Message-ID: <addfjkfzYtmm_flx@macbook.local> (raw)
In-Reply-To: <8b754178-94c8-448f-9ec2-26b6a23565b8@citrix.com>

On Thu, Apr 02, 2026 at 01:57:00PM +0200, Andrew Cooper wrote:
> On 02/04/2026 12:57 pm, Ross Lagerwall wrote:
> > On 4/2/26 9:49 AM, Jan Beulich wrote:
> >> On 02.04.2026 10:40, Ross Lagerwall wrote:
> >>> On 4/2/26 7:09 AM, Jan Beulich wrote:
> >>>> On 01.04.2026 18:35, Ross Lagerwall wrote:
> >>>>> We have observed that the TLB flush lock can be a point of
> >>>>> contention for
> >>>>> certain workloads, e.g. migrating 10 VMs off a host during a host
> >>>>> evacuation.
> >>>>>
> >>>>> Performance numbers:
> >>>>>
> >>>>> I wrote a synthetic benchmark to measure the performance. The
> >>>>> benchmark has one
> >>>>> or more CPUs in Xen calling on_selected_cpus() with between 1 and
> >>>>> 64 CPUs in
> >>>>> the selected mask. The executed function simply delays for 500
> >>>>> microseconds.
> >>>>>
> >>>>> The table below shows the % change in execution time of
> >>>>> on_selected_cpus():
> >>>>>
> >>>>>                     1 thread   2 threads    4 threads
> >>>>> 1 CPU in mask     0.02       -35.23       -51.18
> >>>>> 2 CPUs in mask    0.01       -47.20       -69.27
> >>>>> 4 CPUs in mask    -0.02      -42.40       -66.55
> >>>>> 8 CPUs in mask    -0.03      -47.82       -68.39
> >>>>> 16 CPUs in mask   0.12       -41.95       -58.26
> >>>>> 32 CPUs in mask   0.02       -25.43       -39.35
> >>>>> 64 CPUs in mask   0.00       -24.70       -37.83
> >>>>>
> >>>>> With 1 thread (i.e. no contention), there is no regression in
> >>>>> execution time.
> >>>>> With multiple threads, as expected there is a significant
> >>>>> improvement in
> >>>>> execution time.
> >>>>>
> >>>>> As a more practical benchmark to simulate host evacuation, I
> >>>>> measured the
> >>>>> memory dirtying rate across 10 VMs after enabling log dirty (on an
> >>>>> AMD system,
> >>>>> so without PML). The rate increased by 16% with this patch series,
> >>>>> even
> >>>>> after the recent deferred TLB flush changes.
> >>>>
> >>>> Is this a positive thing though? In the context of some related
> >>>> work something
> >>>> similar was mentioned iirc, accompanied by stating that this is
> >>>> actually
> >>>> problematic. A guest in log-dirty mode generally wants to be making
> >>>> progress,
> >>>> but also wants to be throttled enough to limit re-dirtying, such that
> >>>> subsequent iterations (in particular the final one) of page contents
> >>>> migration won't have to process overly many pages a 2nd time.
> >>>
> >>> In the context of a real migration, both the process copying the pages
> >>> out of the guest and the guest itself will be hitting the TLB flush
> >>> lock
> >>> so reducing that bottleneck may increase throughput on both sides.
> >>> Whether or not the overall migration time increases or decreases
> >>> depends
> >>> on many factors (number of migrations in parallel, the rate the
> >>> guest is
> >>> dirtying memory, the line speed of the NIC, whether PML is used, ...)
> >>> which is why I measured a more controlled scenario to demonstrate the
> >>> change.
> >>>
> >>> IMO throttling of a guest during a migration should be something
> >>> intentional and controlled by userspace policy rather than a side
> >>> effect
> >>> of some internal global locks.
> >>
> >> I definitely agree here, but side effects going away may make it
> >> necessary to
> >> add such explicit throttling.
> >>
> >
> > Explicit throttling is much more important for the already existing
> > case of Intel systems with PML. With log dirty enabled, a VM on an Intel
> > system can dirty memory an order of magnitude faster than an AMD system
> > without PML.
> >
> > As an aside, for the same test an Intel machine without PML is still a
> > lot faster than AMD so there is probably something to improve in this
> > area for AMD machines. 
> 
> AMD have PML on the way. 
> https://docs.amd.com/v/u/en-US/69208_1.00_AMD64_PML_PUB
> 
> There is a mis-step with how support for Intel's PML is done, meaning
> that draining the vCPU's PML buffers is extraordinarily expensive even
> when there's no action to take.  (Specifically, the remote VMCS acquire)
> 
> A better option is this:  When logdirty is active, any VMExit will drain
> the PML buffer into the logdirty bitmap before processing the main exit
> reason.  This way, you drain all the PML buffers by just IPI-ing the
> domain dirty mask.

Seems like a good and easy to implement optimization.  However we are
already too fast when using PML in the sense that the toolstack cannot
keep up with the rate of dirtied memory :).

Thanks, Roger.

     prev parent reply	other threads:[~2026-04-09  8:13 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-01 16:35 [PATCH v1 0/3] Lockless SMP function call and TLB flushing Ross Lagerwall
2026-04-01 16:35 ` [PATCH v1 1/3] x86/hap: Wait for remote CPUs during TLB flush Ross Lagerwall
2026-04-08 15:21   ` Jan Beulich
2026-04-08 15:48     ` Ross Lagerwall
2026-04-09  6:55       ` Jan Beulich
2026-04-01 16:35 ` [PATCH v1 2/3] xen/smp: Rewrite on_selected_cpus() to be lockless Ross Lagerwall
2026-04-08 16:11   ` Jan Beulich
2026-04-09  8:09   ` Roger Pau Monné
2026-04-09 11:46   ` Jan Beulich
2026-04-01 16:35 ` [PATCH v1 3/3] x86/smp: Rewrite TLB flush using on_selected_cpus() Ross Lagerwall
2026-04-20 14:04   ` Jan Beulich
2026-04-02  6:09 ` [PATCH v1 0/3] Lockless SMP function call and TLB flushing Jan Beulich
2026-04-02  8:40   ` Ross Lagerwall
2026-04-02  8:49     ` Jan Beulich
2026-04-02 10:57       ` Ross Lagerwall
2026-04-02 11:57         ` Andrew Cooper
2026-04-09  8:13           ` Roger Pau Monné [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=addfjkfzYtmm_flx@macbook.local \
    --to=roger.pau@citrix.com \
    --cc=andrew.cooper3@citrix.com \
    --cc=anthony.perard@vates.tech \
    --cc=jbeulich@suse.com \
    --cc=julien@xen.org \
    --cc=michal.orzel@amd.com \
    --cc=ross.lagerwall@citrix.com \
    --cc=sstabellini@kernel.org \
    --cc=xen-devel@lists.xenproject.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.