Re: [PATCH v3 2/2] xen/x86: Change stub page allocation/free

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Roger Pau Monné" <roger.pau@citrix.com>
To: Jason Andryuk <jason.andryuk@amd.com>
Cc: xen-devel@lists.xenproject.org, Jan Beulich <jbeulich@suse.com>,
	Andrew Cooper <andrew.cooper3@citrix.com>,
	Teddy Astie <teddy.astie@vates.tech>
Subject: Re: [PATCH v3 2/2] xen/x86: Change stub page allocation/free
Date: Wed, 10 Jun 2026 20:20:02 +0200	[thread overview]
Message-ID: <aimq0g_XqJQ-7Gqe@macbook.local> (raw)
In-Reply-To: <085ebfb8-d76c-4a13-8b5b-130959b25b51@amd.com>

On Wed, Jun 10, 2026 at 11:23:46AM -0400, Jason Andryuk wrote:
> On 2026-06-10 11:01, Roger Pau Monné wrote:
> > On Mon, Jun 08, 2026 at 08:06:38PM -0400, Jason Andryuk wrote:
> > > Today the inline tracking of the stub page is problematic.  0xcc is used
> > > to indicate unused, but it is also a "clear value."  A !CONFIG_PV build
> > > with smt=0 will bring up CPU0, bring up CPU1, bring down CPU1, and free
> > > the in-use stub page.  Subsequent CPU onlining can write to the re-used
> > > page.
> > > 
> > > The new approach uses a global, CPU-indexed array of stub pages.
> > > However, to handle NUMA aware allocations, we cannot allocate all the
> > > pages in advance because the NUMA information is not available.  Keep
> > > track of 1 current page for each NUMA node, allocated on demand, and
> > > allocate the stub buffers out of those pages.
> > > 
> > > The current NUMA allocation approach is opportunistic sharing among the
> > > groups of 32 processors.  The new approach will allocate buffers densely
> > > in a NUMA node.
> > > 
> > > stub pages are no longer freed.  They remain referenced in the global
> > > CPU-indexed array and are re-used if the CPU is re-onlined.
> > > 
> > > stubs and node_stubs don't have an explicit lock.  During boot they are
> > > accessed single threaded.  During runtime, &cpu_add_remove_lock
> > > serializes access.
> > > 
> > > Fixes: 7a66ac8d1633 ("x86: move syscall trampolines off the stack")
> > > Signed-off-by: Jason Andryuk <jason.andryuk@amd.com>
> > > ---
> > > I'm not sure how to test the NUMA part - I don't have an NUMA system.
> > > Also, if NUMA is active, is a cpu node of NUMA_NO_NODE still possible?
> > > I used the MAX_NUMNODES + 1 array sizing to handle that, but it's not
> > > obvious to me if that is necessary.
> > > 
> > > Roger mentioned removing the per-cpu stubs.mfn.  We'd need to replace
> > > that with exposing the stubs array for traps and the emulator.  I have
> > > no idea if that will be an improvement and am looking for agreement on
> > > this patch before attempting.
> > > ---
> > >   xen/arch/x86/include/asm/stubs.h |   2 +-
> > >   xen/arch/x86/setup.c             |   3 +-
> > >   xen/arch/x86/smpboot.c           | 110 +++++++++++++++++++++----------
> > >   3 files changed, 77 insertions(+), 38 deletions(-)
> > > 
> > > diff --git a/xen/arch/x86/include/asm/stubs.h b/xen/arch/x86/include/asm/stubs.h
> > > index a520928e9a..9d776f81dd 100644
> > > --- a/xen/arch/x86/include/asm/stubs.h
> > > +++ b/xen/arch/x86/include/asm/stubs.h
> > > @@ -32,6 +32,6 @@ struct stubs {
> > >   };
> > >   DECLARE_PER_CPU(struct stubs, stubs);
> > > -unsigned long alloc_stub_page(unsigned int cpu, unsigned long *mfn);
> > > +unsigned long assign_stub_page(unsigned int cpu);
> > >   #endif /* X86_ASM_STUBS_H */
> > > diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
> > > index 19ee857abf..0cac94cbdb 100644
> > > --- a/xen/arch/x86/setup.c
> > > +++ b/xen/arch/x86/setup.c
> > > @@ -2089,8 +2089,7 @@ void asmlinkage __init noreturn __start_xen(void)
> > >       init_idle_domain();
> > > -    this_cpu(stubs.addr) = alloc_stub_page(smp_processor_id(),
> > > -                                           &this_cpu(stubs).mfn);
> > > +    this_cpu(stubs.addr) = assign_stub_page(0);
> > 
> > Given stub pages is first used quite late in the boot process, the above
> > arrays would better be dynamically allocated using xvmalloc_array().
> 
> Ok.  At some point I intended to dynamically allocate.  But x86 doesn't have
> num_possible_cpus(), and I thought num_present_cpus() wouldn't have the
> correct value.  nr_cpu_ids seemed close to the value, but then I convinced
> myself NR_CPUS would be okay.

I will double check, but I think num_present_cpus() accounts for the
maximum number of online CPUs possible at any point.

> > diff --git a/xen/arch/x86/include/asm/stubs.h b/xen/arch/x86/include/asm/stubs.h
> > index a520928e9a50..d575f1eb0631 100644
> > --- a/xen/arch/x86/include/asm/stubs.h
> > +++ b/xen/arch/x86/include/asm/stubs.h
> > @@ -32,6 +32,7 @@ struct stubs {
> >   };
> >   DECLARE_PER_CPU(struct stubs, stubs);
> > -unsigned long alloc_stub_page(unsigned int cpu, unsigned long *mfn);
> > +unsigned long assign_stub_page(unsigned int cpu);
> > +void init_bsp_stub(void);
> 
> With init_bsp_stub(), assign_stub_page can be static and not exported.

Oh, nice one.

> 
> >   #endif /* X86_ASM_STUBS_H */
> 
> > diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
> > index b3045eac5b5e..dd0972a3025e 100644
> > --- a/xen/arch/x86/smpboot.c
> > +++ b/xen/arch/x86/smpboot.c
> > @@ -20,6 +20,7 @@
> >   #include <xen/serial.h>
> >   #include <xen/softirq.h>
> >   #include <xen/tasklet.h>
> > +#include <xen/xvmalloc.h>
> >   #include <asm/apic.h>
> >   #include <asm/cpuidle.h>
> > @@ -641,41 +642,61 @@ static int do_boot_cpu(int apicid, int cpu)
> >       return rc;
> >   }
> > -#define STUB_BUF_CPU_OFFS(cpu) (((cpu) & (STUBS_PER_PAGE - 1)) * STUB_BUF_SIZE)
> > +/* Dynamically allocated, indexed by CPU.  Store physical address of stubs. */
> > +static paddr_t *__ro_after_init stubs;
> > -unsigned long alloc_stub_page(unsigned int cpu, unsigned long *mfn)
> > +unsigned long assign_stub_page(unsigned int cpu)
> >   {
> >       unsigned long stub_va;
> > -    struct page_info *pg;
> > +    paddr_t addr = stubs[cpu];
> > -    BUILD_BUG_ON(STUBS_PER_PAGE & (STUBS_PER_PAGE - 1));
> > -
> > -    if ( *mfn )
> > -        pg = mfn_to_page(_mfn(*mfn));
> > -    else
> > +    if ( addr == INVALID_PADDR )
> >       {
> > -        nodeid_t node = cpu_to_node(cpu);
> > -        unsigned int memflags = node != NUMA_NO_NODE ? MEMF_node(node) : 0;
> > +        nodeid_t nid = cpu_to_node(cpu);
> > -        pg = alloc_domheap_page(NULL, memflags);
> > -        if ( !pg )
> > -            return 0;
> > +        /*
> > +         * Attempt to use the same page as the previous CPU if possible,
> > +         * otherwise allocate a new one.
> > +         */
> > +        if ( cpu && nid == cpu_to_node(cpu - 1) &&
> > +             PAGE_OFFSET(stubs[cpu - 1] + STUB_BUF_SIZE) )
>     PAGE_OFFSET(stubs[cpu - 1] + STUB_BUF_SIZE)
> is to ensure we it remains inside the allocated stub page?

Yup, if there's no offset it means we have consumed the full page, and
we need to allocate a new one (at least that was my intention).

> 
> > +            addr = stubs[cpu - 1] + STUB_BUF_SIZE;
> > +        else
> > +        {
> > +            struct page_info *pg = alloc_domheap_page(NULL, MEMF_node(nid));
> 
> > @@ -1092,15 +1106,7 @@ static int cpu_smpboot_alloc(unsigned int cpu)
> >       memcpy(per_cpu(idt, cpu), bsp_idt, sizeof(bsp_idt));
> >       disable_each_ist(per_cpu(idt, cpu));
> > -    for ( stub_page = 0, i = cpu & ~(STUBS_PER_PAGE - 1);
> > -          i < nr_cpu_ids && i <= (cpu | (STUBS_PER_PAGE - 1)); ++i )
> > -        if ( cpu_online(i) && cpu_to_node(i) == node )
> > -        {
> > -            per_cpu(stubs.mfn, cpu) = per_cpu(stubs.mfn, i);
> 
> This loop tries hard to re-use the same page for a NUMA node.  My posted
> approach will densely allocate the stubs.  Your approach would re-use less,
> unless the CPUs are contiguous in a node.

From what I saw on a couple of systems, CPUs are contiguous inside a
node, for example:

(XEN) [ 2384.850395] CPU0...39 -> NODE0
(XEN) [ 2384.850397] CPU40...79 -> NODE1

> This is just an observation.  I have no idea how NUMA nodes are allocated.
> The "round robin" code in numa_init_array() made me worry that CPUs are more
> likely to be non-contiguous.

Well, the approach here would still work for non-contiguously assigned
CPUs, it would just be sub-optimal.  We can adjust if we ever find
such a system, but I think it's unlikely.

> If you have NUMA systems in XenRT, I think it would be worthwhile to test.
> Some printks to see how many pages are allocated would be useful.

Sure, I will give it a try across what we have on the lab.



> 
> Thanks,
> Jason

next prev parent reply	other threads:[~2026-06-10 18:20 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-09  0:06 [PATCH v3 0/2] xen/x86: Change stub page freeing to fix smt=0 Jason Andryuk
2026-06-09  0:06 ` [PATCH v3 1/2] xen/x86: Return virtual address from alloc_stub_page() Jason Andryuk
2026-06-09  0:06 ` [PATCH v3 2/2] xen/x86: Change stub page allocation/free Jason Andryuk
2026-06-10 15:01   ` Roger Pau Monné
2026-06-10 15:23     ` Jason Andryuk
2026-06-10 18:20       ` Roger Pau Monné [this message]
2026-06-11 15:10       ` Jan Beulich
2026-06-10 11:54 ` [PATCH v3 0/2] xen/x86: Change stub page freeing to fix smt=0 Oleksii Kurochko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aimq0g_XqJQ-7Gqe@macbook.local \
    --to=roger.pau@citrix.com \
    --cc=andrew.cooper3@citrix.com \
    --cc=jason.andryuk@amd.com \
    --cc=jbeulich@suse.com \
    --cc=teddy.astie@vates.tech \
    --cc=xen-devel@lists.xenproject.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.