Generic Linux architectural discussions
 help / color / mirror / Atom feed
From: Alistair Popple <apopple@nvidia.com>
To: Li Zhe <lizhe.67@bytedance.com>
Cc: akpm@linux-foundation.org, arnd@arndb.de, bp@alien8.de,
	 dave.hansen@linux.intel.com, david@kernel.org,
	linux-arch@vger.kernel.org,  linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, mingo@redhat.com, rppt@kernel.org,
	 tglx@kernel.org, x86@kernel.org
Subject: Re: [PATCH 4/4] mm: use arch store helpers in zone-device template copies
Date: Thu, 21 May 2026 08:42:05 +1000	[thread overview]
Message-ID: <ag43be6JFA_feZTi@nvdebian.thelocal> (raw)
In-Reply-To: <20260518064242.57313-1-lizhe.67@bytedance.com>

On 2026-05-18 at 16:42 +1000, Li Zhe <lizhe.67@bytedance.com> wrote...
> On Mon, 18 May 2026 10:32:03 +1000, apopple@nvidia.com wrote:
> 
> > On 2026-05-15 at 18:20 +1000, Li Zhe <lizhe.67@bytedance.com> wrote...
> > > The template-based fast path still leaves the actual copy sequence up to
> > > the compiler. On x86-64 that can easily degrade back into a runtime copy
> > > loop in the hot path, which leaves performance on the table.
> > >
> > > Introduce arch_optimize_store_u64() and arch_optimize_store_drain(),
> > > with a generic fallback and an x86-64 MOVNTI/SFENCE implementation, and
> > > use them in the template copy path. Also open-code the word-at-a-time
> > > copy so the compiler emits fixed-offset stores for the hot path instead
> > > of a runtime loop.
> > >
> > > On x86-64, MOVNTI is a better fit for this write-once, streaming
> > > initialization pattern than normal cached stores. It reduces the
> > > write-allocate traffic and cache pollution that a regular store sequence
> > > would otherwise generate while filling large ranges of struct page.
> > 
> > The perf improvement looks good so thanks for looking at this, however open
> > coding this and introducing arch-specific code layout into a generic layer is
> > not the right approach. The correct solution would be to implement a memcpy
> > implementation/variant that is optimised for write-once streaming operations
> > that can transparently degrade to memcpy on unoptimised architectures.
> > 
> > A grep of the kernel sources for movnti shows there is a memcpy_flushcache()
> > variant. Maybe that could work here?
> 
> Thank you for pointing this out. Using memcpy_flushcache is indeed a
> more generic approach. I will implement the fix in the v2 revision.
> 
> I found that memcpy_flushcache() is implemented on multiple architectures,
> although not all of them can achieve performance benefits during
> ZONE_DEVICE memmap initialization from it. For example, the arm64
> implementation of memcpy_flushcache() simply uses memcpy in conjunction
> with dcache_clean_pop. Therefore, I believe it would be a reasonable choice
> on x86 to introduce a new memcpy variant that invokes memcpy_flushcache().
> 
> > > Refresh the PFN-dependent section bits and page->virtual state in the
> > > reusable template before each copy, instead of patching the destination
> > > page afterwards. This keeps the hot path as a fixed-offset store
> > > sequence and avoids post-copy normal stores to cachelines that were
> > > just written with non-temporal stores.
> > >
> > > Because non-temporal stores are not ordered against later normal stores,
> > > drain outstanding stores before memmap_init_compound() updates compound
> > > heads and before memmap_init_zone_device() returns.
> > >
> > > Disable the x86-64 override under KASAN or KMSAN so those builds keep
> > > their instrumented stores through the generic fallback.
> > >
> > > Tested in a VM with a 100 GB fsdax namespace device configured with
> > > map=dev and a 100 GB devdax namespace (align=2097152) on Intel Ice Lake
> > > server.
> > >
> > > Test procedure:
> > > Rebind the nd_pmem and dax_pmem driver 30 times and collect the memmap
> > > initialization time from the pr_debug() output of
> > > memmap_init_zone_device().
> > >
> > > Base(v7.1-rc3):
> > >   First binding for nd_pmem driver: 1486 ms
> > >   Average of subsequent rebinds: 273.52 ms
> > >
> > >   First binding for dax_pmem driver: 1515 ms
> > >   Average of subsequent rebinds: 313.45 ms
> > >
> > > With this patch:
> > >   First binding for nd_pmem driver: 1272 ms
> > >   Average of subsequent rebinds: 104.59 ms
> > >
> > >   First binding for dax_pmem driver: 1286 ms
> > >   Average of subsequent rebinds: 116.93 ms
> > >
> > 
> > > This reduces the average rebind time by about 61.8% for nd_pmem and
> > > 62.7% for dax_pmem.
> > 
> > Nice - is this the improvment from applying the whole patch series or just this
> > change?
> 
> These performance improvements are attributable to the entire patch series.
> Maybe It would be clearer to use "With this series" instead of the above
> "With this patch".

Thanks for the clarification. I was asking mostly just to get a feel for how
important this specific patch is to the overall improvement to see if the
complexity was justified. That said using memcpy_flushcache() simplifies things
a lot from the perspective of the memmap code so it's less of an issue, so long
as it's use shows some benefit.

 - Alistair

> > 
> > > Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
> > > ---
> > >  arch/x86/include/asm/struct_page_init.h | 28 ++++++++
> > >  include/asm-generic/Kbuild              |  1 +
> > >  include/asm-generic/struct_page_init.h  | 17 +++++
> > >  mm/mm_init.c                            | 89 +++++++++++++++++++++----
> > >  4 files changed, 122 insertions(+), 13 deletions(-)
> > >  create mode 100644 arch/x86/include/asm/struct_page_init.h
> > >  create mode 100644 include/asm-generic/struct_page_init.h
> > >
> > > diff --git a/arch/x86/include/asm/struct_page_init.h b/arch/x86/include/asm/struct_page_init.h
> > > new file mode 100644
> > > index 000000000000..de8b4eab44de
> > > --- /dev/null
> > > +++ b/arch/x86/include/asm/struct_page_init.h
> > > @@ -0,0 +1,28 @@
> > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > +#ifndef _ASM_X86_STRUCT_PAGE_INIT_H
> > > +#define _ASM_X86_STRUCT_PAGE_INIT_H
> > > +
> > > +#include <linux/compiler.h>
> > > +#include <linux/types.h>
> > > +
> > > +/*
> > > + * x86-64 guarantees SSE2, so MOVNTI and SFENCE are always available there.
> > > + *
> > > + * KASAN/KMSAN rely on compiler-instrumented stores. Keep the x86 override
> > > + * disabled for those configs and fall back to plain stores instead.
> > > + */
> > > +#if defined(CONFIG_X86_64) && !defined(CONFIG_KASAN) && !defined(CONFIG_KMSAN)
> > > +static __always_inline void arch_optimize_store_u64(u64 *dst, u64 val)
> > > +{
> > > +	asm volatile("movnti %1, %0" : "=m"(*dst) : "r"(val));
> > > +}
> > > +
> > > +static __always_inline void arch_optimize_store_drain(void)
> > > +{
> > > +	asm volatile("sfence" : : : "memory");
> > > +}
> > > +#else
> > > +#include <asm-generic/struct_page_init.h>
> > > +#endif
> > > +
> > > +#endif /* _ASM_X86_STRUCT_PAGE_INIT_H */
> > > diff --git a/include/asm-generic/Kbuild b/include/asm-generic/Kbuild
> > > index 2c53a1e0b760..3a493fed6803 100644
> > > --- a/include/asm-generic/Kbuild
> > > +++ b/include/asm-generic/Kbuild
> > > @@ -65,3 +65,4 @@ mandatory-y += vermagic.h
> > >  mandatory-y += vga.h
> > >  mandatory-y += video.h
> > >  mandatory-y += word-at-a-time.h
> > > +mandatory-y += struct_page_init.h
> > > diff --git a/include/asm-generic/struct_page_init.h b/include/asm-generic/struct_page_init.h
> > > new file mode 100644
> > > index 000000000000..45a722103a51
> > > --- /dev/null
> > > +++ b/include/asm-generic/struct_page_init.h
> > > @@ -0,0 +1,17 @@
> > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > +#ifndef _ASM_GENERIC_STRUCT_PAGE_INIT_H
> > > +#define _ASM_GENERIC_STRUCT_PAGE_INIT_H
> > > +
> > > +#include <linux/compiler.h>
> > > +#include <linux/types.h>
> > > +
> > > +static __always_inline void arch_optimize_store_u64(u64 *dst, u64 val)
> > > +{
> > > +	*dst = val;
> > > +}
> > > +
> > > +static __always_inline void arch_optimize_store_drain(void)
> > > +{
> > > +}
> > > +
> > > +#endif /* _ASM_GENERIC_STRUCT_PAGE_INIT_H */
> > > diff --git a/mm/mm_init.c b/mm/mm_init.c
> > > index 5a9e6ecfa894..a3211666ccd4 100644
> > > --- a/mm/mm_init.c
> > > +++ b/mm/mm_init.c
> > > @@ -37,6 +37,7 @@
> > >  #include "shuffle.h"
> > >
> > >  #include <asm/setup.h>
> > > +#include <asm/struct_page_init.h>
> > >
> > >  #ifndef CONFIG_NUMA
> > >  unsigned long max_mapnr;
> > > @@ -1078,9 +1079,21 @@ static inline bool zone_device_page_init_optimization_enabled(void)
> > >  	return !page_ref_tracepoint_active(page_ref_set);
> > >  }
> > >
> > > +/*
> > > + * The fast path copies struct page with fixed-offset u64 stores instead of
> > > + * a runtime loop. Keep that copy sequence in sync with the struct page
> > > + * layouts supported by this build.
> > > + *
> > > + * The sequence below requires struct page to be u64-aligned and currently
> > > + * handles layouts from 7 to 12 u64 words (56 to 96 bytes). If a future
> > > + * layout falls outside that range, fail the build so the store sequence is
> > > + * updated together with the layout change.
> > > + */
> > >  static inline void struct_page_layout_check(void)
> > >  {
> > >  	BUILD_BUG_ON(sizeof(struct page) & (sizeof(u64) - 1));
> > > +	BUILD_BUG_ON(sizeof(struct page) < 56);
> > > +	BUILD_BUG_ON(sizeof(struct page) > 96);
> > 
> > This would be uneccessary without the open-coded memcpy and is another reason to
> > prefer a more generic approach.
> 
> Yes, I will fix this issue in v2.
> 
> > >  }
> > >
> > >  static inline void init_template_head_page(struct page *template,
> > > @@ -1108,30 +1121,67 @@ static inline void init_template_tail_page(struct page *template,
> > >  }
> > >
> > >  /*
> > > - * Initialize parts that differ from the template
> > > + * 'template' is a reusable page prototype rather than a strictly immutable
> > > + * object. Most ZONE_DEVICE fields stay constant across the pages covered by
> > > + * the current template, but section bits and page->virtual may still depend
> > > + * on the PFN. Refresh those PFN-dependent fields in the template before
> > > + * copying it into @page.
> > >   */
> > > -static inline void generic_init_zone_device_page_finish(struct page *page,
> > > -							unsigned long pfn)
> > > +static inline void zone_device_page_update_template(struct page *template,
> > > +						    unsigned long pfn)
> > >  {
> > >  #ifdef SECTION_IN_PAGE_FLAGS
> > > -	set_page_section(page, pfn_to_section_nr(pfn));
> > > +	set_page_section(template, pfn_to_section_nr(pfn));
> > >  #endif
> > >  #ifdef WANT_PAGE_VIRTUAL
> > >  	if (!is_highmem_idx(ZONE_DEVICE))
> > > -		set_page_address(page, __va(pfn << PAGE_SHIFT));
> > > +		set_page_address(template, __va(pfn << PAGE_SHIFT));
> > >  #endif
> > >  }
> > >
> > >  static void init_zone_device_page_from_template(struct page *page,
> > > -		unsigned long pfn, const struct page *template)
> > > +		unsigned long pfn, struct page *template)
> > >  {
> > >  	const u64 *src = (const u64 *)template;
> > >  	u64 *dst = (u64 *)page;
> > > -	unsigned int i;
> > >
> > > -	for (i = 0; i < sizeof(struct page) / sizeof(u64); i++)
> > > -		dst[i] = src[i];
> > > -	generic_init_zone_device_page_finish(page, pfn);
> > > +	/*
> > > +	 * 'template' carries the invariant portion of a ZONE_DEVICE struct
> > > +	 * page. Update the PFN-dependent fields in place before copying it
> > > +	 * to the destination page.
> > > +	 */
> > > +	zone_device_page_update_template(template, pfn);
> > > +
> > > +	/*
> > > +	 * Keep the copy open-coded so the compiler emits fixed-offset stores
> > > +	 * for the hot path instead of a runtime copy loop.
> > > +	 */
> > > +	switch (sizeof(struct page)) {
> > > +	case 96:
> > > +		arch_optimize_store_u64(&dst[11], src[11]);
> > > +		fallthrough;
> > > +	case 88:
> > > +		arch_optimize_store_u64(&dst[10], src[10]);
> > > +		fallthrough;
> > > +	case 80:
> > > +		arch_optimize_store_u64(&dst[9], src[9]);
> > > +		fallthrough;
> > > +	case 72:
> > > +		arch_optimize_store_u64(&dst[8], src[8]);
> > > +		fallthrough;
> > > +	case 64:
> > > +		arch_optimize_store_u64(&dst[7], src[7]);
> > > +		fallthrough;
> > > +	case 56:
> > > +		arch_optimize_store_u64(&dst[6], src[6]);
> > > +		arch_optimize_store_u64(&dst[5], src[5]);
> > > +		arch_optimize_store_u64(&dst[4], src[4]);
> > > +		arch_optimize_store_u64(&dst[3], src[3]);
> > > +		arch_optimize_store_u64(&dst[2], src[2]);
> > > +		arch_optimize_store_u64(&dst[1], src[1]);
> > > +		arch_optimize_store_u64(&dst[0], src[0]);
> > > +	}
> > > +
> > 
> > I don't think unrolling the copy here is the right approach. This belongs in
> > some kind of generic streaming memcpy routine.
> 
> Yes. I've taken a look at the memcpy_flushcache() implementation on x86,
> and it only unrolls for sizes of 4, 8, and 16 bytes; all other sizes fall
> back to the generic loop. I think we need to extend the x86 implementation
> of memcpy_flushcache() so that its fast path covers at least
> sizeof(struct page).
> 
> Thanks,
> Zhe

  reply	other threads:[~2026-05-20 22:42 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-15  8:20 [PATCH 0/4] mm: speed up ZONE_DEVICE memmap initialization Li Zhe
2026-05-15  8:20 ` [PATCH 1/4] mm: factor zone-device page init helpers out of __init_zone_device_page Li Zhe
2026-05-18  6:32   ` Mike Rapoport
2026-05-18  9:11     ` Li Zhe
2026-05-15  8:20 ` [PATCH 2/4] mm: add a template-based fast path for zone-device page init Li Zhe
2026-05-18  6:51   ` Mike Rapoport
2026-05-18  9:54     ` Li Zhe
2026-05-18 11:42       ` Mike Rapoport
2026-05-15  8:20 ` [PATCH 3/4] mm: extend the template fast path to zone-device compound tails Li Zhe
2026-05-15  8:20 ` [PATCH 4/4] mm: use arch store helpers in zone-device template copies Li Zhe
2026-05-18  0:32   ` Alistair Popple
2026-05-18  6:42     ` Li Zhe
2026-05-20 22:42       ` Alistair Popple [this message]
2026-05-19  3:09     ` Balbir Singh
2026-05-18  6:23 ` [PATCH 0/4] mm: speed up ZONE_DEVICE memmap initialization Mike Rapoport
2026-05-18  8:57   ` Li Zhe
2026-05-20  6:20     ` Mike Rapoport
2026-05-20 11:57       ` Li Zhe
2026-05-20 22:36         ` Alistair Popple
2026-05-21  3:00           ` Li Zhe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ag43be6JFA_feZTi@nvdebian.thelocal \
    --to=apopple@nvidia.com \
    --cc=akpm@linux-foundation.org \
    --cc=arnd@arndb.de \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@kernel.org \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lizhe.67@bytedance.com \
    --cc=mingo@redhat.com \
    --cc=rppt@kernel.org \
    --cc=tglx@kernel.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox