* [PATCH 1/4] mm: factor zone-device page init helpers out of __init_zone_device_page
2026-05-15 8:20 [PATCH 0/4] mm: speed up ZONE_DEVICE memmap initialization Li Zhe
@ 2026-05-15 8:20 ` Li Zhe
2026-05-18 6:32 ` Mike Rapoport
2026-05-15 8:20 ` [PATCH 2/4] mm: add a template-based fast path for zone-device page init Li Zhe
` (3 subsequent siblings)
4 siblings, 1 reply; 15+ messages in thread
From: Li Zhe @ 2026-05-15 8:20 UTC (permalink / raw)
To: tglx, mingo, bp, dave.hansen, arnd, rppt, akpm, david
Cc: x86, linux-kernel, linux-arch, linux-mm, lizhe.67
__init_zone_device_page() currently mixes three different jobs: deciding
the initial page refcount, initializing the generic ZONE_DEVICE state, and
setting up pageblock metadata.
Split the refcount policy into zone_device_page_init_refcount() and move
the generic page initialization into generic_init_zone_device_page(). This
keeps the slow path behavior unchanged, but makes the individual pieces
reusable by later fast-path patches.
No functional change intended.
Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
---
mm/mm_init.c | 62 ++++++++++++++++++++++++++++++++--------------------
1 file changed, 38 insertions(+), 24 deletions(-)
diff --git a/mm/mm_init.c b/mm/mm_init.c
index f9f8e1af921c..5244acb96dbb 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -987,11 +987,36 @@ static void __init memmap_init(void)
}
#ifdef CONFIG_ZONE_DEVICE
-static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
- unsigned long zone_idx, int nid,
- struct dev_pagemap *pgmap)
+static inline int zone_device_page_init_refcount(
+ const struct dev_pagemap *pgmap)
{
+ /*
+ * ZONE_DEVICE pages other than MEMORY_TYPE_GENERIC are released
+ * directly to the driver page allocator which will set the page count
+ * to 1 when allocating the page.
+ *
+ * MEMORY_TYPE_GENERIC and MEMORY_TYPE_FS_DAX pages automatically have
+ * their refcount reset to one whenever they are freed (ie. after
+ * their refcount drops to 0).
+ */
+ switch (pgmap->type) {
+ case MEMORY_DEVICE_FS_DAX:
+ case MEMORY_DEVICE_PRIVATE:
+ case MEMORY_DEVICE_COHERENT:
+ case MEMORY_DEVICE_PCI_P2PDMA:
+ return 0;
+ case MEMORY_DEVICE_GENERIC:
+ return 1;
+ default:
+ WARN_ONCE(1, "Unknown memory type!");
+ return 1;
+ }
+}
+static void __ref generic_init_zone_device_page(struct page *page,
+ unsigned long pfn, unsigned long zone_idx, int nid,
+ struct dev_pagemap *pgmap)
+{
__init_single_page(page, pfn, zone_idx, nid);
/*
@@ -1011,6 +1036,16 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
page_folio(page)->pgmap = pgmap;
page->zone_device_data = NULL;
+ if (!zone_device_page_init_refcount(pgmap))
+ set_page_count(page, 0);
+}
+
+static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
+ unsigned long zone_idx, int nid,
+ struct dev_pagemap *pgmap)
+{
+ generic_init_zone_device_page(page, pfn, zone_idx, nid, pgmap);
+
/*
* Mark the block movable so that blocks are reserved for
* movable at startup. This will force kernel allocations
@@ -1025,27 +1060,6 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
init_pageblock_migratetype(page, MIGRATE_MOVABLE, false);
cond_resched();
}
-
- /*
- * ZONE_DEVICE pages other than MEMORY_TYPE_GENERIC are released
- * directly to the driver page allocator which will set the page count
- * to 1 when allocating the page.
- *
- * MEMORY_TYPE_GENERIC and MEMORY_TYPE_FS_DAX pages automatically have
- * their refcount reset to one whenever they are freed (ie. after
- * their refcount drops to 0).
- */
- switch (pgmap->type) {
- case MEMORY_DEVICE_FS_DAX:
- case MEMORY_DEVICE_PRIVATE:
- case MEMORY_DEVICE_COHERENT:
- case MEMORY_DEVICE_PCI_P2PDMA:
- set_page_count(page, 0);
- break;
-
- case MEMORY_DEVICE_GENERIC:
- break;
- }
}
/*
--
2.20.1
^ permalink raw reply related [flat|nested] 15+ messages in thread* Re: [PATCH 1/4] mm: factor zone-device page init helpers out of __init_zone_device_page
2026-05-15 8:20 ` [PATCH 1/4] mm: factor zone-device page init helpers out of __init_zone_device_page Li Zhe
@ 2026-05-18 6:32 ` Mike Rapoport
2026-05-18 9:11 ` Li Zhe
0 siblings, 1 reply; 15+ messages in thread
From: Mike Rapoport @ 2026-05-18 6:32 UTC (permalink / raw)
To: Li Zhe
Cc: tglx, mingo, bp, dave.hansen, arnd, akpm, david, x86,
linux-kernel, linux-arch, linux-mm
Hi,
On Fri, May 15, 2026 at 04:20:42PM +0800, Li Zhe wrote:
> __init_zone_device_page() currently mixes three different jobs: deciding
> the initial page refcount, initializing the generic ZONE_DEVICE state, and
> setting up pageblock metadata.
>
> Split the refcount policy into zone_device_page_init_refcount() and move
> the generic page initialization into generic_init_zone_device_page(). This
> keeps the slow path behavior unchanged, but makes the individual pieces
> reusable by later fast-path patches.
>
> No functional change intended.
>
> Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
> ---
> mm/mm_init.c | 62 ++++++++++++++++++++++++++++++++--------------------
> 1 file changed, 38 insertions(+), 24 deletions(-)
>
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index f9f8e1af921c..5244acb96dbb 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -987,11 +987,36 @@ static void __init memmap_init(void)
> }
>
> #ifdef CONFIG_ZONE_DEVICE
> -static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
> - unsigned long zone_idx, int nid,
> - struct dev_pagemap *pgmap)
Since you are already changing __init_zone_device_page(), I'd suggest
renaming it to zone_device_page_init().
> +static inline int zone_device_page_init_refcount(
> + const struct dev_pagemap *pgmap)
> {
> + /*
> + * ZONE_DEVICE pages other than MEMORY_TYPE_GENERIC are released
> + * directly to the driver page allocator which will set the page count
> + * to 1 when allocating the page.
> + *
> + * MEMORY_TYPE_GENERIC and MEMORY_TYPE_FS_DAX pages automatically have
> + * their refcount reset to one whenever they are freed (ie. after
> + * their refcount drops to 0).
> + */
> + switch (pgmap->type) {
> + case MEMORY_DEVICE_FS_DAX:
> + case MEMORY_DEVICE_PRIVATE:
> + case MEMORY_DEVICE_COHERENT:
> + case MEMORY_DEVICE_PCI_P2PDMA:
> + return 0;
> + case MEMORY_DEVICE_GENERIC:
> + return 1;
> + default:
> + WARN_ONCE(1, "Unknown memory type!");
> + return 1;
> + }
> +}
>
> +static void __ref generic_init_zone_device_page(struct page *page,
> + unsigned long pfn, unsigned long zone_idx, int nid,
> + struct dev_pagemap *pgmap)
> +{
Here also would be better to use zone_device_page_ prefix.
And I don't think "generic" adds clarity about what this function is doing.
Seeing that later patches rename it again to _slow variant, I'd suggest to
call it __zone_device_page_init() at keep this name going forward.
> __init_single_page(page, pfn, zone_idx, nid);
>
> /*
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH 1/4] mm: factor zone-device page init helpers out of __init_zone_device_page
2026-05-18 6:32 ` Mike Rapoport
@ 2026-05-18 9:11 ` Li Zhe
0 siblings, 0 replies; 15+ messages in thread
From: Li Zhe @ 2026-05-18 9:11 UTC (permalink / raw)
To: rppt
Cc: akpm, arnd, bp, dave.hansen, david, linux-arch, linux-kernel,
linux-mm, lizhe.67, mingo, tglx, x86
On Mon, 18 May 2026 09:23:33 +0300, rppt@kernel.org wrote:
> Hi,
>
> On Fri, May 15, 2026 at 04:20:42PM +0800, Li Zhe wrote:
> > __init_zone_device_page() currently mixes three different jobs: deciding
> > the initial page refcount, initializing the generic ZONE_DEVICE state, and
> > setting up pageblock metadata.
> >
> > Split the refcount policy into zone_device_page_init_refcount() and move
> > the generic page initialization into generic_init_zone_device_page(). This
> > keeps the slow path behavior unchanged, but makes the individual pieces
> > reusable by later fast-path patches.
> >
> > No functional change intended.
> >
> > Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
> > ---
> > mm/mm_init.c | 62 ++++++++++++++++++++++++++++++++--------------------
> > 1 file changed, 38 insertions(+), 24 deletions(-)
> >
> > diff --git a/mm/mm_init.c b/mm/mm_init.c
> > index f9f8e1af921c..5244acb96dbb 100644
> > --- a/mm/mm_init.c
> > +++ b/mm/mm_init.c
> > @@ -987,11 +987,36 @@ static void __init memmap_init(void)
> > }
> >
> > #ifdef CONFIG_ZONE_DEVICE
> > -static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
> > - unsigned long zone_idx, int nid,
> > - struct dev_pagemap *pgmap)
>
> Since you are already changing __init_zone_device_page(), I'd suggest
> renaming it to zone_device_page_init().
>
> > +static inline int zone_device_page_init_refcount(
> > + const struct dev_pagemap *pgmap)
> > {
> > + /*
> > + * ZONE_DEVICE pages other than MEMORY_TYPE_GENERIC are released
> > + * directly to the driver page allocator which will set the page count
> > + * to 1 when allocating the page.
> > + *
> > + * MEMORY_TYPE_GENERIC and MEMORY_TYPE_FS_DAX pages automatically have
> > + * their refcount reset to one whenever they are freed (ie. after
> > + * their refcount drops to 0).
> > + */
> > + switch (pgmap->type) {
> > + case MEMORY_DEVICE_FS_DAX:
> > + case MEMORY_DEVICE_PRIVATE:
> > + case MEMORY_DEVICE_COHERENT:
> > + case MEMORY_DEVICE_PCI_P2PDMA:
> > + return 0;
> > + case MEMORY_DEVICE_GENERIC:
> > + return 1;
> > + default:
> > + WARN_ONCE(1, "Unknown memory type!");
> > + return 1;
> > + }
> > +}
> >
> > +static void __ref generic_init_zone_device_page(struct page *page,
> > + unsigned long pfn, unsigned long zone_idx, int nid,
> > + struct dev_pagemap *pgmap)
> > +{
>
> Here also would be better to use zone_device_page_ prefix.
> And I don't think "generic" adds clarity about what this function is doing.
>
> Seeing that later patches rename it again to _slow variant, I'd suggest to
> call it __zone_device_page_init() at keep this name going forward.
Thanks, I will fix this in the v2 revision.
Thanks,
Zhe
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 2/4] mm: add a template-based fast path for zone-device page init
2026-05-15 8:20 [PATCH 0/4] mm: speed up ZONE_DEVICE memmap initialization Li Zhe
2026-05-15 8:20 ` [PATCH 1/4] mm: factor zone-device page init helpers out of __init_zone_device_page Li Zhe
@ 2026-05-15 8:20 ` Li Zhe
2026-05-18 6:51 ` Mike Rapoport
2026-05-15 8:20 ` [PATCH 3/4] mm: extend the template fast path to zone-device compound tails Li Zhe
` (2 subsequent siblings)
4 siblings, 1 reply; 15+ messages in thread
From: Li Zhe @ 2026-05-15 8:20 UTC (permalink / raw)
To: tglx, mingo, bp, dave.hansen, arnd, rppt, akpm, david
Cc: x86, linux-kernel, linux-arch, linux-mm, lizhe.67
On 64-bit builds, memmap_init_zone_device() spends most of its time
repeating the same struct page initialization for every PFN. Prepare a
template page through the existing slow path once, then copy that
template into each destination page and fix up the PFN-dependent state
afterwards.
Keep the optimized path disabled when the page_ref_set tracepoint is
active, because the template-copy path bypasses set_page_count() and
would otherwise hide the corresponding trace event.
Non-64-bit builds continue to use the existing slow path.
Tested in a VM with a 100 GB fsdax namespace device configured with
map=dev on Intel Ice Lake server. This test exercises the nd_pmem rebind
path (pfns_per_compound == 1).
Test procedure:
Rebind the nd_pmem driver 30 times and collect the memmap initialization
time from the pr_debug() output of memmap_init_zone_device().
Base(v7.1-rc3):
First binding: 1486 ms
Average of subsequent rebinds: 273.52 ms
With this patch:
First binding: 1421 ms
Average of subsequent rebinds: 246.14 ms
This reduces the average rebind time from 273.52 ms to 246.14 ms, or
about 10%.
Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
---
mm/mm_init.c | 103 +++++++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 96 insertions(+), 7 deletions(-)
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 5244acb96dbb..4c475c71a9d6 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1013,7 +1013,7 @@ static inline int zone_device_page_init_refcount(
}
}
-static void __ref generic_init_zone_device_page(struct page *page,
+static void __ref generic_init_zone_device_page_slow(struct page *page,
unsigned long pfn, unsigned long zone_idx, int nid,
struct dev_pagemap *pgmap)
{
@@ -1040,12 +1040,9 @@ static void __ref generic_init_zone_device_page(struct page *page,
set_page_count(page, 0);
}
-static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
- unsigned long zone_idx, int nid,
- struct dev_pagemap *pgmap)
+static void __ref zone_device_page_init_pageblock(struct page *page,
+ unsigned long pfn)
{
- generic_init_zone_device_page(page, pfn, zone_idx, nid, pgmap);
-
/*
* Mark the block movable so that blocks are reserved for
* movable at startup. This will force kernel allocations
@@ -1062,6 +1059,88 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
}
}
+static inline void __init_zone_device_page(struct page *page, unsigned long pfn,
+ unsigned long zone_idx, int nid,
+ struct dev_pagemap *pgmap)
+{
+ generic_init_zone_device_page_slow(page, pfn, zone_idx, nid, pgmap);
+ zone_device_page_init_pageblock(page, pfn);
+}
+
+#if BITS_PER_LONG == 64
+static inline bool zone_device_page_init_optimization_enabled(void)
+{
+ /*
+ * We use template pages and assign page->_refcount via memory copy.
+ * This means the optimized path bypasses set_page_count(), so the
+ * page_ref_set tracepoint cannot observe this initialization.
+ * Skip the optimized path when the tracepoint is enabled.
+ */
+ return !page_ref_tracepoint_active(page_ref_set);
+}
+
+static inline void struct_page_layout_check(void)
+{
+ BUILD_BUG_ON(sizeof(struct page) & (sizeof(u64) - 1));
+}
+
+static inline void init_template_page(struct page *template,
+ unsigned long pfn,
+ unsigned long zone_idx,
+ int nid,
+ struct dev_pagemap *pgmap)
+{
+ generic_init_zone_device_page_slow(template, pfn, zone_idx, nid, pgmap);
+}
+
+/*
+ * Initialize parts that differ from the template
+ */
+static inline void generic_init_zone_device_page_finish(struct page *page,
+ unsigned long pfn)
+{
+#ifdef SECTION_IN_PAGE_FLAGS
+ set_page_section(page, pfn_to_section_nr(pfn));
+#endif
+#ifdef WANT_PAGE_VIRTUAL
+ if (!is_highmem_idx(ZONE_DEVICE))
+ set_page_address(page, __va(pfn << PAGE_SHIFT));
+#endif
+}
+
+static void init_zone_device_page_from_template(struct page *page,
+ unsigned long pfn, const struct page *template)
+{
+ const u64 *src = (const u64 *)template;
+ u64 *dst = (u64 *)page;
+ unsigned int i;
+
+ for (i = 0; i < sizeof(struct page) / sizeof(u64); i++)
+ dst[i] = src[i];
+ generic_init_zone_device_page_finish(page, pfn);
+ zone_device_page_init_pageblock(page, pfn);
+}
+#else
+static inline bool zone_device_page_init_optimization_enabled(void)
+{
+ return false;
+}
+static inline void init_template_page(struct page *template,
+ unsigned long pfn,
+ unsigned long zone_idx,
+ int nid,
+ struct dev_pagemap *pgmap)
+{
+}
+static inline void struct_page_layout_check(void)
+{
+}
+static void init_zone_device_page_from_template(struct page *page,
+ unsigned long pfn, const struct page *template)
+{
+}
+#endif
+
/*
* With compound page geometry and when struct pages are stored in ram most
* tail pages are reused. Consequently, the amount of unique struct pages to
@@ -1110,6 +1189,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
unsigned long nr_pages,
struct dev_pagemap *pgmap)
{
+ bool use_template = zone_device_page_init_optimization_enabled();
unsigned long pfn, end_pfn = start_pfn + nr_pages;
struct pglist_data *pgdat = zone->zone_pgdat;
struct vmem_altmap *altmap = pgmap_altmap(pgmap);
@@ -1117,6 +1197,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
unsigned long zone_idx = zone_idx(zone);
unsigned long start = jiffies;
int nid = pgdat->node_id;
+ struct page template;
if (WARN_ON_ONCE(!pgmap || zone_idx != ZONE_DEVICE))
return;
@@ -1131,10 +1212,18 @@ void __ref memmap_init_zone_device(struct zone *zone,
nr_pages = end_pfn - start_pfn;
}
+ if (use_template) {
+ struct_page_layout_check();
+ init_template_page(&template, start_pfn, zone_idx, nid, pgmap);
+ }
+
for (pfn = start_pfn; pfn < end_pfn; pfn += pfns_per_compound) {
struct page *page = pfn_to_page(pfn);
- __init_zone_device_page(page, pfn, zone_idx, nid, pgmap);
+ if (use_template)
+ init_zone_device_page_from_template(page, pfn, &template);
+ else
+ __init_zone_device_page(page, pfn, zone_idx, nid, pgmap);
if (pfns_per_compound == 1)
continue;
--
2.20.1
^ permalink raw reply related [flat|nested] 15+ messages in thread* Re: [PATCH 2/4] mm: add a template-based fast path for zone-device page init
2026-05-15 8:20 ` [PATCH 2/4] mm: add a template-based fast path for zone-device page init Li Zhe
@ 2026-05-18 6:51 ` Mike Rapoport
2026-05-18 9:54 ` Li Zhe
0 siblings, 1 reply; 15+ messages in thread
From: Mike Rapoport @ 2026-05-18 6:51 UTC (permalink / raw)
To: Li Zhe
Cc: tglx, mingo, bp, dave.hansen, arnd, akpm, david, x86,
linux-kernel, linux-arch, linux-mm
Hi,
On Fri, May 15, 2026 at 04:20:43PM +0800, Li Zhe wrote:
> On 64-bit builds, memmap_init_zone_device() spends most of its time
> repeating the same struct page initialization for every PFN. Prepare a
> template page through the existing slow path once, then copy that
> template into each destination page and fix up the PFN-dependent state
> afterwards.
>
> Keep the optimized path disabled when the page_ref_set tracepoint is
> active, because the template-copy path bypasses set_page_count() and
> would otherwise hide the corresponding trace event.
>
> Non-64-bit builds continue to use the existing slow path.
ZONE_DEVICE depends on MEMORY_HOTPLUG and MEMORY_HOTPLUG is only supported
for 64 bits, so there can't be 32-bit builds for ZONE_DEVICE functionality.
> Tested in a VM with a 100 GB fsdax namespace device configured with
> map=dev on Intel Ice Lake server. This test exercises the nd_pmem rebind
> path (pfns_per_compound == 1).
>
> Test procedure:
> Rebind the nd_pmem driver 30 times and collect the memmap initialization
> time from the pr_debug() output of memmap_init_zone_device().
>
> Base(v7.1-rc3):
> First binding: 1486 ms
> Average of subsequent rebinds: 273.52 ms
>
> With this patch:
> First binding: 1421 ms
> Average of subsequent rebinds: 246.14 ms
>
> This reduces the average rebind time from 273.52 ms to 246.14 ms, or
> about 10%.
>
> Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
> ---
> mm/mm_init.c | 103 +++++++++++++++++++++++++++++++++++++++++++++++----
> 1 file changed, 96 insertions(+), 7 deletions(-)
>
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index 5244acb96dbb..4c475c71a9d6 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -1013,7 +1013,7 @@ static inline int zone_device_page_init_refcount(
> }
> }
>
> -static void __ref generic_init_zone_device_page(struct page *page,
> +static void __ref generic_init_zone_device_page_slow(struct page *page,
> unsigned long pfn, unsigned long zone_idx, int nid,
> struct dev_pagemap *pgmap)
> {
> @@ -1040,12 +1040,9 @@ static void __ref generic_init_zone_device_page(struct page *page,
> set_page_count(page, 0);
> }
>
> -static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
> - unsigned long zone_idx, int nid,
> - struct dev_pagemap *pgmap)
> +static void __ref zone_device_page_init_pageblock(struct page *page,
> + unsigned long pfn)
Please move splitting _pageblock helper into the first patch, so that the
first patch would contain all code movement.
> {
> - generic_init_zone_device_page(page, pfn, zone_idx, nid, pgmap);
> -
> /*
> * Mark the block movable so that blocks are reserved for
> * movable at startup. This will force kernel allocations
> @@ -1062,6 +1059,88 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
> }
> }
>
> +static inline void __init_zone_device_page(struct page *page, unsigned long pfn,
> + unsigned long zone_idx, int nid,
> + struct dev_pagemap *pgmap)
> +{
> + generic_init_zone_device_page_slow(page, pfn, zone_idx, nid, pgmap);
> + zone_device_page_init_pageblock(page, pfn);
> +}
> +
> +#if BITS_PER_LONG == 64
> +static inline bool zone_device_page_init_optimization_enabled(void)
> +{
> + /*
> + * We use template pages and assign page->_refcount via memory copy.
> + * This means the optimized path bypasses set_page_count(), so the
> + * page_ref_set tracepoint cannot observe this initialization.
> + * Skip the optimized path when the tracepoint is enabled.
> + */
> + return !page_ref_tracepoint_active(page_ref_set);
> +}
> +
> +static inline void struct_page_layout_check(void)
> +{
> + BUILD_BUG_ON(sizeof(struct page) & (sizeof(u64) - 1));
Does it have to be a BUILD_BUG()? Can't we fallback to slow path if struct
page has a weird size?
Just do the check in zone_device_page_init_optimization_enabled().
> +}
> +
> +static inline void init_template_page(struct page *template,
> + unsigned long pfn,
> + unsigned long zone_idx,
> + int nid,
> + struct dev_pagemap *pgmap)
The name should include zone_device to avoid confusion with regular pages.
> +{
> + generic_init_zone_device_page_slow(template, pfn, zone_idx, nid, pgmap);
> +}
> +
> +/*
> + * Initialize parts that differ from the template
> + */
> +static inline void generic_init_zone_device_page_finish(struct page *page,
> + unsigned long pfn)
> +{
> +#ifdef SECTION_IN_PAGE_FLAGS
> + set_page_section(page, pfn_to_section_nr(pfn));
Can we add a stub for set_page_address() for !SECTION_IN_PAGE_FLAGS case
and drop the #ifdef here and in set_page_links()?
> +#endif
> +#ifdef WANT_PAGE_VIRTUAL
> + if (!is_highmem_idx(ZONE_DEVICE))
> + set_page_address(page, __va(pfn << PAGE_SHIFT));
set_page_address() is a not when WANT_PAGE_VIRTUAL, you can drop the ifdef.
> +#endif
> +}
> +
> +static void init_zone_device_page_from_template(struct page *page,
> + unsigned long pfn, const struct page *template)
zone_device_page_init_from_template() please.
> +{
> + const u64 *src = (const u64 *)template;
> + u64 *dst = (u64 *)page;
> + unsigned int i;
> +
> + for (i = 0; i < sizeof(struct page) / sizeof(u64); i++)
> + dst[i] = src[i];
> + generic_init_zone_device_page_finish(page, pfn);
> + zone_device_page_init_pageblock(page, pfn);
> +}
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH 2/4] mm: add a template-based fast path for zone-device page init
2026-05-18 6:51 ` Mike Rapoport
@ 2026-05-18 9:54 ` Li Zhe
2026-05-18 11:42 ` Mike Rapoport
0 siblings, 1 reply; 15+ messages in thread
From: Li Zhe @ 2026-05-18 9:54 UTC (permalink / raw)
To: rppt
Cc: akpm, arnd, bp, dave.hansen, david, linux-arch, linux-kernel,
linux-mm, lizhe.67, mingo, tglx, x86
On Mon, 18 May 2026 09:51:34 +0300, rppt@kernel.org wrote:
> Hi,
>
> On Fri, May 15, 2026 at 04:20:43PM +0800, Li Zhe wrote:
> > On 64-bit builds, memmap_init_zone_device() spends most of its time
> > repeating the same struct page initialization for every PFN. Prepare a
> > template page through the existing slow path once, then copy that
> > template into each destination page and fix up the PFN-dependent state
> > afterwards.
> >
> > Keep the optimized path disabled when the page_ref_set tracepoint is
> > active, because the template-copy path bypasses set_page_count() and
> > would otherwise hide the corresponding trace event.
> >
> > Non-64-bit builds continue to use the existing slow path.
>
> ZONE_DEVICE depends on MEMORY_HOTPLUG and MEMORY_HOTPLUG is only supported
> for 64 bits, so there can't be 32-bit builds for ZONE_DEVICE functionality.
Thanks for the clarification.
Indeed ZONE_DEVICE depends on MEMORY_HOTPLUG which is 64-bit only. I
will refine the description accordingly in v2.
> > Tested in a VM with a 100 GB fsdax namespace device configured with
> > map=dev on Intel Ice Lake server. This test exercises the nd_pmem rebind
> > path (pfns_per_compound == 1).
> >
> > Test procedure:
> > Rebind the nd_pmem driver 30 times and collect the memmap initialization
> > time from the pr_debug() output of memmap_init_zone_device().
> >
> > Base(v7.1-rc3):
> > First binding: 1486 ms
> > Average of subsequent rebinds: 273.52 ms
> >
> > With this patch:
> > First binding: 1421 ms
> > Average of subsequent rebinds: 246.14 ms
> >
> > This reduces the average rebind time from 273.52 ms to 246.14 ms, or
> > about 10%.
> >
> > Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
> > ---
> > mm/mm_init.c | 103 +++++++++++++++++++++++++++++++++++++++++++++++----
> > 1 file changed, 96 insertions(+), 7 deletions(-)
> >
> > diff --git a/mm/mm_init.c b/mm/mm_init.c
> > index 5244acb96dbb..4c475c71a9d6 100644
> > --- a/mm/mm_init.c
> > +++ b/mm/mm_init.c
> > @@ -1013,7 +1013,7 @@ static inline int zone_device_page_init_refcount(
> > }
> > }
> >
> > -static void __ref generic_init_zone_device_page(struct page *page,
> > +static void __ref generic_init_zone_device_page_slow(struct page *page,
> > unsigned long pfn, unsigned long zone_idx, int nid,
> > struct dev_pagemap *pgmap)
> > {
> > @@ -1040,12 +1040,9 @@ static void __ref generic_init_zone_device_page(struct page *page,
> > set_page_count(page, 0);
> > }
> >
> > -static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
> > - unsigned long zone_idx, int nid,
> > - struct dev_pagemap *pgmap)
> > +static void __ref zone_device_page_init_pageblock(struct page *page,
> > + unsigned long pfn)
>
> Please move splitting _pageblock helper into the first patch, so that the
> first patch would contain all code movement.
Thanks, I will move the _pageblock helper split into the first patch
in v2.
> > {
> > - generic_init_zone_device_page(page, pfn, zone_idx, nid, pgmap);
> > -
> > /*
> > * Mark the block movable so that blocks are reserved for
> > * movable at startup. This will force kernel allocations
> > @@ -1062,6 +1059,88 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
> > }
> > }
> >
> > +static inline void __init_zone_device_page(struct page *page, unsigned long pfn,
> > + unsigned long zone_idx, int nid,
> > + struct dev_pagemap *pgmap)
> > +{
> > + generic_init_zone_device_page_slow(page, pfn, zone_idx, nid, pgmap);
> > + zone_device_page_init_pageblock(page, pfn);
> > +}
> > +
> > +#if BITS_PER_LONG == 64
> > +static inline bool zone_device_page_init_optimization_enabled(void)
> > +{
> > + /*
> > + * We use template pages and assign page->_refcount via memory copy.
> > + * This means the optimized path bypasses set_page_count(), so the
> > + * page_ref_set tracepoint cannot observe this initialization.
> > + * Skip the optimized path when the tracepoint is enabled.
> > + */
> > + return !page_ref_tracepoint_active(page_ref_set);
> > +}
> > +
> > +static inline void struct_page_layout_check(void)
> > +{
> > + BUILD_BUG_ON(sizeof(struct page) & (sizeof(u64) - 1));
>
> Does it have to be a BUILD_BUG()? Can't we fallback to slow path if struct
> page has a weird size?
> Just do the check in zone_device_page_init_optimization_enabled().
Thanks, I'll replace the BUILD_BUG_ON() with a runtime check and fall
back to the slow path accordingly.
> > +}
> > +
> > +static inline void init_template_page(struct page *template,
> > + unsigned long pfn,
> > + unsigned long zone_idx,
> > + int nid,
> > + struct dev_pagemap *pgmap)
>
> The name should include zone_device to avoid confusion with regular pages.
Thanks, I will rename it to include zone_device in v2.
> > +{
> > + generic_init_zone_device_page_slow(template, pfn, zone_idx, nid, pgmap);
> > +}
> > +
> > +/*
> > + * Initialize parts that differ from the template
> > + */
> > +static inline void generic_init_zone_device_page_finish(struct page *page,
> > + unsigned long pfn)
> > +{
> > +#ifdef SECTION_IN_PAGE_FLAGS
> > + set_page_section(page, pfn_to_section_nr(pfn));
>
> Can we add a stub for set_page_address() for !SECTION_IN_PAGE_FLAGS case
> and drop the #ifdef here and in set_page_links()?
Thanks, I will add the stub and remove the #ifdef in the next version.
> > +#endif
> > +#ifdef WANT_PAGE_VIRTUAL
> > + if (!is_highmem_idx(ZONE_DEVICE))
> > + set_page_address(page, __va(pfn << PAGE_SHIFT));
>
> set_page_address() is a not when WANT_PAGE_VIRTUAL, you can drop the ifdef.
Upon checking the implementation, set_page_address() also has another
implementation for HASHED_PAGE_VIRTUAL
Following the style of __init_single_page(), we only want to call
set_page_address() under WANT_PAGE_VIRTUAL for ZONE_DEVICE initialization,
so would it be acceptable to keep the #ifdef guard here?
> > +#endif
> > +}
> > +
> > +static void init_zone_device_page_from_template(struct page *page,
> > + unsigned long pfn, const struct page *template)
>
> zone_device_page_init_from_template() please.
Thanks, I will rename it to zone_device_page_init_from_template in v2.
Thanks,
Zhe
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH 2/4] mm: add a template-based fast path for zone-device page init
2026-05-18 9:54 ` Li Zhe
@ 2026-05-18 11:42 ` Mike Rapoport
0 siblings, 0 replies; 15+ messages in thread
From: Mike Rapoport @ 2026-05-18 11:42 UTC (permalink / raw)
To: Li Zhe
Cc: akpm, arnd, bp, dave.hansen, david, linux-arch, linux-kernel,
linux-mm, mingo, tglx, x86
On Mon, May 18, 2026 at 05:54:05PM +0800, Li Zhe wrote:
> On Mon, 18 May 2026 09:51:34 +0300, rppt@kernel.org wrote:
>
> > > +#ifdef WANT_PAGE_VIRTUAL
> > > + if (!is_highmem_idx(ZONE_DEVICE))
> > > + set_page_address(page, __va(pfn << PAGE_SHIFT));
> >
> > set_page_address() is a not when WANT_PAGE_VIRTUAL, you can drop the ifdef.
>
> Upon checking the implementation, set_page_address() also has another
> implementation for HASHED_PAGE_VIRTUAL
>
> Following the style of __init_single_page(), we only want to call
> set_page_address() under WANT_PAGE_VIRTUAL for ZONE_DEVICE initialization,
> so would it be acceptable to keep the #ifdef guard here?
Since there's no other option, keep the ifdef.
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 3/4] mm: extend the template fast path to zone-device compound tails
2026-05-15 8:20 [PATCH 0/4] mm: speed up ZONE_DEVICE memmap initialization Li Zhe
2026-05-15 8:20 ` [PATCH 1/4] mm: factor zone-device page init helpers out of __init_zone_device_page Li Zhe
2026-05-15 8:20 ` [PATCH 2/4] mm: add a template-based fast path for zone-device page init Li Zhe
@ 2026-05-15 8:20 ` Li Zhe
2026-05-15 8:20 ` [PATCH 4/4] mm: use arch store helpers in zone-device template copies Li Zhe
2026-05-18 6:23 ` [PATCH 0/4] mm: speed up ZONE_DEVICE memmap initialization Mike Rapoport
4 siblings, 0 replies; 15+ messages in thread
From: Li Zhe @ 2026-05-15 8:20 UTC (permalink / raw)
To: tglx, mingo, bp, dave.hansen, arnd, rppt, akpm, david
Cc: x86, linux-kernel, linux-arch, linux-mm, lizhe.67
The template-based fast path currently only accelerates head-page
initialization in memmap_init_zone_device(). Compound tails still go
through the slow path one by one in memmap_init_compound().
Add separate head and tail template builders and reuse a prepared tail
template for all tail pages in the same compound range. Move the
head-page refcount handling out of generic_init_zone_device_page_slow()
so the two template builders can set their different initial refcount
states explicitly: head pages follow zone_device_page_init_refcount(),
while compound tails always start with a refcount of 0.
This extends the template-copy fast path to pfns_per_compound > 1
without changing the existing slow path.
Tested in a VM with a 100 GB devdax namespace (align=2097152) on Intel
Ice Lake server. This test exercises the dax_pmem rebind path and
measures memmap initialization latency.
Test procedure:
Unbind and rebind the dax_pmem driver, collect memmap initialization
time from the pr_debug() output of memmap_init_zone_device().
Base(v7.1-rc3):
First binding: 1515 ms
Average of subsequent rebinds: 313.45 ms
With this patch
First binding: 1425 ms
Average of subsequent rebinds: 255.47 ms
This reduces the average rebind time from 313.45 ms to 255.47 ms, or
about 20%.
Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
---
mm/mm_init.c | 80 +++++++++++++++++++++++++++++++++++++++-------------
1 file changed, 61 insertions(+), 19 deletions(-)
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 4c475c71a9d6..5a9e6ecfa894 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1035,9 +1035,6 @@ static void __ref generic_init_zone_device_page_slow(struct page *page,
*/
page_folio(page)->pgmap = pgmap;
page->zone_device_data = NULL;
-
- if (!zone_device_page_init_refcount(pgmap))
- set_page_count(page, 0);
}
static void __ref zone_device_page_init_pageblock(struct page *page,
@@ -1064,6 +1061,8 @@ static inline void __init_zone_device_page(struct page *page, unsigned long pfn,
struct dev_pagemap *pgmap)
{
generic_init_zone_device_page_slow(page, pfn, zone_idx, nid, pgmap);
+ if (!zone_device_page_init_refcount(pgmap))
+ set_page_count(page, 0);
zone_device_page_init_pageblock(page, pfn);
}
@@ -1084,13 +1083,28 @@ static inline void struct_page_layout_check(void)
BUILD_BUG_ON(sizeof(struct page) & (sizeof(u64) - 1));
}
-static inline void init_template_page(struct page *template,
- unsigned long pfn,
- unsigned long zone_idx,
- int nid,
- struct dev_pagemap *pgmap)
+static inline void init_template_head_page(struct page *template,
+ unsigned long pfn,
+ unsigned long zone_idx,
+ int nid,
+ struct dev_pagemap *pgmap)
+{
+ generic_init_zone_device_page_slow(template, pfn, zone_idx, nid, pgmap);
+ if (!zone_device_page_init_refcount(pgmap))
+ set_page_count(template, 0);
+}
+
+static inline void init_template_tail_page(struct page *template,
+ unsigned long pfn,
+ unsigned long zone_idx,
+ int nid,
+ struct dev_pagemap *pgmap,
+ const struct page *head,
+ unsigned int order)
{
generic_init_zone_device_page_slow(template, pfn, zone_idx, nid, pgmap);
+ prep_compound_tail(template, head, order);
+ set_page_count(template, 0);
}
/*
@@ -1125,11 +1139,11 @@ static inline bool zone_device_page_init_optimization_enabled(void)
{
return false;
}
-static inline void init_template_page(struct page *template,
- unsigned long pfn,
- unsigned long zone_idx,
- int nid,
- struct dev_pagemap *pgmap)
+static inline void init_template_head_page(struct page *template,
+ unsigned long pfn,
+ unsigned long zone_idx,
+ int nid,
+ struct dev_pagemap *pgmap)
{
}
static inline void struct_page_layout_check(void)
@@ -1139,6 +1153,15 @@ static void init_zone_device_page_from_template(struct page *page,
unsigned long pfn, const struct page *template)
{
}
+static inline void init_template_tail_page(struct page *template,
+ unsigned long pfn,
+ unsigned long zone_idx,
+ int nid,
+ struct dev_pagemap *pgmap,
+ const struct page *head,
+ unsigned int order)
+{
+}
#endif
/*
@@ -1162,10 +1185,12 @@ static void __ref memmap_init_compound(struct page *head,
unsigned long head_pfn,
unsigned long zone_idx, int nid,
struct dev_pagemap *pgmap,
- unsigned long nr_pages)
+ unsigned long nr_pages,
+ bool use_template)
{
unsigned long pfn, end_pfn = head_pfn + nr_pages;
unsigned int order = pgmap->vmemmap_shift;
+ struct page template;
/*
* We have to initialize the pages, including setting up page links.
@@ -1174,12 +1199,27 @@ static void __ref memmap_init_compound(struct page *head,
* the pages in the same go.
*/
__SetPageHead(head);
+
+ /*
+ * A tail template can be reused for all tail pages in the same compound page
+ * because shared state for compound tails is pre-set by prep_compound_tail().
+ * The per-page page->virtual and section in flags are fixed up after copying.
+ */
+ if (use_template)
+ init_template_tail_page(&template, head_pfn + 1, zone_idx, nid,
+ pgmap, head, order);
+
for (pfn = head_pfn + 1; pfn < end_pfn; pfn++) {
struct page *page = pfn_to_page(pfn);
- __init_zone_device_page(page, pfn, zone_idx, nid, pgmap);
- prep_compound_tail(page, head, order);
- set_page_count(page, 0);
+ if (use_template) {
+ init_zone_device_page_from_template(page, pfn,
+ &template);
+ } else {
+ __init_zone_device_page(page, pfn, zone_idx, nid, pgmap);
+ prep_compound_tail(page, head, order);
+ set_page_count(page, 0);
+ }
}
prep_compound_head(head, order);
}
@@ -1214,7 +1254,8 @@ void __ref memmap_init_zone_device(struct zone *zone,
if (use_template) {
struct_page_layout_check();
- init_template_page(&template, start_pfn, zone_idx, nid, pgmap);
+ init_template_head_page(&template, start_pfn, zone_idx,
+ nid, pgmap);
}
for (pfn = start_pfn; pfn < end_pfn; pfn += pfns_per_compound) {
@@ -1229,7 +1270,8 @@ void __ref memmap_init_zone_device(struct zone *zone,
continue;
memmap_init_compound(page, pfn, zone_idx, nid, pgmap,
- compound_nr_pages(altmap, pgmap));
+ compound_nr_pages(altmap, pgmap),
+ use_template);
}
pr_debug("%s initialised %lu pages in %ums\n", __func__,
--
2.20.1
^ permalink raw reply related [flat|nested] 15+ messages in thread* [PATCH 4/4] mm: use arch store helpers in zone-device template copies
2026-05-15 8:20 [PATCH 0/4] mm: speed up ZONE_DEVICE memmap initialization Li Zhe
` (2 preceding siblings ...)
2026-05-15 8:20 ` [PATCH 3/4] mm: extend the template fast path to zone-device compound tails Li Zhe
@ 2026-05-15 8:20 ` Li Zhe
2026-05-18 0:32 ` Alistair Popple
2026-05-18 6:23 ` [PATCH 0/4] mm: speed up ZONE_DEVICE memmap initialization Mike Rapoport
4 siblings, 1 reply; 15+ messages in thread
From: Li Zhe @ 2026-05-15 8:20 UTC (permalink / raw)
To: tglx, mingo, bp, dave.hansen, arnd, rppt, akpm, david
Cc: x86, linux-kernel, linux-arch, linux-mm, lizhe.67
The template-based fast path still leaves the actual copy sequence up to
the compiler. On x86-64 that can easily degrade back into a runtime copy
loop in the hot path, which leaves performance on the table.
Introduce arch_optimize_store_u64() and arch_optimize_store_drain(),
with a generic fallback and an x86-64 MOVNTI/SFENCE implementation, and
use them in the template copy path. Also open-code the word-at-a-time
copy so the compiler emits fixed-offset stores for the hot path instead
of a runtime loop.
On x86-64, MOVNTI is a better fit for this write-once, streaming
initialization pattern than normal cached stores. It reduces the
write-allocate traffic and cache pollution that a regular store sequence
would otherwise generate while filling large ranges of struct page.
Refresh the PFN-dependent section bits and page->virtual state in the
reusable template before each copy, instead of patching the destination
page afterwards. This keeps the hot path as a fixed-offset store
sequence and avoids post-copy normal stores to cachelines that were
just written with non-temporal stores.
Because non-temporal stores are not ordered against later normal stores,
drain outstanding stores before memmap_init_compound() updates compound
heads and before memmap_init_zone_device() returns.
Disable the x86-64 override under KASAN or KMSAN so those builds keep
their instrumented stores through the generic fallback.
Tested in a VM with a 100 GB fsdax namespace device configured with
map=dev and a 100 GB devdax namespace (align=2097152) on Intel Ice Lake
server.
Test procedure:
Rebind the nd_pmem and dax_pmem driver 30 times and collect the memmap
initialization time from the pr_debug() output of
memmap_init_zone_device().
Base(v7.1-rc3):
First binding for nd_pmem driver: 1486 ms
Average of subsequent rebinds: 273.52 ms
First binding for dax_pmem driver: 1515 ms
Average of subsequent rebinds: 313.45 ms
With this patch:
First binding for nd_pmem driver: 1272 ms
Average of subsequent rebinds: 104.59 ms
First binding for dax_pmem driver: 1286 ms
Average of subsequent rebinds: 116.93 ms
This reduces the average rebind time by about 61.8% for nd_pmem and
62.7% for dax_pmem.
Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
---
arch/x86/include/asm/struct_page_init.h | 28 ++++++++
include/asm-generic/Kbuild | 1 +
include/asm-generic/struct_page_init.h | 17 +++++
mm/mm_init.c | 89 +++++++++++++++++++++----
4 files changed, 122 insertions(+), 13 deletions(-)
create mode 100644 arch/x86/include/asm/struct_page_init.h
create mode 100644 include/asm-generic/struct_page_init.h
diff --git a/arch/x86/include/asm/struct_page_init.h b/arch/x86/include/asm/struct_page_init.h
new file mode 100644
index 000000000000..de8b4eab44de
--- /dev/null
+++ b/arch/x86/include/asm/struct_page_init.h
@@ -0,0 +1,28 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_STRUCT_PAGE_INIT_H
+#define _ASM_X86_STRUCT_PAGE_INIT_H
+
+#include <linux/compiler.h>
+#include <linux/types.h>
+
+/*
+ * x86-64 guarantees SSE2, so MOVNTI and SFENCE are always available there.
+ *
+ * KASAN/KMSAN rely on compiler-instrumented stores. Keep the x86 override
+ * disabled for those configs and fall back to plain stores instead.
+ */
+#if defined(CONFIG_X86_64) && !defined(CONFIG_KASAN) && !defined(CONFIG_KMSAN)
+static __always_inline void arch_optimize_store_u64(u64 *dst, u64 val)
+{
+ asm volatile("movnti %1, %0" : "=m"(*dst) : "r"(val));
+}
+
+static __always_inline void arch_optimize_store_drain(void)
+{
+ asm volatile("sfence" : : : "memory");
+}
+#else
+#include <asm-generic/struct_page_init.h>
+#endif
+
+#endif /* _ASM_X86_STRUCT_PAGE_INIT_H */
diff --git a/include/asm-generic/Kbuild b/include/asm-generic/Kbuild
index 2c53a1e0b760..3a493fed6803 100644
--- a/include/asm-generic/Kbuild
+++ b/include/asm-generic/Kbuild
@@ -65,3 +65,4 @@ mandatory-y += vermagic.h
mandatory-y += vga.h
mandatory-y += video.h
mandatory-y += word-at-a-time.h
+mandatory-y += struct_page_init.h
diff --git a/include/asm-generic/struct_page_init.h b/include/asm-generic/struct_page_init.h
new file mode 100644
index 000000000000..45a722103a51
--- /dev/null
+++ b/include/asm-generic/struct_page_init.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_GENERIC_STRUCT_PAGE_INIT_H
+#define _ASM_GENERIC_STRUCT_PAGE_INIT_H
+
+#include <linux/compiler.h>
+#include <linux/types.h>
+
+static __always_inline void arch_optimize_store_u64(u64 *dst, u64 val)
+{
+ *dst = val;
+}
+
+static __always_inline void arch_optimize_store_drain(void)
+{
+}
+
+#endif /* _ASM_GENERIC_STRUCT_PAGE_INIT_H */
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 5a9e6ecfa894..a3211666ccd4 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -37,6 +37,7 @@
#include "shuffle.h"
#include <asm/setup.h>
+#include <asm/struct_page_init.h>
#ifndef CONFIG_NUMA
unsigned long max_mapnr;
@@ -1078,9 +1079,21 @@ static inline bool zone_device_page_init_optimization_enabled(void)
return !page_ref_tracepoint_active(page_ref_set);
}
+/*
+ * The fast path copies struct page with fixed-offset u64 stores instead of
+ * a runtime loop. Keep that copy sequence in sync with the struct page
+ * layouts supported by this build.
+ *
+ * The sequence below requires struct page to be u64-aligned and currently
+ * handles layouts from 7 to 12 u64 words (56 to 96 bytes). If a future
+ * layout falls outside that range, fail the build so the store sequence is
+ * updated together with the layout change.
+ */
static inline void struct_page_layout_check(void)
{
BUILD_BUG_ON(sizeof(struct page) & (sizeof(u64) - 1));
+ BUILD_BUG_ON(sizeof(struct page) < 56);
+ BUILD_BUG_ON(sizeof(struct page) > 96);
}
static inline void init_template_head_page(struct page *template,
@@ -1108,30 +1121,67 @@ static inline void init_template_tail_page(struct page *template,
}
/*
- * Initialize parts that differ from the template
+ * 'template' is a reusable page prototype rather than a strictly immutable
+ * object. Most ZONE_DEVICE fields stay constant across the pages covered by
+ * the current template, but section bits and page->virtual may still depend
+ * on the PFN. Refresh those PFN-dependent fields in the template before
+ * copying it into @page.
*/
-static inline void generic_init_zone_device_page_finish(struct page *page,
- unsigned long pfn)
+static inline void zone_device_page_update_template(struct page *template,
+ unsigned long pfn)
{
#ifdef SECTION_IN_PAGE_FLAGS
- set_page_section(page, pfn_to_section_nr(pfn));
+ set_page_section(template, pfn_to_section_nr(pfn));
#endif
#ifdef WANT_PAGE_VIRTUAL
if (!is_highmem_idx(ZONE_DEVICE))
- set_page_address(page, __va(pfn << PAGE_SHIFT));
+ set_page_address(template, __va(pfn << PAGE_SHIFT));
#endif
}
static void init_zone_device_page_from_template(struct page *page,
- unsigned long pfn, const struct page *template)
+ unsigned long pfn, struct page *template)
{
const u64 *src = (const u64 *)template;
u64 *dst = (u64 *)page;
- unsigned int i;
- for (i = 0; i < sizeof(struct page) / sizeof(u64); i++)
- dst[i] = src[i];
- generic_init_zone_device_page_finish(page, pfn);
+ /*
+ * 'template' carries the invariant portion of a ZONE_DEVICE struct
+ * page. Update the PFN-dependent fields in place before copying it
+ * to the destination page.
+ */
+ zone_device_page_update_template(template, pfn);
+
+ /*
+ * Keep the copy open-coded so the compiler emits fixed-offset stores
+ * for the hot path instead of a runtime copy loop.
+ */
+ switch (sizeof(struct page)) {
+ case 96:
+ arch_optimize_store_u64(&dst[11], src[11]);
+ fallthrough;
+ case 88:
+ arch_optimize_store_u64(&dst[10], src[10]);
+ fallthrough;
+ case 80:
+ arch_optimize_store_u64(&dst[9], src[9]);
+ fallthrough;
+ case 72:
+ arch_optimize_store_u64(&dst[8], src[8]);
+ fallthrough;
+ case 64:
+ arch_optimize_store_u64(&dst[7], src[7]);
+ fallthrough;
+ case 56:
+ arch_optimize_store_u64(&dst[6], src[6]);
+ arch_optimize_store_u64(&dst[5], src[5]);
+ arch_optimize_store_u64(&dst[4], src[4]);
+ arch_optimize_store_u64(&dst[3], src[3]);
+ arch_optimize_store_u64(&dst[2], src[2]);
+ arch_optimize_store_u64(&dst[1], src[1]);
+ arch_optimize_store_u64(&dst[0], src[0]);
+ }
+
zone_device_page_init_pageblock(page, pfn);
}
#else
@@ -1201,9 +1251,10 @@ static void __ref memmap_init_compound(struct page *head,
__SetPageHead(head);
/*
- * A tail template can be reused for all tail pages in the same compound page
- * because shared state for compound tails is pre-set by prep_compound_tail().
- * The per-page page->virtual and section in flags are fixed up after copying.
+ * All tails of the same compound page share the state established by
+ * prep_compound_tail(). Reuse one tail template for the whole range
+ * and refresh only the PFN-dependent fields in that template before
+ * each copy.
*/
if (use_template)
init_template_tail_page(&template, head_pfn + 1, zone_idx, nid,
@@ -1269,10 +1320,22 @@ void __ref memmap_init_zone_device(struct zone *zone,
if (pfns_per_compound == 1)
continue;
+ /*
+ * Compound-head setup immediately updates head->flags, so make
+ * the template copy visible before entering memmap_init_compound().
+ */
+ if (use_template)
+ arch_optimize_store_drain();
+
memmap_init_compound(page, pfn, zone_idx, nid, pgmap,
compound_nr_pages(altmap, pgmap),
use_template);
}
+ /*
+ * Drain any remaining non-temporal stores before returning.
+ */
+ if (use_template)
+ arch_optimize_store_drain();
pr_debug("%s initialised %lu pages in %ums\n", __func__,
nr_pages, jiffies_to_msecs(jiffies - start));
--
2.20.1
^ permalink raw reply related [flat|nested] 15+ messages in thread* Re: [PATCH 4/4] mm: use arch store helpers in zone-device template copies
2026-05-15 8:20 ` [PATCH 4/4] mm: use arch store helpers in zone-device template copies Li Zhe
@ 2026-05-18 0:32 ` Alistair Popple
2026-05-18 6:42 ` Li Zhe
2026-05-19 3:09 ` Balbir Singh
0 siblings, 2 replies; 15+ messages in thread
From: Alistair Popple @ 2026-05-18 0:32 UTC (permalink / raw)
To: Li Zhe
Cc: tglx, mingo, bp, dave.hansen, arnd, rppt, akpm, david, x86,
linux-kernel, linux-arch, linux-mm
On 2026-05-15 at 18:20 +1000, Li Zhe <lizhe.67@bytedance.com> wrote...
> The template-based fast path still leaves the actual copy sequence up to
> the compiler. On x86-64 that can easily degrade back into a runtime copy
> loop in the hot path, which leaves performance on the table.
>
> Introduce arch_optimize_store_u64() and arch_optimize_store_drain(),
> with a generic fallback and an x86-64 MOVNTI/SFENCE implementation, and
> use them in the template copy path. Also open-code the word-at-a-time
> copy so the compiler emits fixed-offset stores for the hot path instead
> of a runtime loop.
>
> On x86-64, MOVNTI is a better fit for this write-once, streaming
> initialization pattern than normal cached stores. It reduces the
> write-allocate traffic and cache pollution that a regular store sequence
> would otherwise generate while filling large ranges of struct page.
The perf improvement looks good so thanks for looking at this, however open
coding this and introducing arch-specific code layout into a generic layer is
not the right approach. The correct solution would be to implement a memcpy
implementation/variant that is optimised for write-once streaming operations
that can transparently degrade to memcpy on unoptimised architectures.
A grep of the kernel sources for movnti shows there is a memcpy_flushcache()
variant. Maybe that could work here?
> Refresh the PFN-dependent section bits and page->virtual state in the
> reusable template before each copy, instead of patching the destination
> page afterwards. This keeps the hot path as a fixed-offset store
> sequence and avoids post-copy normal stores to cachelines that were
> just written with non-temporal stores.
>
> Because non-temporal stores are not ordered against later normal stores,
> drain outstanding stores before memmap_init_compound() updates compound
> heads and before memmap_init_zone_device() returns.
>
> Disable the x86-64 override under KASAN or KMSAN so those builds keep
> their instrumented stores through the generic fallback.
>
> Tested in a VM with a 100 GB fsdax namespace device configured with
> map=dev and a 100 GB devdax namespace (align=2097152) on Intel Ice Lake
> server.
>
> Test procedure:
> Rebind the nd_pmem and dax_pmem driver 30 times and collect the memmap
> initialization time from the pr_debug() output of
> memmap_init_zone_device().
>
> Base(v7.1-rc3):
> First binding for nd_pmem driver: 1486 ms
> Average of subsequent rebinds: 273.52 ms
>
> First binding for dax_pmem driver: 1515 ms
> Average of subsequent rebinds: 313.45 ms
>
> With this patch:
> First binding for nd_pmem driver: 1272 ms
> Average of subsequent rebinds: 104.59 ms
>
> First binding for dax_pmem driver: 1286 ms
> Average of subsequent rebinds: 116.93 ms
>
> This reduces the average rebind time by about 61.8% for nd_pmem and
> 62.7% for dax_pmem.
Nice - is this the improvment from applying the whole patch series or just this
change?
> Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
> ---
> arch/x86/include/asm/struct_page_init.h | 28 ++++++++
> include/asm-generic/Kbuild | 1 +
> include/asm-generic/struct_page_init.h | 17 +++++
> mm/mm_init.c | 89 +++++++++++++++++++++----
> 4 files changed, 122 insertions(+), 13 deletions(-)
> create mode 100644 arch/x86/include/asm/struct_page_init.h
> create mode 100644 include/asm-generic/struct_page_init.h
>
> diff --git a/arch/x86/include/asm/struct_page_init.h b/arch/x86/include/asm/struct_page_init.h
> new file mode 100644
> index 000000000000..de8b4eab44de
> --- /dev/null
> +++ b/arch/x86/include/asm/struct_page_init.h
> @@ -0,0 +1,28 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_X86_STRUCT_PAGE_INIT_H
> +#define _ASM_X86_STRUCT_PAGE_INIT_H
> +
> +#include <linux/compiler.h>
> +#include <linux/types.h>
> +
> +/*
> + * x86-64 guarantees SSE2, so MOVNTI and SFENCE are always available there.
> + *
> + * KASAN/KMSAN rely on compiler-instrumented stores. Keep the x86 override
> + * disabled for those configs and fall back to plain stores instead.
> + */
> +#if defined(CONFIG_X86_64) && !defined(CONFIG_KASAN) && !defined(CONFIG_KMSAN)
> +static __always_inline void arch_optimize_store_u64(u64 *dst, u64 val)
> +{
> + asm volatile("movnti %1, %0" : "=m"(*dst) : "r"(val));
> +}
> +
> +static __always_inline void arch_optimize_store_drain(void)
> +{
> + asm volatile("sfence" : : : "memory");
> +}
> +#else
> +#include <asm-generic/struct_page_init.h>
> +#endif
> +
> +#endif /* _ASM_X86_STRUCT_PAGE_INIT_H */
> diff --git a/include/asm-generic/Kbuild b/include/asm-generic/Kbuild
> index 2c53a1e0b760..3a493fed6803 100644
> --- a/include/asm-generic/Kbuild
> +++ b/include/asm-generic/Kbuild
> @@ -65,3 +65,4 @@ mandatory-y += vermagic.h
> mandatory-y += vga.h
> mandatory-y += video.h
> mandatory-y += word-at-a-time.h
> +mandatory-y += struct_page_init.h
> diff --git a/include/asm-generic/struct_page_init.h b/include/asm-generic/struct_page_init.h
> new file mode 100644
> index 000000000000..45a722103a51
> --- /dev/null
> +++ b/include/asm-generic/struct_page_init.h
> @@ -0,0 +1,17 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_GENERIC_STRUCT_PAGE_INIT_H
> +#define _ASM_GENERIC_STRUCT_PAGE_INIT_H
> +
> +#include <linux/compiler.h>
> +#include <linux/types.h>
> +
> +static __always_inline void arch_optimize_store_u64(u64 *dst, u64 val)
> +{
> + *dst = val;
> +}
> +
> +static __always_inline void arch_optimize_store_drain(void)
> +{
> +}
> +
> +#endif /* _ASM_GENERIC_STRUCT_PAGE_INIT_H */
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index 5a9e6ecfa894..a3211666ccd4 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -37,6 +37,7 @@
> #include "shuffle.h"
>
> #include <asm/setup.h>
> +#include <asm/struct_page_init.h>
>
> #ifndef CONFIG_NUMA
> unsigned long max_mapnr;
> @@ -1078,9 +1079,21 @@ static inline bool zone_device_page_init_optimization_enabled(void)
> return !page_ref_tracepoint_active(page_ref_set);
> }
>
> +/*
> + * The fast path copies struct page with fixed-offset u64 stores instead of
> + * a runtime loop. Keep that copy sequence in sync with the struct page
> + * layouts supported by this build.
> + *
> + * The sequence below requires struct page to be u64-aligned and currently
> + * handles layouts from 7 to 12 u64 words (56 to 96 bytes). If a future
> + * layout falls outside that range, fail the build so the store sequence is
> + * updated together with the layout change.
> + */
> static inline void struct_page_layout_check(void)
> {
> BUILD_BUG_ON(sizeof(struct page) & (sizeof(u64) - 1));
> + BUILD_BUG_ON(sizeof(struct page) < 56);
> + BUILD_BUG_ON(sizeof(struct page) > 96);
This would be uneccessary without the open-coded memcpy and is another reason to
prefer a more generic approach.
> }
>
> static inline void init_template_head_page(struct page *template,
> @@ -1108,30 +1121,67 @@ static inline void init_template_tail_page(struct page *template,
> }
>
> /*
> - * Initialize parts that differ from the template
> + * 'template' is a reusable page prototype rather than a strictly immutable
> + * object. Most ZONE_DEVICE fields stay constant across the pages covered by
> + * the current template, but section bits and page->virtual may still depend
> + * on the PFN. Refresh those PFN-dependent fields in the template before
> + * copying it into @page.
> */
> -static inline void generic_init_zone_device_page_finish(struct page *page,
> - unsigned long pfn)
> +static inline void zone_device_page_update_template(struct page *template,
> + unsigned long pfn)
> {
> #ifdef SECTION_IN_PAGE_FLAGS
> - set_page_section(page, pfn_to_section_nr(pfn));
> + set_page_section(template, pfn_to_section_nr(pfn));
> #endif
> #ifdef WANT_PAGE_VIRTUAL
> if (!is_highmem_idx(ZONE_DEVICE))
> - set_page_address(page, __va(pfn << PAGE_SHIFT));
> + set_page_address(template, __va(pfn << PAGE_SHIFT));
> #endif
> }
>
> static void init_zone_device_page_from_template(struct page *page,
> - unsigned long pfn, const struct page *template)
> + unsigned long pfn, struct page *template)
> {
> const u64 *src = (const u64 *)template;
> u64 *dst = (u64 *)page;
> - unsigned int i;
>
> - for (i = 0; i < sizeof(struct page) / sizeof(u64); i++)
> - dst[i] = src[i];
> - generic_init_zone_device_page_finish(page, pfn);
> + /*
> + * 'template' carries the invariant portion of a ZONE_DEVICE struct
> + * page. Update the PFN-dependent fields in place before copying it
> + * to the destination page.
> + */
> + zone_device_page_update_template(template, pfn);
> +
> + /*
> + * Keep the copy open-coded so the compiler emits fixed-offset stores
> + * for the hot path instead of a runtime copy loop.
> + */
> + switch (sizeof(struct page)) {
> + case 96:
> + arch_optimize_store_u64(&dst[11], src[11]);
> + fallthrough;
> + case 88:
> + arch_optimize_store_u64(&dst[10], src[10]);
> + fallthrough;
> + case 80:
> + arch_optimize_store_u64(&dst[9], src[9]);
> + fallthrough;
> + case 72:
> + arch_optimize_store_u64(&dst[8], src[8]);
> + fallthrough;
> + case 64:
> + arch_optimize_store_u64(&dst[7], src[7]);
> + fallthrough;
> + case 56:
> + arch_optimize_store_u64(&dst[6], src[6]);
> + arch_optimize_store_u64(&dst[5], src[5]);
> + arch_optimize_store_u64(&dst[4], src[4]);
> + arch_optimize_store_u64(&dst[3], src[3]);
> + arch_optimize_store_u64(&dst[2], src[2]);
> + arch_optimize_store_u64(&dst[1], src[1]);
> + arch_optimize_store_u64(&dst[0], src[0]);
> + }
> +
I don't think unrolling the copy here is the right approach. This belongs in
some kind of generic streaming memcpy routine.
- Alistair
> zone_device_page_init_pageblock(page, pfn);
> }
> #else
> @@ -1201,9 +1251,10 @@ static void __ref memmap_init_compound(struct page *head,
> __SetPageHead(head);
>
> /*
> - * A tail template can be reused for all tail pages in the same compound page
> - * because shared state for compound tails is pre-set by prep_compound_tail().
> - * The per-page page->virtual and section in flags are fixed up after copying.
> + * All tails of the same compound page share the state established by
> + * prep_compound_tail(). Reuse one tail template for the whole range
> + * and refresh only the PFN-dependent fields in that template before
> + * each copy.
> */
> if (use_template)
> init_template_tail_page(&template, head_pfn + 1, zone_idx, nid,
> @@ -1269,10 +1320,22 @@ void __ref memmap_init_zone_device(struct zone *zone,
> if (pfns_per_compound == 1)
> continue;
>
> + /*
> + * Compound-head setup immediately updates head->flags, so make
> + * the template copy visible before entering memmap_init_compound().
> + */
> + if (use_template)
> + arch_optimize_store_drain();
> +
> memmap_init_compound(page, pfn, zone_idx, nid, pgmap,
> compound_nr_pages(altmap, pgmap),
> use_template);
> }
> + /*
> + * Drain any remaining non-temporal stores before returning.
> + */
> + if (use_template)
> + arch_optimize_store_drain();
>
> pr_debug("%s initialised %lu pages in %ums\n", __func__,
> nr_pages, jiffies_to_msecs(jiffies - start));
> --
> 2.20.1
>
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH 4/4] mm: use arch store helpers in zone-device template copies
2026-05-18 0:32 ` Alistair Popple
@ 2026-05-18 6:42 ` Li Zhe
2026-05-19 3:09 ` Balbir Singh
1 sibling, 0 replies; 15+ messages in thread
From: Li Zhe @ 2026-05-18 6:42 UTC (permalink / raw)
To: apopple
Cc: akpm, arnd, bp, dave.hansen, david, linux-arch, linux-kernel,
linux-mm, lizhe.67, mingo, rppt, tglx, x86
On Mon, 18 May 2026 10:32:03 +1000, apopple@nvidia.com wrote:
> On 2026-05-15 at 18:20 +1000, Li Zhe <lizhe.67@bytedance.com> wrote...
> > The template-based fast path still leaves the actual copy sequence up to
> > the compiler. On x86-64 that can easily degrade back into a runtime copy
> > loop in the hot path, which leaves performance on the table.
> >
> > Introduce arch_optimize_store_u64() and arch_optimize_store_drain(),
> > with a generic fallback and an x86-64 MOVNTI/SFENCE implementation, and
> > use them in the template copy path. Also open-code the word-at-a-time
> > copy so the compiler emits fixed-offset stores for the hot path instead
> > of a runtime loop.
> >
> > On x86-64, MOVNTI is a better fit for this write-once, streaming
> > initialization pattern than normal cached stores. It reduces the
> > write-allocate traffic and cache pollution that a regular store sequence
> > would otherwise generate while filling large ranges of struct page.
>
> The perf improvement looks good so thanks for looking at this, however open
> coding this and introducing arch-specific code layout into a generic layer is
> not the right approach. The correct solution would be to implement a memcpy
> implementation/variant that is optimised for write-once streaming operations
> that can transparently degrade to memcpy on unoptimised architectures.
>
> A grep of the kernel sources for movnti shows there is a memcpy_flushcache()
> variant. Maybe that could work here?
Thank you for pointing this out. Using memcpy_flushcache is indeed a
more generic approach. I will implement the fix in the v2 revision.
I found that memcpy_flushcache() is implemented on multiple architectures,
although not all of them can achieve performance benefits during
ZONE_DEVICE memmap initialization from it. For example, the arm64
implementation of memcpy_flushcache() simply uses memcpy in conjunction
with dcache_clean_pop. Therefore, I believe it would be a reasonable choice
on x86 to introduce a new memcpy variant that invokes memcpy_flushcache().
> > Refresh the PFN-dependent section bits and page->virtual state in the
> > reusable template before each copy, instead of patching the destination
> > page afterwards. This keeps the hot path as a fixed-offset store
> > sequence and avoids post-copy normal stores to cachelines that were
> > just written with non-temporal stores.
> >
> > Because non-temporal stores are not ordered against later normal stores,
> > drain outstanding stores before memmap_init_compound() updates compound
> > heads and before memmap_init_zone_device() returns.
> >
> > Disable the x86-64 override under KASAN or KMSAN so those builds keep
> > their instrumented stores through the generic fallback.
> >
> > Tested in a VM with a 100 GB fsdax namespace device configured with
> > map=dev and a 100 GB devdax namespace (align=2097152) on Intel Ice Lake
> > server.
> >
> > Test procedure:
> > Rebind the nd_pmem and dax_pmem driver 30 times and collect the memmap
> > initialization time from the pr_debug() output of
> > memmap_init_zone_device().
> >
> > Base(v7.1-rc3):
> > First binding for nd_pmem driver: 1486 ms
> > Average of subsequent rebinds: 273.52 ms
> >
> > First binding for dax_pmem driver: 1515 ms
> > Average of subsequent rebinds: 313.45 ms
> >
> > With this patch:
> > First binding for nd_pmem driver: 1272 ms
> > Average of subsequent rebinds: 104.59 ms
> >
> > First binding for dax_pmem driver: 1286 ms
> > Average of subsequent rebinds: 116.93 ms
> >
>
> > This reduces the average rebind time by about 61.8% for nd_pmem and
> > 62.7% for dax_pmem.
>
> Nice - is this the improvment from applying the whole patch series or just this
> change?
These performance improvements are attributable to the entire patch series.
Maybe It would be clearer to use "With this series" instead of the above
"With this patch".
>
> > Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
> > ---
> > arch/x86/include/asm/struct_page_init.h | 28 ++++++++
> > include/asm-generic/Kbuild | 1 +
> > include/asm-generic/struct_page_init.h | 17 +++++
> > mm/mm_init.c | 89 +++++++++++++++++++++----
> > 4 files changed, 122 insertions(+), 13 deletions(-)
> > create mode 100644 arch/x86/include/asm/struct_page_init.h
> > create mode 100644 include/asm-generic/struct_page_init.h
> >
> > diff --git a/arch/x86/include/asm/struct_page_init.h b/arch/x86/include/asm/struct_page_init.h
> > new file mode 100644
> > index 000000000000..de8b4eab44de
> > --- /dev/null
> > +++ b/arch/x86/include/asm/struct_page_init.h
> > @@ -0,0 +1,28 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#ifndef _ASM_X86_STRUCT_PAGE_INIT_H
> > +#define _ASM_X86_STRUCT_PAGE_INIT_H
> > +
> > +#include <linux/compiler.h>
> > +#include <linux/types.h>
> > +
> > +/*
> > + * x86-64 guarantees SSE2, so MOVNTI and SFENCE are always available there.
> > + *
> > + * KASAN/KMSAN rely on compiler-instrumented stores. Keep the x86 override
> > + * disabled for those configs and fall back to plain stores instead.
> > + */
> > +#if defined(CONFIG_X86_64) && !defined(CONFIG_KASAN) && !defined(CONFIG_KMSAN)
> > +static __always_inline void arch_optimize_store_u64(u64 *dst, u64 val)
> > +{
> > + asm volatile("movnti %1, %0" : "=m"(*dst) : "r"(val));
> > +}
> > +
> > +static __always_inline void arch_optimize_store_drain(void)
> > +{
> > + asm volatile("sfence" : : : "memory");
> > +}
> > +#else
> > +#include <asm-generic/struct_page_init.h>
> > +#endif
> > +
> > +#endif /* _ASM_X86_STRUCT_PAGE_INIT_H */
> > diff --git a/include/asm-generic/Kbuild b/include/asm-generic/Kbuild
> > index 2c53a1e0b760..3a493fed6803 100644
> > --- a/include/asm-generic/Kbuild
> > +++ b/include/asm-generic/Kbuild
> > @@ -65,3 +65,4 @@ mandatory-y += vermagic.h
> > mandatory-y += vga.h
> > mandatory-y += video.h
> > mandatory-y += word-at-a-time.h
> > +mandatory-y += struct_page_init.h
> > diff --git a/include/asm-generic/struct_page_init.h b/include/asm-generic/struct_page_init.h
> > new file mode 100644
> > index 000000000000..45a722103a51
> > --- /dev/null
> > +++ b/include/asm-generic/struct_page_init.h
> > @@ -0,0 +1,17 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#ifndef _ASM_GENERIC_STRUCT_PAGE_INIT_H
> > +#define _ASM_GENERIC_STRUCT_PAGE_INIT_H
> > +
> > +#include <linux/compiler.h>
> > +#include <linux/types.h>
> > +
> > +static __always_inline void arch_optimize_store_u64(u64 *dst, u64 val)
> > +{
> > + *dst = val;
> > +}
> > +
> > +static __always_inline void arch_optimize_store_drain(void)
> > +{
> > +}
> > +
> > +#endif /* _ASM_GENERIC_STRUCT_PAGE_INIT_H */
> > diff --git a/mm/mm_init.c b/mm/mm_init.c
> > index 5a9e6ecfa894..a3211666ccd4 100644
> > --- a/mm/mm_init.c
> > +++ b/mm/mm_init.c
> > @@ -37,6 +37,7 @@
> > #include "shuffle.h"
> >
> > #include <asm/setup.h>
> > +#include <asm/struct_page_init.h>
> >
> > #ifndef CONFIG_NUMA
> > unsigned long max_mapnr;
> > @@ -1078,9 +1079,21 @@ static inline bool zone_device_page_init_optimization_enabled(void)
> > return !page_ref_tracepoint_active(page_ref_set);
> > }
> >
> > +/*
> > + * The fast path copies struct page with fixed-offset u64 stores instead of
> > + * a runtime loop. Keep that copy sequence in sync with the struct page
> > + * layouts supported by this build.
> > + *
> > + * The sequence below requires struct page to be u64-aligned and currently
> > + * handles layouts from 7 to 12 u64 words (56 to 96 bytes). If a future
> > + * layout falls outside that range, fail the build so the store sequence is
> > + * updated together with the layout change.
> > + */
> > static inline void struct_page_layout_check(void)
> > {
> > BUILD_BUG_ON(sizeof(struct page) & (sizeof(u64) - 1));
> > + BUILD_BUG_ON(sizeof(struct page) < 56);
> > + BUILD_BUG_ON(sizeof(struct page) > 96);
>
> This would be uneccessary without the open-coded memcpy and is another reason to
> prefer a more generic approach.
Yes, I will fix this issue in v2.
> > }
> >
> > static inline void init_template_head_page(struct page *template,
> > @@ -1108,30 +1121,67 @@ static inline void init_template_tail_page(struct page *template,
> > }
> >
> > /*
> > - * Initialize parts that differ from the template
> > + * 'template' is a reusable page prototype rather than a strictly immutable
> > + * object. Most ZONE_DEVICE fields stay constant across the pages covered by
> > + * the current template, but section bits and page->virtual may still depend
> > + * on the PFN. Refresh those PFN-dependent fields in the template before
> > + * copying it into @page.
> > */
> > -static inline void generic_init_zone_device_page_finish(struct page *page,
> > - unsigned long pfn)
> > +static inline void zone_device_page_update_template(struct page *template,
> > + unsigned long pfn)
> > {
> > #ifdef SECTION_IN_PAGE_FLAGS
> > - set_page_section(page, pfn_to_section_nr(pfn));
> > + set_page_section(template, pfn_to_section_nr(pfn));
> > #endif
> > #ifdef WANT_PAGE_VIRTUAL
> > if (!is_highmem_idx(ZONE_DEVICE))
> > - set_page_address(page, __va(pfn << PAGE_SHIFT));
> > + set_page_address(template, __va(pfn << PAGE_SHIFT));
> > #endif
> > }
> >
> > static void init_zone_device_page_from_template(struct page *page,
> > - unsigned long pfn, const struct page *template)
> > + unsigned long pfn, struct page *template)
> > {
> > const u64 *src = (const u64 *)template;
> > u64 *dst = (u64 *)page;
> > - unsigned int i;
> >
> > - for (i = 0; i < sizeof(struct page) / sizeof(u64); i++)
> > - dst[i] = src[i];
> > - generic_init_zone_device_page_finish(page, pfn);
> > + /*
> > + * 'template' carries the invariant portion of a ZONE_DEVICE struct
> > + * page. Update the PFN-dependent fields in place before copying it
> > + * to the destination page.
> > + */
> > + zone_device_page_update_template(template, pfn);
> > +
> > + /*
> > + * Keep the copy open-coded so the compiler emits fixed-offset stores
> > + * for the hot path instead of a runtime copy loop.
> > + */
> > + switch (sizeof(struct page)) {
> > + case 96:
> > + arch_optimize_store_u64(&dst[11], src[11]);
> > + fallthrough;
> > + case 88:
> > + arch_optimize_store_u64(&dst[10], src[10]);
> > + fallthrough;
> > + case 80:
> > + arch_optimize_store_u64(&dst[9], src[9]);
> > + fallthrough;
> > + case 72:
> > + arch_optimize_store_u64(&dst[8], src[8]);
> > + fallthrough;
> > + case 64:
> > + arch_optimize_store_u64(&dst[7], src[7]);
> > + fallthrough;
> > + case 56:
> > + arch_optimize_store_u64(&dst[6], src[6]);
> > + arch_optimize_store_u64(&dst[5], src[5]);
> > + arch_optimize_store_u64(&dst[4], src[4]);
> > + arch_optimize_store_u64(&dst[3], src[3]);
> > + arch_optimize_store_u64(&dst[2], src[2]);
> > + arch_optimize_store_u64(&dst[1], src[1]);
> > + arch_optimize_store_u64(&dst[0], src[0]);
> > + }
> > +
>
> I don't think unrolling the copy here is the right approach. This belongs in
> some kind of generic streaming memcpy routine.
Yes. I've taken a look at the memcpy_flushcache() implementation on x86,
and it only unrolls for sizes of 4, 8, and 16 bytes; all other sizes fall
back to the generic loop. I think we need to extend the x86 implementation
of memcpy_flushcache() so that its fast path covers at least
sizeof(struct page).
Thanks,
Zhe
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH 4/4] mm: use arch store helpers in zone-device template copies
2026-05-18 0:32 ` Alistair Popple
2026-05-18 6:42 ` Li Zhe
@ 2026-05-19 3:09 ` Balbir Singh
1 sibling, 0 replies; 15+ messages in thread
From: Balbir Singh @ 2026-05-19 3:09 UTC (permalink / raw)
To: Alistair Popple
Cc: Li Zhe, tglx, mingo, bp, dave.hansen, arnd, rppt, akpm, david,
x86, linux-kernel, linux-arch, linux-mm
On Mon, May 18, 2026 at 10:32:03AM +1000, Alistair Popple wrote:
> On 2026-05-15 at 18:20 +1000, Li Zhe <lizhe.67@bytedance.com> wrote...
> > The template-based fast path still leaves the actual copy sequence up to
> > the compiler. On x86-64 that can easily degrade back into a runtime copy
> > loop in the hot path, which leaves performance on the table.
> >
> > Introduce arch_optimize_store_u64() and arch_optimize_store_drain(),
> > with a generic fallback and an x86-64 MOVNTI/SFENCE implementation, and
> > use them in the template copy path. Also open-code the word-at-a-time
> > copy so the compiler emits fixed-offset stores for the hot path instead
> > of a runtime loop.
> >
> > On x86-64, MOVNTI is a better fit for this write-once, streaming
> > initialization pattern than normal cached stores. It reduces the
> > write-allocate traffic and cache pollution that a regular store sequence
> > would otherwise generate while filling large ranges of struct page.
>
> The perf improvement looks good so thanks for looking at this, however open
> coding this and introducing arch-specific code layout into a generic layer is
> not the right approach. The correct solution would be to implement a memcpy
> implementation/variant that is optimised for write-once streaming operations
> that can transparently degrade to memcpy on unoptimised architectures.
>
> A grep of the kernel sources for movnti shows there is a memcpy_flushcache()
> variant. Maybe that could work here?
>
> > Refresh the PFN-dependent section bits and page->virtual state in the
> > reusable template before each copy, instead of patching the destination
> > page afterwards. This keeps the hot path as a fixed-offset store
> > sequence and avoids post-copy normal stores to cachelines that were
> > just written with non-temporal stores.
> >
> > Because non-temporal stores are not ordered against later normal stores,
> > drain outstanding stores before memmap_init_compound() updates compound
> > heads and before memmap_init_zone_device() returns.
> >
> > Disable the x86-64 override under KASAN or KMSAN so those builds keep
> > their instrumented stores through the generic fallback.
> >
> > Tested in a VM with a 100 GB fsdax namespace device configured with
> > map=dev and a 100 GB devdax namespace (align=2097152) on Intel Ice Lake
> > server.
> >
> > Test procedure:
> > Rebind the nd_pmem and dax_pmem driver 30 times and collect the memmap
> > initialization time from the pr_debug() output of
> > memmap_init_zone_device().
> >
> > Base(v7.1-rc3):
> > First binding for nd_pmem driver: 1486 ms
> > Average of subsequent rebinds: 273.52 ms
> >
> > First binding for dax_pmem driver: 1515 ms
> > Average of subsequent rebinds: 313.45 ms
> >
> > With this patch:
> > First binding for nd_pmem driver: 1272 ms
> > Average of subsequent rebinds: 104.59 ms
> >
> > First binding for dax_pmem driver: 1286 ms
> > Average of subsequent rebinds: 116.93 ms
> >
>
> > This reduces the average rebind time by about 61.8% for nd_pmem and
> > 62.7% for dax_pmem.
>
> Nice - is this the improvment from applying the whole patch series or just this
> change?
>
> > Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
> > ---
> > arch/x86/include/asm/struct_page_init.h | 28 ++++++++
> > include/asm-generic/Kbuild | 1 +
> > include/asm-generic/struct_page_init.h | 17 +++++
> > mm/mm_init.c | 89 +++++++++++++++++++++----
> > 4 files changed, 122 insertions(+), 13 deletions(-)
> > create mode 100644 arch/x86/include/asm/struct_page_init.h
> > create mode 100644 include/asm-generic/struct_page_init.h
> >
> > diff --git a/arch/x86/include/asm/struct_page_init.h b/arch/x86/include/asm/struct_page_init.h
> > new file mode 100644
> > index 000000000000..de8b4eab44de
> > --- /dev/null
> > +++ b/arch/x86/include/asm/struct_page_init.h
> > @@ -0,0 +1,28 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#ifndef _ASM_X86_STRUCT_PAGE_INIT_H
> > +#define _ASM_X86_STRUCT_PAGE_INIT_H
> > +
> > +#include <linux/compiler.h>
> > +#include <linux/types.h>
> > +
> > +/*
> > + * x86-64 guarantees SSE2, so MOVNTI and SFENCE are always available there.
> > + *
> > + * KASAN/KMSAN rely on compiler-instrumented stores. Keep the x86 override
> > + * disabled for those configs and fall back to plain stores instead.
> > + */
> > +#if defined(CONFIG_X86_64) && !defined(CONFIG_KASAN) && !defined(CONFIG_KMSAN)
> > +static __always_inline void arch_optimize_store_u64(u64 *dst, u64 val)
> > +{
> > + asm volatile("movnti %1, %0" : "=m"(*dst) : "r"(val));
> > +}
> > +
> > +static __always_inline void arch_optimize_store_drain(void)
> > +{
> > + asm volatile("sfence" : : : "memory");
> > +}
> > +#else
> > +#include <asm-generic/struct_page_init.h>
> > +#endif
> > +
> > +#endif /* _ASM_X86_STRUCT_PAGE_INIT_H */
> > diff --git a/include/asm-generic/Kbuild b/include/asm-generic/Kbuild
> > index 2c53a1e0b760..3a493fed6803 100644
> > --- a/include/asm-generic/Kbuild
> > +++ b/include/asm-generic/Kbuild
> > @@ -65,3 +65,4 @@ mandatory-y += vermagic.h
> > mandatory-y += vga.h
> > mandatory-y += video.h
> > mandatory-y += word-at-a-time.h
> > +mandatory-y += struct_page_init.h
> > diff --git a/include/asm-generic/struct_page_init.h b/include/asm-generic/struct_page_init.h
> > new file mode 100644
> > index 000000000000..45a722103a51
> > --- /dev/null
> > +++ b/include/asm-generic/struct_page_init.h
> > @@ -0,0 +1,17 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#ifndef _ASM_GENERIC_STRUCT_PAGE_INIT_H
> > +#define _ASM_GENERIC_STRUCT_PAGE_INIT_H
> > +
> > +#include <linux/compiler.h>
> > +#include <linux/types.h>
> > +
> > +static __always_inline void arch_optimize_store_u64(u64 *dst, u64 val)
> > +{
> > + *dst = val;
> > +}
> > +
> > +static __always_inline void arch_optimize_store_drain(void)
> > +{
> > +}
> > +
> > +#endif /* _ASM_GENERIC_STRUCT_PAGE_INIT_H */
> > diff --git a/mm/mm_init.c b/mm/mm_init.c
> > index 5a9e6ecfa894..a3211666ccd4 100644
> > --- a/mm/mm_init.c
> > +++ b/mm/mm_init.c
> > @@ -37,6 +37,7 @@
> > #include "shuffle.h"
> >
> > #include <asm/setup.h>
> > +#include <asm/struct_page_init.h>
> >
> > #ifndef CONFIG_NUMA
> > unsigned long max_mapnr;
> > @@ -1078,9 +1079,21 @@ static inline bool zone_device_page_init_optimization_enabled(void)
> > return !page_ref_tracepoint_active(page_ref_set);
> > }
> >
> > +/*
> > + * The fast path copies struct page with fixed-offset u64 stores instead of
> > + * a runtime loop. Keep that copy sequence in sync with the struct page
> > + * layouts supported by this build.
> > + *
> > + * The sequence below requires struct page to be u64-aligned and currently
> > + * handles layouts from 7 to 12 u64 words (56 to 96 bytes). If a future
> > + * layout falls outside that range, fail the build so the store sequence is
> > + * updated together with the layout change.
> > + */
> > static inline void struct_page_layout_check(void)
> > {
> > BUILD_BUG_ON(sizeof(struct page) & (sizeof(u64) - 1));
> > + BUILD_BUG_ON(sizeof(struct page) < 56);
> > + BUILD_BUG_ON(sizeof(struct page) > 96);
>
> This would be uneccessary without the open-coded memcpy and is another reason to
> prefer a more generic approach.
>
Agreed, also I think this optimization should be enabled only for
production kernel configs (do not enable it if WANT_PAGE_VIRTUAL
is enabled), so that we can restrict the size to 56 bytes.
> > }
> >
> > static inline void init_template_head_page(struct page *template,
> > @@ -1108,30 +1121,67 @@ static inline void init_template_tail_page(struct page *template,
> > }
> >
> > /*
> > - * Initialize parts that differ from the template
> > + * 'template' is a reusable page prototype rather than a strictly immutable
> > + * object. Most ZONE_DEVICE fields stay constant across the pages covered by
> > + * the current template, but section bits and page->virtual may still depend
> > + * on the PFN. Refresh those PFN-dependent fields in the template before
> > + * copying it into @page.
> > */
> > -static inline void generic_init_zone_device_page_finish(struct page *page,
> > - unsigned long pfn)
> > +static inline void zone_device_page_update_template(struct page *template,
> > + unsigned long pfn)
> > {
> > #ifdef SECTION_IN_PAGE_FLAGS
> > - set_page_section(page, pfn_to_section_nr(pfn));
> > + set_page_section(template, pfn_to_section_nr(pfn));
> > #endif
> > #ifdef WANT_PAGE_VIRTUAL
> > if (!is_highmem_idx(ZONE_DEVICE))
> > - set_page_address(page, __va(pfn << PAGE_SHIFT));
> > + set_page_address(template, __va(pfn << PAGE_SHIFT));
> > #endif
> > }
> >
> > static void init_zone_device_page_from_template(struct page *page,
> > - unsigned long pfn, const struct page *template)
> > + unsigned long pfn, struct page *template)
> > {
> > const u64 *src = (const u64 *)template;
> > u64 *dst = (u64 *)page;
> > - unsigned int i;
> >
> > - for (i = 0; i < sizeof(struct page) / sizeof(u64); i++)
> > - dst[i] = src[i];
> > - generic_init_zone_device_page_finish(page, pfn);
> > + /*
> > + * 'template' carries the invariant portion of a ZONE_DEVICE struct
> > + * page. Update the PFN-dependent fields in place before copying it
> > + * to the destination page.
> > + */
> > + zone_device_page_update_template(template, pfn);
> > +
> > + /*
> > + * Keep the copy open-coded so the compiler emits fixed-offset stores
> > + * for the hot path instead of a runtime copy loop.
> > + */
> > + switch (sizeof(struct page)) {
> > + case 96:
> > + arch_optimize_store_u64(&dst[11], src[11]);
> > + fallthrough;
> > + case 88:
> > + arch_optimize_store_u64(&dst[10], src[10]);
> > + fallthrough;
> > + case 80:
> > + arch_optimize_store_u64(&dst[9], src[9]);
> > + fallthrough;
> > + case 72:
> > + arch_optimize_store_u64(&dst[8], src[8]);
> > + fallthrough;
> > + case 64:
> > + arch_optimize_store_u64(&dst[7], src[7]);
> > + fallthrough;
> > + case 56:
> > + arch_optimize_store_u64(&dst[6], src[6]);
> > + arch_optimize_store_u64(&dst[5], src[5]);
> > + arch_optimize_store_u64(&dst[4], src[4]);
> > + arch_optimize_store_u64(&dst[3], src[3]);
> > + arch_optimize_store_u64(&dst[2], src[2]);
> > + arch_optimize_store_u64(&dst[1], src[1]);
> > + arch_optimize_store_u64(&dst[0], src[0]);
> > + }
> > +
>
> I don't think unrolling the copy here is the right approach. This belongs in
> some kind of generic streaming memcpy routine.
>
On x86 memcpy_flushcache does something similar to above, can't that be
reused?
> - Alistair
>
> > zone_device_page_init_pageblock(page, pfn);
> > }
> > #else
> > @@ -1201,9 +1251,10 @@ static void __ref memmap_init_compound(struct page *head,
> > __SetPageHead(head);
> >
> > /*
> > - * A tail template can be reused for all tail pages in the same compound page
> > - * because shared state for compound tails is pre-set by prep_compound_tail().
> > - * The per-page page->virtual and section in flags are fixed up after copying.
> > + * All tails of the same compound page share the state established by
> > + * prep_compound_tail(). Reuse one tail template for the whole range
> > + * and refresh only the PFN-dependent fields in that template before
> > + * each copy.
> > */
> > if (use_template)
> > init_template_tail_page(&template, head_pfn + 1, zone_idx, nid,
> > @@ -1269,10 +1320,22 @@ void __ref memmap_init_zone_device(struct zone *zone,
> > if (pfns_per_compound == 1)
> > continue;
> >
> > + /*
> > + * Compound-head setup immediately updates head->flags, so make
> > + * the template copy visible before entering memmap_init_compound().
> > + */
> > + if (use_template)
> > + arch_optimize_store_drain();
> > +
> > memmap_init_compound(page, pfn, zone_idx, nid, pgmap,
> > compound_nr_pages(altmap, pgmap),
> > use_template);
> > }
> > + /*
> > + * Drain any remaining non-temporal stores before returning.
> > + */
> > + if (use_template)
> > + arch_optimize_store_drain();
> >
> > pr_debug("%s initialised %lu pages in %ums\n", __func__,
> > nr_pages, jiffies_to_msecs(jiffies - start));
> > --
> > 2.20.1
> >
>
Balbir
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 0/4] mm: speed up ZONE_DEVICE memmap initialization
2026-05-15 8:20 [PATCH 0/4] mm: speed up ZONE_DEVICE memmap initialization Li Zhe
` (3 preceding siblings ...)
2026-05-15 8:20 ` [PATCH 4/4] mm: use arch store helpers in zone-device template copies Li Zhe
@ 2026-05-18 6:23 ` Mike Rapoport
2026-05-18 8:57 ` Li Zhe
4 siblings, 1 reply; 15+ messages in thread
From: Mike Rapoport @ 2026-05-18 6:23 UTC (permalink / raw)
To: Li Zhe
Cc: tglx, mingo, bp, dave.hansen, arnd, akpm, david, x86,
linux-kernel, linux-arch, linux-mm
Hi,
On Fri, May 15, 2026 at 04:20:41PM +0800, Li Zhe wrote:
> memmap_init_zone_device() can spend a substantial amount of time
> initializing large ZONE_DEVICE ranges because it repeats nearly
> identical struct page setup for every PFN.
>
> This series reduces that overhead in four steps.
>
> Testing
> =======
>
> Tests were run in a VM on an Intel Ice Lake server.
>
> Two PMEM configurations were used:
> - a 100 GB fsdax namespace configured with map=dev, which exercises
> the nd_pmem rebind path (pfns_per_compound == 1)
> - a 100 GB devdax namespace configured with align=2097152, which
> exercises the dax_pmem rebind path (pfns_per_compound > 1)
>
> For each configuration, the corresponding driver was unbound and
> rebound 30 times. Memmap initialization latency was collected from the
> pr_debug() output of memmap_init_zone_device().
>
> The first bind is reported separately, and the average of subsequent
> rebinds is used as the steady-state result.
>
> Performance
> ===========
> nd_pmem rebind, 100 GB fsdax namespace, map=dev
> Base(v7.1-rc3):
> First binding: 1486 ms
> Average of subsequent rebinds: 273.52 ms
> Full series:
> First binding: 1272 ms
> Average of subsequent rebinds: 104.59 ms
>
> dax_pmem rebind, 100 GB devdax namespace, align=2097152
> Base(v7.1-rc3):
> First binding: 1515 ms
> Average of subsequent rebinds: 313.45 ms
> Full series:
> First binding: 1286 ms
> Average of subsequent rebinds: 116.93 ms
This is really good improvement!
It would be also interesting to see how the template approach would improve
"normal" memory map initialization.
> Li Zhe (4):
> mm: factor zone-device page init helpers out of
> __init_zone_device_page
> mm: add a template-based fast path for zone-device page init
> mm: extend the template fast path to zone-device compound tails
> mm: use arch store helpers in zone-device template copies
>
> arch/x86/include/asm/struct_page_init.h | 28 +++
> include/asm-generic/Kbuild | 1 +
> include/asm-generic/struct_page_init.h | 17 ++
> mm/mm_init.c | 260 +++++++++++++++++++++---
> 4 files changed, 280 insertions(+), 26 deletions(-)
> create mode 100644 arch/x86/include/asm/struct_page_init.h
> create mode 100644 include/asm-generic/struct_page_init.h
>
> --
> 2.20.1
>
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH 0/4] mm: speed up ZONE_DEVICE memmap initialization
2026-05-18 6:23 ` [PATCH 0/4] mm: speed up ZONE_DEVICE memmap initialization Mike Rapoport
@ 2026-05-18 8:57 ` Li Zhe
0 siblings, 0 replies; 15+ messages in thread
From: Li Zhe @ 2026-05-18 8:57 UTC (permalink / raw)
To: rppt
Cc: akpm, arnd, bp, dave.hansen, david, linux-arch, linux-kernel,
linux-mm, lizhe.67, mingo, tglx, x86
On Mon, 18 May 2026 09:23:33 +0300, rppt@kernel.org wrote:
> Hi,
>
> On Fri, May 15, 2026 at 04:20:41PM +0800, Li Zhe wrote:
> > memmap_init_zone_device() can spend a substantial amount of time
> > initializing large ZONE_DEVICE ranges because it repeats nearly
> > identical struct page setup for every PFN.
> >
> > This series reduces that overhead in four steps.
> >
> > Testing
> > =======
> >
> > Tests were run in a VM on an Intel Ice Lake server.
> >
> > Two PMEM configurations were used:
> > - a 100 GB fsdax namespace configured with map=dev, which exercises
> > the nd_pmem rebind path (pfns_per_compound == 1)
> > - a 100 GB devdax namespace configured with align=2097152, which
> > exercises the dax_pmem rebind path (pfns_per_compound > 1)
> >
> > For each configuration, the corresponding driver was unbound and
> > rebound 30 times. Memmap initialization latency was collected from the
> > pr_debug() output of memmap_init_zone_device().
> >
> > The first bind is reported separately, and the average of subsequent
> > rebinds is used as the steady-state result.
> >
> > Performance
> > ===========
> > nd_pmem rebind, 100 GB fsdax namespace, map=dev
> > Base(v7.1-rc3):
> > First binding: 1486 ms
> > Average of subsequent rebinds: 273.52 ms
> > Full series:
> > First binding: 1272 ms
> > Average of subsequent rebinds: 104.59 ms
> >
> > dax_pmem rebind, 100 GB devdax namespace, align=2097152
> > Base(v7.1-rc3):
> > First binding: 1515 ms
> > Average of subsequent rebinds: 313.45 ms
> > Full series:
> > First binding: 1286 ms
> > Average of subsequent rebinds: 116.93 ms
>
> This is really good improvement!
>
> It would be also interesting to see how the template approach would improve
> "normal" memory map initialization.
I also experimented with this approach earlier. Unfortunately, in the
normal memory map initialization path, functions such as
deferred_free_pages() are invoked shortly after struct page
initialization, and this function performs both read and write accesses
to members of the struct page.
Non-temporal stores via MOVNTI are primarily beneficial for streaming
write operations, where the cache lines written are not expected to be
reused by the CPU in the near future. In this case, however, data
written using MOVNTI is immediately accessed again through regular load
and store instructions. This results in an access pattern that resembles
a write-then-reuse workload rather than a pure streaming store.
Consequently, non-temporal stores do not deliver the expected reduction
in cache pollution, and using MOVNTI provides no measurable performance
benefit for this particular workload.
That said, a template-based approach can still accelerate initialization.
Based on measurements from this patchset, it should improve performance
on the generic path by roughly 10%. I would appreciate feedback on
whether such an optimization is still considered useful.
Thanks,
Zhe
^ permalink raw reply [flat|nested] 15+ messages in thread