[PATCH v4 0/8] mm: speed up ZONE_DEVICE memmap initialization

Generic Linux architectural discussions
 help / color / mirror / Atom feed

* [PATCH v4 0/8] mm: speed up ZONE_DEVICE memmap initialization
@ 2026-06-03  8:01 Li Zhe
  2026-06-03  8:01 ` [PATCH v4 1/8] mm: fix stale ZONE_DEVICE refcount comment Li Zhe
                   ` (8 more replies)
  0 siblings, 9 replies; 10+ messages in thread
From: Li Zhe @ 2026-06-03  8:01 UTC (permalink / raw)
  To: akpm, apopple, arnd, bp, dave.hansen, david, kees, mingo, rppt,
	tglx
  Cc: linux-arch, linux-hardening, linux-kernel, linux-mm, x86,
	lizhe.67

memmap_init_zone_device() can spend a substantial amount of time
initializing large ZONE_DEVICE ranges because it repeats nearly
identical struct page setup for every PFN.

This series reduces that overhead in eight steps.

The first patch fixes a stale comment in __init_zone_device_page() so
the documented refcount policy matches the current ZONE_DEVICE code.

The second patch factors the reusable pieces out of
__init_zone_device_page() so later patches can share the same logic
without changing the existing slow path.

The third patch adds set_page_section_from_pfn(), so callers that want
to refresh section bits from a PFN no longer need to open-code
SECTION_IN_PAGE_FLAGS handling.

The fourth patch adds a template-based fast path for ZONE_DEVICE head
pages. Instead of rebuilding the same struct page state for every PFN,
it prepares one reusable template through the existing slow path,
refreshes the PFN-dependent fields in that template, and copies it to
each destination page.

The fifth patch extends the same template-based approach to compound
tails, so pfns_per_compound > 1 can also benefit from the fast path.

The sixth patch introduces memcpy_streaming() and
memcpy_streaming_drain() as a generic interface for write-once copies.
Architectures that do not provide a specialized backend, or cases that
cannot safely use one, fall back to memcpy().

The seventh patch extends x86 memcpy_flushcache() small fixed-size
fastpaths so struct-page-sized streaming copies can stay on the inline
path when alignment permits.

The last patch switches the ZONE_DEVICE template-copy path over to
memcpy_streaming(). It keeps pageblock-aligned PFNs on regular memcpy(),
uses memcpy_streaming() for the remaining write-once copies, and drains
streaming stores before later metadata updates that may depend on them.

This is not intended as a steady-state data-path optimization. Its
benefit is in pmem bring-up paths where memmap_init_zone_device()
dominates device online / rebind latency, such as:
  - fsdax or devdax namespace creation and reconfiguration
  - nd_pmem / dax_pmem driver bind or rebind

In those paths, the kernel initializes a large vmemmap range once and
does not immediately benefit from keeping the copied struct page state
hot in cache. Reducing write-allocate traffic in that one-time setup
path can therefore reduce end-to-end device bring-up latency.

The optimized path is disabled when the page_ref_set tracepoint is
enabled, and sanitized builds remain on the slow path so their
instrumented stores are preserved.

Testing
=======

Tests were run in a VM on an Intel Ice Lake server.

Two PMEM configurations were used:
  - a 100 GB fsdax namespace configured with map=dev, which exercises
    the nd_pmem rebind path (pfns_per_compound == 1)
  - a 100 GB devdax namespace configured with align=2097152, which
    exercises the dax_pmem rebind path (pfns_per_compound > 1)

For each configuration, the corresponding driver was unbound and
rebound 30 times. Memmap initialization latency was collected from the
pr_debug() output of memmap_init_zone_device().

The first bind is reported separately, and the average of subsequent
rebinds is used as the steady-state result.

Performance
===========

nd_pmem rebind, 100 GB fsdax namespace, map=dev
  Base(v7.1-rc6):
    First binding: 1466 ms
    Average of subsequent rebinds: 262.12 ms
  Full series:
    First binding: 1359 ms
    Average of subsequent rebinds: 108.36 ms

dax_pmem rebind, 100 GB devdax namespace, align=2097152
  Base(v7.1-rc6):
    First binding: 1430 ms
    Average of subsequent rebinds: 229.12 ms
  Full series:
    First binding: 1273 ms
    Average of subsequent rebinds: 100.17 ms

Li Zhe (8):
  mm: fix stale ZONE_DEVICE refcount comment
  mm: factor zone-device page init helpers out of
    __init_zone_device_page
  mm: add a set_page_section_from_pfn() helper
  mm: add a template-based fast path for zone-device page init
  mm: extend the template fast path to zone-device compound tails
  string: introduce memcpy_streaming() helpers
  x86/string: extend memcpy_flushcache() fixed-size fastpaths
  mm: use memcpy_streaming() in zone-device template copies

 arch/x86/include/asm/string_64.h | 140 ++++++++++++++++++--
 include/linux/mm.h               |  19 ++-
 include/linux/string.h           |  20 +++
 mm/mm_init.c                     | 221 +++++++++++++++++++++++++++----
 4 files changed, 360 insertions(+), 40 deletions(-)

---
v3: https://lore.kernel.org/all/20260527033636.28231-1-lizhe.67@bytedance.com/
v2: https://lore.kernel.org/all/20260521040124.10608-1-lizhe.67@bytedance.com/
v1: https://lore.kernel.org/all/20260515082045.63029-1-lizhe.67@bytedance.com/

Changelogs:

v3->v4:
- Rebase the series from v7.1-rc3 to v7.1-rc6.
- Rework patch 4 so the reusable head-page template is seeded from the
  first real struct page, rather than being initialized directly on a
  stack-resident template object. Also add an explicit !nr_pages early
  return. Suggested by Andrew Morton.
- Rework patch 5 similarly for compound tails: seed the reusable
  tail-page template from the first real tail page, thread
  use_template through compound-page initialization, and reuse that
  prepared tail-page image for the remaining tails. Suggested by Andrew
  Morton.
- Tighten patch 6 so memcpy_streaming() maps to memcpy_flushcache() only
  when the destination alignment and size allow the transfer to stay
  entirely on the non-temporal path; other cases fall back to memcpy().
  Suggested by Andrew Morton.
- Rework patch 7 so the existing 4/8/16-byte cases remain handled
  directly in memcpy_flushcache(), while the new aligned fixed-size
  fastpaths cover only the larger 32/48/64/80/96-byte cases. Suggested
  by Andrew Morton.

For changelogs of earlier revisions, please refer to the v3 cover letter:
https://lore.kernel.org/all/20260527033636.28231-1-lizhe.67@bytedance.com/

-- 
2.20.1

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v4 1/8] mm: fix stale ZONE_DEVICE refcount comment
  2026-06-03  8:01 [PATCH v4 0/8] mm: speed up ZONE_DEVICE memmap initialization Li Zhe
@ 2026-06-03  8:01 ` Li Zhe
  2026-06-03  8:01 ` [PATCH v4 2/8] mm: factor zone-device page init helpers out of __init_zone_device_page Li Zhe
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Li Zhe @ 2026-06-03  8:01 UTC (permalink / raw)
  To: akpm, apopple, arnd, bp, dave.hansen, david, kees, mingo, rppt,
	tglx
  Cc: linux-arch, linux-hardening, linux-kernel, linux-mm, x86,
	lizhe.67

The comment in __init_zone_device_page() still uses the old
MEMORY_TYPE_* names and implies that FS_DAX pages regain a
refcount of 1 in the free path. That no longer matches the code.

Update the comment to describe the current policy correctly:
MEMORY_DEVICE_GENERIC pages regain a refcount of 1 in the free path,
while the remaining ZONE_DEVICE types start from 0 here and raise the
count again when the allocator or driver hands the page out.

No functional change intended.

Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
---
 mm/mm_init.c | 10 +++-------
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/mm/mm_init.c b/mm/mm_init.c
index f9f8e1af921c..35de3b6a186d 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1027,13 +1027,9 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
 	}
 
 	/*
-	 * ZONE_DEVICE pages other than MEMORY_TYPE_GENERIC are released
-	 * directly to the driver page allocator which will set the page count
-	 * to 1 when allocating the page.
-	 *
-	 * MEMORY_TYPE_GENERIC and MEMORY_TYPE_FS_DAX pages automatically have
-	 * their refcount reset to one whenever they are freed (ie. after
-	 * their refcount drops to 0).
+	 * MEMORY_DEVICE_GENERIC pages regain a refcount of 1 in the free
+	 * path. The remaining ZONE_DEVICE types start from 0 here and raise
+	 * the count again when the allocator or driver hands the page out.
 	 */
 	switch (pgmap->type) {
 	case MEMORY_DEVICE_FS_DAX:
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v4 2/8] mm: factor zone-device page init helpers out of __init_zone_device_page
  2026-06-03  8:01 [PATCH v4 0/8] mm: speed up ZONE_DEVICE memmap initialization Li Zhe
  2026-06-03  8:01 ` [PATCH v4 1/8] mm: fix stale ZONE_DEVICE refcount comment Li Zhe
@ 2026-06-03  8:01 ` Li Zhe
  2026-06-03  8:01 ` [PATCH v4 3/8] mm: add a set_page_section_from_pfn() helper Li Zhe
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Li Zhe @ 2026-06-03  8:01 UTC (permalink / raw)
  To: akpm, apopple, arnd, bp, dave.hansen, david, kees, mingo, rppt,
	tglx
  Cc: linux-arch, linux-hardening, linux-kernel, linux-mm, x86,
	lizhe.67

memmap_init_zone_device() currently mixes refcount policy, core
ZONE_DEVICE page setup, and pageblock metadata handling in a single
helper.

Factor the refcount-reset predicate into pagemap_resets_refcount(), move
the common page initialization into __zone_device_page_init(), split
pageblock handling into zone_device_page_init_pageblock(), and wrap the
existing slow path in zone_device_page_init_slow().

This keeps the slow-path behaviour unchanged and gives later patches
reusable helper boundaries.

No functional change intended.

Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 mm/mm_init.c | 62 ++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 43 insertions(+), 19 deletions(-)

diff --git a/mm/mm_init.c b/mm/mm_init.c
index 35de3b6a186d..2e5899c5cf35 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -987,11 +987,38 @@ static void __init memmap_init(void)
 }
 
 #ifdef CONFIG_ZONE_DEVICE
-static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
+/*
+ * Return true when the free path for this pagemap type restores the page
+ * refcount to 1, so memmap_init_zone_device() can keep the count set by
+ * __init_single_page(). Otherwise initialize the refcount to 0 and leave
+ * it to the allocator or pgmap callbacks to raise it when the page is
+ * handed out again.
+ */
+static inline bool pagemap_resets_refcount(const struct dev_pagemap *pgmap)
+{
+	/*
+	 * MEMORY_DEVICE_GENERIC pages regain a refcount of 1 in the free
+	 * path. The remaining ZONE_DEVICE types start from 0 here and raise
+	 * the count again when the allocator or driver hands the page out.
+	 */
+	switch (pgmap->type) {
+	case MEMORY_DEVICE_FS_DAX:
+	case MEMORY_DEVICE_PRIVATE:
+	case MEMORY_DEVICE_COHERENT:
+	case MEMORY_DEVICE_PCI_P2PDMA:
+		return false;
+	case MEMORY_DEVICE_GENERIC:
+		return true;
+	default:
+		WARN_ONCE(1, "Unknown memory type!");
+		return true;
+	}
+}
+
+static void __ref __zone_device_page_init(struct page *page, unsigned long pfn,
 					  unsigned long zone_idx, int nid,
 					  struct dev_pagemap *pgmap)
 {
-
 	__init_single_page(page, pfn, zone_idx, nid);
 
 	/*
@@ -1010,7 +1037,11 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
 	 */
 	page_folio(page)->pgmap = pgmap;
 	page->zone_device_data = NULL;
+}
 
+static void __ref zone_device_page_init_pageblock(struct page *page,
+						  unsigned long pfn)
+{
 	/*
 	 * Mark the block movable so that blocks are reserved for
 	 * movable at startup. This will force kernel allocations
@@ -1025,23 +1056,16 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
 		init_pageblock_migratetype(page, MIGRATE_MOVABLE, false);
 		cond_resched();
 	}
+}
 
-	/*
-	 * MEMORY_DEVICE_GENERIC pages regain a refcount of 1 in the free
-	 * path. The remaining ZONE_DEVICE types start from 0 here and raise
-	 * the count again when the allocator or driver hands the page out.
-	 */
-	switch (pgmap->type) {
-	case MEMORY_DEVICE_FS_DAX:
-	case MEMORY_DEVICE_PRIVATE:
-	case MEMORY_DEVICE_COHERENT:
-	case MEMORY_DEVICE_PCI_P2PDMA:
+static void __ref zone_device_page_init_slow(struct page *page,
+		unsigned long pfn, unsigned long zone_idx, int nid,
+		struct dev_pagemap *pgmap)
+{
+	__zone_device_page_init(page, pfn, zone_idx, nid, pgmap);
+	if (!pagemap_resets_refcount(pgmap))
 		set_page_count(page, 0);
-		break;
-
-	case MEMORY_DEVICE_GENERIC:
-		break;
-	}
+	zone_device_page_init_pageblock(page, pfn);
 }
 
 /*
@@ -1080,7 +1104,7 @@ static void __ref memmap_init_compound(struct page *head,
 	for (pfn = head_pfn + 1; pfn < end_pfn; pfn++) {
 		struct page *page = pfn_to_page(pfn);
 
-		__init_zone_device_page(page, pfn, zone_idx, nid, pgmap);
+		zone_device_page_init_slow(page, pfn, zone_idx, nid, pgmap);
 		prep_compound_tail(page, head, order);
 		set_page_count(page, 0);
 	}
@@ -1116,7 +1140,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
 	for (pfn = start_pfn; pfn < end_pfn; pfn += pfns_per_compound) {
 		struct page *page = pfn_to_page(pfn);
 
-		__init_zone_device_page(page, pfn, zone_idx, nid, pgmap);
+		zone_device_page_init_slow(page, pfn, zone_idx, nid, pgmap);
 
 		if (pfns_per_compound == 1)
 			continue;
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v4 3/8] mm: add a set_page_section_from_pfn() helper
  2026-06-03  8:01 [PATCH v4 0/8] mm: speed up ZONE_DEVICE memmap initialization Li Zhe
  2026-06-03  8:01 ` [PATCH v4 1/8] mm: fix stale ZONE_DEVICE refcount comment Li Zhe
  2026-06-03  8:01 ` [PATCH v4 2/8] mm: factor zone-device page init helpers out of __init_zone_device_page Li Zhe
@ 2026-06-03  8:01 ` Li Zhe
  2026-06-03  8:01 ` [PATCH v4 4/8] mm: add a template-based fast path for zone-device page init Li Zhe
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Li Zhe @ 2026-06-03  8:01 UTC (permalink / raw)
  To: akpm, apopple, arnd, bp, dave.hansen, david, kees, mingo, rppt,
	tglx
  Cc: linux-arch, linux-hardening, linux-kernel, linux-mm, x86,
	lizhe.67

Callers that want to update section bits from a PFN currently need to
open-code:

	set_page_section(page, pfn_to_section_nr(pfn));

and guard that sequence with #ifdef SECTION_IN_PAGE_FLAGS.

Add set_page_section_from_pfn() to wrap that update in one place. When
section bits are stored in page flags, the helper derives the section
number from the PFN and updates the page flags. Otherwise it degrades to
a no-op.

Convert set_page_links() to use the new helper so later ZONE_DEVICE
fast-path patches can also update section bits without open-coding
SECTION_IN_PAGE_FLAGS at each callsite.

This keeps the PFN-to-section translation local to the configurations
that actually store section bits in struct page flags, and avoids
exposing that detail to generic callers.

No functional change intended.

Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 include/linux/mm.h | 19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 06bbe9eba636..1b0de1eef4c9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2507,11 +2507,26 @@ static inline void set_page_section(struct page *page, unsigned long section)
 	page->flags.f |= (section & SECTIONS_MASK) << SECTIONS_PGSHIFT;
 }
 
+static inline void set_page_section_from_pfn(struct page *page,
+					     unsigned long pfn)
+{
+	set_page_section(page, pfn_to_section_nr(pfn));
+}
+
 static inline unsigned long memdesc_section(memdesc_flags_t mdf)
 {
 	return (mdf.f >> SECTIONS_PGSHIFT) & SECTIONS_MASK;
 }
 #else /* !SECTION_IN_PAGE_FLAGS */
+static inline void set_page_section(struct page *page, unsigned long section)
+{
+}
+
+static inline void set_page_section_from_pfn(struct page *page,
+					     unsigned long pfn)
+{
+}
+
 static inline unsigned long memdesc_section(memdesc_flags_t mdf)
 {
 	return 0;
@@ -2734,9 +2749,7 @@ static inline void set_page_links(struct page *page, enum zone_type zone,
 {
 	set_page_zone(page, zone);
 	set_page_node(page, node);
-#ifdef SECTION_IN_PAGE_FLAGS
-	set_page_section(page, pfn_to_section_nr(pfn));
-#endif
+	set_page_section_from_pfn(page, pfn);
 }
 
 /**
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v4 4/8] mm: add a template-based fast path for zone-device page init
  2026-06-03  8:01 [PATCH v4 0/8] mm: speed up ZONE_DEVICE memmap initialization Li Zhe
                   ` (2 preceding siblings ...)
  2026-06-03  8:01 ` [PATCH v4 3/8] mm: add a set_page_section_from_pfn() helper Li Zhe
@ 2026-06-03  8:01 ` Li Zhe
  2026-06-03  8:01 ` [PATCH v4 5/8] mm: extend the template fast path to zone-device compound tails Li Zhe
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Li Zhe @ 2026-06-03  8:01 UTC (permalink / raw)
  To: akpm, apopple, arnd, bp, dave.hansen, david, kees, mingo, rppt,
	tglx
  Cc: linux-arch, linux-hardening, linux-kernel, linux-mm, x86,
	lizhe.67

memmap_init_zone_device() repeats nearly identical head-page
initialization for each PFN. Prepare one reusable ZONE_DEVICE head-page
template through the existing slow path, refresh the PFN-dependent
fields in that template before each copy, and memcpy it into each
destination page.

The optimized path assigns _refcount through the copied template, so
keep it disabled when the page_ref_set tracepoint is enabled.

This patch accelerates head-page initialization. The pfns_per_compound
== 1 case gets the full benefit here, compound tails are handled in the
next patch.

Tested in a VM with a 100 GB fsdax namespace device configured with
map=dev on Intel Ice Lake server. This test exercises the nd_pmem rebind
path (pfns_per_compound == 1).

Test procedure:
Rebind the nd_pmem driver 30 times and collect the memmap initialization
time from the pr_debug() output of memmap_init_zone_device().

Base(v7.1-rc6):
  First binding: 1466 ms
  Average of subsequent rebinds: 262.12 ms

With this patch and its prerequisites applied:
  First binding: 1432 ms
  Average of subsequent rebinds: 246.30 ms

This reduces the average rebind time from 262.12 ms to 246.30 ms, or
about 6%.

Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
---
 mm/mm_init.c | 76 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 74 insertions(+), 2 deletions(-)

diff --git a/mm/mm_init.c b/mm/mm_init.c
index 2e5899c5cf35..56fe71e966ae 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1068,6 +1068,51 @@ static void __ref zone_device_page_init_slow(struct page *page,
 	zone_device_page_init_pageblock(page, pfn);
 }
 
+static inline bool zone_device_page_init_optimization_enabled(void)
+{
+	/*
+	 * The template fast path copies a preinitialized struct page image.
+	 * Skip it when the page_ref_set tracepoint is enabled.
+	 */
+	return !page_ref_tracepoint_active(page_ref_set);
+}
+
+static inline void zone_device_template_page_init(struct page *template,
+						  struct page *src)
+{
+	memcpy(template, src, sizeof(*template));
+}
+
+/*
+ * 'template' is a reusable page prototype rather than a strictly immutable
+ * object. Most ZONE_DEVICE fields stay constant across the pages covered by
+ * the current template, but section bits and page->virtual may still depend
+ * on the PFN. Refresh those PFN-dependent fields in the template before
+ * copying it into @page.
+ */
+static inline void zone_device_page_update_template(struct page *template,
+		unsigned long pfn)
+{
+	set_page_section_from_pfn(template, pfn);
+#ifdef WANT_PAGE_VIRTUAL
+	if (!is_highmem_idx(ZONE_DEVICE))
+		set_page_address(template, __va(pfn << PAGE_SHIFT));
+#endif
+}
+
+static void zone_device_page_init_from_template(struct page *page,
+		unsigned long pfn, struct page *template)
+{
+	/*
+	 * 'template' carries the invariant portion of a ZONE_DEVICE struct
+	 * page. Update the PFN-dependent fields in place before copying it
+	 * to the destination page.
+	 */
+	zone_device_page_update_template(template, pfn);
+	memcpy(page, template, sizeof(*page));
+	zone_device_page_init_pageblock(page, pfn);
+}
+
 /*
  * With compound page geometry and when struct pages are stored in ram most
  * tail pages are reused. Consequently, the amount of unique struct pages to
@@ -1116,6 +1161,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
 				   unsigned long nr_pages,
 				   struct dev_pagemap *pgmap)
 {
+	bool use_template = zone_device_page_init_optimization_enabled();
 	unsigned long pfn, end_pfn = start_pfn + nr_pages;
 	struct pglist_data *pgdat = zone->zone_pgdat;
 	struct vmem_altmap *altmap = pgmap_altmap(pgmap);
@@ -1123,6 +1169,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
 	unsigned long zone_idx = zone_idx(zone);
 	unsigned long start = jiffies;
 	int nid = pgdat->node_id;
+	struct page template;
 
 	if (WARN_ON_ONCE(!pgmap || zone_idx != ZONE_DEVICE))
 		return;
@@ -1137,10 +1184,35 @@ void __ref memmap_init_zone_device(struct zone *zone,
 		nr_pages = end_pfn - start_pfn;
 	}
 
-	for (pfn = start_pfn; pfn < end_pfn; pfn += pfns_per_compound) {
-		struct page *page = pfn_to_page(pfn);
+	if (!nr_pages)
+		return;
+
+	pfn = start_pfn;
+	/*
+	 * Seed the reusable head-page template from the first real struct
+	 * page, because the existing page-init and pageblock helpers expect
+	 * a real memmap entry rather than a stack object.
+	 */
+	if (use_template) {
+		struct page *page = pfn_to_page(start_pfn);
 
 		zone_device_page_init_slow(page, pfn, zone_idx, nid, pgmap);
+		zone_device_template_page_init(&template, page);
+		if (pfns_per_compound != 1)
+			memmap_init_compound(page, pfn, zone_idx, nid, pgmap,
+					     compound_nr_pages(altmap, pgmap));
+		pfn += pfns_per_compound;
+	}
+
+	for (; pfn < end_pfn; pfn += pfns_per_compound) {
+		struct page *page = pfn_to_page(pfn);
+
+		if (use_template)
+			zone_device_page_init_from_template(page, pfn,
+							    &template);
+		else
+			zone_device_page_init_slow(page, pfn, zone_idx,
+						   nid, pgmap);
 
 		if (pfns_per_compound == 1)
 			continue;
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v4 5/8] mm: extend the template fast path to zone-device compound tails
  2026-06-03  8:01 [PATCH v4 0/8] mm: speed up ZONE_DEVICE memmap initialization Li Zhe
                   ` (3 preceding siblings ...)
  2026-06-03  8:01 ` [PATCH v4 4/8] mm: add a template-based fast path for zone-device page init Li Zhe
@ 2026-06-03  8:01 ` Li Zhe
  2026-06-03  8:01 ` [PATCH v4 6/8] string: introduce memcpy_streaming() helpers Li Zhe
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Li Zhe @ 2026-06-03  8:01 UTC (permalink / raw)
  To: akpm, apopple, arnd, bp, dave.hansen, david, kees, mingo, rppt,
	tglx
  Cc: linux-arch, linux-hardening, linux-kernel, linux-mm, x86,
	lizhe.67

The template fast path from the previous patch only accelerates head
pages. Compound tails in memmap_init_compound() still go through the
slow path one by one.

Build separate head and tail templates and reuse one prepared tail
template across the tail pages in a compound range. Head pages preserve
the existing refcount policy, while compound tails always start with a
refcount of 0 after prep_compound_tail().

This extends the template-copy fast path to pfns_per_compound > 1
without changing the existing slow path. Tail-page PFN-dependent fields
are refreshed in the reusable tail template before each copy.

Tested in a VM with a 100 GB devdax namespace (align=2097152) on Intel
Ice Lake server. This test exercises the dax_pmem rebind path and
measures memmap initialization latency.

Test procedure:
Unbind and rebind the dax_pmem driver 30 times, collect memmap
initialization time from the pr_debug() output of memmap_init_zone_device().

Base(v7.1-rc6):
  First binding: 1430 ms
  Average of subsequent rebinds: 229.12 ms

With this patch and its prerequisites applied:
  First binding: 1336 ms
  Average of subsequent rebinds: 180.00 ms

This reduces the average rebind time from 229.12 ms to 180.00 ms, or
about 21.4%.

Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
---
 mm/mm_init.c | 46 +++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 39 insertions(+), 7 deletions(-)

diff --git a/mm/mm_init.c b/mm/mm_init.c
index 56fe71e966ae..ad078ee354fb 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1083,6 +1083,16 @@ static inline void zone_device_template_page_init(struct page *template,
 	memcpy(template, src, sizeof(*template));
 }
 
+static inline void zone_device_tail_page_init(struct page *page,
+		unsigned long pfn, unsigned long zone_idx, int nid,
+		struct dev_pagemap *pgmap, const struct page *head,
+		unsigned int order)
+{
+	zone_device_page_init_slow(page, pfn, zone_idx, nid, pgmap);
+	prep_compound_tail(page, head, order);
+	set_page_count(page, 0);
+}
+
 /*
  * 'template' is a reusable page prototype rather than a strictly immutable
  * object. Most ZONE_DEVICE fields stay constant across the pages covered by
@@ -1134,10 +1144,12 @@ static void __ref memmap_init_compound(struct page *head,
 				       unsigned long head_pfn,
 				       unsigned long zone_idx, int nid,
 				       struct dev_pagemap *pgmap,
-				       unsigned long nr_pages)
+				       unsigned long nr_pages,
+				       bool use_template)
 {
 	unsigned long pfn, end_pfn = head_pfn + nr_pages;
 	unsigned int order = pgmap->vmemmap_shift;
+	struct page template;
 
 	/*
 	 * We have to initialize the pages, including setting up page links.
@@ -1146,12 +1158,31 @@ static void __ref memmap_init_compound(struct page *head,
 	 * the pages in the same go.
 	 */
 	__SetPageHead(head);
-	for (pfn = head_pfn + 1; pfn < end_pfn; pfn++) {
+
+	pfn = head_pfn + 1;
+	/*
+	 * All tails of the same compound page share the state established by
+	 * prep_compound_tail(). Reuse one tail template for the whole range and
+	 * refresh only the PFN-dependent fields in that template before each copy.
+	 */
+	if (use_template) {
 		struct page *page = pfn_to_page(pfn);
 
-		zone_device_page_init_slow(page, pfn, zone_idx, nid, pgmap);
-		prep_compound_tail(page, head, order);
-		set_page_count(page, 0);
+		zone_device_tail_page_init(page, pfn, zone_idx, nid,
+					   pgmap, head, order);
+		zone_device_template_page_init(&template, page);
+		pfn++;
+	}
+
+	for (; pfn < end_pfn; pfn++) {
+		struct page *page = pfn_to_page(pfn);
+
+		if (use_template)
+			zone_device_page_init_from_template(page, pfn,
+							    &template);
+		else
+			zone_device_tail_page_init(page, pfn, zone_idx, nid,
+						   pgmap, head, order);
 	}
 	prep_compound_head(head, order);
 }
@@ -1200,7 +1231,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
 		zone_device_template_page_init(&template, page);
 		if (pfns_per_compound != 1)
 			memmap_init_compound(page, pfn, zone_idx, nid, pgmap,
-					     compound_nr_pages(altmap, pgmap));
+				compound_nr_pages(altmap, pgmap), use_template);
 		pfn += pfns_per_compound;
 	}
 
@@ -1218,7 +1249,8 @@ void __ref memmap_init_zone_device(struct zone *zone,
 			continue;
 
 		memmap_init_compound(page, pfn, zone_idx, nid, pgmap,
-				     compound_nr_pages(altmap, pgmap));
+				     compound_nr_pages(altmap, pgmap),
+				     use_template);
 	}
 
 	pr_debug("%s initialised %lu pages in %ums\n", __func__,
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v4 6/8] string: introduce memcpy_streaming() helpers
  2026-06-03  8:01 [PATCH v4 0/8] mm: speed up ZONE_DEVICE memmap initialization Li Zhe
                   ` (4 preceding siblings ...)
  2026-06-03  8:01 ` [PATCH v4 5/8] mm: extend the template fast path to zone-device compound tails Li Zhe
@ 2026-06-03  8:01 ` Li Zhe
  2026-06-03  8:01 ` [PATCH v4 7/8] x86/string: extend memcpy_flushcache() fixed-size fastpaths Li Zhe
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Li Zhe @ 2026-06-03  8:01 UTC (permalink / raw)
  To: akpm, apopple, arnd, bp, dave.hansen, david, kees, mingo, rppt,
	tglx
  Cc: linux-arch, linux-hardening, linux-kernel, linux-mm, x86,
	lizhe.67

Introduce a generic memcpy_streaming() interface for write-once copy
sites that can fall back to memcpy() when no architecture-specific
optimization is available, or when an architecture-specific backend
cannot safely handle a given transfer.

Add memcpy_streaming_drain() alongside it so callers can separate the
copy primitive from any required ordering point. On x86, use
memcpy_flushcache() and sfence only for aligned transfers that can stay
entirely on the non-temporal store path; otherwise fall back to memcpy()
so the generic API does not expose flushcache semantics on cached
head/tail fragments.

Callers are responsible for invoking memcpy_streaming_drain() before
later normal stores that must be ordered after the streaming copy.

Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
---
 arch/x86/include/asm/string_64.h | 32 ++++++++++++++++++++++++++++++++
 include/linux/string.h           | 20 ++++++++++++++++++++
 2 files changed, 52 insertions(+)

diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index 4635616863f5..aee63108577f 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -4,6 +4,7 @@
 
 #ifdef __KERNEL__
 #include <linux/jump_label.h>
+#include <linux/align.h>
 
 /* Written 2002 by Andi Kleen */
 
@@ -100,6 +101,37 @@ static __always_inline void memcpy_flushcache(void *dst, const void *src, size_t
 	}
 	__memcpy_flushcache(dst, src, cnt);
 }
+
+/*
+ * Only map memcpy_streaming() to memcpy_flushcache() when the destination
+ * is already 8-byte aligned and the size can be handled without cached
+ * head/tail fragments in __memcpy_flushcache().
+ */
+static __always_inline bool memcpy_flushcache_nt_safe(const void *dst,
+						      size_t cnt)
+{
+	unsigned long d = (unsigned long)dst;
+
+	return cnt && IS_ALIGNED(d, 8) && IS_ALIGNED(cnt, 4);
+}
+
+#define __HAVE_ARCH_MEMCPY_STREAMING 1
+static __always_inline void memcpy_streaming(void *dst, const void *src,
+					     size_t cnt)
+{
+	if (!cnt)
+		return;
+
+	if (memcpy_flushcache_nt_safe(dst, cnt))
+		memcpy_flushcache(dst, src, cnt);
+	else
+		memcpy(dst, src, cnt);
+}
+
+static __always_inline void memcpy_streaming_drain(void)
+{
+	asm volatile("sfence" : : : "memory");
+}
 #endif
 
 #endif /* __KERNEL__ */
diff --git a/include/linux/string.h b/include/linux/string.h
index b850bd91b3d8..a4c2d4347f58 100644
--- a/include/linux/string.h
+++ b/include/linux/string.h
@@ -281,6 +281,26 @@ static inline void memcpy_flushcache(void *dst, const void *src, size_t cnt)
 }
 #endif
 
+#ifndef __HAVE_ARCH_MEMCPY_STREAMING
+/*
+ * memcpy_streaming() is for write-once copy sites that may use
+ * non-temporal stores on some architectures. Callers must follow it
+ * with memcpy_streaming_drain() before later normal stores that need to
+ * be ordered after the streaming copy. Implementations may fall back to
+ * memcpy() when a specialized backend cannot safely handle the given
+ * transfer, and backends that use regular cached stores can make the
+ * drain a no-op.
+ */
+static inline void memcpy_streaming(void *dst, const void *src, size_t cnt)
+{
+	memcpy(dst, src, cnt);
+}
+
+static inline void memcpy_streaming_drain(void)
+{
+}
+#endif
+
 void *memchr_inv(const void *s, int c, size_t n);
 char *strreplace(char *str, char old, char new);
 
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v4 7/8] x86/string: extend memcpy_flushcache() fixed-size fastpaths
  2026-06-03  8:01 [PATCH v4 0/8] mm: speed up ZONE_DEVICE memmap initialization Li Zhe
                   ` (5 preceding siblings ...)
  2026-06-03  8:01 ` [PATCH v4 6/8] string: introduce memcpy_streaming() helpers Li Zhe
@ 2026-06-03  8:01 ` Li Zhe
  2026-06-03  8:01 ` [PATCH v4 8/8] mm: use memcpy_streaming() in zone-device template copies Li Zhe
  2026-06-04  8:14 ` [PATCH v4 0/8] mm: speed up ZONE_DEVICE memmap initialization Alistair Popple
  8 siblings, 0 replies; 10+ messages in thread
From: Li Zhe @ 2026-06-03  8:01 UTC (permalink / raw)
  To: akpm, apopple, arnd, bp, dave.hansen, david, kees, mingo, rppt,
	tglx
  Cc: linux-arch, linux-hardening, linux-kernel, linux-mm, x86,
	lizhe.67

Small constant-sized flushcache copies currently inline only the 4, 8,
and 16-byte cases. Larger constant sizes fall back to
__memcpy_flushcache() even when the destination is naturally aligned.

Factor the movnti sequences into 4/8/16/32/64-byte helpers and extend
the inline fastpath coverage to the additional aligned constant sizes
32, 48, 64, 80, and 96 bytes. Keep the existing 4/8/16-byte cases
handled directly in memcpy_flushcache() so they do not pick up the
extra alignment gating used by the larger fixed-size helpers.

Because memcpy_streaming() maps aligned transfers to
memcpy_flushcache(), these additional fixed-size cases also stay on
the inline movnti path for that helper.

Issue the stores in ascending address order so write-combining sees a
forward stream. Keep all other sizes on __memcpy_flushcache(), and keep
zero-length copies returning immediately.

Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
---
 arch/x86/include/asm/string_64.h | 122 ++++++++++++++++++++++++++-----
 1 file changed, 104 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index aee63108577f..16d1aac2da24 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -83,24 +83,6 @@ int strcmp(const char *cs, const char *ct);
 #ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE
 #define __HAVE_ARCH_MEMCPY_FLUSHCACHE 1
 void __memcpy_flushcache(void *dst, const void *src, size_t cnt);
-static __always_inline void memcpy_flushcache(void *dst, const void *src, size_t cnt)
-{
-	if (__builtin_constant_p(cnt)) {
-		switch (cnt) {
-			case 4:
-				asm ("movntil %1, %0" : "=m"(*(u32 *)dst) : "r"(*(u32 *)src));
-				return;
-			case 8:
-				asm ("movntiq %1, %0" : "=m"(*(u64 *)dst) : "r"(*(u64 *)src));
-				return;
-			case 16:
-				asm ("movntiq %1, %0" : "=m"(*(u64 *)dst) : "r"(*(u64 *)src));
-				asm ("movntiq %1, %0" : "=m"(*(u64 *)(dst + 8)) : "r"(*(u64 *)(src + 8)));
-				return;
-		}
-	}
-	__memcpy_flushcache(dst, src, cnt);
-}
 
 /*
  * Only map memcpy_streaming() to memcpy_flushcache() when the destination
@@ -115,6 +97,110 @@ static __always_inline bool memcpy_flushcache_nt_safe(const void *dst,
 	return cnt && IS_ALIGNED(d, 8) && IS_ALIGNED(cnt, 4);
 }
 
+static __always_inline void memcpy_flushcache_4(void *dst, const void *src)
+{
+	asm volatile("movntil %1, %0"
+		     : "=m"(*(u32 *)dst)
+		     : "r"(*(const u32 *)src)
+		     : "memory");
+}
+
+static __always_inline void memcpy_flushcache_8(void *dst, const void *src)
+{
+	asm volatile("movntiq %1, %0"
+		     : "=m"(*(u64 *)dst)
+		     : "r"(*(const u64 *)src)
+		     : "memory");
+}
+
+static __always_inline void memcpy_flushcache_16(void *dst,
+						 const void *src)
+{
+	memcpy_flushcache_8(dst, src);
+	memcpy_flushcache_8(dst + 8, src + 8);
+}
+
+static __always_inline void memcpy_flushcache_32(void *dst,
+						 const void *src)
+{
+	memcpy_flushcache_16(dst, src);
+	memcpy_flushcache_16(dst + 16, src + 16);
+}
+
+static __always_inline void memcpy_flushcache_64(void *dst,
+						 const void *src)
+{
+	memcpy_flushcache_32(dst, src);
+	memcpy_flushcache_32(dst + 32, src + 32);
+}
+
+/*
+ * Keep the additional aligned fixed-size cases on the inline movnti path.
+ * Leave the existing 4/8/16-byte cases handled directly in
+ * memcpy_flushcache() so they do not pick up the extra alignment gating
+ * used by the larger fixed-size helpers.
+ */
+static __always_inline int memcpy_flushcache_large(void *dst,
+						   const void *src,
+						   size_t cnt)
+{
+	unsigned long d = (unsigned long)dst;
+	char *dptr = dst;
+	const char *sptr = src;
+
+	if (!IS_ALIGNED(d, 8))
+		return 0;
+
+	switch (cnt) {
+	case 32:
+		memcpy_flushcache_32(dptr, sptr);
+		return 1;
+	case 48:
+		memcpy_flushcache_32(dptr, sptr);
+		memcpy_flushcache_16(dptr + 32, sptr + 32);
+		return 1;
+	case 64:
+		memcpy_flushcache_64(dptr, sptr);
+		return 1;
+	case 80:
+		memcpy_flushcache_64(dptr, sptr);
+		memcpy_flushcache_16(dptr + 64, sptr + 64);
+		return 1;
+	case 96:
+		memcpy_flushcache_64(dptr, sptr);
+		memcpy_flushcache_32(dptr + 64, sptr + 64);
+		return 1;
+	}
+
+	return 0;
+}
+
+static __always_inline void memcpy_flushcache(void *dst, const void *src,
+					      size_t cnt)
+{
+	if (!cnt)
+		return;
+
+	if (__builtin_constant_p(cnt)) {
+		switch (cnt) {
+		case 4:
+			memcpy_flushcache_4(dst, src);
+			return;
+		case 8:
+			memcpy_flushcache_8(dst, src);
+			return;
+		case 16:
+			memcpy_flushcache_16(dst, src);
+			return;
+		}
+
+		if (memcpy_flushcache_large(dst, src, cnt))
+			return;
+	}
+
+	__memcpy_flushcache(dst, src, cnt);
+}
+
 #define __HAVE_ARCH_MEMCPY_STREAMING 1
 static __always_inline void memcpy_streaming(void *dst, const void *src,
 					     size_t cnt)
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v4 8/8] mm: use memcpy_streaming() in zone-device template copies
  2026-06-03  8:01 [PATCH v4 0/8] mm: speed up ZONE_DEVICE memmap initialization Li Zhe
                   ` (6 preceding siblings ...)
  2026-06-03  8:01 ` [PATCH v4 7/8] x86/string: extend memcpy_flushcache() fixed-size fastpaths Li Zhe
@ 2026-06-03  8:01 ` Li Zhe
  2026-06-04  8:14 ` [PATCH v4 0/8] mm: speed up ZONE_DEVICE memmap initialization Alistair Popple
  8 siblings, 0 replies; 10+ messages in thread
From: Li Zhe @ 2026-06-03  8:01 UTC (permalink / raw)
  To: akpm, apopple, arnd, bp, dave.hansen, david, kees, mingo, rppt,
	tglx
  Cc: linux-arch, linux-hardening, linux-kernel, linux-mm, x86,
	lizhe.67

The template fast path still leaves the actual copy sequence up to the
compiler. Use the streaming-copy helpers introduced in the previous
patches for the ZONE_DEVICE template-copy path so common mm code can
request a write-once copy primitive without embedding arch-specific
store layout in the generic layer.

ZONE_DEVICE memmap initialization is a write-once path: each struct page
is populated once and is not expected to be reused from cache
immediately afterwards. A regular cached copy can therefore incur
write-allocate traffic and pollute the cache without much benefit.
Using memcpy_streaming() lets this path use an architecture-optimized
streaming copy where available, while still degrading to memcpy() on
architectures that do not provide a specialized implementation.

Keep pageblock-aligned PFNs on memcpy() so pageblock initialization can
immediately read back page metadata without introducing a
read-after-streaming dependency. For the remaining PFNs, use
memcpy_streaming() so the hot path can avoid write-allocate traffic
while still leaving unsupported or unsuitable cases to the fallback
implementation.

When the streaming backend uses non-temporal stores, order them before
entering memmap_init_compound(), before prep_compound_head() updates the
overlapping compound metadata, and before returning from
memmap_init_zone_device().

Keep sanitized builds on the slow path so KASAN/KMSAN retain their
instrumented stores.

Tested in a VM with a 100 GB fsdax namespace device configured with
map=dev and a 100 GB devdax namespace (align=2097152) on Intel Ice Lake
server.

Test procedure:
Rebind the nd_pmem and dax_pmem driver 30 times and collect the memmap
initialization time from the pr_debug() output of
memmap_init_zone_device().

Base(v7.1-rc6):
  First binding for nd_pmem driver: 1466 ms
  Average of subsequent rebinds: 262.12 ms

  First binding for dax_pmem driver: 1430 ms
  Average of subsequent rebinds: 229.12 ms

With this series:
  First binding for nd_pmem driver: 1359 ms
  Average of subsequent rebinds: 108.36 ms

  First binding for dax_pmem driver: 1273 ms
  Average of subsequent rebinds: 100.17 ms

This reduces the average rebind time by about 58.6% for nd_pmem and
56.3% for dax_pmem.

Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
---
 mm/mm_init.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 47 insertions(+), 2 deletions(-)

diff --git a/mm/mm_init.c b/mm/mm_init.c
index ad078ee354fb..fbc873284fb8 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1070,11 +1070,21 @@ static void __ref zone_device_page_init_slow(struct page *page,
 
 static inline bool zone_device_page_init_optimization_enabled(void)
 {
+	/*
+	 * Keep sanitized builds on the slow path so their stores stay
+	 * instrumented.
+	 */
+	if (IS_ENABLED(CONFIG_KASAN) || IS_ENABLED(CONFIG_KMSAN))
+		return false;
+
 	/*
 	 * The template fast path copies a preinitialized struct page image.
 	 * Skip it when the page_ref_set tracepoint is enabled.
 	 */
-	return !page_ref_tracepoint_active(page_ref_set);
+	if (page_ref_tracepoint_active(page_ref_set))
+		return false;
+
+	return true;
 }
 
 static inline void zone_device_template_page_init(struct page *template,
@@ -1117,9 +1127,19 @@ static void zone_device_page_init_from_template(struct page *page,
 	 * 'template' carries the invariant portion of a ZONE_DEVICE struct
 	 * page. Update the PFN-dependent fields in place before copying it
 	 * to the destination page.
+	 *
+	 * pageblock-aligned pages immediately feed
+	 * init_pageblock_migratetype(), which reads back page metadata via
+	 * helpers like page_zone(page). Avoid a read-after-streaming
+	 * dependency for these rare pages by using regular cached stores
+	 * instead of non-temporal ones.
 	 */
 	zone_device_page_update_template(template, pfn);
-	memcpy(page, template, sizeof(*page));
+	if (unlikely(pageblock_aligned(pfn)))
+		memcpy(page, template, sizeof(*page));
+	else
+		memcpy_streaming(page, template, sizeof(*page));
+
 	zone_device_page_init_pageblock(page, pfn);
 }
 
@@ -1184,6 +1204,15 @@ static void __ref memmap_init_compound(struct page *head,
 			zone_device_tail_page_init(page, pfn, zone_idx, nid,
 						   pgmap, head, order);
 	}
+
+	/*
+	 * When the template path is enabled, order the preceding tail-page copies
+	 * before prep_compound_head() updates the overlapping compound metadata
+	 * in the first tail-page descriptors. If memcpy_streaming() fell back to
+	 * regular cached stores, memcpy_streaming_drain() may be a no-op.
+	 */
+	if (use_template)
+		memcpy_streaming_drain();
 	prep_compound_head(head, order);
 }
 
@@ -1248,10 +1277,26 @@ void __ref memmap_init_zone_device(struct zone *zone,
 		if (pfns_per_compound == 1)
 			continue;
 
+		/*
+		 * When the template path is enabled, order the preceding head-page copy
+		 * before memmap_init_compound(), which immediately updates compound-head
+		 * metadata. If memcpy_streaming() fell back to regular cached stores,
+		 * memcpy_streaming_drain() may be a no-op.
+		 */
+		if (use_template)
+			memcpy_streaming_drain();
+
 		memmap_init_compound(page, pfn, zone_idx, nid, pgmap,
 				     compound_nr_pages(altmap, pgmap),
 				     use_template);
 	}
+	/*
+	 * Ensure any prior template copies are ordered before returning.
+	 * On architectures where memcpy_streaming() used regular cached stores,
+	 * memcpy_streaming_drain() may be a no-op.
+	 */
+	if (use_template)
+		memcpy_streaming_drain();
 
 	pr_debug("%s initialised %lu pages in %ums\n", __func__,
 		nr_pages, jiffies_to_msecs(jiffies - start));
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 0/8] mm: speed up ZONE_DEVICE memmap initialization
  2026-06-03  8:01 [PATCH v4 0/8] mm: speed up ZONE_DEVICE memmap initialization Li Zhe
                   ` (7 preceding siblings ...)
  2026-06-03  8:01 ` [PATCH v4 8/8] mm: use memcpy_streaming() in zone-device template copies Li Zhe
@ 2026-06-04  8:14 ` Alistair Popple
  8 siblings, 0 replies; 10+ messages in thread
From: Alistair Popple @ 2026-06-04  8:14 UTC (permalink / raw)
  To: Li Zhe
  Cc: akpm, arnd, bp, dave.hansen, david, kees, mingo, rppt, tglx,
	linux-arch, linux-hardening, linux-kernel, linux-mm, x86

On 2026-06-03 at 18:01 +1000, Li Zhe <lizhe.67@bytedance.com> wrote...
> memmap_init_zone_device() can spend a substantial amount of time
> initializing large ZONE_DEVICE ranges because it repeats nearly
> identical struct page setup for every PFN.
> 
> This series reduces that overhead in eight steps.
> 
> The first patch fixes a stale comment in __init_zone_device_page() so
> the documented refcount policy matches the current ZONE_DEVICE code.
> 
> The second patch factors the reusable pieces out of
> __init_zone_device_page() so later patches can share the same logic
> without changing the existing slow path.
> 
> The third patch adds set_page_section_from_pfn(), so callers that want
> to refresh section bits from a PFN no longer need to open-code
> SECTION_IN_PAGE_FLAGS handling.
> 
> The fourth patch adds a template-based fast path for ZONE_DEVICE head
> pages. Instead of rebuilding the same struct page state for every PFN,
> it prepares one reusable template through the existing slow path,
> refreshes the PFN-dependent fields in that template, and copies it to
> each destination page.
> 
> The fifth patch extends the same template-based approach to compound
> tails, so pfns_per_compound > 1 can also benefit from the fast path.
> 
> The sixth patch introduces memcpy_streaming() and
> memcpy_streaming_drain() as a generic interface for write-once copies.
> Architectures that do not provide a specialized backend, or cases that
> cannot safely use one, fall back to memcpy().
> 
> The seventh patch extends x86 memcpy_flushcache() small fixed-size
> fastpaths so struct-page-sized streaming copies can stay on the inline
> path when alignment permits.
> 
> The last patch switches the ZONE_DEVICE template-copy path over to
> memcpy_streaming(). It keeps pageblock-aligned PFNs on regular memcpy(),
> uses memcpy_streaming() for the remaining write-once copies, and drains
> streaming stores before later metadata updates that may depend on them.
> 
> This is not intended as a steady-state data-path optimization. Its
> benefit is in pmem bring-up paths where memmap_init_zone_device()
> dominates device online / rebind latency, such as:
>   - fsdax or devdax namespace creation and reconfiguration
>   - nd_pmem / dax_pmem driver bind or rebind
> 
> In those paths, the kernel initializes a large vmemmap range once and
> does not immediately benefit from keeping the copied struct page state
> hot in cache. Reducing write-allocate traffic in that one-time setup
> path can therefore reduce end-to-end device bring-up latency.
> 
> The optimized path is disabled when the page_ref_set tracepoint is
> enabled, and sanitized builds remain on the slow path so their
> instrumented stores are preserved.
> 
> Testing
> =======
> 
> Tests were run in a VM on an Intel Ice Lake server.
> 
> Two PMEM configurations were used:
>   - a 100 GB fsdax namespace configured with map=dev, which exercises
>     the nd_pmem rebind path (pfns_per_compound == 1)
>   - a 100 GB devdax namespace configured with align=2097152, which
>     exercises the dax_pmem rebind path (pfns_per_compound > 1)
> 
> For each configuration, the corresponding driver was unbound and
> rebound 30 times. Memmap initialization latency was collected from the
> pr_debug() output of memmap_init_zone_device().
> 
> The first bind is reported separately, and the average of subsequent
> rebinds is used as the steady-state result.
> 
> Performance
> ===========
> 
> nd_pmem rebind, 100 GB fsdax namespace, map=dev
>   Base(v7.1-rc6):
>     First binding: 1466 ms
>     Average of subsequent rebinds: 262.12 ms
>   Full series:
>     First binding: 1359 ms
>     Average of subsequent rebinds: 108.36 ms
> 
> dax_pmem rebind, 100 GB devdax namespace, align=2097152
>   Base(v7.1-rc6):
>     First binding: 1430 ms
>     Average of subsequent rebinds: 229.12 ms
>   Full series:
>     First binding: 1273 ms
>     Average of subsequent rebinds: 100.17 ms

The results here are impressive, but I've been having trouble replicating them
with hmm_test on my local development machines. Both an older AMD machine and
a newer Arrow Lake based machine shows ~3% worse performance with this series
applied doing ZONE_DEVICE_PRIVATE.

This is based on measuring the memremap_pages() call when inserting test_hmm.ko
in a VM using the following hack to measure 10 64GB memremaps. Is there an easy
way for me to replicate your results in a VM? Or is there something in my
testing that I'm missing here?

---

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 213504915737..a1d5463dbc86 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -34,7 +34,7 @@
 
 #define DMIRROR_NDEVICES		4
 #define DMIRROR_RANGE_FAULT_TIMEOUT	1000
-#define DEVMEM_CHUNK_SIZE		(256 * 1024 * 1024U)
+#define DEVMEM_CHUNK_SIZE		(64 * 1024 * 1024 * 1024UL)
 #define DEVMEM_CHUNKS_RESERVE		16
 
 /*
@@ -565,6 +565,8 @@ static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
 	unsigned long pfn_last;
 	void *ptr;
 	int ret = -ENOMEM;
+	int i;
+	u64 t0, total = 0;
 
 	devmem = kzalloc_obj(*devmem);
 	if (!devmem)
@@ -613,6 +615,22 @@ static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
 		mdevice->devmem_capacity = new_capacity;
 		mdevice->devmem_chunks = new_chunks;
 	}
+
+	for (i = 0; i < 10; i++) {
+		t0 = ktime_get_ns();
+		ptr = memremap_pages(&devmem->pagemap, numa_node_id());
+		total += ktime_get_ns() - t0;
+		if (IS_ERR_OR_NULL(ptr)) {
+			if (ptr)
+				ret = PTR_ERR(ptr);
+			else
+				ret = -EFAULT;
+			goto err_release;
+		}
+		memunmap_pages(&devmem->pagemap);
+	}
+	pr_info("avg memremap %llu ns\n", total / i);
+
 	ptr = memremap_pages(&devmem->pagemap, numa_node_id());
 	if (IS_ERR_OR_NULL(ptr)) {
 		if (ptr)
@@ -629,7 +647,7 @@ static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
 
 	mutex_unlock(&mdevice->devmem_lock);
 
-	pr_info("added new %u MB chunk (total %u chunks, %u MB) PFNs [0x%lx 0x%lx)\n",
+	pr_info("added new %lu MB chunk (total %u chunks, %lu MB) PFNs [0x%lx 0x%lx)\n",
 		DEVMEM_CHUNK_SIZE / (1024 * 1024),
 		mdevice->devmem_count,
 		mdevice->devmem_count * (DEVMEM_CHUNK_SIZE / (1024 * 1024)),

> Li Zhe (8):
>   mm: fix stale ZONE_DEVICE refcount comment
>   mm: factor zone-device page init helpers out of
>     __init_zone_device_page
>   mm: add a set_page_section_from_pfn() helper
>   mm: add a template-based fast path for zone-device page init
>   mm: extend the template fast path to zone-device compound tails
>   string: introduce memcpy_streaming() helpers
>   x86/string: extend memcpy_flushcache() fixed-size fastpaths
>   mm: use memcpy_streaming() in zone-device template copies
> 
>  arch/x86/include/asm/string_64.h | 140 ++++++++++++++++++--
>  include/linux/mm.h               |  19 ++-
>  include/linux/string.h           |  20 +++
>  mm/mm_init.c                     | 221 +++++++++++++++++++++++++++----
>  4 files changed, 360 insertions(+), 40 deletions(-)
> 
> ---
> v3: https://lore.kernel.org/all/20260527033636.28231-1-lizhe.67@bytedance.com/
> v2: https://lore.kernel.org/all/20260521040124.10608-1-lizhe.67@bytedance.com/
> v1: https://lore.kernel.org/all/20260515082045.63029-1-lizhe.67@bytedance.com/
> 
> Changelogs:
> 
> v3->v4:
> - Rebase the series from v7.1-rc3 to v7.1-rc6.
> - Rework patch 4 so the reusable head-page template is seeded from the
>   first real struct page, rather than being initialized directly on a
>   stack-resident template object. Also add an explicit !nr_pages early
>   return. Suggested by Andrew Morton.
> - Rework patch 5 similarly for compound tails: seed the reusable
>   tail-page template from the first real tail page, thread
>   use_template through compound-page initialization, and reuse that
>   prepared tail-page image for the remaining tails. Suggested by Andrew
>   Morton.
> - Tighten patch 6 so memcpy_streaming() maps to memcpy_flushcache() only
>   when the destination alignment and size allow the transfer to stay
>   entirely on the non-temporal path; other cases fall back to memcpy().
>   Suggested by Andrew Morton.
> - Rework patch 7 so the existing 4/8/16-byte cases remain handled
>   directly in memcpy_flushcache(), while the new aligned fixed-size
>   fastpaths cover only the larger 32/48/64/80/96-byte cases. Suggested
>   by Andrew Morton.
> 
> For changelogs of earlier revisions, please refer to the v3 cover letter:
> https://lore.kernel.org/all/20260527033636.28231-1-lizhe.67@bytedance.com/
> 
> -- 
> 2.20.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2026-06-04  8:14 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-03  8:01 [PATCH v4 0/8] mm: speed up ZONE_DEVICE memmap initialization Li Zhe
2026-06-03  8:01 ` [PATCH v4 1/8] mm: fix stale ZONE_DEVICE refcount comment Li Zhe
2026-06-03  8:01 ` [PATCH v4 2/8] mm: factor zone-device page init helpers out of __init_zone_device_page Li Zhe
2026-06-03  8:01 ` [PATCH v4 3/8] mm: add a set_page_section_from_pfn() helper Li Zhe
2026-06-03  8:01 ` [PATCH v4 4/8] mm: add a template-based fast path for zone-device page init Li Zhe
2026-06-03  8:01 ` [PATCH v4 5/8] mm: extend the template fast path to zone-device compound tails Li Zhe
2026-06-03  8:01 ` [PATCH v4 6/8] string: introduce memcpy_streaming() helpers Li Zhe
2026-06-03  8:01 ` [PATCH v4 7/8] x86/string: extend memcpy_flushcache() fixed-size fastpaths Li Zhe
2026-06-03  8:01 ` [PATCH v4 8/8] mm: use memcpy_streaming() in zone-device template copies Li Zhe
2026-06-04  8:14 ` [PATCH v4 0/8] mm: speed up ZONE_DEVICE memmap initialization Alistair Popple

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox