[PATCH v5 0/8] mm: optimize zone-device memmap initialization

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

* [PATCH v5 0/8] mm: optimize zone-device memmap initialization
@ 2026-07-01  9:05 Li Zhe
  2026-07-01  9:05 ` [PATCH v5 1/8] mm: fix stale ZONE_DEVICE refcount comment Li Zhe
                   ` (8 more replies)
  0 siblings, 9 replies; 18+ messages in thread
From: Li Zhe @ 2026-07-01  9:05 UTC (permalink / raw)
  To: akpm, apopple, arnd, bp, dave.hansen, david, kees, mingo, rppt,
	tglx
  Cc: linux-arch, linux-hardening, linux-kernel, linux-mm, x86,
	lizhe.67

memmap_init_zone_device() can take a noticeable amount of time when large
pmem namespaces are bound or rebound, because it initializes nearly
identical struct page descriptors one PFN at a time. This series reduces
that ZONE_DEVICE memmap initialization overhead by reusing prepared
struct page templates and, on x86, using memcpy_nt() for the template
copy path.

The main target is large fsdax/devdax pmem configurations, where the
cost of initializing the memmap shows up directly in nd_pmem/dax_pmem
bind and rebind latency.

Patches 1-3 are preparatory cleanups and helper extraction. Patches 4-5
add the template-copy fast path for head pages and compound tails.
Patches 6-8 introduce memcpy_nt()/memcpy_nt_drain(), extend the x86
fixed-size memcpy_flushcache() inline cases used by that helper, and
switch the template-copy path over to memcpy_nt().

The fast path remains disabled when the page_ref_set tracepoint is
active, and sanitized builds stay on the slow path so their instrumented
stores are preserved. Architectures without a specialized memcpy_nt()
backend continue to fall back to memcpy().

Tested in a VM with a 100 GB fsdax namespace device configured with
map=dev and a 100 GB devdax namespace (align=2097152) on Intel Ice Lake
server.

Test procedure:
Rebind the nd_pmem and dax_pmem driver 30 times and collect the memmap
initialization time from the pr_debug() output of
memmap_init_zone_device().

Base(v7.2-rc1):
  First binding for nd_pmem driver: 1456 ms
  Average of subsequent rebinds: 244.28 ms

  First binding for dax_pmem driver: 1462 ms
  Average of subsequent rebinds: 273.31 ms

With this series applied:
  First binding for nd_pmem driver: 1272 ms
  Average of subsequent rebinds: 96.79 ms

  First binding for dax_pmem driver: 1354 ms
  Average of subsequent rebinds: 119.04 ms

This reduces the average rebind time by about 60.4% for nd_pmem and
56.4% for dax_pmem.

As an additional data point, I also ran a smaller set of measurements on
the same physical x86_64 host with a 100 GB PMEM region created via the
memmap= kernel command line, configured as fsdax and devdax namespaces
with map=dev and 2 MiB alignment.

For brevity, the individual patches keep only the VM results rather than
including a second set of physical-host measurements throughout the
series. The physical-host numbers below are included only as
supplemental evidence that the same optimization also provides a similar
benefit on a non-virtualized system.

Test procedure:
Reconfigure the namespace mode, rebind the nd_pmem or dax_pmem driver
once, and collect the memmap initialization time from the pr_debug()
output of memmap_init_zone_device().

Base (v7.2-rc1):
  nd_pmem / fsdax: 179 ms
  dax_pmem / devdax: 264 ms

With this series applied:
  nd_pmem / fsdax: 82 ms
  dax_pmem / devdax: 113 ms

This reduces the measured rebind time by about 54.2% for nd_pmem and
57.2% for dax_pmem on that setup, which is broadly consistent with the
VM results above.

As another supplemental data point, I also measured the test_hmm.ko
module on the same physical x86_64 host, using the test_hmm.ko setup
from the previous discussion that times ten 64 GB
memremap_pages()/memunmap_pages() iterations during module insertion[1].
By default, module insertion initializes two DEVICE_PRIVATE dmirror
devices, so two avg memremap values are reported; each value is the
average for one 64 GB chunk.

This is not the primary target workload of the series, but it exercises
the same large ZONE_DEVICE memmap initialization path and shows the same
direction of improvement.

Base (v7.2-rc1):
  avg memremap reported during module insertion: 116689362 ns, 116539263 ns

With this series applied:
  avg memremap reported during module insertion: 54607108 ns, 54458236 ns

This corresponds to about a 53.2% reduction based on the mean of the
reported values, which is again consistent with the pmem bind/rebind
results above.

[1] https://lore.kernel.org/all/aiEoByaQdRR3xtM5@nvdebian.thelocal/

Li Zhe (8):
  mm: fix stale ZONE_DEVICE refcount comment
  mm: factor zone-device page init helpers out of
    __init_zone_device_page
  mm: add a set_page_section_from_pfn() helper
  mm: add a template-based fast path for zone-device page init
  mm: extend the template fast path to zone-device compound tails
  string: introduce memcpy_nt() helpers
  x86/string: extend memcpy_flushcache() fixed-size fastpaths
  mm: use memcpy_nt() in zone-device template copies

 arch/x86/include/asm/string_64.h |  96 +++++++++++++-
 include/linux/mm.h               |  19 ++-
 include/linux/string.h           |  18 +++
 mm/mm_init.c                     | 209 +++++++++++++++++++++++++++----
 4 files changed, 311 insertions(+), 31 deletions(-)

---
v4: https://lore.kernel.org/all/20260603080152.64728-1-lizhe.67@bytedance.com/
v3: https://lore.kernel.org/all/20260527033636.28231-1-lizhe.67@bytedance.com/
v2: https://lore.kernel.org/all/20260521040124.10608-1-lizhe.67@bytedance.com/
v1: https://lore.kernel.org/all/20260515082045.63029-1-lizhe.67@bytedance.com/

Changelogs:

v4->v5:
- Rebase the series from v7.1-rc6 to v7.2-rc1, and refresh the VM
  performance numbers.
- Simplify patch 6 around a small memcpy_nt()/memcpy_nt_drain()
  interface, rename the previous memcpy_streaming() helpers
  accordingly, make the generic implementation fall back to memcpy(),
  and let x86 reuse the existing memcpy_flushcache() backend instead of
  carrying extra policy/alignment logic in the generic layer. Suggested
  by Borislav Petkov.
- Add physical-host measurements for a 100 GB PMEM region simulated via
  the memmap= kernel command line to the cover letter as supplemental
  evidence that the same optimization also improves fsdax/devdax map=dev
  bind/rebind latency outside the VM, while keeping the per-patch
  performance data limited to the VM measurements for consistency across
  the series. Suggested by Borislav Petkov.
- Add supplemental test_hmm.ko results to the cover letter as another
  physical-host data point, in addition to the pmem bind/rebind
  measurements.

v3->v4:
- Rebase the series from v7.1-rc3 to v7.1-rc6.
- Rework patch 4 so the reusable head-page template is seeded from the
  first real struct page, rather than being initialized directly on a
  stack-resident template object. Also add an explicit !nr_pages early
  return. Suggested by Andrew Morton.
- Rework patch 5 similarly for compound tails: seed the reusable
  tail-page template from the first real tail page, thread
  use_template through compound-page initialization, and reuse that
  prepared tail-page image for the remaining tails. Suggested by Andrew
  Morton.
- Tighten patch 6 so memcpy_streaming() maps to memcpy_flushcache() only
  when the destination alignment and size allow the transfer to stay
  entirely on the non-temporal path; other cases fall back to memcpy().
  Suggested by Andrew Morton.
- Rework patch 7 so the existing 4/8/16-byte cases remain handled
  directly in memcpy_flushcache(), while the new aligned fixed-size
  fastpaths cover only the larger 32/48/64/80/96-byte cases. Suggested
  by Andrew Morton.

For changelogs of earlier revisions, please refer to the v3 cover letter:
https://lore.kernel.org/all/20260527033636.28231-1-lizhe.67@bytedance.com/

-- 
2.20.1

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v5 1/8] mm: fix stale ZONE_DEVICE refcount comment
  2026-07-01  9:05 [PATCH v5 0/8] mm: optimize zone-device memmap initialization Li Zhe
@ 2026-07-01  9:05 ` Li Zhe
  2026-07-01  9:05 ` [PATCH v5 2/8] mm: factor zone-device page init helpers out of __init_zone_device_page Li Zhe
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: Li Zhe @ 2026-07-01  9:05 UTC (permalink / raw)
  To: akpm, apopple, arnd, bp, dave.hansen, david, kees, mingo, rppt,
	tglx
  Cc: linux-arch, linux-hardening, linux-kernel, linux-mm, x86,
	lizhe.67

The comment in __init_zone_device_page() still uses the old
MEMORY_TYPE_* names and implies that FS_DAX pages regain a
refcount of 1 in the free path. That no longer matches the code.

Update the comment to describe the current policy correctly:
MEMORY_DEVICE_GENERIC pages regain a refcount of 1 in the free path,
while the remaining ZONE_DEVICE types start from 0 here and raise the
count again when the allocator or driver hands the page out.

No functional change intended.

Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
---
 mm/mm_init.c | 10 +++-------
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/mm/mm_init.c b/mm/mm_init.c
index 0f64909e8d20..95808ab5cfdb 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1030,13 +1030,9 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
 	page->zone_device_data = NULL;
 
 	/*
-	 * ZONE_DEVICE pages other than MEMORY_TYPE_GENERIC are released
-	 * directly to the driver page allocator which will set the page count
-	 * to 1 when allocating the page.
-	 *
-	 * MEMORY_TYPE_GENERIC and MEMORY_TYPE_FS_DAX pages automatically have
-	 * their refcount reset to one whenever they are freed (ie. after
-	 * their refcount drops to 0).
+	 * MEMORY_DEVICE_GENERIC pages regain a refcount of 1 in the free
+	 * path. The remaining ZONE_DEVICE types start from 0 here and raise
+	 * the count again when the allocator or driver hands the page out.
 	 */
 	switch (pgmap->type) {
 	case MEMORY_DEVICE_FS_DAX:
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v5 2/8] mm: factor zone-device page init helpers out of __init_zone_device_page
  2026-07-01  9:05 [PATCH v5 0/8] mm: optimize zone-device memmap initialization Li Zhe
  2026-07-01  9:05 ` [PATCH v5 1/8] mm: fix stale ZONE_DEVICE refcount comment Li Zhe
@ 2026-07-01  9:05 ` Li Zhe
  2026-07-01  9:05 ` [PATCH v5 3/8] mm: add a set_page_section_from_pfn() helper Li Zhe
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: Li Zhe @ 2026-07-01  9:05 UTC (permalink / raw)
  To: akpm, apopple, arnd, bp, dave.hansen, david, kees, mingo, rppt,
	tglx
  Cc: linux-arch, linux-hardening, linux-kernel, linux-mm, x86,
	lizhe.67

memmap_init_zone_device() currently mixes refcount policy and core
ZONE_DEVICE page setup in a single helper.

Factor the refcount-reset predicate into pagemap_resets_refcount(), move
the common page initialization into __zone_device_page_init(), and wrap
the existing slow path in zone_device_page_init_slow().

This keeps the slow-path behaviour unchanged and gives later patches
reusable helper boundaries.

No functional change intended.

Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 mm/mm_init.c | 57 ++++++++++++++++++++++++++++++++++------------------
 1 file changed, 38 insertions(+), 19 deletions(-)

diff --git a/mm/mm_init.c b/mm/mm_init.c
index 95808ab5cfdb..4c7fad440c2a 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1005,11 +1005,38 @@ static void __init memmap_init(void)
 }
 
 #ifdef CONFIG_ZONE_DEVICE
-static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
+/*
+ * Return true when the free path for this pagemap type restores the page
+ * refcount to 1, so memmap_init_zone_device() can keep the count set by
+ * __init_single_page(). Otherwise initialize the refcount to 0 and leave
+ * it to the allocator or pgmap callbacks to raise it when the page is
+ * handed out again.
+ */
+static inline bool pagemap_resets_refcount(const struct dev_pagemap *pgmap)
+{
+	/*
+	 * MEMORY_DEVICE_GENERIC pages regain a refcount of 1 in the free
+	 * path. The remaining ZONE_DEVICE types start from 0 here and raise
+	 * the count again when the allocator or driver hands the page out.
+	 */
+	switch (pgmap->type) {
+	case MEMORY_DEVICE_FS_DAX:
+	case MEMORY_DEVICE_PRIVATE:
+	case MEMORY_DEVICE_COHERENT:
+	case MEMORY_DEVICE_PCI_P2PDMA:
+		return false;
+	case MEMORY_DEVICE_GENERIC:
+		return true;
+	default:
+		WARN_ONCE(1, "Unknown memory type!");
+		return true;
+	}
+}
+
+static void __ref __zone_device_page_init(struct page *page, unsigned long pfn,
 					  unsigned long zone_idx, int nid,
 					  struct dev_pagemap *pgmap)
 {
-
 	__init_single_page(page, pfn, zone_idx, nid);
 
 	/*
@@ -1028,23 +1055,15 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
 	 */
 	page_folio(page)->pgmap = pgmap;
 	page->zone_device_data = NULL;
+}
 
-	/*
-	 * MEMORY_DEVICE_GENERIC pages regain a refcount of 1 in the free
-	 * path. The remaining ZONE_DEVICE types start from 0 here and raise
-	 * the count again when the allocator or driver hands the page out.
-	 */
-	switch (pgmap->type) {
-	case MEMORY_DEVICE_FS_DAX:
-	case MEMORY_DEVICE_PRIVATE:
-	case MEMORY_DEVICE_COHERENT:
-	case MEMORY_DEVICE_PCI_P2PDMA:
+static void __ref zone_device_page_init_slow(struct page *page,
+		unsigned long pfn, unsigned long zone_idx, int nid,
+		struct dev_pagemap *pgmap)
+{
+	__zone_device_page_init(page, pfn, zone_idx, nid, pgmap);
+	if (!pagemap_resets_refcount(pgmap))
 		set_page_count(page, 0);
-		break;
-
-	case MEMORY_DEVICE_GENERIC:
-		break;
-	}
 }
 
 /*
@@ -1090,7 +1109,7 @@ static void __ref memmap_init_compound(struct page *head,
 	for (pfn = head_pfn + 1; pfn < end_pfn; pfn++) {
 		struct page *page = pfn_to_page(pfn);
 
-		__init_zone_device_page(page, pfn, zone_idx, nid, pgmap);
+		zone_device_page_init_slow(page, pfn, zone_idx, nid, pgmap);
 		prep_compound_tail(page, head, order);
 		set_page_count(page, 0);
 	}
@@ -1126,7 +1145,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
 	for (pfn = start_pfn; pfn < end_pfn; pfn += pfns_per_compound) {
 		struct page *page = pfn_to_page(pfn);
 
-		__init_zone_device_page(page, pfn, zone_idx, nid, pgmap);
+		zone_device_page_init_slow(page, pfn, zone_idx, nid, pgmap);
 
 		if (IS_ALIGNED(pfn, PAGES_PER_SECTION))
 			cond_resched();
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v5 3/8] mm: add a set_page_section_from_pfn() helper
  2026-07-01  9:05 [PATCH v5 0/8] mm: optimize zone-device memmap initialization Li Zhe
  2026-07-01  9:05 ` [PATCH v5 1/8] mm: fix stale ZONE_DEVICE refcount comment Li Zhe
  2026-07-01  9:05 ` [PATCH v5 2/8] mm: factor zone-device page init helpers out of __init_zone_device_page Li Zhe
@ 2026-07-01  9:05 ` Li Zhe
  2026-07-01  9:05 ` [PATCH v5 4/8] mm: add a template-based fast path for zone-device page init Li Zhe
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: Li Zhe @ 2026-07-01  9:05 UTC (permalink / raw)
  To: akpm, apopple, arnd, bp, dave.hansen, david, kees, mingo, rppt,
	tglx
  Cc: linux-arch, linux-hardening, linux-kernel, linux-mm, x86,
	lizhe.67

Callers that want to update section bits from a PFN currently need to
open-code:

	set_page_section(page, pfn_to_section_nr(pfn));

and guard that sequence with #ifdef SECTION_IN_PAGE_FLAGS.

Add set_page_section_from_pfn() to wrap that update in one place. When
section bits are stored in page flags, the helper derives the section
number from the PFN and updates the page flags. Otherwise it degrades to
a no-op.

Convert set_page_links() to use the new helper so later ZONE_DEVICE
fast-path patches can also update section bits without open-coding
SECTION_IN_PAGE_FLAGS at each callsite.

This keeps the PFN-to-section translation local to the configurations
that actually store section bits in struct page flags, and avoids
exposing that detail to generic callers.

No functional change intended.

Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 include/linux/mm.h | 19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 485df9c2dbdd..f78afa63dd3d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2541,11 +2541,26 @@ static inline void set_page_section(struct page *page, unsigned long section)
 	page->flags.f |= (section & SECTIONS_MASK) << SECTIONS_PGSHIFT;
 }
 
+static inline void set_page_section_from_pfn(struct page *page,
+					     unsigned long pfn)
+{
+	set_page_section(page, pfn_to_section_nr(pfn));
+}
+
 static inline unsigned long memdesc_section(memdesc_flags_t mdf)
 {
 	return (mdf.f >> SECTIONS_PGSHIFT) & SECTIONS_MASK;
 }
 #else /* !SECTION_IN_PAGE_FLAGS */
+static inline void set_page_section(struct page *page, unsigned long section)
+{
+}
+
+static inline void set_page_section_from_pfn(struct page *page,
+					     unsigned long pfn)
+{
+}
+
 static inline unsigned long memdesc_section(memdesc_flags_t mdf)
 {
 	return 0;
@@ -2768,9 +2783,7 @@ static inline void set_page_links(struct page *page, enum zone_type zone,
 {
 	set_page_zone(page, zone);
 	set_page_node(page, node);
-#ifdef SECTION_IN_PAGE_FLAGS
-	set_page_section(page, pfn_to_section_nr(pfn));
-#endif
+	set_page_section_from_pfn(page, pfn);
 }
 
 /**
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v5 4/8] mm: add a template-based fast path for zone-device page init
  2026-07-01  9:05 [PATCH v5 0/8] mm: optimize zone-device memmap initialization Li Zhe
                   ` (2 preceding siblings ...)
  2026-07-01  9:05 ` [PATCH v5 3/8] mm: add a set_page_section_from_pfn() helper Li Zhe
@ 2026-07-01  9:05 ` Li Zhe
  2026-07-03 14:06   ` Mike Rapoport
  2026-07-01  9:05 ` [PATCH v5 5/8] mm: extend the template fast path to zone-device compound tails Li Zhe
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 18+ messages in thread
From: Li Zhe @ 2026-07-01  9:05 UTC (permalink / raw)
  To: akpm, apopple, arnd, bp, dave.hansen, david, kees, mingo, rppt,
	tglx
  Cc: linux-arch, linux-hardening, linux-kernel, linux-mm, x86,
	lizhe.67

memmap_init_zone_device() repeats nearly identical head-page
initialization for each PFN. Prepare one reusable ZONE_DEVICE head-page
template through the existing slow path, refresh the PFN-dependent
fields in that template before each copy, and memcpy it into each
destination page.

The optimized path assigns _refcount through the copied template, so
keep it disabled when the page_ref_set tracepoint is enabled.

This patch accelerates head-page initialization. The pfns_per_compound
== 1 case gets the full benefit here, compound tails are handled in the
next patch.

Tested in a VM with a 100 GB fsdax namespace device configured with
map=dev on Intel Ice Lake server. This test exercises the nd_pmem rebind
path (pfns_per_compound == 1).

Test procedure:
Rebind the nd_pmem driver 30 times and collect the memmap initialization
time from the pr_debug() output of memmap_init_zone_device().

Base(v7.2-rc1):
  First binding: 1456 ms
  Average of subsequent rebinds: 244.28 ms

With this patch and its prerequisites applied:
  First binding: 1440 ms
  Average of subsequent rebinds: 217.19 ms

This reduces the average rebind time from 244.28 ms to 217.19 ms, or
about 11%.

Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
---
 mm/mm_init.c | 76 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 74 insertions(+), 2 deletions(-)

diff --git a/mm/mm_init.c b/mm/mm_init.c
index 4c7fad440c2a..cc8417951467 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1066,6 +1066,50 @@ static void __ref zone_device_page_init_slow(struct page *page,
 		set_page_count(page, 0);
 }
 
+static inline bool zone_device_page_init_optimization_enabled(void)
+{
+	/*
+	 * The template fast path copies a preinitialized struct page image.
+	 * Skip it when the page_ref_set tracepoint is enabled.
+	 */
+	return !page_ref_tracepoint_active(page_ref_set);
+}
+
+static inline void zone_device_template_page_init(struct page *template,
+						  struct page *src)
+{
+	memcpy(template, src, sizeof(*template));
+}
+
+/*
+ * 'template' is a reusable page prototype rather than a strictly immutable
+ * object. Most ZONE_DEVICE fields stay constant across the pages covered by
+ * the current template, but section bits and page->virtual may still depend
+ * on the PFN. Refresh those PFN-dependent fields in the template before
+ * copying it into @page.
+ */
+static inline void zone_device_page_update_template(struct page *template,
+		unsigned long pfn)
+{
+	set_page_section_from_pfn(template, pfn);
+#ifdef WANT_PAGE_VIRTUAL
+	if (!is_highmem_idx(ZONE_DEVICE))
+		set_page_address(template, __va(pfn << PAGE_SHIFT));
+#endif
+}
+
+static void zone_device_page_init_from_template(struct page *page,
+		unsigned long pfn, struct page *template)
+{
+	/*
+	 * 'template' carries the invariant portion of a ZONE_DEVICE struct
+	 * page. Update the PFN-dependent fields in place before copying it
+	 * to the destination page.
+	 */
+	zone_device_page_update_template(template, pfn);
+	memcpy(page, template, sizeof(*page));
+}
+
 /*
  * With compound page geometry and when struct pages are stored in ram most
  * tail pages are reused. Consequently, the amount of unique struct pages to
@@ -1121,6 +1165,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
 				   unsigned long nr_pages,
 				   struct dev_pagemap *pgmap)
 {
+	bool use_template = zone_device_page_init_optimization_enabled();
 	unsigned long pfn, end_pfn = start_pfn + nr_pages;
 	struct pglist_data *pgdat = zone->zone_pgdat;
 	struct vmem_altmap *altmap = pgmap_altmap(pgmap);
@@ -1128,6 +1173,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
 	unsigned long zone_idx = zone_idx(zone);
 	unsigned long start = jiffies;
 	int nid = pgdat->node_id;
+	struct page template;
 
 	if (WARN_ON_ONCE(!pgmap || zone_idx != ZONE_DEVICE))
 		return;
@@ -1142,10 +1188,36 @@ void __ref memmap_init_zone_device(struct zone *zone,
 		nr_pages = end_pfn - start_pfn;
 	}
 
-	for (pfn = start_pfn; pfn < end_pfn; pfn += pfns_per_compound) {
-		struct page *page = pfn_to_page(pfn);
+
+	if (!nr_pages)
+		return;
+
+	pfn = start_pfn;
+	/*
+	 * Seed the reusable head-page template from the first real struct
+	 * page, because the existing page-init and pageblock helpers expect
+	 * a real memmap entry rather than a stack object.
+	 */
+	if (use_template) {
+		struct page *page = pfn_to_page(start_pfn);
 
 		zone_device_page_init_slow(page, pfn, zone_idx, nid, pgmap);
+		zone_device_template_page_init(&template, page);
+		if (pfns_per_compound != 1)
+			memmap_init_compound(page, pfn, zone_idx, nid, pgmap,
+				compound_nr_pages(start_pfn, altmap, pgmap));
+		pfn += pfns_per_compound;
+	}
+
+	for (; pfn < end_pfn; pfn += pfns_per_compound) {
+		struct page *page = pfn_to_page(pfn);
+
+		if (use_template)
+			zone_device_page_init_from_template(page, pfn,
+							    &template);
+		else
+			zone_device_page_init_slow(page, pfn, zone_idx,
+						   nid, pgmap);
 
 		if (IS_ALIGNED(pfn, PAGES_PER_SECTION))
 			cond_resched();
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 4/8] mm: add a template-based fast path for zone-device page init
  2026-07-01  9:05 ` [PATCH v5 4/8] mm: add a template-based fast path for zone-device page init Li Zhe
@ 2026-07-03 14:06   ` Mike Rapoport
  0 siblings, 0 replies; 18+ messages in thread
From: Mike Rapoport @ 2026-07-03 14:06 UTC (permalink / raw)
  To: Li Zhe
  Cc: akpm, apopple, arnd, bp, dave.hansen, david, kees, mingo, tglx,
	linux-arch, linux-hardening, linux-kernel, linux-mm, x86

On Wed, Jul 01, 2026 at 05:05:49PM +0800, Li Zhe wrote:
> memmap_init_zone_device() repeats nearly identical head-page
> initialization for each PFN. Prepare one reusable ZONE_DEVICE head-page
> template through the existing slow path, refresh the PFN-dependent
> fields in that template before each copy, and memcpy it into each
> destination page.
> 
> This reduces the average rebind time from 244.28 ms to 217.19 ms, or
> about 11%.
> 
> Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
> ---
>  mm/mm_init.c | 76 ++++++++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 74 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index 4c7fad440c2a..cc8417951467 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -1066,6 +1066,50 @@ static void __ref zone_device_page_init_slow(struct page *page,
>  		set_page_count(page, 0);
>  }
>  
> +static inline bool zone_device_page_init_optimization_enabled(void)
> +{
> +	/*
> +	 * The template fast path copies a preinitialized struct page image.
> +	 * Skip it when the page_ref_set tracepoint is enabled.
> +	 */
> +	return !page_ref_tracepoint_active(page_ref_set);
> +}
> +
> +static inline void zone_device_template_page_init(struct page *template,
> +						  struct page *src)
> +{
> +	memcpy(template, src, sizeof(*template));
> +}
> +
> +/*
> + * 'template' is a reusable page prototype rather than a strictly immutable
> + * object. Most ZONE_DEVICE fields stay constant across the pages covered by
> + * the current template, but section bits and page->virtual may still depend
> + * on the PFN. Refresh those PFN-dependent fields in the template before
> + * copying it into @page.
> + */
> +static inline void zone_device_page_update_template(struct page *template,
> +		unsigned long pfn)
> +{
> +	set_page_section_from_pfn(template, pfn);
> +#ifdef WANT_PAGE_VIRTUAL
> +	if (!is_highmem_idx(ZONE_DEVICE))
> +		set_page_address(template, __va(pfn << PAGE_SHIFT));
> +#endif
> +}
> +
> +static void zone_device_page_init_from_template(struct page *page,
> +		unsigned long pfn, struct page *template)
> +{
> +	/*
> +	 * 'template' carries the invariant portion of a ZONE_DEVICE struct
> +	 * page. Update the PFN-dependent fields in place before copying it
> +	 * to the destination page.
> +	 */
> +	zone_device_page_update_template(template, pfn);
> +	memcpy(page, template, sizeof(*page));
> +}
> +

The whole bunch of template functions look like it could be useful for
initialization of the non-zone-device struct pages as well.

As I mentioned previously, it's interesting to see if this approach speeds
up normal memory map initialization as well. If if does could have a single
set of the template functions.

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v5 5/8] mm: extend the template fast path to zone-device compound tails
  2026-07-01  9:05 [PATCH v5 0/8] mm: optimize zone-device memmap initialization Li Zhe
                   ` (3 preceding siblings ...)
  2026-07-01  9:05 ` [PATCH v5 4/8] mm: add a template-based fast path for zone-device page init Li Zhe
@ 2026-07-01  9:05 ` Li Zhe
  2026-07-01  9:05 ` [PATCH v5 6/8] string: introduce memcpy_nt() helpers Li Zhe
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: Li Zhe @ 2026-07-01  9:05 UTC (permalink / raw)
  To: akpm, apopple, arnd, bp, dave.hansen, david, kees, mingo, rppt,
	tglx
  Cc: linux-arch, linux-hardening, linux-kernel, linux-mm, x86,
	lizhe.67

The template fast path from the previous patch only accelerates head
pages. Compound tails in memmap_init_compound() still go through the
slow path one by one.

Build separate head and tail templates and reuse one prepared tail
template across the tail pages in a compound range. Head pages preserve
the existing refcount policy, while compound tails always start with a
refcount of 0 after prep_compound_tail().

This extends the template-copy fast path to pfns_per_compound > 1
without changing the existing slow path. Tail-page PFN-dependent fields
are refreshed in the reusable tail template before each copy.

Tested in a VM with a 100 GB devdax namespace (align=2097152) on Intel
Ice Lake server. This test exercises the dax_pmem rebind path and
measures memmap initialization latency.

Test procedure:
Unbind and rebind the dax_pmem driver 30 times, collect memmap
initialization time from the pr_debug() output of memmap_init_zone_device().

Base(v7.2-rc1):
  First binding: 1462 ms
  Average of subsequent rebinds: 273.31 ms

With this patch and its prerequisites applied:
  First binding: 1403 ms
  Average of subsequent rebinds: 244.37 ms

This reduces the average rebind time from 273.31 ms to 244.37 ms, or
about 10.6%.

Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
---
 mm/mm_init.c | 47 ++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 40 insertions(+), 7 deletions(-)

diff --git a/mm/mm_init.c b/mm/mm_init.c
index cc8417951467..60794050bc07 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1081,6 +1081,16 @@ static inline void zone_device_template_page_init(struct page *template,
 	memcpy(template, src, sizeof(*template));
 }
 
+static inline void zone_device_tail_page_init(struct page *page,
+		unsigned long pfn, unsigned long zone_idx, int nid,
+		struct dev_pagemap *pgmap, const struct page *head,
+		unsigned int order)
+{
+	zone_device_page_init_slow(page, pfn, zone_idx, nid, pgmap);
+	prep_compound_tail(page, head, order);
+	set_page_count(page, 0);
+}
+
 /*
  * 'template' is a reusable page prototype rather than a strictly immutable
  * object. Most ZONE_DEVICE fields stay constant across the pages covered by
@@ -1138,10 +1148,12 @@ static void __ref memmap_init_compound(struct page *head,
 				       unsigned long head_pfn,
 				       unsigned long zone_idx, int nid,
 				       struct dev_pagemap *pgmap,
-				       unsigned long nr_pages)
+				       unsigned long nr_pages,
+				       bool use_template)
 {
 	unsigned long pfn, end_pfn = head_pfn + nr_pages;
 	unsigned int order = pgmap->vmemmap_shift;
+	struct page template;
 
 	/*
 	 * We have to initialize the pages, including setting up page links.
@@ -1150,12 +1162,31 @@ static void __ref memmap_init_compound(struct page *head,
 	 * the pages in the same go.
 	 */
 	__SetPageHead(head);
-	for (pfn = head_pfn + 1; pfn < end_pfn; pfn++) {
+
+	pfn = head_pfn + 1;
+	/*
+	 * All tails of the same compound page share the state established by
+	 * prep_compound_tail(). Reuse one tail template for the whole range and
+	 * refresh only the PFN-dependent fields in that template before each copy.
+	 */
+	if (use_template) {
 		struct page *page = pfn_to_page(pfn);
 
-		zone_device_page_init_slow(page, pfn, zone_idx, nid, pgmap);
-		prep_compound_tail(page, head, order);
-		set_page_count(page, 0);
+		zone_device_tail_page_init(page, pfn, zone_idx, nid,
+					   pgmap, head, order);
+		zone_device_template_page_init(&template, page);
+		pfn++;
+	}
+
+	for (; pfn < end_pfn; pfn++) {
+		struct page *page = pfn_to_page(pfn);
+
+		if (use_template)
+			zone_device_page_init_from_template(page, pfn,
+							    &template);
+		else
+			zone_device_tail_page_init(page, pfn, zone_idx, nid,
+						   pgmap, head, order);
 	}
 	prep_compound_head(head, order);
 }
@@ -1205,7 +1236,8 @@ void __ref memmap_init_zone_device(struct zone *zone,
 		zone_device_template_page_init(&template, page);
 		if (pfns_per_compound != 1)
 			memmap_init_compound(page, pfn, zone_idx, nid, pgmap,
-				compound_nr_pages(start_pfn, altmap, pgmap));
+				compound_nr_pages(start_pfn, altmap, pgmap),
+				use_template);
 		pfn += pfns_per_compound;
 	}
 
@@ -1226,7 +1258,8 @@ void __ref memmap_init_zone_device(struct zone *zone,
 			continue;
 
 		memmap_init_compound(page, pfn, zone_idx, nid, pgmap,
-				     compound_nr_pages(pfn, altmap, pgmap));
+				     compound_nr_pages(pfn, altmap, pgmap),
+				     use_template);
 	}
 
 	pageblock_migratetype_init_range(start_pfn, nr_pages, MIGRATE_MOVABLE);
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v5 6/8] string: introduce memcpy_nt() helpers
  2026-07-01  9:05 [PATCH v5 0/8] mm: optimize zone-device memmap initialization Li Zhe
                   ` (4 preceding siblings ...)
  2026-07-01  9:05 ` [PATCH v5 5/8] mm: extend the template fast path to zone-device compound tails Li Zhe
@ 2026-07-01  9:05 ` Li Zhe
  2026-07-02  2:16   ` Borislav Petkov
  2026-07-01  9:05 ` [PATCH v5 7/8] x86/string: extend memcpy_flushcache() fixed-size fastpaths Li Zhe
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 18+ messages in thread
From: Li Zhe @ 2026-07-01  9:05 UTC (permalink / raw)
  To: akpm, apopple, arnd, bp, dave.hansen, david, kees, mingo, rppt,
	tglx
  Cc: linux-arch, linux-hardening, linux-kernel, linux-mm, x86,
	lizhe.67

Introduce memcpy_nt() and memcpy_nt_drain() for write-once copy sites
that want a named non-temporal copy primitive plus an explicit ordering
point. On x86, place the arch-visible wrapper in
arch/x86/include/asm/string_64.h and map it to the existing
memcpy_flushcache() backend plus sfence. Architectures that do not
override the helper fall back to memcpy() and a no-op drain in
include/linux/string.h.

The immediate user is the ZONE_DEVICE template-copy path. That path
populates struct page descriptors in a write-once pattern, so most
destination cachelines are not expected to be reused immediately after
the copy. A regular cached memcpy() can therefore incur avoidable
write-allocate traffic and pollute the cache with data that has little
near-term reuse.

This interface lets callers request that non-temporal-copy semantics
directly, while x86 simply reuses the existing memcpy_flushcache()
backend instead of adding another generic memcpy-like wrapper with
extra selection policy above it.

Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
---
 arch/x86/include/asm/string_64.h | 16 ++++++++++++++++
 include/linux/string.h           | 18 ++++++++++++++++++
 2 files changed, 34 insertions(+)

diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index 4635616863f5..6f36abedc56a 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -100,6 +100,22 @@ static __always_inline void memcpy_flushcache(void *dst, const void *src, size_t
 	}
 	__memcpy_flushcache(dst, src, cnt);
 }
+
+#define __HAVE_ARCH_MEMCPY_NT 1
+/*
+ * Reuse the existing x86 flushcache backend as the nt copy primitive.
+ * Callers pair it with memcpy_nt_drain() when later stores must be
+ * ordered after the copy.
+ */
+static __always_inline void memcpy_nt(void *dst, const void *src, size_t cnt)
+{
+	memcpy_flushcache(dst, src, cnt);
+}
+
+static __always_inline void memcpy_nt_drain(void)
+{
+	asm volatile("sfence" : : : "memory");
+}
 #endif
 
 #endif /* __KERNEL__ */
diff --git a/include/linux/string.h b/include/linux/string.h
index 5702daca4326..5165763ab812 100644
--- a/include/linux/string.h
+++ b/include/linux/string.h
@@ -278,6 +278,24 @@ static inline void memcpy_flushcache(void *dst, const void *src, size_t cnt)
 }
 #endif
 
+#ifndef __HAVE_ARCH_MEMCPY_NT
+/*
+ * memcpy_nt() requests a non-temporal copy when the architecture has a
+ * suitable backend. Callers must follow it with memcpy_nt_drain()
+ * before later normal stores that need to be ordered after the copy.
+ * Architectures that do not override it fall back to memcpy() and a
+ * no-op drain.
+ */
+static inline void memcpy_nt(void *dst, const void *src, size_t cnt)
+{
+	memcpy(dst, src, cnt);
+}
+
+static inline void memcpy_nt_drain(void)
+{
+}
+#endif
+
 void *memchr_inv(const void *s, int c, size_t n);
 char *strreplace(char *str, char old, char new);
 
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 6/8] string: introduce memcpy_nt() helpers
  2026-07-01  9:05 ` [PATCH v5 6/8] string: introduce memcpy_nt() helpers Li Zhe
@ 2026-07-02  2:16   ` Borislav Petkov
  2026-07-02  7:36     ` Li Zhe
  0 siblings, 1 reply; 18+ messages in thread
From: Borislav Petkov @ 2026-07-02  2:16 UTC (permalink / raw)
  To: Li Zhe
  Cc: akpm, apopple, arnd, dave.hansen, david, kees, mingo, rppt, tglx,
	linux-arch, linux-hardening, linux-kernel, linux-mm, x86

On Wed, Jul 01, 2026 at 05:05:51PM +0800, Li Zhe wrote:
> +static __always_inline void memcpy_nt_drain(void)
> +{
> +	asm volatile("sfence" : : : "memory");

Doing a simple grep around the tree will give you the hint to use wmb() and
not reinvent the wheel here. Not to mention that this is bypassing KCSAN
instrumentation.

And then you don't need to go define this drain thing but simply use wmb()
because, oh, look, other architectures already define a write memory barrier
too.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 6/8] string: introduce memcpy_nt() helpers
  2026-07-02  2:16   ` Borislav Petkov
@ 2026-07-02  7:36     ` Li Zhe
  0 siblings, 0 replies; 18+ messages in thread
From: Li Zhe @ 2026-07-02  7:36 UTC (permalink / raw)
  To: bp
  Cc: akpm, apopple, arnd, dave.hansen, david, kees, linux-arch,
	linux-hardening, linux-kernel, linux-mm, lizhe.67, mingo, rppt,
	tglx, x86

On Wed, 1 Jul 2026 19:16:19 -0700, bp@alien8.de wrote:

> On Wed, Jul 01, 2026 at 05:05:51PM +0800, Li Zhe wrote:
> > +static __always_inline void memcpy_nt_drain(void)
> > +{
> > +	asm volatile("sfence" : : : "memory");
> 
> Doing a simple grep around the tree will give you the hint to use wmb() and
> not reinvent the wheel here. Not to mention that this is bypassing KCSAN
> instrumentation.
> 
> And then you don't need to go define this drain thing but simply use wmb()
> because, oh, look, other architectures already define a write memory barrier
> too.

Thanks for the review.

Yes, a separate memcpy_nt_drain() wrapper is unnecessary. I'll drop
memcpy_nt_drain() in v6 and use wmb() at the ordering points in patch
8 instead.

Thanks,
Zhe

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v5 7/8] x86/string: extend memcpy_flushcache() fixed-size fastpaths
  2026-07-01  9:05 [PATCH v5 0/8] mm: optimize zone-device memmap initialization Li Zhe
                   ` (5 preceding siblings ...)
  2026-07-01  9:05 ` [PATCH v5 6/8] string: introduce memcpy_nt() helpers Li Zhe
@ 2026-07-01  9:05 ` Li Zhe
  2026-07-02  2:23   ` Borislav Petkov
  2026-07-01  9:05 ` [PATCH v5 8/8] mm: use memcpy_nt() in zone-device template copies Li Zhe
  2026-07-01 23:28 ` [PATCH v5 0/8] mm: optimize zone-device memmap initialization Andrew Morton
  8 siblings, 1 reply; 18+ messages in thread
From: Li Zhe @ 2026-07-01  9:05 UTC (permalink / raw)
  To: akpm, apopple, arnd, bp, dave.hansen, david, kees, mingo, rppt,
	tglx
  Cc: linux-arch, linux-hardening, linux-kernel, linux-mm, x86,
	lizhe.67

The new x86 memcpy_nt() helper in this series maps to
memcpy_flushcache(), and the ZONE_DEVICE fast path uses that primitive
for constant-sized struct page template copies.

memcpy_flushcache() currently inlines only the 4, 8, and 16-byte
cases. Larger constant-sized copies fall back to __memcpy_flushcache()
even when the destination is naturally aligned. Extend the inline
movnti coverage to 32, 48, 64, 80, and 96 bytes so the struct
page-sized copies used by that path can stay on the inline
non-temporal store path instead of dropping into the out-of-line
helper.

Factor the store sequences into 8/16/32/64-byte helpers, keep the
existing 4/8/16-byte cases handled directly in memcpy_flushcache(),
issue the stores in ascending address order, and leave all other sizes
on __memcpy_flushcache().

Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
---
 arch/x86/include/asm/string_64.h | 80 +++++++++++++++++++++++++++++++-
 1 file changed, 79 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index 6f36abedc56a..95ef2d481418 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -4,6 +4,7 @@
 
 #ifdef __KERNEL__
 #include <linux/jump_label.h>
+#include <linux/align.h>
 
 /* Written 2002 by Andi Kleen */
 
@@ -82,8 +83,81 @@ int strcmp(const char *cs, const char *ct);
 #ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE
 #define __HAVE_ARCH_MEMCPY_FLUSHCACHE 1
 void __memcpy_flushcache(void *dst, const void *src, size_t cnt);
-static __always_inline void memcpy_flushcache(void *dst, const void *src, size_t cnt)
+
+static __always_inline void memcpy_flushcache_8(void *dst, const void *src)
+{
+	asm volatile("movntiq %1, %0"
+		     : "=m"(*(u64 *)dst)
+		     : "r"(*(const u64 *)src)
+		     : "memory");
+}
+
+static __always_inline void memcpy_flushcache_16(void *dst,
+						 const void *src)
+{
+	memcpy_flushcache_8(dst, src);
+	memcpy_flushcache_8(dst + 8, src + 8);
+}
+
+static __always_inline void memcpy_flushcache_32(void *dst,
+						 const void *src)
+{
+	memcpy_flushcache_16(dst, src);
+	memcpy_flushcache_16(dst + 16, src + 16);
+}
+
+static __always_inline void memcpy_flushcache_64(void *dst,
+						 const void *src)
 {
+	memcpy_flushcache_32(dst, src);
+	memcpy_flushcache_32(dst + 32, src + 32);
+}
+
+/*
+ * Keep the additional aligned fixed-size cases on the inline movnti path.
+ * Leave the existing 4/8/16-byte cases handled directly in
+ * memcpy_flushcache() so their code generation stays unchanged.
+ */
+static __always_inline int memcpy_flushcache_large(void *dst,
+						   const void *src,
+						   size_t cnt)
+{
+	char *dptr = dst;
+	const char *sptr = src;
+
+	if (!IS_ALIGNED((unsigned long)dst, 8))
+		return 0;
+
+	switch (cnt) {
+	case 32:
+		memcpy_flushcache_32(dptr, sptr);
+		return 1;
+	case 48:
+		memcpy_flushcache_32(dptr, sptr);
+		memcpy_flushcache_16(dptr + 32, sptr + 32);
+		return 1;
+	case 64:
+		memcpy_flushcache_64(dptr, sptr);
+		return 1;
+	case 80:
+		memcpy_flushcache_64(dptr, sptr);
+		memcpy_flushcache_16(dptr + 64, sptr + 64);
+		return 1;
+	case 96:
+		memcpy_flushcache_64(dptr, sptr);
+		memcpy_flushcache_32(dptr + 64, sptr + 64);
+		return 1;
+	}
+
+	return 0;
+}
+
+static __always_inline void memcpy_flushcache(void *dst, const void *src,
+					      size_t cnt)
+{
+	if (!cnt)
+		return;
+
 	if (__builtin_constant_p(cnt)) {
 		switch (cnt) {
 			case 4:
@@ -97,7 +171,11 @@ static __always_inline void memcpy_flushcache(void *dst, const void *src, size_t
 				asm ("movntiq %1, %0" : "=m"(*(u64 *)(dst + 8)) : "r"(*(u64 *)(src + 8)));
 				return;
 		}
+
+		if (memcpy_flushcache_large(dst, src, cnt))
+			return;
 	}
+
 	__memcpy_flushcache(dst, src, cnt);
 }
 
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 7/8] x86/string: extend memcpy_flushcache() fixed-size fastpaths
  2026-07-01  9:05 ` [PATCH v5 7/8] x86/string: extend memcpy_flushcache() fixed-size fastpaths Li Zhe
@ 2026-07-02  2:23   ` Borislav Petkov
  2026-07-02  7:37     ` Li Zhe
  0 siblings, 1 reply; 18+ messages in thread
From: Borislav Petkov @ 2026-07-02  2:23 UTC (permalink / raw)
  To: Li Zhe
  Cc: akpm, apopple, arnd, dave.hansen, david, kees, mingo, rppt, tglx,
	linux-arch, linux-hardening, linux-kernel, linux-mm, x86

On Wed, Jul 01, 2026 at 05:05:52PM +0800, Li Zhe wrote:
> The new x86 memcpy_nt() helper in this series maps to
> memcpy_flushcache(), and the ZONE_DEVICE fast path uses that primitive
> for constant-sized struct page template copies.
> 
> memcpy_flushcache() currently inlines only the 4, 8, and 16-byte
> cases. Larger constant-sized copies fall back to __memcpy_flushcache()
> even when the destination is naturally aligned. Extend the inline
> movnti coverage to 32, 48, 64, 80, and 96 bytes so the struct

Why exactly those sizes and not bigger?

Any numbers to back this up?

> page-sized copies used by that path can stay on the inline
> non-temporal store path instead of dropping into the out-of-line
> helper.
> 
> Factor the store sequences into 8/16/32/64-byte helpers, keep the
> existing 4/8/16-byte cases handled directly in memcpy_flushcache(),
> issue the stores in ascending address order, and leave all other sizes
> on __memcpy_flushcache().
> 
> Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
> ---
>  arch/x86/include/asm/string_64.h | 80 +++++++++++++++++++++++++++++++-
>  1 file changed, 79 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
> index 6f36abedc56a..95ef2d481418 100644
> --- a/arch/x86/include/asm/string_64.h
> +++ b/arch/x86/include/asm/string_64.h
> @@ -4,6 +4,7 @@
>  
>  #ifdef __KERNEL__
>  #include <linux/jump_label.h>
> +#include <linux/align.h>
>  
>  /* Written 2002 by Andi Kleen */
>  
> @@ -82,8 +83,81 @@ int strcmp(const char *cs, const char *ct);
>  #ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE
>  #define __HAVE_ARCH_MEMCPY_FLUSHCACHE 1
>  void __memcpy_flushcache(void *dst, const void *src, size_t cnt);
> -static __always_inline void memcpy_flushcache(void *dst, const void *src, size_t cnt)
> +
> +static __always_inline void memcpy_flushcache_8(void *dst, const void *src)

If this is calling MOVNTI, you might as well call it that way. Only leave the
externally exposed wrapper called "flushcache".

> +{
> +	asm volatile("movntiq %1, %0"
> +		     : "=m"(*(u64 *)dst)
> +		     : "r"(*(const u64 *)src)
> +		     : "memory");
> +}
> +
> +static __always_inline void memcpy_flushcache_16(void *dst,
> +						 const void *src)
> +{
> +	memcpy_flushcache_8(dst, src);
> +	memcpy_flushcache_8(dst + 8, src + 8);
> +}
> +
> +static __always_inline void memcpy_flushcache_32(void *dst,
> +						 const void *src)
> +{
> +	memcpy_flushcache_16(dst, src);
> +	memcpy_flushcache_16(dst + 16, src + 16);
> +}
> +
> +static __always_inline void memcpy_flushcache_64(void *dst,
> +						 const void *src)
>  {
> +	memcpy_flushcache_32(dst, src);
> +	memcpy_flushcache_32(dst + 32, src + 32);
> +}
> +
> +/*
> + * Keep the additional aligned fixed-size cases on the inline movnti path.
> + * Leave the existing 4/8/16-byte cases handled directly in
> + * memcpy_flushcache() so their code generation stays unchanged.
> + */
> +static __always_inline int memcpy_flushcache_large(void *dst,
> +						   const void *src,
> +						   size_t cnt)
> +{
> +	char *dptr = dst;
> +	const char *sptr = src;
> +
> +	if (!IS_ALIGNED((unsigned long)dst, 8))
> +		return 0;
> +
> +	switch (cnt) {
> +	case 32:
> +		memcpy_flushcache_32(dptr, sptr);
> +		return 1;
> +	case 48:
> +		memcpy_flushcache_32(dptr, sptr);
> +		memcpy_flushcache_16(dptr + 32, sptr + 32);
> +		return 1;
> +	case 64:
> +		memcpy_flushcache_64(dptr, sptr);
> +		return 1;
> +	case 80:
> +		memcpy_flushcache_64(dptr, sptr);
> +		memcpy_flushcache_16(dptr + 64, sptr + 64);
> +		return 1;
> +	case 96:
> +		memcpy_flushcache_64(dptr, sptr);
> +		memcpy_flushcache_32(dptr + 64, sptr + 64);
> +		return 1;
> +	}
> +
> +	return 0;
> +}
> +
> +static __always_inline void memcpy_flushcache(void *dst, const void *src,
> +					      size_t cnt)
> +{
> +	if (!cnt)
> +		return;
> +
>  	if (__builtin_constant_p(cnt)) {
>  		switch (cnt) {
>  			case 4:
> @@ -97,7 +171,11 @@ static __always_inline void memcpy_flushcache(void *dst, const void *src, size_t
>  				asm ("movntiq %1, %0" : "=m"(*(u64 *)(dst + 8)) : "r"(*(u64 *)(src + 8)));
>  				return;
>  		}
> +
> +		if (memcpy_flushcache_large(dst, src, cnt))

Why is this a separate function and not extending the switch-case?

I betcha it has to do with the alignment check but nothing explains to me why.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 7/8] x86/string: extend memcpy_flushcache() fixed-size fastpaths
  2026-07-02  2:23   ` Borislav Petkov
@ 2026-07-02  7:37     ` Li Zhe
  2026-07-03 19:56       ` Borislav Petkov
  0 siblings, 1 reply; 18+ messages in thread
From: Li Zhe @ 2026-07-02  7:37 UTC (permalink / raw)
  To: bp
  Cc: akpm, apopple, arnd, dave.hansen, david, kees, linux-arch,
	linux-hardening, linux-kernel, linux-mm, lizhe.67, mingo, rppt,
	tglx, x86

On Wed, 1 Jul 2026 19:23:42 -0700, bp@alien8.de wrote:

> On Wed, Jul 01, 2026 at 05:05:52PM +0800, Li Zhe wrote:
> > The new x86 memcpy_nt() helper in this series maps to
> > memcpy_flushcache(), and the ZONE_DEVICE fast path uses that primitive
> > for constant-sized struct page template copies.
> >
> > memcpy_flushcache() currently inlines only the 4, 8, and 16-byte
> > cases. Larger constant-sized copies fall back to __memcpy_flushcache()
> > even when the destination is naturally aligned. Extend the inline
> > movnti coverage to 32, 48, 64, 80, and 96 bytes so the struct
> 
> Why exactly those sizes and not bigger?
> 
> Any numbers to back this up?

The relevant target sizes here are the x86_64 sizeof(struct page)
values which patch 8 can hit: 64, 80, and 96 bytes.

On x86_64, CONFIG_HAVE_ALIGNED_STRUCT_PAGE keeps struct page 16-byte
aligned. That gives 64 bytes in the common case, 80 bytes when
WANT_PAGE_VIRTUAL or CONFIG_KMSAN adds extra fields, and 96 bytes when
both are present. I stopped there deliberately because this series does
not have a current caller which needs larger constant sizes.

The 32-byte and 48-byte cases were added only as intermediate fixed-size
combinations used to build those larger struct page-sized copies while
keeping them on the inline movnti path.

> > page-sized copies used by that path can stay on the inline
> > non-temporal store path instead of dropping into the out-of-line
> > helper.
> >
> > Factor the store sequences into 8/16/32/64-byte helpers, keep the
> > existing 4/8/16-byte cases handled directly in memcpy_flushcache(),
> > issue the stores in ascending address order, and leave all other sizes
> > on __memcpy_flushcache().
> >
> > Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
> > ---
> >  arch/x86/include/asm/string_64.h | 80 +++++++++++++++++++++++++++++++-
> >  1 file changed, 79 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
> > index 6f36abedc56a..95ef2d481418 100644
> > --- a/arch/x86/include/asm/string_64.h
> > +++ b/arch/x86/include/asm/string_64.h
> > @@ -4,6 +4,7 @@
> >
> >  #ifdef __KERNEL__
> >  #include <linux/jump_label.h>
> > +#include <linux/align.h>
> >
> >  /* Written 2002 by Andi Kleen */
> >
> > @@ -82,8 +83,81 @@ int strcmp(const char *cs, const char *ct);
> >  #ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE
> >  #define __HAVE_ARCH_MEMCPY_FLUSHCACHE 1
> >  void __memcpy_flushcache(void *dst, const void *src, size_t cnt);
> > -static __always_inline void memcpy_flushcache(void *dst, const void *src, size_t cnt)
> > +
> > +static __always_inline void memcpy_flushcache_8(void *dst, const void *src)
> 
> If this is calling MOVNTI, you might as well call it that way. Only leave the
> externally exposed wrapper called "flushcache".

Thanks for the review.

In the next version I will rename the internal leaf helpers to
movnti_*() names and keep memcpy_flushcache() as the externally exposed
wrapper.

> > +{
> > +	asm volatile("movntiq %1, %0"
> > +		     : "=m"(*(u64 *)dst)
> > +		     : "r"(*(const u64 *)src)
> > +		     : "memory");
> > +}
> > +
> > +static __always_inline void memcpy_flushcache_16(void *dst,
> > +						 const void *src)
> > +{
> > +	memcpy_flushcache_8(dst, src);
> > +	memcpy_flushcache_8(dst + 8, src + 8);
> > +}
> > +
> > +static __always_inline void memcpy_flushcache_32(void *dst,
> > +						 const void *src)
> > +{
> > +	memcpy_flushcache_16(dst, src);
> > +	memcpy_flushcache_16(dst + 16, src + 16);
> > +}
> > +
> > +static __always_inline void memcpy_flushcache_64(void *dst,
> > +						 const void *src)
> >  {
> > +	memcpy_flushcache_32(dst, src);
> > +	memcpy_flushcache_32(dst + 32, src + 32);
> > +}
> > +
> > +/*
> > + * Keep the additional aligned fixed-size cases on the inline movnti path.
> > + * Leave the existing 4/8/16-byte cases handled directly in
> > + * memcpy_flushcache() so their code generation stays unchanged.
> > + */
> > +static __always_inline int memcpy_flushcache_large(void *dst,
> > +						   const void *src,
> > +						   size_t cnt)
> > +{
> > +	char *dptr = dst;
> > +	const char *sptr = src;
> > +
> > +	if (!IS_ALIGNED((unsigned long)dst, 8))
> > +		return 0;
> > +
> > +	switch (cnt) {
> > +	case 32:
> > +		memcpy_flushcache_32(dptr, sptr);
> > +		return 1;
> > +	case 48:
> > +		memcpy_flushcache_32(dptr, sptr);
> > +		memcpy_flushcache_16(dptr + 32, sptr + 32);
> > +		return 1;
> > +	case 64:
> > +		memcpy_flushcache_64(dptr, sptr);
> > +		return 1;
> > +	case 80:
> > +		memcpy_flushcache_64(dptr, sptr);
> > +		memcpy_flushcache_16(dptr + 64, sptr + 64);
> > +		return 1;
> > +	case 96:
> > +		memcpy_flushcache_64(dptr, sptr);
> > +		memcpy_flushcache_32(dptr + 64, sptr + 64);
> > +		return 1;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static __always_inline void memcpy_flushcache(void *dst, const void *src,
> > +					      size_t cnt)
> > +{
> > +	if (!cnt)
> > +		return;
> > +
> >  	if (__builtin_constant_p(cnt)) {
> >  		switch (cnt) {
> >  			case 4:
> > @@ -97,7 +171,11 @@ static __always_inline void memcpy_flushcache(void *dst, const void *src, size_t
> >  				asm ("movntiq %1, %0" : "=m"(*(u64 *)(dst + 8)) : "r"(*(u64 *)(src + 8)));
> >  				return;
> >  		}
> > +
> > +		if (memcpy_flushcache_large(dst, src, cnt))
> 
> Why is this a separate function and not extending the switch-case?
> 
> I betcha it has to do with the alignment check but nothing explains to me why.

Yes. The separate helper is there exactly because the added
32/48/64/80/96-byte cases carry an extra 8-byte destination-alignment
check, while I left the existing 4/8/16-byte inline cases unchanged.

The intent was to keep the larger fixed-size cases on the inline
movnti path only when they already match the alignment assumptions
used by __memcpy_flushcache(); otherwise they fall back to that
out-of-line helper, which already handles the misaligned head/tail
fragments before entering its movnti loops.

The current patch does not explain that clearly enough. In v6 I will
add an explicit comment to memcpy_flushcache_large() and spell the
same reasoning out in the changelog.

Thanks,
Zhe

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 7/8] x86/string: extend memcpy_flushcache() fixed-size fastpaths
  2026-07-02  7:37     ` Li Zhe
@ 2026-07-03 19:56       ` Borislav Petkov
  0 siblings, 0 replies; 18+ messages in thread
From: Borislav Petkov @ 2026-07-03 19:56 UTC (permalink / raw)
  To: Li Zhe
  Cc: akpm, apopple, arnd, dave.hansen, david, kees, linux-arch,
	linux-hardening, linux-kernel, linux-mm, mingo, rppt, tglx, x86

On Thu, Jul 02, 2026 at 03:37:50PM +0800, Li Zhe wrote:
> The relevant target sizes here are the x86_64 sizeof(struct page)
> values which patch 8 can hit: 64, 80, and 96 bytes.
> 
> On x86_64, CONFIG_HAVE_ALIGNED_STRUCT_PAGE keeps struct page 16-byte
> aligned. That gives 64 bytes in the common case, 80 bytes when
> WANT_PAGE_VIRTUAL or CONFIG_KMSAN adds extra fields, and 96 bytes when
> both are present. I stopped there deliberately because this series does
> not have a current caller which needs larger constant sizes.
> 
> The 32-byte and 48-byte cases were added only as intermediate fixed-size
> combinations used to build those larger struct page-sized copies while
> keeping them on the inline movnti path.

This needs to be in a comment over that code so that it is clear why that
distinction has been made.

> Yes. The separate helper is there exactly because the added
> 32/48/64/80/96-byte cases carry an extra 8-byte destination-alignment
> check, while I left the existing 4/8/16-byte inline cases unchanged.
> 
> The intent was to keep the larger fixed-size cases on the inline
> movnti path only when they already match the alignment assumptions
> used by __memcpy_flushcache(); otherwise they fall back to that
> out-of-line helper, which already handles the misaligned head/tail
> fragments before entering its movnti loops.

Why can't the larger MOVNTI sizes deal with misaligned cases too and thus make
the code even simpler and more straight-forward this way?

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v5 8/8] mm: use memcpy_nt() in zone-device template copies
  2026-07-01  9:05 [PATCH v5 0/8] mm: optimize zone-device memmap initialization Li Zhe
                   ` (6 preceding siblings ...)
  2026-07-01  9:05 ` [PATCH v5 7/8] x86/string: extend memcpy_flushcache() fixed-size fastpaths Li Zhe
@ 2026-07-01  9:05 ` Li Zhe
  2026-07-01 23:28 ` [PATCH v5 0/8] mm: optimize zone-device memmap initialization Andrew Morton
  8 siblings, 0 replies; 18+ messages in thread
From: Li Zhe @ 2026-07-01  9:05 UTC (permalink / raw)
  To: akpm, apopple, arnd, bp, dave.hansen, david, kees, mingo, rppt,
	tglx
  Cc: linux-arch, linux-hardening, linux-kernel, linux-mm, x86,
	lizhe.67

The template fast path currently uses memcpy() for the actual struct
page copy. Switch zone_device_page_init_from_template() to memcpy_nt()
and add memcpy_nt_drain() before memmap_init_compound(), before
prep_compound_head() updates overlapping tail metadata, and before
returning from memmap_init_zone_device().

ZONE_DEVICE memmap initialization is largely write-once: each struct
page is populated once, and most destination cachelines are not expected
to be reused immediately afterwards. On x86, a regular cached memcpy()
can therefore incur write-allocate traffic by pulling destination
cachelines into the cache before writeback, and can populate the cache
with data that has little near-term reuse. Using memcpy_nt() lets this
path request non-temporal stores for that copy pattern, which can reduce
cache pollution and avoid part of the associated write-allocate
overhead, while architectures without a specialized backend still fall
back to memcpy().

When memcpy_nt() maps to non-temporal stores, order those stores before
memmap_init_compound(), before prep_compound_head() updates overlapping
compound metadata, and before returning from memmap_init_zone_device().

Keep sanitized builds on the slow path so KASAN/KMSAN retain their
instrumented stores.

Tested in a VM with a 100 GB fsdax namespace device configured with
map=dev and a 100 GB devdax namespace (align=2097152) on Intel Ice Lake
server.

Test procedure:
Rebind the nd_pmem and dax_pmem driver 30 times and collect the memmap
initialization time from the pr_debug() output of
memmap_init_zone_device().

Base(v7.2-rc1):
  First binding for nd_pmem driver: 1456 ms
  Average of subsequent rebinds: 244.28 ms

  First binding for dax_pmem driver: 1462 ms
  Average of subsequent rebinds: 273.31 ms

With this series:
  First binding for nd_pmem driver: 1272 ms
  Average of subsequent rebinds: 96.79 ms

  First binding for dax_pmem driver: 1354 ms
  Average of subsequent rebinds: 119.04 ms

This reduces the average rebind time by about 60.4% for nd_pmem and
56.4% for dax_pmem.

Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
---
 mm/mm_init.c | 39 +++++++++++++++++++++++++++++++++++++--
 1 file changed, 37 insertions(+), 2 deletions(-)

diff --git a/mm/mm_init.c b/mm/mm_init.c
index 60794050bc07..eb8859a62f70 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1068,11 +1068,21 @@ static void __ref zone_device_page_init_slow(struct page *page,

 static inline bool zone_device_page_init_optimization_enabled(void)
 {
+	/*
+	 * Keep sanitized builds on the slow path so their stores stay
+	 * instrumented.
+	 */
+	if (IS_ENABLED(CONFIG_KASAN) || IS_ENABLED(CONFIG_KMSAN))
+		return false;
+
 	/*
 	 * The template fast path copies a preinitialized struct page image.
 	 * Skip it when the page_ref_set tracepoint is enabled.
 	 */
-	return !page_ref_tracepoint_active(page_ref_set);
+	if (page_ref_tracepoint_active(page_ref_set))
+		return false;
+
+	return true;
 }

 static inline void zone_device_template_page_init(struct page *template,
@@ -1117,7 +1127,7 @@ static void zone_device_page_init_from_template(struct page *page,
 	 * to the destination page.
 	 */
 	zone_device_page_update_template(template, pfn);
-	memcpy(page, template, sizeof(*page));
+	memcpy_nt(page, template, sizeof(*page));
 }

 /*
@@ -1188,6 +1198,15 @@ static void __ref memmap_init_compound(struct page *head,
 			zone_device_tail_page_init(page, pfn, zone_idx, nid,
 						   pgmap, head, order);
 	}
+
+	/*
+	 * When the template path is enabled, order the preceding tail-page copies
+	 * before prep_compound_head() updates the overlapping compound metadata
+	 * in the first tail-page descriptors. If memcpy_nt() fell back to
+	 * regular cached stores, memcpy_nt_drain() may be a no-op.
+	 */
+	if (use_template)
+		memcpy_nt_drain();
 	prep_compound_head(head, order);
 }

@@ -1257,10 +1276,26 @@ void __ref memmap_init_zone_device(struct zone *zone,
 		if (pfns_per_compound == 1)
 			continue;

+		/*
+		 * When the template path is enabled, order the preceding head-page copy
+		 * before memmap_init_compound(), which immediately updates compound-head
+		 * metadata. If memcpy_nt() fell back to regular cached stores,
+		 * memcpy_nt_drain() may be a no-op.
+		 */
+		if (use_template)
+			memcpy_nt_drain();
+
 		memmap_init_compound(page, pfn, zone_idx, nid, pgmap,
 				     compound_nr_pages(pfn, altmap, pgmap),
 				     use_template);
 	}
+	/*
+	 * Ensure any prior template copies are ordered before returning.
+	 * On architectures where memcpy_nt() used regular cached stores,
+	 * memcpy_nt_drain() may be a no-op.
+	 */
+	if (use_template)
+		memcpy_nt_drain();

 	pageblock_migratetype_init_range(start_pfn, nr_pages, MIGRATE_MOVABLE);

-- 
2.20.1

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 0/8] mm: optimize zone-device memmap initialization
  2026-07-01  9:05 [PATCH v5 0/8] mm: optimize zone-device memmap initialization Li Zhe
                   ` (7 preceding siblings ...)
  2026-07-01  9:05 ` [PATCH v5 8/8] mm: use memcpy_nt() in zone-device template copies Li Zhe
@ 2026-07-01 23:28 ` Andrew Morton
  2026-07-02  2:24   ` Borislav Petkov
  2026-07-02  7:39   ` Li Zhe
  8 siblings, 2 replies; 18+ messages in thread
From: Andrew Morton @ 2026-07-01 23:28 UTC (permalink / raw)
  To: Li Zhe
  Cc: apopple, arnd, bp, dave.hansen, david, kees, mingo, rppt, tglx,
	linux-arch, linux-hardening, linux-kernel, linux-mm, x86,
	Balbir Singh

On Wed,  1 Jul 2026 17:05:45 +0800 "Li Zhe" <lizhe.67@bytedance.com> wrote:

> memmap_init_zone_device() can take a noticeable amount of time when large
> pmem namespaces are bound or rebound, because it initializes nearly
> identical struct page descriptors one PFN at a time. This series reduces
> that ZONE_DEVICE memmap initialization overhead by reusing prepared
> struct page templates and, on x86, using memcpy_nt() for the template
> copy path.
> 
> The main target is large fsdax/devdax pmem configurations, where the
> cost of initializing the memmap shows up directly in nd_pmem/dax_pmem
> bind and rebind latency.
> 
> Patches 1-3 are preparatory cleanups and helper extraction. Patches 4-5
> add the template-copy fast path for head pages and compound tails.
> Patches 6-8 introduce memcpy_nt()/memcpy_nt_drain(), extend the x86
> fixed-size memcpy_flushcache() inline cases used by that helper, and
> switch the template-copy path over to memcpy_nt().
> 
> The fast path remains disabled when the page_ref_set tracepoint is
> active, and sanitized builds stay on the slow path so their instrumented
> stores are preserved. Architectures without a specialized memcpy_nt()
> backend continue to fall back to memcpy().
> 
> Tested in a VM with a 100 GB fsdax namespace device configured with
> map=dev and a 100 GB devdax namespace (align=2097152) on Intel Ice Lake
> server.

Thanks for persisting with this.

Review is still thin :( I see that Mike, Boris and Alistair have
commented on previous versions.  As did Balbir, who wasn't cc'ed on
this (fixed).

I'll add it to mm.git for testing exposure (because I'm still a sucker
for speedups), but more review is needed, please.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 0/8] mm: optimize zone-device memmap initialization
  2026-07-01 23:28 ` [PATCH v5 0/8] mm: optimize zone-device memmap initialization Andrew Morton
@ 2026-07-02  2:24   ` Borislav Petkov
  2026-07-02  7:39   ` Li Zhe
  1 sibling, 0 replies; 18+ messages in thread
From: Borislav Petkov @ 2026-07-02  2:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Li Zhe, apopple, arnd, dave.hansen, david, kees, mingo, rppt,
	tglx, linux-arch, linux-hardening, linux-kernel, linux-mm, x86,
	Balbir Singh

On Wed, Jul 01, 2026 at 04:28:08PM -0700, Andrew Morton wrote:
> I'll add it to mm.git for testing exposure (because I'm still a sucker
> for speedups), but more review is needed, please.

Yeah, it'll need more churn until it looks sensible.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 0/8] mm: optimize zone-device memmap initialization
  2026-07-01 23:28 ` [PATCH v5 0/8] mm: optimize zone-device memmap initialization Andrew Morton
  2026-07-02  2:24   ` Borislav Petkov
@ 2026-07-02  7:39   ` Li Zhe
  1 sibling, 0 replies; 18+ messages in thread
From: Li Zhe @ 2026-07-02  7:39 UTC (permalink / raw)
  To: akpm
  Cc: apopple, arnd, balbirs, bp, dave.hansen, david, kees, linux-arch,
	linux-hardening, linux-kernel, linux-mm, lizhe.67, mingo, rppt,
	tglx, x86

On Wed, 1 Jul 2026 16:28:08 -0700, akpm@linux-foundation.org wrote:

> On Wed,  1 Jul 2026 17:05:45 +0800 "Li Zhe" <lizhe.67@bytedance.com> wrote:
> 
> > memmap_init_zone_device() can take a noticeable amount of time when large
> > pmem namespaces are bound or rebound, because it initializes nearly
> > identical struct page descriptors one PFN at a time. This series reduces
> > that ZONE_DEVICE memmap initialization overhead by reusing prepared
> > struct page templates and, on x86, using memcpy_nt() for the template
> > copy path.
> >
> > The main target is large fsdax/devdax pmem configurations, where the
> > cost of initializing the memmap shows up directly in nd_pmem/dax_pmem
> > bind and rebind latency.
> >
> > Patches 1-3 are preparatory cleanups and helper extraction. Patches 4-5
> > add the template-copy fast path for head pages and compound tails.
> > Patches 6-8 introduce memcpy_nt()/memcpy_nt_drain(), extend the x86
> > fixed-size memcpy_flushcache() inline cases used by that helper, and
> > switch the template-copy path over to memcpy_nt().
> >
> > The fast path remains disabled when the page_ref_set tracepoint is
> > active, and sanitized builds stay on the slow path so their instrumented
> > stores are preserved. Architectures without a specialized memcpy_nt()
> > backend continue to fall back to memcpy().
> >
> > Tested in a VM with a 100 GB fsdax namespace device configured with
> > map=dev and a 100 GB devdax namespace (align=2097152) on Intel Ice Lake
> > server.
> 
> Thanks for persisting with this.
> 
> Review is still thin :( I see that Mike, Boris and Alistair have
> commented on previous versions.  As did Balbir, who wasn't cc'ed on
> this (fixed).
> 
> I'll add it to mm.git for testing exposure (because I'm still a sucker
> for speedups), but more review is needed, please.

Thanks, Andrew. I appreciate your help here. I'll address the latest
review comments in the next version.

Thanks,
Zhe

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2026-07-03 19:56 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-07-01  9:05 [PATCH v5 0/8] mm: optimize zone-device memmap initialization Li Zhe
2026-07-01  9:05 ` [PATCH v5 1/8] mm: fix stale ZONE_DEVICE refcount comment Li Zhe
2026-07-01  9:05 ` [PATCH v5 2/8] mm: factor zone-device page init helpers out of __init_zone_device_page Li Zhe
2026-07-01  9:05 ` [PATCH v5 3/8] mm: add a set_page_section_from_pfn() helper Li Zhe
2026-07-01  9:05 ` [PATCH v5 4/8] mm: add a template-based fast path for zone-device page init Li Zhe
2026-07-03 14:06   ` Mike Rapoport
2026-07-01  9:05 ` [PATCH v5 5/8] mm: extend the template fast path to zone-device compound tails Li Zhe
2026-07-01  9:05 ` [PATCH v5 6/8] string: introduce memcpy_nt() helpers Li Zhe
2026-07-02  2:16   ` Borislav Petkov
2026-07-02  7:36     ` Li Zhe
2026-07-01  9:05 ` [PATCH v5 7/8] x86/string: extend memcpy_flushcache() fixed-size fastpaths Li Zhe
2026-07-02  2:23   ` Borislav Petkov
2026-07-02  7:37     ` Li Zhe
2026-07-03 19:56       ` Borislav Petkov
2026-07-01  9:05 ` [PATCH v5 8/8] mm: use memcpy_nt() in zone-device template copies Li Zhe
2026-07-01 23:28 ` [PATCH v5 0/8] mm: optimize zone-device memmap initialization Andrew Morton
2026-07-02  2:24   ` Borislav Petkov
2026-07-02  7:39   ` Li Zhe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox