Generic Linux architectural discussions
 help / color / mirror / Atom feed
From: Alistair Popple <apopple@nvidia.com>
To: Li Zhe <lizhe.67@bytedance.com>
Cc: akpm@linux-foundation.org, arnd@arndb.de, bp@alien8.de,
	 dave.hansen@linux.intel.com, david@kernel.org, kees@kernel.org,
	mingo@redhat.com,  rppt@kernel.org, tglx@kernel.org,
	linux-arch@vger.kernel.org,  linux-hardening@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org
Subject: Re: [PATCH v4 0/8] mm: speed up ZONE_DEVICE memmap initialization
Date: Thu, 4 Jun 2026 18:14:05 +1000	[thread overview]
Message-ID: <aiEoByaQdRR3xtM5@nvdebian.thelocal> (raw)
In-Reply-To: <20260603080152.64728-1-lizhe.67@bytedance.com>

On 2026-06-03 at 18:01 +1000, Li Zhe <lizhe.67@bytedance.com> wrote...
> memmap_init_zone_device() can spend a substantial amount of time
> initializing large ZONE_DEVICE ranges because it repeats nearly
> identical struct page setup for every PFN.
> 
> This series reduces that overhead in eight steps.
> 
> The first patch fixes a stale comment in __init_zone_device_page() so
> the documented refcount policy matches the current ZONE_DEVICE code.
> 
> The second patch factors the reusable pieces out of
> __init_zone_device_page() so later patches can share the same logic
> without changing the existing slow path.
> 
> The third patch adds set_page_section_from_pfn(), so callers that want
> to refresh section bits from a PFN no longer need to open-code
> SECTION_IN_PAGE_FLAGS handling.
> 
> The fourth patch adds a template-based fast path for ZONE_DEVICE head
> pages. Instead of rebuilding the same struct page state for every PFN,
> it prepares one reusable template through the existing slow path,
> refreshes the PFN-dependent fields in that template, and copies it to
> each destination page.
> 
> The fifth patch extends the same template-based approach to compound
> tails, so pfns_per_compound > 1 can also benefit from the fast path.
> 
> The sixth patch introduces memcpy_streaming() and
> memcpy_streaming_drain() as a generic interface for write-once copies.
> Architectures that do not provide a specialized backend, or cases that
> cannot safely use one, fall back to memcpy().
> 
> The seventh patch extends x86 memcpy_flushcache() small fixed-size
> fastpaths so struct-page-sized streaming copies can stay on the inline
> path when alignment permits.
> 
> The last patch switches the ZONE_DEVICE template-copy path over to
> memcpy_streaming(). It keeps pageblock-aligned PFNs on regular memcpy(),
> uses memcpy_streaming() for the remaining write-once copies, and drains
> streaming stores before later metadata updates that may depend on them.
> 
> This is not intended as a steady-state data-path optimization. Its
> benefit is in pmem bring-up paths where memmap_init_zone_device()
> dominates device online / rebind latency, such as:
>   - fsdax or devdax namespace creation and reconfiguration
>   - nd_pmem / dax_pmem driver bind or rebind
> 
> In those paths, the kernel initializes a large vmemmap range once and
> does not immediately benefit from keeping the copied struct page state
> hot in cache. Reducing write-allocate traffic in that one-time setup
> path can therefore reduce end-to-end device bring-up latency.
> 
> The optimized path is disabled when the page_ref_set tracepoint is
> enabled, and sanitized builds remain on the slow path so their
> instrumented stores are preserved.
> 
> Testing
> =======
> 
> Tests were run in a VM on an Intel Ice Lake server.
> 
> Two PMEM configurations were used:
>   - a 100 GB fsdax namespace configured with map=dev, which exercises
>     the nd_pmem rebind path (pfns_per_compound == 1)
>   - a 100 GB devdax namespace configured with align=2097152, which
>     exercises the dax_pmem rebind path (pfns_per_compound > 1)
> 
> For each configuration, the corresponding driver was unbound and
> rebound 30 times. Memmap initialization latency was collected from the
> pr_debug() output of memmap_init_zone_device().
> 
> The first bind is reported separately, and the average of subsequent
> rebinds is used as the steady-state result.
> 
> Performance
> ===========
> 
> nd_pmem rebind, 100 GB fsdax namespace, map=dev
>   Base(v7.1-rc6):
>     First binding: 1466 ms
>     Average of subsequent rebinds: 262.12 ms
>   Full series:
>     First binding: 1359 ms
>     Average of subsequent rebinds: 108.36 ms
> 
> dax_pmem rebind, 100 GB devdax namespace, align=2097152
>   Base(v7.1-rc6):
>     First binding: 1430 ms
>     Average of subsequent rebinds: 229.12 ms
>   Full series:
>     First binding: 1273 ms
>     Average of subsequent rebinds: 100.17 ms

The results here are impressive, but I've been having trouble replicating them
with hmm_test on my local development machines. Both an older AMD machine and
a newer Arrow Lake based machine shows ~3% worse performance with this series
applied doing ZONE_DEVICE_PRIVATE.

This is based on measuring the memremap_pages() call when inserting test_hmm.ko
in a VM using the following hack to measure 10 64GB memremaps. Is there an easy
way for me to replicate your results in a VM? Or is there something in my
testing that I'm missing here?

---

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 213504915737..a1d5463dbc86 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -34,7 +34,7 @@
 
 #define DMIRROR_NDEVICES		4
 #define DMIRROR_RANGE_FAULT_TIMEOUT	1000
-#define DEVMEM_CHUNK_SIZE		(256 * 1024 * 1024U)
+#define DEVMEM_CHUNK_SIZE		(64 * 1024 * 1024 * 1024UL)
 #define DEVMEM_CHUNKS_RESERVE		16
 
 /*
@@ -565,6 +565,8 @@ static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
 	unsigned long pfn_last;
 	void *ptr;
 	int ret = -ENOMEM;
+	int i;
+	u64 t0, total = 0;
 
 	devmem = kzalloc_obj(*devmem);
 	if (!devmem)
@@ -613,6 +615,22 @@ static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
 		mdevice->devmem_capacity = new_capacity;
 		mdevice->devmem_chunks = new_chunks;
 	}
+
+	for (i = 0; i < 10; i++) {
+		t0 = ktime_get_ns();
+		ptr = memremap_pages(&devmem->pagemap, numa_node_id());
+		total += ktime_get_ns() - t0;
+		if (IS_ERR_OR_NULL(ptr)) {
+			if (ptr)
+				ret = PTR_ERR(ptr);
+			else
+				ret = -EFAULT;
+			goto err_release;
+		}
+		memunmap_pages(&devmem->pagemap);
+	}
+	pr_info("avg memremap %llu ns\n", total / i);
+
 	ptr = memremap_pages(&devmem->pagemap, numa_node_id());
 	if (IS_ERR_OR_NULL(ptr)) {
 		if (ptr)
@@ -629,7 +647,7 @@ static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
 
 	mutex_unlock(&mdevice->devmem_lock);
 
-	pr_info("added new %u MB chunk (total %u chunks, %u MB) PFNs [0x%lx 0x%lx)\n",
+	pr_info("added new %lu MB chunk (total %u chunks, %lu MB) PFNs [0x%lx 0x%lx)\n",
 		DEVMEM_CHUNK_SIZE / (1024 * 1024),
 		mdevice->devmem_count,
 		mdevice->devmem_count * (DEVMEM_CHUNK_SIZE / (1024 * 1024)),

> Li Zhe (8):
>   mm: fix stale ZONE_DEVICE refcount comment
>   mm: factor zone-device page init helpers out of
>     __init_zone_device_page
>   mm: add a set_page_section_from_pfn() helper
>   mm: add a template-based fast path for zone-device page init
>   mm: extend the template fast path to zone-device compound tails
>   string: introduce memcpy_streaming() helpers
>   x86/string: extend memcpy_flushcache() fixed-size fastpaths
>   mm: use memcpy_streaming() in zone-device template copies
> 
>  arch/x86/include/asm/string_64.h | 140 ++++++++++++++++++--
>  include/linux/mm.h               |  19 ++-
>  include/linux/string.h           |  20 +++
>  mm/mm_init.c                     | 221 +++++++++++++++++++++++++++----
>  4 files changed, 360 insertions(+), 40 deletions(-)
> 
> ---
> v3: https://lore.kernel.org/all/20260527033636.28231-1-lizhe.67@bytedance.com/
> v2: https://lore.kernel.org/all/20260521040124.10608-1-lizhe.67@bytedance.com/
> v1: https://lore.kernel.org/all/20260515082045.63029-1-lizhe.67@bytedance.com/
> 
> Changelogs:
> 
> v3->v4:
> - Rebase the series from v7.1-rc3 to v7.1-rc6.
> - Rework patch 4 so the reusable head-page template is seeded from the
>   first real struct page, rather than being initialized directly on a
>   stack-resident template object. Also add an explicit !nr_pages early
>   return. Suggested by Andrew Morton.
> - Rework patch 5 similarly for compound tails: seed the reusable
>   tail-page template from the first real tail page, thread
>   use_template through compound-page initialization, and reuse that
>   prepared tail-page image for the remaining tails. Suggested by Andrew
>   Morton.
> - Tighten patch 6 so memcpy_streaming() maps to memcpy_flushcache() only
>   when the destination alignment and size allow the transfer to stay
>   entirely on the non-temporal path; other cases fall back to memcpy().
>   Suggested by Andrew Morton.
> - Rework patch 7 so the existing 4/8/16-byte cases remain handled
>   directly in memcpy_flushcache(), while the new aligned fixed-size
>   fastpaths cover only the larger 32/48/64/80/96-byte cases. Suggested
>   by Andrew Morton.
> 
> For changelogs of earlier revisions, please refer to the v3 cover letter:
> https://lore.kernel.org/all/20260527033636.28231-1-lizhe.67@bytedance.com/
> 
> -- 
> 2.20.1

  parent reply	other threads:[~2026-06-04  8:14 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-03  8:01 [PATCH v4 0/8] mm: speed up ZONE_DEVICE memmap initialization Li Zhe
2026-06-03  8:01 ` [PATCH v4 1/8] mm: fix stale ZONE_DEVICE refcount comment Li Zhe
2026-06-03  8:01 ` [PATCH v4 2/8] mm: factor zone-device page init helpers out of __init_zone_device_page Li Zhe
2026-06-03  8:01 ` [PATCH v4 3/8] mm: add a set_page_section_from_pfn() helper Li Zhe
2026-06-03  8:01 ` [PATCH v4 4/8] mm: add a template-based fast path for zone-device page init Li Zhe
2026-06-03  8:01 ` [PATCH v4 5/8] mm: extend the template fast path to zone-device compound tails Li Zhe
2026-06-03  8:01 ` [PATCH v4 6/8] string: introduce memcpy_streaming() helpers Li Zhe
2026-06-03  8:01 ` [PATCH v4 7/8] x86/string: extend memcpy_flushcache() fixed-size fastpaths Li Zhe
2026-06-03  8:01 ` [PATCH v4 8/8] mm: use memcpy_streaming() in zone-device template copies Li Zhe
2026-06-04  8:14 ` Alistair Popple [this message]
2026-06-05  9:52   ` [PATCH v4 0/8] mm: speed up ZONE_DEVICE memmap initialization Li Zhe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aiEoByaQdRR3xtM5@nvdebian.thelocal \
    --to=apopple@nvidia.com \
    --cc=akpm@linux-foundation.org \
    --cc=arnd@arndb.de \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@kernel.org \
    --cc=kees@kernel.org \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-hardening@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lizhe.67@bytedance.com \
    --cc=mingo@redhat.com \
    --cc=rppt@kernel.org \
    --cc=tglx@kernel.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox