Linux-mm Archive on lore.kernel.org

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [patch 07/38] treewide: Consolidate cycles_t
From: Geert Uytterhoeven @ 2026-04-16 11:22 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Arnd Bergmann, x86, Lu Baolu, iommu, Michael Grzeschik,
	netdev, linux-wireless, Herbert Xu, linux-crypto, Vlastimil Babka,
	linux-mm, David Woodhouse, Bernie Thompson, linux-fbdev,
	Theodore Tso, linux-ext4, Andrew Morton, Uladzislau Rezki,
	Marco Elver, Dmitry Vyukov, kasan-dev, Andrey Ryabinin,
	Thomas Sailer, linux-hams, Jason A. Donenfeld, Richard Henderson,
	linux-alpha, Russell King, linux-arm-kernel, Catalin Marinas,
	Huacai Chen, loongarch, Dinh Nguyen, Jonas Bonn, linux-openrisc,
	Helge Deller, linux-parisc, Michael Ellerman, linuxppc-dev,
	Paul Walmsley, linux-riscv, Heiko Carstens, linux-s390,
	David S. Miller, sparclinux
In-Reply-To: <20260410120318.045532623@kernel.org>

On Fri, 10 Apr 2026 at 14:19, Thomas Gleixner <tglx@kernel.org> wrote:
> Most architectures define cycles_t as unsigned long execpt:
>
>  - x86 requires it to be 64-bit independent of the 32-bit/64-bit build.
>
>  - parisc and mips define it as unsigned int
>
>    parisc has no real reason to do so as there are only a few usage sites
>    which either expand it to a 64-bit value or utilize only the lower
>    32bits.
>
>    mips has no real requirement either.
>
> Move the typedef to types.h and provide a config switch to enforce the
> 64-bit type for x86.
>
> Signed-off-by: Thomas Gleixner <tglx@kernel.org>

>  arch/m68k/include/asm/timex.h      |    2 --

Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> # m68k

Gr{oetje,eeting}s,

                        Geert


--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds


^ permalink raw reply

* Re: [patch 27/38] m68k: Select ARCH_HAS_RANDOM_ENTROPY
From: Geert Uytterhoeven @ 2026-04-16 11:22 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, linux-m68k, Arnd Bergmann, x86, Lu Baolu, iommu,
	Michael Grzeschik, netdev, linux-wireless, Herbert Xu,
	linux-crypto, Vlastimil Babka, linux-mm, David Woodhouse,
	Bernie Thompson, linux-fbdev, Theodore Tso, linux-ext4,
	Andrew Morton, Uladzislau Rezki, Marco Elver, Dmitry Vyukov,
	kasan-dev, Andrey Ryabinin, Thomas Sailer, linux-hams,
	Jason A. Donenfeld, Richard Henderson, linux-alpha, Russell King,
	linux-arm-kernel, Catalin Marinas, Huacai Chen, loongarch,
	Dinh Nguyen, Jonas Bonn, linux-openrisc, Helge Deller,
	linux-parisc, Michael Ellerman, linuxppc-dev, Paul Walmsley,
	linux-riscv, Heiko Carstens, linux-s390, David S. Miller,
	sparclinux
In-Reply-To: <20260410120319.397219631@kernel.org>

On Fri, 10 Apr 2026 at 14:20, Thomas Gleixner <tglx@kernel.org> wrote:
> The only remaining usage of get_cycles() is to provide
> random_get_entropy().
>
> Switch m68k over to the new scheme of selecting ARCH_HAS_RANDOM_ENTROPY and
> providing random_get_entropy() in asm/random.h.
>
> Remove asm/timex.h as it has no functionality anymore.
>
> Signed-off-by: Thomas Gleixner <tglx@kernel.org>

Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>

Gr{oetje,eeting}s,

                        Geert


--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds


^ permalink raw reply

* Re: [patch 05/38] treewide: Remove CLOCK_TICK_RATE
From: Geert Uytterhoeven @ 2026-04-16 11:22 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Arnd Bergmann, x86, Lu Baolu, iommu, Michael Grzeschik,
	netdev, linux-wireless, Herbert Xu, linux-crypto, Vlastimil Babka,
	linux-mm, David Woodhouse, Bernie Thompson, linux-fbdev,
	Theodore Tso, linux-ext4, Andrew Morton, Uladzislau Rezki,
	Marco Elver, Dmitry Vyukov, kasan-dev, Andrey Ryabinin,
	Thomas Sailer, linux-hams, Jason A. Donenfeld, Richard Henderson,
	linux-alpha, Russell King, linux-arm-kernel, Catalin Marinas,
	Huacai Chen, loongarch, linux-m68k, Dinh Nguyen, Jonas Bonn,
	linux-openrisc, Helge Deller, linux-parisc, Michael Ellerman,
	linuxppc-dev, Paul Walmsley, linux-riscv, Heiko Carstens,
	linux-s390, David S. Miller, sparclinux
In-Reply-To: <20260410120317.910770161@kernel.org>

On Fri, 10 Apr 2026 at 14:18, Thomas Gleixner <tglx@kernel.org> wrote:
> This has been scheduled for removal more than a decade ago and the comments
> related to it have been dutifully ignored. The last dependencies are gone.
>
> Remove it along with various now empty asm/timex.h files.
>
> Signed-off-by: Thomas Gleixner <tglx@kernel.org>

>  arch/m68k/include/asm/timex.h       |   15 ---------------

Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> # m68k

Gr{oetje,eeting}s,

                        Geert


--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds


^ permalink raw reply

* Re: [PATCH v3 3/4] arch, mm: consolidate empty_zero_page
From: Thomas Weißschuh @ 2026-04-16 11:20 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Yoshinori Sato, Rich Felker, John Paul Adrian Glaubitz,
	Andrew Morton, David Hildenbrand, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, linux-kernel, linux-sh,
	linux-mm
In-Reply-To: <aeDBw4O655-pXiHy@kernel.org>

On Thu, Apr 16, 2026 at 02:02:27PM +0300, Mike Rapoport wrote:
> On Thu, Apr 16, 2026 at 10:10:06AM +0200, Thomas Weißschuh wrote:
> > On Wed, Feb 11, 2026 at 12:31:40PM +0200, Mike Rapoport wrote:
> > > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> > > 
> > > Reduce 22 declarations of empty_zero_page to 3 and 23 declarations of
> > > ZERO_PAGE() to 4.
> > > 
> > > Every architecture defines empty_zero_page that way or another, but for the
> > > most of them it is always a page aligned page in BSS and most definitions
> > > of ZERO_PAGE do virt_to_page(empty_zero_page).
> > > 
> > > Move Linus vetted x86 definition of empty_zero_page and ZERO_PAGE() to the
> > > core MM and drop these definitions in architectures that do not implement
> > > colored zero page (MIPS and s390).
> > > 
> > > ZERO_PAGE() remains a macro because turning it to a wrapper for a static
> > > inline causes severe pain in header dependencies.
> > > 
> > > For the most part the change is mechanical, with these being noteworthy:
> > > 
> > > * alpha: aliased empty_zero_page with ZERO_PGE that was also used for boot
> > >   parameters. Switching to a generic empty_zero_page removes the aliasing
> > >   and keeps ZERO_PGE for boot parameters only
> > > * arm64: uses __pa_symbol() in ZERO_PAGE() so that definition of
> > >   ZERO_PAGE() is kept intact.
> > > * m68k/parisc/um: allocated empty_zero_page from memblock,
> > >   although they do not support zero page coloring and having it in BSS
> > >   will work fine.
> > > * sparc64 can have empty_zero_page in BSS rather allocate it, but it
> > >   can't use virt_to_page() for BSS. Keep it's definition of ZERO_PAGE()
> > >   but instead of allocating it, make mem_map_zero point to
> > >   empty_zero_page.
> > > * sh: used empty_zero_page for boot parameters at the very early boot.
> > >   Rename the parameters page to boot_params_page and let sh use the generic
> > >   empty_zero_page.
> > 
> > With this in mainline as commit 6215d9f4470f ("arch, mm: consolidate
> > empty_zero_page") booting sh on QEMU is now broken.
> > The machine hangs before any output.
> 
> Hmm, looks like sh does not like boot_param_page declared as unsigned char *
> This fixes the issue for me:
> 
> diff --git a/arch/sh/include/asm/setup.h b/arch/sh/include/asm/setup.h
> index 63c9efc06348..b7c4469cb61e 100644
> --- a/arch/sh/include/asm/setup.h
> +++ b/arch/sh/include/asm/setup.h
> @@ -3,12 +3,13 @@
>  #define _SH_SETUP_H
>  
>  #include <uapi/asm/setup.h>
> +#include <asm/page.h>
>  
>  /*
>   * This is set up by the setup-routine at boot-time
>   */
> -extern unsigned char *boot_params_page;
> -#define PARAM boot_params_page
> +extern unsigned long boot_params_page[PAGE_SIZE / sizeof(unsigned long)];
> +#define PARAM ((unsigned char *)boot_params_page)
>  
>  #define MOUNT_ROOT_RDONLY (*(unsigned long *) (PARAM+0x000))
>  #define RAMDISK_FLAGS (*(unsigned long *) (PARAM+0x004))

Seems weird but works.

Tested-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>

Thanks!

> > Reproducer:
> > ./tools/testing/kunit/kunit.py run --arch sh --cross_compile sh4-linux- --raw_output=all example
> > 
> > > * hexagon: had an amusing comment about empty_zero_page
> > > 
> > > 	/* A handy thing to have if one has the RAM. Declared in head.S */
> > > 
> > >   that unfortunately had to go :)
> > 
> > (...)


^ permalink raw reply

* [PATCH v8 2/2] kho: make preserved pages compatible with deferred struct page init
From: Michal Clapinski @ 2026-04-16 11:06 UTC (permalink / raw)
  To: Evangelos Petrongonas, Pasha Tatashin, Mike Rapoport,
	Pratyush Yadav, Alexander Graf, Samiullah Khawaja, kexec,
	linux-mm
  Cc: linux-kernel, Andrew Morton, Vlastimil Babka, Suren Baghdasaryan,
	Michal Hocko, Brendan Jackman, Johannes Weiner, Zi Yan,
	Michal Clapinski
In-Reply-To: <20260416110654.247398-1-mclapinski@google.com>

From: Evangelos Petrongonas <epetron@amazon.de>

When CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, struct page
initialization is deferred to parallel kthreads that run later
in the boot process.

During KHO restoration, kho_preserved_memory_reserve() writes metadata
for each preserved memory region. However, if the struct page has not
been initialized, this write targets uninitialized memory, potentially
leading to errors like:
BUG: unable to handle page fault for address: ...

Fix this by introducing kho_get_preserved_page(),  which ensures
all struct pages in a preserved region are initialized by calling
init_deferred_page() which is a no-op when the struct page is already
initialized.

Signed-off-by: Evangelos Petrongonas <epetron@amazon.de>
Co-developed-by: Michal Clapinski <mclapinski@google.com>
Signed-off-by: Michal Clapinski <mclapinski@google.com>
Reviewed-by: Pratyush Yadav (Google) <pratyush@kernel.org>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 kernel/liveupdate/Kconfig          |  2 --
 kernel/liveupdate/kexec_handover.c | 27 ++++++++++++++++++++++++++-
 2 files changed, 26 insertions(+), 3 deletions(-)

diff --git a/kernel/liveupdate/Kconfig b/kernel/liveupdate/Kconfig
index 1a8513f16ef7..c13af38ba23a 100644
--- a/kernel/liveupdate/Kconfig
+++ b/kernel/liveupdate/Kconfig
@@ -1,12 +1,10 @@
 # SPDX-License-Identifier: GPL-2.0-only
 
 menu "Live Update and Kexec HandOver"
-	depends on !DEFERRED_STRUCT_PAGE_INIT
 
 config KEXEC_HANDOVER
 	bool "kexec handover"
 	depends on ARCH_SUPPORTS_KEXEC_HANDOVER && ARCH_SUPPORTS_KEXEC_FILE
-	depends on !DEFERRED_STRUCT_PAGE_INIT
 	select MEMBLOCK_KHO_SCRATCH
 	select KEXEC_FILE
 	select LIBFDT
diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c
index a507366a2cf9..d5718bef6d4d 100644
--- a/kernel/liveupdate/kexec_handover.c
+++ b/kernel/liveupdate/kexec_handover.c
@@ -473,6 +473,31 @@ struct page *kho_restore_pages(phys_addr_t phys, unsigned long nr_pages)
 }
 EXPORT_SYMBOL_GPL(kho_restore_pages);
 
+/*
+ * With CONFIG_DEFERRED_STRUCT_PAGE_INIT, struct pages in higher memory regions
+ * may not be initialized yet at the time KHO deserializes preserved memory.
+ * KHO uses the struct page to store metadata and a later initialization would
+ * overwrite it.
+ * Ensure all the struct pages in the preservation are
+ * initialized. kho_preserved_memory_reserve() marks the reservation as noinit
+ * to make sure they don't get re-initialized later.
+ */
+static struct page *__init kho_get_preserved_page(phys_addr_t phys,
+						  unsigned int order)
+{
+	unsigned long pfn = PHYS_PFN(phys);
+	int nid;
+
+	if (!IS_ENABLED(CONFIG_DEFERRED_STRUCT_PAGE_INIT))
+		return pfn_to_page(pfn);
+
+	nid = early_pfn_to_nid(pfn);
+	for (unsigned long i = 0; i < (1UL << order); i++)
+		init_deferred_page(pfn + i, nid);
+
+	return pfn_to_page(pfn);
+}
+
 static int __init kho_preserved_memory_reserve(phys_addr_t phys,
 					       unsigned int order)
 {
@@ -481,7 +506,7 @@ static int __init kho_preserved_memory_reserve(phys_addr_t phys,
 	u64 sz;
 
 	sz = 1 << (order + PAGE_SHIFT);
-	page = phys_to_page(phys);
+	page = kho_get_preserved_page(phys, order);
 
 	/* Reserve the memory preserved in KHO in memblock */
 	memblock_reserve(phys, sz);
-- 
2.54.0.rc1.555.g9c883467ad-goog



^ permalink raw reply related

* [PATCH v8 1/2] kho: fix deferred initialization of scratch areas
From: Michal Clapinski @ 2026-04-16 11:06 UTC (permalink / raw)
  To: Evangelos Petrongonas, Pasha Tatashin, Mike Rapoport,
	Pratyush Yadav, Alexander Graf, Samiullah Khawaja, kexec,
	linux-mm
  Cc: linux-kernel, Andrew Morton, Vlastimil Babka, Suren Baghdasaryan,
	Michal Hocko, Brendan Jackman, Johannes Weiner, Zi Yan,
	Michal Clapinski
In-Reply-To: <20260416110654.247398-1-mclapinski@google.com>

Currently, if CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled,
kho_release_scratch() will initialize the struct pages and set migratetype
of KHO scratch. Unless the whole scratch fits below first_deferred_pfn,
some of that will be overwritten either by deferred_init_pages() or
memmap_init_reserved_range().

To fix it, make memmap_init_range(), deferred_init_memmap_chunk() and
memmap_init_reserved_range() recognize KHO scratch regions and set
migratetype of pageblocks in those regions to MIGRATE_CMA.

Signed-off-by: Michal Clapinski <mclapinski@google.com>
Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 include/linux/memblock.h           |  7 +++--
 kernel/liveupdate/kexec_handover.c | 25 ------------------
 mm/memblock.c                      | 41 ++++++++++++++----------------
 mm/mm_init.c                       | 27 ++++++++++++++------
 4 files changed, 43 insertions(+), 57 deletions(-)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 6ec5e9ac0699..410f2a399691 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -614,11 +614,14 @@ static inline void memtest_report_meminfo(struct seq_file *m) { }
 #ifdef CONFIG_MEMBLOCK_KHO_SCRATCH
 void memblock_set_kho_scratch_only(void);
 void memblock_clear_kho_scratch_only(void);
-void memmap_init_kho_scratch_pages(void);
+bool memblock_is_kho_scratch_memory(phys_addr_t addr);
 #else
 static inline void memblock_set_kho_scratch_only(void) { }
 static inline void memblock_clear_kho_scratch_only(void) { }
-static inline void memmap_init_kho_scratch_pages(void) {}
+static inline bool memblock_is_kho_scratch_memory(phys_addr_t addr)
+{
+	return false;
+}
 #endif
 
 #endif /* _LINUX_MEMBLOCK_H */
diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c
index 18509d8082ea..a507366a2cf9 100644
--- a/kernel/liveupdate/kexec_handover.c
+++ b/kernel/liveupdate/kexec_handover.c
@@ -1576,35 +1576,10 @@ static __init int kho_init(void)
 }
 fs_initcall(kho_init);
 
-static void __init kho_release_scratch(void)
-{
-	phys_addr_t start, end;
-	u64 i;
-
-	memmap_init_kho_scratch_pages();
-
-	/*
-	 * Mark scratch mem as CMA before we return it. That way we
-	 * ensure that no kernel allocations happen on it. That means
-	 * we can reuse it as scratch memory again later.
-	 */
-	__for_each_mem_range(i, &memblock.memory, NULL, NUMA_NO_NODE,
-			     MEMBLOCK_KHO_SCRATCH, &start, &end, NULL) {
-		ulong start_pfn = pageblock_start_pfn(PFN_DOWN(start));
-		ulong end_pfn = pageblock_align(PFN_UP(end));
-		ulong pfn;
-
-		for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages)
-			init_pageblock_migratetype(pfn_to_page(pfn),
-						   MIGRATE_CMA, false);
-	}
-}
-
 void __init kho_memory_init(void)
 {
 	if (kho_in.scratch_phys) {
 		kho_scratch = phys_to_virt(kho_in.scratch_phys);
-		kho_release_scratch();
 
 		if (kho_mem_retrieve(kho_get_fdt()))
 			kho_in.fdt_phys = 0;
diff --git a/mm/memblock.c b/mm/memblock.c
index 4224fdaa8918..fab234f732c3 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -17,6 +17,7 @@
 #include <linux/seq_file.h>
 #include <linux/memblock.h>
 #include <linux/mutex.h>
+#include <linux/page-isolation.h>
 
 #ifdef CONFIG_KEXEC_HANDOVER
 #include <linux/libfdt.h>
@@ -959,28 +960,6 @@ __init void memblock_clear_kho_scratch_only(void)
 {
 	kho_scratch_only = false;
 }
-
-__init void memmap_init_kho_scratch_pages(void)
-{
-	phys_addr_t start, end;
-	unsigned long pfn;
-	int nid;
-	u64 i;
-
-	if (!IS_ENABLED(CONFIG_DEFERRED_STRUCT_PAGE_INIT))
-		return;
-
-	/*
-	 * Initialize struct pages for free scratch memory.
-	 * The struct pages for reserved scratch memory will be set up in
-	 * memmap_init_reserved_pages()
-	 */
-	__for_each_mem_range(i, &memblock.memory, NULL, NUMA_NO_NODE,
-			     MEMBLOCK_KHO_SCRATCH, &start, &end, &nid) {
-		for (pfn = PFN_UP(start); pfn < PFN_DOWN(end); pfn++)
-			init_deferred_page(pfn, nid);
-	}
-}
 #endif
 
 /**
@@ -1971,6 +1950,18 @@ bool __init_memblock memblock_is_map_memory(phys_addr_t addr)
 	return !memblock_is_nomap(&memblock.memory.regions[i]);
 }
 
+#ifdef CONFIG_MEMBLOCK_KHO_SCRATCH
+bool __init_memblock memblock_is_kho_scratch_memory(phys_addr_t addr)
+{
+	int i = memblock_search(&memblock.memory, addr);
+
+	if (i == -1)
+		return false;
+
+	return memblock_is_kho_scratch(&memblock.memory.regions[i]);
+}
+#endif
+
 int __init_memblock memblock_search_pfn_nid(unsigned long pfn,
 			 unsigned long *start_pfn, unsigned long *end_pfn)
 {
@@ -2262,6 +2253,12 @@ static void __init memmap_init_reserved_range(phys_addr_t start,
 		 * access it yet.
 		 */
 		__SetPageReserved(page);
+
+#ifdef CONFIG_MEMBLOCK_KHO_SCRATCH
+		if (memblock_is_kho_scratch_memory(PFN_PHYS(pfn)) &&
+		    pageblock_aligned(pfn))
+			init_pageblock_migratetype(page, MIGRATE_CMA, false);
+#endif
 	}
 }
 
diff --git a/mm/mm_init.c b/mm/mm_init.c
index f9f8e1af921c..890c3ae21ba0 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -916,8 +916,15 @@ void __meminit memmap_init_range(unsigned long size, int nid, unsigned long zone
 		 * over the place during system boot.
 		 */
 		if (pageblock_aligned(pfn)) {
-			init_pageblock_migratetype(page, migratetype,
-					isolate_pageblock);
+			int mt = migratetype;
+
+#ifdef CONFIG_MEMBLOCK_KHO_SCRATCH
+			if (memblock_is_kho_scratch_memory(page_to_phys(page)))
+				mt = MIGRATE_CMA;
+#endif
+
+			init_pageblock_migratetype(page, mt,
+						   isolate_pageblock);
 			cond_resched();
 		}
 		pfn++;
@@ -1970,7 +1977,7 @@ unsigned long __init node_map_pfn_alignment(void)
 
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
 static void __init deferred_free_pages(unsigned long pfn,
-		unsigned long nr_pages)
+		unsigned long nr_pages, enum migratetype mt)
 {
 	struct page *page;
 	unsigned long i;
@@ -1983,8 +1990,7 @@ static void __init deferred_free_pages(unsigned long pfn,
 	/* Free a large naturally-aligned chunk if possible */
 	if (nr_pages == MAX_ORDER_NR_PAGES && IS_MAX_ORDER_ALIGNED(pfn)) {
 		for (i = 0; i < nr_pages; i += pageblock_nr_pages)
-			init_pageblock_migratetype(page + i, MIGRATE_MOVABLE,
-					false);
+			init_pageblock_migratetype(page + i, mt, false);
 		__free_pages_core(page, MAX_PAGE_ORDER, MEMINIT_EARLY);
 		return;
 	}
@@ -1994,8 +2000,7 @@ static void __init deferred_free_pages(unsigned long pfn,
 
 	for (i = 0; i < nr_pages; i++, page++, pfn++) {
 		if (pageblock_aligned(pfn))
-			init_pageblock_migratetype(page, MIGRATE_MOVABLE,
-					false);
+			init_pageblock_migratetype(page, mt, false);
 		__free_pages_core(page, 0, MEMINIT_EARLY);
 	}
 }
@@ -2051,6 +2056,7 @@ deferred_init_memmap_chunk(unsigned long start_pfn, unsigned long end_pfn,
 	u64 i = 0;
 
 	for_each_free_mem_range(i, nid, 0, &start, &end, NULL) {
+		enum migratetype mt = MIGRATE_MOVABLE;
 		unsigned long spfn = PFN_UP(start);
 		unsigned long epfn = PFN_DOWN(end);
 
@@ -2060,12 +2066,17 @@ deferred_init_memmap_chunk(unsigned long start_pfn, unsigned long end_pfn,
 		spfn = max(spfn, start_pfn);
 		epfn = min(epfn, end_pfn);
 
+#ifdef CONFIG_MEMBLOCK_KHO_SCRATCH
+		if (memblock_is_kho_scratch_memory(PFN_PHYS(spfn)))
+			mt = MIGRATE_CMA;
+#endif
+
 		while (spfn < epfn) {
 			unsigned long mo_pfn = ALIGN(spfn + 1, MAX_ORDER_NR_PAGES);
 			unsigned long chunk_end = min(mo_pfn, epfn);
 
 			nr_pages += deferred_init_pages(zone, spfn, chunk_end);
-			deferred_free_pages(spfn, chunk_end - spfn);
+			deferred_free_pages(spfn, chunk_end - spfn, mt);
 
 			spfn = chunk_end;
 
-- 
2.54.0.rc1.555.g9c883467ad-goog



^ permalink raw reply related

* [PATCH v8 0/2] kho: add support for deferred struct page init
From: Michal Clapinski @ 2026-04-16 11:06 UTC (permalink / raw)
  To: Evangelos Petrongonas, Pasha Tatashin, Mike Rapoport,
	Pratyush Yadav, Alexander Graf, Samiullah Khawaja, kexec,
	linux-mm
  Cc: linux-kernel, Andrew Morton, Vlastimil Babka, Suren Baghdasaryan,
	Michal Hocko, Brendan Jackman, Johannes Weiner, Zi Yan,
	Michal Clapinski

When CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, struct page
initialization is deferred to parallel kthreads that run later in
the boot process.

Currently, KHO is incompatible with DEFERRED.
This series fixes that incompatibility.
---
v8:
- moved overriding the migratetype from init_pageblock_migratetype
  to callsites
v7:
- reimplemented the initialization of kho scratch again
v6:
- reimplemented the initialization of kho scratch
v5:
- rebased
v4:
- added a new commit to fix deferred init of kho scratch
- switched to ulong when refering to pfn
v3:
- changed commit msg
- don't invoke early_pfn_to_nid if CONFIG_DEFERRED_STRUCT_PAGE_INIT=n
v2:
- updated a comment

I took Evangelos's test code:
https://git.infradead.org/?p=users/vpetrog/linux.git;a=shortlog;h=refs/heads/kho-deferred-struct-page-init
and then modified it to this monster test that does 2 allocations:
at core_initcall (early) and at module_init (late). Then kexec, then
2 more allocations at these points, then restore the original 2, then
kexec, then restore the other 2. Basically I test preservation of early
and late allocation both on cold and on warm boot.
Tested it both with and without DEFERRED.

This patch probably doesn't apply onto anything currently.
It's based on mm-new with
"memblock: move reserve_bootmem_range() to memblock.c and make it static"
cherrypicked from rppt/memblock.

Evangelos Petrongonas (1):
  kho: make preserved pages compatible with deferred struct page init

Michal Clapinski (1):
  kho: fix deferred initialization of scratch areas

 include/linux/memblock.h           |  7 ++--
 kernel/liveupdate/Kconfig          |  2 --
 kernel/liveupdate/kexec_handover.c | 52 +++++++++++++++---------------
 mm/memblock.c                      | 41 +++++++++++------------
 mm/mm_init.c                       | 27 +++++++++++-----
 5 files changed, 69 insertions(+), 60 deletions(-)

-- 
2.54.0.rc1.555.g9c883467ad-goog

^ permalink raw reply

* Re: [PATCH v3 3/4] arch, mm: consolidate empty_zero_page
From: Mike Rapoport @ 2026-04-16 11:02 UTC (permalink / raw)
  To: Thomas Weißschuh
  Cc: Yoshinori Sato, Rich Felker, John Paul Adrian Glaubitz,
	Andrew Morton, David Hildenbrand, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, linux-kernel, linux-sh,
	linux-mm
In-Reply-To: <20260416100221-57063053-1c9e-4450-8b0c-d9783657fa47@linutronix.de>

On Thu, Apr 16, 2026 at 10:10:06AM +0200, Thomas Weißschuh wrote:
> Hi Mike,
> 
> On Wed, Feb 11, 2026 at 12:31:40PM +0200, Mike Rapoport wrote:
> > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> > 
> > Reduce 22 declarations of empty_zero_page to 3 and 23 declarations of
> > ZERO_PAGE() to 4.
> > 
> > Every architecture defines empty_zero_page that way or another, but for the
> > most of them it is always a page aligned page in BSS and most definitions
> > of ZERO_PAGE do virt_to_page(empty_zero_page).
> > 
> > Move Linus vetted x86 definition of empty_zero_page and ZERO_PAGE() to the
> > core MM and drop these definitions in architectures that do not implement
> > colored zero page (MIPS and s390).
> > 
> > ZERO_PAGE() remains a macro because turning it to a wrapper for a static
> > inline causes severe pain in header dependencies.
> > 
> > For the most part the change is mechanical, with these being noteworthy:
> > 
> > * alpha: aliased empty_zero_page with ZERO_PGE that was also used for boot
> >   parameters. Switching to a generic empty_zero_page removes the aliasing
> >   and keeps ZERO_PGE for boot parameters only
> > * arm64: uses __pa_symbol() in ZERO_PAGE() so that definition of
> >   ZERO_PAGE() is kept intact.
> > * m68k/parisc/um: allocated empty_zero_page from memblock,
> >   although they do not support zero page coloring and having it in BSS
> >   will work fine.
> > * sparc64 can have empty_zero_page in BSS rather allocate it, but it
> >   can't use virt_to_page() for BSS. Keep it's definition of ZERO_PAGE()
> >   but instead of allocating it, make mem_map_zero point to
> >   empty_zero_page.
> > * sh: used empty_zero_page for boot parameters at the very early boot.
> >   Rename the parameters page to boot_params_page and let sh use the generic
> >   empty_zero_page.
> 
> With this in mainline as commit 6215d9f4470f ("arch, mm: consolidate
> empty_zero_page") booting sh on QEMU is now broken.
> The machine hangs before any output.

Hmm, looks like sh does not like boot_param_page declared as unsigned char *
This fixes the issue for me:

diff --git a/arch/sh/include/asm/setup.h b/arch/sh/include/asm/setup.h
index 63c9efc06348..b7c4469cb61e 100644
--- a/arch/sh/include/asm/setup.h
+++ b/arch/sh/include/asm/setup.h
@@ -3,12 +3,13 @@
 #define _SH_SETUP_H
 
 #include <uapi/asm/setup.h>
+#include <asm/page.h>
 
 /*
  * This is set up by the setup-routine at boot-time
  */
-extern unsigned char *boot_params_page;
-#define PARAM boot_params_page
+extern unsigned long boot_params_page[PAGE_SIZE / sizeof(unsigned long)];
+#define PARAM ((unsigned char *)boot_params_page)
 
 #define MOUNT_ROOT_RDONLY (*(unsigned long *) (PARAM+0x000))
 #define RAMDISK_FLAGS (*(unsigned long *) (PARAM+0x004))
 
> Reproducer:
> ./tools/testing/kunit/kunit.py run --arch sh --cross_compile sh4-linux- --raw_output=all example
> 
> > * hexagon: had an amusing comment about empty_zero_page
> > 
> > 	/* A handy thing to have if one has the RAM. Declared in head.S */
> > 
> >   that unfortunately had to go :)
> 
> (...)
> 
> 
> Thomas

-- 
Sincerely yours,
Mike.


^ permalink raw reply related

* Re: [RFC PATCH v2] mm/fake-numa: fix under-allocation detection in uniform split
From: Sang-Heon Jeon @ 2026-04-16 10:36 UTC (permalink / raw)
  To: akpm, rppt, djbw, mingo; +Cc: linux-mm, Donghyeon Lee, Munhui Chae
In-Reply-To: <20260416102558.575210-1-ekffu200098@gmail.com>

Hello,

On Thu, Apr 16, 2026 at 7:26 PM Sang-Heon Jeon <ekffu200098@gmail.com> wrote:
>
> When split NUMA node uniformly, split_nodes_size_interleave_uniform()
> returns the next absolute node ID, not the number of nodes created.
>
> The existing under-allocation detection logic compares next absolute node
> ID (ret) and request count (n), which only works when nid starts at 0.
>
> For example, on a system with 2 physical NUMA nodes (node 0: 2GB, node
> 1: 128MB) and numa=fake=8U, 8 fake nodes are successfully created from
> node 0 and split_nodes_size_interleave_uniform() returns 8. For node 1,
> fake node nid starts at 8, but only 4 fake nodes are created due to
> current FAKE_NODE_MIN_SIZE being 32MB, and
> split_nodes_size_interleave_uniform() returns 12. By existing
> under-allocation detection logic, "ret < n" (12 < 8) is false, so the
> under-allocation will not be detected.
>
> Fix under-allocation detection logic to compare the number of actually
> created nodes (ret - nid) against the request count (n).
>
> Also, fix the outdated comment to match the actual return value.
>
> Signed-off-by: Sang-Heon Jeon <ekffu200098@gmail.com>
> Reported-by: Donghyeon Lee <asd142513@gmail.com>
> Reported-by: Munhui Chae <mochae@student.42seoul.kr>
> Fixes: cc9aec03e58f ("x86/numa_emulation: Introduce uniform split capability") # 4.19
> ---
>  mm/numa_emulation.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/mm/numa_emulation.c b/mm/numa_emulation.c
> index 703c8fa05048..c1d0a76aef64 100644
> --- a/mm/numa_emulation.c
> +++ b/mm/numa_emulation.c
> @@ -214,7 +214,7 @@ static u64 uniform_size(u64 max_addr, u64 base, u64 hole, int nr_nodes)
>   * Sets up fake nodes of `size' interleaved over physical nodes ranging from
>   * `addr' to `max_addr'.
>   *
> - * Returns zero on success or negative on error.
> + * Returns absolute node ID on success or negative on error.
>   */
>  static int __init split_nodes_size_interleave_uniform(struct numa_meminfo *ei,
>                                               struct numa_meminfo *pi,
> @@ -416,7 +416,7 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt)
>                                         n, &pi.blk[0], nid);
>                         if (ret < 0)
>                                 break;
> -                       if (ret < n) {
> +                       if (ret - nid < n) {
>                                 pr_info("%s: phys: %d only got %d of %ld nodes, failing\n",
>                                                 __func__, i, ret, n);
>                                 ret = -1;
> --
> 2.43.0
>

The change log from the previous patch was accidentally omitted, so I
added it here.

---
Changes from v1 [1]
- Merge patchset into once.
- Change base from linux-next to mm-unstable

[1] https://lore.kernel.org/all/20260413154438.396031-1-ekffu200098@gmail.com/
---

Best Regards,
Sang-Heon Jeon


^ permalink raw reply

* Re: [patch 18/38] lib/tests: Replace get_cycles() with ktime_get()
From: Geert Uytterhoeven @ 2026-04-16 10:24 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Andrew Morton, Uladzislau Rezki, linux-mm, Arnd Bergmann,
	x86, Lu Baolu, iommu, Michael Grzeschik, netdev, linux-wireless,
	Herbert Xu, linux-crypto, Vlastimil Babka, David Woodhouse,
	Bernie Thompson, linux-fbdev, Theodore Tso, linux-ext4,
	Marco Elver, Dmitry Vyukov, kasan-dev, Andrey Ryabinin,
	Thomas Sailer, linux-hams, Jason A. Donenfeld, Richard Henderson,
	linux-alpha, Russell King, linux-arm-kernel, Catalin Marinas,
	Huacai Chen, loongarch, linux-m68k, Dinh Nguyen, Jonas Bonn,
	linux-openrisc, Helge Deller, linux-parisc, Michael Ellerman,
	linuxppc-dev, Paul Walmsley, linux-riscv, Heiko Carstens,
	linux-s390, David S. Miller, sparclinux
In-Reply-To: <20260410120318.794680738@kernel.org>

Hi Thomas,

On Fri, 10 Apr 2026 at 14:20, Thomas Gleixner <tglx@kernel.org> wrote:
> get_cycles() is the historical access to a fine grained time source, but it
> is a suboptimal choice for two reasons:
>
>    - get_cycles() is not guaranteed to be supported and functional on all
>      systems/platforms. If not supported or not functional it returns 0,
>      which makes benchmarking moot.
>
>    - get_cycles() returns the raw counter value of whatever the
>      architecture platform provides. The original x86 Time Stamp Counter
>      (TSC) was despite its name tied to the actual CPU core frequency.
>      That's not longer the case. So the counter value is only meaningful
>      when the CPU operates at the same frequency as the TSC or the value is
>      adjusted to the actual CPU frequency. Other architectures and
>      platforms provide similar disjunct counters via get_cycles(), so the
>      result is operations per BOGO-cycles, which is not really meaningful.
>
> Use ktime_get() instead which provides nanosecond timestamps with the
> granularity of the underlying hardware counter, which is not different to
> the variety of get_cycles() implementations.
>
> This provides at least understandable metrics, i.e. operations/nanoseconds,
> and is available on all platforms. As with get_cycles() the result might
> have to be put into relation with the CPU operating frequency, but that's
> not any different.
>
> This is part of a larger effort to remove get_cycles() usage from
> non-architecture code.
>
> Signed-off-by: Thomas Gleixner <tglx@kernel.org>

Thanks for your patch!

> --- a/lib/interval_tree_test.c
> +++ b/lib/interval_tree_test.c
> @@ -65,13 +65,13 @@ static void init(void)
>  static int basic_check(void)
>  {
>         int i, j;
> -       cycles_t time1, time2, time;
> +       ktime_t time1, time2, time;
>
>         printk(KERN_ALERT "interval tree insert/remove");
>
>         init();
>
> -       time1 = get_cycles();
> +       time1 = ktime_get();
>
>         for (i = 0; i < perf_loops; i++) {
>                 for (j = 0; j < nnodes; j++)
> @@ -80,11 +80,11 @@ static int basic_check(void)
>                         interval_tree_remove(nodes + j, &root);
>         }
>
> -       time2 = get_cycles();
> +       time2 = ktime_get();
>         time = time2 - time1;
>
>         time = div_u64(time, perf_loops);
> -       printk(" -> %llu cycles\n", (unsigned long long)time);
> +       printk(" -> %llu nsecs\n", (unsigned long long)time);

While cycles_t was unsigned long or long long, ktime_t is always s64,
so "%lld", and the cast can be dropped (everywhere).

Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds


^ permalink raw reply

* Re: [PATCH 0/3] mm: split the file's i_mmap tree for NUMA
From: Mateusz Guzik @ 2026-04-16 10:29 UTC (permalink / raw)
  To: Huang Shijie
  Cc: akpm, viro, brauner, linux-mm, linux-kernel, linux-arm-kernel,
	linux-fsdevel, muchun.song, osalvador, linux-trace-kernel,
	linux-perf-users, linux-parisc, nvdimm, zhongyuan, fangbaoshun,
	yingzhiwei
In-Reply-To: <ad4EvoDcAKE2Sl4+@hsj-2U-Workstation>

On Tue, Apr 14, 2026 at 11:11 AM Huang Shijie <huangsj@hygon.cn> wrote:
>
> On Mon, Apr 13, 2026 at 05:33:21PM +0200, Mateusz Guzik wrote:
> > On Mon, Apr 13, 2026 at 02:20:39PM +0800, Huang Shijie wrote:
> > >   In NUMA, there are maybe many NUMA nodes and many CPUs.
> > > For example, a Hygon's server has 12 NUMA nodes, and 384 CPUs.
> > > In the UnixBench tests, there is a test "execl" which tests
> > > the execve system call.
> > >
> > >   When we test our server with "./Run -c 384 execl",
> > > the test result is not good enough. The i_mmap locks contended heavily on
> > > "libc.so" and "ld.so". For example, the i_mmap tree for "libc.so" can have
> > > over 6000 VMAs, all the VMAs can be in different NUMA mode.
> > > The insert/remove operations do not run quickly enough.
> > >
> > > patch 1 & patch 2 are try to hide the direct access of i_mmap.
> > > patch 3 splits the i_mmap into sibling trees, and we can get better
> > > performance with this patch set:
> > >     we can get 77% performance improvement(10 times average)
> > >
> >
> > To my reading you kept the lock as-is and only distributed the protected
> > state.
> >
> > While I don't doubt the improvement, I'm confident should you take a
> > look at the profile you are going to find this still does not scale with
> > rwsem being one of the problems (there are other global locks, some of
> > which have experimental patches for).
> IMHO, when the number of VMAs in the i_mmap is very large, only optimise the rwsem
> lock does not help too much for our NUMA case.
>
> In our NUMA server, the remote access could be the major issue.
>

I'm confused how this is not supposed to help. You moved your data to
be stored per-domain. With my proposal the lock itself will also get
that treatment.

Modulo the issue of what to do with code wanting to iterate the entire
thing, this is blatantly faster.

>
> >
> > Apart from that this does nothing to help high core systems which are
> > all one node, which imo puts another question mark on this specific
> > proposal.
> Yes, this patch set only focus on the NUMA case.
> The one-node case should use the original i_mmap.
>
> Maybe I can add a new config, CONFIG_SPILT_I_MMAP. The config is disabled
> by default, and enabled when the NUMA node is not one.
>
> >
> > Of course one may question whether a RB tree is the right choice here,
> > it may be the lock-protected cost can go way down with merely a better
> > data structure.
> >
> > Regardless of that, for actual scalability, there will be no way around
> > decentralazing locking around this and partitioning per some core count
> > (not just by numa awareness).
> >
> > Decentralizing locking is definitely possible, but I have not looked
> > into specifics of how problematic it is. Best case scenario it will
> > merely with separate locks. Worst case scenario something needs a fully
> > stabilized state for traversal, in that case another rw lock can be
> Yes.
>
> The traversal may need to hold many locks.
>

The very paragraph you partially quoted answers what to do in that
case: wrap everything with a new rwsem taken for reading when
adding/removing entries and taken for writing when iterating the
entire thing. Then the iteration sticks to one lock.

The new rw lock puts an upper ceiling on scalability of the thing, but
it is way higher than the current state.

Given the extra overhead associated with it one could consider
sticking to one centralized state by default and switching to
distributed state if there is enough contention.

> > slapped around this, creating locking order read lock -> per-subset
> > write lock -- this will suffer scalability due to the read locking, but
> > it will still scale drastically better as apart from that there will be
> > no serialization. In this setting the problematic consumer will write
> > lock the new thing to stabilize the state.
> >
> > So my non-maintainer opinion is that the patchset is not worth it as it
> > fails to address anything for significantly more common and already
> > affected setups.
> This patch set is to reduce the remote access latency for insert/remove VMA
> in NUMA.
>

And I am saying the mmap semaphore is a significant problem already on
high-core no-numa setups. Addressing scalability in that case would
sort out the problem in your setup and to a significantly higher
extent.

> >
> > Have you looked into splitting the lock?
> >
> I ever tried.
>
> But there are two disadvantages:
>   1.) The traversal may need to hold many locks which makes the
>       code very horrible.
>

I already above this is avoidable.

>   2.) Even we split the locks. Each lock protects a tree, when the tree becomes
>       big enough, the VMA insert/remove will also become slow in NUMA.
>       The reason is that the tree has VMAs in different NUMA nodes.
>

This is orthogonal to my proposal. In fact, if one is to pretend this
is never a factor with your patch, I would like to point out it will
remain not a factor if the per-numa struct gets its own lock.


^ permalink raw reply

* [RFC PATCH v2] mm/fake-numa: fix under-allocation detection in uniform split
From: Sang-Heon Jeon @ 2026-04-16 10:25 UTC (permalink / raw)
  To: akpm, rppt, djbw, mingo
  Cc: linux-mm, Sang-Heon Jeon, Donghyeon Lee, Munhui Chae

When split NUMA node uniformly, split_nodes_size_interleave_uniform()
returns the next absolute node ID, not the number of nodes created.

The existing under-allocation detection logic compares next absolute node
ID (ret) and request count (n), which only works when nid starts at 0.

For example, on a system with 2 physical NUMA nodes (node 0: 2GB, node
1: 128MB) and numa=fake=8U, 8 fake nodes are successfully created from
node 0 and split_nodes_size_interleave_uniform() returns 8. For node 1,
fake node nid starts at 8, but only 4 fake nodes are created due to
current FAKE_NODE_MIN_SIZE being 32MB, and
split_nodes_size_interleave_uniform() returns 12. By existing
under-allocation detection logic, "ret < n" (12 < 8) is false, so the
under-allocation will not be detected.

Fix under-allocation detection logic to compare the number of actually
created nodes (ret - nid) against the request count (n).

Also, fix the outdated comment to match the actual return value.

Signed-off-by: Sang-Heon Jeon <ekffu200098@gmail.com>
Reported-by: Donghyeon Lee <asd142513@gmail.com>
Reported-by: Munhui Chae <mochae@student.42seoul.kr>
Fixes: cc9aec03e58f ("x86/numa_emulation: Introduce uniform split capability") # 4.19
---
 mm/numa_emulation.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/numa_emulation.c b/mm/numa_emulation.c
index 703c8fa05048..c1d0a76aef64 100644
--- a/mm/numa_emulation.c
+++ b/mm/numa_emulation.c
@@ -214,7 +214,7 @@ static u64 uniform_size(u64 max_addr, u64 base, u64 hole, int nr_nodes)
  * Sets up fake nodes of `size' interleaved over physical nodes ranging from
  * `addr' to `max_addr'.
  *
- * Returns zero on success or negative on error.
+ * Returns absolute node ID on success or negative on error.
  */
 static int __init split_nodes_size_interleave_uniform(struct numa_meminfo *ei,
 					      struct numa_meminfo *pi,
@@ -416,7 +416,7 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt)
 					n, &pi.blk[0], nid);
 			if (ret < 0)
 				break;
-			if (ret < n) {
+			if (ret - nid < n) {
 				pr_info("%s: phys: %d only got %d of %ld nodes, failing\n",
 						__func__, i, ret, n);
 				ret = -1;
-- 
2.43.0

^ permalink raw reply related

* Re: [syzbot] [mm?] KASAN: use-after-free Read in copy_folio_from_iter_atomic (2)
From: syzbot @ 2026-04-16 10:23 UTC (permalink / raw)
  To: akpm, baolin.wang, hughd, linux-kernel, linux-mm, syzkaller-bugs,
	tahernady45
In-Reply-To: <69ca48ca.050a0220.183828.001a.GAE@google.com>

syzbot has found a reproducer for the following issue on:

HEAD commit:    e6efabc0afca Add linux-next specific files for 20260414
git tree:       linux-next
console output: https://syzkaller.appspot.com/x/log.txt?x=17908702580000
kernel config:  https://syzkaller.appspot.com/x/.config?x=56c2b36de3316f1b
dashboard link: https://syzkaller.appspot.com/bug?extid=6cc93ec9a4035badb85f
compiler:       Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=14c798ce580000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=1142c4ce580000

Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/e7099cbf73e4/disk-e6efabc0.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/439c402df1b9/vmlinux-e6efabc0.xz
kernel image: https://storage.googleapis.com/syzbot-assets/fc0c0175fc76/bzImage-e6efabc0.xz
mounted in repro: https://storage.googleapis.com/syzbot-assets/94fa7bad6be9/mount_0.gz
  fsck result: failed (log: https://syzkaller.appspot.com/x/fsck.log?x=17017b16580000)

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+6cc93ec9a4035badb85f@syzkaller.appspotmail.com

==================================================================
BUG: KASAN: use-after-free in memcpy_from_iter lib/iov_iter.c:85 [inline]
BUG: KASAN: use-after-free in iterate_bvec include/linux/iov_iter.h:123 [inline]
BUG: KASAN: use-after-free in iterate_and_advance2 include/linux/iov_iter.h:306 [inline]
BUG: KASAN: use-after-free in iterate_and_advance include/linux/iov_iter.h:330 [inline]
BUG: KASAN: use-after-free in __copy_from_iter lib/iov_iter.c:261 [inline]
BUG: KASAN: use-after-free in copy_folio_from_iter_atomic+0xbb5/0x1ad0 lib/iov_iter.c:491
Read of size 4096 at addr ffff888037cb3000 by task kworker/u8:7/1020

CPU: 0 UID: 0 PID: 1020 Comm: kworker/u8:7 Not tainted syzkaller #0 PREEMPT_{RT,(full)} 
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/18/2026
Workqueue: loop0 loop_workfn
Call Trace:
 <TASK>
 dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120
 print_address_description+0x55/0x1e0 mm/kasan/report.c:378
 print_report+0x58/0x70 mm/kasan/report.c:482
 kasan_report+0x117/0x150 mm/kasan/report.c:595
 check_region_inline mm/kasan/generic.c:-1 [inline]
 kasan_check_range+0x264/0x2c0 mm/kasan/generic.c:200
 __asan_memcpy+0x29/0x70 mm/kasan/shadow.c:105
 memcpy_from_iter lib/iov_iter.c:85 [inline]
 iterate_bvec include/linux/iov_iter.h:123 [inline]
 iterate_and_advance2 include/linux/iov_iter.h:306 [inline]
 iterate_and_advance include/linux/iov_iter.h:330 [inline]
 __copy_from_iter lib/iov_iter.c:261 [inline]
 copy_folio_from_iter_atomic+0xbb5/0x1ad0 lib/iov_iter.c:491
 generic_perform_write+0x5b1/0x8b0 mm/filemap.c:4342
 shmem_file_write_iter+0xfb/0x120 mm/shmem.c:3478
 lo_rw_aio+0xc80/0xf00 include/linux/percpu-rwsem.h:-1
 do_req_filebacked drivers/block/loop.c:433 [inline]
 loop_handle_cmd drivers/block/loop.c:1925 [inline]
 loop_process_work+0x637/0x11b0 drivers/block/loop.c:1960
 process_one_work kernel/workqueue.c:3308 [inline]
 process_scheduled_works+0xb68/0x1910 kernel/workqueue.c:3399
 worker_thread+0xa90/0x1040 kernel/workqueue.c:3485
 kthread+0x388/0x470 kernel/kthread.c:436
 ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
 </TASK>

The buggy address belongs to the physical page:
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x38 pfn:0x37cb3
flags: 0x80000000000000(node=0|zone=1)
raw: 0080000000000000 ffffea0000d17548 ffffea0001353448 0000000000000000
raw: 0000000000000038 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: kasan: bad access detected
page_owner tracks the page as freed
page last allocated via order 0, migratetype Unmovable, gfp_mask 0xdc0(GFP_KERNEL|__GFP_ZERO), pid 6098, tgid 6098 (syz.0.91), ts 154383709391, free_ts 154428922615
 set_page_owner include/linux/page_owner.h:32 [inline]
 post_alloc_hook+0x231/0x280 mm/page_alloc.c:1858
 prep_new_page mm/page_alloc.c:1866 [inline]
 get_page_from_freelist+0x27d6/0x2850 mm/page_alloc.c:3946
 __alloc_frozen_pages_noprof+0x18d/0x380 mm/page_alloc.c:5226
 alloc_pages_mpol+0xd1/0x380 mm/mempolicy.c:2490
 alloc_frozen_pages_noprof mm/mempolicy.c:2561 [inline]
 alloc_pages_noprof+0xd2/0x2f0 mm/mempolicy.c:2581
 lbmLogInit fs/jfs/jfs_logmgr.c:1813 [inline]
 lmLogInit+0x357/0x1a00 fs/jfs/jfs_logmgr.c:1267
 open_inline_log fs/jfs/jfs_logmgr.c:1173 [inline]
 lmLogOpen+0x4e1/0xfa0 fs/jfs/jfs_logmgr.c:1067
 jfs_mount_rw+0xee/0x670 fs/jfs/jfs_mount.c:257
 jfs_fill_super+0x754/0xd80 fs/jfs/super.c:532
 get_tree_bdev_flags+0x431/0x4f0 fs/super.c:1694
 vfs_get_tree+0x92/0x2a0 fs/super.c:1754
 fc_mount fs/namespace.c:1193 [inline]
 do_new_mount_fc fs/namespace.c:3758 [inline]
 do_new_mount+0x341/0xd30 fs/namespace.c:3834
 do_mount fs/namespace.c:4167 [inline]
 __do_sys_mount fs/namespace.c:4399 [inline]
 __se_sys_mount+0x31d/0x420 fs/namespace.c:4376
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x15f/0xf80 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
page last free pid 5945 tgid 5945 stack trace:
 reset_page_owner include/linux/page_owner.h:25 [inline]
 __free_pages_prepare mm/page_alloc.c:1402 [inline]
 __free_frozen_pages+0xf9b/0x10f0 mm/page_alloc.c:2943
 lbmLogShutdown fs/jfs/jfs_logmgr.c:1861 [inline]
 lmLogShutdown+0x44e/0x850 fs/jfs/jfs_logmgr.c:1681
 lmLogClose+0x28a/0x520 fs/jfs/jfs_logmgr.c:1457
 jfs_umount+0x2fb/0x3d0 fs/jfs/jfs_umount.c:124
 jfs_put_super+0x8c/0x190 fs/jfs/super.c:194
 generic_shutdown_super+0x13d/0x2d0 fs/super.c:646
 kill_block_super+0x44/0x90 fs/super.c:1725
 deactivate_locked_super+0xbc/0x130 fs/super.c:476
 cleanup_mnt+0x437/0x4d0 fs/namespace.c:1312
 task_work_run+0x1d9/0x270 kernel/task_work.c:233
 resume_user_mode_work include/linux/resume_user_mode.h:50 [inline]
 __exit_to_user_mode_loop kernel/entry/common.c:67 [inline]
 exit_to_user_mode_loop+0xed/0x480 kernel/entry/common.c:98
 __exit_to_user_mode_prepare include/linux/irq-entry-common.h:207 [inline]
 syscall_exit_to_user_mode_prepare include/linux/irq-entry-common.h:238 [inline]
 syscall_exit_to_user_mode include/linux/entry-common.h:328 [inline]
 do_syscall_64+0x33e/0xf80 arch/x86/entry/syscall_64.c:100
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

Memory state around the buggy address:
 ffff888037cb2f00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 ffff888037cb2f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff888037cb3000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
                   ^
 ffff888037cb3080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
 ffff888037cb3100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
==================================================================


---
If you want syzbot to run the reproducer, reply with:
#syz test: git://repo/address.git branch-or-commit-hash
If you attach or paste a git patch, syzbot will apply it before testing.


^ permalink raw reply

* Re: [RFC PATCH 1/2] mm/fake-numa: fix under-allocation detection logic in uniform split
From: Sang-Heon Jeon @ 2026-04-16 10:18 UTC (permalink / raw)
  To: Mike Rapoport; +Cc: akpm, djbw, mingo, linux-mm, Donghyeon Lee, Munhui Chae
In-Reply-To: <ad40-uVPKtaP6wuw@kernel.org>

Hello, Mike

On Tue, Apr 14, 2026 at 9:37 PM Mike Rapoport <rppt@kernel.org> wrote:
>
> Hi,
>
> On Tue, Apr 14, 2026 at 12:44:37AM +0900, Sang-Heon Jeon wrote:
> > When split NUMA node uniformly, split_nodes_size_interleave_uniform()
> > returns the next absolute node ID, not the number of nodes created.
> >
> > The previous under-allocation detection logic compares next absolute node
>
> I'd replace "previous" with "existing"

That's good :) I'll replace it.

> > ID (ret) and request count (n), which only works when nid starts at 0.
> >
> > Fix under-allocation detection logic to compare the number of actually
> > created nodes (ret - nid) against the request count (n).
>
> It would be nice to have an example of memory configuration and
> numa=fake=nU that demonstrates the issue.

Sure, I'll add it.

> > Signed-off-by: Sang-Heon Jeon <ekffu200098@gmail.com>
> > Reported-by: Donghyeon Lee <asd142513@gmail.com>
> > Reported-by: Munhui Chae <mochae@student.42seoul.kr>
> > Fixes: cc9aec03e58f ("x86/numa_emulation: Introduce uniform split capability") # 4.19
> > ---
> >  mm/numa_emulation.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/mm/numa_emulation.c b/mm/numa_emulation.c
> > index 703c8fa05048..e7f856c8f2a1 100644
> > --- a/mm/numa_emulation.c
> > +++ b/mm/numa_emulation.c
> > @@ -416,7 +416,7 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt)
> >                                       n, &pi.blk[0], nid);
> >                       if (ret < 0)
> >                               break;
> > -                     if (ret < n) {
> > +                     if (ret - nid < n) {
> >                               pr_info("%s: phys: %d only got %d of %ld nodes, failing\n",
> >                                               __func__, i, ret, n);
> >                               ret = -1;
> > --
> > 2.43.0
> >
>
> --
> Sincerely yours,
> Mike.

Thanks for your kind review.

Best Regards,
Sang-Heon Jeon


^ permalink raw reply

* Re: [RFC PATCH 2/2] mm/fake-numa: fix outdated return value comment
From: Sang-Heon Jeon @ 2026-04-16 10:16 UTC (permalink / raw)
  To: Mike Rapoport; +Cc: akpm, djbw, mingo, linux-mm
In-Reply-To: <ad41JQ_MjuPNhC3M@kernel.org>

Hello,

On Tue, Apr 14, 2026 at 9:38 PM Mike Rapoport <rppt@kernel.org> wrote:
>
> Hi,
>
> On Tue, Apr 14, 2026 at 12:44:38AM +0900, Sang-Heon Jeon wrote:
> > split_nodes_size_interleave_uniform() returns absolute node ID when
> > succeed, not zero.
> >
> > Fix the outdated comment to match the actual return value.
>
> This should be merged in the previous patch.

I got it. I'll include it in the v2 patch.

> > Signed-off-by: Sang-Heon Jeon <ekffu200098@gmail.com>
> > ---
> >  mm/numa_emulation.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/mm/numa_emulation.c b/mm/numa_emulation.c
> > index e7f856c8f2a1..c1d0a76aef64 100644
> > --- a/mm/numa_emulation.c
> > +++ b/mm/numa_emulation.c
> > @@ -214,7 +214,7 @@ static u64 uniform_size(u64 max_addr, u64 base, u64 hole, int nr_nodes)
> >   * Sets up fake nodes of `size' interleaved over physical nodes ranging from
> >   * `addr' to `max_addr'.
> >   *
> > - * Returns zero on success or negative on error.
> > + * Returns absolute node ID on success or negative on error.
> >   */
> >  static int __init split_nodes_size_interleave_uniform(struct numa_meminfo *ei,
> >                                             struct numa_meminfo *pi,
> > --
> > 2.43.0
> >
>
> --
> Sincerely yours,
> Mike.

Best Regards,
Sang-Heon Jeon


^ permalink raw reply

* Re: [PATCH] MAINTAINERS: Add page cache reviewer
From: Lorenzo Stoakes @ 2026-04-16 10:11 UTC (permalink / raw)
  To: Jan Kara; +Cc: Andrew Morton, Matthew Wilcox, linux-mm
In-Reply-To: <20260415174039.13016-2-jack@suse.cz>

On Wed, Apr 15, 2026 at 07:40:40PM +0200, Jan Kara wrote:
> Add myself as a page cache reviewer since I tend to review changes in
> these areas anyway.
>
> Signed-off-by: Jan Kara <jack@suse.cz>

FWIW:

Acked-by: Lorenzo Stoakes <ljs@kernel.org>

> ---
>  MAINTAINERS | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index d1cc0e12fe1f..2af750f5f06d 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -19961,6 +19961,7 @@ F:	kernel/padata.c
>
>  PAGE CACHE
>  M:	Matthew Wilcox (Oracle) <willy@infradead.org>
> +R:	Jan Kara <jack@suse.cz>
>  L:	linux-fsdevel@vger.kernel.org
>  S:	Supported
>  T:	git git://git.infradead.org/users/willy/pagecache.git
> --
> 2.51.0
>
>
>


^ permalink raw reply

* Re: [LSF/MM/BPF TOPIC] Filesystem inode reclaim
From: Jan Kara @ 2026-04-16 10:06 UTC (permalink / raw)
  To: Shakeel Butt; +Cc: Jan Kara, linux-fsdevel, linux-mm, Matthew Wilcox, lsf-pc
In-Reply-To: <ad_Eo7iQVT-HUkx1@linux.dev>

On Wed 15-04-26 10:45:11, Shakeel Butt wrote:
> On Tue, Apr 14, 2026 at 11:15:48AM +0200, Jan Kara wrote:
> > > > I have been mulling over possible solutions since I don't think each
> > > > filesystem should be inventing a complex inode lifetime management scheme
> > > > as XFS has invented to solve these issues. Here's what I think we could do:
> > > > 
> > > > 1) Filesystems will be required to mark inodes that have non-trivial
> > > > cleanup work to do on reclaim with an inode flag I_RECLAIM_HARD (or
> > > > whatever :)). Usually I expect this to happen on first inode modification
> > > > or so. This will require some per-fs work but it shouldn't be that
> > > > difficult and filesystems can be adapted one-by-one as they decide to
> > > > address these warnings from reclaim.
> > > > 
> > > > 2) Inodes without I_RECLAIM_HARD will be reclaimed as usual directly from
> > > > kswapd / direct reclaim. I'm keeping this variant of inode reclaim for
> > > > performance reasons. I expect this to be a significant portion of inodes
> > > > on average and in particular for some workloads which scan a lot of inodes
> > > > (find through the whole fs or similar) the efficiency of inode reclaim is
> > > > one of the determining factors for their performance.
> > > > 
> > > > 3) Inodes with I_RECLAIM_HARD will be moved by the shrinker to a separate
> > > > per-sb list s_hard_reclaim_inodes and we'll queue work (per-sb work struct)
> > > > to process them.
> > > 
> > > This async worker is an interesting idea. I have been brain-storming for similar
> > > problems and I was going towards more kswapds or async/background reclaimers and
> > > such reclaimers can do more intensive cleanup work. Basically aim to avoid
> > > direct reclaimers as much as possible.
> > 
> > So similarly as we eventually moved direct page writeback from kswapd
> > reclaim, I think it makes sense to remove difficult inode reclaim from
> > kswapd as well. In particular because I think such separation makes it
> > clearer that while you do complex inode reclaim and allocate memory from
> > there, there's still kswapd that can free some memory for you to make
> > forward progress. And you better need to be sure that there's enough "easy
> > to free" memory to allow for forward progress of difficult reclaim.
> 
> Another important point that we need memory guarantee for forward progress of
> the difficult reclaim.

Yes, although I don't expect we can get it in a direct way (we have only
very vague idea how much memory is needed for reclaiming such inodes) but
just by making sure the amount of hard to reclaim inodes cannot grow too much.

> > > > 4) The work will walk s_hard_reclaim_inodes list and call evict() for each
> > > > inode, doing the hard work.
> > > > 
> > > > This way, kswapd / direct reclaim doesn't wait for hard to reclaim inodes
> > > > and they can work on freeing memory needed for freeing of hard to reclaim
> > > > inodes. So warnings about GFP_NOFAIL allocations aren't only papered over,
> > > > they should really be addressed.
> > > > 
> > > > One possible concern is that s_hard_reclaim_inodes list could grow out of
> > > > control for some workloads (in particular because there could be multiple
> > > > CPUs generating hard to reclaim inodes while the cleanup would be
> > > > single-threaded).
> > > 
> > > Why single-threaded? What will be the issue to have multiple such workers
> > > doing independent cleanups? Also these workers will need memory
> > > guarantees as well (something like PF_MEMALLOC) to not cause their
> > > allocations stuck in reclaim.
> > 
> > Well, single-threaded isn't a requirement but in the beginning I plan to do
> > it like that for simplicity similarly as currently there's only one flush
> > work doing writeback (although we are just discussing moving to more for
> > that). Also the inode cleanup will contend on fs-wide resources such as
> > journal so although some scaling can bring you benefits it will be
> > difficult to scale beyond certain limits (again heavily fs dependent).
> 
> Difficult reclaim uses fs-wide resources (and locks) and thus we can not
> depend on it to be effective under extreme memory pressure, right?

Correct.

> Or do we want it to be reliable under extreme memory pressure where we
> will need to provide memory and cpu guarantees to it?

At least I don't have that expectation :)

> One more question, I assume it is fs-dependent but is it possible to avoid
> allocations (and thus reclaim) under fs-wide locks? One challenge/issue we at
> Meta are seeing is (btrfs) lock holders getting stuck in reclaim causing
> isolation issues.

I don't think it is practically feasible. Often before you acquire locks
and start working, you don't know how much memory you'll need. For simple
operations you can go with worst case estimates and preallocation before
acquiring locks (like we do e.g. with radix tree manipulations) but for
complex mutations of data structures involving journalling etc. it isn't
really practical anymore - too much code to execute, too many possibilities
to consider, too many interactions with other parts of the system.

I understand the priority inversion issues that are arising from this for
memcg reclaim. But I think the "measure now and punish later" model that is
used e.g. for dirty page throttling or blk-iocost throttling of metadata IO
is an approach which has much higher chances of success than trying to move
the allocations out of locks.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply

* Re: [RFC] making nested spin_trylock() work on UP?
From: Vlastimil Babka (SUSE) @ 2026-04-16 10:05 UTC (permalink / raw)
  To: Harry Yoo (Oracle), Matthew Wilcox
  Cc: Vlastimil Babka, Peter Zijlstra, Ingo Molnar, Will Deacon,
	Sebastian Andrzej Siewior, LKML, linux-mm@kvack.org,
	Linus Torvalds, Waiman Long, Mel Gorman, Steven Rostedt,
	Alexei Starovoitov, Hao Li, Andrew Morton, Suren Baghdasaryan,
	Michal Hocko, Brendan Jackman, Johannes Weiner, Zi Yan,
	Christoph Lameter, David Rientjes, Roman Gushchin
In-Reply-To: <ad_cqe51pvr1WaDg@hyeyoo>

On 4/15/26 20:44, Harry Yoo (Oracle) wrote:
> [+Cc Alexei for _nolock() APIs]
> [+Cc SLAB ALLOCATOR and PAGE ALLOCATOR folks]
> 
> I was testing kmalloc_nolock() on UP and I think
> I'm dealt with a similar issue...
> 
> On Sat, Feb 14, 2026 at 06:28:43AM +0000, Matthew Wilcox wrote:
>> On Fri, Feb 13, 2026 at 12:57:43PM +0100, Vlastimil Babka wrote:
>> > The page allocator has been using a locking scheme for its percpu page
>> > caches (pcp) for years now, based on spin_trylock() with no _irqsave() part.
>> > The point is that if we interrupt the locked section, we fail the trylock
>> > and just fallback to something that's more expensive, but it's rare so we
>> > don't need to pay the irqsave cost all the time in the fastpaths.
>> > 
>> > It's similar to but not exactly local_trylock_t (which is also newer anyway)
>> > because in some cases we do lock the pcp of a non-local cpu to flush it, in
>> > a way that's cheaper than IPI or queue_work_on().
>> > 
>> > The complication of this scheme has been UP non-debug spinlock
>> > implementation which assumes spin_trylock() can't fail on UP and has no
>> > state to track it. It just doesn't anticipate this usage scenario.
> 
> This is not the only scenario that doesn't work.
> 
> I was testing "calling {kmalloc,kfree}_nolock() in an NMI handler
> when the CPU is calling kmalloc() & kfree()" [1] scenario.
> 
> Weirdly it's broken (dmesg at the end of the email) on UP since v6.18,
> where {kmalloc,kfree}_nolock() APIs were introduced.
> 
> [1] https://lore.kernel.org/linux-mm/20260406090907.11710-3-harry@kernel.org
> 
>> > So to
>> > work around that we disable IRQs on UP, complicating the implementation.
>> > Also recently we found years old bug in the implementation - see
>> > 038a102535eb ("mm/page_alloc: prevent pcp corruption with SMP=n").
> 
> In the case mentioned above, disabling IRQs doesn't work as the handler
> can be called in an NMI context.

IIRC for the BPF usecases of kmalloc_nolock() think there could be also some
kprobe context somewhere in the locked section.

> {kmalloc,kfree}_nolock()->spin_trylock_irqsave() can succeed on UP
> when the CPU already acquired the spinlock w/ IRQs disabled.
> 
>> > So my question is if we could have spinlock implementation supporting this
>> > nested spin_trylock() usage, or if the UP optimization is still considered
>> > too important to lose it. I was thinking:
>> > 
>> > - remove the UP implementation completely - would it increase the overhead
>> > on SMP=n systems too much and do we still care?
>> > 
>> > - make the non-debug implementation a bit like the debug one so we do have
>> > the 'locked' state (see include/linux/spinlock_up.h and lock->slock). This
>> > also adds some overhead but not as much as the full SMP implementation?
>> 
>> What if we use an atomic_t on UP to simulate there being a spinlock,
>> but only for pcp?  Your demo shows pcp_spin_trylock() continuing to
>> exist, so how about doing something like:
>> 
>> #ifdef CONFIG_SMP
>> #define pcp_spin_trylock(ptr)						\
>> ({									\
>> 	struct per_cpu_pages *__ret;					\
>> 	__ret = pcpu_spin_trylock(struct per_cpu_pages, lock, ptr);	\
>> 	__ret;								\
>> })
>> #else
>> static atomic_t pcp_UP_lock = ATOMIC_INIT(0);
>> #define pcp_spin_trylock(ptr)						\
>> ({									\
>> 	struct per_cpu_pages *__ret = NULL;				\
>> 	if (atomic_try_cmpxchg(&pcp_UP_lock, 0, 1))			\
>> 		__ret = (void *)&pcp_UP_lock;				\
>> 	__ret;								\
>> });
>> #endif
>>
>> (obviously you need pcp_spin_lock/pcp_spin_unlock also defined)
>> 
>> That only costs us 4 extra bytes on UP, rather than 4 bytes per spinlock.
>> And some people still use routers with tiny amounts of memory and a
>> single CPU, or retrocomputers with single CPUs.
> 
> I think we need a special spinlock type that wraps something like this
> and use them when spinlocks can be trylock'd in an unknown context:
> pcp lock, zone lock, per-node partial slab list lock,
> per-node barn lock, etc.

Soudns like a lot of hassle for a niche config (SMP=n) where nobody would
use e.g. bpf tracing anyway. We already have this in kmalloc_nolock():

        /*
         * See the comment for the same check in
         * alloc_frozen_pages_nolock_noprof()
         */
        if (IS_ENABLED(CONFIG_PREEMPT_RT) && (in_nmi() || in_hardirq()))
                return NULL;

It would be trivial to extend this to !SMP. However it wouldn't cover the
kprobe context. Any idea Alexei?

> dmesg here, HEAD is a commit that adds the test case, on top of
> commit af92793e52c3a ("slab: Introduce kmalloc_nolock() and
> kfree_nolock()."):
>> 
>> [    3.658916] ------------[ cut here ]------------
>> [    3.659492] perf: interrupt took too long (5015 > 5005), lowering kernel.perf_event_max_sample_rate to 39000
>> [    3.660800] kernel BUG at mm/slub.c:4382!
> 
> This is BUG_ON(new.frozen) in freeze_slab(), which implies that
> somebody else has taken it off list and froze it already (which should
> have been prevented by the spinlock)
> 
>> [    3.661674] Oops: invalid opcode: 0000 [#1] NOPTI
>> [    3.662427] CPU: 0 UID: 0 PID: 256 Comm: kunit_try_catch Tainted: G            E    N  6.17.0-rc3+ #24 PREEMPTLAZY
>> [    3.663270] Tainted: [E]=UNSIGNED_MODULE, [N]=TEST
>> [    3.663658] Hardware name: QEMU Ubuntu 24.04 PC v2 (i440FX + PIIX, arch_caps fix, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
>> [    3.664571] RIP: 0010:___slab_alloc (mm/slub.c:4382 (discriminator 1) mm/slub.c:4599 (discriminator 1)) 
>> [ 3.664949] Code: 4c 24 78 e8 32 cc ff ff 84 c0 0f 85 09 fa ff ff 49 8b 4c 24 28 4d 8b 6c 24 20 48 89 c8 48 89 4c 24 78 48 c1 e8 18 84 c0 79 b3 <0f> 0b 41 8b 46 10 a9 87 04 00 00 74 a1 a8 80 75 24 49 89 dd e9 09
> 



^ permalink raw reply

* Re: [PATCH] mm/memfd_luo: report error when restoring a folio fails mid-loop
From: Pratyush Yadav @ 2026-04-16  9:44 UTC (permalink / raw)
  To: David Carlier
  Cc: Pasha Tatashin, Mike Rapoport, Pratyush Yadav, Andrew Morton,
	Chenghao Duan, linux-mm, linux-kernel
In-Reply-To: <20260415052300.362539-1-devnexen@gmail.com>

On Wed, Apr 15 2026, David Carlier wrote:

> memfd_luo_retrieve_folios() initialises err to -EIO, but the per-iteration
> calls to mem_cgroup_charge(), shmem_add_to_page_cache() and
> shmem_inode_acct_blocks() reuse and overwrite err.  Once any iteration
> completes successfully, err becomes zero.
>
> If a later iteration's kho_restore_folio() returns NULL, the failure path
> jumps to put_folios without resetting err, so the function returns 0.
> The caller memfd_luo_retrieve() then takes the success path, sets
> args->file and reports the restore as successful, leaving userspace with
> a partially populated memfd and no indication that anything went wrong.
>
> Set err to -EIO in the kho_restore_folio() failure branch so the error
> is propagated to the caller.
>
> Signed-off-by: David Carlier <devnexen@gmail.com>

Reviewed-by: Pratyush Yadav <pratyush@kernel.org>

Please add these when applying:

Cc: stable@vger.kernel.org
Fixes: b3749f174d68 ("mm: memfd_luo: allow preserving memfd")

[...]

-- 
Regards,
Pratyush Yadav


^ permalink raw reply

* Re: [PATCH v7 2/3] kho: fix deferred init of kho scratch
From: Pratyush Yadav @ 2026-04-16  9:44 UTC (permalink / raw)
  To: Zi Yan
  Cc: Pratyush Yadav, Mike Rapoport, Michał Cłapiński,
	Evangelos Petrongonas, Pasha Tatashin, Alexander Graf,
	Samiullah Khawaja, kexec, linux-mm, linux-kernel, Andrew Morton
In-Reply-To: <DFD0C0F5-F23B-44CA-B2C7-9D03F2397DCF@nvidia.com>

On Tue, Apr 07 2026, Zi Yan wrote:

> On 7 Apr 2026, at 8:21, Pratyush Yadav wrote:
[...]
>> Hmm, I don't like that how complex this is. It adds another layer of
>> complexity to the initialization of the migratetype, and you have to dig
>> through all the possible call sites to be sure that we catch all the
>> cases. Makes it harder to wrap your head around it. Plus, makes it more
>> likely for bugs to slip through if later refactors change some page init
>> flow.
>>
>> Is the cost to look through the scratch array really that bad? I would
>> suspect we'd have at most 4-6 per-node scratches, and one global one
>> lowmem. So I'd expect around 10 items to look through, and it will
>> probably be in the cache anyway.
>
> It is not only about the cost of going through the scratch array, but also
> about adding kho code to the generic init_pageblock_migratetype().
> This means all callers of init_pageblock_migratetype(), no matter if
> they are involved with kho or not, need to do the check. It is a good
> practice to do the check when necessary, otherwise, this catch-all check
> might hide some bugs in the future.

We can move the check to memmap init, so it will still be done for most
pageblocks I reckon. The only other callers I see are zone device and
CMA.

Anyway, I get your point and am fine with moving it out to memmap init
functions.

[...]

-- 
Regards,
Pratyush Yadav


^ permalink raw reply

* Re: [PATCH v7 2/3] kho: fix deferred init of kho scratch
From: Pratyush Yadav @ 2026-04-16  9:41 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Pratyush Yadav, Michał Cłapiński, Zi Yan,
	Evangelos Petrongonas, Pasha Tatashin, Alexander Graf,
	Samiullah Khawaja, kexec, linux-mm, linux-kernel, Andrew Morton
In-Reply-To: <adfqkOWVFgeAkItF@kernel.org>

On Thu, Apr 09 2026, Mike Rapoport wrote:

> On Tue, Apr 07, 2026 at 12:21:56PM +0000, Pratyush Yadav wrote:
>> On Sun, Mar 22 2026, Mike Rapoport wrote:
>> 
>> > On Thu, Mar 19, 2026 at 07:17:48PM +0100, Michał Cłapiński wrote:
[...]
>> Can we just get rid of this entirely? And just update
>> memmap_init_zone_range() to also look for scratch and set the
>> migratetype correctly from the get go? That's more consistent IMO. The
>> two main places that initialize the struct page,
>> memmap_init_zone_range() and deferred_init_memmap_chunk(), check for
>> scratch and set the migratetype correctly.
>
> We could. E.g. let memmap_init() check the memblock flags and pass the
> migratetype to memmap_init_zone_range().
>
> I wanted to avoid as much KHO code in mm/ as possible, but if it is must
> have in deferred_init_memmap_chunk() we could add some to memmap_init() as
> well.

KHO fundamentally alters mm init, so I think it would be hard to keep it
to a neat corner unfortunately... We have been somewhat successful so
far, but that has come at the cost of performance. Once we start trying
to improve performance, I reckon more and more of it will spill into mm
init.

>  
>> > @@ -2061,12 +2060,15 @@ deferred_init_memmap_chunk(unsigned long start_pfn, unsigned long end_pfn,
>> >  		spfn = max(spfn, start_pfn);
>> >  		epfn = min(epfn, end_pfn);
>> >  
>> > +		if (memblock_is_kho_scratch_memory(PFN_PHYS(spfn)))
>> > +			mt = MIGRATE_CMA;
>> 
>> Would it make sense for for_each_free_mem_range() to also return the
>> flags for the region? Then you won't have to do another search. It adds
>> yet another parameter to it so no strong opinion, but something to
>> consider.
>
> I hesitated a lot about this.
> Have you seen memblock::__next_mem_range() signature? ;-)

Fair enough :-O

>
> I decided to start with something correct, but slowish and leave the churn
> and speed for later.

-- 
Regards,
Pratyush Yadav


^ permalink raw reply

* Re: [PATCH v3 2/7] mm/memfd_luo: optimize shmem_recalc_inode calls in retrieve path
From: Pratyush Yadav @ 2026-04-16  9:35 UTC (permalink / raw)
  To: Chenghao Duan
  Cc: Pratyush Yadav, pasha.tatashin, rppt, akpm, linux-kernel,
	linux-mm, jianghaoran
In-Reply-To: <20260410014537.GA28528@chenghao-pc>

On Fri, Apr 10 2026, Chenghao Duan wrote:

> On Thu, Apr 02, 2026 at 11:02:04AM +0000, Pratyush Yadav wrote:
>> On Thu, Mar 26 2026, Chenghao Duan wrote:
>> 
>> > Move shmem_recalc_inode() out of the loop in memfd_luo_retrieve_folios()
>> > to improve performance when restoring large memfds.
>> >
>> > Currently, shmem_recalc_inode() is called for each folio during restore,
>> > which is O(n) expensive operations. This patch collects the number of
>> > successfully added folios and calls shmem_recalc_inode() once after the
>> > loop completes, reducing complexity to O(1).
>> >
>> > Additionally, fix the error path to also call shmem_recalc_inode() for
>> > the folios that were successfully added before the error occurred.
>> >
>> > Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
>> > Signed-off-by: Chenghao Duan <duanchenghao@kylinos.cn>
>> 
>> Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
>> 
>> BTW, can we also do the same for shmem_inode_acct_blocks() it the call
>> to it can also be aggregated in the same way? You don't have to do it in
>> this series, but possibly as a follow up.
>> 
> Thanks for pointing that out.
>
> We can move shmem_recalc_inode() outside the loop for aggregation, as it
> performs a single inode state update for the folios.
>
> In contrast, shmem_inode_acct_blocks() is a validation step for resource
> reservation, quota, and block accounting. Since it can fail on its own,
> aggregating it outside the loop would complicate the error handling and
> rollback paths, especially when only a subset of folios have been restored.

Okay, fair enough. Thanks for looking into it.

-- 
Regards,
Pratyush Yadav


^ permalink raw reply

* Re: [PATCH v5 09/14] mm/mglru: use the common routine for dirty/writeback reactivation
From: Barry Song @ 2026-04-16  9:18 UTC (permalink / raw)
  To: kasong
  Cc: linux-mm, Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, David Stevens, Chen Ridong,
	Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel, Qi Zheng,
	Baolin Wang
In-Reply-To: <20260413-mglru-reclaim-v5-9-8eaeacbddc44@tencent.com>

On Mon, Apr 13, 2026 at 12:48 AM Kairui Song via B4 Relay
<devnull+kasong.tencent.com@kernel.org> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> Currently MGLRU will move the dirty writeback folios to the second
> oldest gen instead of reactivate them like the classical LRU. This
> might help to reduce the LRU contention as it skipped the isolation.
> But as a result we will see these folios at the LRU tail more frequently
> leading to inefficient reclaim.
>
> Besides, the dirty / writeback check after isolation in
> shrink_folio_list is more accurate and covers more cases. So instead,
> just drop the special handling for dirty writeback, use the common
> routine and re-activate it like the classical LRU.
>
> This should in theory improve the scan efficiency. These folios will be
> rotated back to LRU tail once writeback is done so there is no risk of
> hotness inversion. And now each reclaim loop will have a higher
> success rate. This also prepares for unifying the writeback and
> throttling mechanism with classical LRU, we keep these folios far from
> tail so detecting the tail batch will have a similar pattern with
> classical LRU.
>
> The micro optimization that avoids LRU contention by skipping the
> isolation is gone, which should be fine. Compared to IO and writeback
> cost, the isolation overhead is trivial.
>
> And using the common routine also keeps the folio's referenced bits
> (tier bits), which could improve metrics in the long term. Also no
> more need to clean reclaim bit as the common routine will make use
> of it.
>
> Note the common routine updates a few throttling and writeback counters,
> which are not used, and never have been for the MGLRU case. We will
> start making use of these in later commits.
>
> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
> Signed-off-by: Kairui Song <kasong@tencent.com>

I like the move to the common path, activating the folio and
relying on folio rotation afterwards. The original MGLRU code
seems over-designed and may hurt performance. It seems to
overthink things.

Reviewed-by: Barry Song <baohua@kernel.org>


^ permalink raw reply

* Re: [RFC 3/3] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag
From: Christian Brauner @ 2026-04-16  9:13 UTC (permalink / raw)
  To: Minchan Kim
  Cc: akpm, david, mhocko, linux-mm, linux-kernel, surenb, timmurray
In-Reply-To: <20260413223948.556351-4-minchan@kernel.org>

On Mon, Apr 13, 2026 at 03:39:48PM -0700, Minchan Kim wrote:
> Currently, process_mrelease() requires userspace to send a SIGKILL signal
> prior to invocation. This separation introduces a race window where the
> victim task may receive the signal and enter the exit path before the
> reaper can invoke process_mrelease().
> 
> In this case, the victim task frees its memory via the standard, unoptimized
> exit path, bypassing the expedited clean file folio reclamation optimization
> introduced in the previous patch (which relies on the MMF_UNSTABLE flag).
> 
> This patch introduces the PROCESS_MRELEASE_REAP_KILL UAPI flag to support
> an integrated auto-kill mode. When specified, process_mrelease() directly
> injects a SIGKILL into the target task.
> 
> Crucially, this patch utilizes a dedicated signal code (KILL_MRELEASE)
> during signal injection, belonging to a new SIGKILL si_codes section.
> This special code ensures that the kernel's signal delivery path reliably
> intercepts the request and marks the target address space as unstable
> (MMF_UNSTABLE). This mechanism guarantees that the MMF_UNSTABLE flag is set
> before either the victim task or the reaper proceeds, ensuring that the
> expedited reclamation optimization is utilized regardless of scheduling
> order.
> 
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>  include/uapi/asm-generic/siginfo.h |  6 ++++++
>  include/uapi/linux/mman.h          |  4 ++++
>  kernel/signal.c                    |  4 ++++
>  mm/oom_kill.c                      | 20 +++++++++++++++++++-
>  4 files changed, 33 insertions(+), 1 deletion(-)
> 
> diff --git a/include/uapi/asm-generic/siginfo.h b/include/uapi/asm-generic/siginfo.h
> index 5a1ca43b5fc6..0f59b791dab4 100644
> --- a/include/uapi/asm-generic/siginfo.h
> +++ b/include/uapi/asm-generic/siginfo.h
> @@ -252,6 +252,12 @@ typedef struct siginfo {
>  #define BUS_MCEERR_AO	5
>  #define NSIGBUS		5
>  
> +/*
> + * SIGKILL si_codes
> + */
> +#define KILL_MRELEASE	1	/* sent by process_mrelease */
> +#define NSIGKILL	1
> +
>  /*
>   * SIGTRAP si_codes
>   */
> diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h
> index e89d00528f2f..4266976b45ad 100644
> --- a/include/uapi/linux/mman.h
> +++ b/include/uapi/linux/mman.h
> @@ -56,4 +56,8 @@ struct cachestat {
>  	__u64 nr_recently_evicted;
>  };
>  
> +/* Flags for process_mrelease */
> +#define PROCESS_MRELEASE_REAP_KILL	(1 << 0)
> +#define PROCESS_MRELEASE_VALID_FLAGS	(PROCESS_MRELEASE_REAP_KILL)
> +
>  #endif /* _UAPI_LINUX_MMAN_H */
> diff --git a/kernel/signal.c b/kernel/signal.c
> index d65d0fe24bfb..c21b2176dc5e 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -1134,6 +1134,10 @@ static int __send_signal_locked(int sig, struct kernel_siginfo *info,
>  
>  out_set:
>  	signalfd_notify(t, sig);
> +
> +	if (sig == SIGKILL && !is_si_special(info) &&
> +	    info->si_code == KILL_MRELEASE && t->mm)
> +		mm_flags_set(MMF_UNSTABLE, t->mm);
>  	sigaddset(&pending->signal, sig);
>  
>  	/* Let multiprocess signals appear after on-going forks */
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 5c6c95c169ee..0b5da5208707 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -20,6 +20,8 @@
>  
>  #include <linux/oom.h>
>  #include <linux/mm.h>
> +#include <uapi/linux/mman.h>
> +#include <linux/capability.h>
>  #include <linux/err.h>
>  #include <linux/gfp.h>
>  #include <linux/sched.h>
> @@ -1218,13 +1220,29 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
>  	bool reap = false;
>  	long ret = 0;
>  
> -	if (flags)
> +	if (flags & ~PROCESS_MRELEASE_VALID_FLAGS)
>  		return -EINVAL;
>  
>  	task = pidfd_get_task(pidfd, &f_flags);
>  	if (IS_ERR(task))
>  		return PTR_ERR(task);
>  
> +	if (flags & PROCESS_MRELEASE_REAP_KILL) {
> +		struct kernel_siginfo info;
> +
> +		if (!capable(CAP_KILL)) {

Why? Just call a function that uses check_kill_permission() before
firing the signal? What's the rational for doing it this way?

Tbh, I really hate that process_mrelease() now has a kill side effect
with non-standard permission handling as well.

Seems like bad api design. Why can't you just raise the MMF_UNSTABLE bit
before the SIGKILL as that's the problem you're trying to solve.

> +			ret = -EPERM;
> +			goto put_task;
> +		}
> +		clear_siginfo(&info);
> +		info.si_signo = SIGKILL;
> +		info.si_code = KILL_MRELEASE;
> +		info.si_pid = task_tgid_vnr(current);
> +		info.si_uid = from_kuid_munged(current_user_ns(), current_uid());

This should not be open-coded like this.

> +
> +		do_send_sig_info(SIGKILL, &info, task, PIDTYPE_TGID);
> +	}
> +
>  	/*
>  	 * Make sure to choose a thread which still has a reference to mm
>  	 * during the group exit
> -- 
> 2.54.0.rc0.605.g598a273b03-goog
> 


^ permalink raw reply

* Re: [PATCH v2 1/3] vmalloc: add __GFP_SKIP_KASAN support
From: David Hildenbrand @ 2026-04-16  9:10 UTC (permalink / raw)
  To: Muhammad Usama Anjum, Arnd Bergmann, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Kees Cook,
	Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Uladzislau Rezki, linux-arch,
	linux-kernel, linux-mm, Andrey Konovalov, Marco Elver,
	Vincenzo Frascino, Peter Collingbourne, Catalin Marinas,
	Will Deacon, Ryan.Roberts
In-Reply-To: <20260324132631.482520-2-usama.anjum@arm.com>

On 3/24/26 14:26, Muhammad Usama Anjum wrote:
> For allocations that will be accessed only with match-all pointers
> (e.g., kernel stacks), setting tags is wasted work. If the caller
> already set __GFP_SKIP_KASAN, don’t skip zeroing the pages and
> don’t set KASAN_VMALLOC_PROT_NORMAL so kasan_unpoison_vmalloc()
> returns early without tagging.
> 
> Before this patch, __GFP_SKIP_KASAN wasn't being used with vmalloc
> APIs. So it wasn't being checked. Now its being checked and acted
> upon. Other KASAN modes are unchanged because __GFP_SKIP_KASAN isn't
> defined there.
> 
> This is a preparatory patch for optimizing kernel stack allocations.
> 
> Signed-off-by: Muhammad Usama Anjum <usama.anjum@arm.com>
> ---
> Changes since v1:
> - Simplify skip conditions based on the fact that __GFP_SKIP_KASAN
>   is zero in non-hw-tags mode.
> - Add __GFP_SKIP_KASAN to GFP_VMALLOC_SUPPORTED list of flags
> ---
>  mm/vmalloc.c | 11 ++++++++---
>  1 file changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index c607307c657a6..69ae205effb46 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3939,7 +3939,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
>  				__GFP_NOFAIL | __GFP_ZERO |\
>  				__GFP_NORETRY | __GFP_RETRY_MAYFAIL |\
>  				GFP_NOFS | GFP_NOIO | GFP_KERNEL_ACCOUNT |\
> -				GFP_USER | __GFP_NOLOCKDEP)
> +				GFP_USER | __GFP_NOLOCKDEP | __GFP_SKIP_KASAN)
>  
>  static gfp_t vmalloc_fix_flags(gfp_t flags)
>  {
> @@ -3980,6 +3980,8 @@ static gfp_t vmalloc_fix_flags(gfp_t flags)
>   *
>   * %__GFP_NOWARN can be used to suppress failure messages.
>   *
> + * %__GFP_SKIP_KASAN can be used to skip poisoning
> + *
>   * Can not be called from interrupt nor NMI contexts.
>   * Return: the address of the area or %NULL on failure
>   */
> @@ -4041,7 +4043,9 @@ void *__vmalloc_node_range_noprof(unsigned long size, unsigned long align,
>  	 * kasan_unpoison_vmalloc().
>  	 */
>  	if (pgprot_val(prot) == pgprot_val(PAGE_KERNEL)) {
> -		if (kasan_hw_tags_enabled()) {
> +		bool skip_kasan = gfp_mask & __GFP_SKIP_KASAN;
> +
> +		if (kasan_hw_tags_enabled() && !skip_kasan) {

This code gets ever more ugly. :)

After I spotted the horrible ___GFP_SKIP_ZERO that shouldn't even exist,
I thought about teaching vmalloc.c to use a sub-allocator interface to
the buddy instead, where we would essentially say "leave zeroing and
KASAN to the sub-allocator": vmalloc.

Then, we'd get rid of ___GFP_SKIP_ZERO and just use __GFP_SKIP_KASAN to
decide ourselves here what to do with KASAN.

I tried to implement that, but that SW KASAN / !KASAN handling messes
with my brain. :)

In particular, the order for HW KASAN is currently:

a) Allocate pages *and map them*.

b) Zero the pages

That means that we have temporarily unzeroed pages mapped there. I don't
know if that's problematic, but it's one of the differences to SW KASAN
/ ! KASAN handling here.

-- 
Cheers,

David


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox