* [PATCH v2 10/69] mm/mm_init: Remove set_pageblock_order() call from sparse_init()
From: Muchun Song @ 2026-05-13 13:04 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
In-Reply-To: <20260513130542.35604-1-songmuchun@bytedance.com>
free_area_init() already sets pageblock_order before sparse_init() runs
for CONFIG_HUGETLB_PAGE_SIZE_VARIABLE, so sparse_init() does not need to
call set_pageblock_order() again.
With that call removed, set_pageblock_order() is only used in mm/mm_init.c.
Make it static.
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
v1->v2:
- Move the removal of set_pageblock_order() into this patch
- Update the commit message accordingly
- Add Reviewed-by from Mike Rapoport
---
mm/internal.h | 1 -
mm/mm_init.c | 4 ++--
mm/sparse.c | 3 ---
3 files changed, 2 insertions(+), 6 deletions(-)
diff --git a/mm/internal.h b/mm/internal.h
index 28d179cbc451..6bd9aa37b952 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1436,7 +1436,6 @@ extern unsigned long __must_check vm_mmap_pgoff(struct file *, unsigned long,
unsigned long, unsigned long,
unsigned long, unsigned long);
-extern void set_pageblock_order(void);
unsigned long reclaim_pages(struct list_head *folio_list);
unsigned int reclaim_clean_pages_from_list(struct zone *zone,
struct list_head *folio_list);
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 75f98abfed97..6646d4b47796 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1508,7 +1508,7 @@ static inline void setup_usemap(struct zone *zone) {}
#ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE
/* Initialise the number of pages represented by NR_PAGEBLOCK_BITS */
-void __init set_pageblock_order(void)
+static void __init set_pageblock_order(void)
{
unsigned int order = PAGE_BLOCK_MAX_ORDER;
@@ -1534,7 +1534,7 @@ void __init set_pageblock_order(void)
* include/linux/pageblock-flags.h for the values of pageblock_order based on
* the kernel config
*/
-void __init set_pageblock_order(void)
+static inline void __init set_pageblock_order(void)
{
}
diff --git a/mm/sparse.c b/mm/sparse.c
index 85557ef387c7..324213d8bdcb 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -343,9 +343,6 @@ void __init sparse_init(void)
pnum_begin = first_present_section_nr();
nid_begin = sparse_early_nid(__nr_to_section(pnum_begin));
- /* Setup pageblock_order for HUGETLB_PAGE_SIZE_VARIABLE */
- set_pageblock_order();
-
for_each_present_section_nr(pnum_begin + 1, pnum_end) {
int nid = sparse_early_nid(__nr_to_section(pnum_end));
--
2.54.0
^ permalink raw reply related
* [PATCH v2 09/69] mm/mm_init: Defer hugetlb reservation until after zone initialization
From: Muchun Song @ 2026-05-13 13:04 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
In-Reply-To: <20260513130542.35604-1-songmuchun@bytedance.com>
hugetlb_cma_reserve() and hugetlb_bootmem_alloc() currently run before
free_area_init(), so HugeTLB reservation happens before zone state is
initialized.
Move the reservation step after free_area_init() so the relevant zone
information is available before HugeTLB reserves memory. This is needed
for later hugetlb changes that validate boot-time HugeTLB reservations
against zone boundaries.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
mm/mm_init.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/mm/mm_init.c b/mm/mm_init.c
index c14491c2dad3..75f98abfed97 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -2688,11 +2688,11 @@ void __init mm_core_init_early(void)
{
int nid;
+ free_area_init();
+
hugetlb_cma_reserve();
hugetlb_bootmem_alloc();
- free_area_init();
-
sparse_init();
for_each_node_state(nid, N_MEMORY)
sparse_vmemmap_init_nid_late(nid);
--
2.54.0
^ permalink raw reply related
* [PATCH v2 08/69] mm/mm_init: Defer sparse_init() until after zone initialization
From: Muchun Song @ 2026-05-13 13:04 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
In-Reply-To: <20260513130542.35604-1-songmuchun@bytedance.com>
free_area_init() is responsible for initializing pgdat and zone state.
Calling sparse_init() from there mixes in later vmemmap and struct page
setup, which makes the initialization flow less clear.
Defer sparse_init(), sparse_vmemmap_init_nid_late(), and memmap_init()
until after free_area_init() completes, when zone initialization is fully
done. This keeps free_area_init() focused on zone setup and ensures that
sparse_init() runs with the relevant zone state already available.
This is also a prerequisite for later hugetlb vmemmap changes that need
zone information during early sparse vmemmap setup.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
v1->v2:
- Restore the set_pageblock_order() change suggested by Mike Rapoport
- Add Mike Rapoport's Reviewed-by
---
mm/mm_init.c | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 12fe21c4e26c..c14491c2dad3 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1826,7 +1826,6 @@ static void __init free_area_init(void)
bool descending;
arch_zone_limits_init(max_zone_pfn);
- sparse_init();
start_pfn = PHYS_PFN(memblock_start_of_DRAM());
descending = arch_has_descending_max_zone_pfns();
@@ -1915,11 +1914,7 @@ static void __init free_area_init(void)
}
}
- for_each_node_state(nid, N_MEMORY)
- sparse_vmemmap_init_nid_late(nid);
-
calc_nr_kernel_pages();
- memmap_init();
/* disable hash distribution for systems with a single node */
fixup_hashdist();
@@ -2691,10 +2686,17 @@ void __init __weak mem_init(void)
void __init mm_core_init_early(void)
{
+ int nid;
+
hugetlb_cma_reserve();
hugetlb_bootmem_alloc();
free_area_init();
+
+ sparse_init();
+ for_each_node_state(nid, N_MEMORY)
+ sparse_vmemmap_init_nid_late(nid);
+ memmap_init();
}
/*
--
2.54.0
^ permalink raw reply related
* [PATCH v2 07/69] mm/sparse: Move subsection_map_init() into sparse_init()
From: Muchun Song @ 2026-05-13 13:04 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
In-Reply-To: <20260513130542.35604-1-songmuchun@bytedance.com>
subsection_map_init() is part of sparse memory initialization, but it is
currently called from free_area_init().
Move it into sparse_init() so the sparse-specific setup stays together
instead of being split across the generic free_area_init() path.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
v1->v2:
- Add Acked-by from Mike Rapoport
---
mm/internal.h | 5 ++---
mm/mm_init.c | 10 ++--------
mm/sparse-vmemmap.c | 11 ++++++++++-
mm/sparse.c | 1 +
4 files changed, 15 insertions(+), 12 deletions(-)
diff --git a/mm/internal.h b/mm/internal.h
index 5a2ddcf68e0b..28d179cbc451 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1003,10 +1003,9 @@ static inline void sparse_init(void) {}
* mm/sparse-vmemmap.c
*/
#ifdef CONFIG_SPARSEMEM_VMEMMAP
-void sparse_init_subsection_map(unsigned long pfn, unsigned long nr_pages);
+void sparse_init_subsection_map(void);
#else
-static inline void sparse_init_subsection_map(unsigned long pfn,
- unsigned long nr_pages)
+static inline void sparse_init_subsection_map(void)
{
}
#endif /* CONFIG_SPARSEMEM_VMEMMAP */
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 96e0f2d8c3ea..12fe21c4e26c 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1876,18 +1876,12 @@ static void __init free_area_init(void)
(u64)zone_movable_pfn[i] << PAGE_SHIFT);
}
- /*
- * Print out the early node map, and initialize the
- * subsection-map relative to active online memory ranges to
- * enable future "sub-section" extensions of the memory map.
- */
+ /* Print out the early node map. */
pr_info("Early memory node ranges\n");
- for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
+ for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid)
pr_info(" node %3d: [mem %#018Lx-%#018Lx]\n", nid,
(u64)start_pfn << PAGE_SHIFT,
((u64)end_pfn << PAGE_SHIFT) - 1);
- sparse_init_subsection_map(start_pfn, end_pfn - start_pfn);
- }
/* Initialise every node */
mminit_verify_pageflags_layout();
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 112ccf9c71ca..fcf0ce5212f1 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -596,7 +596,7 @@ static void subsection_mask_set(unsigned long *map, unsigned long pfn,
bitmap_set(map, idx, end - idx + 1);
}
-void __init sparse_init_subsection_map(unsigned long pfn, unsigned long nr_pages)
+static void __init sparse_init_subsection_map_range(unsigned long pfn, unsigned long nr_pages)
{
int end_sec_nr = pfn_to_section_nr(pfn + nr_pages - 1);
unsigned long nr, start_sec_nr = pfn_to_section_nr(pfn);
@@ -619,6 +619,15 @@ void __init sparse_init_subsection_map(unsigned long pfn, unsigned long nr_pages
}
}
+void __init sparse_init_subsection_map(void)
+{
+ int i, nid;
+ unsigned long start, end;
+
+ for_each_mem_pfn_range(i, MAX_NUMNODES, &start, &end, &nid)
+ sparse_init_subsection_map_range(start, end - start);
+}
+
#ifdef CONFIG_MEMORY_HOTPLUG
/* Mark all memory sections within the pfn range as online */
diff --git a/mm/sparse.c b/mm/sparse.c
index c92bbc3f3aa3..85557ef387c7 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -361,5 +361,6 @@ void __init sparse_init(void)
}
/* cover the last node */
sparse_init_nid(nid_begin, pnum_begin, pnum_end, map_count);
+ sparse_init_subsection_map();
vmemmap_populate_print_last();
}
--
2.54.0
^ permalink raw reply related
* [PATCH v2 06/69] mm/sparse: Panic on memmap and usemap allocation failure
From: Muchun Song @ 2026-05-13 13:04 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
In-Reply-To: <20260513130542.35604-1-songmuchun@bytedance.com>
When vmemmap or usemap allocation fails, sparse_init_nid() currently
marks the section non-present and continues. Later boot-time code can
still walk PFNs in that section without checking for this partial setup,
which leads to invalid accesses. subsection_map_init() can also touch an
unallocated usemap.
Auditing and fixing all early PFN walkers for this case is not worth the
complexity. These allocation failures are expected to be fatal anyway,
and other memory models already treat them that way.
Make memmap and usemap allocation failures panic immediately instead of
trying to recover and crashing later in less obvious ways. This is also
consistent with how other memory model configurations handle memmap
allocation failures.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
v1->v2:
- Add Acked-by from Mike Rapoport
- I refrained from adding panic() to memmap_alloc() as it wouldn't simplify
the code. However, panic() is still required in sparse_init_nid() because
the architecture-specific vmemmap_populate() bypasses memmap_alloc().
---
mm/sparse.c | 44 +++++++++-----------------------------------
1 file changed, 9 insertions(+), 35 deletions(-)
diff --git a/mm/sparse.c b/mm/sparse.c
index 16ac6df3c89f..c92bbc3f3aa3 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -239,15 +239,8 @@ struct page __init *__populate_section_memmap(unsigned long pfn,
struct dev_pagemap *pgmap)
{
unsigned long size = section_map_size();
- struct page *map;
- phys_addr_t addr = __pa(MAX_DMA_ADDRESS);
- map = memmap_alloc(size, size, addr, nid, false);
- if (!map)
- panic("%s: Failed to allocate %lu bytes align=0x%lx nid=%d from=%pa\n",
- __func__, size, PAGE_SIZE, nid, &addr);
-
- return map;
+ return memmap_alloc(size, size, __pa(MAX_DMA_ADDRESS), nid, false);
}
#endif /* !CONFIG_SPARSEMEM_VMEMMAP */
@@ -300,17 +293,14 @@ static void __init sparse_init_nid(int nid, unsigned long pnum_begin,
unsigned long map_count)
{
unsigned long pnum;
- struct page *map;
- struct mem_section *ms;
- if (sparse_usage_init(nid, map_count)) {
- pr_err("%s: node[%d] usemap allocation failed", __func__, nid);
- goto failed;
- }
+ if (sparse_usage_init(nid, map_count))
+ panic("Failed to allocate usemap for node %d\n", nid);
sparse_vmemmap_init_nid_early(nid);
for_each_present_section_nr(pnum_begin, pnum) {
+ struct mem_section *ms;
unsigned long pfn = section_nr_to_pfn(pnum);
if (pnum >= pnum_end)
@@ -318,34 +308,18 @@ static void __init sparse_init_nid(int nid, unsigned long pnum_begin,
ms = __nr_to_section(pnum);
if (!preinited_vmemmap_section(ms)) {
+ struct page *map;
+
map = __populate_section_memmap(pfn, PAGES_PER_SECTION,
- nid, NULL, NULL);
- if (!map) {
- pr_err("%s: node[%d] memory map backing failed. Some memory will not be available.",
- __func__, nid);
- pnum_begin = pnum;
- sparse_usage_fini();
- goto failed;
- }
+ nid, NULL, NULL);
+ if (!map)
+ panic("Failed to allocate memmap for section %lu\n", pnum);
memmap_boot_pages_add(DIV_ROUND_UP(PAGES_PER_SECTION * sizeof(struct page),
PAGE_SIZE));
sparse_init_early_section(nid, map, pnum, 0);
}
}
sparse_usage_fini();
- return;
-failed:
- /*
- * We failed to allocate, mark all the following pnums as not present,
- * except the ones already initialized earlier.
- */
- for_each_present_section_nr(pnum_begin, pnum) {
- if (pnum >= pnum_end)
- break;
- ms = __nr_to_section(pnum);
- if (!preinited_vmemmap_section(ms))
- ms->section_mem_map = 0;
- }
}
/*
--
2.54.0
^ permalink raw reply related
* [PATCH v2 05/69] mm/mm_init: Simplify deferred_free_pages() migratetype init
From: Muchun Song @ 2026-05-13 13:04 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
In-Reply-To: <20260513130542.35604-1-songmuchun@bytedance.com>
deferred_free_pages() open-codes two loops to initialize the pageblock
migratetype for a range of pages.
Replace them with pageblock_migratetype_init_range() to remove the
duplication and make the code clearer (Note that deferred_free_pages()
may be called from atomic context).
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
v1->v2:
- Add Acked-by from Mike Rapoport
---
mm/mm_init.c | 19 ++++++++-----------
1 file changed, 8 insertions(+), 11 deletions(-)
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 5a910cc5534c..96e0f2d8c3ea 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -674,15 +674,15 @@ static inline void fixup_hashdist(void)
static inline void fixup_hashdist(void) {}
#endif /* CONFIG_NUMA */
-#ifdef CONFIG_ZONE_DEVICE
+#if defined(CONFIG_ZONE_DEVICE) || defined(CONFIG_DEFERRED_STRUCT_PAGE_INIT)
static __meminit void pageblock_migratetype_init_range(unsigned long pfn,
- unsigned long nr_pages, int migratetype)
+ unsigned long nr_pages, int migratetype, bool atomic)
{
const unsigned long end = pfn + nr_pages;
for (pfn = pageblock_align(pfn); pfn < end; pfn += pageblock_nr_pages) {
init_pageblock_migratetype(pfn_to_page(pfn), migratetype, false);
- if (IS_ALIGNED(pfn, PAGES_PER_SECTION))
+ if (!atomic && IS_ALIGNED(pfn, PAGES_PER_SECTION))
cond_resched();
}
}
@@ -1142,7 +1142,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
compound_nr_pages(pfn, altmap, pgmap));
}
- pageblock_migratetype_init_range(start_pfn, nr_pages, MIGRATE_MOVABLE);
+ pageblock_migratetype_init_range(start_pfn, nr_pages, MIGRATE_MOVABLE, false);
pr_debug("%s initialised %lu pages in %ums\n", __func__,
nr_pages, jiffies_to_msecs(jiffies - start));
@@ -1993,12 +1993,12 @@ static void __init deferred_free_pages(unsigned long pfn,
if (!nr_pages)
return;
+ pageblock_migratetype_init_range(pfn, nr_pages, mt, true);
+
page = pfn_to_page(pfn);
/* Free a large naturally-aligned chunk if possible */
if (nr_pages == MAX_ORDER_NR_PAGES && IS_MAX_ORDER_ALIGNED(pfn)) {
- for (i = 0; i < nr_pages; i += pageblock_nr_pages)
- init_pageblock_migratetype(page + i, mt, false);
__free_pages_core(page, MAX_PAGE_ORDER, MEMINIT_EARLY);
return;
}
@@ -2006,11 +2006,8 @@ static void __init deferred_free_pages(unsigned long pfn,
/* Accept chunks smaller than MAX_PAGE_ORDER upfront */
accept_memory(PFN_PHYS(pfn), nr_pages * PAGE_SIZE);
- for (i = 0; i < nr_pages; i++, page++, pfn++) {
- if (pageblock_aligned(pfn))
- init_pageblock_migratetype(page, mt, false);
- __free_pages_core(page, 0, MEMINIT_EARLY);
- }
+ for (i = 0; i < nr_pages; i++)
+ __free_pages_core(page + i, 0, MEMINIT_EARLY);
}
/* Completion tracking for deferred_init_memmap() threads */
--
2.54.0
^ permalink raw reply related
* [PATCH v2 04/69] mm/hugetlb: Initialize gigantic bootmem hugepage struct pages earlier
From: Muchun Song @ 2026-05-13 13:04 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
In-Reply-To: <20260513130542.35604-1-songmuchun@bytedance.com>
Gigantic bootmem HugeTLB pages are currently initialized from hugetlb_init(),
but page_alloc_init_late() runs earlier and walks pageblocks to determine
zone contiguity.
If a bootmem HugeTLB region is marked noinit, set_zone_contiguous() can
observe still-uninitialized struct pages through __pageblock_pfn_to_page().
This may not trigger an immediate failure, but it can make
set_zone_contiguous() compute the wrong zone contiguity state. If extra
poisoned-page checks are added in this path, such as PF_POISONED_CHECK()
in page_zone_id(), it can also trigger an early boot panic.
Initialize gigantic bootmem HugeTLB struct pages from page_alloc_init_late(),
before zone contiguity is evaluated, so later page allocator setup only
sees valid struct page state. This also makes the initialization order
more natural, as struct pages should be initialized before later code
inspects them.
Fixes: fde1c4ecf916 ("mm: hugetlb: skip initialization of gigantic tail struct pages if freed by HVO")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
include/linux/hugetlb.h | 5 +++++
mm/hugetlb.c | 3 +--
mm/mm_init.c | 1 +
3 files changed, 7 insertions(+), 2 deletions(-)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 93418625d3c5..52a2c30f866c 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -173,6 +173,7 @@ extern int movable_gigantic_pages __read_mostly;
extern int sysctl_hugetlb_shm_group __read_mostly;
extern struct list_head huge_boot_pages[MAX_NUMNODES];
+void hugetlb_struct_page_init(void);
void hugetlb_bootmem_alloc(void);
extern nodemask_t hugetlb_bootmem_nodes;
void hugetlb_bootmem_set_nodes(void);
@@ -1307,6 +1308,10 @@ static inline bool hugetlbfs_pagecache_present(
static inline void hugetlb_bootmem_alloc(void)
{
}
+
+static inline void hugetlb_struct_page_init(void)
+{
+}
#endif /* CONFIG_HUGETLB_PAGE */
static inline spinlock_t *huge_pte_lock(struct hstate *h,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d22683ab30a1..b4999653a156 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3370,7 +3370,7 @@ static void __init gather_bootmem_prealloc_parallel(unsigned long start,
gather_bootmem_prealloc_node(nid);
}
-static void __init gather_bootmem_prealloc(void)
+void __init hugetlb_struct_page_init(void)
{
struct padata_mt_job job = {
.thread_fn = gather_bootmem_prealloc_parallel,
@@ -4163,7 +4163,6 @@ static int __init hugetlb_init(void)
}
hugetlb_init_hstates();
- gather_bootmem_prealloc();
report_hugepages();
hugetlb_sysfs_init();
diff --git a/mm/mm_init.c b/mm/mm_init.c
index fde49f7bba6c..5a910cc5534c 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -2335,6 +2335,7 @@ void __init page_alloc_init_late(void)
/* Reinit limits that are based on free pages after the kernel is up */
files_maxfiles_init();
#endif
+ hugetlb_struct_page_init();
/* Accounting of total+free memory is stable at this point. */
mem_init_print_info();
--
2.54.0
^ permalink raw reply related
* [PATCH v2 03/69] powerpc/mm: Fix wrong addr_pfn tracking in compound vmemmap population
From: Muchun Song @ 2026-05-13 13:04 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
In-Reply-To: <20260513130542.35604-1-songmuchun@bytedance.com>
vmemmap_populate_compound_pages() uses addr_pfn to determine the PFN
offset within a compound page and to decide whether the current
vmemmap slot should be populated as a head page mapping or should reuse
a tail page mapping.
However, addr_pfn is advanced manually in parallel with addr. The loop
itself progresses in vmemmap address space, so each PAGE_SIZE step in
addr covers PAGE_SIZE / sizeof(struct page) struct page slots. Since
addr_pfn is compared against nr_pages in data-PFN units, it should
advance by the same number of PFNs. The existing manual increments do
not match that and therefore do not reliably track the PFN
corresponding to the current addr.
As a result, pfn_offset can be computed from the wrong PFN and the code
can make the head/tail decision for the wrong compound-page position.
Fix this by deriving addr_pfn directly from the current vmemmap address
instead of carrying it as loop state.
Fixes: f2b79c0d7968 ("powerpc/book3s64/radix: add support for vmemmap optimization for radix")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
arch/powerpc/mm/book3s64/radix_pgtable.c | 7 +------
1 file changed, 1 insertion(+), 6 deletions(-)
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 10aced261cff..cf692b2b5f7b 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -1314,7 +1314,6 @@ int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
* covering out both edges.
*/
unsigned long addr;
- unsigned long addr_pfn = start_pfn;
unsigned long next;
pgd_t *pgd;
p4d_t *p4d;
@@ -1335,7 +1334,6 @@ int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
if (pmd_leaf(READ_ONCE(*pmd))) {
/* existing huge mapping. Skip the range */
- addr_pfn += (PMD_SIZE >> PAGE_SHIFT);
next = pmd_addr_end(addr, end);
continue;
}
@@ -1348,11 +1346,11 @@ int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
* page whose VMEMMAP_RESERVE_NR pages were mapped and
* this request fall in those pages.
*/
- addr_pfn += 1;
next = addr + PAGE_SIZE;
continue;
} else {
unsigned long nr_pages = pgmap_vmemmap_nr(pgmap);
+ unsigned long addr_pfn = page_to_pfn((struct page *)addr);
unsigned long pfn_offset = addr_pfn - ALIGN_DOWN(addr_pfn, nr_pages);
pte_t *tail_page_pte;
@@ -1376,7 +1374,6 @@ int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
if (!pte)
return -ENOMEM;
- addr_pfn += 2;
next = addr + 2 * PAGE_SIZE;
continue;
}
@@ -1392,7 +1389,6 @@ int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
return -ENOMEM;
vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
- addr_pfn += 1;
next = addr + PAGE_SIZE;
continue;
}
@@ -1402,7 +1398,6 @@ int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
return -ENOMEM;
vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
- addr_pfn += 1;
next = addr + PAGE_SIZE;
continue;
}
--
2.54.0
^ permalink raw reply related
* [PATCH v2 02/69] mm/hugetlb_vmemmap: Fix __hugetlb_vmemmap_optimize_folios()
From: Muchun Song @ 2026-05-13 13:04 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
In-Reply-To: <20260513130542.35604-1-songmuchun@bytedance.com>
__hugetlb_vmemmap_optimize_folios() uses incorrect arguments when handling
bootmem HugeTLB folios.
The section number passed to register_page_bootmem_memmap() is derived from
the vmemmap virtual address of folio->page instead of the folio PFN, so the
bootmem memmap metadata can be registered against the wrong section. The
helper is also given HUGETLB_VMEMMAP_RESERVE_SIZE even though it expects a
page count, not a size in bytes. In addition, the write-protect range is
based on pages_per_huge_page(h), which does not cover the full HugeTLB
vmemmap area and can leave part of the shared tail vmemmap mapping writable.
Fix the section lookup to use folio_pfn(folio), use
HUGETLB_VMEMMAP_RESERVE_PAGES when registering the reserved memmap pages, and
use hugetlb_vmemmap_size(h) for the write-protect range.
Fixes: 752fe17af693 ("mm/hugetlb: add pre-HVO framework")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
mm/hugetlb_vmemmap.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 62e61af18c9a..4f58cd940f61 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -635,12 +635,12 @@ static void __hugetlb_vmemmap_optimize_folios(struct hstate *h,
* mirrored tail page structs RO.
*/
spfn = (unsigned long)&folio->page;
- epfn = spfn + pages_per_huge_page(h);
+ epfn = spfn + hugetlb_vmemmap_size(h);
vmemmap_wrprotect_hvo(spfn, epfn, folio_nid(folio),
HUGETLB_VMEMMAP_RESERVE_SIZE);
- register_page_bootmem_memmap(pfn_to_section_nr(spfn),
+ register_page_bootmem_memmap(pfn_to_section_nr(folio_pfn(folio)),
&folio->page,
- HUGETLB_VMEMMAP_RESERVE_SIZE);
+ HUGETLB_VMEMMAP_RESERVE_PAGES);
continue;
}
--
2.54.0
^ permalink raw reply related
* [PATCH v2 00/69] mm: Generalize HVO for HugeTLB and device DAX
From: Muchun Song @ 2026-05-13 13:04 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
In this series, HVO is redefined as Hugepage Vmemmap Optimization: a
general vmemmap optimization model for large hugepage-backed mappings,
rather than a HugeTLB-only implementation detail.
The existing code grew around the original HugeTLB-specific HVO path,
while device DAX developed similar but separate vmemmap optimization
handling. As a result, the current implementation carries duplicated
logic, boot-time special cases, and subsystem-specific interfaces around
what is fundamentally the same sparse-vmemmap optimization.
This series generalizes that optimization into a common framework used
by both HugeTLB and device DAX.
The first few patches include some minor bug fixes found during AI-aided
review of the current code. These fixes are not the main goal of the
series, but the later refactoring and unification work depends on them,
so they are included here as preparatory changes.
The series then reworks the relevant early boot and sparse
initialization paths, introduces a generic section-based sparse-vmemmap
optimization infrastructure, switches HugeTLB and device DAX over to the
shared implementation, and removes the old special-case code.
At a high level, the series does the following:
- apply a small set of preparatory bug fixes
- reorder early boot and sparse initialization so optimized vmemmap
setup has the required zone and pageblock state
- introduce generic section-based vmemmap optimization infrastructure
- switch HugeTLB and device DAX to the shared implementation
- consolidate HVO enablement and naming
- remove obsolete HugeTLB-specific boot-time and architecture-specific
optimization code
- rewrite the documentation around the unified design
This brings a few concrete benefits:
- HugeTLB and device DAX share one vmemmap optimization framework,
reducing duplicated logic and long-term maintenance overhead
- when CONFIG_DEFERRED_STRUCT_PAGE_INIT is disabled, optimized struct
pages can skip the usual memmap_init() initialization work, which
helps reduce boot-time overhead
- all architectures that support HVO benefit from the generic
sparse-vmemmap optimization path without extra architecture-specific
preinit handling
- device DAX improves its struct page savings further by dropping the
extra reserved tail page
- shared vmemmap tail pages are mapped read-only, improving robustness
I have only built and tested this series on x86. I do not currently have
a powerpc test environment, so any testing or feedback on powerpc would
be much appreciated.
Changes since v1:
- rebased onto current next tree
- added the preparatory minor bug fixes found during AI-aided review
- added further refactoring on top of the new infrastructure
Muchun Song (69):
mm/hugetlb: Fix boot panic with CONFIG_DEBUG_VM and HVO bootmem pages
mm/hugetlb_vmemmap: Fix __hugetlb_vmemmap_optimize_folios()
powerpc/mm: Fix wrong addr_pfn tracking in compound vmemmap population
mm/hugetlb: Initialize gigantic bootmem hugepage struct pages earlier
mm/mm_init: Simplify deferred_free_pages() migratetype init
mm/sparse: Panic on memmap and usemap allocation failure
mm/sparse: Move subsection_map_init() into sparse_init()
mm/mm_init: Defer sparse_init() until after zone initialization
mm/mm_init: Defer hugetlb reservation until after zone initialization
mm/mm_init: Remove set_pageblock_order() call from sparse_init()
mm/sparse: Move sparse_vmemmap_init_nid_late() into sparse_init_nid()
mm/hugetlb_cma: Validate hugetlb CMA range by zone at reserve time
mm/hugetlb: Refactor early boot gigantic hugepage allocation
mm/hugetlb: Free cross-zone bootmem gigantic pages after allocation
mm/hugetlb_vmemmap: Move bootmem HVO setup to early init
mm/hugetlb: Remove obsolete bootmem cross-zone checks
mm/sparse-vmemmap: Remove sparse_vmemmap_init_nid_late()
mm/hugetlb: Remove unused bootmem cma field
mm/mm_init: Make __init_page_from_nid() static
mm/sparse-vmemmap: Drop VMEMMAP_POPULATE_PAGEREF
mm: Rename vmemmap optimization macros around folio semantics
mm/sparse: Drop power-of-2 size requirement for struct mem_section
mm/sparse-vmemmap: track compound page order in struct mem_section
mm/mm_init: Skip initializing shared vmemmap tail pages
mm/sparse-vmemmap: Initialize shared tail vmemmap pages on allocation
mm/sparse-vmemmap: Support section-based vmemmap accounting
mm/sparse-vmemmap: Support section-based vmemmap optimization
mm/hugetlb: Use generic vmemmap optimization macros
mm/sparse: Mark memblocks present earlier
mm/hugetlb: Switch HugeTLB to section-based vmemmap optimization
mm/sparse: Remove section_map_size()
mm/mm_init: Factor out pfn_to_zone() as a shared helper
mm/sparse: Remove SPARSEMEM_VMEMMAP_PREINIT
mm/sparse: Inline usemap allocation into sparse_init_nid()
mm/hugetlb: Remove HUGE_BOOTMEM_HVO
mm/hugetlb: Remove HUGE_BOOTMEM_CMA
mm/sparse-vmemmap: Factor out shared vmemmap page allocation
mm/sparse-vmemmap: Introduce CONFIG_SPARSEMEM_VMEMMAP_OPTIMIZATION
mm/sparse-vmemmap: Switch DAX to vmemmap_shared_tail_page()
powerpc/mm: Switch DAX to vmemmap_shared_tail_page()
mm/sparse-vmemmap: Drop the extra tail page from DAX reservation
mm/sparse-vmemmap: Switch DAX to section-based vmemmap optimization
mm/sparse-vmemmap: Unify DAX and HugeTLB population paths
mm/sparse-vmemmap: Remove the unused ptpfn argument
powerpc/mm: Make vmemmap_populate_compound_pages() static
mm/sparse-vmemmap: Map shared vmemmap tail pages read-only
powerpc/mm: Map shared vmemmap tail pages read-only
mm/sparse-vmemmap: Inline vmemmap_populate_address() into its caller
mm/hugetlb_vmemmap: Remove vmemmap_wrprotect_hvo()
mm/sparse: Simplify section_nr_vmemmap_pages()
mm/sparse-vmemmap: Introduce vmemmap_nr_struct_pages()
powerpc/mm: Drop powerpc vmemmap_can_optimize()
mm/sparse-vmemmap: Drop vmemmap_can_optimize()
mm/sparse-vmemmap: Drop @pgmap from vmemmap population APIs
mm/sparse: Decouple section activation from ZONE_DEVICE
mm: Redefine HVO as Hugepage Vmemmap Optimization
mm/sparse-vmemmap: Consolidate HVO enable checks
mm/hugetlb: Make HVO optimizable checks depend on generic logic
mm/sparse-vmemmap: Localize init_compound_tail()
mm/mm_init: Check zone consistency on optimized vmemmap sections
mm/hugetlb: Drop boot-time HVO handling for gigantic folios
mm/hugetlb: Simplify hugetlb_folio_init_vmemmap()
mm/hugetlb: Initialize the full bootmem hugepage in hugetlb code
mm/mm_init: Factor out compound page initialization
mm/mm_init: Make __init_single_page() static
mm/cma: Move CMA pageblock initialization into cma_activate_area()
mm/cma: Move init_cma_pageblock() into cma.c
mm/mm_init: Initialize pageblock migratetype in memmap init helpers
Documentation/mm: Rewrite vmemmap_dedup.rst for unified HVO
.../admin-guide/kernel-parameters.txt | 2 +-
Documentation/admin-guide/mm/hugetlbpage.rst | 4 +-
.../admin-guide/mm/memory-hotplug.rst | 2 +-
Documentation/admin-guide/sysctl/vm.rst | 3 +-
Documentation/arch/powerpc/index.rst | 1 -
Documentation/arch/powerpc/vmemmap_dedup.rst | 101 ----
Documentation/mm/vmemmap_dedup.rst | 217 ++------
arch/arm64/mm/mmu.c | 5 +-
arch/loongarch/mm/init.c | 5 +-
arch/powerpc/include/asm/book3s/64/radix.h | 12 -
arch/powerpc/mm/book3s64/radix_pgtable.c | 154 +-----
arch/powerpc/mm/hugetlbpage.c | 11 +-
arch/powerpc/mm/init_64.c | 1 +
arch/powerpc/mm/mem.c | 5 +-
arch/riscv/mm/init.c | 5 +-
arch/s390/mm/init.c | 5 +-
arch/x86/Kconfig | 1 -
arch/x86/entry/vdso/vdso32/fake_32bit_build.h | 1 -
arch/x86/mm/init_64.c | 5 +-
drivers/dax/Kconfig | 1 +
fs/Kconfig | 6 +-
include/linux/hugetlb.h | 23 +-
include/linux/memory_hotplug.h | 12 +-
include/linux/mm.h | 44 +-
include/linux/mm_types.h | 3 +-
include/linux/mmzone.h | 151 ++++--
include/linux/page-flags-layout.h | 2 +
include/linux/page-flags.h | 31 +-
kernel/bounds.c | 5 +
mm/Kconfig | 9 +-
mm/bootmem_info.c | 5 +-
mm/cma.c | 18 +-
mm/hugetlb.c | 337 ++++--------
mm/hugetlb_cma.c | 41 +-
mm/hugetlb_cma.h | 4 +-
mm/hugetlb_vmemmap.c | 266 +--------
mm/hugetlb_vmemmap.h | 64 +--
mm/internal.h | 72 ++-
mm/memory-failure.c | 6 +-
mm/memory_hotplug.c | 22 +-
mm/memremap.c | 4 +-
mm/mm_init.c | 241 ++++-----
mm/sparse-vmemmap.c | 511 ++++++------------
mm/sparse.c | 129 +----
mm/util.c | 2 +-
scripts/gdb/linux/mm.py | 6 +-
46 files changed, 743 insertions(+), 1812 deletions(-)
delete mode 100644 Documentation/arch/powerpc/vmemmap_dedup.rst
base-commit: e98d21c170b01ddef366f023bbfcf6b31509fa83
--
2.54.0
^ permalink raw reply
* [PATCH v2 01/69] mm/hugetlb: Fix boot panic with CONFIG_DEBUG_VM and HVO bootmem pages
From: Muchun Song @ 2026-05-13 13:04 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
In-Reply-To: <20260513130542.35604-1-songmuchun@bytedance.com>
Commit 622026e87c40 ("mm/hugetlb: remove fake head pages") switched
HVO to reuse per-zone shared tail pages from zone->vmemmap_tails[].
Those shared tail pages were initialized in hugetlb_vmemmap_init(), but
bootmem HugeTLB folios are prepared earlier from gather_bootmem_prealloc().
With hugetlb_free_vmemmap=on, prep_and_add_bootmem_folios() can access
pageblock flags on bootmem HugeTLB pages whose mirrored tail struct pages
already point to the shared tail page. On CONFIG_DEBUG_VM kernels,
get_pfnblock_bitmap_bitidx() then dereferences the still-uninitialized
shared tail page and can panic during boot.
Initialize zone->vmemmap_tails[] from gather_bootmem_prealloc(), before
bootmem HugeTLB folios are processed, and drop the later initialization
from hugetlb_vmemmap_init().
This bug only affects CONFIG_DEBUG_VM kernels, where the relevant
assertion is evaluated.
Fixes: 622026e87c40 ("mm/hugetlb: remove fake head pages")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
mm/hugetlb.c | 19 +++++++++++++++++++
mm/hugetlb_vmemmap.c | 17 -----------------
2 files changed, 19 insertions(+), 17 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 31b34ca0f402..d22683ab30a1 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3382,6 +3382,25 @@ static void __init gather_bootmem_prealloc(void)
.max_threads = num_node_state(N_MEMORY),
.numa_aware = true,
};
+#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
+ struct zone *zone;
+
+ for_each_zone(zone) {
+ for (int i = 0; i < NR_VMEMMAP_TAILS; i++) {
+ struct page *tail, *p;
+ unsigned int order;
+
+ tail = zone->vmemmap_tails[i];
+ if (!tail)
+ continue;
+
+ order = i + VMEMMAP_TAIL_MIN_ORDER;
+ p = page_to_virt(tail);
+ for (int j = 0; j < PAGE_SIZE / sizeof(struct page); j++)
+ init_compound_tail(p + j, NULL, order, zone);
+ }
+ }
+#endif
padata_do_multithreaded(&job);
}
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 4a077d231d3a..62e61af18c9a 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -870,27 +870,10 @@ static const struct ctl_table hugetlb_vmemmap_sysctls[] = {
static int __init hugetlb_vmemmap_init(void)
{
const struct hstate *h;
- struct zone *zone;
/* HUGETLB_VMEMMAP_RESERVE_SIZE should cover all used struct pages */
BUILD_BUG_ON(__NR_USED_SUBPAGE > HUGETLB_VMEMMAP_RESERVE_PAGES);
- for_each_zone(zone) {
- for (int i = 0; i < NR_VMEMMAP_TAILS; i++) {
- struct page *tail, *p;
- unsigned int order;
-
- tail = zone->vmemmap_tails[i];
- if (!tail)
- continue;
-
- order = i + VMEMMAP_TAIL_MIN_ORDER;
- p = page_to_virt(tail);
- for (int j = 0; j < PAGE_SIZE / sizeof(struct page); j++)
- init_compound_tail(p + j, NULL, order, zone);
- }
- }
-
for_each_hstate(h) {
if (hugetlb_vmemmap_optimizable(h)) {
register_sysctl_init("vm", hugetlb_vmemmap_sysctls);
--
2.54.0
^ permalink raw reply related
* [PATCH v2 5/5] KVM: PPC: Document KVM_PPC_GET_COMPAT_CAPS ioctl
From: Amit Machhiwal @ 2026-05-13 10:07 UTC (permalink / raw)
To: linuxppc-dev, Madhavan Srinivasan
Cc: Amit Machhiwal, Vaibhav Jain, Paolo Bonzini, Jonathan Corbet,
Shuah Khan, kvm, linux-kernel, linux-doc
In-Reply-To: <20260513100755.83215-1-amachhiw@linux.ibm.com>
Add documentation for the KVM_PPC_GET_COMPAT_CAPS ioctl to the KVM API
documentation.
The ioctl exposes host processor compatibility modes supported for
nested KVM guests on PowerPC systems.
Signed-off-by: Amit Machhiwal <amachhiw@linux.ibm.com>
---
Documentation/virt/kvm/api.rst | 35 ++++++++++++++++++++++++++++++++++
1 file changed, 35 insertions(+)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 52bbbb553ce1..1b533f674a09 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6555,6 +6555,41 @@ KVM_S390_KEYOP_SSKE
.. _kvm_run:
+4.145 KVM_PPC_GET_COMPAT_CAPS
+-----------------------------
+:Capability: KVM_CAP_PPC_COMPAT_CAPS
+:Architectures: powerpc
+:Type: vm ioctl
+:Parameters: struct kvm_ppc_compat_caps (out)
+:Returns:
+ 0 on successful completion,
+ -EFAULT if ``struct kvm_ppc_compat_caps`` cannot be written
+
+IBM POWER system server-based processors provide a compatibility mode feature
+where an Nth generation processor can operate in modes consistent with earlier
+generations such as (N-1) and (N-2).
+
+This ioctl provides userspace with information about the CPU compatibility modes
+supported by the current host processor for booting the nested KVM guests on
+PowerNV (KVM nested APIv1) and PowerVM (KVM nested APIv2) platforms.
+
+::
+
+ struct kvm_ppc_compat_caps {
+ __u64 flags; /* Reserved for future use */
+ __u64 compat_capabilities; /* Capabilities supported by the host */
+ };
+
+The ``compat_capabilities`` bit field describes the processor compatibility
+modes supported by the host. For example, the following bits indicate support
+for specific processor modes.
+
+::
+
+H_GUEST_CAP_POWER9 (bit 1): KVM guests can run in Power9 processor mode
+H_GUEST_CAP_POWER10 (bit 2): KVM guests can run in Power10 processor mode
+H_GUEST_CAP_POWER11 (bit 3): KVM guests can run in Power11 processor mode
+
5. The kvm_run structure
========================
--
2.50.1 (Apple Git-155)
^ permalink raw reply related
* [PATCH v2 4/5] KVM: PPC: Book3S HV: Add support for compat CPU capabilities for KVM on PowerNV
From: Amit Machhiwal @ 2026-05-13 10:07 UTC (permalink / raw)
To: linuxppc-dev, Madhavan Srinivasan
Cc: Amit Machhiwal, Vaibhav Jain, Nicholas Piggin, Michael Ellerman,
Christophe Leroy (CS GROUP), kvm, linux-kernel
In-Reply-To: <20260513100755.83215-1-amachhiw@linux.ibm.com>
Currently, when booting a compatibility-mode KVM guest (L1) on a PowerNV
hypervisor (L0), the guest runs with the expected processor
compatibility level. However, when booting a nested KVM guest (L2)
inside the L1, QEMU derives the CPU model from the raw host PVR and
attempts to run the nested guest at that level, instead of honoring the
compatibility mode of the L1.
Extend host CPU compatibility capability reporting to support nested
virtualization on PowerNV systems (PAPR nested API v1).
For nested API v2 (PowerVM), compatibility capabilities are obtained
from the hypervisor via the H_GUEST_GET_CAPABILITIES hcall. This
information is not available on PowerNV systems.
For nested API v1, derive the compatibility capabilities from the L1
guest by reading the "cpu-version" property from the device tree, which
reflects the effective (logical) processor compatibility level. Map this
value to the corresponding compatibility capability bitmap.
Introduce a helper to translate CPU version values into compatibility
capability bits and integrate it into kvmppc_get_compat_cpu_caps().
This allows userspace to query host CPU compatibility modes on both
PowerVM and PowerNV platforms via the KVM_PPC_GET_COMPAT_CAPS ioctl.
Signed-off-by: Amit Machhiwal <amachhiw@linux.ibm.com>
---
arch/powerpc/kvm/book3s_hv.c | 37 +++++++++++++++++++++++++++++++++++-
1 file changed, 36 insertions(+), 1 deletion(-)
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 38de7040e2b7..18774c49af85 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -6522,15 +6522,50 @@ static bool kvmppc_hash_v3_possible(void)
return true;
}
+static int kvmppc_map_compat_capabilities(const __be32 cpu_version,
+ unsigned long *capabilities)
+{
+ switch (cpu_version) {
+ case PVR_ARCH_31_P11:
+ *capabilities |= H_GUEST_CAP_POWER11;
+ break;
+ case PVR_ARCH_31:
+ *capabilities |= H_GUEST_CAP_POWER10;
+ break;
+ case PVR_ARCH_300:
+ *capabilities |= H_GUEST_CAP_POWER9;
+ break;
+ default:
+ return -EINVAL;
+ }
+
+ return 0;
+}
static int kvmppc_get_compat_cpu_caps(struct kvm_ppc_compat_caps *host_caps)
{
+ struct device_node *np;
unsigned long capabilities = 0;
+ const __be32 *prop = NULL;
long rc = -EINVAL;
+ u32 cpu_version;
if (kvmhv_on_pseries()) {
- if (kvmhv_is_nestedv2())
+ if (kvmhv_is_nestedv2()) {
rc = plpar_guest_get_capabilities(0, &capabilities);
+ } else {
+ for_each_node_by_type(np, "cpu") {
+ prop = of_get_property(np, "cpu-version", NULL);
+ if (prop) {
+ cpu_version = be32_to_cpup(prop);
+ break;
+ }
+ }
+ if (!prop)
+ return -EINVAL;
+ rc = kvmppc_map_compat_capabilities(cpu_version,
+ &capabilities);
+ }
host_caps->compat_capabilities = capabilities;
}
--
2.50.1 (Apple Git-155)
^ permalink raw reply related
* [PATCH v2 3/5] KVM: PPC: Book3S HV: Implement compat CPU capability retrieval for KVM on PowerVM
From: Amit Machhiwal @ 2026-05-13 10:07 UTC (permalink / raw)
To: linuxppc-dev, Madhavan Srinivasan
Cc: Amit Machhiwal, Vaibhav Jain, Nicholas Piggin, Michael Ellerman,
Christophe Leroy (CS GROUP), kvm, linux-kernel
In-Reply-To: <20260513100755.83215-1-amachhiw@linux.ibm.com>
On POWER systems, the host CPU may run in a compatibility mode (e.g., a
Power11 processor operating in Power10 compatibility mode). In such
cases, the effective CPU level exposed to guests differs from the
physical processor generation.
When running nested KVM guests, QEMU derives the host CPU type using
mfpvr(), which reflects the physical processor version. This can result
in a mismatch between the CPU model selected by QEMU and the
compatibility mode enforced by the host, leading to guest boot failures.
For example, booting a nested guest on a Power11 LPAR configured in
Power10 compatibility mode fails with:
KVM-NESTEDv2: couldn't set guest wide elements
[..KVM reg dump..]
This occurs because QEMU selects a CPU model corresponding to the
physical processor (via mfpvr()), while the host operates in a lower
compatibility mode. As a result, KVM rejects the requested compatibility
level during guest initialization.
Add support for retrieving host CPU compatibility capabilities for
nested guests on PowerVM (PAPR nested API v2). The hypervisor provides
the effective compatibility levels via the H_GUEST_GET_CAPABILITIES
hcall, which reflects the processor modes negotiated between the Power
hypervisor (L0) and the host partition (L1).
On pseries systems, obtain the capability bitmap using
plpar_guest_get_capabilities() and return it via struct
kvm_ppc_compat_caps. This information is then exposed to userspace
through the KVM_PPC_GET_COMPAT_CAPS ioctl.
Hook the implementation into the Book3S HV kvmppc_ops so that it can be
invoked by the generic KVM ioctl handling code.
Signed-off-by: Amit Machhiwal <amachhiw@linux.ibm.com>
---
arch/powerpc/kvm/book3s_hv.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 249d1f2e4e2c..38de7040e2b7 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -6522,6 +6522,21 @@ static bool kvmppc_hash_v3_possible(void)
return true;
}
+
+static int kvmppc_get_compat_cpu_caps(struct kvm_ppc_compat_caps *host_caps)
+{
+ unsigned long capabilities = 0;
+ long rc = -EINVAL;
+
+ if (kvmhv_on_pseries()) {
+ if (kvmhv_is_nestedv2())
+ rc = plpar_guest_get_capabilities(0, &capabilities);
+ host_caps->compat_capabilities = capabilities;
+ }
+
+ return rc;
+}
+
static struct kvmppc_ops kvm_ops_hv = {
.get_sregs = kvm_arch_vcpu_ioctl_get_sregs_hv,
.set_sregs = kvm_arch_vcpu_ioctl_set_sregs_hv,
@@ -6564,6 +6579,7 @@ static struct kvmppc_ops kvm_ops_hv = {
.hash_v3_possible = kvmppc_hash_v3_possible,
.create_vcpu_debugfs = kvmppc_arch_create_vcpu_debugfs_hv,
.create_vm_debugfs = kvmppc_arch_create_vm_debugfs_hv,
+ .get_compat_cpu_ver = kvmppc_get_compat_cpu_caps,
};
static int kvm_init_subcore_bitmap(void)
--
2.50.1 (Apple Git-155)
^ permalink raw reply related
* [PATCH v2 2/5] KVM: PPC: Introduce KVM_CAP_PPC_COMPAT_CAPS and wire up ioctl
From: Amit Machhiwal @ 2026-05-13 10:07 UTC (permalink / raw)
To: linuxppc-dev, Madhavan Srinivasan
Cc: Amit Machhiwal, Vaibhav Jain, Paolo Bonzini, Nicholas Piggin,
Michael Ellerman, Christophe Leroy (CS GROUP), kvm, linux-kernel
In-Reply-To: <20260513100755.83215-1-amachhiw@linux.ibm.com>
Introduce a new capability and ioctl to expose CPU compatibility modes
supported by the host processor for nested guests.
On IBM POWER systems, newer processor generations (N) can operate in
compatibility modes corresponding to earlier generations, like (N-1) and
(N-2). This is particularly relevant for nested virtualization, where
nested KVM guests may need to run with a specific processor compatibility
level.
Introduce KVM_CAP_PPC_COMPAT_CAPS capability and the corresponding
KVM_PPC_GET_COMPAT_CAPS vm ioctl. The ioctl returns a bitmap describing
the compatibility modes supported by the host in respective bit numbers,
allowing userspace (e.g., QEMU) to select an appropriate compatibility
level when configuring nested KVM guests.
The ioctl handling is added in kvm_arch_vm_ioctl() and retrieves host
CPU compatibility capabilities via a PowerPC-specific backend
implementation when available. If the capability is not supported, the
ioctl returns success with no capabilities set, allowing userspace to
fall back gracefully.
Signed-off-by: Amit Machhiwal <amachhiw@linux.ibm.com>
---
arch/powerpc/include/asm/kvm_ppc.h | 1 +
arch/powerpc/include/uapi/asm/kvm.h | 6 ++++++
arch/powerpc/kvm/powerpc.c | 21 +++++++++++++++++++++
include/uapi/linux/kvm.h | 4 ++++
4 files changed, 32 insertions(+)
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 0953f2daa466..cadfb839e836 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -319,6 +319,7 @@ struct kvmppc_ops {
bool (*hash_v3_possible)(void);
int (*create_vm_debugfs)(struct kvm *kvm);
int (*create_vcpu_debugfs)(struct kvm_vcpu *vcpu, struct dentry *debugfs_dentry);
+ int (*get_compat_cpu_ver)(struct kvm_ppc_compat_caps *host_caps);
};
extern struct kvmppc_ops *kvmppc_hv_ops;
diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
index 077c5437f521..081d6c7f7f70 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -437,6 +437,12 @@ struct kvm_ppc_cpu_char {
__u64 behaviour_mask; /* valid bits in behaviour */
};
+/* For KVM_PPC_GET_COMPAT_CAPS */
+struct kvm_ppc_compat_caps {
+ __u64 flags; /* Reserved for future use */
+ __u64 compat_capabilities; /* Capabilities supported by the host */
+};
+
/*
* Values for character and character_mask.
* These are identical to the values used by H_GET_CPU_CHARACTERISTICS.
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 00302399fc37..91a0228942e1 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -697,6 +697,13 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
}
}
break;
+#if defined(CONFIG_KVM_BOOK3S_HV_POSSIBLE)
+ case KVM_CAP_PPC_COMPAT_CAPS:
+ r = 0;
+ if (kvmhv_on_pseries())
+ r = 1;
+ break;
+#endif /* CONFIG_KVM_BOOK3S_HV_POSSIBLE */
default:
r = 0;
break;
@@ -2463,6 +2470,20 @@ int kvm_arch_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg)
r = kvm->arch.kvm_ops->svm_off(kvm);
break;
}
+ case KVM_PPC_GET_COMPAT_CAPS: {
+ struct kvm_ppc_compat_caps host_caps;
+
+ r = 0;
+ memset(&host_caps, 0, sizeof(host_caps));
+ if (!kvm->arch.kvm_ops->get_compat_cpu_ver)
+ goto out;
+
+ r = kvm->arch.kvm_ops->get_compat_cpu_ver(&host_caps);
+ if (!r && copy_to_user(argp, &host_caps,
+ sizeof(host_caps)))
+ r = -EFAULT;
+ break;
+ }
default: {
struct kvm *kvm = filp->private_data;
r = kvm->arch.kvm_ops->arch_vm_ioctl(filp, ioctl, arg);
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 6c8afa2047bf..1788a0068662 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -996,6 +996,7 @@ struct kvm_enable_cap {
#define KVM_CAP_S390_USER_OPEREXEC 246
#define KVM_CAP_S390_KEYOP 247
#define KVM_CAP_S390_VSIE_ESAMODE 248
+#define KVM_CAP_PPC_COMPAT_CAPS 249
struct kvm_irq_routing_irqchip {
__u32 irqchip;
@@ -1349,6 +1350,9 @@ struct kvm_s390_keyop {
#define KVM_GET_DEVICE_ATTR _IOW(KVMIO, 0xe2, struct kvm_device_attr)
#define KVM_HAS_DEVICE_ATTR _IOW(KVMIO, 0xe3, struct kvm_device_attr)
+/* Available with KVM_CAP_PPC_COMPAT_CAPS */
+#define KVM_PPC_GET_COMPAT_CAPS _IOR(KVMIO, 0xe4, struct kvm_ppc_compat_caps)
+
/*
* ioctls for vcpu fds
*/
--
2.50.1 (Apple Git-155)
^ permalink raw reply related
* [PATCH v2 1/5] KVM: PPC: Book3S HV: Validate arch_compat against host compatibility mode
From: Amit Machhiwal @ 2026-05-13 10:07 UTC (permalink / raw)
To: linuxppc-dev, Madhavan Srinivasan
Cc: Amit Machhiwal, Vaibhav Jain, Nicholas Piggin, Michael Ellerman,
Christophe Leroy (CS GROUP), kvm, linux-kernel
In-Reply-To: <20260513100755.83215-1-amachhiw@linux.ibm.com>
On IBM POWER systems, newer processor generations can operate in
compatibility modes corresponding to earlier generations. This becomes
relevant for nested virtualization, where nested KVM guests may need to
run with a specific processor compatibility level.
Currently, when running a nested KVM guest (L2) inside a Power11 pSeries
logical partition (L1) booted in Power10 compatibility mode, the guest
fails to boot while setting 'arch_compat'. This happens because the CPU
class is derived from the hardware PVR (via mfspr()), which reflects the
physical processor generation (Power11), rather than the effective
compatibility mode (Power10).
As a result, userspace may request a Power11 arch_compat for the L2
guest. However, the L1 partition, running in Power10 compatibility, has
only negotiated support up to Power10 with the Power Hypervisor (L0).
When H_SET_STATE is invoked with a Power11 Logical PVR, the hypervisor
rejects the request, leading to a late guest boot failure:
KVM-NESTEDv2: couldn't set guest wide elements
[..KVM reg dump..]
This situation should be detected earlier. Rejecting unsupported
'arch_compat' values in 'kvmppc_set_arch_compat()' avoids issuing an
invalid H_SET_STATE hcall and provides a clearer failure mode.
Add a check to reject Power11 'arch_compat' requests when the host is
running in Power10 compatibility mode, returning -EINVAL early instead
of deferring the failure to the hypervisor.
Signed-off-by: Amit Machhiwal <amachhiw@linux.ibm.com>
---
arch/powerpc/kvm/book3s_hv.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 61dbeea317f3..249d1f2e4e2c 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -446,7 +446,19 @@ static int kvmppc_set_arch_compat(struct kvm_vcpu *vcpu, u32 arch_compat)
guest_pcr_bit = PCR_ARCH_300;
break;
case PVR_ARCH_31:
+ guest_pcr_bit = PCR_ARCH_31;
+ break;
case PVR_ARCH_31_P11:
+ /*
+ * Need to check this for ISA 3.1, as Power10 and
+ * Power11 share the same PCR. For any subsequent ISA
+ * versions, this will be taken care of by the guest vs
+ * host PCR comparison below.
+ */
+ if ((PVR_ARCH_31 & cur_cpu_spec->pvr_mask) ==
+ cur_cpu_spec->pvr_value) {
+ return -EINVAL;
+ }
guest_pcr_bit = PCR_ARCH_31;
break;
default:
--
2.50.1 (Apple Git-155)
^ permalink raw reply related
* [PATCH v2 0/5] KVM: PPC: Handle CPU compatibility mode for nested guests
From: Amit Machhiwal @ 2026-05-13 10:07 UTC (permalink / raw)
To: linuxppc-dev, Madhavan Srinivasan
Cc: Amit Machhiwal, Vaibhav Jain, Paolo Bonzini, Nicholas Piggin,
Michael Ellerman, Christophe Leroy (CS GROUP), Jonathan Corbet,
Shuah Khan, kvm, linux-kernel, linux-doc
On POWER systems, newer processor generations can operate in compatibility
modes corresponding to earlier generations (e.g., a Power11 system running
in Power10 compatibility mode). In such cases, the effective CPU level
exposed to guests differs from the physical processor generation.
This creates a problem for nested virtualization. When booting a nested KVM
guest (L2) inside a host KVM guest (L1) running in a compatibility mode,
userspace (e.g., QEMU) may derive the CPU model from the raw hardware PVR
and attempt to configure the nested guest accordingly. However, the L1
partition is constrained by the compatibility level negotiated with the
hypervisor (L0), and requests exceeding that level are rejected, leading to
guest boot failures such as:
KVM-NESTEDv2: couldn't set guest wide elements
This series addresses the issue in two steps:
1. Detect and reject invalid compatibility requests early in KVM to avoid
late failures.
2. Provide a mechanism for userspace to query the effective CPU
compatibility modes supported by the host, so it can select an
appropriate CPU model for nested guests.
To achieve this, the series introduces a new KVM capability and ioctl
(KVM_CAP_PPC_COMPAT_CAPS / KVM_PPC_GET_COMPAT_CAPS) that expose the
compatibility modes supported by the host.
The implementation supports both:
- PowerVM (nested API v2), where compatibility information is obtained
via the H_GUEST_GET_CAPABILITIES hypercall.
- PowerNV (nested API v1), where compatibility is derived from the device
tree ("cpu-version") representing the effective processor compatibility
level.
This allows userspace (e.g., QEMU) to select a CPU model consistent with
the host compatibility mode, avoiding mismatches and enabling successful
nested guest boot.
Changes in v2:
- Squashed patches 2 and 3 from v1 (capability introduction and ioctl
wiring) into a single patch for better logical grouping
- Changed kvm_ppc_compat_caps.flags from __u32 to __u64 for consistency
and future extensibility
- Addressed other review comments
- Improved commit messages with clearer explanations of the changes
Patch summary:
[1/5] Validate arch_compat against host compatibility mode
[2/5] Introduce KVM_CAP_PPC_COMPAT_CAPS and wire up ioctl
[3/5] Implement capability retrieval for PowerVM (API v2)
[4/5] Add PowerNV support (API v1)
[5/5] Document the new ioctl
Tested on:
- Power11 pSeries LPAR in Power10 compatibility mode (nested API v2)
- Power10 PowerNV system (and QEMU TCG PowerNV 11) with nested
virtualization (API v1) with various combinations of KVM L1/L2 guests
in various supported compatibility modes.
With this series, nested guests boot successfully in configurations where
they previously failed due to compatibility mismatches.
Related QEMU series:
A corresponding QEMU series adds support for querying and using these
compatibility capabilities when configuring nested KVM guests:
https://lore.kernel.org/all/20260502140021.69712-1-amachhiw@linux.ibm.com/
v1: https://lore.kernel.org/linuxppc-dev/20260430054906.94431-1-amachhiw@linux.ibm.com/
Amit Machhiwal (5):
KVM: PPC: Book3S HV: Validate arch_compat against host compatibility
mode
KVM: PPC: Introduce KVM_CAP_PPC_COMPAT_CAPS and wire up ioctl
KVM: PPC: Book3S HV: Implement compat CPU capability retrieval for KVM
on PowerVM
KVM: PPC: Book3S HV: Add support for compat CPU capabilities for KVM
on PowerNV
KVM: PPC: Document KVM_PPC_GET_COMPAT_CAPS ioctl
Documentation/virt/kvm/api.rst | 35 ++++++++++++++++
arch/powerpc/include/asm/kvm_ppc.h | 1 +
arch/powerpc/include/uapi/asm/kvm.h | 6 +++
arch/powerpc/kvm/book3s_hv.c | 63 +++++++++++++++++++++++++++++
arch/powerpc/kvm/powerpc.c | 21 ++++++++++
include/uapi/linux/kvm.h | 4 ++
6 files changed, 130 insertions(+)
base-commit: 1d5dcaa3bd65f2e8c9baa14a393d3a2dc5db7524
--
2.50.1 (Apple Git-155)
^ permalink raw reply
* Re: [PATCH] powerpc/64s: Fix the vector number in comments for h_facility_unavailable
From: Vaibhav Jain @ 2026-05-13 9:05 UTC (permalink / raw)
To: Gautam Menghani, maddy, mpe, npiggin, chleroy
Cc: Gautam Menghani, linuxppc-dev, linux-kernel
In-Reply-To: <20260511080412.50722-1-Gautam.Menghani@ibm.com>
Hey Gautam,
Thanks for the patch. Since this patch doesnt have any functional or
code change can you please put a 'trivial' suffix to it patch title like
[1] or some other suffix indicating its a non-functional change. That
way maintainers can easily pull the patch without worrying much about a
regression.
[1]
https://git.kernel.org/powerpc/c/d2827e5e2e0f0941a651f4b1ca5e9b778c4b5293
Gautam Menghani <Gautam.Menghani@ibm.com> writes:
> From: Gautam Menghani <gautam@linux.ibm.com>
>
> The comments explaining the h_facility_unavailable interrupt have mentioned
> the vector number as 0xf60 instead of 0xf80. Fix this typo.
>
> Signed-off-by: Gautam Menghani <gautam@linux.ibm.com>
> ---
> arch/powerpc/kernel/exceptions-64s.S | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S
> index b7229430ca94..2696fbbca3b6 100644
> --- a/arch/powerpc/kernel/exceptions-64s.S
> +++ b/arch/powerpc/kernel/exceptions-64s.S
> @@ -2498,7 +2498,7 @@ EXC_COMMON_BEGIN(facility_unavailable_common)
>
>
> /**
> - * Interrupt 0xf60 - Hypervisor Facility Unavailable Interrupt.
> + * Interrupt 0xf80 - Hypervisor Facility Unavailable Interrupt.
> * This is a synchronous interrupt in response to
> * executing an instruction without access to the facility that can only
> * be resolved in HV mode (e.g., HFSCR).
> --
> 2.53.0
>
>
--
Cheers
~ Vaibhav
^ permalink raw reply
* Re: [PATCH 8/8] powerpc/mm: remove CONFIG_HAVE_BOOTMEM_INFO_NODE
From: Mike Rapoport @ 2026-05-13 8:41 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: David S. Miller, Andreas Larsson, Andrew Morton,
Alexander Gordeev, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Christian Borntraeger, Sven Schnelle, Madhavan Srinivasan,
Michael Ellerman, Nicholas Piggin, Christophe Leroy (CS GROUP),
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka,
Suren Baghdasaryan, Michal Hocko, sparclinux, linux-kernel,
linux-mm, linux-s390, linuxppc-dev
In-Reply-To: <20260511-bootmem_info_prep-v1-8-3fb0be6fc688@kernel.org>
On Mon, May 11, 2026 at 04:05:36PM +0200, David Hildenbrand (Arm) wrote:
> register_page_bootmem_info_node() essentially only calls
> register_page_bootmem_memmap(). However, on powerpc that function is a
> nop. So there is not benefit in using CONFIG_HAVE_BOOTMEM_INFO_NODE
> anymore, let's just drop it.
>
> We can stop including bootmem_info.h.
>
> Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> ---
> arch/powerpc/mm/init_64.c | 8 --------
> mm/Kconfig | 2 +-
> 2 files changed, 1 insertion(+), 9 deletions(-)
>
> diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
> index b6f3ae03ca9e..64f0df5bb5cd 100644
> --- a/arch/powerpc/mm/init_64.c
> +++ b/arch/powerpc/mm/init_64.c
> @@ -41,7 +41,6 @@
> #include <linux/libfdt.h>
> #include <linux/memremap.h>
> #include <linux/memory.h>
> -#include <linux/bootmem_info.h>
>
> #include <asm/pgalloc.h>
> #include <asm/page.h>
> @@ -388,13 +387,6 @@ void __ref vmemmap_free(unsigned long start, unsigned long end,
>
> #endif
>
> -#ifdef CONFIG_HAVE_BOOTMEM_INFO_NODE
> -void register_page_bootmem_memmap(unsigned long section_nr,
> - struct page *start_page, unsigned long size)
> -{
> -}
> -#endif /* CONFIG_HAVE_BOOTMEM_INFO_NODE */
> -
> #endif /* CONFIG_SPARSEMEM_VMEMMAP */
>
> #ifdef CONFIG_PPC_BOOK3S_64
> diff --git a/mm/Kconfig b/mm/Kconfig
> index e221fa1dc54d..97b079372325 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -537,7 +537,7 @@ endchoice
>
> config MEMORY_HOTREMOVE
> bool "Allow for memory hot remove"
> - select HAVE_BOOTMEM_INFO_NODE if (X86_64 || PPC64)
> + select HAVE_BOOTMEM_INFO_NODE if X86_64
> depends on MEMORY_HOTPLUG
> select MIGRATION
>
>
> --
> 2.43.0
>
--
Sincerely yours,
Mike.
^ permalink raw reply
* Re: [PATCH 7/8] s390/mm: use free_reserved_page() in vmem_free_pages()
From: Mike Rapoport @ 2026-05-13 8:40 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: David S. Miller, Andreas Larsson, Andrew Morton,
Alexander Gordeev, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Christian Borntraeger, Sven Schnelle, Madhavan Srinivasan,
Michael Ellerman, Nicholas Piggin, Christophe Leroy (CS GROUP),
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka,
Suren Baghdasaryan, Michal Hocko, sparclinux, linux-kernel,
linux-mm, linux-s390, linuxppc-dev
In-Reply-To: <20260511-bootmem_info_prep-v1-7-3fb0be6fc688@kernel.org>
On Mon, May 11, 2026 at 04:05:35PM +0200, David Hildenbrand (Arm) wrote:
> We never select CONFIG_HAVE_BOOTMEM_INFO_NODE on s390. Therefore,
> free_bootmem_page() nowadays always translates to free_reserved_page().
>
> Let's use free_reserved_page() to replace the free_bootmem_page() loop.
> We can stop including bootmem_info.h.
>
> Likely, vmemmap freeing code could be factored out into the core in the
> future.
>
> Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> ---
> arch/s390/mm/vmem.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
> index eeadff45e0e1..d8b2a60e0c33 100644
> --- a/arch/s390/mm/vmem.c
> +++ b/arch/s390/mm/vmem.c
> @@ -4,7 +4,6 @@
> */
>
> #include <linux/memory_hotplug.h>
> -#include <linux/bootmem_info.h>
> #include <linux/cpufeature.h>
> #include <linux/memblock.h>
> #include <linux/pfn.h>
> @@ -51,7 +50,7 @@ static void vmem_free_pages(unsigned long addr, int order, struct vmem_altmap *a
> if (PageReserved(page)) {
> /* allocated from memblock */
> while (nr_pages--)
> - free_bootmem_page(page++);
> + free_reserved_page(page++);
> } else {
> free_pages(addr, order);
> }
>
> --
> 2.43.0
>
--
Sincerely yours,
Mike.
^ permalink raw reply
* Re: [PATCH 6/8] mm/bootmem_info: stop marking mem_section_usage as MIX_SECTION_INFO
From: Mike Rapoport @ 2026-05-13 8:38 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: David S. Miller, Andreas Larsson, Andrew Morton,
Alexander Gordeev, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Christian Borntraeger, Sven Schnelle, Madhavan Srinivasan,
Michael Ellerman, Nicholas Piggin, Christophe Leroy (CS GROUP),
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka,
Suren Baghdasaryan, Michal Hocko, sparclinux, linux-kernel,
linux-mm, linux-s390, linuxppc-dev
In-Reply-To: <20260511-bootmem_info_prep-v1-6-3fb0be6fc688@kernel.org>
On Mon, May 11, 2026 at 04:05:34PM +0200, David Hildenbrand (Arm) wrote:
> We never free the ms->usage data for boot memory sections (see
> section_deactivate()). And to identify whether ms->usage was allocated
> from memblock, we simply identify it by looking at PG_reserved.
>
> Consequently, there is no need to mark ms->usage as MIX_SECTION_INFO.
> Let's just stop doing that.
>
> Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> ---
> mm/bootmem_info.c | 12 +-----------
> 1 file changed, 1 insertion(+), 11 deletions(-)
>
> diff --git a/mm/bootmem_info.c b/mm/bootmem_info.c
> index cce1d560f094..0fa78db7fbc0 100644
> --- a/mm/bootmem_info.c
> +++ b/mm/bootmem_info.c
> @@ -38,10 +38,8 @@ void put_page_bootmem(struct page *page)
>
> static void __init register_page_bootmem_info_section(unsigned long start_pfn)
> {
> - unsigned long mapsize, section_nr, i;
> + unsigned long section_nr;
> struct mem_section *ms;
> - struct mem_section_usage *usage;
> - struct page *page;
>
> start_pfn = SECTION_ALIGN_DOWN(start_pfn);
> section_nr = pfn_to_section_nr(start_pfn);
> @@ -50,14 +48,6 @@ static void __init register_page_bootmem_info_section(unsigned long start_pfn)
> if (!preinited_vmemmap_section(ms))
> register_page_bootmem_memmap(section_nr, pfn_to_page(start_pfn),
> PAGES_PER_SECTION);
> -
> - usage = ms->usage;
> - page = virt_to_page(usage);
> -
> - mapsize = PAGE_ALIGN(mem_section_usage_size()) >> PAGE_SHIFT;
> -
> - for (i = 0; i < mapsize; i++, page++)
> - get_page_bootmem(section_nr, page, MIX_SECTION_INFO);
> }
>
> void __init register_page_bootmem_info_node(struct pglist_data *pgdat)
>
> --
> 2.43.0
>
--
Sincerely yours,
Mike.
^ permalink raw reply
* Re: [PATCH 5/8] mm/bootmem_info: stop marking the pgdat as NODE_INFO
From: Mike Rapoport @ 2026-05-13 8:35 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: David S. Miller, Andreas Larsson, Andrew Morton,
Alexander Gordeev, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Christian Borntraeger, Sven Schnelle, Madhavan Srinivasan,
Michael Ellerman, Nicholas Piggin, Christophe Leroy (CS GROUP),
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka,
Suren Baghdasaryan, Michal Hocko, sparclinux, linux-kernel,
linux-mm, linux-s390, linuxppc-dev
In-Reply-To: <20260511-bootmem_info_prep-v1-5-3fb0be6fc688@kernel.org>
On Mon, May 11, 2026 at 04:05:33PM +0200, David Hildenbrand (Arm) wrote:
> We removed the last user of NODE_INFO in commit 119c31caa59e ("mm/sparse:
> remove !CONFIG_SPARSEMEM_VMEMMAP leftovers for CONFIG_MEMORY_HOTPLUG").
>
> But it really was never used it besides for safety-checks ever since it was
> introduced in commit 04753278769f ("memory hotplug: register section/node
> id to free"), where we had the comment:
>
> 5) The node information like pgdat has similar issues. But, this
> will be able to be solved too by this.
> (Not implemented yet, but, remembering node id in the pages.)
>
> Of course, that never happened, and we are not planning on freeing the
> node data (pgdat/pglist_data), during memory hotunplug.
>
> So let's just stop marking the pgdat as NODE_INFO.
>
> Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> ---
> mm/bootmem_info.c | 9 +--------
> 1 file changed, 1 insertion(+), 8 deletions(-)
>
> diff --git a/mm/bootmem_info.c b/mm/bootmem_info.c
> index 74c1116626c8..cce1d560f094 100644
> --- a/mm/bootmem_info.c
> +++ b/mm/bootmem_info.c
> @@ -62,15 +62,8 @@ static void __init register_page_bootmem_info_section(unsigned long start_pfn)
>
> void __init register_page_bootmem_info_node(struct pglist_data *pgdat)
> {
> - unsigned long i, pfn, end_pfn, nr_pages;
> + unsigned long pfn, end_pfn;
> int node = pgdat->node_id;
> - struct page *page;
> -
> - nr_pages = PAGE_ALIGN(sizeof(struct pglist_data)) >> PAGE_SHIFT;
> - page = virt_to_page(pgdat);
> -
> - for (i = 0; i < nr_pages; i++, page++)
> - get_page_bootmem(node, page, NODE_INFO);
>
> pfn = pgdat->node_start_pfn;
> end_pfn = pgdat_end_pfn(pgdat);
>
> --
> 2.43.0
>
--
Sincerely yours,
Mike.
^ permalink raw reply
* Re: [PATCH 4/8] mm/bootmem_info: remove call to kmemleak_free_part_phys()
From: Mike Rapoport @ 2026-05-13 8:31 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: David S. Miller, Andreas Larsson, Andrew Morton,
Alexander Gordeev, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Christian Borntraeger, Sven Schnelle, Madhavan Srinivasan,
Michael Ellerman, Nicholas Piggin, Christophe Leroy (CS GROUP),
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka,
Suren Baghdasaryan, Michal Hocko, sparclinux, linux-kernel,
linux-mm, linux-s390, linuxppc-dev
In-Reply-To: <20260511-bootmem_info_prep-v1-4-3fb0be6fc688@kernel.org>
On Mon, May 11, 2026 at 04:05:32PM +0200, David Hildenbrand (Arm) wrote:
> The call to kmemleak_free_part_phys() was added in 2022 in
> commit dd0ff4d12dd2 ("bootmem: remove the vmemmap pages from kmemleak in
> put_page_bootmem").
>
> In 2025, commit b2aad24b5333 ("mm/memmap: prevent double scanning of memmap
> by kmemleak") started to use MEMBLOCK_ALLOC_NOLEAKTRACE when allocating
> the memmap to skip the kmemleak_alloc_phys() in the buddy.
>
> So remove the call to kmemleak_free_part_phys(). If this would still
> be required for other purposes, either free_reserved_page() should take
> care of it, or selected users.
>
> Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> ---
> include/linux/bootmem_info.h | 1 -
> mm/bootmem_info.c | 1 -
> 2 files changed, 2 deletions(-)
>
> diff --git a/include/linux/bootmem_info.h b/include/linux/bootmem_info.h
> index 492ceeb1cdf8..f724340755e5 100644
> --- a/include/linux/bootmem_info.h
> +++ b/include/linux/bootmem_info.h
> @@ -82,7 +82,6 @@ static inline void get_page_bootmem(unsigned long info, struct page *page,
>
> static inline void free_bootmem_page(struct page *page)
> {
> - kmemleak_free_part_phys(PFN_PHYS(page_to_pfn(page)), PAGE_SIZE);
> free_reserved_page(page);
> }
> #endif
> diff --git a/mm/bootmem_info.c b/mm/bootmem_info.c
> index 6e2aaab3dca9..74c1116626c8 100644
> --- a/mm/bootmem_info.c
> +++ b/mm/bootmem_info.c
> @@ -32,7 +32,6 @@ void put_page_bootmem(struct page *page)
>
> if (page_ref_dec_return(page) == 1) {
> set_page_private(page, 0);
> - kmemleak_free_part_phys(PFN_PHYS(page_to_pfn(page)), PAGE_SIZE);
> free_reserved_page(page);
> }
> }
>
> --
> 2.43.0
>
--
Sincerely yours,
Mike.
^ permalink raw reply
* Re: [PATCH 3/8] mm/bootmem_info: stop using PG_private
From: Mike Rapoport @ 2026-05-13 8:29 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: David S. Miller, Andreas Larsson, Andrew Morton,
Alexander Gordeev, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Christian Borntraeger, Sven Schnelle, Madhavan Srinivasan,
Michael Ellerman, Nicholas Piggin, Christophe Leroy (CS GROUP),
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka,
Suren Baghdasaryan, Michal Hocko, sparclinux, linux-kernel,
linux-mm, linux-s390, linuxppc-dev
In-Reply-To: <20260511-bootmem_info_prep-v1-3-3fb0be6fc688@kernel.org>
On Mon, May 11, 2026 at 04:05:31PM +0200, David Hildenbrand (Arm) wrote:
> Nobody checks PG_private for these pages, and we can happily use
> set_page_private() without setting PG_private. So let's just stop
> setting/clearing PG_private.
>
> Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> ---
> mm/bootmem_info.c | 2 --
> 1 file changed, 2 deletions(-)
>
> diff --git a/mm/bootmem_info.c b/mm/bootmem_info.c
> index a0a1ecdec8d0..6e2aaab3dca9 100644
> --- a/mm/bootmem_info.c
> +++ b/mm/bootmem_info.c
> @@ -19,7 +19,6 @@ void get_page_bootmem(unsigned long info, struct page *page,
> {
> BUG_ON(type > 0xf);
> BUG_ON(info > (ULONG_MAX >> 4));
> - SetPagePrivate(page);
> set_page_private(page, info << 4 | type);
> page_ref_inc(page);
> }
> @@ -32,7 +31,6 @@ void put_page_bootmem(struct page *page)
> type > MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE);
>
> if (page_ref_dec_return(page) == 1) {
> - ClearPagePrivate(page);
> set_page_private(page, 0);
> kmemleak_free_part_phys(PFN_PHYS(page_to_pfn(page)), PAGE_SIZE);
> free_reserved_page(page);
>
> --
> 2.43.0
>
--
Sincerely yours,
Mike.
^ permalink raw reply
* Re: [PATCH 2/8] mm/bootmem_info: drop initialization of page->lru
From: Mike Rapoport @ 2026-05-13 8:27 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: David S. Miller, Andreas Larsson, Andrew Morton,
Alexander Gordeev, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Christian Borntraeger, Sven Schnelle, Madhavan Srinivasan,
Michael Ellerman, Nicholas Piggin, Christophe Leroy (CS GROUP),
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka,
Suren Baghdasaryan, Michal Hocko, sparclinux, linux-kernel,
linux-mm, linux-s390, linuxppc-dev
In-Reply-To: <20260511-bootmem_info_prep-v1-2-3fb0be6fc688@kernel.org>
On Mon, May 11, 2026 at 04:05:30PM +0200, David Hildenbrand (Arm) wrote:
> In the past, we used to store the type in page->lru.next, introduced by
> commit 5f24ce5fd34c ("thp: remove PG_buddy"). The location changed over
> the years; ever since commit 0386aaa6e9c8 ("bootmem: stop using
> page->index"), we store it alongside the info in page->private.
>
> Consequently, there is no need to reset page->lru anymore.
>
> Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> ---
> mm/bootmem_info.c | 1 -
> 1 file changed, 1 deletion(-)
>
> diff --git a/mm/bootmem_info.c b/mm/bootmem_info.c
> index 3d7675a3ae04..a0a1ecdec8d0 100644
> --- a/mm/bootmem_info.c
> +++ b/mm/bootmem_info.c
> @@ -34,7 +34,6 @@ void put_page_bootmem(struct page *page)
> if (page_ref_dec_return(page) == 1) {
> ClearPagePrivate(page);
> set_page_private(page, 0);
> - INIT_LIST_HEAD(&page->lru);
> kmemleak_free_part_phys(PFN_PHYS(page_to_pfn(page)), PAGE_SIZE);
> free_reserved_page(page);
> }
>
> --
> 2.43.0
>
--
Sincerely yours,
Mike.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox