* [PATCH v2 58/69] mm/hugetlb: Make HVO optimizable checks depend on generic logic
From: Muchun Song @ 2026-05-13 13:20 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
In-Reply-To: <20260513132044.41690-1-songmuchun@bytedance.com>
Make hugetlb_vmemmap_optimizable() reuse the generic
order_vmemmap_optimizable() logic, and switch hugetlb boolean call sites
to use the dedicated helper directly.
This keeps HugeTLB-specific optimizable checks aligned with the generic
vmemmap optimization rules and avoids open-coding the size-based test.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
include/linux/hugetlb.h | 2 +-
mm/hugetlb.c | 4 ++--
mm/hugetlb_vmemmap.h | 43 ++++++++++++++++++++---------------------
3 files changed, 24 insertions(+), 25 deletions(-)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 82dbb9ebead8..2383adb22ce1 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -778,7 +778,7 @@ static inline unsigned long huge_page_mask(struct hstate *h)
return h->mask;
}
-static inline unsigned int huge_page_order(struct hstate *h)
+static inline unsigned int huge_page_order(const struct hstate *h)
{
return h->order;
}
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 54ef7d12c585..bd136fc6aec0 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3351,7 +3351,7 @@ static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
folio = only_alloc_fresh_hugetlb_folio(h, gfp_mask, nid,
&node_states[N_MEMORY], NULL);
if (!folio && !list_empty(&folio_list) &&
- hugetlb_vmemmap_optimizable_size(h)) {
+ hugetlb_vmemmap_optimizable(h)) {
prep_and_add_allocated_folios(h, &folio_list);
INIT_LIST_HEAD(&folio_list);
folio = only_alloc_fresh_hugetlb_folio(h, gfp_mask, nid,
@@ -3420,7 +3420,7 @@ static void __init hugetlb_pages_alloc_boot_node(unsigned long start, unsigned l
for (i = 0; i < num; ++i) {
struct folio *folio;
- if (hugetlb_vmemmap_optimizable_size(h) &&
+ if (hugetlb_vmemmap_optimizable(h) &&
(si_mem_available() == 0) && !list_empty(&folio_list)) {
prep_and_add_allocated_folios(h, &folio_list);
INIT_LIST_HEAD(&folio_list);
diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
index dfd48be6b231..1765f8274220 100644
--- a/mm/hugetlb_vmemmap.h
+++ b/mm/hugetlb_vmemmap.h
@@ -18,22 +18,6 @@ long hugetlb_vmemmap_restore_folios(const struct hstate *h,
void hugetlb_vmemmap_optimize_folio(const struct hstate *h, struct folio *folio);
void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list);
void hugetlb_vmemmap_optimize_bootmem_page(struct huge_bootmem_page *m);
-
-static inline unsigned int hugetlb_vmemmap_size(const struct hstate *h)
-{
- return pages_per_huge_page(h) * sizeof(struct page);
-}
-
-/*
- * Return how many vmemmap size associated with a HugeTLB page that can be
- * optimized and can be freed to the buddy allocator.
- */
-static inline unsigned int hugetlb_vmemmap_optimizable_size(const struct hstate *h)
-{
- int size = hugetlb_vmemmap_size(h) - OPTIMIZED_FOLIO_VMEMMAP_SIZE;
-
- return size > 0 ? size : 0;
-}
#else
static inline int hugetlb_vmemmap_restore_folio(const struct hstate *h, struct folio *folio)
{
@@ -56,11 +40,6 @@ static inline void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list
{
}
-static inline unsigned int hugetlb_vmemmap_optimizable_size(const struct hstate *h)
-{
- return 0;
-}
-
static inline void hugetlb_vmemmap_optimize_bootmem_page(struct huge_bootmem_page *m)
{
}
@@ -68,6 +47,26 @@ static inline void hugetlb_vmemmap_optimize_bootmem_page(struct huge_bootmem_pag
static inline bool hugetlb_vmemmap_optimizable(const struct hstate *h)
{
- return hugetlb_vmemmap_optimizable_size(h) != 0;
+ if (!IS_ENABLED(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP))
+ return false;
+
+ return order_vmemmap_optimizable(huge_page_order(h));
+}
+
+static inline unsigned int hugetlb_vmemmap_size(const struct hstate *h)
+{
+ return pages_per_huge_page(h) * sizeof(struct page);
+}
+
+/*
+ * Return the size of the vmemmap area associated with a HugeTLB page
+ * that can be optimized.
+ */
+static inline unsigned int hugetlb_vmemmap_optimizable_size(const struct hstate *h)
+{
+ if (!hugetlb_vmemmap_optimizable(h))
+ return 0;
+
+ return hugetlb_vmemmap_size(h) - OPTIMIZED_FOLIO_VMEMMAP_SIZE;
}
#endif /* _LINUX_HUGETLB_VMEMMAP_H */
--
2.54.0
^ permalink raw reply related
* [PATCH v2 59/69] mm/sparse-vmemmap: Localize init_compound_tail()
From: Muchun Song @ 2026-05-13 13:20 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
In-Reply-To: <20260513132044.41690-1-songmuchun@bytedance.com>
init_compound_tail() is only used in mm/sparse-vmemmap.c, so there is no
need to keep it in mm/internal.h.
The helper is only used for SPARSEMEM_VMEMMAP_OPTIMIZATION, where passing
NULL as the compound head is valid. Keeping it visible outside makes that
usage look more generally applicable than it really is, which increases
the chance of misuse.
Move it into mm/sparse-vmemmap.c so the helper stays tied to the only
context where its NULL head argument is valid.
No functional change intended.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
mm/internal.h | 9 ---------
mm/sparse-vmemmap.c | 12 +++++++++++-
2 files changed, 11 insertions(+), 10 deletions(-)
diff --git a/mm/internal.h b/mm/internal.h
index afdae79640b5..aff7cebb1da4 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -907,15 +907,6 @@ static inline void prep_compound_tail(struct page *tail,
set_page_private(tail, 0);
}
-static inline void init_compound_tail(struct page *tail,
- const struct page *head, unsigned int order, struct zone *zone)
-{
- atomic_set(&tail->_mapcount, -1);
- set_page_node(tail, zone_to_nid(zone));
- set_page_zone(tail, zone_idx(zone));
- prep_compound_tail(tail, head, order);
-}
-
void post_alloc_hook(struct page *page, unsigned int order, gfp_t gfp_flags);
extern bool free_pages_prepare(struct page *page, unsigned int order);
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 667424aadd6b..38777e4952e1 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -265,6 +265,16 @@ int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
return 0;
}
+static void init_compound_tail(struct page *page, unsigned int order, struct zone *zone)
+{
+ BUILD_BUG_ON(!IS_ENABLED(SPARSEMEM_VMEMMAP_OPTIMIZATION));
+
+ atomic_set(&page->_mapcount, -1);
+ set_page_node(page, zone_to_nid(zone));
+ set_page_zone(page, zone_idx(zone));
+ prep_compound_tail(page, NULL, order);
+}
+
struct page __ref *vmemmap_shared_tail_page(unsigned int order, struct zone *zone)
{
void *addr;
@@ -286,7 +296,7 @@ struct page __ref *vmemmap_shared_tail_page(unsigned int order, struct zone *zon
page = (struct page *)addr + i;
if (zone_is_zone_device(zone))
__SetPageReserved(page);
- init_compound_tail(page, NULL, order, zone);
+ init_compound_tail(page, order, zone);
}
page = virt_to_page(addr);
--
2.54.0
^ permalink raw reply related
* [PATCH v2 60/69] mm/mm_init: Check zone consistency on optimized vmemmap sections
From: Muchun Song @ 2026-05-13 13:20 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
In-Reply-To: <20260513132044.41690-1-songmuchun@bytedance.com>
For vmemmap-optimized sections, the shared tail struct pages are reused
across compound pages and should already carry the expected zone and
node.
Warn in __init_single_page() if such a shared tail page is seen with a
different zone or node, which would indicate inconsistent initialization.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
mm/mm_init.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 4ea39392993b..95422e92ede8 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -605,6 +605,9 @@ void __meminit __init_single_page(struct page *page, unsigned long pfn,
if (!is_highmem_idx(zone))
set_page_address(page, __va(pfn << PAGE_SHIFT));
#endif
+ VM_WARN_ON_ONCE(order_vmemmap_optimizable(pfn_to_section_order(pfn)) &&
+ page_zone_id(page + OPTIMIZED_FOLIO_VMEMMAP_NR_STRUCT_PAGES) !=
+ page_zone_id(page));
}
#ifdef CONFIG_NUMA
--
2.54.0
^ permalink raw reply related
* [PATCH v2 61/69] mm/hugetlb: Drop boot-time HVO handling for gigantic folios
From: Muchun Song @ 2026-05-13 13:20 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
In-Reply-To: <20260513132044.41690-1-songmuchun@bytedance.com>
HugeTLB HVO is currently supported on x86-64, riscv64, and LoongArch.
On x86-64 and riscv64, gigantic HugeTLB pages are larger than the
section size, so the existing section-based vmemmap optimization
infrastructure is already sufficient to cover the whole folio. On
LoongArch, HugeTLB HVO is supported without gigantic HugeTLB pages.
Therefore, boot-time HugeTLB HVO folios can rely on the section-based
vmemmap optimization infrastructure directly, without the extra bulk
optimization and fallback handling.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
mm/hugetlb.c | 25 ++++++-------------------
mm/hugetlb_vmemmap.c | 21 ++++++---------------
mm/internal.h | 25 +++++++++++++++++++++++--
mm/sparse.c | 23 -----------------------
4 files changed, 35 insertions(+), 59 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index bd136fc6aec0..3cb8fffb9e3e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3201,21 +3201,7 @@ static void __init prep_and_add_bootmem_folios(struct hstate *h,
unsigned long flags;
struct folio *folio, *tmp_f;
- /* Send list for bulk vmemmap optimization processing */
- hugetlb_vmemmap_optimize_folios(h, folio_list);
-
list_for_each_entry_safe(folio, tmp_f, folio_list, lru) {
- if (!folio_test_hugetlb_vmemmap_optimized(folio)) {
- /*
- * If HVO fails, initialize all tail struct pages
- * We do not worry about potential long lock hold
- * time as this is early in boot and there should
- * be no contention.
- */
- hugetlb_folio_init_tail_vmemmap(folio, h,
- OPTIMIZED_FOLIO_VMEMMAP_NR_STRUCT_PAGES,
- pages_per_huge_page(h));
- }
hugetlb_bootmem_init_migratetype(folio, h);
/* Subdivide locks to achieve better parallel performance */
spin_lock_irqsave(&hugetlb_lock, flags);
@@ -3238,6 +3224,8 @@ static void __init gather_bootmem_prealloc_node(unsigned long nid)
list_for_each_entry_safe(m, tm, &huge_boot_pages[nid], list) {
struct page *page = virt_to_page(m);
struct folio *folio = (void *)page;
+ unsigned long pfn = PHYS_PFN(__pa(m));
+ unsigned long nr_pages = pages_per_huge_page(m->hstate);
h = m->hstate;
/*
@@ -3251,13 +3239,12 @@ static void __init gather_bootmem_prealloc_node(unsigned long nid)
VM_BUG_ON(!hstate_is_gigantic(h));
WARN_ON(folio_ref_count(folio) != 1);
- hugetlb_folio_init_vmemmap(folio, h,
- OPTIMIZED_FOLIO_VMEMMAP_NR_STRUCT_PAGES);
+ hugetlb_folio_init_vmemmap(folio, h, vmemmap_nr_struct_pages(pfn, nr_pages));
init_new_hugetlb_folio(folio);
- if (order_vmemmap_optimizable(pfn_to_section_order(folio_pfn(folio)))) {
+ if (order_vmemmap_optimizable(pfn_to_section_order(pfn))) {
folio_set_hugetlb_vmemmap_optimized(folio);
- section_set_order_range(folio_pfn(folio), folio_nr_pages(folio), 0);
+ section_set_order_range(pfn, nr_pages, 0);
}
if (hugetlb_early_cma(h))
@@ -3274,7 +3261,7 @@ static void __init gather_bootmem_prealloc_node(unsigned long nid)
* (via hugetlb_bootmem_init_migratetype), so skip it here.
*/
if (!folio_test_hugetlb_cma(folio))
- adjust_managed_page_count(page, pages_per_huge_page(h));
+ adjust_managed_page_count(page, nr_pages);
cond_resched();
}
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 1305bee1195a..d20d2ce13906 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -599,23 +599,17 @@ static int hugetlb_vmemmap_split_folio(const struct hstate *h, struct folio *fol
void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list)
{
struct folio *folio;
- unsigned long nr_to_optimize = 0;
LIST_HEAD(vmemmap_pages);
unsigned long flags = VMEMMAP_REMAP_NO_TLB_FLUSH;
- list_for_each_entry(folio, folio_list, lru) {
- int ret;
-
- /*
- * Bootmem gigantic folios may already be marked optimized when
- * their vmemmap layout was prepared earlier, so skip them here.
- */
- if (folio_test_hugetlb_vmemmap_optimized(folio))
- continue;
+ if (!vmemmap_should_optimize(h))
+ return;
- nr_to_optimize++;
+ if (list_empty(folio_list))
+ return;
- ret = hugetlb_vmemmap_split_folio(h, folio);
+ list_for_each_entry(folio, folio_list, lru) {
+ int ret = hugetlb_vmemmap_split_folio(h, folio);
/*
* Splitting the PMD requires allocating a page, thus let's fail
@@ -627,9 +621,6 @@ void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_l
break;
}
- if (!nr_to_optimize)
- return;
-
flush_tlb_all();
list_for_each_entry(folio, folio_list, lru) {
diff --git a/mm/internal.h b/mm/internal.h
index aff7cebb1da4..416afdf7b2ec 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -949,6 +949,29 @@ void memmap_init_range(unsigned long, int, unsigned long, unsigned long,
unsigned long, enum meminit_context, struct vmem_altmap *, int,
bool);
+static inline int vmemmap_nr_struct_pages(unsigned long pfn, unsigned long nr_pages)
+{
+ const unsigned int order = pfn_to_section_order(pfn);
+ const unsigned long pages_per_compound = 1UL << order;
+
+ if (!order_vmemmap_optimizable(order))
+ return nr_pages;
+
+ if (order < PFN_SECTION_SHIFT) {
+ VM_WARN_ON_ONCE(!IS_ALIGNED(pfn | nr_pages, pages_per_compound));
+ return OPTIMIZED_FOLIO_VMEMMAP_NR_STRUCT_PAGES * nr_pages / pages_per_compound;
+ }
+
+ VM_WARN_ON_ONCE(!IS_ALIGNED(pfn | nr_pages, PAGES_PER_SECTION));
+ /* Ensure the requested range does not cross a compound page boundary. */
+ VM_WARN_ON_ONCE((pfn % pages_per_compound) + nr_pages > pages_per_compound);
+
+ if (IS_ALIGNED(pfn, pages_per_compound))
+ return OPTIMIZED_FOLIO_VMEMMAP_NR_STRUCT_PAGES;
+
+ return 0;
+}
+
/*
* mm/sparse.c
*/
@@ -988,8 +1011,6 @@ static inline void __section_mark_present(struct mem_section *ms,
ms->section_mem_map |= SECTION_MARKED_PRESENT;
}
-int vmemmap_nr_struct_pages(unsigned long pfn, unsigned long nr_pages);
-
static inline int section_nr_vmemmap_pages(unsigned long pfn, unsigned long nr_pages)
{
VM_WARN_ON_ONCE(!IS_ALIGNED(pfn | nr_pages, PAGES_PER_SUBSECTION));
diff --git a/mm/sparse.c b/mm/sparse.c
index 598da1651e49..21a0eb636fea 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -236,29 +236,6 @@ void __weak __meminit vmemmap_populate_print_last(void)
{
}
-int __meminit vmemmap_nr_struct_pages(unsigned long pfn, unsigned long nr_pages)
-{
- const unsigned int order = pfn_to_section_order(pfn);
- const unsigned long pages_per_compound = 1UL << order;
-
- if (!order_vmemmap_optimizable(order))
- return nr_pages;
-
- if (order < PFN_SECTION_SHIFT) {
- VM_WARN_ON_ONCE(!IS_ALIGNED(pfn | nr_pages, pages_per_compound));
- return OPTIMIZED_FOLIO_VMEMMAP_NR_STRUCT_PAGES * nr_pages / pages_per_compound;
- }
-
- VM_WARN_ON_ONCE(!IS_ALIGNED(pfn | nr_pages, PAGES_PER_SECTION));
- /* Ensure the requested range does not cross a compound page boundary. */
- VM_WARN_ON_ONCE((pfn % pages_per_compound) + nr_pages > pages_per_compound);
-
- if (IS_ALIGNED(pfn, pages_per_compound))
- return OPTIMIZED_FOLIO_VMEMMAP_NR_STRUCT_PAGES;
-
- return 0;
-}
-
/*
* Initialize sparse on a specific node. The node spans [pnum_begin, pnum_end)
* And number of present sections in this node is map_count.
--
2.54.0
^ permalink raw reply related
* [PATCH v2 62/69] mm/hugetlb: Simplify hugetlb_folio_init_vmemmap()
From: Muchun Song @ 2026-05-13 13:20 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
In-Reply-To: <20260513132044.41690-1-songmuchun@bytedance.com>
hugetlb_folio_init_vmemmap() currently splits the open-coded
compound-page setup across two helpers even though the tail-page
initialization is only used here.
Fold the tail-page initialization into the main helper and pass
the precomputed page metadata in from the caller. This makes
the initialization flow easier to follow.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
mm/hugetlb.c | 50 +++++++++++++++++---------------------------------
1 file changed, 17 insertions(+), 33 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3cb8fffb9e3e..950b0fa3bc27 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3126,33 +3126,8 @@ static bool __init alloc_bootmem_huge_page(struct hstate *h, int nid)
return true;
}
-/* Initialize [start_page:end_page_number] tail struct pages of a hugepage */
-static void __init hugetlb_folio_init_tail_vmemmap(struct folio *folio,
- struct hstate *h,
- unsigned long start_page_number,
- unsigned long end_page_number)
-{
- enum zone_type zone = folio_zonenum(folio);
- int nid = folio_nid(folio);
- struct page *page = folio_page(folio, start_page_number);
- unsigned long head_pfn = folio_pfn(folio);
- unsigned long pfn, end_pfn = head_pfn + end_page_number;
- unsigned int order = huge_page_order(h);
-
- /*
- * As we marked all tail pages with memblock_reserved_mark_noinit(),
- * we must initialize them ourselves here.
- */
- for (pfn = head_pfn + start_page_number; pfn < end_pfn; page++, pfn++) {
- __init_single_page(page, pfn, zone, nid);
- prep_compound_tail(page, &folio->page, order);
- set_page_count(page, 0);
- }
-}
-
-static void __init hugetlb_folio_init_vmemmap(struct folio *folio,
- struct hstate *h,
- unsigned long nr_pages)
+static void __init hugetlb_folio_init_vmemmap(struct page *head, unsigned long pfn,
+ enum zone_type zone, int nid, unsigned int order, unsigned int nr_pages)
{
int ret;
@@ -3161,12 +3136,19 @@ static void __init hugetlb_folio_init_vmemmap(struct folio *folio,
* walking pages twice by initializing/preparing+freezing them in the
* same go.
*/
- __folio_clear_reserved(folio);
- __folio_set_head(folio);
- ret = folio_ref_freeze(folio, 1);
+ __ClearPageReserved(head);
+ ret = page_ref_freeze(head, 1);
VM_BUG_ON(!ret);
- hugetlb_folio_init_tail_vmemmap(folio, h, 1, nr_pages);
- prep_compound_head(&folio->page, huge_page_order(h));
+
+ __SetPageHead(head);
+ for (int i = 1; i < nr_pages; i++) {
+ struct page *page = head + i;
+
+ __init_single_page(page, pfn + i, zone, nid);
+ prep_compound_tail(page, head, order);
+ set_page_count(page, 0);
+ }
+ prep_compound_head(head, order);
}
/*
@@ -3226,6 +3208,7 @@ static void __init gather_bootmem_prealloc_node(unsigned long nid)
struct folio *folio = (void *)page;
unsigned long pfn = PHYS_PFN(__pa(m));
unsigned long nr_pages = pages_per_huge_page(m->hstate);
+ enum zone_type zone = folio_zonenum(folio);
h = m->hstate;
/*
@@ -3239,7 +3222,8 @@ static void __init gather_bootmem_prealloc_node(unsigned long nid)
VM_BUG_ON(!hstate_is_gigantic(h));
WARN_ON(folio_ref_count(folio) != 1);
- hugetlb_folio_init_vmemmap(folio, h, vmemmap_nr_struct_pages(pfn, nr_pages));
+ hugetlb_folio_init_vmemmap(page, pfn, zone, nid, huge_page_order(h),
+ vmemmap_nr_struct_pages(pfn, nr_pages));
init_new_hugetlb_folio(folio);
if (order_vmemmap_optimizable(pfn_to_section_order(pfn))) {
--
2.54.0
^ permalink raw reply related
* [PATCH v2 63/69] mm/hugetlb: Initialize the full bootmem hugepage in hugetlb code
From: Muchun Song @ 2026-05-13 13:20 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
In-Reply-To: <20260513132044.41690-1-songmuchun@bytedance.com>
Boot-time gigantic hugepages currently leave the head struct page to
the generic memmap initialization path while the HugeTLB code
initializes the remaining struct pages itself.
Mark the full hugepage noinit and initialize the head struct page in
hugetlb_folio_init_vmemmap() as well, so the whole compound-page setup
is handled in one place.
This can also reduce memblock metadata overhead when many boot-time
HugeTLB pages are reserved, because physically contiguous hugepages
can be covered by fewer noinit regions.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
mm/hugetlb.c | 20 ++++----------------
1 file changed, 4 insertions(+), 16 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 950b0fa3bc27..10f04fa95d43 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3112,15 +3112,7 @@ static bool __init alloc_bootmem_huge_page(struct hstate *h, int nid)
list_add_tail(&m->list, &huge_boot_pages[nid]);
m->flags |= HUGE_BOOTMEM_ZONES_VALID;
hugetlb_vmemmap_optimize_bootmem_page(m);
- /*
- * Only initialize the head struct page in memmap_init_reserved_pages,
- * rest of the struct pages will be initialized by the HugeTLB
- * subsystem itself.
- * The head struct page is used to get folio information by the HugeTLB
- * subsystem like zone id and node id.
- */
- memblock_reserved_mark_noinit(__pa((void *)m + PAGE_SIZE),
- huge_page_size(h) - PAGE_SIZE);
+ memblock_reserved_mark_noinit(__pa(m), huge_page_size(h));
}
return true;
@@ -3129,16 +3121,13 @@ static bool __init alloc_bootmem_huge_page(struct hstate *h, int nid)
static void __init hugetlb_folio_init_vmemmap(struct page *head, unsigned long pfn,
enum zone_type zone, int nid, unsigned int order, unsigned int nr_pages)
{
- int ret;
-
/*
* This is an open-coded prep_compound_page() whereby we avoid
* walking pages twice by initializing/preparing+freezing them in the
* same go.
*/
- __ClearPageReserved(head);
- ret = page_ref_freeze(head, 1);
- VM_BUG_ON(!ret);
+ __init_single_page(head, pfn, zone, nid);
+ set_page_count(head, 0);
__SetPageHead(head);
for (int i = 1; i < nr_pages; i++) {
@@ -3208,7 +3197,7 @@ static void __init gather_bootmem_prealloc_node(unsigned long nid)
struct folio *folio = (void *)page;
unsigned long pfn = PHYS_PFN(__pa(m));
unsigned long nr_pages = pages_per_huge_page(m->hstate);
- enum zone_type zone = folio_zonenum(folio);
+ enum zone_type zone = zone_idx(pfn_to_zone(pfn, nid));
h = m->hstate;
/*
@@ -3220,7 +3209,6 @@ static void __init gather_bootmem_prealloc_node(unsigned long nid)
prev_h = h;
VM_BUG_ON(!hstate_is_gigantic(h));
- WARN_ON(folio_ref_count(folio) != 1);
hugetlb_folio_init_vmemmap(page, pfn, zone, nid, huge_page_order(h),
vmemmap_nr_struct_pages(pfn, nr_pages));
--
2.54.0
^ permalink raw reply related
* [PATCH v2 64/69] mm/mm_init: Factor out compound page initialization
From: Muchun Song @ 2026-05-13 13:20 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
In-Reply-To: <20260513132044.41690-1-songmuchun@bytedance.com>
The compound struct page initialization needed by boot-time gigantic hugetlb
folios is currently open-coded in hugetlb code, while ZONE_DEVICE has its own
separate initialization path in mm_init.c.
Factor the common compound memmap setup into memmap_init_compound_page_frozen()
so both paths can share the same frozen page initialization logic. This removes
duplicated open-coded compound page setup and keeps the initialization rules
in one place.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
mm/hugetlb.c | 25 +-----------
mm/internal.h | 2 +
mm/mm_init.c | 111 +++++++++++++++++++-------------------------------
3 files changed, 45 insertions(+), 93 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 10f04fa95d43..7e9f49882395 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3118,28 +3118,6 @@ static bool __init alloc_bootmem_huge_page(struct hstate *h, int nid)
return true;
}
-static void __init hugetlb_folio_init_vmemmap(struct page *head, unsigned long pfn,
- enum zone_type zone, int nid, unsigned int order, unsigned int nr_pages)
-{
- /*
- * This is an open-coded prep_compound_page() whereby we avoid
- * walking pages twice by initializing/preparing+freezing them in the
- * same go.
- */
- __init_single_page(head, pfn, zone, nid);
- set_page_count(head, 0);
-
- __SetPageHead(head);
- for (int i = 1; i < nr_pages; i++) {
- struct page *page = head + i;
-
- __init_single_page(page, pfn + i, zone, nid);
- prep_compound_tail(page, head, order);
- set_page_count(page, 0);
- }
- prep_compound_head(head, order);
-}
-
/*
* memblock-allocated pageblocks might not have the migrate type set
* if marked with the 'noinit' flag. Set it to the default (MIGRATE_MOVABLE)
@@ -3210,8 +3188,7 @@ static void __init gather_bootmem_prealloc_node(unsigned long nid)
VM_BUG_ON(!hstate_is_gigantic(h));
- hugetlb_folio_init_vmemmap(page, pfn, zone, nid, huge_page_order(h),
- vmemmap_nr_struct_pages(pfn, nr_pages));
+ memmap_init_compound_page_frozen(page, pfn, zone, nid, huge_page_order(h));
init_new_hugetlb_folio(folio);
if (order_vmemmap_optimizable(pfn_to_section_order(pfn))) {
diff --git a/mm/internal.h b/mm/internal.h
index 416afdf7b2ec..2c67ae25124b 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1793,6 +1793,8 @@ static inline bool pte_needs_soft_dirty_wp(struct vm_area_struct *vma, pte_t pte
void __meminit __init_single_page(struct page *page, unsigned long pfn,
unsigned long zone, int nid);
+void __meminit memmap_init_compound_page_frozen(struct page *head, unsigned long pfn,
+ enum zone_type zone, int nid, unsigned int order);
/* shrinker related functions */
unsigned long shrink_slab(gfp_t gfp_mask, int nid, struct mem_cgroup *memcg,
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 95422e92ede8..9b23c31db8c6 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1018,79 +1018,46 @@ static void __init memmap_init(void)
init_unavailable_range(hole_pfn, end_pfn, zone_id, nid);
}
-#ifdef CONFIG_ZONE_DEVICE
-static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
- unsigned long zone_idx, int nid,
- struct dev_pagemap *pgmap)
+static void __meminit init_single_page_frozen(struct page *page, unsigned long pfn,
+ enum zone_type zone, int nid)
{
+ __init_single_page(page, pfn, zone, nid);
+ if (zone_is_zone_device(&NODE_DATA(nid)->node_zones[zone])) {
+ /*
+ * ZONE_DEVICE pages are not managed by the page allocator, mark
+ * them reserved to prevent them from being touched elsewhere.
+ *
+ * We can use the non-atomic __set_bit operation for setting
+ * the flag as we are still initializing the pages.
+ */
+ __SetPageReserved(page);
- __init_single_page(page, pfn, zone_idx, nid);
-
- /*
- * Mark page reserved as it will need to wait for onlining
- * phase for it to be fully associated with a zone.
- *
- * We can use the non-atomic __set_bit operation for setting
- * the flag as we are still initializing the pages.
- */
- __SetPageReserved(page);
-
- /*
- * ZONE_DEVICE pages union ->lru with a ->pgmap back pointer
- * and zone_device_data. It is a bug if a ZONE_DEVICE page is
- * ever freed or placed on a driver-private list.
- */
- page_folio(page)->pgmap = pgmap;
- page->zone_device_data = NULL;
-
- /*
- * ZONE_DEVICE pages other than MEMORY_TYPE_GENERIC are released
- * directly to the driver page allocator which will set the page count
- * to 1 when allocating the page.
- *
- * MEMORY_TYPE_GENERIC and MEMORY_TYPE_FS_DAX pages automatically have
- * their refcount reset to one whenever they are freed (ie. after
- * their refcount drops to 0).
- */
- switch (pgmap->type) {
- case MEMORY_DEVICE_FS_DAX:
- case MEMORY_DEVICE_PRIVATE:
- case MEMORY_DEVICE_COHERENT:
- case MEMORY_DEVICE_PCI_P2PDMA:
- set_page_count(page, 0);
- break;
-
- case MEMORY_DEVICE_GENERIC:
- break;
+ /*
+ * ZONE_DEVICE pages union ->lru with a ->pgmap back pointer
+ * and zone_device_data. It is a bug if a ZONE_DEVICE page is
+ * ever freed or placed on a driver-private list.
+ */
+ page->zone_device_data = NULL;
}
+ set_page_count(page, 0);
}
-static void __ref memmap_init_compound(struct page *head,
- unsigned long head_pfn,
- unsigned long zone_idx, int nid,
- struct dev_pagemap *pgmap,
- unsigned long nr_pages)
+void __meminit memmap_init_compound_page_frozen(struct page *head, unsigned long pfn,
+ enum zone_type zone, int nid, unsigned int order)
{
- unsigned long pfn, end_pfn = head_pfn + nr_pages;
- unsigned int order = pgmap->vmemmap_shift;
+ int nr_pages = vmemmap_nr_struct_pages(pfn, 1UL << order);
- /*
- * We have to initialize the pages, including setting up page links.
- * prep_compound_page() does not take care of that, so instead we
- * open-code prep_compound_page() so we can take care of initializing
- * the pages in the same go.
- */
- __SetPageHead(head);
- for (pfn = head_pfn + 1; pfn < end_pfn; pfn++) {
- struct page *page = pfn_to_page(pfn);
+ init_single_page_frozen(head, pfn, zone, nid);
- __init_zone_device_page(page, pfn, zone_idx, nid, pgmap);
- prep_compound_tail(page, head, order);
- set_page_count(page, 0);
+ __SetPageHead(head);
+ for (int i = 1; i < nr_pages; i++) {
+ init_single_page_frozen(head + i, pfn + i, zone, nid);
+ prep_compound_tail(head + i, head, order);
}
prep_compound_head(head, order);
}
+#ifdef CONFIG_ZONE_DEVICE
void __ref memmap_init_zone_device(struct zone *zone,
unsigned long start_pfn,
unsigned long nr_pages,
@@ -1118,18 +1085,24 @@ void __ref memmap_init_zone_device(struct zone *zone,
}
for (pfn = start_pfn; pfn < end_pfn; pfn += pfns_per_compound) {
- struct page *page = pfn_to_page(pfn);
-
- __init_zone_device_page(page, pfn, zone_idx, nid, pgmap);
+ struct page *head = pfn_to_page(pfn);
if (IS_ALIGNED(pfn, PAGES_PER_SECTION))
cond_resched();
- if (pfns_per_compound == 1)
- continue;
-
- memmap_init_compound(page, pfn, zone_idx, nid, pgmap,
- vmemmap_nr_struct_pages(pfn, pfns_per_compound));
+ if (pgmap->vmemmap_shift)
+ memmap_init_compound_page_frozen(head, pfn, zone_idx, nid,
+ pgmap->vmemmap_shift);
+ else
+ init_single_page_frozen(head, pfn, zone_idx, nid);
+ /*
+ * ZONE_DEVICE pages other than MEMORY_TYPE_GENERIC are released
+ * directly to the driver page allocator which will set the page
+ * count to 1 when allocating the page.
+ */
+ if (pgmap->type == MEMORY_DEVICE_GENERIC)
+ init_page_count(head);
+ ((struct folio *)head)->pgmap = pgmap;
}
pageblock_migratetype_init_range(start_pfn, nr_pages, MIGRATE_MOVABLE, false, false);
--
2.54.0
^ permalink raw reply related
* [PATCH v2 65/69] mm/mm_init: Make __init_single_page() static
From: Muchun Song @ 2026-05-13 13:20 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
In-Reply-To: <20260513132044.41690-1-songmuchun@bytedance.com>
__init_single_page() is only used within mm/mm_init.c, so make it
static.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
mm/internal.h | 2 --
mm/mm_init.c | 2 +-
2 files changed, 1 insertion(+), 3 deletions(-)
diff --git a/mm/internal.h b/mm/internal.h
index 2c67ae25124b..80b9ab594dc5 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1791,8 +1791,6 @@ static inline bool pte_needs_soft_dirty_wp(struct vm_area_struct *vma, pte_t pte
return vma_soft_dirty_enabled(vma) && !pte_soft_dirty(pte);
}
-void __meminit __init_single_page(struct page *page, unsigned long pfn,
- unsigned long zone, int nid);
void __meminit memmap_init_compound_page_frozen(struct page *head, unsigned long pfn,
enum zone_type zone, int nid, unsigned int order);
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 9b23c31db8c6..1e11fd683292 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -589,7 +589,7 @@ static void __init find_zone_movable_pfns_for_nodes(void)
node_states[N_MEMORY] = saved_node_state;
}
-void __meminit __init_single_page(struct page *page, unsigned long pfn,
+static void __meminit __init_single_page(struct page *page, unsigned long pfn,
unsigned long zone, int nid)
{
mm_zero_struct_page(page);
--
2.54.0
^ permalink raw reply related
* [PATCH v2 66/69] mm/cma: Move CMA pageblock initialization into cma_activate_area()
From: Muchun Song @ 2026-05-13 13:20 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
In-Reply-To: <20260513132044.41690-1-songmuchun@bytedance.com>
Move CMA pageblock initialization for early-reserved pages into
cma_activate_area() so CMA pageblock setup is handled in one place.
This keeps init_cma_pageblock() in the CMA core instead of pushing
special handling for early CMA allocations into its callers.
As a side effect, this also fixes the zone->cma_pages accounting race for
early-reserved HugeTLB CMA pages. The accounting is no longer updated from
parallel hugetlb_struct_page_init() workers and is instead performed
serially from cma_activate_area().
Fixes: d2d786714080 ("mm/hugetlb: enable bootmem allocation from CMA areas")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
mm/cma.c | 7 +++++--
mm/hugetlb.c | 8 +++-----
2 files changed, 8 insertions(+), 7 deletions(-)
diff --git a/mm/cma.c b/mm/cma.c
index 0369f04c7ba5..c1896c0db63d 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -162,6 +162,10 @@ static void __init cma_activate_area(struct cma *cma)
count = early_pfn[r] - cmr->base_pfn;
bitmap_count = cma_bitmap_pages_to_bits(cma, count);
bitmap_set(cmr->bitmap, 0, bitmap_count);
+
+ for (pfn = cmr->base_pfn; pfn < early_pfn[r];
+ pfn += pageblock_nr_pages)
+ init_cma_pageblock(pfn_to_page(pfn));
}
WARN_ON_ONCE(!pfn_valid(cmr->base_pfn));
@@ -1098,8 +1102,7 @@ bool cma_intersects(struct cma *cma, unsigned long start, unsigned long end)
*
* The caller is responsible for initializing the page structures
* in the area properly, since this just points to memblock-allocated
- * memory. The caller should subsequently use init_cma_pageblock to
- * set the migrate type and CMA stats the pageblocks that were reserved.
+ * memory.
*
* If the CMA area fails to activate later, memory obtained through
* this interface is not handed to the page allocator, this is
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 7e9f49882395..df798f9386d6 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3136,9 +3136,7 @@ static void __init hugetlb_bootmem_init_migratetype(struct folio *folio,
WARN_ON_ONCE(!pageblock_aligned(folio_pfn(folio)));
for (i = 0; i < nr_pages; i += pageblock_nr_pages) {
- if (folio_test_hugetlb_cma(folio))
- init_cma_pageblock(folio_page(folio, i));
- else
+ if (!folio_test_hugetlb_cma(folio))
init_pageblock_migratetype(folio_page(folio, i),
MIGRATE_MOVABLE, false);
}
@@ -3206,8 +3204,8 @@ static void __init gather_bootmem_prealloc_node(unsigned long nid)
* in order to fix confusing memory reports from free(1) and
* other side-effects, like CommitLimit going negative.
*
- * For CMA pages, this is done in init_cma_pageblock
- * (via hugetlb_bootmem_init_migratetype), so skip it here.
+ * For CMA pages, this is done in cma_activate_area(), so skip
+ * it here.
*/
if (!folio_test_hugetlb_cma(folio))
adjust_managed_page_count(page, nr_pages);
--
2.54.0
^ permalink raw reply related
* [PATCH v2 67/69] mm/cma: Move init_cma_pageblock() into cma.c
From: Muchun Song @ 2026-05-13 13:20 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
In-Reply-To: <20260513132044.41690-1-songmuchun@bytedance.com>
Move init_cma_pageblock() from mm_init.c into cma.c so it lives
alongside the CMA code that uses it.
No functional change intended.
---
mm/cma.c | 8 ++++++++
mm/internal.h | 4 ----
mm/mm_init.c | 9 ---------
3 files changed, 8 insertions(+), 13 deletions(-)
diff --git a/mm/cma.c b/mm/cma.c
index c1896c0db63d..2843c4f59c4e 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -30,6 +30,7 @@
#include <linux/io.h>
#include <linux/kmemleak.h>
#include <trace/events/cma.h>
+#include <linux/page-isolation.h>
#include "internal.h"
#include "cma.h"
@@ -137,6 +138,13 @@ bool cma_validate_zones(struct cma *cma)
return true;
}
+static void __init init_cma_pageblock(struct page *page)
+{
+ init_pageblock_migratetype(page, MIGRATE_CMA, false);
+ adjust_managed_page_count(page, pageblock_nr_pages);
+ page_zone(page)->cma_pages += pageblock_nr_pages;
+}
+
static void __init cma_activate_area(struct cma *cma)
{
unsigned long pfn, end_pfn, early_pfn[CMA_MAX_RANGES];
diff --git a/mm/internal.h b/mm/internal.h
index 80b9ab594dc5..25b6e767cea0 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1126,7 +1126,6 @@ struct cma;
#ifdef CONFIG_CMA
bool cma_validate_zones(struct cma *cma);
void *cma_reserve_early(struct cma *cma, unsigned long size);
-void init_cma_pageblock(struct page *page);
#else
static inline bool cma_validate_zones(struct cma *cma)
{
@@ -1136,9 +1135,6 @@ static inline void *cma_reserve_early(struct cma *cma, unsigned long size)
{
return NULL;
}
-static inline void init_cma_pageblock(struct page *page)
-{
-}
#endif
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 1e11fd683292..ff6e9fb468bd 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -2200,15 +2200,6 @@ void __init init_cma_reserved_pageblock(struct page *page)
adjust_managed_page_count(page, pageblock_nr_pages);
page_zone(page)->cma_pages += pageblock_nr_pages;
}
-/*
- * Similar to above, but only set the migrate type and stats.
- */
-void __init init_cma_pageblock(struct page *page)
-{
- init_pageblock_migratetype(page, MIGRATE_CMA, false);
- adjust_managed_page_count(page, pageblock_nr_pages);
- page_zone(page)->cma_pages += pageblock_nr_pages;
-}
#endif
void set_zone_contiguous(struct zone *zone)
--
2.54.0
^ permalink raw reply related
* [PATCH v2 68/69] mm/mm_init: Initialize pageblock migratetype in memmap init helpers
From: Muchun Song @ 2026-05-13 13:20 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
In-Reply-To: <20260513132044.41690-1-songmuchun@bytedance.com>
Move MIGRATE_MOVABLE pageblock initialization into the memmap init
helpers in mm/mm_init.c.
Let init_single_page_frozen() initialize the pageblock migratetype for
single-page folios, and let memmap_init_compound_page_frozen() handle
the whole range for compound pages. With pageblock initialization
centralized there, drop the duplicate hugetlb bootmem-specific
initialization in mm/hugetlb.c.
The old hugetlb_bootmem_init_migratetype() skipped CMA folios (via
folio_test_hugetlb_cma()), but the new helpers always set
MIGRATE_MOVABLE. This is safe because cma_activate_area() will later
override the migratetype for CMA pageblocks, so the initial
MIGRATE_MOVABLE setting does not matter for CMA pages.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
mm/hugetlb.c | 25 -------------------------
mm/mm_init.c | 20 ++++++++++++--------
2 files changed, 12 insertions(+), 33 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index df798f9386d6..fa269560f657 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3118,30 +3118,6 @@ static bool __init alloc_bootmem_huge_page(struct hstate *h, int nid)
return true;
}
-/*
- * memblock-allocated pageblocks might not have the migrate type set
- * if marked with the 'noinit' flag. Set it to the default (MIGRATE_MOVABLE)
- * here, or MIGRATE_CMA if this was a page allocated through an early CMA
- * reservation.
- *
- * In case of vmemmap optimized folios, the tail vmemmap pages are mapped
- * read-only, but that's ok - for sparse vmemmap this does not write to
- * the page structure.
- */
-static void __init hugetlb_bootmem_init_migratetype(struct folio *folio,
- struct hstate *h)
-{
- unsigned long nr_pages = pages_per_huge_page(h), i;
-
- WARN_ON_ONCE(!pageblock_aligned(folio_pfn(folio)));
-
- for (i = 0; i < nr_pages; i += pageblock_nr_pages) {
- if (!folio_test_hugetlb_cma(folio))
- init_pageblock_migratetype(folio_page(folio, i),
- MIGRATE_MOVABLE, false);
- }
-}
-
static void __init prep_and_add_bootmem_folios(struct hstate *h,
struct list_head *folio_list)
{
@@ -3149,7 +3125,6 @@ static void __init prep_and_add_bootmem_folios(struct hstate *h,
struct folio *folio, *tmp_f;
list_for_each_entry_safe(folio, tmp_f, folio_list, lru) {
- hugetlb_bootmem_init_migratetype(folio, h);
/* Subdivide locks to achieve better parallel performance */
spin_lock_irqsave(&hugetlb_lock, flags);
account_new_hugetlb_folio(h, folio);
diff --git a/mm/mm_init.c b/mm/mm_init.c
index ff6e9fb468bd..17ae0eb1ccfb 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1018,7 +1018,7 @@ static void __init memmap_init(void)
init_unavailable_range(hole_pfn, end_pfn, zone_id, nid);
}
-static void __meminit init_single_page_frozen(struct page *page, unsigned long pfn,
+static void __meminit __init_single_page_frozen(struct page *page, unsigned long pfn,
enum zone_type zone, int nid)
{
__init_single_page(page, pfn, zone, nid);
@@ -1042,19 +1042,28 @@ static void __meminit init_single_page_frozen(struct page *page, unsigned long p
set_page_count(page, 0);
}
+static void __meminit init_single_page_frozen(struct page *page, unsigned long pfn,
+ enum zone_type zone, int nid)
+{
+ __init_single_page_frozen(page, pfn, zone, nid);
+ pageblock_migratetype_init_range(pfn, 1, MIGRATE_MOVABLE, false, false);
+}
+
void __meminit memmap_init_compound_page_frozen(struct page *head, unsigned long pfn,
enum zone_type zone, int nid, unsigned int order)
{
int nr_pages = vmemmap_nr_struct_pages(pfn, 1UL << order);
- init_single_page_frozen(head, pfn, zone, nid);
+ __init_single_page_frozen(head, pfn, zone, nid);
__SetPageHead(head);
for (int i = 1; i < nr_pages; i++) {
- init_single_page_frozen(head + i, pfn + i, zone, nid);
+ __init_single_page_frozen(head + i, pfn + i, zone, nid);
prep_compound_tail(head + i, head, order);
}
prep_compound_head(head, order);
+
+ pageblock_migratetype_init_range(pfn, 1UL << order, MIGRATE_MOVABLE, false, false);
}
#ifdef CONFIG_ZONE_DEVICE
@@ -1087,9 +1096,6 @@ void __ref memmap_init_zone_device(struct zone *zone,
for (pfn = start_pfn; pfn < end_pfn; pfn += pfns_per_compound) {
struct page *head = pfn_to_page(pfn);
- if (IS_ALIGNED(pfn, PAGES_PER_SECTION))
- cond_resched();
-
if (pgmap->vmemmap_shift)
memmap_init_compound_page_frozen(head, pfn, zone_idx, nid,
pgmap->vmemmap_shift);
@@ -1105,8 +1111,6 @@ void __ref memmap_init_zone_device(struct zone *zone,
((struct folio *)head)->pgmap = pgmap;
}
- pageblock_migratetype_init_range(start_pfn, nr_pages, MIGRATE_MOVABLE, false, false);
-
pr_debug("%s initialised %lu pages in %ums\n", __func__,
nr_pages, jiffies_to_msecs(jiffies - start));
}
--
2.54.0
^ permalink raw reply related
* [PATCH v2 69/69] Documentation/mm: Rewrite vmemmap_dedup.rst for unified HVO
From: Muchun Song @ 2026-05-13 13:20 UTC (permalink / raw)
To: Andrew Morton, David Hildenbrand, Muchun Song, Oscar Salvador,
Michael Ellerman, Madhavan Srinivasan
Cc: Lorenzo Stoakes, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Nicholas Piggin,
Christophe Leroy, Ackerley Tng, Frank van der Linden,
aneesh.kumar, joao.m.martins, linux-mm, linuxppc-dev,
linux-kernel, Muchun Song
In-Reply-To: <20260513132044.41690-1-songmuchun@bytedance.com>
HVO is no longer specific to HugeTLB. The optimization has been
generalized for other large compound-page users, including device DAX,
but vmemmap_dedup.rst still describes the old split model.
Rewrite the document around the shared HVO design and behavior, and
drop the obsolete powerpc-specific document that only covered the old
device DAX path.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
Documentation/arch/powerpc/index.rst | 1 -
Documentation/arch/powerpc/vmemmap_dedup.rst | 101 ---------
Documentation/mm/vmemmap_dedup.rst | 217 ++++---------------
3 files changed, 42 insertions(+), 277 deletions(-)
delete mode 100644 Documentation/arch/powerpc/vmemmap_dedup.rst
diff --git a/Documentation/arch/powerpc/index.rst b/Documentation/arch/powerpc/index.rst
index 40419bea8e10..4dcf6b0f218c 100644
--- a/Documentation/arch/powerpc/index.rst
+++ b/Documentation/arch/powerpc/index.rst
@@ -36,7 +36,6 @@ powerpc
ultravisor
vas-api
vcpudispatch_stats
- vmemmap_dedup
vpa-dtl
features
diff --git a/Documentation/arch/powerpc/vmemmap_dedup.rst b/Documentation/arch/powerpc/vmemmap_dedup.rst
deleted file mode 100644
index dc4db59fdf87..000000000000
--- a/Documentation/arch/powerpc/vmemmap_dedup.rst
+++ /dev/null
@@ -1,101 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-==========
-Device DAX
-==========
-
-The device-dax interface uses the tail deduplication technique explained in
-Documentation/mm/vmemmap_dedup.rst
-
-On powerpc, vmemmap deduplication is only used with radix MMU translation. Also
-with a 64K page size, only the devdax namespace with 1G alignment uses vmemmap
-deduplication.
-
-With 2M PMD level mapping, we require 32 struct pages and a single 64K vmemmap
-page can contain 1024 struct pages (64K/sizeof(struct page)). Hence there is no
-vmemmap deduplication possible.
-
-With 1G PUD level mapping, we require 16384 struct pages and a single 64K
-vmemmap page can contain 1024 struct pages (64K/sizeof(struct page)). Hence we
-require 16 64K pages in vmemmap to map the struct page for 1G PUD level mapping.
-
-Here's how things look like on device-dax after the sections are populated::
- +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+
- | | | 0 | -------------> | 0 |
- | | +-----------+ +-----------+
- | | | 1 | -------------> | 1 |
- | | +-----------+ +-----------+
- | | | 2 | ----------------^ ^ ^ ^ ^ ^
- | | +-----------+ | | | | |
- | | | 3 | ------------------+ | | | |
- | | +-----------+ | | | |
- | | | 4 | --------------------+ | | |
- | PUD | +-----------+ | | |
- | level | | . | ----------------------+ | |
- | mapping | +-----------+ | |
- | | | . | ------------------------+ |
- | | +-----------+ |
- | | | 15 | --------------------------+
- | | +-----------+
- | |
- | |
- | |
- +-----------+
-
-
-With 4K page size, 2M PMD level mapping requires 512 struct pages and a single
-4K vmemmap page contains 64 struct pages(4K/sizeof(struct page)). Hence we
-require 8 4K pages in vmemmap to map the struct page for 2M pmd level mapping.
-
-Here's how things look like on device-dax after the sections are populated::
-
- +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+
- | | | 0 | -------------> | 0 |
- | | +-----------+ +-----------+
- | | | 1 | -------------> | 1 |
- | | +-----------+ +-----------+
- | | | 2 | ----------------^ ^ ^ ^ ^ ^
- | | +-----------+ | | | | |
- | | | 3 | ------------------+ | | | |
- | | +-----------+ | | | |
- | | | 4 | --------------------+ | | |
- | PMD | +-----------+ | | |
- | level | | 5 | ----------------------+ | |
- | mapping | +-----------+ | |
- | | | 6 | ------------------------+ |
- | | +-----------+ |
- | | | 7 | --------------------------+
- | | +-----------+
- | |
- | |
- | |
- +-----------+
-
-With 1G PUD level mapping, we require 262144 struct pages and a single 4K
-vmemmap page can contain 64 struct pages (4K/sizeof(struct page)). Hence we
-require 4096 4K pages in vmemmap to map the struct pages for 1G PUD level
-mapping.
-
-Here's how things look like on device-dax after the sections are populated::
-
- +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+
- | | | 0 | -------------> | 0 |
- | | +-----------+ +-----------+
- | | | 1 | -------------> | 1 |
- | | +-----------+ +-----------+
- | | | 2 | ----------------^ ^ ^ ^ ^ ^
- | | +-----------+ | | | | |
- | | | 3 | ------------------+ | | | |
- | | +-----------+ | | | |
- | | | 4 | --------------------+ | | |
- | PUD | +-----------+ | | |
- | level | | . | ----------------------+ | |
- | mapping | +-----------+ | |
- | | | . | ------------------------+ |
- | | +-----------+ |
- | | | 4095 | --------------------------+
- | | +-----------+
- | |
- | |
- | |
- +-----------+
diff --git a/Documentation/mm/vmemmap_dedup.rst b/Documentation/mm/vmemmap_dedup.rst
index 44e80bd2e398..c3a68a923b0d 100644
--- a/Documentation/mm/vmemmap_dedup.rst
+++ b/Documentation/mm/vmemmap_dedup.rst
@@ -1,107 +1,34 @@
.. SPDX-License-Identifier: GPL-2.0
-=========================================
-A vmemmap diet for HugeTLB and Device DAX
-=========================================
+===================================================
+Fundamentals of Hugepage Vmemmap Optimization (HVO)
+===================================================
-HugeTLB
-=======
-
-This section is to explain how Hugepage Vmemmap Optimization (HVO) for HugeTLB works.
-
-The ``struct page`` structures are used to describe a physical page frame. By
-default, there is a one-to-one mapping from a page frame to its corresponding
+The ``struct page`` structures are used to describe a physical base page frame.
+By default, there is a one-to-one mapping from a page frame to its corresponding
``struct page``.
-HugeTLB pages consist of multiple base page size pages and is supported by many
-architectures. See Documentation/admin-guide/mm/hugetlbpage.rst for more
-details. On the x86-64 architecture, HugeTLB pages of size 2MB and 1GB are
-currently supported. Since the base page size on x86 is 4KB, a 2MB HugeTLB page
-consists of 512 base pages and a 1GB HugeTLB page consists of 262144 base pages.
-For each base page, there is a corresponding ``struct page``.
-
-Within the HugeTLB subsystem, only the first 4 ``struct page`` are used to
-contain unique information about a HugeTLB page. ``__NR_USED_SUBPAGE`` provides
-this upper limit. The only 'useful' information in the remaining ``struct page``
-is the compound_info field, and this field is the same for all tail pages.
-
-By removing redundant ``struct page`` for HugeTLB pages, memory can be returned
-to the buddy allocator for other uses.
-
-Different architectures support different HugeTLB pages. For example, the
-following table is the HugeTLB page size supported by x86 and arm64
-architectures. Because arm64 supports 4k, 16k, and 64k base pages and
-supports contiguous entries, so it supports many kinds of sizes of HugeTLB
-page.
-
-+--------------+-----------+-----------------------------------------------+
-| Architecture | Page Size | HugeTLB Page Size |
-+--------------+-----------+-----------+-----------+-----------+-----------+
-| x86-64 | 4KB | 2MB | 1GB | | |
-+--------------+-----------+-----------+-----------+-----------+-----------+
-| | 4KB | 64KB | 2MB | 32MB | 1GB |
-| +-----------+-----------+-----------+-----------+-----------+
-| arm64 | 16KB | 2MB | 32MB | 1GB | |
-| +-----------+-----------+-----------+-----------+-----------+
-| | 64KB | 2MB | 512MB | 16GB | |
-+--------------+-----------+-----------+-----------+-----------+-----------+
-
-When the system boot up, every HugeTLB page has more than one ``struct page``
-structs which size is (unit: pages)::
-
- struct_size = HugeTLB_Size / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE
-
-Where HugeTLB_Size is the size of the HugeTLB page. We know that the size
-of the HugeTLB page is always n times PAGE_SIZE. So we can get the following
-relationship::
-
- HugeTLB_Size = n * PAGE_SIZE
-
-Then::
-
- struct_size = n * PAGE_SIZE / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE
- = n * sizeof(struct page) / PAGE_SIZE
+When huge pages (large compound page) are used, they consist of multiple base
+page size pages. For each base page, there is a corresponding ``struct page``.
+However, only a few ``struct page``
+structures are actually used to contain unique information about the huge page.
+The only 'useful' information in the remaining tail ``struct page`` structures
+is the ``->compound_info`` field to get the head page structure, and this field
+is the same for all tail pages.
-We can use huge mapping at the pud/pmd level for the HugeTLB page.
+We can remove redundant ``struct page`` structures for huge pages to save memory.
+This optimization is referred to as Hugepage Vmemmap Optimization (HVO).
-For the HugeTLB page of the pmd level mapping, then::
+The optimization is only applied when the size of the ``struct page`` is a
+power-of-2. In this case, all tail pages of the same order are identical. See
+``compound_head()``. This allows us to remap the tail pages of the vmemmap to a
+shared page.
- struct_size = n * sizeof(struct page) / PAGE_SIZE
- = PAGE_SIZE / sizeof(pte_t) * sizeof(struct page) / PAGE_SIZE
- = sizeof(struct page) / sizeof(pte_t)
- = 64 / 8
- = 8 (pages)
+Let’s take a system with a 2 MB huge page and a base page size of 4 KB as an
+example for illustration. Here is how things look before optimization::
-Where n is how many pte entries which one page can contains. So the value of
-n is (PAGE_SIZE / sizeof(pte_t)).
-
-This optimization only supports 64-bit system, so the value of sizeof(pte_t)
-is 8. And this optimization also applicable only when the size of ``struct page``
-is a power of two. In most cases, the size of ``struct page`` is 64 bytes (e.g.
-x86-64 and arm64). So if we use pmd level mapping for a HugeTLB page, the
-size of ``struct page`` structs of it is 8 page frames which size depends on the
-size of the base page.
-
-For the HugeTLB page of the pud level mapping, then::
-
- struct_size = PAGE_SIZE / sizeof(pmd_t) * struct_size(pmd)
- = PAGE_SIZE / 8 * 8 (pages)
- = PAGE_SIZE (pages)
-
-Where the struct_size(pmd) is the size of the ``struct page`` structs of a
-HugeTLB page of the pmd level mapping.
-
-E.g.: A 2MB HugeTLB page on x86_64 consists in 8 page frames while 1GB
-HugeTLB page consists in 4096.
-
-Next, we take the pmd level mapping of the HugeTLB page as an example to
-show the internal implementation of this optimization. There are 8 pages
-``struct page`` structs associated with a HugeTLB page which is pmd mapped.
-
-Here is how things look before optimization::
-
- HugeTLB struct pages(8 pages) page frame(8 pages)
+ 2MB Hugepage struct pages (8 pages) page frame (8 pages)
+-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+
| | | 0 | -------------> | 0 |
| | +-----------+ +-----------+
@@ -112,9 +39,9 @@ Here is how things look before optimization::
| | | 3 | -------------> | 3 |
| | +-----------+ +-----------+
| | | 4 | -------------> | 4 |
- | PMD | +-----------+ +-----------+
- | level | | 5 | -------------> | 5 |
- | mapping | +-----------+ +-----------+
+ | | +-----------+ +-----------+
+ | | | 5 | -------------> | 5 |
+ | | +-----------+ +-----------+
| | | 6 | -------------> | 6 |
| | +-----------+ +-----------+
| | | 7 | -------------> | 7 |
@@ -124,34 +51,27 @@ Here is how things look before optimization::
| |
+-----------+
-The first page of ``struct page`` (page 0) associated with the HugeTLB page
-contains the 4 ``struct page`` necessary to describe the HugeTLB. The remaining
-pages of ``struct page`` (page 1 to page 7) are tail pages.
-
-The optimization is only applied when the size of the struct page is a power
-of 2. In this case, all tail pages of the same order are identical. See
-compound_head(). This allows us to remap the tail pages of the vmemmap to a
-shared, read-only page. The head page is also remapped to a new page. This
-allows the original vmemmap pages to be freed.
+We remap the tail pages (page 1 to page 7) of the vmemmap to a shared, read-only
+page (per-zone).
Here is how things look after remapping::
- HugeTLB struct pages(8 pages) page frame (new)
+ 2MB Hugepage struct pages(8 pages) page frame (1 page)
+-----------+ ---virt_to_page---> +-----------+ mapping to +----------------+
| | | 0 | -------------> | 0 |
| | +-----------+ +----------------+
| | | 1 | ------┐
| | +-----------+ |
- | | | 2 | ------┼ +----------------------------+
+ | | | 2 | ------┼
+ | | +-----------+ |
+ | | | 3 | ------┼ +----------------------------+
| | +-----------+ | | A single, per-zone page |
- | | | 3 | ------┼------> | frame shared among all |
+ | | | 4 | ------┼------> | frame shared among all |
| | +-----------+ | | hugepages of the same size |
- | | | 4 | ------┼ +----------------------------+
+ | | | 5 | ------┼ +----------------------------+
+ | | +-----------+ |
+ | | | 6 | ------┼
| | +-----------+ |
- | | | 5 | ------┼
- | PMD | +-----------+ |
- | level | | 6 | ------┼
- | mapping | +-----------+ |
| | | 7 | ------┘
| | +-----------+
| |
@@ -159,65 +79,12 @@ Here is how things look after remapping::
| |
+-----------+
-When a HugeTLB is freed to the buddy system, we should allocate 7 pages for
-vmemmap pages and restore the previous mapping relationship.
-
-For the HugeTLB page of the pud level mapping. It is similar to the former.
-We also can use this approach to free (PAGE_SIZE - 1) vmemmap pages.
-
-Apart from the HugeTLB page of the pmd/pud level mapping, some architectures
-(e.g. aarch64) provides a contiguous bit in the translation table entries
-that hints to the MMU to indicate that it is one of a contiguous set of
-entries that can be cached in a single TLB entry.
-
-The contiguous bit is used to increase the mapping size at the pmd and pte
-(last) level. So this type of HugeTLB page can be optimized only when its
-size of the ``struct page`` structs is greater than **1** page.
-
-Device DAX
-==========
-
-The device-dax interface uses the same tail deduplication technique explained
-in the previous chapter, except when used with the vmemmap in
-the device (altmap).
-
-The following page sizes are supported in DAX: PAGE_SIZE (4K on x86_64),
-PMD_SIZE (2M on x86_64) and PUD_SIZE (1G on x86_64).
-For powerpc equivalent details see Documentation/arch/powerpc/vmemmap_dedup.rst
-
-The differences with HugeTLB are relatively minor.
-
-It only use 3 ``struct page`` for storing all information as opposed
-to 4 on HugeTLB pages.
-
-There's no remapping of vmemmap given that device-dax memory is not part of
-System RAM ranges initialized at boot. Thus the tail page deduplication
-happens at a later stage when we populate the sections. HugeTLB reuses the
-the head vmemmap page representing, whereas device-dax reuses the tail
-vmemmap page. This results in only half of the savings compared to HugeTLB.
-
-Deduplicated tail pages are not mapped read-only.
+Therefore, for any hugepage, if the total size of its corresponding ``struct pages``
+is greater than or equal to the size of two base pages, then HVO technology can
+be applied to this hugepage to save memory. For example, in this case, the
+smallest hugepage that can apply HVO is 512 KB (its order corresponds to
+``OPTIMIZABLE_FOLIO_MIN_ORDER``). Therefore, any hugepage with an order greater
+than or equal to ``OPTIMIZABLE_FOLIO_MIN_ORDER`` can apply HVO technology.
-Here's how things look like on device-dax after the sections are populated::
-
- +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+
- | | | 0 | -------------> | 0 |
- | | +-----------+ +-----------+
- | | | 1 | -------------> | 1 |
- | | +-----------+ +-----------+
- | | | 2 | ----------------^ ^ ^ ^ ^ ^
- | | +-----------+ | | | | |
- | | | 3 | ------------------+ | | | |
- | | +-----------+ | | | |
- | | | 4 | --------------------+ | | |
- | PMD | +-----------+ | | |
- | level | | 5 | ----------------------+ | |
- | mapping | +-----------+ | |
- | | | 6 | ------------------------+ |
- | | +-----------+ |
- | | | 7 | --------------------------+
- | | +-----------+
- | |
- | |
- | |
- +-----------+
+Meanwhile, each HVOed hugepage still has ``OPTIMIZED_FOLIO_VMEMMAP_NR_STRUCT_PAGES``
+available ``struct page`` structures.
--
2.54.0
^ permalink raw reply related
* Re: [PATCH 01/19] btrfs: require at least 4 devices for RAID 6
From: H. Peter Anvin @ 2026-05-13 16:14 UTC (permalink / raw)
To: Christoph Hellwig, Andrew Morton
Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Huacai Chen,
WANG Xuerui, Madhavan Srinivasan, Michael Ellerman,
Nicholas Piggin, Christophe Leroy (CS GROUP), Paul Walmsley,
Palmer Dabbelt, Albert Ou, Alexandre Ghiti, Heiko Carstens,
Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
Sven Schnelle, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, Herbert Xu, Dan Williams, Chris Mason,
David Sterba, Arnd Bergmann, Song Liu, Yu Kuai, Li Nan,
linux-kernel, linux-arm-kernel, loongarch, linuxppc-dev,
linux-riscv, linux-s390, linux-crypto, linux-btrfs, linux-arch,
linux-raid
In-Reply-To: <20260512052230.2947683-2-hch@lst.de>
On May 11, 2026 10:20:41 PM PDT, Christoph Hellwig <hch@lst.de> wrote:
>While the RAID6 algorithm could in theory support 3 devices by just
>copying the data disk to the two parity disks, this version is not only
>useless because it is a suboptimal version of 3-way mirroring, but also
>broken with various crashes and incorrect parity generation in various
>architecture-optimized implementations. Disallow it similar to mdraid
>which requires at least 4 devices for RAID 6.
>
>Fixes: 53b381b3abeb ("Btrfs: RAID5 and RAID6")
>Signed-off-by: Christoph Hellwig <hch@lst.de>
>---
> fs/btrfs/volumes.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
>diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>index a88e68f90564..0b54b97bdad8 100644
>--- a/fs/btrfs/volumes.c
>+++ b/fs/btrfs/volumes.c
>@@ -159,7 +159,7 @@ const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
> .sub_stripes = 1,
> .dev_stripes = 1,
> .devs_max = 0,
>- .devs_min = 3,
>+ .devs_min = 4,
> .tolerated_failures = 2,
> .devs_increment = 1,
> .ncopies = 1,
Yes, if anyone cares about < 4 disks for the RAID-6 case (or < 3 for the RAID-4/5 case), just use the RAID-1 code.
^ permalink raw reply
* Re: [PATCH v4 04/13] dma: swiotlb: track pool encryption state and honor DMA_ATTR_CC_SHARED
From: Jason Gunthorpe @ 2026-05-13 17:24 UTC (permalink / raw)
To: Mostafa Saleh
Cc: Aneesh Kumar K.V (Arm), iommu, linux-arm-kernel, linux-kernel,
linux-coco, Robin Murphy, Marek Szyprowski, Will Deacon,
Marc Zyngier, Steven Price, Suzuki K Poulose, Catalin Marinas,
Jiri Pirko, Petr Tesarik, Alexey Kardashevskiy, Dan Williams,
Xu Yilun, linuxppc-dev, linux-s390, Madhavan Srinivasan,
Michael Ellerman, Nicholas Piggin, Christophe Leroy (CS GROUP),
Alexander Gordeev, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <agSKQrSIhizCXKwx@google.com>
On Wed, May 13, 2026 at 02:27:14PM +0000, Mostafa Saleh wrote:
> > + /*
> > + * if platform supports memory encryption,
> > + * restricted mem pool is decrypted by default
> > + */
> > + if (cc_platform_has(CC_ATTR_MEM_ENCRYPT)) {
> > + mem->unencrypted = true;
> > + set_memory_decrypted((unsigned long)phys_to_virt(rmem->base),
> > + rmem->size >> PAGE_SHIFT);
> > + } else {
> > + mem->unencrypted = false;
> > + }
>
> This breaks pKVM as it doesn’t set CC_ATTR_MEM_ENCRYPT, so all virtio
> traffic now fails.
How will pKVM signal what kind of memory the DMA needs then?
Does it use set_memory_decrypted()? How can it use
set_memory_decrypted() without offering CC_ATTR_MEM_ENCRYPT ?
> Also, by design, some drivers are clueless about bouncing, so
Oh? What does this mean? We take quite a dim view of drivers mis-using
the DMA API..
> I believe that the pool should have a way to control it’s property
> (encrypted or decrypted) and that takes priority over whatever
> attributes comes from allocation.
We should get here because dma_capable() fails, and then swiotlb needs
to return something that makes dma_capable() succeed. Yes, it should
return details about the thing it decided, but it shouldn't have been
pre-created with some idea how to make dma_capable() work.
If dma_capable() can fail, then swiotlb should know exactly what to do
to fix it.
If pkvm wants to use the hacky scheme where you force a swiotlb pool
configuration during arch init with force swiotlb that's a somewhat
different flow and, sure the forced pool should force do whatever it
is forced to.
But lets try to keep them seperated in the discussion..
> And that brings us to the same point whether it’s better to return
> the memory along with it’s state or we pass the requested state.
> I think for other cases it’s fine for the device/DMA-API to dictate
> the attrs, but not in restricted-dma case, the firmware just knows better.
The memory type must be returned back at some level so downstream
things can do the right transformation of the phys_addr_t.
One of the aspirational CC things that should work is a T=1 device
tries to DMA from a decrypted page, finds the address is above the dma
limit of the device, so it bounces it with SWIOTLB to an encrypted low
address page and then the DMA API internal flow switiches from working
with decrypted to encrypted phys_addr_t.
If we can make that work then maybe the flows are designed correctly.
Jason
^ permalink raw reply
* Re: [PATCH v2 0/3] KVM: Fix and clean up kvm_vcpu_map[_readonly]() usages
From: Sean Christopherson @ 2026-05-13 17:33 UTC (permalink / raw)
To: Peter Fang
Cc: David Woodhouse, Paolo Bonzini, Madhavan Srinivasan,
Nicholas Piggin, Fred Griffoul, Yosry Ahmed, Ritesh Harjani,
Michael Ellerman, Christophe Leroy (CS GROUP), Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
kvm, linuxppc-dev, linux-kernel
In-Reply-To: <20260507081855.GA362060@pedri>
On Thu, May 07, 2026, Peter Fang wrote:
> On Mon, May 04, 2026 at 10:59:06AM -0700, Sean Christopherson wrote:
> > On Mon, Apr 27, 2026, Peter Fang wrote:
> >
> > > Thanks David!
> > >
> > > I think I'd need at least input from the maintainers on this but just by
> > > code inspection, the kvm_vcpu_map() usage in sev.c seems a bit tricky.
> > > Unmapping doesn't happen until right before switching to the guest, so
> > > this might fall into the "keep the mapping around for a longer time"
> > > category [1].
> >
> > It definitely falls into that category. But that code is also rather gross, i.e.
> > could use some cleanup no matter what, so I don't think it's a good argument for
> > keeping kvm_vcpu_map() around.
> >
> > To avoid a bunch of pointless work and churn, let's hold off on hardening and/or
> > renaming kvm_vcpu_map() for now. I'll take this v2 as-is; even though taking a
> > gpa instead of a gfn will conflict with the nVMX series, it's dead simple and a
> > worthwhile cleanup even if some of the conversions get discarded shortly after.
I had a change of heart after looking at the applied code, and after going through
Fred's gpc+nVMX series. I don't want to have a discrepancy between kvm_vcpu_map()
and __kvm_vcpu_map(), even for a "short" amount of time, and I do think it makes
sense to pursue switching to gpcs for the nested code. But, I also agree with the
changelog's statement that __kvm_vcpu_map() fundamentally operates on gfns, i.e. I
don't want to "fix" the discrepancy.
The other thing that swayed me is patch 2; I have a separate patch (amusingly
related to gpc stuff) to extra gpa_to_gfn (and others) into kvm_types.h, and so
I don't want to take patch 2 either.
Long story short, I'm going to grab only patch 1.
^ permalink raw reply
* Re: [PATCH v2 00/69] mm: Generalize HVO for HugeTLB and device DAX
From: Andrew Morton @ 2026-05-13 17:46 UTC (permalink / raw)
To: Muchun Song
Cc: David Hildenbrand, Muchun Song, Oscar Salvador, Michael Ellerman,
Madhavan Srinivasan, Lorenzo Stoakes, Liam R . Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Nicholas Piggin, Christophe Leroy, Ackerley Tng,
Frank van der Linden, aneesh.kumar, joao.m.martins, linux-mm,
linuxppc-dev, linux-kernel
In-Reply-To: <20260513130542.35604-1-songmuchun@bytedance.com>
On Wed, 13 May 2026 21:04:28 +0800 Muchun Song <songmuchun@bytedance.com> wrote:
> In this series, HVO is redefined as Hugepage Vmemmap Optimization: a
> general vmemmap optimization model for large hugepage-backed mappings,
> rather than a HugeTLB-only implementation detail.
>
> The existing code grew around the original HugeTLB-specific HVO path,
> while device DAX developed similar but separate vmemmap optimization
> handling. As a result, the current implementation carries duplicated
> logic, boot-time special cases, and subsystem-specific interfaces around
> what is fundamentally the same sparse-vmemmap optimization.
>
> This series generalizes that optimization into a common framework used
> by both HugeTLB and device DAX.
>
> The first few patches include some minor bug fixes found during AI-aided
> review of the current code. These fixes are not the main goal of the
> series, but the later refactoring and unification work depends on them,
> so they are included here as preparatory changes.
>
> The series then reworks the relevant early boot and sparse
> initialization paths, introduces a generic section-based sparse-vmemmap
> optimization infrastructure, switches HugeTLB and device DAX over to the
> shared implementation, and removes the old special-case code.
>
> ...
>
> 46 files changed, 743 insertions(+), 1812 deletions(-)
Gulp.
I think the first 15ish patches (little fixes and cleanups and
refactorings) are ready to go in immediately?
Perhaps you could prepare such things as a separate series. Or tell me
which ones are suitable and I'll fudge up a [0/N]?
^ permalink raw reply
* Re: [PATCH v2 00/69] mm: Generalize HVO for HugeTLB and device DAX
From: Oscar Salvador @ 2026-05-13 18:26 UTC (permalink / raw)
To: Andrew Morton
Cc: Muchun Song, David Hildenbrand, Muchun Song, Michael Ellerman,
Madhavan Srinivasan, Lorenzo Stoakes, Liam R . Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Nicholas Piggin, Christophe Leroy, Ackerley Tng,
Frank van der Linden, aneesh.kumar, joao.m.martins, linux-mm,
linuxppc-dev, linux-kernel
In-Reply-To: <20260513104640.b0f02b844c57f92bc954878e@linux-foundation.org>
On Wed, May 13, 2026 at 10:46:40AM -0700, Andrew Morton wrote:
> On Wed, 13 May 2026 21:04:28 +0800 Muchun Song <songmuchun@bytedance.com> wrote:
>
> > In this series, HVO is redefined as Hugepage Vmemmap Optimization: a
> > general vmemmap optimization model for large hugepage-backed mappings,
> > rather than a HugeTLB-only implementation detail.
> >
> > The existing code grew around the original HugeTLB-specific HVO path,
> > while device DAX developed similar but separate vmemmap optimization
> > handling. As a result, the current implementation carries duplicated
> > logic, boot-time special cases, and subsystem-specific interfaces around
> > what is fundamentally the same sparse-vmemmap optimization.
> >
> > This series generalizes that optimization into a common framework used
> > by both HugeTLB and device DAX.
> >
> > The first few patches include some minor bug fixes found during AI-aided
> > review of the current code. These fixes are not the main goal of the
> > series, but the later refactoring and unification work depends on them,
> > so they are included here as preparatory changes.
> >
> > The series then reworks the relevant early boot and sparse
> > initialization paths, introduces a generic section-based sparse-vmemmap
> > optimization infrastructure, switches HugeTLB and device DAX over to the
> > shared implementation, and removes the old special-case code.
> >
> > ...
> >
> > 46 files changed, 743 insertions(+), 1812 deletions(-)
>
> Gulp.
>
> I think the first 15ish patches (little fixes and cleanups and
> refactorings) are ready to go in immediately?
I plan to have a (partial ) look at this tomorrow/Friday, but splitting
this series in fixes-that-can-go-straight-away and the feature itself would make more
sense and help ease the review.
Head tends to spin a bit when the patchset grows beyond certain number of patches :-D.
Would that be possible Munchun?
--
Oscar Salvador
SUSE Labs
^ permalink raw reply
* Re: [PATCH 01/19] btrfs: require at least 4 devices for RAID 6
From: David Sterba @ 2026-05-13 20:19 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Andrew Morton, Catalin Marinas, Will Deacon, Ard Biesheuvel,
Huacai Chen, WANG Xuerui, Madhavan Srinivasan, Michael Ellerman,
Nicholas Piggin, Christophe Leroy (CS GROUP), Paul Walmsley,
Palmer Dabbelt, Albert Ou, Alexandre Ghiti, Heiko Carstens,
Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
Sven Schnelle, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, H. Peter Anvin, Herbert Xu, Dan Williams,
Chris Mason, David Sterba, Arnd Bergmann, Song Liu, Yu Kuai,
Li Nan, linux-kernel, linux-arm-kernel, loongarch, linuxppc-dev,
linux-riscv, linux-s390, linux-crypto, linux-btrfs, linux-arch,
linux-raid
In-Reply-To: <20260513054742.GA1018@lst.de>
On Wed, May 13, 2026 at 07:47:42AM +0200, Christoph Hellwig wrote:
> On Tue, May 12, 2026 at 01:42:31PM +0200, David Sterba wrote:
> > On Tue, May 12, 2026 at 07:20:41AM +0200, Christoph Hellwig wrote:
> > > While the RAID6 algorithm could in theory support 3 devices by just
> > > copying the data disk to the two parity disks, this version is not only
> > > useless because it is a suboptimal version of 3-way mirroring, but also
> > > broken with various crashes and incorrect parity generation in various
> > > architecture-optimized implementations. Disallow it similar to mdraid
> > > which requires at least 4 devices for RAID 6.
> > >
> > > Fixes: 53b381b3abeb ("Btrfs: RAID5 and RAID6")
> > > Signed-off-by: Christoph Hellwig <hch@lst.de>
> >
> > This patch should have been sent separately as it has user visible
> > impact and can potentially break some setups.
>
> It _is_ sent out separate.
It's public interface change of btrfs but in a patch series cleaning
up some library code, I noticed it by accident.
> > The degenerate modes of
> > raid0, 5, or 6 are explicit as a possible middle step when converting
> > profiles. We can use a fallback implementation for this case if the
> > accelerated implementations cannot do it.
>
> This is not about a degenerated mode. For a degenerated RAID 6, parity
> generation uses the RAID 5 XOR routines as the second parity will be
> missing. This is about generating two parities for a single data disk,
> which must be explicitly selected.
The calcuation is a different than what I'm concened about, changing
minimum devices from 3 to 4 is a breaking change. If the library won't
provide the xor/parity functions then we'll have to add a fallback for
this special case.
^ permalink raw reply
* Re: [PATCH v4 02/13] dma-direct: use DMA_ATTR_CC_SHARED in alloc/free paths
From: Mostafa Saleh @ 2026-05-13 13:58 UTC (permalink / raw)
To: Aneesh Kumar K.V (Arm)
Cc: iommu, linux-arm-kernel, linux-kernel, linux-coco, Robin Murphy,
Marek Szyprowski, Will Deacon, Marc Zyngier, Steven Price,
Suzuki K Poulose, Catalin Marinas, Jiri Pirko, Jason Gunthorpe,
Petr Tesarik, Alexey Kardashevskiy, Dan Williams, Xu Yilun,
linuxppc-dev, linux-s390, Madhavan Srinivasan, Michael Ellerman,
Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <20260512090408.794195-3-aneesh.kumar@kernel.org>
On Tue, May 12, 2026 at 02:33:57PM +0530, Aneesh Kumar K.V (Arm) wrote:
> Propagate force_dma_unencrypted() into DMA_ATTR_CC_SHARED in the
> dma-direct allocation path and use the attribute to drive the related
> decisions.
>
> This updates dma_direct_alloc(), dma_direct_free(), and
> dma_direct_alloc_pages() to fold the forced unencrypted case into attrs.
>
> Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
> ---
> kernel/dma/direct.c | 44 ++++++++++++++++++++++++++++++++++++--------
> 1 file changed, 36 insertions(+), 8 deletions(-)
>
> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
> index b958f150718a..0c2e1f8436ce 100644
> --- a/kernel/dma/direct.c
> +++ b/kernel/dma/direct.c
> @@ -201,16 +201,31 @@ void *dma_direct_alloc(struct device *dev, size_t size,
> dma_addr_t *dma_handle, gfp_t gfp, unsigned long attrs)
> {
> bool remap = false, set_uncached = false;
> - bool mark_mem_decrypt = true;
> + bool mark_mem_decrypt = false;
> struct page *page;
> void *ret;
>
> + /*
> + * DMA_ATTR_CC_SHARED is not a caller-visible dma_alloc_*()
> + * attribute. The direct allocator uses it internally after it has
> + * decided that the backing pages must be shared/decrypted, so the
> + * rest of the allocation path can consistently select DMA addresses,
> + * choose compatible pools and restore encryption on free.
> + */
> + if (attrs & DMA_ATTR_CC_SHARED)
> + return NULL;
> +
> + if (force_dma_unencrypted(dev)) {
> + attrs |= DMA_ATTR_CC_SHARED;
> + mark_mem_decrypt = true;
> + }
> +
> size = PAGE_ALIGN(size);
> if (attrs & DMA_ATTR_NO_WARN)
> gfp |= __GFP_NOWARN;
>
> - if ((attrs & DMA_ATTR_NO_KERNEL_MAPPING) &&
> - !force_dma_unencrypted(dev) && !is_swiotlb_for_alloc(dev))
> + if (((attrs & (DMA_ATTR_NO_KERNEL_MAPPING | DMA_ATTR_CC_SHARED)) ==
> + DMA_ATTR_NO_KERNEL_MAPPING) && !is_swiotlb_for_alloc(dev))
> return dma_direct_alloc_no_mapping(dev, size, dma_handle, gfp);
>
> if (!dev_is_dma_coherent(dev)) {
> @@ -244,7 +259,7 @@ void *dma_direct_alloc(struct device *dev, size_t size,
> * Remapping or decrypting memory may block, allocate the memory from
> * the atomic pools instead if we aren't allowed block.
> */
> - if ((remap || force_dma_unencrypted(dev)) &&
> + if ((remap || (attrs & DMA_ATTR_CC_SHARED)) &&
> dma_direct_use_pool(dev, gfp))
> return dma_direct_alloc_from_pool(dev, size, dma_handle, gfp);
>
> @@ -318,11 +333,20 @@ void *dma_direct_alloc(struct device *dev, size_t size,
> void dma_direct_free(struct device *dev, size_t size,
> void *cpu_addr, dma_addr_t dma_addr, unsigned long attrs)
> {
> - bool mark_mem_encrypted = true;
> + bool mark_mem_encrypted = false;
> unsigned int page_order = get_order(size);
>
> - if ((attrs & DMA_ATTR_NO_KERNEL_MAPPING) &&
> - !force_dma_unencrypted(dev) && !is_swiotlb_for_alloc(dev)) {
> + /*
> + * if the device had requested for an unencrypted buffer,
> + * convert it to encrypted on free
> + */
> + if (force_dma_unencrypted(dev)) {
> + attrs |= DMA_ATTR_CC_SHARED;
> + mark_mem_encrypted = true;
> + }
> +
> + if (((attrs & (DMA_ATTR_NO_KERNEL_MAPPING | DMA_ATTR_CC_SHARED)) ==
> + DMA_ATTR_NO_KERNEL_MAPPING) && !is_swiotlb_for_alloc(dev)) {
> /* cpu_addr is a struct page cookie, not a kernel address */
> dma_free_contiguous(dev, cpu_addr, size);
> return;
> @@ -365,10 +389,14 @@ void dma_direct_free(struct device *dev, size_t size,
> struct page *dma_direct_alloc_pages(struct device *dev, size_t size,
> dma_addr_t *dma_handle, enum dma_data_direction dir, gfp_t gfp)
> {
> + unsigned long attrs = 0;
> struct page *page;
> void *ret;
>
> - if (force_dma_unencrypted(dev) && dma_direct_use_pool(dev, gfp))
> + if (force_dma_unencrypted(dev))
> + attrs |= DMA_ATTR_CC_SHARED;
> +
> + if ((attrs & DMA_ATTR_CC_SHARED) && dma_direct_use_pool(dev, gfp))
> return dma_direct_alloc_from_pool(dev, size, dma_handle, gfp);
What about dma_direct_free_pages()? Nothing inside uses attrs, but
that’s quite similar to dma_direct_alloc_pages()
Also, at this point, shouldn’t this patch also remove
force_dma_unencrypted() calls from dma_set_decrypted() and
dma_set_encrypted()?
Thanks,
Mostafa
>
> if (is_swiotlb_for_alloc(dev)) {
> --
> 2.43.0
>
^ permalink raw reply
* Re: [PATCH v4 01/13] dma-direct: swiotlb: handle swiotlb alloc/free outside __dma_direct_alloc_pages
From: Mostafa Saleh @ 2026-05-13 13:57 UTC (permalink / raw)
To: Aneesh Kumar K.V (Arm)
Cc: iommu, linux-arm-kernel, linux-kernel, linux-coco, Robin Murphy,
Marek Szyprowski, Will Deacon, Marc Zyngier, Steven Price,
Suzuki K Poulose, Catalin Marinas, Jiri Pirko, Jason Gunthorpe,
Petr Tesarik, Alexey Kardashevskiy, Dan Williams, Xu Yilun,
linuxppc-dev, linux-s390, Madhavan Srinivasan, Michael Ellerman,
Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <20260512090408.794195-2-aneesh.kumar@kernel.org>
On Tue, May 12, 2026 at 02:33:56PM +0530, Aneesh Kumar K.V (Arm) wrote:
> Move swiotlb allocation out of __dma_direct_alloc_pages() and handle it in
> dma_direct_alloc() / dma_direct_alloc_pages().
>
> This is needed for follow-up changes that simplify the handling of
> memory encryption/decryption based on the DMA attribute flags.
>
> swiotlb backing pages are already mapped decrypted by
> swiotlb_update_mem_attributes() and rmem_swiotlb_device_init(), so
> dma-direct should not call dma_set_decrypted() on allocation nor
> dma_set_encrypted() on free for swiotlb-backed memory.
>
> Update alloc/free paths to detect swiotlb-backed pages and skip
> encrypt/decrypt transitions for those paths. Keep the existing highmem
> rejection in dma_direct_alloc_pages() for swiotlb allocations.
>
> Only for "restricted-dma-pool", we currently set `for_alloc = true`, while
> rmem_swiotlb_device_init() decrypts the whole pool up front. This pool is
> typically used together with "shared-dma-pool", where the shared region is
> accessed after remap/ioremap and the returned address is suitable for
> decrypted memory access. So existing code paths remain valid.
>
> Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
> ---
> kernel/dma/direct.c | 44 +++++++++++++++++++++++++++++++++++++-------
> 1 file changed, 37 insertions(+), 7 deletions(-)
>
> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
> index ec887f443741..b958f150718a 100644
> --- a/kernel/dma/direct.c
> +++ b/kernel/dma/direct.c
> @@ -125,9 +125,6 @@ static struct page *__dma_direct_alloc_pages(struct device *dev, size_t size,
>
> WARN_ON_ONCE(!PAGE_ALIGNED(size));
>
> - if (is_swiotlb_for_alloc(dev))
> - return dma_direct_alloc_swiotlb(dev, size);
> -
> gfp |= dma_direct_optimal_gfp_mask(dev, &phys_limit);
> page = dma_alloc_contiguous(dev, size, gfp);
> if (page) {
> @@ -204,6 +201,7 @@ void *dma_direct_alloc(struct device *dev, size_t size,
> dma_addr_t *dma_handle, gfp_t gfp, unsigned long attrs)
> {
> bool remap = false, set_uncached = false;
> + bool mark_mem_decrypt = true;
> struct page *page;
> void *ret;
>
> @@ -250,11 +248,21 @@ void *dma_direct_alloc(struct device *dev, size_t size,
> dma_direct_use_pool(dev, gfp))
> return dma_direct_alloc_from_pool(dev, size, dma_handle, gfp);
>
> + if (is_swiotlb_for_alloc(dev)) {
> + page = dma_direct_alloc_swiotlb(dev, size);
> + if (page) {
> + mark_mem_decrypt = false;
> + goto setup_page;
> + }
> + return NULL;
> + }
> +
> /* we always manually zero the memory once we are done */
> page = __dma_direct_alloc_pages(dev, size, gfp & ~__GFP_ZERO, true);
> if (!page)
> return NULL;
>
> +setup_page:
> /*
> * dma_alloc_contiguous can return highmem pages depending on a
> * combination the cma= arguments and per-arch setup. These need to be
> @@ -281,7 +289,7 @@ void *dma_direct_alloc(struct device *dev, size_t size,
> goto out_free_pages;
> } else {
> ret = page_address(page);
> - if (dma_set_decrypted(dev, ret, size))
> + if (mark_mem_decrypt && dma_set_decrypted(dev, ret, size))
I am ok with that approach, but Jason was mentioning we shouldn’t
special case swiotlb and make the allocator return the memory state
(similar to the dma_page [1]) . I am also OK if you want to merge that
part of my series with is.
[1] https://lore.kernel.org/linux-iommu/20260408194750.2280873-1-smostafa@google.com/
> goto out_leak_pages;
> }
>
> @@ -298,7 +306,7 @@ void *dma_direct_alloc(struct device *dev, size_t size,
> return ret;
>
> out_encrypt_pages:
> - if (dma_set_encrypted(dev, page_address(page), size))
> + if (mark_mem_decrypt && dma_set_encrypted(dev, page_address(page), size))
> return NULL;
> out_free_pages:
> __dma_direct_free_pages(dev, page, size);
> @@ -310,6 +318,7 @@ void *dma_direct_alloc(struct device *dev, size_t size,
> void dma_direct_free(struct device *dev, size_t size,
> void *cpu_addr, dma_addr_t dma_addr, unsigned long attrs)
> {
> + bool mark_mem_encrypted = true;
> unsigned int page_order = get_order(size);
>
> if ((attrs & DMA_ATTR_NO_KERNEL_MAPPING) &&
> @@ -338,12 +347,15 @@ void dma_direct_free(struct device *dev, size_t size,
> dma_free_from_pool(dev, cpu_addr, PAGE_ALIGN(size)))
> return;
>
> + if (swiotlb_find_pool(dev, dma_to_phys(dev, dma_addr)))
> + mark_mem_encrypted = false;
> +
> if (is_vmalloc_addr(cpu_addr)) {
> vunmap(cpu_addr);
> } else {
> if (IS_ENABLED(CONFIG_ARCH_HAS_DMA_CLEAR_UNCACHED))
> arch_dma_clear_uncached(cpu_addr, size);
> - if (dma_set_encrypted(dev, cpu_addr, size))
> + if (mark_mem_encrypted && dma_set_encrypted(dev, cpu_addr, size))
> return;
> }
>
> @@ -359,6 +371,19 @@ struct page *dma_direct_alloc_pages(struct device *dev, size_t size,
> if (force_dma_unencrypted(dev) && dma_direct_use_pool(dev, gfp))
> return dma_direct_alloc_from_pool(dev, size, dma_handle, gfp);
>
> + if (is_swiotlb_for_alloc(dev)) {
> + page = dma_direct_alloc_swiotlb(dev, size);
> + if (!page)
> + return NULL;
> +
> + if (PageHighMem(page)) {
My understanding is that rmem_swiotlb_device_init() asserts that there
is no PageHighMem()? Also a similar check doesn’t exist in
dma_direct_alloc().
Thanks,
Mostafa
> + swiotlb_free(dev, page, size);
> + return NULL;
> + }
> + ret = page_address(page);
> + goto setup_page;
> + }
> +
> page = __dma_direct_alloc_pages(dev, size, gfp, false);
> if (!page)
> return NULL;
> @@ -366,6 +391,7 @@ struct page *dma_direct_alloc_pages(struct device *dev, size_t size,
> ret = page_address(page);
> if (dma_set_decrypted(dev, ret, size))
> goto out_leak_pages;
> +setup_page:
> memset(ret, 0, size);
> *dma_handle = phys_to_dma_direct(dev, page_to_phys(page));
> return page;
> @@ -378,13 +404,17 @@ void dma_direct_free_pages(struct device *dev, size_t size,
> enum dma_data_direction dir)
> {
> void *vaddr = page_address(page);
> + bool mark_mem_encrypted = true;
>
> /* If cpu_addr is not from an atomic pool, dma_free_from_pool() fails */
> if (IS_ENABLED(CONFIG_DMA_COHERENT_POOL) &&
> dma_free_from_pool(dev, vaddr, size))
> return;
>
> - if (dma_set_encrypted(dev, vaddr, size))
> + if (swiotlb_find_pool(dev, page_to_phys(page)))
> + mark_mem_encrypted = false;
> +
> + if (mark_mem_encrypted && dma_set_encrypted(dev, vaddr, size))
> return;
> __dma_direct_free_pages(dev, page, size);
> }
> --
> 2.43.0
>
^ permalink raw reply
* Re: [PATCH v4 03/13] dma-pool: track decrypted atomic pools and select them via attrs
From: Mostafa Saleh @ 2026-05-13 14:00 UTC (permalink / raw)
To: Aneesh Kumar K.V (Arm)
Cc: iommu, linux-arm-kernel, linux-kernel, linux-coco, Robin Murphy,
Marek Szyprowski, Will Deacon, Marc Zyngier, Steven Price,
Suzuki K Poulose, Catalin Marinas, Jiri Pirko, Jason Gunthorpe,
Petr Tesarik, Alexey Kardashevskiy, Dan Williams, Xu Yilun,
linuxppc-dev, linux-s390, Madhavan Srinivasan, Michael Ellerman,
Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <20260512090408.794195-4-aneesh.kumar@kernel.org>
On Tue, May 12, 2026 at 02:33:58PM +0530, Aneesh Kumar K.V (Arm) wrote:
> Teach the atomic DMA pool code to distinguish between encrypted and
> decrypted pools, and make pool allocation select the matching pool based
> on DMA attributes.
>
> Introduce a dma_gen_pool wrapper that records whether a pool is
> decrypted, initialize that state when the atomic pools are created, and
> use it when expanding and resizing the pools. Update dma_alloc_from_pool()
> to take attrs and skip pools whose encrypted/decrypted state does not
> match DMA_ATTR_CC_SHARED. Update dma_free_from_pool() accordingly.
>
> Also pass DMA_ATTR_CC_SHARED from the swiotlb atomic allocation path
> so decrypted swiotlb allocations are taken from the correct atomic pool.
>
> Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
> ---
> drivers/iommu/dma-iommu.c | 2 +-
> include/linux/dma-map-ops.h | 2 +-
> kernel/dma/direct.c | 11 ++-
> kernel/dma/pool.c | 163 +++++++++++++++++++++++-------------
> kernel/dma/swiotlb.c | 7 +-
> 5 files changed, 122 insertions(+), 63 deletions(-)
>
> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> index 54d96e847f16..c2595bee3d41 100644
> --- a/drivers/iommu/dma-iommu.c
> +++ b/drivers/iommu/dma-iommu.c
> @@ -1673,7 +1673,7 @@ void *iommu_dma_alloc(struct device *dev, size_t size, dma_addr_t *handle,
> if (IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) &&
> !gfpflags_allow_blocking(gfp) && !coherent)
> page = dma_alloc_from_pool(dev, PAGE_ALIGN(size), &cpu_addr,
> - gfp, NULL);
> + gfp, attrs, NULL);
> else
> cpu_addr = iommu_dma_alloc_pages(dev, size, &page, gfp, attrs);
> if (!cpu_addr)
> diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
> index 6a1832a73cad..696b2c3a2305 100644
> --- a/include/linux/dma-map-ops.h
> +++ b/include/linux/dma-map-ops.h
> @@ -212,7 +212,7 @@ void *dma_common_pages_remap(struct page **pages, size_t size, pgprot_t prot,
> void dma_common_free_remap(void *cpu_addr, size_t size);
>
> struct page *dma_alloc_from_pool(struct device *dev, size_t size,
> - void **cpu_addr, gfp_t flags,
> + void **cpu_addr, gfp_t flags, unsigned long attrs,
> bool (*phys_addr_ok)(struct device *, phys_addr_t, size_t));
> bool dma_free_from_pool(struct device *dev, void *start, size_t size);
>
> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
> index 0c2e1f8436ce..dc2907439b3d 100644
> --- a/kernel/dma/direct.c
> +++ b/kernel/dma/direct.c
> @@ -162,7 +162,7 @@ static bool dma_direct_use_pool(struct device *dev, gfp_t gfp)
> }
>
> static void *dma_direct_alloc_from_pool(struct device *dev, size_t size,
> - dma_addr_t *dma_handle, gfp_t gfp)
> + dma_addr_t *dma_handle, gfp_t gfp, unsigned long attrs)
> {
> struct page *page;
> u64 phys_limit;
> @@ -172,7 +172,8 @@ static void *dma_direct_alloc_from_pool(struct device *dev, size_t size,
> return NULL;
>
> gfp |= dma_direct_optimal_gfp_mask(dev, &phys_limit);
> - page = dma_alloc_from_pool(dev, size, &ret, gfp, dma_coherent_ok);
> + page = dma_alloc_from_pool(dev, size, &ret, gfp, attrs,
> + dma_coherent_ok);
> if (!page)
> return NULL;
> *dma_handle = phys_to_dma_direct(dev, page_to_phys(page));
> @@ -261,7 +262,8 @@ void *dma_direct_alloc(struct device *dev, size_t size,
> */
> if ((remap || (attrs & DMA_ATTR_CC_SHARED)) &&
> dma_direct_use_pool(dev, gfp))
> - return dma_direct_alloc_from_pool(dev, size, dma_handle, gfp);
> + return dma_direct_alloc_from_pool(dev, size, dma_handle,
> + gfp, attrs);
>
> if (is_swiotlb_for_alloc(dev)) {
> page = dma_direct_alloc_swiotlb(dev, size);
> @@ -397,7 +399,8 @@ struct page *dma_direct_alloc_pages(struct device *dev, size_t size,
> attrs |= DMA_ATTR_CC_SHARED;
>
> if ((attrs & DMA_ATTR_CC_SHARED) && dma_direct_use_pool(dev, gfp))
> - return dma_direct_alloc_from_pool(dev, size, dma_handle, gfp);
> + return dma_direct_alloc_from_pool(dev, size, dma_handle,
> + gfp, attrs);
>
> if (is_swiotlb_for_alloc(dev)) {
> page = dma_direct_alloc_swiotlb(dev, size);
> diff --git a/kernel/dma/pool.c b/kernel/dma/pool.c
> index 2b2fbb709242..75f0eba48a23 100644
> --- a/kernel/dma/pool.c
> +++ b/kernel/dma/pool.c
> @@ -12,12 +12,18 @@
> #include <linux/set_memory.h>
> #include <linux/slab.h>
> #include <linux/workqueue.h>
> +#include <linux/cc_platform.h>
>
> -static struct gen_pool *atomic_pool_dma __ro_after_init;
> +struct dma_gen_pool {
> + bool unencrypted;
> + struct gen_pool *pool;
> +};
> +
> +static struct dma_gen_pool atomic_pool_dma __ro_after_init;
> static unsigned long pool_size_dma;
> -static struct gen_pool *atomic_pool_dma32 __ro_after_init;
> +static struct dma_gen_pool atomic_pool_dma32 __ro_after_init;
> static unsigned long pool_size_dma32;
> -static struct gen_pool *atomic_pool_kernel __ro_after_init;
> +static struct dma_gen_pool atomic_pool_kernel __ro_after_init;
> static unsigned long pool_size_kernel;
>
> /* Size can be defined by the coherent_pool command line */
> @@ -76,7 +82,7 @@ static bool cma_in_zone(gfp_t gfp)
> return true;
> }
>
> -static int atomic_pool_expand(struct gen_pool *pool, size_t pool_size,
> +static int atomic_pool_expand(struct dma_gen_pool *dma_pool, size_t pool_size,
> gfp_t gfp)
> {
> unsigned int order;
> @@ -113,12 +119,15 @@ static int atomic_pool_expand(struct gen_pool *pool, size_t pool_size,
> * Memory in the atomic DMA pools must be unencrypted, the pools do not
> * shrink so no re-encryption occurs in dma_direct_free().
> */
> - ret = set_memory_decrypted((unsigned long)page_to_virt(page),
> + if (dma_pool->unencrypted) {
> + ret = set_memory_decrypted((unsigned long)page_to_virt(page),
> 1 << order);
> - if (ret)
> - goto remove_mapping;
> - ret = gen_pool_add_virt(pool, (unsigned long)addr, page_to_phys(page),
> - pool_size, NUMA_NO_NODE);
> + if (ret)
> + goto remove_mapping;
> + }
> +
> + ret = gen_pool_add_virt(dma_pool->pool, (unsigned long)addr,
> + page_to_phys(page), pool_size, NUMA_NO_NODE);
> if (ret)
> goto encrypt_mapping;
>
> @@ -126,11 +135,15 @@ static int atomic_pool_expand(struct gen_pool *pool, size_t pool_size,
> return 0;
>
> encrypt_mapping:
> - ret = set_memory_encrypted((unsigned long)page_to_virt(page),
> - 1 << order);
> - if (WARN_ON_ONCE(ret)) {
> - /* Decrypt succeeded but encrypt failed, purposely leak */
> - goto out;
> + if (dma_pool->unencrypted) {
> + int rc;
> +
> + rc = set_memory_encrypted((unsigned long)page_to_virt(page),
> + 1 << order);
> + if (WARN_ON_ONCE(rc)) {
> + /* Decrypt succeeded but encrypt failed, purposely leak */
> + goto out;
> + }
> }
> remove_mapping:
> #ifdef CONFIG_DMA_DIRECT_REMAP
> @@ -142,46 +155,52 @@ static int atomic_pool_expand(struct gen_pool *pool, size_t pool_size,
> return ret;
> }
>
> -static void atomic_pool_resize(struct gen_pool *pool, gfp_t gfp)
> +static void atomic_pool_resize(struct dma_gen_pool *dma_pool, gfp_t gfp)
> {
> - if (pool && gen_pool_avail(pool) < atomic_pool_size)
> - atomic_pool_expand(pool, gen_pool_size(pool), gfp);
> + if (dma_pool->pool && gen_pool_avail(dma_pool->pool) < atomic_pool_size)
> + atomic_pool_expand(dma_pool, gen_pool_size(dma_pool->pool), gfp);
> }
>
> static void atomic_pool_work_fn(struct work_struct *work)
> {
> if (IS_ENABLED(CONFIG_ZONE_DMA))
> - atomic_pool_resize(atomic_pool_dma,
> + atomic_pool_resize(&atomic_pool_dma,
> GFP_KERNEL | GFP_DMA);
> if (IS_ENABLED(CONFIG_ZONE_DMA32))
> - atomic_pool_resize(atomic_pool_dma32,
> + atomic_pool_resize(&atomic_pool_dma32,
> GFP_KERNEL | GFP_DMA32);
> - atomic_pool_resize(atomic_pool_kernel, GFP_KERNEL);
> + atomic_pool_resize(&atomic_pool_kernel, GFP_KERNEL);
> }
>
> -static __init struct gen_pool *__dma_atomic_pool_init(size_t pool_size,
> - gfp_t gfp)
> +static __init struct dma_gen_pool *__dma_atomic_pool_init(struct dma_gen_pool *dma_pool,
> + size_t pool_size, gfp_t gfp)
> {
> - struct gen_pool *pool;
> int ret;
>
> - pool = gen_pool_create(PAGE_SHIFT, NUMA_NO_NODE);
> - if (!pool)
> + dma_pool->pool = gen_pool_create(PAGE_SHIFT, NUMA_NO_NODE);
> + if (!dma_pool->pool)
> return NULL;
>
> - gen_pool_set_algo(pool, gen_pool_first_fit_order_align, NULL);
> + gen_pool_set_algo(dma_pool->pool, gen_pool_first_fit_order_align, NULL);
> +
> + /* if platform is using memory encryption atomic pools are by default decrypted. */
> + if (cc_platform_has(CC_ATTR_MEM_ENCRYPT))
> + dma_pool->unencrypted = true;
> + else
> + dma_pool->unencrypted = false;
I believe that’s a good start, although we might need to have more
fine grained policies in the future as CC guests might need
encrypted pools
Reviewed-by: Mostafa Saleh <smostafa@google.com>
Thanks,
Mostafa
>
> - ret = atomic_pool_expand(pool, pool_size, gfp);
> + ret = atomic_pool_expand(dma_pool, pool_size, gfp);
> if (ret) {
> - gen_pool_destroy(pool);
> + gen_pool_destroy(dma_pool->pool);
> + dma_pool->pool = NULL;
> pr_err("DMA: failed to allocate %zu KiB %pGg pool for atomic allocation\n",
> pool_size >> 10, &gfp);
> return NULL;
> }
>
> pr_info("DMA: preallocated %zu KiB %pGg pool for atomic allocations\n",
> - gen_pool_size(pool) >> 10, &gfp);
> - return pool;
> + gen_pool_size(dma_pool->pool) >> 10, &gfp);
> + return dma_pool;
> }
>
> #ifdef CONFIG_ZONE_DMA32
> @@ -207,21 +226,22 @@ static int __init dma_atomic_pool_init(void)
>
> /* All memory might be in the DMA zone(s) to begin with */
> if (has_managed_zone(ZONE_NORMAL)) {
> - atomic_pool_kernel = __dma_atomic_pool_init(atomic_pool_size,
> - GFP_KERNEL);
> - if (!atomic_pool_kernel)
> + __dma_atomic_pool_init(&atomic_pool_kernel, atomic_pool_size, GFP_KERNEL);
> + if (!atomic_pool_kernel.pool)
> ret = -ENOMEM;
> }
> +
> if (has_managed_dma()) {
> - atomic_pool_dma = __dma_atomic_pool_init(atomic_pool_size,
> - GFP_KERNEL | GFP_DMA);
> - if (!atomic_pool_dma)
> + __dma_atomic_pool_init(&atomic_pool_dma, atomic_pool_size,
> + GFP_KERNEL | GFP_DMA);
> + if (!atomic_pool_dma.pool)
> ret = -ENOMEM;
> }
> +
> if (has_managed_dma32) {
> - atomic_pool_dma32 = __dma_atomic_pool_init(atomic_pool_size,
> - GFP_KERNEL | GFP_DMA32);
> - if (!atomic_pool_dma32)
> + __dma_atomic_pool_init(&atomic_pool_dma32, atomic_pool_size,
> + GFP_KERNEL | GFP_DMA32);
> + if (!atomic_pool_dma32.pool)
> ret = -ENOMEM;
> }
>
> @@ -230,19 +250,44 @@ static int __init dma_atomic_pool_init(void)
> }
> postcore_initcall(dma_atomic_pool_init);
>
> -static inline struct gen_pool *dma_guess_pool(struct gen_pool *prev, gfp_t gfp)
> +static inline struct dma_gen_pool *__dma_guess_pool(struct dma_gen_pool *first,
> + struct dma_gen_pool *second, struct dma_gen_pool *third)
> +{
> + if (first->pool)
> + return first;
> + if (second && second->pool)
> + return second;
> + if (third && third->pool)
> + return third;
> + return NULL;
> +}
> +
> +static inline struct dma_gen_pool *dma_guess_pool(struct dma_gen_pool *prev,
> + gfp_t gfp)
> {
> - if (prev == NULL) {
> + if (!prev) {
> if (gfp & GFP_DMA)
> - return atomic_pool_dma ?: atomic_pool_dma32 ?: atomic_pool_kernel;
> + return __dma_guess_pool(&atomic_pool_dma,
> + &atomic_pool_dma32,
> + &atomic_pool_kernel);
> +
> if (gfp & GFP_DMA32)
> - return atomic_pool_dma32 ?: atomic_pool_dma ?: atomic_pool_kernel;
> - return atomic_pool_kernel ?: atomic_pool_dma32 ?: atomic_pool_dma;
> + return __dma_guess_pool(&atomic_pool_dma32,
> + &atomic_pool_dma,
> + &atomic_pool_kernel);
> +
> + return __dma_guess_pool(&atomic_pool_kernel,
> + &atomic_pool_dma32,
> + &atomic_pool_dma);
> }
> - if (prev == atomic_pool_kernel)
> - return atomic_pool_dma32 ? atomic_pool_dma32 : atomic_pool_dma;
> - if (prev == atomic_pool_dma32)
> - return atomic_pool_dma;
> +
> + if (prev == &atomic_pool_kernel)
> + return __dma_guess_pool(&atomic_pool_dma32,
> + &atomic_pool_dma, NULL);
> +
> + if (prev == &atomic_pool_dma32)
> + return __dma_guess_pool(&atomic_pool_dma, NULL, NULL);
> +
> return NULL;
> }
>
> @@ -272,16 +317,20 @@ static struct page *__dma_alloc_from_pool(struct device *dev, size_t size,
> }
>
> struct page *dma_alloc_from_pool(struct device *dev, size_t size,
> - void **cpu_addr, gfp_t gfp,
> + void **cpu_addr, gfp_t gfp, unsigned long attrs,
> bool (*phys_addr_ok)(struct device *, phys_addr_t, size_t))
> {
> - struct gen_pool *pool = NULL;
> + struct dma_gen_pool *dma_pool = NULL;
> struct page *page;
> bool pool_found = false;
>
> - while ((pool = dma_guess_pool(pool, gfp))) {
> + while ((dma_pool = dma_guess_pool(dma_pool, gfp))) {
> +
> + if (dma_pool->unencrypted != !!(attrs & DMA_ATTR_CC_SHARED))
> + continue;
> +
nit: If we fail to find a matching pool, a slightly misleading message
is printed as pool_found = false
> pool_found = true;
> - page = __dma_alloc_from_pool(dev, size, pool, cpu_addr,
> + page = __dma_alloc_from_pool(dev, size, dma_pool->pool, cpu_addr,
> phys_addr_ok);
> if (page)
> return page;
> @@ -296,12 +345,14 @@ struct page *dma_alloc_from_pool(struct device *dev, size_t size,
>
> bool dma_free_from_pool(struct device *dev, void *start, size_t size)
> {
> - struct gen_pool *pool = NULL;
> + struct dma_gen_pool *dma_pool = NULL;
> +
> + while ((dma_pool = dma_guess_pool(dma_pool, 0))) {
>
> - while ((pool = dma_guess_pool(pool, 0))) {
> - if (!gen_pool_has_addr(pool, (unsigned long)start, size))
> + if (!gen_pool_has_addr(dma_pool->pool, (unsigned long)start, size))
> continue;
> - gen_pool_free(pool, (unsigned long)start, size);
> +
> + gen_pool_free(dma_pool->pool, (unsigned long)start, size);
> return true;
> }
>
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index 1abd3e6146f4..ab4eccbaa076 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -612,6 +612,7 @@ static struct page *swiotlb_alloc_tlb(struct device *dev, size_t bytes,
> u64 phys_limit, gfp_t gfp)
> {
> struct page *page;
> + unsigned long attrs = 0;
>
> /*
> * Allocate from the atomic pools if memory is encrypted and
> @@ -623,8 +624,12 @@ static struct page *swiotlb_alloc_tlb(struct device *dev, size_t bytes,
> if (!IS_ENABLED(CONFIG_DMA_COHERENT_POOL))
> return NULL;
>
> + /* swiotlb considered decrypted by default */
> + if (cc_platform_has(CC_ATTR_MEM_ENCRYPT))
> + attrs = DMA_ATTR_CC_SHARED;
> +
> return dma_alloc_from_pool(dev, bytes, &vaddr, gfp,
> - dma_coherent_ok);
> + attrs, dma_coherent_ok);
> }
>
> gfp &= ~GFP_ZONEMASK;
> --
> 2.43.0
>
^ permalink raw reply
* Re: [PATCH v4 04/13] dma: swiotlb: track pool encryption state and honor DMA_ATTR_CC_SHARED
From: Mostafa Saleh @ 2026-05-13 14:27 UTC (permalink / raw)
To: Aneesh Kumar K.V (Arm)
Cc: iommu, linux-arm-kernel, linux-kernel, linux-coco, Robin Murphy,
Marek Szyprowski, Will Deacon, Marc Zyngier, Steven Price,
Suzuki K Poulose, Catalin Marinas, Jiri Pirko, Jason Gunthorpe,
Petr Tesarik, Alexey Kardashevskiy, Dan Williams, Xu Yilun,
linuxppc-dev, linux-s390, Madhavan Srinivasan, Michael Ellerman,
Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <20260512090408.794195-5-aneesh.kumar@kernel.org>
On Tue, May 12, 2026 at 02:33:59PM +0530, Aneesh Kumar K.V (Arm) wrote:
> Teach swiotlb to distinguish between encrypted and decrypted bounce
> buffer pools, and make allocation and mapping paths select a pool whose
> state matches the requested DMA attributes.
>
> Add a decrypted flag to io_tlb_mem, initialize it for the default and
> restricted pools, and propagate DMA_ATTR_CC_SHARED into swiotlb pool
> allocation. Reject swiotlb alloc/map requests when the selected pool does
> not match the required encrypted/decrypted state.
>
> Also return DMA addresses with the matching phys_to_dma_{encrypted,
> unencrypted} helper so the DMA address encoding stays consistent with the
> chosen pool.
>
> Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
> ---
> include/linux/dma-direct.h | 10 ++++
> include/linux/swiotlb.h | 8 ++-
> kernel/dma/direct.c | 14 +++--
> kernel/dma/swiotlb.c | 108 +++++++++++++++++++++++++++----------
> 4 files changed, 107 insertions(+), 33 deletions(-)
>
> diff --git a/include/linux/dma-direct.h b/include/linux/dma-direct.h
> index c249912456f9..94fad4e7c11e 100644
> --- a/include/linux/dma-direct.h
> +++ b/include/linux/dma-direct.h
> @@ -77,6 +77,10 @@ static inline dma_addr_t dma_range_map_max(const struct bus_dma_region *map)
> #ifndef phys_to_dma_unencrypted
> #define phys_to_dma_unencrypted phys_to_dma
> #endif
> +
> +#ifndef phys_to_dma_encrypted
> +#define phys_to_dma_encrypted phys_to_dma
> +#endif
> #else
> static inline dma_addr_t __phys_to_dma(struct device *dev, phys_addr_t paddr)
> {
> @@ -90,6 +94,12 @@ static inline dma_addr_t phys_to_dma_unencrypted(struct device *dev,
> {
> return dma_addr_unencrypted(__phys_to_dma(dev, paddr));
> }
> +
> +static inline dma_addr_t phys_to_dma_encrypted(struct device *dev,
> + phys_addr_t paddr)
> +{
> + return dma_addr_encrypted(__phys_to_dma(dev, paddr));
> +}
> /*
> * If memory encryption is supported, phys_to_dma will set the memory encryption
> * bit in the DMA address, and dma_to_phys will clear it.
> diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
> index 3dae0f592063..b3fa3c6e0169 100644
> --- a/include/linux/swiotlb.h
> +++ b/include/linux/swiotlb.h
> @@ -81,6 +81,7 @@ struct io_tlb_pool {
> struct list_head node;
> struct rcu_head rcu;
> bool transient;
> + bool unencrypted;
> #endif
> };
>
> @@ -111,6 +112,7 @@ struct io_tlb_mem {
> struct dentry *debugfs;
> bool force_bounce;
> bool for_alloc;
> + bool unencrypted;
> #ifdef CONFIG_SWIOTLB_DYNAMIC
> bool can_grow;
> u64 phys_limit;
> @@ -282,7 +284,8 @@ static inline void swiotlb_sync_single_for_cpu(struct device *dev,
> extern void swiotlb_print_info(void);
>
> #ifdef CONFIG_DMA_RESTRICTED_POOL
> -struct page *swiotlb_alloc(struct device *dev, size_t size);
> +struct page *swiotlb_alloc(struct device *dev, size_t size,
> + unsigned long attrs);
> bool swiotlb_free(struct device *dev, struct page *page, size_t size);
>
> static inline bool is_swiotlb_for_alloc(struct device *dev)
> @@ -290,7 +293,8 @@ static inline bool is_swiotlb_for_alloc(struct device *dev)
> return dev->dma_io_tlb_mem->for_alloc;
> }
> #else
> -static inline struct page *swiotlb_alloc(struct device *dev, size_t size)
> +static inline struct page *swiotlb_alloc(struct device *dev, size_t size,
> + unsigned long attrs)
> {
> return NULL;
> }
> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
> index dc2907439b3d..97ae4fa10521 100644
> --- a/kernel/dma/direct.c
> +++ b/kernel/dma/direct.c
> @@ -104,9 +104,10 @@ static void __dma_direct_free_pages(struct device *dev, struct page *page,
> dma_free_contiguous(dev, page, size);
> }
>
> -static struct page *dma_direct_alloc_swiotlb(struct device *dev, size_t size)
> +static struct page *dma_direct_alloc_swiotlb(struct device *dev, size_t size,
> + unsigned long attrs)
> {
> - struct page *page = swiotlb_alloc(dev, size);
> + struct page *page = swiotlb_alloc(dev, size, attrs);
>
> if (page && !dma_coherent_ok(dev, page_to_phys(page), size)) {
> swiotlb_free(dev, page, size);
> @@ -266,8 +267,12 @@ void *dma_direct_alloc(struct device *dev, size_t size,
> gfp, attrs);
>
> if (is_swiotlb_for_alloc(dev)) {
> - page = dma_direct_alloc_swiotlb(dev, size);
> + page = dma_direct_alloc_swiotlb(dev, size, attrs);
> if (page) {
> + /*
> + * swiotlb allocations comes from pool already marked
> + * decrypted
> + */
> mark_mem_decrypt = false;
> goto setup_page;
> }
> @@ -374,6 +379,7 @@ void dma_direct_free(struct device *dev, size_t size,
> return;
>
> if (swiotlb_find_pool(dev, dma_to_phys(dev, dma_addr)))
> + /* Swiotlb doesn't need a page attribute update on free */
> mark_mem_encrypted = false;
>
> if (is_vmalloc_addr(cpu_addr)) {
> @@ -403,7 +409,7 @@ struct page *dma_direct_alloc_pages(struct device *dev, size_t size,
> gfp, attrs);
>
> if (is_swiotlb_for_alloc(dev)) {
> - page = dma_direct_alloc_swiotlb(dev, size);
> + page = dma_direct_alloc_swiotlb(dev, size, attrs);
> if (!page)
> return NULL;
>
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index ab4eccbaa076..065663be282c 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -259,10 +259,21 @@ void __init swiotlb_update_mem_attributes(void)
> struct io_tlb_pool *mem = &io_tlb_default_mem.defpool;
> unsigned long bytes;
>
> + /*
> + * if platform support memory encryption, swiotlb buffers are
> + * decrypted by default.
> + */
> + if (cc_platform_has(CC_ATTR_MEM_ENCRYPT))
> + io_tlb_default_mem.unencrypted = true;
> + else
> + io_tlb_default_mem.unencrypted = false;
> +
> if (!mem->nslabs || mem->late_alloc)
> return;
> bytes = PAGE_ALIGN(mem->nslabs << IO_TLB_SHIFT);
> - set_memory_decrypted((unsigned long)mem->vaddr, bytes >> PAGE_SHIFT);
> +
> + if (io_tlb_default_mem.unencrypted)
> + set_memory_decrypted((unsigned long)mem->vaddr, bytes >> PAGE_SHIFT);
> }
>
> static void swiotlb_init_io_tlb_pool(struct io_tlb_pool *mem, phys_addr_t start,
> @@ -505,8 +516,10 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
> if (!mem->slots)
> goto error_slots;
>
> - set_memory_decrypted((unsigned long)vstart,
> - (nslabs << IO_TLB_SHIFT) >> PAGE_SHIFT);
> + if (io_tlb_default_mem.unencrypted)
> + set_memory_decrypted((unsigned long)vstart,
> + (nslabs << IO_TLB_SHIFT) >> PAGE_SHIFT);
> +
> swiotlb_init_io_tlb_pool(mem, virt_to_phys(vstart), nslabs, true,
> nareas);
> add_mem_pool(&io_tlb_default_mem, mem);
> @@ -539,7 +552,9 @@ void __init swiotlb_exit(void)
> tbl_size = PAGE_ALIGN(mem->end - mem->start);
> slots_size = PAGE_ALIGN(array_size(sizeof(*mem->slots), mem->nslabs));
>
> - set_memory_encrypted(tbl_vaddr, tbl_size >> PAGE_SHIFT);
> + if (io_tlb_default_mem.unencrypted)
> + set_memory_encrypted(tbl_vaddr, tbl_size >> PAGE_SHIFT);
> +
> if (mem->late_alloc) {
> area_order = get_order(array_size(sizeof(*mem->areas),
> mem->nareas));
> @@ -563,6 +578,7 @@ void __init swiotlb_exit(void)
> * @gfp: GFP flags for the allocation.
> * @bytes: Size of the buffer.
> * @phys_limit: Maximum allowed physical address of the buffer.
> + * @unencrypted: true to allocate unencrypted memory, false for encrypted memory
> *
> * Allocate pages from the buddy allocator. If successful, make the allocated
> * pages decrypted that they can be used for DMA.
> @@ -570,7 +586,8 @@ void __init swiotlb_exit(void)
> * Return: Decrypted pages, %NULL on allocation failure, or ERR_PTR(-EAGAIN)
> * if the allocated physical address was above @phys_limit.
> */
> -static struct page *alloc_dma_pages(gfp_t gfp, size_t bytes, u64 phys_limit)
> +static struct page *alloc_dma_pages(gfp_t gfp, size_t bytes,
> + u64 phys_limit, bool unencrypted)
> {
> unsigned int order = get_order(bytes);
> struct page *page;
> @@ -588,13 +605,13 @@ static struct page *alloc_dma_pages(gfp_t gfp, size_t bytes, u64 phys_limit)
> }
>
> vaddr = phys_to_virt(paddr);
> - if (set_memory_decrypted((unsigned long)vaddr, PFN_UP(bytes)))
> + if (unencrypted && set_memory_decrypted((unsigned long)vaddr, PFN_UP(bytes)))
> goto error;
> return page;
>
> error:
> /* Intentional leak if pages cannot be encrypted again. */
> - if (!set_memory_encrypted((unsigned long)vaddr, PFN_UP(bytes)))
> + if (unencrypted && !set_memory_encrypted((unsigned long)vaddr, PFN_UP(bytes)))
> __free_pages(page, order);
> return NULL;
> }
> @@ -604,30 +621,26 @@ static struct page *alloc_dma_pages(gfp_t gfp, size_t bytes, u64 phys_limit)
> * @dev: Device for which a memory pool is allocated.
> * @bytes: Size of the buffer.
> * @phys_limit: Maximum allowed physical address of the buffer.
> + * @attrs: DMA attributes for the allocation.
> * @gfp: GFP flags for the allocation.
> *
> * Return: Allocated pages, or %NULL on allocation failure.
> */
> static struct page *swiotlb_alloc_tlb(struct device *dev, size_t bytes,
> - u64 phys_limit, gfp_t gfp)
> + u64 phys_limit, unsigned long attrs, gfp_t gfp)
> {
> struct page *page;
> - unsigned long attrs = 0;
>
> /*
> * Allocate from the atomic pools if memory is encrypted and
> * the allocation is atomic, because decrypting may block.
> */
> - if (!gfpflags_allow_blocking(gfp) && dev && force_dma_unencrypted(dev)) {
> + if (!gfpflags_allow_blocking(gfp) && (attrs & DMA_ATTR_CC_SHARED)) {
> void *vaddr;
>
> if (!IS_ENABLED(CONFIG_DMA_COHERENT_POOL))
> return NULL;
>
> - /* swiotlb considered decrypted by default */
> - if (cc_platform_has(CC_ATTR_MEM_ENCRYPT))
> - attrs = DMA_ATTR_CC_SHARED;
> -
> return dma_alloc_from_pool(dev, bytes, &vaddr, gfp,
> attrs, dma_coherent_ok);
> }
> @@ -638,7 +651,8 @@ static struct page *swiotlb_alloc_tlb(struct device *dev, size_t bytes,
> else if (phys_limit <= DMA_BIT_MASK(32))
> gfp |= __GFP_DMA32;
>
> - while (IS_ERR(page = alloc_dma_pages(gfp, bytes, phys_limit))) {
> + while (IS_ERR(page = alloc_dma_pages(gfp, bytes, phys_limit,
> + !!(attrs & DMA_ATTR_CC_SHARED)))) {
> if (IS_ENABLED(CONFIG_ZONE_DMA32) &&
> phys_limit < DMA_BIT_MASK(64) &&
> !(gfp & (__GFP_DMA32 | __GFP_DMA)))
> @@ -657,15 +671,18 @@ static struct page *swiotlb_alloc_tlb(struct device *dev, size_t bytes,
> * swiotlb_free_tlb() - free a dynamically allocated IO TLB buffer
> * @vaddr: Virtual address of the buffer.
> * @bytes: Size of the buffer.
> + * @unencrypted: true if @vaddr was allocated decrypted and must be
> + * re-encrypted before being freed
> */
> -static void swiotlb_free_tlb(void *vaddr, size_t bytes)
> +static void swiotlb_free_tlb(void *vaddr, size_t bytes, bool unencrypted)
> {
> if (IS_ENABLED(CONFIG_DMA_COHERENT_POOL) &&
> dma_free_from_pool(NULL, vaddr, bytes))
> return;
>
> /* Intentional leak if pages cannot be encrypted again. */
> - if (!set_memory_encrypted((unsigned long)vaddr, PFN_UP(bytes)))
> + if (!unencrypted ||
> + !set_memory_encrypted((unsigned long)vaddr, PFN_UP(bytes)))
> __free_pages(virt_to_page(vaddr), get_order(bytes));
> }
>
> @@ -676,6 +693,7 @@ static void swiotlb_free_tlb(void *vaddr, size_t bytes)
> * @nslabs: Desired (maximum) number of slabs.
> * @nareas: Number of areas.
> * @phys_limit: Maximum DMA buffer physical address.
> + * @attrs: DMA attributes for the allocation.
> * @gfp: GFP flags for the allocations.
> *
> * Allocate and initialize a new IO TLB memory pool. The actual number of
> @@ -686,7 +704,8 @@ static void swiotlb_free_tlb(void *vaddr, size_t bytes)
> */
> static struct io_tlb_pool *swiotlb_alloc_pool(struct device *dev,
> unsigned long minslabs, unsigned long nslabs,
> - unsigned int nareas, u64 phys_limit, gfp_t gfp)
> + unsigned int nareas, u64 phys_limit, unsigned long attrs,
> + gfp_t gfp)
> {
> struct io_tlb_pool *pool;
> unsigned int slot_order;
> @@ -704,9 +723,10 @@ static struct io_tlb_pool *swiotlb_alloc_pool(struct device *dev,
> if (!pool)
> goto error;
> pool->areas = (void *)pool + sizeof(*pool);
> + pool->unencrypted = !!(attrs & DMA_ATTR_CC_SHARED);
>
> tlb_size = nslabs << IO_TLB_SHIFT;
> - while (!(tlb = swiotlb_alloc_tlb(dev, tlb_size, phys_limit, gfp))) {
> + while (!(tlb = swiotlb_alloc_tlb(dev, tlb_size, phys_limit, attrs, gfp))) {
> if (nslabs <= minslabs)
> goto error_tlb;
> nslabs = ALIGN(nslabs >> 1, IO_TLB_SEGSIZE);
> @@ -724,7 +744,8 @@ static struct io_tlb_pool *swiotlb_alloc_pool(struct device *dev,
> return pool;
>
> error_slots:
> - swiotlb_free_tlb(page_address(tlb), tlb_size);
> + swiotlb_free_tlb(page_address(tlb), tlb_size,
> + !!(attrs & DMA_ATTR_CC_SHARED));
> error_tlb:
> kfree(pool);
> error:
> @@ -742,7 +763,9 @@ static void swiotlb_dyn_alloc(struct work_struct *work)
> struct io_tlb_pool *pool;
>
> pool = swiotlb_alloc_pool(NULL, IO_TLB_MIN_SLABS, default_nslabs,
> - default_nareas, mem->phys_limit, GFP_KERNEL);
> + default_nareas, mem->phys_limit,
> + mem->unencrypted ? DMA_ATTR_CC_SHARED : 0,
> + GFP_KERNEL);
> if (!pool) {
> pr_warn_ratelimited("Failed to allocate new pool");
> return;
> @@ -762,7 +785,7 @@ static void swiotlb_dyn_free(struct rcu_head *rcu)
> size_t tlb_size = pool->end - pool->start;
>
> free_pages((unsigned long)pool->slots, get_order(slots_size));
> - swiotlb_free_tlb(pool->vaddr, tlb_size);
> + swiotlb_free_tlb(pool->vaddr, tlb_size, pool->unencrypted);
> kfree(pool);
> }
>
> @@ -1232,6 +1255,7 @@ static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
> nslabs = nr_slots(alloc_size);
> phys_limit = min_not_zero(*dev->dma_mask, dev->bus_dma_limit);
> pool = swiotlb_alloc_pool(dev, nslabs, nslabs, 1, phys_limit,
> + mem->unencrypted ? DMA_ATTR_CC_SHARED : 0,
> GFP_NOWAIT);
> if (!pool)
> return -1;
> @@ -1394,6 +1418,7 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
> enum dma_data_direction dir, unsigned long attrs)
> {
> struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
> + bool require_decrypted = false;
> unsigned int offset;
> struct io_tlb_pool *pool;
> unsigned int i;
> @@ -1411,6 +1436,16 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
> if (cc_platform_has(CC_ATTR_MEM_ENCRYPT))
> pr_warn_once("Memory encryption is active and system is using DMA bounce buffers\n");
>
> + /*
> + * if we are trying to swiotlb map a decrypted paddr or the paddr is encrypted
> + * but the device is forcing decryption, use decrypted io_tlb_mem
> + */
> + if ((attrs & DMA_ATTR_CC_SHARED) || force_dma_unencrypted(dev))
> + require_decrypted = true;
> +
> + if (require_decrypted != mem->unencrypted)
> + return (phys_addr_t)DMA_MAPPING_ERROR;
> +
> /*
> * The default swiotlb memory pool is allocated with PAGE_SIZE
> * alignment. If a mapping is requested with larger alignment,
> @@ -1608,8 +1643,14 @@ dma_addr_t swiotlb_map(struct device *dev, phys_addr_t paddr, size_t size,
> if (swiotlb_addr == (phys_addr_t)DMA_MAPPING_ERROR)
> return DMA_MAPPING_ERROR;
>
> - /* Ensure that the address returned is DMA'ble */
> - dma_addr = phys_to_dma_unencrypted(dev, swiotlb_addr);
> + /*
> + * Use the allocated io_tlb_mem encryption type to determine dma addr.
> + */
> + if (dev->dma_io_tlb_mem->unencrypted)
> + dma_addr = phys_to_dma_unencrypted(dev, swiotlb_addr);
> + else
> + dma_addr = phys_to_dma_encrypted(dev, swiotlb_addr);
> +
> if (unlikely(!dma_capable(dev, dma_addr, size, true))) {
> __swiotlb_tbl_unmap_single(dev, swiotlb_addr, size, dir,
> attrs | DMA_ATTR_SKIP_CPU_SYNC,
> @@ -1773,7 +1814,8 @@ static inline void swiotlb_create_debugfs_files(struct io_tlb_mem *mem,
>
> #ifdef CONFIG_DMA_RESTRICTED_POOL
>
> -struct page *swiotlb_alloc(struct device *dev, size_t size)
> +struct page *swiotlb_alloc(struct device *dev, size_t size,
> + unsigned long attrs)
> {
> struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
> struct io_tlb_pool *pool;
> @@ -1784,6 +1826,9 @@ struct page *swiotlb_alloc(struct device *dev, size_t size)
> if (!mem)
> return NULL;
>
> + if (mem->unencrypted != !!(attrs & DMA_ATTR_CC_SHARED))
> + return NULL;
> +
> align = (1 << (get_order(size) + PAGE_SHIFT)) - 1;
> index = swiotlb_find_slots(dev, 0, size, align, &pool);
> if (index == -1)
> @@ -1853,9 +1898,18 @@ static int rmem_swiotlb_device_init(struct reserved_mem *rmem,
> kfree(mem);
> return -ENOMEM;
> }
> + /*
> + * if platform supports memory encryption,
> + * restricted mem pool is decrypted by default
> + */
> + if (cc_platform_has(CC_ATTR_MEM_ENCRYPT)) {
> + mem->unencrypted = true;
> + set_memory_decrypted((unsigned long)phys_to_virt(rmem->base),
> + rmem->size >> PAGE_SHIFT);
> + } else {
> + mem->unencrypted = false;
> + }
This breaks pKVM as it doesn’t set CC_ATTR_MEM_ENCRYPT, so all virtio
traffic now fails.
Also, by design, some drivers are clueless about bouncing, so
I believe that the pool should have a way to control it’s property
(encrypted or decrypted) and that takes priority over whatever
attributes comes from allocation.
And that brings us to the same point whether it’s better to return
the memory along with it’s state or we pass the requested state.
I think for other cases it’s fine for the device/DMA-API to dictate
the attrs, but not in restricted-dma case, the firmware just knows better.
Thanks,
Mostafa
>
> - set_memory_decrypted((unsigned long)phys_to_virt(rmem->base),
> - rmem->size >> PAGE_SHIFT);
> swiotlb_init_io_tlb_pool(pool, rmem->base, nslabs,
> false, nareas);
> mem->force_bounce = true;
> --
> 2.43.0
>
^ permalink raw reply
* Re: [PATCH v2 00/69] mm: Generalize HVO for HugeTLB and device DAX
From: Muchun Song @ 2026-05-14 2:34 UTC (permalink / raw)
To: Andrew Morton
Cc: Muchun Song, David Hildenbrand, Oscar Salvador, Michael Ellerman,
Madhavan Srinivasan, Lorenzo Stoakes, Liam R . Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Nicholas Piggin, Christophe Leroy, Ackerley Tng,
Frank van der Linden, aneesh.kumar, joao.m.martins, linux-mm,
linuxppc-dev, linux-kernel
In-Reply-To: <20260513104640.b0f02b844c57f92bc954878e@linux-foundation.org>
> On May 14, 2026, at 01:46, Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Wed, 13 May 2026 21:04:28 +0800 Muchun Song <songmuchun@bytedance.com> wrote:
>
>> In this series, HVO is redefined as Hugepage Vmemmap Optimization: a
>> general vmemmap optimization model for large hugepage-backed mappings,
>> rather than a HugeTLB-only implementation detail.
>>
>> The existing code grew around the original HugeTLB-specific HVO path,
>> while device DAX developed similar but separate vmemmap optimization
>> handling. As a result, the current implementation carries duplicated
>> logic, boot-time special cases, and subsystem-specific interfaces around
>> what is fundamentally the same sparse-vmemmap optimization.
>>
>> This series generalizes that optimization into a common framework used
>> by both HugeTLB and device DAX.
>>
>> The first few patches include some minor bug fixes found during AI-aided
>> review of the current code. These fixes are not the main goal of the
>> series, but the later refactoring and unification work depends on them,
>> so they are included here as preparatory changes.
>>
>> The series then reworks the relevant early boot and sparse
>> initialization paths, introduces a generic section-based sparse-vmemmap
>> optimization infrastructure, switches HugeTLB and device DAX over to the
>> shared implementation, and removes the old special-case code.
>>
>> ...
>>
>> 46 files changed, 743 insertions(+), 1812 deletions(-)
>
> Gulp.
Sorry for a so big series to refactor lots of things.
>
> I think the first 15ish patches (little fixes and cleanups and
> refactorings) are ready to go in immediately?
Overall, I believe the first 12 patches (excluding the bugfixes)
have already been reviewed by Mike and are theoretically ready
to be merged.
>
> Perhaps you could prepare such things as a separate series. Or tell me
> which ones are suitable and I'll fudge up a [0/N]?
I'd prefer to group the first 19 patches into a single series (titled
"Refactor bootmem gigantic hugepage allocation"). Therefore, I
suggest waiting for feedback from other maintainers/reviewers;
once patches 13-19 pass review, I will send them out as a standalone
series.
Muchun,
Thanks.
^ permalink raw reply
* Re: [PATCH v2 00/69] mm: Generalize HVO for HugeTLB and device DAX
From: Muchun Song @ 2026-05-14 2:37 UTC (permalink / raw)
To: Oscar Salvador
Cc: Andrew Morton, Muchun Song, David Hildenbrand, Michael Ellerman,
Madhavan Srinivasan, Lorenzo Stoakes, Liam R . Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Nicholas Piggin, Christophe Leroy, Ackerley Tng,
Frank van der Linden, aneesh.kumar, joao.m.martins, linux-mm,
linuxppc-dev, linux-kernel
In-Reply-To: <agTCbglgY23ydrDx@localhost.localdomain>
> On May 14, 2026, at 02:26, Oscar Salvador <osalvador@suse.de> wrote:
>
> On Wed, May 13, 2026 at 10:46:40AM -0700, Andrew Morton wrote:
>> On Wed, 13 May 2026 21:04:28 +0800 Muchun Song <songmuchun@bytedance.com> wrote:
>>
>>> In this series, HVO is redefined as Hugepage Vmemmap Optimization: a
>>> general vmemmap optimization model for large hugepage-backed mappings,
>>> rather than a HugeTLB-only implementation detail.
>>>
>>> The existing code grew around the original HugeTLB-specific HVO path,
>>> while device DAX developed similar but separate vmemmap optimization
>>> handling. As a result, the current implementation carries duplicated
>>> logic, boot-time special cases, and subsystem-specific interfaces around
>>> what is fundamentally the same sparse-vmemmap optimization.
>>>
>>> This series generalizes that optimization into a common framework used
>>> by both HugeTLB and device DAX.
>>>
>>> The first few patches include some minor bug fixes found during AI-aided
>>> review of the current code. These fixes are not the main goal of the
>>> series, but the later refactoring and unification work depends on them,
>>> so they are included here as preparatory changes.
>>>
>>> The series then reworks the relevant early boot and sparse
>>> initialization paths, introduces a generic section-based sparse-vmemmap
>>> optimization infrastructure, switches HugeTLB and device DAX over to the
>>> shared implementation, and removes the old special-case code.
>>>
>>> ...
>>>
>>> 46 files changed, 743 insertions(+), 1812 deletions(-)
>>
>> Gulp.
>>
>> I think the first 15ish patches (little fixes and cleanups and
>> refactorings) are ready to go in immediately?
>
> I plan to have a (partial ) look at this tomorrow/Friday, but splitting
Thanks.
> this series in fixes-that-can-go-straight-away and the feature itself would make more
> sense and help ease the review.
Yes.
> Head tends to spin a bit when the patchset grows beyond certain number of patches :-D.
>
> Would that be possible Munchun?
I'd prefer to group the first 19 patches into a single series (titled
"Refactor bootmem gigantic hugepage allocation"). Therefore, waiting
for your review. :)
Once patches 13-19 pass review, I will send them out as a standalone
series.
Muchun,
Thanks.
>
>
>
> --
> Oscar Salvador
> SUSE Labs
^ permalink raw reply
* Re: [PATCH v2 0/5] KVM: PPC: Handle CPU compatibility mode for nested guests
From: Ritesh Harjani @ 2026-05-14 3:19 UTC (permalink / raw)
To: Amit Machhiwal, linuxppc-dev, Madhavan Srinivasan
Cc: Amit Machhiwal, Vaibhav Jain, Paolo Bonzini, Nicholas Piggin,
Michael Ellerman, Christophe Leroy (CS GROUP), Jonathan Corbet,
Shuah Khan, kvm, linux-kernel, linux-doc
In-Reply-To: <20260513100755.83215-1-amachhiw@linux.ibm.com>
Hi Amit,
Amit Machhiwal <amachhiw@linux.ibm.com> writes:
> On POWER systems, newer processor generations can operate in compatibility
> modes corresponding to earlier generations (e.g., a Power11 system running
> in Power10 compatibility mode). In such cases, the effective CPU level
> exposed to guests differs from the physical processor generation.
>
> This creates a problem for nested virtualization. When booting a nested KVM
> guest (L2) inside a host KVM guest (L1) running in a compatibility mode,
> userspace (e.g., QEMU) may derive the CPU model from the raw hardware PVR
> and attempt to configure the nested guest accordingly. However, the L1
> partition is constrained by the compatibility level negotiated with the
> hypervisor (L0), and requests exceeding that level are rejected, leading to
> guest boot failures such as:
>
> KVM-NESTEDv2: couldn't set guest wide elements
>
> This series addresses the issue in two steps:
>
> 1. Detect and reject invalid compatibility requests early in KVM to avoid
> late failures.
>
> 2. Provide a mechanism for userspace to query the effective CPU
> compatibility modes supported by the host, so it can select an
> appropriate CPU model for nested guests.
>
Do we really need to add a uapi change for this? Tools like Qemu can
read the device tree info of the host, isn't it?
> To achieve this, the series introduces a new KVM capability and ioctl
> (KVM_CAP_PPC_COMPAT_CAPS / KVM_PPC_GET_COMPAT_CAPS) that expose the
> compatibility modes supported by the host.
>
> The implementation supports both:
>
> - PowerVM (nested API v2), where compatibility information is obtained
> via the H_GUEST_GET_CAPABILITIES hypercall.
> - PowerNV (nested API v1), where compatibility is derived from the device
> tree ("cpu-version") representing the effective processor compatibility
> level.
See there you go, for PowerNV if this info is provided in the device
tree, then Qemu could as well just read that info, no?
... yup, kvmppc_read_int_dt() can do that I guess.
So, my request is, can we look into this to see, if there is a possible
alternative to this? maybe we already have a mechanism which Qemu could
use to get this info already?
btw - I haven't given a full read of the patch series, but reading the
cover letter, I felt we should atleast add this info to the cover
letter on, why a uapi change is really needed here, why can't the
existing alternatives work for us.
-ritesh
>
> This allows userspace (e.g., QEMU) to select a CPU model consistent with
> the host compatibility mode, avoiding mismatches and enabling successful
> nested guest boot.
>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox