* [PATCH 0/5] add STATIC_PMD_ZERO_PAGE config option
@ 2025-06-12 10:50 Pankaj Raghav
2025-06-12 10:50 ` [PATCH 1/5] mm: move huge_zero_page declaration from huge_mm.h to mm.h Pankaj Raghav
` (6 more replies)
0 siblings, 7 replies; 19+ messages in thread
From: Pankaj Raghav @ 2025-06-12 10:50 UTC (permalink / raw)
To: Suren Baghdasaryan, Ryan Roberts, Mike Rapoport, Michal Hocko,
Thomas Gleixner, Nico Pache, Dev Jain, Baolin Wang,
Borislav Petkov, Ingo Molnar, H . Peter Anvin, Vlastimil Babka,
Zi Yan, Dave Hansen, David Hildenbrand, Lorenzo Stoakes,
Andrew Morton, Liam R . Howlett, Jens Axboe
Cc: linux-kernel, linux-mm, willy, x86, linux-block, linux-fsdevel,
Darrick J . Wong, mcgrof, gost.dev, kernel, hch, Pankaj Raghav
There are many places in the kernel where we need to zeroout larger
chunks but the maximum segment we can zeroout at a time by ZERO_PAGE
is limited by PAGE_SIZE.
This concern was raised during the review of adding Large Block Size support
to XFS[1][2].
This is especially annoying in block devices and filesystems where we
attach multiple ZERO_PAGEs to the bio in different bvecs. With multipage
bvec support in block layer, it is much more efficient to send out
larger zero pages as a part of a single bvec.
Some examples of places in the kernel where this could be useful:
- blkdev_issue_zero_pages()
- iomap_dio_zero()
- vmalloc.c:zero_iter()
- rxperf_process_call()
- fscrypt_zeroout_range_inline_crypt()
- bch2_checksum_update()
...
We already have huge_zero_folio that is allocated on demand, and it will be
deallocated by the shrinker if there are no users of it left.
But to use huge_zero_folio, we need to pass a mm struct and the
put_folio needs to be called in the destructor. This makes sense for
systems that have memory constraints but for bigger servers, it does not
matter if the PMD size is reasonable (like in x86).
Add a config option STATIC_PMD_ZERO_PAGE that will always allocate
the huge_zero_folio in .bss, and it will never be freed.
The static PMD page is reused by huge_zero_folio when this config
option is enabled.
I have converted blkdev_issue_zero_pages() as an example as a part of
this series.
I will send patches to individual subsystems using the huge_zero_folio
once this gets upstreamed.
Looking forward to some feedback.
[1] https://lore.kernel.org/linux-xfs/20231027051847.GA7885@lst.de/
[2] https://lore.kernel.org/linux-xfs/ZitIK5OnR7ZNY0IG@infradead.org/
Changes since RFC:
- Added the config option based on the feedback from David.
- Encode more info in the header to avoid dead code (Dave hansen
feedback)
- The static part of huge_zero_folio in memory.c and the dynamic part
stays in huge_memory.c
- Split the patches to make it easy for review.
Pankaj Raghav (5):
mm: move huge_zero_page declaration from huge_mm.h to mm.h
huge_memory: add huge_zero_page_shrinker_(init|exit) function
mm: add static PMD zero page
mm: add mm_get_static_huge_zero_folio() routine
block: use mm_huge_zero_folio in __blkdev_issue_zero_pages()
arch/x86/Kconfig | 1 +
arch/x86/include/asm/pgtable.h | 8 +++++
arch/x86/kernel/head_64.S | 8 +++++
block/blk-lib.c | 17 +++++----
include/linux/huge_mm.h | 31 ----------------
include/linux/mm.h | 64 ++++++++++++++++++++++++++++++++++
mm/Kconfig | 13 +++++++
mm/huge_memory.c | 62 ++++++++++++++++++++++++--------
mm/memory.c | 19 ++++++++++
9 files changed, 170 insertions(+), 53 deletions(-)
base-commit: 19272b37aa4f83ca52bdf9c16d5d81bdd1354494
--
2.49.0
^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH 1/5] mm: move huge_zero_page declaration from huge_mm.h to mm.h
2025-06-12 10:50 [PATCH 0/5] add STATIC_PMD_ZERO_PAGE config option Pankaj Raghav
@ 2025-06-12 10:50 ` Pankaj Raghav
2025-06-12 10:50 ` [PATCH 2/5] huge_memory: add huge_zero_page_shrinker_(init|exit) function Pankaj Raghav
` (5 subsequent siblings)
6 siblings, 0 replies; 19+ messages in thread
From: Pankaj Raghav @ 2025-06-12 10:50 UTC (permalink / raw)
To: Suren Baghdasaryan, Ryan Roberts, Mike Rapoport, Michal Hocko,
Thomas Gleixner, Nico Pache, Dev Jain, Baolin Wang,
Borislav Petkov, Ingo Molnar, H . Peter Anvin, Vlastimil Babka,
Zi Yan, Dave Hansen, David Hildenbrand, Lorenzo Stoakes,
Andrew Morton, Liam R . Howlett, Jens Axboe
Cc: linux-kernel, linux-mm, willy, x86, linux-block, linux-fsdevel,
Darrick J . Wong, mcgrof, gost.dev, kernel, hch, Pankaj Raghav
Move the declaration associated with huge_zero_page from huge_mm.h to
mm.h. This patch is in preparation for adding static PMD zero page.
No functional changes.
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
include/linux/huge_mm.h | 31 -------------------------------
include/linux/mm.h | 34 ++++++++++++++++++++++++++++++++++
2 files changed, 34 insertions(+), 31 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 2f190c90192d..3e887374892c 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -478,22 +478,6 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf);
-extern struct folio *huge_zero_folio;
-extern unsigned long huge_zero_pfn;
-
-static inline bool is_huge_zero_folio(const struct folio *folio)
-{
- return READ_ONCE(huge_zero_folio) == folio;
-}
-
-static inline bool is_huge_zero_pmd(pmd_t pmd)
-{
- return pmd_present(pmd) && READ_ONCE(huge_zero_pfn) == pmd_pfn(pmd);
-}
-
-struct folio *mm_get_huge_zero_folio(struct mm_struct *mm);
-void mm_put_huge_zero_folio(struct mm_struct *mm);
-
static inline bool thp_migration_supported(void)
{
return IS_ENABLED(CONFIG_ARCH_ENABLE_THP_MIGRATION);
@@ -631,21 +615,6 @@ static inline vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
return 0;
}
-static inline bool is_huge_zero_folio(const struct folio *folio)
-{
- return false;
-}
-
-static inline bool is_huge_zero_pmd(pmd_t pmd)
-{
- return false;
-}
-
-static inline void mm_put_huge_zero_folio(struct mm_struct *mm)
-{
- return;
-}
-
static inline struct page *follow_devmap_pmd(struct vm_area_struct *vma,
unsigned long addr, pmd_t *pmd, int flags, struct dev_pagemap **pgmap)
{
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0ef2ba0c667a..c8fbeaacf896 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4018,6 +4018,40 @@ static inline bool vma_is_special_huge(const struct vm_area_struct *vma)
#endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS */
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+extern struct folio *huge_zero_folio;
+extern unsigned long huge_zero_pfn;
+
+static inline bool is_huge_zero_folio(const struct folio *folio)
+{
+ return READ_ONCE(huge_zero_folio) == folio;
+}
+
+static inline bool is_huge_zero_pmd(pmd_t pmd)
+{
+ return pmd_present(pmd) && READ_ONCE(huge_zero_pfn) == pmd_pfn(pmd);
+}
+
+struct folio *mm_get_huge_zero_folio(struct mm_struct *mm);
+void mm_put_huge_zero_folio(struct mm_struct *mm);
+
+#else
+static inline bool is_huge_zero_folio(const struct folio *folio)
+{
+ return false;
+}
+
+static inline bool is_huge_zero_pmd(pmd_t pmd)
+{
+ return false;
+}
+
+static inline void mm_put_huge_zero_folio(struct mm_struct *mm)
+{
+ return;
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
#if MAX_NUMNODES > 1
void __init setup_nr_node_ids(void);
#else
--
2.49.0
^ permalink raw reply related [flat|nested] 19+ messages in thread
* [PATCH 2/5] huge_memory: add huge_zero_page_shrinker_(init|exit) function
2025-06-12 10:50 [PATCH 0/5] add STATIC_PMD_ZERO_PAGE config option Pankaj Raghav
2025-06-12 10:50 ` [PATCH 1/5] mm: move huge_zero_page declaration from huge_mm.h to mm.h Pankaj Raghav
@ 2025-06-12 10:50 ` Pankaj Raghav
2025-06-12 10:50 ` [PATCH 3/5] mm: add static PMD zero page Pankaj Raghav
` (4 subsequent siblings)
6 siblings, 0 replies; 19+ messages in thread
From: Pankaj Raghav @ 2025-06-12 10:50 UTC (permalink / raw)
To: Suren Baghdasaryan, Ryan Roberts, Mike Rapoport, Michal Hocko,
Thomas Gleixner, Nico Pache, Dev Jain, Baolin Wang,
Borislav Petkov, Ingo Molnar, H . Peter Anvin, Vlastimil Babka,
Zi Yan, Dave Hansen, David Hildenbrand, Lorenzo Stoakes,
Andrew Morton, Liam R . Howlett, Jens Axboe
Cc: linux-kernel, linux-mm, willy, x86, linux-block, linux-fsdevel,
Darrick J . Wong, mcgrof, gost.dev, kernel, hch, Pankaj Raghav
Add huge_zero_page_shrinker_init() and huge_zero_page_shrinker_exit().
As shrinker will not be needed when static PMD zero page is enabled,
these two functions can be a no-op.
This is a preparation patch for static PMD zero page. No functional
changes.
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
mm/huge_memory.c | 38 +++++++++++++++++++++++++++-----------
1 file changed, 27 insertions(+), 11 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d3e66136e41a..101b67ab2eb6 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -289,6 +289,24 @@ static unsigned long shrink_huge_zero_page_scan(struct shrinker *shrink,
}
static struct shrinker *huge_zero_page_shrinker;
+static int huge_zero_page_shrinker_init(void)
+{
+ huge_zero_page_shrinker = shrinker_alloc(0, "thp-zero");
+ if (!huge_zero_page_shrinker)
+ return -ENOMEM;
+
+ huge_zero_page_shrinker->count_objects = shrink_huge_zero_page_count;
+ huge_zero_page_shrinker->scan_objects = shrink_huge_zero_page_scan;
+ shrinker_register(huge_zero_page_shrinker);
+ return 0;
+}
+
+static void huge_zero_page_shrinker_exit(void)
+{
+ shrinker_free(huge_zero_page_shrinker);
+ return;
+}
+
#ifdef CONFIG_SYSFS
static ssize_t enabled_show(struct kobject *kobj,
@@ -850,33 +868,31 @@ static inline void hugepage_exit_sysfs(struct kobject *hugepage_kobj)
static int __init thp_shrinker_init(void)
{
- huge_zero_page_shrinker = shrinker_alloc(0, "thp-zero");
- if (!huge_zero_page_shrinker)
- return -ENOMEM;
+ int ret = 0;
deferred_split_shrinker = shrinker_alloc(SHRINKER_NUMA_AWARE |
SHRINKER_MEMCG_AWARE |
SHRINKER_NONSLAB,
"thp-deferred_split");
- if (!deferred_split_shrinker) {
- shrinker_free(huge_zero_page_shrinker);
+ if (!deferred_split_shrinker)
return -ENOMEM;
- }
-
- huge_zero_page_shrinker->count_objects = shrink_huge_zero_page_count;
- huge_zero_page_shrinker->scan_objects = shrink_huge_zero_page_scan;
- shrinker_register(huge_zero_page_shrinker);
deferred_split_shrinker->count_objects = deferred_split_count;
deferred_split_shrinker->scan_objects = deferred_split_scan;
shrinker_register(deferred_split_shrinker);
+ ret = huge_zero_page_shrinker_init();
+ if (ret) {
+ shrinker_free(deferred_split_shrinker);
+ return ret;
+ }
+
return 0;
}
static void __init thp_shrinker_exit(void)
{
- shrinker_free(huge_zero_page_shrinker);
+ huge_zero_page_shrinker_exit();
shrinker_free(deferred_split_shrinker);
}
--
2.49.0
^ permalink raw reply related [flat|nested] 19+ messages in thread
* [PATCH 3/5] mm: add static PMD zero page
2025-06-12 10:50 [PATCH 0/5] add STATIC_PMD_ZERO_PAGE config option Pankaj Raghav
2025-06-12 10:50 ` [PATCH 1/5] mm: move huge_zero_page declaration from huge_mm.h to mm.h Pankaj Raghav
2025-06-12 10:50 ` [PATCH 2/5] huge_memory: add huge_zero_page_shrinker_(init|exit) function Pankaj Raghav
@ 2025-06-12 10:50 ` Pankaj Raghav
2025-06-24 8:51 ` kernel test robot
2025-06-12 10:50 ` [PATCH 4/5] mm: add mm_get_static_huge_zero_folio() routine Pankaj Raghav
` (3 subsequent siblings)
6 siblings, 1 reply; 19+ messages in thread
From: Pankaj Raghav @ 2025-06-12 10:50 UTC (permalink / raw)
To: Suren Baghdasaryan, Ryan Roberts, Mike Rapoport, Michal Hocko,
Thomas Gleixner, Nico Pache, Dev Jain, Baolin Wang,
Borislav Petkov, Ingo Molnar, H . Peter Anvin, Vlastimil Babka,
Zi Yan, Dave Hansen, David Hildenbrand, Lorenzo Stoakes,
Andrew Morton, Liam R . Howlett, Jens Axboe
Cc: linux-kernel, linux-mm, willy, x86, linux-block, linux-fsdevel,
Darrick J . Wong, mcgrof, gost.dev, kernel, hch, Pankaj Raghav
There are many places in the kernel where we need to zeroout larger
chunks but the maximum segment we can zeroout at a time by ZERO_PAGE
is limited by PAGE_SIZE.
This is especially annoying in block devices and filesystems where we
attach multiple ZERO_PAGEs to the bio in different bvecs. With multipage
bvec support in block layer, it is much more efficient to send out
larger zero pages as a part of single bvec.
This concern was raised during the review of adding LBS support to
XFS[1][2].
Usually huge_zero_folio is allocated on demand, and it will be
deallocated by the shrinker if there are no users of it left.
Add a config option STATIC_PMD_ZERO_PAGE that will always allocate
the huge_zero_folio in .bss, and it will never be freed. This makes using the
huge_zero_folio without having to pass any mm struct and call put_folio
in the destructor.
As STATIC_PMD_ZERO_PAGE does not depend on THP, declare huge_zero_folio
and huge_zero_pfn outside of the THP ifdef.
It can only be enabled from x86_64, but it is an optional config. We
could expand it more architectures in the future.
[1] https://lore.kernel.org/linux-xfs/20231027051847.GA7885@lst.de/
[2] https://lore.kernel.org/linux-xfs/ZitIK5OnR7ZNY0IG@infradead.org/
Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
Questions:
- Can we call __split_huge_zero_page_pmd() on static PMD page?
arch/x86/Kconfig | 1 +
arch/x86/include/asm/pgtable.h | 8 ++++++++
arch/x86/kernel/head_64.S | 8 ++++++++
include/linux/mm.h | 16 +++++++++++++++-
mm/Kconfig | 13 +++++++++++++
mm/huge_memory.c | 24 ++++++++++++++++++++----
mm/memory.c | 19 +++++++++++++++++++
7 files changed, 84 insertions(+), 5 deletions(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 340e5468980e..c3a9d136ec0a 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -153,6 +153,7 @@ config X86
select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP if X86_64
select ARCH_WANT_HUGETLB_VMEMMAP_PREINIT if X86_64
select ARCH_WANTS_THP_SWAP if X86_64
+ select ARCH_HAS_STATIC_PMD_ZERO_PAGE if X86_64
select ARCH_HAS_PARANOID_L1D_FLUSH
select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
select BUILDTIME_TABLE_SORT
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 774430c3abff..7013a7d26da5 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -47,6 +47,14 @@ void ptdump_walk_user_pgd_level_checkwx(void);
#define debug_checkwx_user() do { } while (0)
#endif
+#ifdef CONFIG_STATIC_PMD_ZERO_PAGE
+/*
+ * PMD_ZERO_PAGE is a global shared PMD page that is always zero.
+ */
+extern unsigned long empty_pmd_zero_page[(PMD_SIZE) / sizeof(unsigned long)]
+ __visible;
+#endif
+
/*
* ZERO_PAGE is a global shared page that is always zero: used
* for zero-mapped memory areas etc..
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 3e9b3a3bd039..86aaa53fd619 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -714,6 +714,14 @@ EXPORT_SYMBOL(phys_base)
#include "../xen/xen-head.S"
__PAGE_ALIGNED_BSS
+
+#ifdef CONFIG_STATIC_PMD_ZERO_PAGE
+SYM_DATA_START_PAGE_ALIGNED(empty_pmd_zero_page)
+ .skip PMD_SIZE
+SYM_DATA_END(empty_pmd_zero_page)
+EXPORT_SYMBOL(empty_pmd_zero_page)
+#endif
+
SYM_DATA_START_PAGE_ALIGNED(empty_zero_page)
.skip PAGE_SIZE
SYM_DATA_END(empty_zero_page)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index c8fbeaacf896..b20d60d68b3c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4018,10 +4018,10 @@ static inline bool vma_is_special_huge(const struct vm_area_struct *vma)
#endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS */
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
extern struct folio *huge_zero_folio;
extern unsigned long huge_zero_pfn;
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
static inline bool is_huge_zero_folio(const struct folio *folio)
{
return READ_ONCE(huge_zero_folio) == folio;
@@ -4032,9 +4032,23 @@ static inline bool is_huge_zero_pmd(pmd_t pmd)
return pmd_present(pmd) && READ_ONCE(huge_zero_pfn) == pmd_pfn(pmd);
}
+#ifdef CONFIG_STATIC_PMD_ZERO_PAGE
+static inline struct folio *mm_get_huge_zero_folio(struct mm_struct *mm)
+{
+ return READ_ONCE(huge_zero_folio);
+}
+
+static inline void mm_put_huge_zero_folio(struct mm_struct *mm)
+{
+ return;
+}
+
+#else
struct folio *mm_get_huge_zero_folio(struct mm_struct *mm);
void mm_put_huge_zero_folio(struct mm_struct *mm);
+#endif /* CONFIG_STATIC_PMD_ZERO_PAGE */
+
#else
static inline bool is_huge_zero_folio(const struct folio *folio)
{
diff --git a/mm/Kconfig b/mm/Kconfig
index 781be3240e21..fd1c51995029 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -826,6 +826,19 @@ config ARCH_WANTS_THP_SWAP
config MM_ID
def_bool n
+config ARCH_HAS_STATIC_PMD_ZERO_PAGE
+ def_bool n
+
+config STATIC_PMD_ZERO_PAGE
+ bool "Allocate a PMD page for zeroing"
+ depends on ARCH_HAS_STATIC_PMD_ZERO_PAGE
+ help
+ Typically huge_zero_folio, which is a PMD page of zeroes, is allocated
+ on demand and deallocated when not in use. This option will
+ allocate a PMD sized zero page in .bss and huge_zero_folio will
+ use it instead allocating dynamically.
+ Not suitable for memory constrained systems.
+
menuconfig TRANSPARENT_HUGEPAGE
bool "Transparent Hugepage Support"
depends on HAVE_ARCH_TRANSPARENT_HUGEPAGE && !PREEMPT_RT
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 101b67ab2eb6..c12ca7134e88 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -75,9 +75,6 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
struct shrink_control *sc);
static bool split_underused_thp = true;
-static atomic_t huge_zero_refcount;
-struct folio *huge_zero_folio __read_mostly;
-unsigned long huge_zero_pfn __read_mostly = ~0UL;
unsigned long huge_anon_orders_always __read_mostly;
unsigned long huge_anon_orders_madvise __read_mostly;
unsigned long huge_anon_orders_inherit __read_mostly;
@@ -208,6 +205,23 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
return orders;
}
+#ifdef CONFIG_STATIC_PMD_ZERO_PAGE
+static int huge_zero_page_shrinker_init(void)
+{
+ return 0;
+}
+
+static void huge_zero_page_shrinker_exit(void)
+{
+ return;
+}
+#else
+
+static struct shrinker *huge_zero_page_shrinker;
+static atomic_t huge_zero_refcount;
+struct folio *huge_zero_folio __read_mostly;
+unsigned long huge_zero_pfn __read_mostly = ~0UL;
+
static bool get_huge_zero_page(void)
{
struct folio *zero_folio;
@@ -288,7 +302,6 @@ static unsigned long shrink_huge_zero_page_scan(struct shrinker *shrink,
return 0;
}
-static struct shrinker *huge_zero_page_shrinker;
static int huge_zero_page_shrinker_init(void)
{
huge_zero_page_shrinker = shrinker_alloc(0, "thp-zero");
@@ -307,6 +320,7 @@ static void huge_zero_page_shrinker_exit(void)
return;
}
+#endif
#ifdef CONFIG_SYSFS
static ssize_t enabled_show(struct kobject *kobj,
@@ -2843,6 +2857,8 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
pte_t *pte;
int i;
+ // FIXME: can this be called with static zero page?
+ VM_BUG_ON(IS_ENABLED(CONFIG_STATIC_PMD_ZERO_PAGE));
/*
* Leave pmd empty until pte is filled note that it is fine to delay
* notification until mmu_notifier_invalidate_range_end() as we are
diff --git a/mm/memory.c b/mm/memory.c
index 8eba595056fe..77721f5ae043 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -159,6 +159,25 @@ static int __init init_zero_pfn(void)
}
early_initcall(init_zero_pfn);
+#ifdef CONFIG_STATIC_PMD_ZERO_PAGE
+struct folio *huge_zero_folio __read_mostly;
+unsigned long huge_zero_pfn __read_mostly = ~0UL;
+
+static int __init init_pmd_zero_pfn(void)
+{
+ huge_zero_folio = virt_to_folio(empty_pmd_zero_page);
+ huge_zero_pfn = page_to_pfn(virt_to_page(empty_pmd_zero_page));
+
+ __folio_set_head(huge_zero_folio);
+ prep_compound_head((struct page *)huge_zero_folio, PMD_ORDER);
+ /* Ensure zero folio won't have large_rmappable flag set. */
+ folio_clear_large_rmappable(huge_zero_folio);
+
+ return 0;
+}
+early_initcall(init_pmd_zero_pfn);
+#endif
+
void mm_trace_rss_stat(struct mm_struct *mm, int member)
{
trace_rss_stat(mm, member);
--
2.49.0
^ permalink raw reply related [flat|nested] 19+ messages in thread
* [PATCH 4/5] mm: add mm_get_static_huge_zero_folio() routine
2025-06-12 10:50 [PATCH 0/5] add STATIC_PMD_ZERO_PAGE config option Pankaj Raghav
` (2 preceding siblings ...)
2025-06-12 10:50 ` [PATCH 3/5] mm: add static PMD zero page Pankaj Raghav
@ 2025-06-12 10:50 ` Pankaj Raghav
2025-06-12 14:09 ` Dave Hansen
2025-06-12 10:51 ` [PATCH 5/5] block: use mm_huge_zero_folio in __blkdev_issue_zero_pages() Pankaj Raghav
` (2 subsequent siblings)
6 siblings, 1 reply; 19+ messages in thread
From: Pankaj Raghav @ 2025-06-12 10:50 UTC (permalink / raw)
To: Suren Baghdasaryan, Ryan Roberts, Mike Rapoport, Michal Hocko,
Thomas Gleixner, Nico Pache, Dev Jain, Baolin Wang,
Borislav Petkov, Ingo Molnar, H . Peter Anvin, Vlastimil Babka,
Zi Yan, Dave Hansen, David Hildenbrand, Lorenzo Stoakes,
Andrew Morton, Liam R . Howlett, Jens Axboe
Cc: linux-kernel, linux-mm, willy, x86, linux-block, linux-fsdevel,
Darrick J . Wong, mcgrof, gost.dev, kernel, hch, Pankaj Raghav
Add mm_get_static_huge_zero_folio() routine so that huge_zero_folio can be
used without the need to pass any mm struct. This will return ZERO_PAGE
folio if CONFIG_STATIC_PMD_ZERO_PAGE is disabled.
This routine can also be called even if THP is disabled.
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
include/linux/mm.h | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b20d60d68b3c..c8805480ff21 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4021,6 +4021,22 @@ static inline bool vma_is_special_huge(const struct vm_area_struct *vma)
extern struct folio *huge_zero_folio;
extern unsigned long huge_zero_pfn;
+/*
+ * mm_get_static_huge_zero_folio - Get a PMD sized zero folio
+ *
+ * This function will return a PMD sized zero folio if CONFIG_STATIC_PMD_ZERO_PAGE
+ * is enabled. Otherwise, a ZERO_PAGE folio is returned.
+ *
+ * Deduce the size of the folio with folio_size instead of assuming the
+ * folio size.
+ */
+static inline struct folio *mm_get_static_huge_zero_folio(void)
+{
+ if(IS_ENABLED(CONFIG_STATIC_PMD_ZERO_PAGE))
+ return READ_ONCE(huge_zero_folio);
+ return page_folio(ZERO_PAGE(0));
+}
+
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
static inline bool is_huge_zero_folio(const struct folio *folio)
{
--
2.49.0
^ permalink raw reply related [flat|nested] 19+ messages in thread
* [PATCH 5/5] block: use mm_huge_zero_folio in __blkdev_issue_zero_pages()
2025-06-12 10:50 [PATCH 0/5] add STATIC_PMD_ZERO_PAGE config option Pankaj Raghav
` (3 preceding siblings ...)
2025-06-12 10:50 ` [PATCH 4/5] mm: add mm_get_static_huge_zero_folio() routine Pankaj Raghav
@ 2025-06-12 10:51 ` Pankaj Raghav
2025-06-12 13:50 ` [PATCH 0/5] add STATIC_PMD_ZERO_PAGE config option Dave Hansen
2025-06-16 5:40 ` Christoph Hellwig
6 siblings, 0 replies; 19+ messages in thread
From: Pankaj Raghav @ 2025-06-12 10:51 UTC (permalink / raw)
To: Suren Baghdasaryan, Ryan Roberts, Mike Rapoport, Michal Hocko,
Thomas Gleixner, Nico Pache, Dev Jain, Baolin Wang,
Borislav Petkov, Ingo Molnar, H . Peter Anvin, Vlastimil Babka,
Zi Yan, Dave Hansen, David Hildenbrand, Lorenzo Stoakes,
Andrew Morton, Liam R . Howlett, Jens Axboe
Cc: linux-kernel, linux-mm, willy, x86, linux-block, linux-fsdevel,
Darrick J . Wong, mcgrof, gost.dev, kernel, hch, Pankaj Raghav
Use mm_get_static_huge_zero_folio() in __blkdev_issue_zero_pages().
On systems with CONFIG_STATIC_PMD_ZERO_PAGE enabled, we will end up
sending larger bvecs instead of multiple small ones.
Noticed a 4% increase in performance on a commercial NVMe SSD which does
not support OP_WRITE_ZEROES. The device's MDTS was 128K. The performance
gains might be bigger if the device supports bigger MDTS.
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
block/blk-lib.c | 17 ++++++++++-------
1 file changed, 10 insertions(+), 7 deletions(-)
diff --git a/block/blk-lib.c b/block/blk-lib.c
index 4c9f20a689f7..4ee219637a3f 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -196,6 +196,10 @@ static void __blkdev_issue_zero_pages(struct block_device *bdev,
sector_t sector, sector_t nr_sects, gfp_t gfp_mask,
struct bio **biop, unsigned int flags)
{
+ struct folio *zero_folio;
+
+ zero_folio = mm_get_static_huge_zero_folio();
+
while (nr_sects) {
unsigned int nr_vecs = __blkdev_sectors_to_bio_pages(nr_sects);
struct bio *bio;
@@ -208,15 +212,14 @@ static void __blkdev_issue_zero_pages(struct block_device *bdev,
break;
do {
- unsigned int len, added;
+ unsigned int len;
- len = min_t(sector_t,
- PAGE_SIZE, nr_sects << SECTOR_SHIFT);
- added = bio_add_page(bio, ZERO_PAGE(0), len, 0);
- if (added < len)
+ len = min_t(sector_t, folio_size(zero_folio),
+ nr_sects << SECTOR_SHIFT);
+ if (!bio_add_folio(bio, zero_folio, len, 0))
break;
- nr_sects -= added >> SECTOR_SHIFT;
- sector += added >> SECTOR_SHIFT;
+ nr_sects -= len >> SECTOR_SHIFT;
+ sector += len >> SECTOR_SHIFT;
} while (nr_sects);
*biop = bio_chain_and_submit(*biop, bio);
--
2.49.0
^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH 0/5] add STATIC_PMD_ZERO_PAGE config option
2025-06-12 10:50 [PATCH 0/5] add STATIC_PMD_ZERO_PAGE config option Pankaj Raghav
` (4 preceding siblings ...)
2025-06-12 10:51 ` [PATCH 5/5] block: use mm_huge_zero_folio in __blkdev_issue_zero_pages() Pankaj Raghav
@ 2025-06-12 13:50 ` Dave Hansen
2025-06-12 20:36 ` Pankaj Raghav (Samsung)
2025-06-16 5:40 ` Christoph Hellwig
6 siblings, 1 reply; 19+ messages in thread
From: Dave Hansen @ 2025-06-12 13:50 UTC (permalink / raw)
To: Pankaj Raghav, Suren Baghdasaryan, Ryan Roberts, Mike Rapoport,
Michal Hocko, Thomas Gleixner, Nico Pache, Dev Jain, Baolin Wang,
Borislav Petkov, Ingo Molnar, H . Peter Anvin, Vlastimil Babka,
Zi Yan, Dave Hansen, David Hildenbrand, Lorenzo Stoakes,
Andrew Morton, Liam R . Howlett, Jens Axboe
Cc: linux-kernel, linux-mm, willy, x86, linux-block, linux-fsdevel,
Darrick J . Wong, mcgrof, gost.dev, kernel, hch
On 6/12/25 03:50, Pankaj Raghav wrote:
> But to use huge_zero_folio, we need to pass a mm struct and the
> put_folio needs to be called in the destructor. This makes sense for
> systems that have memory constraints but for bigger servers, it does not
> matter if the PMD size is reasonable (like in x86).
So, what's the problem with calling a destructor?
In your last patch, surely bio_add_folio() can put the page/folio when
it's done. Is the real problem that you don't want to call zero page
specific code at bio teardown?
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 4/5] mm: add mm_get_static_huge_zero_folio() routine
2025-06-12 10:50 ` [PATCH 4/5] mm: add mm_get_static_huge_zero_folio() routine Pankaj Raghav
@ 2025-06-12 14:09 ` Dave Hansen
2025-06-12 20:54 ` Pankaj Raghav (Samsung)
0 siblings, 1 reply; 19+ messages in thread
From: Dave Hansen @ 2025-06-12 14:09 UTC (permalink / raw)
To: Pankaj Raghav, Suren Baghdasaryan, Ryan Roberts, Mike Rapoport,
Michal Hocko, Thomas Gleixner, Nico Pache, Dev Jain, Baolin Wang,
Borislav Petkov, Ingo Molnar, H . Peter Anvin, Vlastimil Babka,
Zi Yan, Dave Hansen, David Hildenbrand, Lorenzo Stoakes,
Andrew Morton, Liam R . Howlett, Jens Axboe
Cc: linux-kernel, linux-mm, willy, x86, linux-block, linux-fsdevel,
Darrick J . Wong, mcgrof, gost.dev, kernel, hch
On 6/12/25 03:50, Pankaj Raghav wrote:
> +/*
> + * mm_get_static_huge_zero_folio - Get a PMD sized zero folio
Isn't that a rather inaccurate function name and comment?
The third line of the function literally returns a non-PMD-sized zero folio.
> + * This function will return a PMD sized zero folio if CONFIG_STATIC_PMD_ZERO_PAGE
> + * is enabled. Otherwise, a ZERO_PAGE folio is returned.
> + *
> + * Deduce the size of the folio with folio_size instead of assuming the
> + * folio size.
> + */
> +static inline struct folio *mm_get_static_huge_zero_folio(void)
> +{
> + if(IS_ENABLED(CONFIG_STATIC_PMD_ZERO_PAGE))
> + return READ_ONCE(huge_zero_folio);
> + return page_folio(ZERO_PAGE(0));
> +}
This doesn't tell us very much about when I should use:
mm_get_static_huge_zero_folio()
vs.
mm_get_huge_zero_folio(mm)
vs.
page_folio(ZERO_PAGE(0))
What's with the "mm_" in the name? Usually "mm" means "mm_struct" not
Memory Management. It's really weird to prefix something that doesn't
take an "mm_struct" with "mm_"
Isn't the "get_" also a bad idea since mm_get_huge_zero_folio() does its
own refcounting but this interface does not?
Shouldn't this be something more along the lines of:
/*
* pick_zero_folio() - Pick and return the largest available zero folio
*
* mm_get_huge_zero_folio() is preferred over this function. It is more
* flexible and can provide a larger zero page under wider
* circumstances.
*
* Only use this when there is no mm available.
*
* ... then other comments
*/
static inline struct folio *pick_zero_folio(void)
{
if (IS_ENABLED(CONFIG_STATIC_PMD_ZERO_PAGE))
return READ_ONCE(huge_zero_folio);
return page_folio(ZERO_PAGE(0));
}
Or, maybe even name it _just_: zero_folio()
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/5] add STATIC_PMD_ZERO_PAGE config option
2025-06-12 13:50 ` [PATCH 0/5] add STATIC_PMD_ZERO_PAGE config option Dave Hansen
@ 2025-06-12 20:36 ` Pankaj Raghav (Samsung)
2025-06-12 21:46 ` Dave Hansen
0 siblings, 1 reply; 19+ messages in thread
From: Pankaj Raghav (Samsung) @ 2025-06-12 20:36 UTC (permalink / raw)
To: Dave Hansen
Cc: Pankaj Raghav, Suren Baghdasaryan, Ryan Roberts, Mike Rapoport,
Michal Hocko, Thomas Gleixner, Nico Pache, Dev Jain, Baolin Wang,
Borislav Petkov, Ingo Molnar, H . Peter Anvin, Vlastimil Babka,
Zi Yan, Dave Hansen, David Hildenbrand, Lorenzo Stoakes,
Andrew Morton, Liam R . Howlett, Jens Axboe, linux-kernel,
linux-mm, willy, x86, linux-block, linux-fsdevel,
Darrick J . Wong, mcgrof, gost.dev, hch
On Thu, Jun 12, 2025 at 06:50:07AM -0700, Dave Hansen wrote:
> On 6/12/25 03:50, Pankaj Raghav wrote:
> > But to use huge_zero_folio, we need to pass a mm struct and the
> > put_folio needs to be called in the destructor. This makes sense for
> > systems that have memory constraints but for bigger servers, it does not
> > matter if the PMD size is reasonable (like in x86).
>
> So, what's the problem with calling a destructor?
>
> In your last patch, surely bio_add_folio() can put the page/folio when
> it's done. Is the real problem that you don't want to call zero page
> specific code at bio teardown?
Yeah, it feels like a lot of code on the caller just to use a zero page.
It would be nice just to have a call similar to ZERO_PAGE() in these
subsystems where we can have guarantee of getting huge zero page.
Apart from that, these are the following problems if we use
mm_get_huge_zero_folio() at the moment:
- We might end up allocating 512MB PMD on ARM systems with 64k base page
size, which is undesirable. With the patch series posted, we will only
enable the static huge page for sane architectures and page sizes.
- In the current implementation we always call mm_put_huge_zero_folio()
in __mmput()[1]. I am not sure if model will work for all subsystems. For
example bio completions can be async, i.e, we might need a reference
to the zero page even if the process is no longer alive.
I will try to include these motivations in the cover letter next time.
Thanks
[1] 6fcb52a56ff6 ("thp: reduce usage of huge zero page's atomic counter")
--
Pankaj
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 4/5] mm: add mm_get_static_huge_zero_folio() routine
2025-06-12 14:09 ` Dave Hansen
@ 2025-06-12 20:54 ` Pankaj Raghav (Samsung)
2025-06-16 9:14 ` David Hildenbrand
0 siblings, 1 reply; 19+ messages in thread
From: Pankaj Raghav (Samsung) @ 2025-06-12 20:54 UTC (permalink / raw)
To: Dave Hansen
Cc: Pankaj Raghav, Suren Baghdasaryan, Ryan Roberts, Mike Rapoport,
Michal Hocko, Thomas Gleixner, Nico Pache, Dev Jain, Baolin Wang,
Borislav Petkov, Ingo Molnar, H . Peter Anvin, Vlastimil Babka,
Zi Yan, Dave Hansen, David Hildenbrand, Lorenzo Stoakes,
Andrew Morton, Liam R . Howlett, Jens Axboe, linux-kernel,
linux-mm, willy, x86, linux-block, linux-fsdevel,
Darrick J . Wong, mcgrof, gost.dev, hch
On Thu, Jun 12, 2025 at 07:09:34AM -0700, Dave Hansen wrote:
> On 6/12/25 03:50, Pankaj Raghav wrote:
> > +/*
> > + * mm_get_static_huge_zero_folio - Get a PMD sized zero folio
>
> Isn't that a rather inaccurate function name and comment?
I agree. I also felt it was not a good name for the function.
>
> The third line of the function literally returns a non-PMD-sized zero folio.
>
> > + * This function will return a PMD sized zero folio if CONFIG_STATIC_PMD_ZERO_PAGE
> > + * is enabled. Otherwise, a ZERO_PAGE folio is returned.
> > + *
> > + * Deduce the size of the folio with folio_size instead of assuming the
> > + * folio size.
> > + */
> > +static inline struct folio *mm_get_static_huge_zero_folio(void)
> > +{
> > + if(IS_ENABLED(CONFIG_STATIC_PMD_ZERO_PAGE))
> > + return READ_ONCE(huge_zero_folio);
> > + return page_folio(ZERO_PAGE(0));
> > +}
>
> This doesn't tell us very much about when I should use:
>
> mm_get_static_huge_zero_folio()
> vs.
> mm_get_huge_zero_folio(mm)
> vs.
> page_folio(ZERO_PAGE(0))
>
> What's with the "mm_" in the name? Usually "mm" means "mm_struct" not
> Memory Management. It's really weird to prefix something that doesn't
> take an "mm_struct" with "mm_"
Got it. Actually, I was not aware of this one.
>
> Isn't the "get_" also a bad idea since mm_get_huge_zero_folio() does its
> own refcounting but this interface does not?
>
Agree.
> Shouldn't this be something more along the lines of:
>
> /*
> * pick_zero_folio() - Pick and return the largest available zero folio
> *
> * mm_get_huge_zero_folio() is preferred over this function. It is more
> * flexible and can provide a larger zero page under wider
> * circumstances.
> *
> * Only use this when there is no mm available.
> *
> * ... then other comments
> */
> static inline struct folio *pick_zero_folio(void)
> {
> if (IS_ENABLED(CONFIG_STATIC_PMD_ZERO_PAGE))
> return READ_ONCE(huge_zero_folio);
> return page_folio(ZERO_PAGE(0));
> }
>
> Or, maybe even name it _just_: zero_folio()
I think zero_folio() sounds like a good and straightforward name. In
most cases it will return a ZERO_PAGE() folio. If
CONFIG_STATIC_PMD_ZERO_PAGE is enabled, then we return a PMD page.
Thanks for all your comments Dave.
--
Pankaj
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/5] add STATIC_PMD_ZERO_PAGE config option
2025-06-12 20:36 ` Pankaj Raghav (Samsung)
@ 2025-06-12 21:46 ` Dave Hansen
2025-06-13 8:58 ` Pankaj Raghav (Samsung)
0 siblings, 1 reply; 19+ messages in thread
From: Dave Hansen @ 2025-06-12 21:46 UTC (permalink / raw)
To: Pankaj Raghav (Samsung)
Cc: Pankaj Raghav, Suren Baghdasaryan, Ryan Roberts, Mike Rapoport,
Michal Hocko, Thomas Gleixner, Nico Pache, Dev Jain, Baolin Wang,
Borislav Petkov, Ingo Molnar, H . Peter Anvin, Vlastimil Babka,
Zi Yan, Dave Hansen, David Hildenbrand, Lorenzo Stoakes,
Andrew Morton, Liam R . Howlett, Jens Axboe, linux-kernel,
linux-mm, willy, x86, linux-block, linux-fsdevel,
Darrick J . Wong, mcgrof, gost.dev, hch
On 6/12/25 13:36, Pankaj Raghav (Samsung) wrote:
> On Thu, Jun 12, 2025 at 06:50:07AM -0700, Dave Hansen wrote:
>> On 6/12/25 03:50, Pankaj Raghav wrote:
>>> But to use huge_zero_folio, we need to pass a mm struct and the
>>> put_folio needs to be called in the destructor. This makes sense for
>>> systems that have memory constraints but for bigger servers, it does not
>>> matter if the PMD size is reasonable (like in x86).
>>
>> So, what's the problem with calling a destructor?
>>
>> In your last patch, surely bio_add_folio() can put the page/folio when
>> it's done. Is the real problem that you don't want to call zero page
>> specific code at bio teardown?
>
> Yeah, it feels like a lot of code on the caller just to use a zero page.
> It would be nice just to have a call similar to ZERO_PAGE() in these
> subsystems where we can have guarantee of getting huge zero page.
>
> Apart from that, these are the following problems if we use
> mm_get_huge_zero_folio() at the moment:
>
> - We might end up allocating 512MB PMD on ARM systems with 64k base page
> size, which is undesirable. With the patch series posted, we will only
> enable the static huge page for sane architectures and page sizes.
Does *anybody* want the 512MB huge zero page? Maybe it should be an
opt-in at runtime or something.
> - In the current implementation we always call mm_put_huge_zero_folio()
> in __mmput()[1]. I am not sure if model will work for all subsystems. For
> example bio completions can be async, i.e, we might need a reference
> to the zero page even if the process is no longer alive.
The mm is a nice convenient place to stick an mm but there are other
ways to keep an efficient refcount around. For instance, you could just
bump a per-cpu refcount and then have the shrinker sum up all the
refcounts to see if there are any outstanding on the system as a whole.
I understand that the current refcounts are tied to an mm, but you could
either replace the mm-specific ones or add something in parallel for
when there's no mm.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/5] add STATIC_PMD_ZERO_PAGE config option
2025-06-12 21:46 ` Dave Hansen
@ 2025-06-13 8:58 ` Pankaj Raghav (Samsung)
2025-06-16 9:12 ` David Hildenbrand
0 siblings, 1 reply; 19+ messages in thread
From: Pankaj Raghav (Samsung) @ 2025-06-13 8:58 UTC (permalink / raw)
To: Dave Hansen
Cc: Pankaj Raghav, Suren Baghdasaryan, Ryan Roberts, Mike Rapoport,
Michal Hocko, Thomas Gleixner, Nico Pache, Dev Jain, Baolin Wang,
Borislav Petkov, Ingo Molnar, H . Peter Anvin, Vlastimil Babka,
Zi Yan, Dave Hansen, David Hildenbrand, Lorenzo Stoakes,
Andrew Morton, Liam R . Howlett, Jens Axboe, linux-kernel,
linux-mm, willy, x86, linux-block, linux-fsdevel,
Darrick J . Wong, mcgrof, gost.dev, hch
On Thu, Jun 12, 2025 at 02:46:34PM -0700, Dave Hansen wrote:
> On 6/12/25 13:36, Pankaj Raghav (Samsung) wrote:
> > On Thu, Jun 12, 2025 at 06:50:07AM -0700, Dave Hansen wrote:
> >> On 6/12/25 03:50, Pankaj Raghav wrote:
> >>> But to use huge_zero_folio, we need to pass a mm struct and the
> >>> put_folio needs to be called in the destructor. This makes sense for
> >>> systems that have memory constraints but for bigger servers, it does not
> >>> matter if the PMD size is reasonable (like in x86).
> >>
> >> So, what's the problem with calling a destructor?
> >>
> >> In your last patch, surely bio_add_folio() can put the page/folio when
> >> it's done. Is the real problem that you don't want to call zero page
> >> specific code at bio teardown?
> >
> > Yeah, it feels like a lot of code on the caller just to use a zero page.
> > It would be nice just to have a call similar to ZERO_PAGE() in these
> > subsystems where we can have guarantee of getting huge zero page.
> >
> > Apart from that, these are the following problems if we use
> > mm_get_huge_zero_folio() at the moment:
> >
> > - We might end up allocating 512MB PMD on ARM systems with 64k base page
> > size, which is undesirable. With the patch series posted, we will only
> > enable the static huge page for sane architectures and page sizes.
>
> Does *anybody* want the 512MB huge zero page? Maybe it should be an
> opt-in at runtime or something.
>
Yeah, I think that needs to be fixed. David also pointed this out in one
of his earlier reviews[1].
> > - In the current implementation we always call mm_put_huge_zero_folio()
> > in __mmput()[1]. I am not sure if model will work for all subsystems. For
> > example bio completions can be async, i.e, we might need a reference
> > to the zero page even if the process is no longer alive.
>
> The mm is a nice convenient place to stick an mm but there are other
> ways to keep an efficient refcount around. For instance, you could just
> bump a per-cpu refcount and then have the shrinker sum up all the
> refcounts to see if there are any outstanding on the system as a whole.
>
> I understand that the current refcounts are tied to an mm, but you could
> either replace the mm-specific ones or add something in parallel for
> when there's no mm.
But the whole idea of allocating a static PMD page for sane
architectures like x86 started with the intent of avoiding the refcounts and
shrinker.
This was the initial feedback I got[2]:
I mean, the whole thing about dynamically allocating/freeing it was for
memory-constrained systems. For large systems, we just don't care.
[1] https://lore.kernel.org/linux-mm/1e571419-9709-4898-9349-3d2eef0f8709@redhat.com/
[2] https://lore.kernel.org/linux-mm/cb52312d-348b-49d5-b0d7-0613fb38a558@redhat.com/
--
Pankaj
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/5] add STATIC_PMD_ZERO_PAGE config option
2025-06-12 10:50 [PATCH 0/5] add STATIC_PMD_ZERO_PAGE config option Pankaj Raghav
` (5 preceding siblings ...)
2025-06-12 13:50 ` [PATCH 0/5] add STATIC_PMD_ZERO_PAGE config option Dave Hansen
@ 2025-06-16 5:40 ` Christoph Hellwig
2025-06-16 9:00 ` Pankaj Raghav
6 siblings, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2025-06-16 5:40 UTC (permalink / raw)
To: Pankaj Raghav
Cc: Suren Baghdasaryan, Ryan Roberts, Mike Rapoport, Michal Hocko,
Thomas Gleixner, Nico Pache, Dev Jain, Baolin Wang,
Borislav Petkov, Ingo Molnar, H . Peter Anvin, Vlastimil Babka,
Zi Yan, Dave Hansen, David Hildenbrand, Lorenzo Stoakes,
Andrew Morton, Liam R . Howlett, Jens Axboe, linux-kernel,
linux-mm, willy, x86, linux-block, linux-fsdevel,
Darrick J . Wong, mcgrof, gost.dev, kernel, hch
Just curious: why doesn't this series get rid of the iomap zero_page,
which would be really low hanging fruit?
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/5] add STATIC_PMD_ZERO_PAGE config option
2025-06-16 5:40 ` Christoph Hellwig
@ 2025-06-16 9:00 ` Pankaj Raghav
0 siblings, 0 replies; 19+ messages in thread
From: Pankaj Raghav @ 2025-06-16 9:00 UTC (permalink / raw)
To: Christoph Hellwig, Pankaj Raghav
Cc: Suren Baghdasaryan, Ryan Roberts, Mike Rapoport, Michal Hocko,
Thomas Gleixner, Nico Pache, Dev Jain, Baolin Wang,
Borislav Petkov, Ingo Molnar, H . Peter Anvin, Vlastimil Babka,
Zi Yan, Dave Hansen, David Hildenbrand, Lorenzo Stoakes,
Andrew Morton, Liam R . Howlett, Jens Axboe, linux-kernel,
linux-mm, willy, x86, linux-block, linux-fsdevel,
Darrick J . Wong, mcgrof, gost.dev
On 6/16/25 07:40, Christoph Hellwig wrote:
> Just curious: why doesn't this series get rid of the iomap zero_page,
> which would be really low hanging fruit?
>
I did the conversion in my first RFC series [1]. But the implementation and API for the zero
page was still under heavy discussion. So I decided to focus on that instead and leave the
conversion for later as I mentioned in the cover letter. I included blkdev_issue_zero_pages() as it
is a more straight forward conversion.
--
Pankaj
[1] https://lore.kernel.org/linux-mm/20250516101054.676046-4-p.raghav@samsung.com/
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/5] add STATIC_PMD_ZERO_PAGE config option
2025-06-13 8:58 ` Pankaj Raghav (Samsung)
@ 2025-06-16 9:12 ` David Hildenbrand
2025-06-16 10:49 ` Pankaj Raghav (Samsung)
0 siblings, 1 reply; 19+ messages in thread
From: David Hildenbrand @ 2025-06-16 9:12 UTC (permalink / raw)
To: Pankaj Raghav (Samsung), Dave Hansen
Cc: Pankaj Raghav, Suren Baghdasaryan, Ryan Roberts, Mike Rapoport,
Michal Hocko, Thomas Gleixner, Nico Pache, Dev Jain, Baolin Wang,
Borislav Petkov, Ingo Molnar, H . Peter Anvin, Vlastimil Babka,
Zi Yan, Dave Hansen, Lorenzo Stoakes, Andrew Morton,
Liam R . Howlett, Jens Axboe, linux-kernel, linux-mm, willy, x86,
linux-block, linux-fsdevel, Darrick J . Wong, mcgrof, gost.dev,
hch
On 13.06.25 10:58, Pankaj Raghav (Samsung) wrote:
> On Thu, Jun 12, 2025 at 02:46:34PM -0700, Dave Hansen wrote:
>> On 6/12/25 13:36, Pankaj Raghav (Samsung) wrote:
>>> On Thu, Jun 12, 2025 at 06:50:07AM -0700, Dave Hansen wrote:
>>>> On 6/12/25 03:50, Pankaj Raghav wrote:
>>>>> But to use huge_zero_folio, we need to pass a mm struct and the
>>>>> put_folio needs to be called in the destructor. This makes sense for
>>>>> systems that have memory constraints but for bigger servers, it does not
>>>>> matter if the PMD size is reasonable (like in x86).
>>>>
>>>> So, what's the problem with calling a destructor?
>>>>
>>>> In your last patch, surely bio_add_folio() can put the page/folio when
>>>> it's done. Is the real problem that you don't want to call zero page
>>>> specific code at bio teardown?
>>>
>>> Yeah, it feels like a lot of code on the caller just to use a zero page.
>>> It would be nice just to have a call similar to ZERO_PAGE() in these
>>> subsystems where we can have guarantee of getting huge zero page.
>>>
>>> Apart from that, these are the following problems if we use
>>> mm_get_huge_zero_folio() at the moment:
>>>
>>> - We might end up allocating 512MB PMD on ARM systems with 64k base page
>>> size, which is undesirable. With the patch series posted, we will only
>>> enable the static huge page for sane architectures and page sizes.
>>
>> Does *anybody* want the 512MB huge zero page? Maybe it should be an
>> opt-in at runtime or something.
>>
> Yeah, I think that needs to be fixed. David also pointed this out in one
> of his earlier reviews[1].
>
>>> - In the current implementation we always call mm_put_huge_zero_folio()
>>> in __mmput()[1]. I am not sure if model will work for all subsystems. For
>>> example bio completions can be async, i.e, we might need a reference
>>> to the zero page even if the process is no longer alive.
>>
>> The mm is a nice convenient place to stick an mm but there are other
>> ways to keep an efficient refcount around. For instance, you could just
>> bump a per-cpu refcount and then have the shrinker sum up all the
>> refcounts to see if there are any outstanding on the system as a whole.
>>
>> I understand that the current refcounts are tied to an mm, but you could
>> either replace the mm-specific ones or add something in parallel for
>> when there's no mm.
>
> But the whole idea of allocating a static PMD page for sane
> architectures like x86 started with the intent of avoiding the refcounts and
> shrinker.
>
> This was the initial feedback I got[2]:
>
> I mean, the whole thing about dynamically allocating/freeing it was for
> memory-constrained systems. For large systems, we just don't care.
For non-mm usage we can just use the folio refcount. The per-mm
refcounts are all combined into a single folio refcount. The way the
global variable is managed based on per-mm refcounts is the weird thing.
In some corner cases we might end up having multiple instances of huge
zero folios right now. Just imagine:
1) Allocate huge zero folio during read fault
2) vmsplice() it
3) Unmap the huge zero folio
4) Shrinker runs and frees it
5) Repeat with 1)
As long as the folio is vmspliced(), it will not get actually freed ...
I would hope that we could remove the shrinker completely, and simply
never free the huge zero folio once allocated. Or at least, only free it
once it is actually no longer used.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 4/5] mm: add mm_get_static_huge_zero_folio() routine
2025-06-12 20:54 ` Pankaj Raghav (Samsung)
@ 2025-06-16 9:14 ` David Hildenbrand
2025-06-16 10:41 ` Pankaj Raghav (Samsung)
0 siblings, 1 reply; 19+ messages in thread
From: David Hildenbrand @ 2025-06-16 9:14 UTC (permalink / raw)
To: Pankaj Raghav (Samsung), Dave Hansen
Cc: Pankaj Raghav, Suren Baghdasaryan, Ryan Roberts, Mike Rapoport,
Michal Hocko, Thomas Gleixner, Nico Pache, Dev Jain, Baolin Wang,
Borislav Petkov, Ingo Molnar, H . Peter Anvin, Vlastimil Babka,
Zi Yan, Dave Hansen, Lorenzo Stoakes, Andrew Morton,
Liam R . Howlett, Jens Axboe, linux-kernel, linux-mm, willy, x86,
linux-block, linux-fsdevel, Darrick J . Wong, mcgrof, gost.dev,
hch
On 12.06.25 22:54, Pankaj Raghav (Samsung) wrote:
> On Thu, Jun 12, 2025 at 07:09:34AM -0700, Dave Hansen wrote:
>> On 6/12/25 03:50, Pankaj Raghav wrote:
>>> +/*
>>> + * mm_get_static_huge_zero_folio - Get a PMD sized zero folio
>>
>> Isn't that a rather inaccurate function name and comment?
> I agree. I also felt it was not a good name for the function.
>
>>
>> The third line of the function literally returns a non-PMD-sized zero folio.
>>
>>> + * This function will return a PMD sized zero folio if CONFIG_STATIC_PMD_ZERO_PAGE
>>> + * is enabled. Otherwise, a ZERO_PAGE folio is returned.
>>> + *
>>> + * Deduce the size of the folio with folio_size instead of assuming the
>>> + * folio size.
>>> + */
>>> +static inline struct folio *mm_get_static_huge_zero_folio(void)
>>> +{
>>> + if(IS_ENABLED(CONFIG_STATIC_PMD_ZERO_PAGE))
>>> + return READ_ONCE(huge_zero_folio);
>>> + return page_folio(ZERO_PAGE(0));
>>> +}
>>
>> This doesn't tell us very much about when I should use:
>>
>> mm_get_static_huge_zero_folio()
>> vs.
>> mm_get_huge_zero_folio(mm)
>> vs.
>> page_folio(ZERO_PAGE(0))
>>
>> What's with the "mm_" in the name? Usually "mm" means "mm_struct" not
>> Memory Management. It's really weird to prefix something that doesn't
>> take an "mm_struct" with "mm_"
>
> Got it. Actually, I was not aware of this one.
>
>>
>> Isn't the "get_" also a bad idea since mm_get_huge_zero_folio() does its
>> own refcounting but this interface does not?
>>
>
> Agree.
>
>> Shouldn't this be something more along the lines of:
>>
>> /*
>> * pick_zero_folio() - Pick and return the largest available zero folio
>> *
>> * mm_get_huge_zero_folio() is preferred over this function. It is more
>> * flexible and can provide a larger zero page under wider
>> * circumstances.
>> *
>> * Only use this when there is no mm available.
>> *
>> * ... then other comments
>> */
>> static inline struct folio *pick_zero_folio(void)
>> {
>> if (IS_ENABLED(CONFIG_STATIC_PMD_ZERO_PAGE))
>> return READ_ONCE(huge_zero_folio);
>> return page_folio(ZERO_PAGE(0));
>> }
>>
>> Or, maybe even name it _just_: zero_folio()
>
> I think zero_folio() sounds like a good and straightforward name. In
> most cases it will return a ZERO_PAGE() folio. If
> CONFIG_STATIC_PMD_ZERO_PAGE is enabled, then we return a PMD page.
"zero_folio" would be confusing I'm afraid.
At least with current "is_zero_folio" etc.
"largest_zero_folio" or sth. like that might make it clearer that the
size we are getting back might actually differ.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 4/5] mm: add mm_get_static_huge_zero_folio() routine
2025-06-16 9:14 ` David Hildenbrand
@ 2025-06-16 10:41 ` Pankaj Raghav (Samsung)
0 siblings, 0 replies; 19+ messages in thread
From: Pankaj Raghav (Samsung) @ 2025-06-16 10:41 UTC (permalink / raw)
To: David Hildenbrand
Cc: Dave Hansen, Pankaj Raghav, Suren Baghdasaryan, Ryan Roberts,
Mike Rapoport, Michal Hocko, Thomas Gleixner, Nico Pache,
Dev Jain, Baolin Wang, Borislav Petkov, Ingo Molnar,
H . Peter Anvin, Vlastimil Babka, Zi Yan, Dave Hansen,
Lorenzo Stoakes, Andrew Morton, Liam R . Howlett, Jens Axboe,
linux-kernel, linux-mm, willy, x86, linux-block, linux-fsdevel,
Darrick J . Wong, mcgrof, gost.dev, hch
On Mon, Jun 16, 2025 at 11:14:07AM +0200, David Hildenbrand wrote:
> On 12.06.25 22:54, Pankaj Raghav (Samsung) wrote:
> > On Thu, Jun 12, 2025 at 07:09:34AM -0700, Dave Hansen wrote:
> > > On 6/12/25 03:50, Pankaj Raghav wrote:
> > > > +/*
> > > > + * mm_get_static_huge_zero_folio - Get a PMD sized zero folio
> > >
> > > Isn't that a rather inaccurate function name and comment?
> > I agree. I also felt it was not a good name for the function.
> >
> > >
> > > The third line of the function literally returns a non-PMD-sized zero folio.
> > >
> > > > + * This function will return a PMD sized zero folio if CONFIG_STATIC_PMD_ZERO_PAGE
> > > > + * is enabled. Otherwise, a ZERO_PAGE folio is returned.
> > > > + *
> > > > + * Deduce the size of the folio with folio_size instead of assuming the
> > > > + * folio size.
> > > > + */
> > > > +static inline struct folio *mm_get_static_huge_zero_folio(void)
> > > > +{
> > > > + if(IS_ENABLED(CONFIG_STATIC_PMD_ZERO_PAGE))
> > > > + return READ_ONCE(huge_zero_folio);
> > > > + return page_folio(ZERO_PAGE(0));
> > > > +}
> > >
> > > This doesn't tell us very much about when I should use:
> > >
> > > mm_get_static_huge_zero_folio()
> > > vs.
> > > mm_get_huge_zero_folio(mm)
> > > vs.
> > > page_folio(ZERO_PAGE(0))
> > >
> > > What's with the "mm_" in the name? Usually "mm" means "mm_struct" not
> > > Memory Management. It's really weird to prefix something that doesn't
> > > take an "mm_struct" with "mm_"
> >
> > Got it. Actually, I was not aware of this one.
> >
> > >
> > > Isn't the "get_" also a bad idea since mm_get_huge_zero_folio() does its
> > > own refcounting but this interface does not?
> > >
> >
> > Agree.
> >
> > > Shouldn't this be something more along the lines of:
> > >
> > > /*
> > > * pick_zero_folio() - Pick and return the largest available zero folio
> > > *
> > > * mm_get_huge_zero_folio() is preferred over this function. It is more
> > > * flexible and can provide a larger zero page under wider
> > > * circumstances.
> > > *
> > > * Only use this when there is no mm available.
> > > *
> > > * ... then other comments
> > > */
> > > static inline struct folio *pick_zero_folio(void)
> > > {
> > > if (IS_ENABLED(CONFIG_STATIC_PMD_ZERO_PAGE))
> > > return READ_ONCE(huge_zero_folio);
> > > return page_folio(ZERO_PAGE(0));
> > > }
> > >
> > > Or, maybe even name it _just_: zero_folio()
> >
> > I think zero_folio() sounds like a good and straightforward name. In
> > most cases it will return a ZERO_PAGE() folio. If
> > CONFIG_STATIC_PMD_ZERO_PAGE is enabled, then we return a PMD page.
>
> "zero_folio" would be confusing I'm afraid.
>
> At least with current "is_zero_folio" etc.
>
> "largest_zero_folio" or sth. like that might make it clearer that the size
> we are getting back might actually differ.
>
That makes sense. I can change that in the next revision.
--
Pankaj
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 0/5] add STATIC_PMD_ZERO_PAGE config option
2025-06-16 9:12 ` David Hildenbrand
@ 2025-06-16 10:49 ` Pankaj Raghav (Samsung)
0 siblings, 0 replies; 19+ messages in thread
From: Pankaj Raghav (Samsung) @ 2025-06-16 10:49 UTC (permalink / raw)
To: David Hildenbrand
Cc: Dave Hansen, Pankaj Raghav, Suren Baghdasaryan, Ryan Roberts,
Mike Rapoport, Michal Hocko, Thomas Gleixner, Nico Pache,
Dev Jain, Baolin Wang, Borislav Petkov, Ingo Molnar,
H . Peter Anvin, Vlastimil Babka, Zi Yan, Dave Hansen,
Lorenzo Stoakes, Andrew Morton, Liam R . Howlett, Jens Axboe,
linux-kernel, linux-mm, willy, x86, linux-block, linux-fsdevel,
Darrick J . Wong, mcgrof, gost.dev, hch
> > >
> > > The mm is a nice convenient place to stick an mm but there are other
> > > ways to keep an efficient refcount around. For instance, you could just
> > > bump a per-cpu refcount and then have the shrinker sum up all the
> > > refcounts to see if there are any outstanding on the system as a whole.
> > >
> > > I understand that the current refcounts are tied to an mm, but you could
> > > either replace the mm-specific ones or add something in parallel for
> > > when there's no mm.
> >
> > But the whole idea of allocating a static PMD page for sane
> > architectures like x86 started with the intent of avoiding the refcounts and
> > shrinker.
> >
> > This was the initial feedback I got[2]:
> >
> > I mean, the whole thing about dynamically allocating/freeing it was for
> > memory-constrained systems. For large systems, we just don't care.
>
> For non-mm usage we can just use the folio refcount. The per-mm refcounts
> are all combined into a single folio refcount. The way the global variable
> is managed based on per-mm refcounts is the weird thing.
>
> In some corner cases we might end up having multiple instances of huge zero
> folios right now. Just imagine:
>
> 1) Allocate huge zero folio during read fault
> 2) vmsplice() it
> 3) Unmap the huge zero folio
> 4) Shrinker runs and frees it
> 5) Repeat with 1)
>
> As long as the folio is vmspliced(), it will not get actually freed ...
>
> I would hope that we could remove the shrinker completely, and simply never
> free the huge zero folio once allocated. Or at least, only free it once it
> is actually no longer used.
>
Thanks for the explanation, David.
But I am still a bit confused on how to proceed with these patches.
So IIUC, our eventual goal is to get rid of the shrinker.
But do we still want to add a static PMD page in the .bss or do we take
an alternate approach here?
--
Pankaj
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 3/5] mm: add static PMD zero page
2025-06-12 10:50 ` [PATCH 3/5] mm: add static PMD zero page Pankaj Raghav
@ 2025-06-24 8:51 ` kernel test robot
0 siblings, 0 replies; 19+ messages in thread
From: kernel test robot @ 2025-06-24 8:51 UTC (permalink / raw)
To: Pankaj Raghav
Cc: oe-lkp, lkp, David Hildenbrand, linux-kernel, linux-mm,
Suren Baghdasaryan, Ryan Roberts, Mike Rapoport, Michal Hocko,
Thomas Gleixner, Nico Pache, Dev Jain, Baolin Wang,
Borislav Petkov, Ingo Molnar, H . Peter Anvin, Vlastimil Babka,
Zi Yan, Dave Hansen, Lorenzo Stoakes, Andrew Morton,
Liam R . Howlett, Jens Axboe, willy, x86, linux-block,
linux-fsdevel, Darrick J . Wong, mcgrof, gost.dev, kernel, hch,
Pankaj Raghav, oliver.sang
Hello,
kernel test robot noticed "WARNING:at_mm/gup.c:#try_grab_folio" on:
commit: 8e628a9d6cc5c377ae06b7821f8280cd6ff2a20f ("[PATCH 3/5] mm: add static PMD zero page")
url: https://github.com/intel-lab-lkp/linux/commits/Pankaj-Raghav/mm-move-huge_zero_page-declaration-from-huge_mm-h-to-mm-h/20250612-185248
patch link: https://lore.kernel.org/all/20250612105100.59144-4-p.raghav@samsung.com/
patch subject: [PATCH 3/5] mm: add static PMD zero page
in testcase: trinity
version: trinity-x86_64-ba2360ed-1_20241228
with following parameters:
runtime: 300s
group: group-03
nr_groups: 5
config: x86_64-randconfig-077-20250618
compiler: clang-20
test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
(please refer to attached dmesg/kmsg for entire log/backtrace)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202506201441.2f96266-lkp@intel.com
[ 379.105772][ T4274] ------------[ cut here ]------------
[ 379.107617][ T4274] WARNING: CPU: 0 PID: 4274 at mm/gup.c:148 try_grab_folio (mm/gup.c:148 (discriminator 12))
[ 379.109660][ T4274] Modules linked in:
[ 379.111018][ T4274] CPU: 0 UID: 65534 PID: 4274 Comm: trinity-c3 Not tainted 6.16.0-rc1-00003-g8e628a9d6cc5 #1 PREEMPT(voluntary)
[ 379.113741][ T4274] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[ 379.116285][ T4274] RIP: 0010:try_grab_folio (mm/gup.c:148 (discriminator 12))
[ 379.117678][ T4274] Code: 00 48 01 1d 6f 95 3f 0b 48 c7 c7 38 95 55 8f be 08 00 00 00 e8 76 98 0f 00 48 01 1d 47 08 ac 0d e9 e4 fe ff ff e8 c5 2c cd ff <0f> 0b b8 f4 ff ff ff e9 d5 fe ff ff 44 89 f1 80 e1 07 80 c1 03 38
All code
========
0: 00 48 01 add %cl,0x1(%rax)
3: 1d 6f 95 3f 0b sbb $0xb3f956f,%eax
8: 48 c7 c7 38 95 55 8f mov $0xffffffff8f559538,%rdi
f: be 08 00 00 00 mov $0x8,%esi
14: e8 76 98 0f 00 call 0xf988f
19: 48 01 1d 47 08 ac 0d add %rbx,0xdac0847(%rip) # 0xdac0867
20: e9 e4 fe ff ff jmp 0xffffffffffffff09
25: e8 c5 2c cd ff call 0xffffffffffcd2cef
2a:* 0f 0b ud2 <-- trapping instruction
2c: b8 f4 ff ff ff mov $0xfffffff4,%eax
31: e9 d5 fe ff ff jmp 0xffffffffffffff0b
36: 44 89 f1 mov %r14d,%ecx
39: 80 e1 07 and $0x7,%cl
3c: 80 c1 03 add $0x3,%cl
3f: 38 .byte 0x38
Code starting with the faulting instruction
===========================================
0: 0f 0b ud2
2: b8 f4 ff ff ff mov $0xfffffff4,%eax
7: e9 d5 fe ff ff jmp 0xfffffffffffffee1
c: 44 89 f1 mov %r14d,%ecx
f: 80 e1 07 and $0x7,%cl
12: 80 c1 03 add $0x3,%cl
15: 38 .byte 0x38
[ 379.122288][ T4274] RSP: 0018:ffffc90003eafc00 EFLAGS: 00010246
[ 379.123803][ T4274] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
[ 379.125678][ T4274] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[ 379.127640][ T4274] RBP: 0000000000210008 R08: 0000000000000000 R09: 0000000000000000
[ 379.129505][ T4274] R10: 0000000000000000 R11: 0000000000000000 R12: dffffc0000000000
[ 379.131448][ T4274] R13: ffffea0000398000 R14: ffffea0000398034 R15: ffffea0000398000
[ 379.133373][ T4274] FS: 00007f8feed44740(0000) GS:0000000000000000(0000) knlGS:0000000000000000
[ 379.135522][ T4274] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 379.137151][ T4274] CR2: 000000000000006e CR3: 0000000157d77000 CR4: 00000000000406f0
[ 379.139063][ T4274] Call Trace:
[ 379.139969][ T4274] <TASK>
[ 379.140739][ T4274] follow_huge_pmd (mm/gup.c:767)
[ 379.141902][ T4274] __get_user_pages (mm/gup.c:993)
[ 379.143221][ T4274] populate_vma_page_range (mm/gup.c:1926 (discriminator 1))
[ 379.144519][ T4274] __mm_populate (mm/gup.c:2029)
[ 379.145559][ T4274] vm_mmap_pgoff (include/linux/mm.h:? mm/util.c:584)
[ 379.146769][ T4274] ? entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
[ 379.148226][ T4274] do_syscall_64 (arch/x86/entry/syscall_64.c:?)
[ 379.149357][ T4274] ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:473 (discriminator 3))
[ 379.150866][ T4274] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
[ 379.152205][ T4274] RIP: 0033:0x7f8feee48719
[ 379.153354][ T4274] Code: 08 89 e8 5b 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b7 06 0d 00 f7 d8 64 89 01 48
All code
========
0: 08 89 e8 5b 5d c3 or %cl,-0x3ca2a418(%rcx)
6: 66 2e 0f 1f 84 00 00 cs nopw 0x0(%rax,%rax,1)
d: 00 00 00
10: 90 nop
11: 48 89 f8 mov %rdi,%rax
14: 48 89 f7 mov %rsi,%rdi
17: 48 89 d6 mov %rdx,%rsi
1a: 48 89 ca mov %rcx,%rdx
1d: 4d 89 c2 mov %r8,%r10
20: 4d 89 c8 mov %r9,%r8
23: 4c 8b 4c 24 08 mov 0x8(%rsp),%r9
28: 0f 05 syscall
2a:* 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax <-- trapping instruction
30: 73 01 jae 0x33
32: c3 ret
33: 48 8b 0d b7 06 0d 00 mov 0xd06b7(%rip),%rcx # 0xd06f1
3a: f7 d8 neg %eax
3c: 64 89 01 mov %eax,%fs:(%rcx)
3f: 48 rex.W
Code starting with the faulting instruction
===========================================
0: 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax
6: 73 01 jae 0x9
8: c3 ret
9: 48 8b 0d b7 06 0d 00 mov 0xd06b7(%rip),%rcx # 0xd06c7
10: f7 d8 neg %eax
12: 64 89 01 mov %eax,%fs:(%rcx)
15: 48 rex.W
[ 379.157899][ T4274] RSP: 002b:00007ffc477ec658 EFLAGS: 00000246 ORIG_RAX: 0000000000000009
[ 379.159864][ T4274] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f8feee48719
[ 379.161627][ T4274] RDX: 0000000000000004 RSI: 0000000000200000 RDI: 0000000000000000
[ 379.163409][ T4274] RBP: 00007f8fed769058 R08: ffffffffffffffff R09: 0000000000000000
[ 379.165255][ T4274] R10: 0000000004008862 R11: 0000000000000246 R12: 0000000000000009
[ 379.167147][ T4274] R13: 00007f8feed446c0 R14: 00007f8fed769058 R15: 00007f8fed769000
[ 379.169043][ T4274] </TASK>
[ 379.169891][ T4274] irq event stamp: 771243
[ 379.170971][ T4274] hardirqs last enabled at (771255): __console_unlock (arch/x86/include/asm/irqflags.h:42 arch/x86/include/asm/irqflags.h:119 arch/x86/include/asm/irqflags.h:159 kernel/printk/printk.c:344 kernel/printk/printk.c:2885)
[ 379.173320][ T4274] hardirqs last disabled at (771278): __console_unlock (kernel/printk/printk.c:342 (discriminator 9) kernel/printk/printk.c:2885 (discriminator 9))
[ 379.175642][ T4274] softirqs last enabled at (771272): handle_softirqs (arch/x86/include/asm/preempt.h:27 kernel/softirq.c:426 kernel/softirq.c:607)
[ 379.177979][ T4274] softirqs last disabled at (771263): __irq_exit_rcu (arch/x86/include/asm/jump_label.h:36 kernel/softirq.c:682)
[ 379.180133][ T4274] ---[ end trace 0000000000000000 ]---
The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20250620/202506201441.2f96266-lkp@intel.com
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2025-06-24 8:51 UTC | newest]
Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-12 10:50 [PATCH 0/5] add STATIC_PMD_ZERO_PAGE config option Pankaj Raghav
2025-06-12 10:50 ` [PATCH 1/5] mm: move huge_zero_page declaration from huge_mm.h to mm.h Pankaj Raghav
2025-06-12 10:50 ` [PATCH 2/5] huge_memory: add huge_zero_page_shrinker_(init|exit) function Pankaj Raghav
2025-06-12 10:50 ` [PATCH 3/5] mm: add static PMD zero page Pankaj Raghav
2025-06-24 8:51 ` kernel test robot
2025-06-12 10:50 ` [PATCH 4/5] mm: add mm_get_static_huge_zero_folio() routine Pankaj Raghav
2025-06-12 14:09 ` Dave Hansen
2025-06-12 20:54 ` Pankaj Raghav (Samsung)
2025-06-16 9:14 ` David Hildenbrand
2025-06-16 10:41 ` Pankaj Raghav (Samsung)
2025-06-12 10:51 ` [PATCH 5/5] block: use mm_huge_zero_folio in __blkdev_issue_zero_pages() Pankaj Raghav
2025-06-12 13:50 ` [PATCH 0/5] add STATIC_PMD_ZERO_PAGE config option Dave Hansen
2025-06-12 20:36 ` Pankaj Raghav (Samsung)
2025-06-12 21:46 ` Dave Hansen
2025-06-13 8:58 ` Pankaj Raghav (Samsung)
2025-06-16 9:12 ` David Hildenbrand
2025-06-16 10:49 ` Pankaj Raghav (Samsung)
2025-06-16 5:40 ` Christoph Hellwig
2025-06-16 9:00 ` Pankaj Raghav
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).