* [RFC 0/3] add STATIC_PMD_ZERO_PAGE config option
@ 2025-05-27 5:04 Pankaj Raghav
2025-05-27 5:04 ` [RFC 1/3] mm: move huge_zero_folio from huge_memory.c to memory.c Pankaj Raghav
` (2 more replies)
0 siblings, 3 replies; 15+ messages in thread
From: Pankaj Raghav @ 2025-05-27 5:04 UTC (permalink / raw)
To: Suren Baghdasaryan, Ryan Roberts, Vlastimil Babka, Baolin Wang,
Borislav Petkov, Ingo Molnar, H . Peter Anvin, Zi Yan,
Mike Rapoport, Dave Hansen, Michal Hocko, David Hildenbrand,
Lorenzo Stoakes, Andrew Morton, Thomas Gleixner, Nico Pache,
Dev Jain, Liam R . Howlett, Jens Axboe
Cc: linux-kernel, linux-mm, linux-block, willy, x86, linux-fsdevel,
Darrick J . Wong, mcgrof, gost.dev, kernel, hch, Pankaj Raghav
There are many places in the kernel where we need to zeroout larger
chunks but the maximum segment we can zeroout at a time by ZERO_PAGE
is limited by PAGE_SIZE.
This concern was raised during the review of adding Large Block Size support
to XFS[1][2].
This is especially annoying in block devices and filesystems where we
attach multiple ZERO_PAGEs to the bio in different bvecs. With multipage
bvec support in block layer, it is much more efficient to send out
larger zero pages as a part of a single bvec.
Some examples of places in the kernel where this could be useful:
- blkdev_issue_zero_pages()
- iomap_dio_zero()
- vmalloc.c:zero_iter()
- rxperf_process_call()
- fscrypt_zeroout_range_inline_crypt()
- bch2_checksum_update()
...
We already have huge_zero_folio that is allocated on demand, and it will be
deallocated by the shrinker if there are no users of it left.
But to use huge_zero_folio, we need to pass a mm struct and the
put_folio needs to be called in the destructor. This makes sense for
systems that have memory constraints but for bigger servers, it does not
matter if the PMD size is reasonable (like x86).
Add a config option STATIC_PMD_ZERO_PAGE that will always allocate
the huge_zero_folio, and it will never be freed. This makes using the
huge_zero_folio without having to pass any mm struct and a call to put_folio
in the destructor.
I have converted blkdev_issue_zero_pages() as an example as a part of
this series.
I will send patches to individual subsystems using the huge_zero_folio
once this gets upstreamed.
Looking forward to some feedback.
[1] https://lore.kernel.org/linux-xfs/20231027051847.GA7885@lst.de/
[2] https://lore.kernel.org/linux-xfs/ZitIK5OnR7ZNY0IG@infradead.org/
Changes since v1:
- Added the config option based on the feedback from David.
- Removed iomap patches so that I don't clutter this series with too
many subsystems.
Pankaj Raghav (3):
mm: move huge_zero_folio from huge_memory.c to memory.c
mm: add STATIC_PMD_ZERO_PAGE config option
block: use mm_huge_zero_folio in __blkdev_issue_zero_pages()
arch/x86/Kconfig | 1 +
block/blk-lib.c | 16 ++++--
include/linux/huge_mm.h | 16 ------
include/linux/mm.h | 16 ++++++
mm/Kconfig | 12 ++++
mm/huge_memory.c | 105 +---------------------------------
mm/memory.c | 121 ++++++++++++++++++++++++++++++++++++++++
7 files changed, 164 insertions(+), 123 deletions(-)
base-commit: f1f6aceb82a55f87d04e2896ac3782162e7859bd
--
2.47.2
^ permalink raw reply [flat|nested] 15+ messages in thread
* [RFC 1/3] mm: move huge_zero_folio from huge_memory.c to memory.c
2025-05-27 5:04 [RFC 0/3] add STATIC_PMD_ZERO_PAGE config option Pankaj Raghav
@ 2025-05-27 5:04 ` Pankaj Raghav
2025-05-27 5:04 ` [RFC 2/3] mm: add STATIC_PMD_ZERO_PAGE config option Pankaj Raghav
2025-05-27 5:04 ` [RFC 3/3] block: use mm_huge_zero_folio in __blkdev_issue_zero_pages() Pankaj Raghav
2 siblings, 0 replies; 15+ messages in thread
From: Pankaj Raghav @ 2025-05-27 5:04 UTC (permalink / raw)
To: Suren Baghdasaryan, Ryan Roberts, Vlastimil Babka, Baolin Wang,
Borislav Petkov, Ingo Molnar, H . Peter Anvin, Zi Yan,
Mike Rapoport, Dave Hansen, Michal Hocko, David Hildenbrand,
Lorenzo Stoakes, Andrew Morton, Thomas Gleixner, Nico Pache,
Dev Jain, Liam R . Howlett, Jens Axboe
Cc: linux-kernel, linux-mm, linux-block, willy, x86, linux-fsdevel,
Darrick J . Wong, mcgrof, gost.dev, kernel, hch, Pankaj Raghav
The huge_zero_folio was initially placed in huge_memory.c as most of the
users were in that file. But it does not depend on THP, so it could very
well be a part of memory.c file.
As huge_zero_folio is going to be exposed to more users outside of mm,
let's move it to memory.c file.
This is a prep patch to add CONFIG_STATIC_PMD_ZERO_PAGE. No functional
changes.
Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
include/linux/huge_mm.h | 16 ------
include/linux/mm.h | 16 ++++++
mm/huge_memory.c | 105 +---------------------------------------
mm/memory.c | 99 +++++++++++++++++++++++++++++++++++++
4 files changed, 117 insertions(+), 119 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 2f190c90192d..d48973a6bd0f 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -478,22 +478,6 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf);
-extern struct folio *huge_zero_folio;
-extern unsigned long huge_zero_pfn;
-
-static inline bool is_huge_zero_folio(const struct folio *folio)
-{
- return READ_ONCE(huge_zero_folio) == folio;
-}
-
-static inline bool is_huge_zero_pmd(pmd_t pmd)
-{
- return pmd_present(pmd) && READ_ONCE(huge_zero_pfn) == pmd_pfn(pmd);
-}
-
-struct folio *mm_get_huge_zero_folio(struct mm_struct *mm);
-void mm_put_huge_zero_folio(struct mm_struct *mm);
-
static inline bool thp_migration_supported(void)
{
return IS_ENABLED(CONFIG_ARCH_ENABLE_THP_MIGRATION);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index cd2e513189d6..58d150dfc2da 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -69,6 +69,22 @@ static inline void totalram_pages_add(long count)
extern void * high_memory;
+extern struct folio *huge_zero_folio;
+extern unsigned long huge_zero_pfn;
+
+static inline bool is_huge_zero_folio(const struct folio *folio)
+{
+ return READ_ONCE(huge_zero_folio) == folio;
+}
+
+static inline bool is_huge_zero_pmd(pmd_t pmd)
+{
+ return pmd_present(pmd) && READ_ONCE(huge_zero_pfn) == pmd_pfn(pmd);
+}
+
+struct folio *mm_get_huge_zero_folio(struct mm_struct *mm);
+void mm_put_huge_zero_folio(struct mm_struct *mm);
+
#ifdef CONFIG_SYSCTL
extern int sysctl_legacy_va_layout;
#else
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d3e66136e41a..c6e203abb2de 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -75,9 +75,6 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
struct shrink_control *sc);
static bool split_underused_thp = true;
-static atomic_t huge_zero_refcount;
-struct folio *huge_zero_folio __read_mostly;
-unsigned long huge_zero_pfn __read_mostly = ~0UL;
unsigned long huge_anon_orders_always __read_mostly;
unsigned long huge_anon_orders_madvise __read_mostly;
unsigned long huge_anon_orders_inherit __read_mostly;
@@ -208,88 +205,6 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
return orders;
}
-static bool get_huge_zero_page(void)
-{
- struct folio *zero_folio;
-retry:
- if (likely(atomic_inc_not_zero(&huge_zero_refcount)))
- return true;
-
- zero_folio = folio_alloc((GFP_TRANSHUGE | __GFP_ZERO) & ~__GFP_MOVABLE,
- HPAGE_PMD_ORDER);
- if (!zero_folio) {
- count_vm_event(THP_ZERO_PAGE_ALLOC_FAILED);
- return false;
- }
- /* Ensure zero folio won't have large_rmappable flag set. */
- folio_clear_large_rmappable(zero_folio);
- preempt_disable();
- if (cmpxchg(&huge_zero_folio, NULL, zero_folio)) {
- preempt_enable();
- folio_put(zero_folio);
- goto retry;
- }
- WRITE_ONCE(huge_zero_pfn, folio_pfn(zero_folio));
-
- /* We take additional reference here. It will be put back by shrinker */
- atomic_set(&huge_zero_refcount, 2);
- preempt_enable();
- count_vm_event(THP_ZERO_PAGE_ALLOC);
- return true;
-}
-
-static void put_huge_zero_page(void)
-{
- /*
- * Counter should never go to zero here. Only shrinker can put
- * last reference.
- */
- BUG_ON(atomic_dec_and_test(&huge_zero_refcount));
-}
-
-struct folio *mm_get_huge_zero_folio(struct mm_struct *mm)
-{
- if (test_bit(MMF_HUGE_ZERO_PAGE, &mm->flags))
- return READ_ONCE(huge_zero_folio);
-
- if (!get_huge_zero_page())
- return NULL;
-
- if (test_and_set_bit(MMF_HUGE_ZERO_PAGE, &mm->flags))
- put_huge_zero_page();
-
- return READ_ONCE(huge_zero_folio);
-}
-
-void mm_put_huge_zero_folio(struct mm_struct *mm)
-{
- if (test_bit(MMF_HUGE_ZERO_PAGE, &mm->flags))
- put_huge_zero_page();
-}
-
-static unsigned long shrink_huge_zero_page_count(struct shrinker *shrink,
- struct shrink_control *sc)
-{
- /* we can free zero page only if last reference remains */
- return atomic_read(&huge_zero_refcount) == 1 ? HPAGE_PMD_NR : 0;
-}
-
-static unsigned long shrink_huge_zero_page_scan(struct shrinker *shrink,
- struct shrink_control *sc)
-{
- if (atomic_cmpxchg(&huge_zero_refcount, 1, 0) == 1) {
- struct folio *zero_folio = xchg(&huge_zero_folio, NULL);
- BUG_ON(zero_folio == NULL);
- WRITE_ONCE(huge_zero_pfn, ~0UL);
- folio_put(zero_folio);
- return HPAGE_PMD_NR;
- }
-
- return 0;
-}
-
-static struct shrinker *huge_zero_page_shrinker;
-
#ifdef CONFIG_SYSFS
static ssize_t enabled_show(struct kobject *kobj,
struct kobj_attribute *attr, char *buf)
@@ -850,22 +765,12 @@ static inline void hugepage_exit_sysfs(struct kobject *hugepage_kobj)
static int __init thp_shrinker_init(void)
{
- huge_zero_page_shrinker = shrinker_alloc(0, "thp-zero");
- if (!huge_zero_page_shrinker)
- return -ENOMEM;
-
deferred_split_shrinker = shrinker_alloc(SHRINKER_NUMA_AWARE |
SHRINKER_MEMCG_AWARE |
SHRINKER_NONSLAB,
"thp-deferred_split");
- if (!deferred_split_shrinker) {
- shrinker_free(huge_zero_page_shrinker);
+ if (!deferred_split_shrinker)
return -ENOMEM;
- }
-
- huge_zero_page_shrinker->count_objects = shrink_huge_zero_page_count;
- huge_zero_page_shrinker->scan_objects = shrink_huge_zero_page_scan;
- shrinker_register(huge_zero_page_shrinker);
deferred_split_shrinker->count_objects = deferred_split_count;
deferred_split_shrinker->scan_objects = deferred_split_scan;
@@ -874,12 +779,6 @@ static int __init thp_shrinker_init(void)
return 0;
}
-static void __init thp_shrinker_exit(void)
-{
- shrinker_free(huge_zero_page_shrinker);
- shrinker_free(deferred_split_shrinker);
-}
-
static int __init hugepage_init(void)
{
int err;
@@ -923,7 +822,7 @@ static int __init hugepage_init(void)
return 0;
err_khugepaged:
- thp_shrinker_exit();
+ shrinker_free(deferred_split_shrinker);
err_shrinker:
khugepaged_destroy();
err_slab:
diff --git a/mm/memory.c b/mm/memory.c
index 5cb48f262ab0..11edc4d66e74 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -159,6 +159,105 @@ static int __init init_zero_pfn(void)
}
early_initcall(init_zero_pfn);
+static atomic_t huge_zero_refcount;
+struct folio *huge_zero_folio __read_mostly;
+unsigned long huge_zero_pfn __read_mostly = ~0UL;
+static struct shrinker *huge_zero_page_shrinker;
+
+static bool get_huge_zero_page(void)
+{
+ struct folio *zero_folio;
+retry:
+ if (likely(atomic_inc_not_zero(&huge_zero_refcount)))
+ return true;
+
+ zero_folio = folio_alloc((GFP_TRANSHUGE | __GFP_ZERO) & ~__GFP_MOVABLE,
+ HPAGE_PMD_ORDER);
+ if (!zero_folio) {
+ count_vm_event(THP_ZERO_PAGE_ALLOC_FAILED);
+ return false;
+ }
+ /* Ensure zero folio won't have large_rmappable flag set. */
+ folio_clear_large_rmappable(zero_folio);
+ preempt_disable();
+ if (cmpxchg(&huge_zero_folio, NULL, zero_folio)) {
+ preempt_enable();
+ folio_put(zero_folio);
+ goto retry;
+ }
+ WRITE_ONCE(huge_zero_pfn, folio_pfn(zero_folio));
+
+ /* We take additional reference here. It will be put back by shrinker */
+ atomic_set(&huge_zero_refcount, 2);
+ preempt_enable();
+ count_vm_event(THP_ZERO_PAGE_ALLOC);
+ return true;
+}
+
+static void put_huge_zero_page(void)
+{
+ /*
+ * Counter should never go to zero here. Only shrinker can put
+ * last reference.
+ */
+ BUG_ON(atomic_dec_and_test(&huge_zero_refcount));
+}
+
+struct folio *mm_get_huge_zero_folio(struct mm_struct *mm)
+{
+ if (test_bit(MMF_HUGE_ZERO_PAGE, &mm->flags))
+ return READ_ONCE(huge_zero_folio);
+
+ if (!get_huge_zero_page())
+ return NULL;
+
+ if (test_and_set_bit(MMF_HUGE_ZERO_PAGE, &mm->flags))
+ put_huge_zero_page();
+
+ return READ_ONCE(huge_zero_folio);
+}
+
+void mm_put_huge_zero_folio(struct mm_struct *mm)
+{
+ if (test_bit(MMF_HUGE_ZERO_PAGE, &mm->flags))
+ put_huge_zero_page();
+}
+
+static unsigned long shrink_huge_zero_page_count(struct shrinker *shrink,
+ struct shrink_control *sc)
+{
+ /* we can free zero page only if last reference remains */
+ return atomic_read(&huge_zero_refcount) == 1 ? HPAGE_PMD_NR : 0;
+}
+
+static unsigned long shrink_huge_zero_page_scan(struct shrinker *shrink,
+ struct shrink_control *sc)
+{
+ if (atomic_cmpxchg(&huge_zero_refcount, 1, 0) == 1) {
+ struct folio *zero_folio = xchg(&huge_zero_folio, NULL);
+ BUG_ON(zero_folio == NULL);
+ WRITE_ONCE(huge_zero_pfn, ~0UL);
+ folio_put(zero_folio);
+ return HPAGE_PMD_NR;
+ }
+
+ return 0;
+}
+
+static int __init init_huge_zero_page(void)
+{
+ huge_zero_page_shrinker = shrinker_alloc(0, "thp-zero");
+ if (!huge_zero_page_shrinker)
+ return -ENOMEM;
+
+ huge_zero_page_shrinker->count_objects = shrink_huge_zero_page_count;
+ huge_zero_page_shrinker->scan_objects = shrink_huge_zero_page_scan;
+ shrinker_register(huge_zero_page_shrinker);
+
+ return 0;
+}
+early_initcall(init_huge_zero_page);
+
void mm_trace_rss_stat(struct mm_struct *mm, int member)
{
trace_rss_stat(mm, member);
--
2.47.2
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [RFC 2/3] mm: add STATIC_PMD_ZERO_PAGE config option
2025-05-27 5:04 [RFC 0/3] add STATIC_PMD_ZERO_PAGE config option Pankaj Raghav
2025-05-27 5:04 ` [RFC 1/3] mm: move huge_zero_folio from huge_memory.c to memory.c Pankaj Raghav
@ 2025-05-27 5:04 ` Pankaj Raghav
2025-05-27 16:37 ` Dave Hansen
2025-06-02 5:03 ` Christoph Hellwig
2025-05-27 5:04 ` [RFC 3/3] block: use mm_huge_zero_folio in __blkdev_issue_zero_pages() Pankaj Raghav
2 siblings, 2 replies; 15+ messages in thread
From: Pankaj Raghav @ 2025-05-27 5:04 UTC (permalink / raw)
To: Suren Baghdasaryan, Ryan Roberts, Vlastimil Babka, Baolin Wang,
Borislav Petkov, Ingo Molnar, H . Peter Anvin, Zi Yan,
Mike Rapoport, Dave Hansen, Michal Hocko, David Hildenbrand,
Lorenzo Stoakes, Andrew Morton, Thomas Gleixner, Nico Pache,
Dev Jain, Liam R . Howlett, Jens Axboe
Cc: linux-kernel, linux-mm, linux-block, willy, x86, linux-fsdevel,
Darrick J . Wong, mcgrof, gost.dev, kernel, hch, Pankaj Raghav
There are many places in the kernel where we need to zeroout larger
chunks but the maximum segment we can zeroout at a time by ZERO_PAGE
is limited by PAGE_SIZE.
This is especially annoying in block devices and filesystems where we
attach multiple ZERO_PAGEs to the bio in different bvecs. With multipage
bvec support in block layer, it is much more efficient to send out
larger zero pages as a part of single bvec.
This concern was raised during the review of adding LBS support to
XFS[1][2].
Usually huge_zero_folio is allocated on demand, and it will be
deallocated by the shrinker if there are no users of it left.
Add a config option STATIC_PMD_ZERO_PAGE that will always allocate
the huge_zero_folio, and it will never be freed. This makes using the
huge_zero_folio without having to pass any mm struct and call put_folio
in the destructor.
We can enable it by default for x86_64 where the PMD size is 2M.
It is good compromise between the memory and efficiency.
As a THP zero page might be wasteful for architectures with bigger page
sizes, let's not enable it for them.
[1] https://lore.kernel.org/linux-xfs/20231027051847.GA7885@lst.de/
[2] https://lore.kernel.org/linux-xfs/ZitIK5OnR7ZNY0IG@infradead.org/
Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
arch/x86/Kconfig | 1 +
mm/Kconfig | 12 ++++++++++++
mm/memory.c | 30 ++++++++++++++++++++++++++----
3 files changed, 39 insertions(+), 4 deletions(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 055204dc211d..96f99b4f96ea 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -152,6 +152,7 @@ config X86
select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP if X86_64
select ARCH_WANT_HUGETLB_VMEMMAP_PREINIT if X86_64
select ARCH_WANTS_THP_SWAP if X86_64
+ select ARCH_WANTS_STATIC_PMD_ZERO_PAGE if X86_64
select ARCH_HAS_PARANOID_L1D_FLUSH
select BUILDTIME_TABLE_SORT
select CLKEVT_I8253
diff --git a/mm/Kconfig b/mm/Kconfig
index bd08e151fa1b..8f50f5c3f7a7 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -826,6 +826,18 @@ config ARCH_WANTS_THP_SWAP
config MM_ID
def_bool n
+config ARCH_WANTS_STATIC_PMD_ZERO_PAGE
+ bool
+
+config STATIC_PMD_ZERO_PAGE
+ def_bool y
+ depends on ARCH_WANTS_STATIC_PMD_ZERO_PAGE
+ help
+ Typically huge_zero_folio, which is a PMD page of zeroes, is allocated
+ on demand and deallocated when not in use. This option will always
+ allocate huge_zero_folio for zeroing and it is never deallocated.
+ Not suitable for memory constrained systems.
+
menuconfig TRANSPARENT_HUGEPAGE
bool "Transparent Hugepage Support"
depends on HAVE_ARCH_TRANSPARENT_HUGEPAGE && !PREEMPT_RT
diff --git a/mm/memory.c b/mm/memory.c
index 11edc4d66e74..ab8c16d04307 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -203,9 +203,17 @@ static void put_huge_zero_page(void)
BUG_ON(atomic_dec_and_test(&huge_zero_refcount));
}
+/*
+ * If STATIC_PMD_ZERO_PAGE is enabled, @mm can be NULL, i.e, the huge_zero_folio
+ * is not associated with any mm_struct.
+*/
struct folio *mm_get_huge_zero_folio(struct mm_struct *mm)
{
- if (test_bit(MMF_HUGE_ZERO_PAGE, &mm->flags))
+ if (!IS_ENABLED(CONFIG_STATIC_PMD_ZERO_PAGE) && !mm)
+ return NULL;
+
+ if (IS_ENABLED(CONFIG_STATIC_PMD_ZERO_PAGE) ||
+ test_bit(MMF_HUGE_ZERO_PAGE, &mm->flags))
return READ_ONCE(huge_zero_folio);
if (!get_huge_zero_page())
@@ -219,6 +227,9 @@ struct folio *mm_get_huge_zero_folio(struct mm_struct *mm)
void mm_put_huge_zero_folio(struct mm_struct *mm)
{
+ if (IS_ENABLED(CONFIG_STATIC_PMD_ZERO_PAGE))
+ return;
+
if (test_bit(MMF_HUGE_ZERO_PAGE, &mm->flags))
put_huge_zero_page();
}
@@ -246,15 +257,26 @@ static unsigned long shrink_huge_zero_page_scan(struct shrinker *shrink,
static int __init init_huge_zero_page(void)
{
+ int ret = 0;
+
+ if (IS_ENABLED(CONFIG_STATIC_PMD_ZERO_PAGE)) {
+ if (!get_huge_zero_page())
+ ret = -ENOMEM;
+ goto out;
+ }
+
huge_zero_page_shrinker = shrinker_alloc(0, "thp-zero");
- if (!huge_zero_page_shrinker)
- return -ENOMEM;
+ if (!huge_zero_page_shrinker) {
+ ret = -ENOMEM;
+ goto out;
+ }
huge_zero_page_shrinker->count_objects = shrink_huge_zero_page_count;
huge_zero_page_shrinker->scan_objects = shrink_huge_zero_page_scan;
shrinker_register(huge_zero_page_shrinker);
- return 0;
+out:
+ return ret;
}
early_initcall(init_huge_zero_page);
--
2.47.2
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [RFC 3/3] block: use mm_huge_zero_folio in __blkdev_issue_zero_pages()
2025-05-27 5:04 [RFC 0/3] add STATIC_PMD_ZERO_PAGE config option Pankaj Raghav
2025-05-27 5:04 ` [RFC 1/3] mm: move huge_zero_folio from huge_memory.c to memory.c Pankaj Raghav
2025-05-27 5:04 ` [RFC 2/3] mm: add STATIC_PMD_ZERO_PAGE config option Pankaj Raghav
@ 2025-05-27 5:04 ` Pankaj Raghav
2025-06-02 5:05 ` Christoph Hellwig
2 siblings, 1 reply; 15+ messages in thread
From: Pankaj Raghav @ 2025-05-27 5:04 UTC (permalink / raw)
To: Suren Baghdasaryan, Ryan Roberts, Vlastimil Babka, Baolin Wang,
Borislav Petkov, Ingo Molnar, H . Peter Anvin, Zi Yan,
Mike Rapoport, Dave Hansen, Michal Hocko, David Hildenbrand,
Lorenzo Stoakes, Andrew Morton, Thomas Gleixner, Nico Pache,
Dev Jain, Liam R . Howlett, Jens Axboe
Cc: linux-kernel, linux-mm, linux-block, willy, x86, linux-fsdevel,
Darrick J . Wong, mcgrof, gost.dev, kernel, hch, Pankaj Raghav
Use mm_huge_zero_folio in __blkdev_issue_zero_pages(). Fallback to
ZERO_PAGE if mm_huge_zero_folio is not available.
On systems that allocates mm_huge_zero_folio, we will end up sending larger
bvecs instead of multiple small ones.
Noticed a 4% increase in performance on a commercial NVMe SSD which does
not support OP_WRITE_ZEROES. The device's MDTS was 128K. The performance
gains might be bigger if the device supports bigger MDTS.
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
block/blk-lib.c | 16 ++++++++++++----
1 file changed, 12 insertions(+), 4 deletions(-)
diff --git a/block/blk-lib.c b/block/blk-lib.c
index 4c9f20a689f7..0fd55e028170 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -4,6 +4,7 @@
*/
#include <linux/kernel.h>
#include <linux/module.h>
+#include <linux/mm.h>
#include <linux/bio.h>
#include <linux/blkdev.h>
#include <linux/scatterlist.h>
@@ -196,6 +197,12 @@ static void __blkdev_issue_zero_pages(struct block_device *bdev,
sector_t sector, sector_t nr_sects, gfp_t gfp_mask,
struct bio **biop, unsigned int flags)
{
+ struct folio *zero_folio;
+
+ zero_folio = mm_get_huge_zero_folio(NULL);
+ if (!zero_folio)
+ zero_folio = page_folio(ZERO_PAGE(0));
+
while (nr_sects) {
unsigned int nr_vecs = __blkdev_sectors_to_bio_pages(nr_sects);
struct bio *bio;
@@ -208,11 +215,12 @@ static void __blkdev_issue_zero_pages(struct block_device *bdev,
break;
do {
- unsigned int len, added;
+ unsigned int len, added = 0;
- len = min_t(sector_t,
- PAGE_SIZE, nr_sects << SECTOR_SHIFT);
- added = bio_add_page(bio, ZERO_PAGE(0), len, 0);
+ len = min_t(sector_t, folio_size(zero_folio),
+ nr_sects << SECTOR_SHIFT);
+ if (bio_add_folio(bio, zero_folio, len, 0))
+ added = len;
if (added < len)
break;
nr_sects -= added >> SECTOR_SHIFT;
--
2.47.2
^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [RFC 2/3] mm: add STATIC_PMD_ZERO_PAGE config option
2025-05-27 5:04 ` [RFC 2/3] mm: add STATIC_PMD_ZERO_PAGE config option Pankaj Raghav
@ 2025-05-27 16:37 ` Dave Hansen
2025-05-27 18:00 ` Pankaj Raghav (Samsung)
2025-06-02 5:03 ` Christoph Hellwig
1 sibling, 1 reply; 15+ messages in thread
From: Dave Hansen @ 2025-05-27 16:37 UTC (permalink / raw)
To: Pankaj Raghav, Suren Baghdasaryan, Ryan Roberts, Vlastimil Babka,
Baolin Wang, Borislav Petkov, Ingo Molnar, H . Peter Anvin,
Zi Yan, Mike Rapoport, Dave Hansen, Michal Hocko,
David Hildenbrand, Lorenzo Stoakes, Andrew Morton,
Thomas Gleixner, Nico Pache, Dev Jain, Liam R . Howlett,
Jens Axboe
Cc: linux-kernel, linux-mm, linux-block, willy, x86, linux-fsdevel,
Darrick J . Wong, mcgrof, gost.dev, kernel, hch
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 055204dc211d..96f99b4f96ea 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -152,6 +152,7 @@ config X86
> select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP if X86_64
> select ARCH_WANT_HUGETLB_VMEMMAP_PREINIT if X86_64
> select ARCH_WANTS_THP_SWAP if X86_64
> + select ARCH_WANTS_STATIC_PMD_ZERO_PAGE if X86_64
I don't think this should be the default. There are lots of little
x86_64 VMs sitting around and 2MB might be significant to them.
> +config ARCH_WANTS_STATIC_PMD_ZERO_PAGE
> + bool
> +
> +config STATIC_PMD_ZERO_PAGE
> + def_bool y
> + depends on ARCH_WANTS_STATIC_PMD_ZERO_PAGE
> + help
> + Typically huge_zero_folio, which is a PMD page of zeroes, is allocated
> + on demand and deallocated when not in use. This option will always
> + allocate huge_zero_folio for zeroing and it is never deallocated.
> + Not suitable for memory constrained systems.
"Static" seems like a weird term to use for this. I was really expecting
to see a 2MB object that gets allocated in .bss or something rather than
a dynamically allocated page that's just never freed.
> menuconfig TRANSPARENT_HUGEPAGE
> bool "Transparent Hugepage Support"
> depends on HAVE_ARCH_TRANSPARENT_HUGEPAGE && !PREEMPT_RT
> diff --git a/mm/memory.c b/mm/memory.c
> index 11edc4d66e74..ab8c16d04307 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -203,9 +203,17 @@ static void put_huge_zero_page(void)
> BUG_ON(atomic_dec_and_test(&huge_zero_refcount));
> }
>
> +/*
> + * If STATIC_PMD_ZERO_PAGE is enabled, @mm can be NULL, i.e, the huge_zero_folio
> + * is not associated with any mm_struct.
> +*/
I get that callers have to handle failure. But isn't this pretty nasty
for mm==NULL callers to be *guaranteed* to fail? They'll generate code
for the success case that will never even run.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC 2/3] mm: add STATIC_PMD_ZERO_PAGE config option
2025-05-27 16:37 ` Dave Hansen
@ 2025-05-27 18:00 ` Pankaj Raghav (Samsung)
2025-05-27 18:30 ` Dave Hansen
2025-05-28 20:36 ` David Hildenbrand
0 siblings, 2 replies; 15+ messages in thread
From: Pankaj Raghav (Samsung) @ 2025-05-27 18:00 UTC (permalink / raw)
To: Dave Hansen
Cc: Pankaj Raghav, Suren Baghdasaryan, Ryan Roberts, Vlastimil Babka,
Baolin Wang, Borislav Petkov, Ingo Molnar, H . Peter Anvin,
Zi Yan, Mike Rapoport, Dave Hansen, Michal Hocko,
David Hildenbrand, Lorenzo Stoakes, Andrew Morton,
Thomas Gleixner, Nico Pache, Dev Jain, Liam R . Howlett,
Jens Axboe, linux-kernel, linux-mm, linux-block, willy, x86,
linux-fsdevel, Darrick J . Wong, mcgrof, gost.dev, hch
On Tue, May 27, 2025 at 09:37:50AM -0700, Dave Hansen wrote:
> > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> > index 055204dc211d..96f99b4f96ea 100644
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -152,6 +152,7 @@ config X86
> > select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP if X86_64
> > select ARCH_WANT_HUGETLB_VMEMMAP_PREINIT if X86_64
> > select ARCH_WANTS_THP_SWAP if X86_64
> > + select ARCH_WANTS_STATIC_PMD_ZERO_PAGE if X86_64
>
> I don't think this should be the default. There are lots of little
> x86_64 VMs sitting around and 2MB might be significant to them.
This is the feedback I wanted. I will make it optional.
> > +config ARCH_WANTS_STATIC_PMD_ZERO_PAGE
> > + bool
> > +
> > +config STATIC_PMD_ZERO_PAGE
> > + def_bool y
> > + depends on ARCH_WANTS_STATIC_PMD_ZERO_PAGE
> > + help
> > + Typically huge_zero_folio, which is a PMD page of zeroes, is allocated
> > + on demand and deallocated when not in use. This option will always
> > + allocate huge_zero_folio for zeroing and it is never deallocated.
> > + Not suitable for memory constrained systems.
>
> "Static" seems like a weird term to use for this. I was really expecting
> to see a 2MB object that gets allocated in .bss or something rather than
> a dynamically allocated page that's just never freed.
My first proposal was along those lines[0] (sorry I messed up version
while sending the patches). David Hilderbrand suggested to leverage the
infrastructure we already have in huge_memory.
>
> > menuconfig TRANSPARENT_HUGEPAGE
> > bool "Transparent Hugepage Support"
> > depends on HAVE_ARCH_TRANSPARENT_HUGEPAGE && !PREEMPT_RT
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 11edc4d66e74..ab8c16d04307 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -203,9 +203,17 @@ static void put_huge_zero_page(void)
> > BUG_ON(atomic_dec_and_test(&huge_zero_refcount));
> > }
> >
> > +/*
> > + * If STATIC_PMD_ZERO_PAGE is enabled, @mm can be NULL, i.e, the huge_zero_folio
> > + * is not associated with any mm_struct.
> > +*/
>
> I get that callers have to handle failure. But isn't this pretty nasty
> for mm==NULL callers to be *guaranteed* to fail? They'll generate code
> for the success case that will never even run.
>
The idea was to still have dynamic allocation possible even if this
config was disabled.
You are right that if this config is disabled, the callers with NULL mm
struct are guaranteed to fail, but we are not generating extra code
because there are still users who want dynamic allocation.
Do you think it is better to have the code with inside an #ifdef instead
of using the IS_ENABLED primitive?
[1] https://lore.kernel.org/linux-fsdevel/20250516101054.676046-2-p.raghav@samsung.com/
--
Pankaj
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC 2/3] mm: add STATIC_PMD_ZERO_PAGE config option
2025-05-27 18:00 ` Pankaj Raghav (Samsung)
@ 2025-05-27 18:30 ` Dave Hansen
2025-05-27 19:28 ` Pankaj Raghav (Samsung)
2025-05-28 20:36 ` David Hildenbrand
1 sibling, 1 reply; 15+ messages in thread
From: Dave Hansen @ 2025-05-27 18:30 UTC (permalink / raw)
To: Pankaj Raghav (Samsung)
Cc: Pankaj Raghav, Suren Baghdasaryan, Ryan Roberts, Vlastimil Babka,
Baolin Wang, Borislav Petkov, Ingo Molnar, H . Peter Anvin,
Zi Yan, Mike Rapoport, Dave Hansen, Michal Hocko,
David Hildenbrand, Lorenzo Stoakes, Andrew Morton,
Thomas Gleixner, Nico Pache, Dev Jain, Liam R . Howlett,
Jens Axboe, linux-kernel, linux-mm, linux-block, willy, x86,
linux-fsdevel, Darrick J . Wong, mcgrof, gost.dev, hch
On 5/27/25 11:00, Pankaj Raghav (Samsung) wrote:
>> I get that callers have to handle failure. But isn't this pretty nasty
>> for mm==NULL callers to be *guaranteed* to fail? They'll generate code
>> for the success case that will never even run.
>>
> The idea was to still have dynamic allocation possible even if this
> config was disabled.
I don't really understand what you are trying to say here.
> You are right that if this config is disabled, the callers with NULL mm
> struct are guaranteed to fail, but we are not generating extra code
> because there are still users who want dynamic allocation.
I'm pretty sure you're making the compiler generate unnecessary code.
Think of this:
if (mm_get_huge_zero_folio(mm)
foo();
else
bar();
With the static zero page, foo() is always called. But bar() is dead
code. The compiler doesn't know that, so it will generate both sides of
the if().
If you can get the CONFIG_... option checks into the header, the
compiler can figure it out and not even generate the call to bar().
> Do you think it is better to have the code with inside an #ifdef instead
> of using the IS_ENABLED primitive?
It has nothing to do with an #ifdef versus IS_ENABLED(). It has to do
with the compiler having visibility into how mm_get_huge_zero_folio()
works enough to optimize its callers.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC 2/3] mm: add STATIC_PMD_ZERO_PAGE config option
2025-05-27 18:30 ` Dave Hansen
@ 2025-05-27 19:28 ` Pankaj Raghav (Samsung)
0 siblings, 0 replies; 15+ messages in thread
From: Pankaj Raghav (Samsung) @ 2025-05-27 19:28 UTC (permalink / raw)
To: Dave Hansen
Cc: Pankaj Raghav, Suren Baghdasaryan, Ryan Roberts, Vlastimil Babka,
Baolin Wang, Borislav Petkov, Ingo Molnar, H . Peter Anvin,
Zi Yan, Mike Rapoport, Dave Hansen, Michal Hocko,
David Hildenbrand, Lorenzo Stoakes, Andrew Morton,
Thomas Gleixner, Nico Pache, Dev Jain, Liam R . Howlett,
Jens Axboe, linux-kernel, linux-mm, linux-block, willy, x86,
linux-fsdevel, Darrick J . Wong, mcgrof, gost.dev, hch
> > You are right that if this config is disabled, the callers with NULL mm
> > struct are guaranteed to fail, but we are not generating extra code
> > because there are still users who want dynamic allocation.
>
> I'm pretty sure you're making the compiler generate unnecessary code.
> Think of this:
>
> if (mm_get_huge_zero_folio(mm)
> foo();
> else
> bar();
>
> With the static zero page, foo() is always called. But bar() is dead
> code. The compiler doesn't know that, so it will generate both sides of
> the if().
>
Ahh, yeah you are right. I was thinking about the callee and not the
caller.
> If you can get the CONFIG_... option checks into the header, the
> compiler can figure it out and not even generate the call to bar().
Got it. I will keep this in mind before sending the next version.
> > Do you think it is better to have the code with inside an #ifdef instead
> > of using the IS_ENABLED primitive?
> It has nothing to do with an #ifdef versus IS_ENABLED(). It has to do
> with the compiler having visibility into how mm_get_huge_zero_folio()
> works enough to optimize its callers.
I think something like this should give some visibility to the compiler:
struct folio *huge_zero_folio __read_mostly;
...
#if CONFIG_STATIC_PMD_ZERO_PAGE
struct folio* mm_get_huge_zero_folio(...)
{
return READ_ONCE(huge_zero_folio);
}
#else
struct folio* mm_get_huge_zero_folio(...)
{
<old-code>
}
#endif
But I am not sure here if the compiler can assume here the static
huge_zero_folio variable will be non-NULL. It will be interesting to
check that in the output.
--
Pankaj
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC 2/3] mm: add STATIC_PMD_ZERO_PAGE config option
2025-05-27 18:00 ` Pankaj Raghav (Samsung)
2025-05-27 18:30 ` Dave Hansen
@ 2025-05-28 20:36 ` David Hildenbrand
1 sibling, 0 replies; 15+ messages in thread
From: David Hildenbrand @ 2025-05-28 20:36 UTC (permalink / raw)
To: Pankaj Raghav (Samsung), Dave Hansen
Cc: Pankaj Raghav, Suren Baghdasaryan, Ryan Roberts, Vlastimil Babka,
Baolin Wang, Borislav Petkov, Ingo Molnar, H . Peter Anvin,
Zi Yan, Mike Rapoport, Dave Hansen, Michal Hocko, Lorenzo Stoakes,
Andrew Morton, Thomas Gleixner, Nico Pache, Dev Jain,
Liam R . Howlett, Jens Axboe, linux-kernel, linux-mm, linux-block,
willy, x86, linux-fsdevel, Darrick J . Wong, mcgrof, gost.dev,
hch
On 27.05.25 20:00, Pankaj Raghav (Samsung) wrote:
> On Tue, May 27, 2025 at 09:37:50AM -0700, Dave Hansen wrote:
>>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>>> index 055204dc211d..96f99b4f96ea 100644
>>> --- a/arch/x86/Kconfig
>>> +++ b/arch/x86/Kconfig
>>> @@ -152,6 +152,7 @@ config X86
>>> select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP if X86_64
>>> select ARCH_WANT_HUGETLB_VMEMMAP_PREINIT if X86_64
>>> select ARCH_WANTS_THP_SWAP if X86_64
>>> + select ARCH_WANTS_STATIC_PMD_ZERO_PAGE if X86_64
>>
>> I don't think this should be the default. There are lots of little
>> x86_64 VMs sitting around and 2MB might be significant to them.
>
> This is the feedback I wanted. I will make it optional.
>
>>> +config ARCH_WANTS_STATIC_PMD_ZERO_PAGE
>>> + bool
>>> +
>>> +config STATIC_PMD_ZERO_PAGE
>>> + def_bool y
>>> + depends on ARCH_WANTS_STATIC_PMD_ZERO_PAGE
>>> + help
>>> + Typically huge_zero_folio, which is a PMD page of zeroes, is allocated
>>> + on demand and deallocated when not in use. This option will always
>>> + allocate huge_zero_folio for zeroing and it is never deallocated.
>>> + Not suitable for memory constrained systems.
>>
>> "Static" seems like a weird term to use for this. I was really expecting
>> to see a 2MB object that gets allocated in .bss or something rather than
>> a dynamically allocated page that's just never freed.
>
> My first proposal was along those lines[0] (sorry I messed up version
> while sending the patches). David Hilderbrand suggested to leverage the
> infrastructure we already have in huge_memory.
Sorry, maybe I was not 100% clear.
We could either
a) Allocate it statically in bss and reuse it for huge_memory purposes
(static vs. dynamic is a good fit)
b) Allocate it during early boot and never free it.
Assuming we allocate it from memblock, it's almost static ... :)
I would not allocate it at runtime later when requested. Then, "static"
is really a suboptimal fit.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC 2/3] mm: add STATIC_PMD_ZERO_PAGE config option
2025-05-27 5:04 ` [RFC 2/3] mm: add STATIC_PMD_ZERO_PAGE config option Pankaj Raghav
2025-05-27 16:37 ` Dave Hansen
@ 2025-06-02 5:03 ` Christoph Hellwig
2025-06-02 14:49 ` Pankaj Raghav
1 sibling, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2025-06-02 5:03 UTC (permalink / raw)
To: Pankaj Raghav
Cc: Suren Baghdasaryan, Ryan Roberts, Vlastimil Babka, Baolin Wang,
Borislav Petkov, Ingo Molnar, H . Peter Anvin, Zi Yan,
Mike Rapoport, Dave Hansen, Michal Hocko, David Hildenbrand,
Lorenzo Stoakes, Andrew Morton, Thomas Gleixner, Nico Pache,
Dev Jain, Liam R . Howlett, Jens Axboe, linux-kernel, linux-mm,
linux-block, willy, x86, linux-fsdevel, Darrick J . Wong, mcgrof,
gost.dev, kernel, hch
Should this say FOLIO instead of PAGE in the config option to match
the symbol protected by it?
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC 3/3] block: use mm_huge_zero_folio in __blkdev_issue_zero_pages()
2025-05-27 5:04 ` [RFC 3/3] block: use mm_huge_zero_folio in __blkdev_issue_zero_pages() Pankaj Raghav
@ 2025-06-02 5:05 ` Christoph Hellwig
2025-06-02 15:34 ` Pankaj Raghav
0 siblings, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2025-06-02 5:05 UTC (permalink / raw)
To: Pankaj Raghav
Cc: Suren Baghdasaryan, Ryan Roberts, Vlastimil Babka, Baolin Wang,
Borislav Petkov, Ingo Molnar, H . Peter Anvin, Zi Yan,
Mike Rapoport, Dave Hansen, Michal Hocko, David Hildenbrand,
Lorenzo Stoakes, Andrew Morton, Thomas Gleixner, Nico Pache,
Dev Jain, Liam R . Howlett, Jens Axboe, linux-kernel, linux-mm,
linux-block, willy, x86, linux-fsdevel, Darrick J . Wong, mcgrof,
gost.dev, kernel, hch
On Tue, May 27, 2025 at 07:04:52AM +0200, Pankaj Raghav wrote:
> Noticed a 4% increase in performance on a commercial NVMe SSD which does
> not support OP_WRITE_ZEROES. The device's MDTS was 128K. The performance
> gains might be bigger if the device supports bigger MDTS.
Impressive gain on the one hand - on the other hand what is the macro
workload that does a lot of zeroing on an SSD, because avoiding that
should yield even better result while reducing wear..
> + unsigned int len, added = 0;
>
> + len = min_t(sector_t, folio_size(zero_folio),
> + nr_sects << SECTOR_SHIFT);
> + if (bio_add_folio(bio, zero_folio, len, 0))
> + added = len;
> if (added < len)
> break;
> nr_sects -= added >> SECTOR_SHIFT;
Unless I'm missing something the added variable can go away now, and
the code using it can simply use len.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC 2/3] mm: add STATIC_PMD_ZERO_PAGE config option
2025-06-02 5:03 ` Christoph Hellwig
@ 2025-06-02 14:49 ` Pankaj Raghav
2025-06-02 20:28 ` David Hildenbrand
0 siblings, 1 reply; 15+ messages in thread
From: Pankaj Raghav @ 2025-06-02 14:49 UTC (permalink / raw)
To: Christoph Hellwig, Pankaj Raghav
Cc: Suren Baghdasaryan, Ryan Roberts, Vlastimil Babka, Baolin Wang,
Borislav Petkov, Ingo Molnar, H . Peter Anvin, Zi Yan,
Mike Rapoport, Dave Hansen, Michal Hocko, David Hildenbrand,
Lorenzo Stoakes, Andrew Morton, Thomas Gleixner, Nico Pache,
Dev Jain, Liam R . Howlett, Jens Axboe, linux-kernel, linux-mm,
linux-block, willy, x86, linux-fsdevel, Darrick J . Wong, mcgrof,
gost.dev
On 6/2/25 07:03, Christoph Hellwig wrote:
> Should this say FOLIO instead of PAGE in the config option to match
> the symbol protected by it?
>
I am still discussing how the final implementation should be with David. But I will
change the _PAGE to _FOLIO as that is what we would like to expose at the end.
--
Pankaj
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC 3/3] block: use mm_huge_zero_folio in __blkdev_issue_zero_pages()
2025-06-02 5:05 ` Christoph Hellwig
@ 2025-06-02 15:34 ` Pankaj Raghav
0 siblings, 0 replies; 15+ messages in thread
From: Pankaj Raghav @ 2025-06-02 15:34 UTC (permalink / raw)
To: Christoph Hellwig, Pankaj Raghav
Cc: Suren Baghdasaryan, Ryan Roberts, Vlastimil Babka, Baolin Wang,
Borislav Petkov, Ingo Molnar, H . Peter Anvin, Zi Yan,
Mike Rapoport, Dave Hansen, Michal Hocko, David Hildenbrand,
Lorenzo Stoakes, Andrew Morton, Thomas Gleixner, Nico Pache,
Dev Jain, Liam R . Howlett, Jens Axboe, linux-kernel, linux-mm,
linux-block, willy, x86, linux-fsdevel, Darrick J . Wong, mcgrof,
gost.dev
On 6/2/25 07:05, Christoph Hellwig wrote:
> On Tue, May 27, 2025 at 07:04:52AM +0200, Pankaj Raghav wrote:
>> Noticed a 4% increase in performance on a commercial NVMe SSD which does
>> not support OP_WRITE_ZEROES. The device's MDTS was 128K. The performance
>> gains might be bigger if the device supports bigger MDTS.
>
> Impressive gain on the one hand - on the other hand what is the macro
> workload that does a lot of zeroing on an SSD, because avoiding that
> should yield even better result while reducing wear..
>
Absolutely. I think it is better to use either WRITE_ZEROES or DISCARD. But I wanted
to have some measurable workload to show the benefits of using a huge page to zero out.
Interestingly, I have seen many client SSDs not implementing WRITE_ZEROES.
>> + unsigned int len, added = 0;
>>
>> + len = min_t(sector_t, folio_size(zero_folio),
>> + nr_sects << SECTOR_SHIFT);
>> + if (bio_add_folio(bio, zero_folio, len, 0))
>> + added = len;
>> if (added < len)
>> break;
>> nr_sects -= added >> SECTOR_SHIFT;
>
> Unless I'm missing something the added variable can go away now, and
> the code using it can simply use len.
>
Yes. This should do it.
if (!bio_add_folio(bio, zero_folio, len, 0))
break;
nr_sects -= len >> SECTOR_SHIFT;
--
Pankaj
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC 2/3] mm: add STATIC_PMD_ZERO_PAGE config option
2025-06-02 14:49 ` Pankaj Raghav
@ 2025-06-02 20:28 ` David Hildenbrand
2025-06-02 20:32 ` David Hildenbrand
0 siblings, 1 reply; 15+ messages in thread
From: David Hildenbrand @ 2025-06-02 20:28 UTC (permalink / raw)
To: Pankaj Raghav, Christoph Hellwig, Pankaj Raghav
Cc: Suren Baghdasaryan, Ryan Roberts, Vlastimil Babka, Baolin Wang,
Borislav Petkov, Ingo Molnar, H . Peter Anvin, Zi Yan,
Mike Rapoport, Dave Hansen, Michal Hocko, Lorenzo Stoakes,
Andrew Morton, Thomas Gleixner, Nico Pache, Dev Jain,
Liam R . Howlett, Jens Axboe, linux-kernel, linux-mm, linux-block,
willy, x86, linux-fsdevel, Darrick J . Wong, mcgrof, gost.dev
On 02.06.25 16:49, Pankaj Raghav wrote:
> On 6/2/25 07:03, Christoph Hellwig wrote:
>> Should this say FOLIO instead of PAGE in the config option to match
>> the symbol protected by it?
>>
> I am still discussing how the final implementation should be with David. But I will
> change the _PAGE to _FOLIO as that is what we would like to expose at the end.
It's a huge page, represented internally as a folio. No strong opinion,
as ...
MMF_HUGE_ZERO_PAGE vs. mm_get_huge_zero_folio vs. get_huge_zero_page vs
... :)
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC 2/3] mm: add STATIC_PMD_ZERO_PAGE config option
2025-06-02 20:28 ` David Hildenbrand
@ 2025-06-02 20:32 ` David Hildenbrand
0 siblings, 0 replies; 15+ messages in thread
From: David Hildenbrand @ 2025-06-02 20:32 UTC (permalink / raw)
To: Pankaj Raghav, Christoph Hellwig, Pankaj Raghav
Cc: Suren Baghdasaryan, Ryan Roberts, Vlastimil Babka, Baolin Wang,
Borislav Petkov, Ingo Molnar, H . Peter Anvin, Zi Yan,
Mike Rapoport, Dave Hansen, Michal Hocko, Lorenzo Stoakes,
Andrew Morton, Thomas Gleixner, Nico Pache, Dev Jain,
Liam R . Howlett, Jens Axboe, linux-kernel, linux-mm, linux-block,
willy, x86, linux-fsdevel, Darrick J . Wong, mcgrof, gost.dev
On 02.06.25 22:28, David Hildenbrand wrote:
> On 02.06.25 16:49, Pankaj Raghav wrote:
>> On 6/2/25 07:03, Christoph Hellwig wrote:
>>> Should this say FOLIO instead of PAGE in the config option to match
>>> the symbol protected by it?
>>>
>> I am still discussing how the final implementation should be with David. But I will
>> change the _PAGE to _FOLIO as that is what we would like to expose at the end.
>
> It's a huge page, represented internally as a folio. No strong opinion,
> as ...
>
> MMF_HUGE_ZERO_PAGE vs. mm_get_huge_zero_folio vs. get_huge_zero_page vs
> ... :)
Just to add, the existing one is exposed (configurable) to the user through
/sys/kernel/mm/transparent_hugepage/use_zero_page
combined with the implicit size through
/sys/kernel/mm/transparent_hugepage/hpage_pmd_size
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2025-06-02 20:33 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-27 5:04 [RFC 0/3] add STATIC_PMD_ZERO_PAGE config option Pankaj Raghav
2025-05-27 5:04 ` [RFC 1/3] mm: move huge_zero_folio from huge_memory.c to memory.c Pankaj Raghav
2025-05-27 5:04 ` [RFC 2/3] mm: add STATIC_PMD_ZERO_PAGE config option Pankaj Raghav
2025-05-27 16:37 ` Dave Hansen
2025-05-27 18:00 ` Pankaj Raghav (Samsung)
2025-05-27 18:30 ` Dave Hansen
2025-05-27 19:28 ` Pankaj Raghav (Samsung)
2025-05-28 20:36 ` David Hildenbrand
2025-06-02 5:03 ` Christoph Hellwig
2025-06-02 14:49 ` Pankaj Raghav
2025-06-02 20:28 ` David Hildenbrand
2025-06-02 20:32 ` David Hildenbrand
2025-05-27 5:04 ` [RFC 3/3] block: use mm_huge_zero_folio in __blkdev_issue_zero_pages() Pankaj Raghav
2025-06-02 5:05 ` Christoph Hellwig
2025-06-02 15:34 ` Pankaj Raghav
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).