[PATCH RFC 06/35] mm/page_alloc: reject unreasonable folio/compound page sizes in alloc_contig_range

wireguard.lists.zx2c4.com archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC 06/35] mm/page_alloc: reject unreasonable folio/compound page sizes in alloc_contig_range_noprof()
       [not found] <20250821200701.1329277-1-david@redhat.com>
@ 2025-08-21 20:06 ` David Hildenbrand
  2025-08-21 20:23   ` Zi Yan
  2025-08-22 17:07   ` SeongJae Park
  2025-08-21 20:06 ` [PATCH RFC 07/35] mm/memremap: reject unreasonable folio/compound page sizes in memremap_pages() David Hildenbrand
                   ` (25 subsequent siblings)
  26 siblings, 2 replies; 61+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Alexander Potapenko, Andrew Morton,
	Brendan Jackman, Christoph Lameter, Dennis Zhou, Dmitry Vyukov,
	dri-devel, intel-gfx, iommu, io-uring, Jason Gunthorpe,
	Jens Axboe, Johannes Weiner, John Hubbard, kasan-dev, kvm,
	Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Mike Rapoport, Muchun Song, netdev, Oscar Salvador,
	Peter Xu, Robin Murphy, Suren Baghdasaryan, Tejun Heo,
	virtualization, Vlastimil Babka, wireguard, x86, Zi Yan

Let's reject them early, which in turn makes folio_alloc_gigantic() reject
them properly.

To avoid converting from order to nr_pages, let's just add MAX_FOLIO_ORDER
and calculate MAX_FOLIO_NR_PAGES based on that.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/mm.h | 6 ++++--
 mm/page_alloc.c    | 5 ++++-
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 00c8a54127d37..77737cbf2216a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2055,11 +2055,13 @@ static inline long folio_nr_pages(const struct folio *folio)
 
 /* Only hugetlbfs can allocate folios larger than MAX_ORDER */
 #ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
-#define MAX_FOLIO_NR_PAGES	(1UL << PUD_ORDER)
+#define MAX_FOLIO_ORDER		PUD_ORDER
 #else
-#define MAX_FOLIO_NR_PAGES	MAX_ORDER_NR_PAGES
+#define MAX_FOLIO_ORDER		MAX_PAGE_ORDER
 #endif
 
+#define MAX_FOLIO_NR_PAGES	(1UL << MAX_FOLIO_ORDER)
+
 /*
  * compound_nr() returns the number of pages in this potentially compound
  * page.  compound_nr() can be called on a tail page, and is defined to
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ca9e6b9633f79..1e6ae4c395b30 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6833,6 +6833,7 @@ static int __alloc_contig_verify_gfp_mask(gfp_t gfp_mask, gfp_t *gfp_cc_mask)
 int alloc_contig_range_noprof(unsigned long start, unsigned long end,
 			      acr_flags_t alloc_flags, gfp_t gfp_mask)
 {
+	const unsigned int order = ilog2(end - start);
 	unsigned long outer_start, outer_end;
 	int ret = 0;
 
@@ -6850,6 +6851,9 @@ int alloc_contig_range_noprof(unsigned long start, unsigned long end,
 					    PB_ISOLATE_MODE_CMA_ALLOC :
 					    PB_ISOLATE_MODE_OTHER;
 
+	if (WARN_ON_ONCE((gfp_mask & __GFP_COMP) && order > MAX_FOLIO_ORDER))
+		return -EINVAL;
+
 	gfp_mask = current_gfp_context(gfp_mask);
 	if (__alloc_contig_verify_gfp_mask(gfp_mask, (gfp_t *)&cc.gfp_mask))
 		return -EINVAL;
@@ -6947,7 +6951,6 @@ int alloc_contig_range_noprof(unsigned long start, unsigned long end,
 			free_contig_range(end, outer_end - end);
 	} else if (start == outer_start && end == outer_end && is_power_of_2(end - start)) {
 		struct page *head = pfn_to_page(start);
-		int order = ilog2(end - start);
 
 		check_new_pages(head, order);
 		prep_new_page(head, order, gfp_mask, 0);
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH RFC 07/35] mm/memremap: reject unreasonable folio/compound page sizes in memremap_pages()
       [not found] <20250821200701.1329277-1-david@redhat.com>
  2025-08-21 20:06 ` [PATCH RFC 06/35] mm/page_alloc: reject unreasonable folio/compound page sizes in alloc_contig_range_noprof() David Hildenbrand
@ 2025-08-21 20:06 ` David Hildenbrand
  2025-08-22 17:09   ` SeongJae Park
  2025-08-21 20:06 ` [PATCH RFC 08/35] mm/hugetlb: check for unreasonable folio sizes when registering hstate David Hildenbrand
                   ` (24 subsequent siblings)
  26 siblings, 1 reply; 61+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Alexander Potapenko, Andrew Morton,
	Brendan Jackman, Christoph Lameter, Dennis Zhou, Dmitry Vyukov,
	dri-devel, intel-gfx, iommu, io-uring, Jason Gunthorpe,
	Jens Axboe, Johannes Weiner, John Hubbard, kasan-dev, kvm,
	Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Mike Rapoport, Muchun Song, netdev, Oscar Salvador,
	Peter Xu, Robin Murphy, Suren Baghdasaryan, Tejun Heo,
	virtualization, Vlastimil Babka, wireguard, x86, Zi Yan

Let's reject unreasonable folio sizes early, where we can still fail.
We'll add sanity checks to prepare_compound_head/prepare_compound_page
next.

Is there a way to configure a system such that unreasonable folio sizes
would be possible? It would already be rather questionable.

If so, we'd probably want to bail out earlier, where we can avoid a
WARN and just report a proper error message that indicates where
something went wrong such that we messed up.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memremap.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/memremap.c b/mm/memremap.c
index b0ce0d8254bd8..a2d4bb88f64b6 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -275,6 +275,9 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
 
 	if (WARN_ONCE(!nr_range, "nr_range must be specified\n"))
 		return ERR_PTR(-EINVAL);
+	if (WARN_ONCE(pgmap->vmemmap_shift > MAX_FOLIO_ORDER,
+		      "requested folio size unsupported\n"))
+		return ERR_PTR(-EINVAL);
 
 	switch (pgmap->type) {
 	case MEMORY_DEVICE_PRIVATE:
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH RFC 08/35] mm/hugetlb: check for unreasonable folio sizes when registering hstate
       [not found] <20250821200701.1329277-1-david@redhat.com>
  2025-08-21 20:06 ` [PATCH RFC 06/35] mm/page_alloc: reject unreasonable folio/compound page sizes in alloc_contig_range_noprof() David Hildenbrand
  2025-08-21 20:06 ` [PATCH RFC 07/35] mm/memremap: reject unreasonable folio/compound page sizes in memremap_pages() David Hildenbrand
@ 2025-08-21 20:06 ` David Hildenbrand
  2025-08-21 20:06 ` [PATCH RFC 09/35] mm/mm_init: make memmap_init_compound() look more like prep_compound_page() David Hildenbrand
                   ` (23 subsequent siblings)
  26 siblings, 0 replies; 61+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Alexander Potapenko, Andrew Morton,
	Brendan Jackman, Christoph Lameter, Dennis Zhou, Dmitry Vyukov,
	dri-devel, intel-gfx, iommu, io-uring, Jason Gunthorpe,
	Jens Axboe, Johannes Weiner, John Hubbard, kasan-dev, kvm,
	Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Mike Rapoport, Muchun Song, netdev, Oscar Salvador,
	Peter Xu, Robin Murphy, Suren Baghdasaryan, Tejun Heo,
	virtualization, Vlastimil Babka, wireguard, x86, Zi Yan

Let's check that no hstate that corresponds to an unreasonable folio size
is registered by an architecture. If we were to succeed registering, we
could later try allocating an unsupported gigantic folio size.

Further, let's add a BUILD_BUG_ON() for checking that HUGETLB_PAGE_ORDER
is sane at build time. As HUGETLB_PAGE_ORDER is dynamic on powerpc, we have
to use a BUILD_BUG_ON_INVALID() to make it compile.

No existing kernel configuration should be able to trigger this check:
either SPARSEMEM without SPARSEMEM_VMEMMAP cannot be configured or
gigantic folios will not exceed a memory section (the case on sparse).

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/hugetlb.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 514fab5a20ef8..d12a9d5146af4 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4657,6 +4657,7 @@ static int __init hugetlb_init(void)
 
 	BUILD_BUG_ON(sizeof_field(struct page, private) * BITS_PER_BYTE <
 			__NR_HPAGEFLAGS);
+	BUILD_BUG_ON_INVALID(HUGETLB_PAGE_ORDER > MAX_FOLIO_ORDER);
 
 	if (!hugepages_supported()) {
 		if (hugetlb_max_hstate || default_hstate_max_huge_pages)
@@ -4740,6 +4741,7 @@ void __init hugetlb_add_hstate(unsigned int order)
 	}
 	BUG_ON(hugetlb_max_hstate >= HUGE_MAX_HSTATE);
 	BUG_ON(order < order_base_2(__NR_USED_SUBPAGE));
+	WARN_ON(order > MAX_FOLIO_ORDER);
 	h = &hstates[hugetlb_max_hstate++];
 	__mutex_init(&h->resize_lock, "resize mutex", &h->resize_key);
 	h->order = order;
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH RFC 09/35] mm/mm_init: make memmap_init_compound() look more like prep_compound_page()
       [not found] <20250821200701.1329277-1-david@redhat.com>
                   ` (2 preceding siblings ...)
  2025-08-21 20:06 ` [PATCH RFC 08/35] mm/hugetlb: check for unreasonable folio sizes when registering hstate David Hildenbrand
@ 2025-08-21 20:06 ` David Hildenbrand
  2025-08-22 15:27   ` Mike Rapoport
  2025-08-21 20:06 ` [PATCH RFC 10/35] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap() David Hildenbrand
                   ` (22 subsequent siblings)
  26 siblings, 1 reply; 61+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Alexander Potapenko, Andrew Morton,
	Brendan Jackman, Christoph Lameter, Dennis Zhou, Dmitry Vyukov,
	dri-devel, intel-gfx, iommu, io-uring, Jason Gunthorpe,
	Jens Axboe, Johannes Weiner, John Hubbard, kasan-dev, kvm,
	Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Mike Rapoport, Muchun Song, netdev, Oscar Salvador,
	Peter Xu, Robin Murphy, Suren Baghdasaryan, Tejun Heo,
	virtualization, Vlastimil Babka, wireguard, x86, Zi Yan

Grepping for "prep_compound_page" leaves on clueless how devdax gets its
compound pages initialized.

Let's add a comment that might help finding this open-coded
prep_compound_page() initialization more easily.

Further, let's be less smart about the ordering of initialization and just
perform the prep_compound_head() call after all tail pages were
initialized: just like prep_compound_page() does.

No need for a lengthy comment then: again, just like prep_compound_page().

Note that prep_compound_head() already does initialize stuff in page[2]
through prep_compound_head() that successive tail page initialization
will overwrite: _deferred_list, and on 32bit _entire_mapcount and
_pincount. Very likely 32bit does not apply, and likely nobody ever ends
up testing whether the _deferred_list is empty.

So it shouldn't be a fix at this point, but certainly something to clean
up.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/mm_init.c | 13 +++++--------
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/mm/mm_init.c b/mm/mm_init.c
index 5c21b3af216b2..708466c5b2cc9 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1091,6 +1091,10 @@ static void __ref memmap_init_compound(struct page *head,
 	unsigned long pfn, end_pfn = head_pfn + nr_pages;
 	unsigned int order = pgmap->vmemmap_shift;
 
+	/*
+	 * This is an open-coded prep_compound_page() whereby we avoid
+	 * walking pages twice by initializing them in the same go.
+	 */
 	__SetPageHead(head);
 	for (pfn = head_pfn + 1; pfn < end_pfn; pfn++) {
 		struct page *page = pfn_to_page(pfn);
@@ -1098,15 +1102,8 @@ static void __ref memmap_init_compound(struct page *head,
 		__init_zone_device_page(page, pfn, zone_idx, nid, pgmap);
 		prep_compound_tail(head, pfn - head_pfn);
 		set_page_count(page, 0);
-
-		/*
-		 * The first tail page stores important compound page info.
-		 * Call prep_compound_head() after the first tail page has
-		 * been initialized, to not have the data overwritten.
-		 */
-		if (pfn == head_pfn + 1)
-			prep_compound_head(head, order);
 	}
+	prep_compound_head(head, order);
 }
 
 void __ref memmap_init_zone_device(struct zone *zone,
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH RFC 10/35] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap()
       [not found] <20250821200701.1329277-1-david@redhat.com>
                   ` (3 preceding siblings ...)
  2025-08-21 20:06 ` [PATCH RFC 09/35] mm/mm_init: make memmap_init_compound() look more like prep_compound_page() David Hildenbrand
@ 2025-08-21 20:06 ` David Hildenbrand
  2025-08-22  4:09   ` Mika Penttilä
  2025-08-21 20:06 ` [PATCH RFC 11/35] mm: sanity-check maximum folio size in folio_set_order() David Hildenbrand
                   ` (21 subsequent siblings)
  26 siblings, 1 reply; 61+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Alexander Potapenko, Andrew Morton,
	Brendan Jackman, Christoph Lameter, Dennis Zhou, Dmitry Vyukov,
	dri-devel, intel-gfx, iommu, io-uring, Jason Gunthorpe,
	Jens Axboe, Johannes Weiner, John Hubbard, kasan-dev, kvm,
	Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Mike Rapoport, Muchun Song, netdev, Oscar Salvador,
	Peter Xu, Robin Murphy, Suren Baghdasaryan, Tejun Heo,
	virtualization, Vlastimil Babka, wireguard, x86, Zi Yan

All pages were already initialized and set to PageReserved() with a
refcount of 1 by MM init code.

In fact, by using __init_single_page(), we will be setting the refcount to
1 just to freeze it again immediately afterwards.

So drop the __init_single_page() and use __ClearPageReserved() instead.
Adjust the comments to highlight that we are dealing with an open-coded
prep_compound_page() variant.

Further, as we can now safely iterate over all pages in a folio, let's
avoid the page-pfn dance and just iterate the pages directly.

Note that the current code was likely problematic, but we never ran into
it: prep_compound_tail() would have been called with an offset that might
exceed a memory section, and prep_compound_tail() would have simply
added that offset to the page pointer -- which would not have done the
right thing on sparsemem without vmemmap.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/hugetlb.c | 21 ++++++++++-----------
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d12a9d5146af4..ae82a845b14ad 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3235,17 +3235,14 @@ static void __init hugetlb_folio_init_tail_vmemmap(struct folio *folio,
 					unsigned long start_page_number,
 					unsigned long end_page_number)
 {
-	enum zone_type zone = zone_idx(folio_zone(folio));
-	int nid = folio_nid(folio);
-	unsigned long head_pfn = folio_pfn(folio);
-	unsigned long pfn, end_pfn = head_pfn + end_page_number;
+	struct page *head_page = folio_page(folio, 0);
+	struct page *page = folio_page(folio, start_page_number);
+	unsigned long i;
 	int ret;
 
-	for (pfn = head_pfn + start_page_number; pfn < end_pfn; pfn++) {
-		struct page *page = pfn_to_page(pfn);
-
-		__init_single_page(page, pfn, zone, nid);
-		prep_compound_tail((struct page *)folio, pfn - head_pfn);
+	for (i = start_page_number; i < end_page_number; i++, page++) {
+		__ClearPageReserved(page);
+		prep_compound_tail(head_page, i);
 		ret = page_ref_freeze(page, 1);
 		VM_BUG_ON(!ret);
 	}
@@ -3257,12 +3254,14 @@ static void __init hugetlb_folio_init_vmemmap(struct folio *folio,
 {
 	int ret;
 
-	/* Prepare folio head */
+	/*
+	 * This is an open-coded prep_compound_page() whereby we avoid
+	 * walking pages twice by preparing+freezing them in the same go.
+	 */
 	__folio_clear_reserved(folio);
 	__folio_set_head(folio);
 	ret = folio_ref_freeze(folio, 1);
 	VM_BUG_ON(!ret);
-	/* Initialize the necessary tail struct pages */
 	hugetlb_folio_init_tail_vmemmap(folio, 1, nr_pages);
 	prep_compound_head((struct page *)folio, huge_page_order(h));
 }
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH RFC 11/35] mm: sanity-check maximum folio size in folio_set_order()
       [not found] <20250821200701.1329277-1-david@redhat.com>
                   ` (4 preceding siblings ...)
  2025-08-21 20:06 ` [PATCH RFC 10/35] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap() David Hildenbrand
@ 2025-08-21 20:06 ` David Hildenbrand
  2025-08-21 20:36   ` Zi Yan
  2025-08-21 20:06 ` [PATCH RFC 12/35] mm: limit folio/compound page sizes in problematic kernel configs David Hildenbrand
                   ` (20 subsequent siblings)
  26 siblings, 1 reply; 61+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Alexander Potapenko, Andrew Morton,
	Brendan Jackman, Christoph Lameter, Dennis Zhou, Dmitry Vyukov,
	dri-devel, intel-gfx, iommu, io-uring, Jason Gunthorpe,
	Jens Axboe, Johannes Weiner, John Hubbard, kasan-dev, kvm,
	Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Mike Rapoport, Muchun Song, netdev, Oscar Salvador,
	Peter Xu, Robin Murphy, Suren Baghdasaryan, Tejun Heo,
	virtualization, Vlastimil Babka, wireguard, x86, Zi Yan

Let's sanity-check in folio_set_order() whether we would be trying to
create a folio with an order that would make it exceed MAX_FOLIO_ORDER.

This will enable the check whenever a folio/compound page is initialized
through prepare_compound_head() / prepare_compound_page().

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/internal.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/internal.h b/mm/internal.h
index 45b725c3dc030..946ce97036d67 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -755,6 +755,7 @@ static inline void folio_set_order(struct folio *folio, unsigned int order)
 {
 	if (WARN_ON_ONCE(!order || !folio_test_large(folio)))
 		return;
+	VM_WARN_ON_ONCE(order > MAX_FOLIO_ORDER);
 
 	folio->_flags_1 = (folio->_flags_1 & ~0xffUL) | order;
 #ifdef NR_PAGES_IN_LARGE_FOLIO
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH RFC 12/35] mm: limit folio/compound page sizes in problematic kernel configs
       [not found] <20250821200701.1329277-1-david@redhat.com>
                   ` (5 preceding siblings ...)
  2025-08-21 20:06 ` [PATCH RFC 11/35] mm: sanity-check maximum folio size in folio_set_order() David Hildenbrand
@ 2025-08-21 20:06 ` David Hildenbrand
  2025-08-21 20:46   ` Zi Yan
  2025-08-24 13:24   ` Mike Rapoport
  2025-08-21 20:06 ` [PATCH RFC 13/35] mm: simplify folio_page() and folio_page_idx() David Hildenbrand
                   ` (19 subsequent siblings)
  26 siblings, 2 replies; 61+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Alexander Potapenko, Andrew Morton,
	Brendan Jackman, Christoph Lameter, Dennis Zhou, Dmitry Vyukov,
	dri-devel, intel-gfx, iommu, io-uring, Jason Gunthorpe,
	Jens Axboe, Johannes Weiner, John Hubbard, kasan-dev, kvm,
	Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Mike Rapoport, Muchun Song, netdev, Oscar Salvador,
	Peter Xu, Robin Murphy, Suren Baghdasaryan, Tejun Heo,
	virtualization, Vlastimil Babka, wireguard, x86, Zi Yan

Let's limit the maximum folio size in problematic kernel config where
the memmap is allocated per memory section (SPARSEMEM without
SPARSEMEM_VMEMMAP) to a single memory section.

Currently, only a single architectures supports ARCH_HAS_GIGANTIC_PAGE
but not SPARSEMEM_VMEMMAP: sh.

Fortunately, the biggest hugetlb size sh supports is 64 MiB
(HUGETLB_PAGE_SIZE_64MB) and the section size is at least 64 MiB
(SECTION_SIZE_BITS == 26), so their use case is not degraded.

As folios and memory sections are naturally aligned to their order-2 size
in memory, consequently a single folio can no longer span multiple memory
sections on these problematic kernel configs.

nth_page() is no longer required when operating within a single compound
page / folio.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/mm.h | 22 ++++++++++++++++++----
 1 file changed, 18 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 77737cbf2216a..48a985e17ef4e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2053,11 +2053,25 @@ static inline long folio_nr_pages(const struct folio *folio)
 	return folio_large_nr_pages(folio);
 }
 
-/* Only hugetlbfs can allocate folios larger than MAX_ORDER */
-#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
-#define MAX_FOLIO_ORDER		PUD_ORDER
-#else
+#if !defined(CONFIG_ARCH_HAS_GIGANTIC_PAGE)
+/*
+ * We don't expect any folios that exceed buddy sizes (and consequently
+ * memory sections).
+ */
 #define MAX_FOLIO_ORDER		MAX_PAGE_ORDER
+#elif defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
+/*
+ * Only pages within a single memory section are guaranteed to be
+ * contiguous. By limiting folios to a single memory section, all folio
+ * pages are guaranteed to be contiguous.
+ */
+#define MAX_FOLIO_ORDER		PFN_SECTION_SHIFT
+#else
+/*
+ * There is no real limit on the folio size. We limit them to the maximum we
+ * currently expect.
+ */
+#define MAX_FOLIO_ORDER		PUD_ORDER
 #endif
 
 #define MAX_FOLIO_NR_PAGES	(1UL << MAX_FOLIO_ORDER)
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH RFC 13/35] mm: simplify folio_page() and folio_page_idx()
       [not found] <20250821200701.1329277-1-david@redhat.com>
                   ` (6 preceding siblings ...)
  2025-08-21 20:06 ` [PATCH RFC 12/35] mm: limit folio/compound page sizes in problematic kernel configs David Hildenbrand
@ 2025-08-21 20:06 ` David Hildenbrand
  2025-08-21 20:55   ` Zi Yan
  2025-08-21 20:06 ` [PATCH RFC 14/35] mm/mm/percpu-km: drop nth_page() usage within single allocation David Hildenbrand
                   ` (18 subsequent siblings)
  26 siblings, 1 reply; 61+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Alexander Potapenko, Andrew Morton,
	Brendan Jackman, Christoph Lameter, Dennis Zhou, Dmitry Vyukov,
	dri-devel, intel-gfx, iommu, io-uring, Jason Gunthorpe,
	Jens Axboe, Johannes Weiner, John Hubbard, kasan-dev, kvm,
	Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Mike Rapoport, Muchun Song, netdev, Oscar Salvador,
	Peter Xu, Robin Murphy, Suren Baghdasaryan, Tejun Heo,
	virtualization, Vlastimil Babka, wireguard, x86, Zi Yan

Now that a single folio/compound page can no longer span memory sections
in problematic kernel configurations, we can stop using nth_page().

While at it, turn both macros into static inline functions and add
kernel doc for folio_page_idx().

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/mm.h         | 16 ++++++++++++++--
 include/linux/page-flags.h |  5 ++++-
 2 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 48a985e17ef4e..ef360b72cb05c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -210,10 +210,8 @@ extern unsigned long sysctl_admin_reserve_kbytes;
 
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
 #define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n))
-#define folio_page_idx(folio, p)	(page_to_pfn(p) - folio_pfn(folio))
 #else
 #define nth_page(page,n) ((page) + (n))
-#define folio_page_idx(folio, p)	((p) - &(folio)->page)
 #endif
 
 /* to align the pointer to the (next) page boundary */
@@ -225,6 +223,20 @@ extern unsigned long sysctl_admin_reserve_kbytes;
 /* test whether an address (unsigned long or pointer) is aligned to PAGE_SIZE */
 #define PAGE_ALIGNED(addr)	IS_ALIGNED((unsigned long)(addr), PAGE_SIZE)
 
+/**
+ * folio_page_idx - Return the number of a page in a folio.
+ * @folio: The folio.
+ * @page: The folio page.
+ *
+ * This function expects that the page is actually part of the folio.
+ * The returned number is relative to the start of the folio.
+ */
+static inline unsigned long folio_page_idx(const struct folio *folio,
+		const struct page *page)
+{
+	return page - &folio->page;
+}
+
 static inline struct folio *lru_to_folio(struct list_head *head)
 {
 	return list_entry((head)->prev, struct folio, lru);
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index d53a86e68c89b..080ad10c0defc 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -316,7 +316,10 @@ static __always_inline unsigned long _compound_head(const struct page *page)
  * check that the page number lies within @folio; the caller is presumed
  * to have a reference to the page.
  */
-#define folio_page(folio, n)	nth_page(&(folio)->page, n)
+static inline struct page *folio_page(struct folio *folio, unsigned long nr)
+{
+	return &folio->page + nr;
+}
 
 static __always_inline int PageTail(const struct page *page)
 {
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH RFC 14/35] mm/mm/percpu-km: drop nth_page() usage within single allocation
       [not found] <20250821200701.1329277-1-david@redhat.com>
                   ` (7 preceding siblings ...)
  2025-08-21 20:06 ` [PATCH RFC 13/35] mm: simplify folio_page() and folio_page_idx() David Hildenbrand
@ 2025-08-21 20:06 ` David Hildenbrand
  2025-08-21 20:06 ` [PATCH RFC 15/35] fs: hugetlbfs: remove nth_page() usage within folio in adjust_range_hwpoison() David Hildenbrand
                   ` (17 subsequent siblings)
  26 siblings, 0 replies; 61+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Alexander Potapenko, Andrew Morton,
	Brendan Jackman, Christoph Lameter, Dennis Zhou, Dmitry Vyukov,
	dri-devel, intel-gfx, iommu, io-uring, Jason Gunthorpe,
	Jens Axboe, Johannes Weiner, John Hubbard, kasan-dev, kvm,
	Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Mike Rapoport, Muchun Song, netdev, Oscar Salvador,
	Peter Xu, Robin Murphy, Suren Baghdasaryan, Tejun Heo,
	virtualization, Vlastimil Babka, wireguard, x86, Zi Yan

We're allocating a higher-order page from the buddy. For these pages
(that are guaranteed to not exceed a single memory section) there is no
need to use nth_page().

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/percpu-km.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/percpu-km.c b/mm/percpu-km.c
index fe31aa19db81a..4efa74a495cb6 100644
--- a/mm/percpu-km.c
+++ b/mm/percpu-km.c
@@ -69,7 +69,7 @@ static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp)
 	}
 
 	for (i = 0; i < nr_pages; i++)
-		pcpu_set_page_chunk(nth_page(pages, i), chunk);
+		pcpu_set_page_chunk(pages + i, chunk);
 
 	chunk->data = pages;
 	chunk->base_addr = page_address(pages);
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH RFC 15/35] fs: hugetlbfs: remove nth_page() usage within folio in adjust_range_hwpoison()
       [not found] <20250821200701.1329277-1-david@redhat.com>
                   ` (8 preceding siblings ...)
  2025-08-21 20:06 ` [PATCH RFC 14/35] mm/mm/percpu-km: drop nth_page() usage within single allocation David Hildenbrand
@ 2025-08-21 20:06 ` David Hildenbrand
  2025-08-21 20:06 ` [PATCH RFC 16/35] mm/pagewalk: drop nth_page() usage within folio in folio_walk_start() David Hildenbrand
                   ` (16 subsequent siblings)
  26 siblings, 0 replies; 61+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Alexander Potapenko, Andrew Morton,
	Brendan Jackman, Christoph Lameter, Dennis Zhou, Dmitry Vyukov,
	dri-devel, intel-gfx, iommu, io-uring, Jason Gunthorpe,
	Jens Axboe, Johannes Weiner, John Hubbard, kasan-dev, kvm,
	Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Mike Rapoport, Muchun Song, netdev, Oscar Salvador,
	Peter Xu, Robin Murphy, Suren Baghdasaryan, Tejun Heo,
	virtualization, Vlastimil Babka, wireguard, x86, Zi Yan

The nth_page() is not really required anymore, so let's remove it.
While at it, cleanup and simplify the code a bit.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 fs/hugetlbfs/inode.c | 25 ++++++++-----------------
 1 file changed, 8 insertions(+), 17 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 34d496a2b7de6..dc981509a7717 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -198,31 +198,22 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
 static size_t adjust_range_hwpoison(struct folio *folio, size_t offset,
 		size_t bytes)
 {
-	struct page *page;
-	size_t n = 0;
-	size_t res = 0;
+	struct page *page = folio_page(folio, offset / PAGE_SIZE);
+	size_t n, safe_bytes;
 
-	/* First page to start the loop. */
-	page = folio_page(folio, offset / PAGE_SIZE);
 	offset %= PAGE_SIZE;
-	while (1) {
+	for (safe_bytes = 0; safe_bytes < bytes; safe_bytes += n) {
+
 		if (is_raw_hwpoison_page_in_hugepage(page))
 			break;
 
 		/* Safe to read n bytes without touching HWPOISON subpage. */
-		n = min(bytes, (size_t)PAGE_SIZE - offset);
-		res += n;
-		bytes -= n;
-		if (!bytes || !n)
-			break;
-		offset += n;
-		if (offset == PAGE_SIZE) {
-			page = nth_page(page, 1);
-			offset = 0;
-		}
+		n = min(bytes - safe_bytes, (size_t)PAGE_SIZE - offset);
+		offset = 0;
+		page++;
 	}
 
-	return res;
+	return safe_bytes;
 }
 
 /*
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH RFC 16/35] mm/pagewalk: drop nth_page() usage within folio in folio_walk_start()
       [not found] <20250821200701.1329277-1-david@redhat.com>
                   ` (9 preceding siblings ...)
  2025-08-21 20:06 ` [PATCH RFC 15/35] fs: hugetlbfs: remove nth_page() usage within folio in adjust_range_hwpoison() David Hildenbrand
@ 2025-08-21 20:06 ` David Hildenbrand
  2025-08-21 20:06 ` [PATCH RFC 17/35] mm/gup: drop nth_page() usage within folio when recording subpages David Hildenbrand
                   ` (15 subsequent siblings)
  26 siblings, 0 replies; 61+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Alexander Potapenko, Andrew Morton,
	Brendan Jackman, Christoph Lameter, Dennis Zhou, Dmitry Vyukov,
	dri-devel, intel-gfx, iommu, io-uring, Jason Gunthorpe,
	Jens Axboe, Johannes Weiner, John Hubbard, kasan-dev, kvm,
	Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Mike Rapoport, Muchun Song, netdev, Oscar Salvador,
	Peter Xu, Robin Murphy, Suren Baghdasaryan, Tejun Heo,
	virtualization, Vlastimil Babka, wireguard, x86, Zi Yan

It's no longer required to use nth_page() within a folio, so let's just
drop the nth_page() in folio_walk_start().

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/pagewalk.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index c6753d370ff4e..9e4225e5fcf5c 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -1004,7 +1004,7 @@ struct folio *folio_walk_start(struct folio_walk *fw,
 found:
 	if (expose_page)
 		/* Note: Offset from the mapped page, not the folio start. */
-		fw->page = nth_page(page, (addr & (entry_size - 1)) >> PAGE_SHIFT);
+		fw->page = page + ((addr & (entry_size - 1)) >> PAGE_SHIFT);
 	else
 		fw->page = NULL;
 	fw->ptl = ptl;
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH RFC 17/35] mm/gup: drop nth_page() usage within folio when recording subpages
       [not found] <20250821200701.1329277-1-david@redhat.com>
                   ` (10 preceding siblings ...)
  2025-08-21 20:06 ` [PATCH RFC 16/35] mm/pagewalk: drop nth_page() usage within folio in folio_walk_start() David Hildenbrand
@ 2025-08-21 20:06 ` David Hildenbrand
  2025-08-21 20:06 ` [PATCH RFC 18/35] io_uring/zcrx: remove "struct io_copy_cache" and one nth_page() usage David Hildenbrand
                   ` (14 subsequent siblings)
  26 siblings, 0 replies; 61+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Alexander Potapenko, Andrew Morton,
	Brendan Jackman, Christoph Lameter, Dennis Zhou, Dmitry Vyukov,
	dri-devel, intel-gfx, iommu, io-uring, Jason Gunthorpe,
	Jens Axboe, Johannes Weiner, John Hubbard, kasan-dev, kvm,
	Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Mike Rapoport, Muchun Song, netdev, Oscar Salvador,
	Peter Xu, Robin Murphy, Suren Baghdasaryan, Tejun Heo,
	virtualization, Vlastimil Babka, wireguard, x86, Zi Yan

nth_page() is no longer required when iterating over pages within a
single folio, so let's just drop it when recording subpages.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/gup.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index b2a78f0291273..f017ff6d7d61a 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -491,9 +491,9 @@ static int record_subpages(struct page *page, unsigned long sz,
 	struct page *start_page;
 	int nr;
 
-	start_page = nth_page(page, (addr & (sz - 1)) >> PAGE_SHIFT);
+	start_page = page + ((addr & (sz - 1)) >> PAGE_SHIFT);
 	for (nr = 0; addr != end; nr++, addr += PAGE_SIZE)
-		pages[nr] = nth_page(start_page, nr);
+		pages[nr] = start_page + nr;
 
 	return nr;
 }
@@ -1512,7 +1512,7 @@ static long __get_user_pages(struct mm_struct *mm,
 			}
 
 			for (j = 0; j < page_increm; j++) {
-				subpage = nth_page(page, j);
+				subpage = page + j;
 				pages[i + j] = subpage;
 				flush_anon_page(vma, subpage, start + j * PAGE_SIZE);
 				flush_dcache_page(subpage);
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH RFC 18/35] io_uring/zcrx: remove "struct io_copy_cache" and one nth_page() usage
       [not found] <20250821200701.1329277-1-david@redhat.com>
                   ` (11 preceding siblings ...)
  2025-08-21 20:06 ` [PATCH RFC 17/35] mm/gup: drop nth_page() usage within folio when recording subpages David Hildenbrand
@ 2025-08-21 20:06 ` David Hildenbrand
       [not found]   ` <b5b08ad3-d8cd-45ff-9767-7cf1b22b5e03@gmail.com>
  2025-08-21 20:06 ` [PATCH RFC 19/35] io_uring/zcrx: remove nth_page() usage within folio David Hildenbrand
                   ` (13 subsequent siblings)
  26 siblings, 1 reply; 61+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Jens Axboe, Alexander Potapenko, Andrew Morton,
	Brendan Jackman, Christoph Lameter, Dennis Zhou, Dmitry Vyukov,
	dri-devel, intel-gfx, iommu, io-uring, Jason Gunthorpe,
	Johannes Weiner, John Hubbard, kasan-dev, kvm, Liam R. Howlett,
	Linus Torvalds, linux-arm-kernel, linux-arm-kernel, linux-crypto,
	linux-ide, linux-kselftest, linux-mips, linux-mmc, linux-mm,
	linux-riscv, linux-s390, linux-scsi, Lorenzo Stoakes, Marco Elver,
	Marek Szyprowski, Michal Hocko, Mike Rapoport, Muchun Song,
	netdev, Oscar Salvador, Peter Xu, Robin Murphy,
	Suren Baghdasaryan, Tejun Heo, virtualization, Vlastimil Babka,
	wireguard, x86, Zi Yan

We always provide a single dst page, it's unclear why the io_copy_cache
complexity is required.

So let's simplify and get rid of "struct io_copy_cache", simply working on
the single page.

... which immediately allows us for dropping one "nth_page" usage,
because it's really just a single page.

Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 io_uring/zcrx.c | 32 +++++++-------------------------
 1 file changed, 7 insertions(+), 25 deletions(-)

diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
index e5ff49f3425e0..f29b2a4867516 100644
--- a/io_uring/zcrx.c
+++ b/io_uring/zcrx.c
@@ -954,29 +954,18 @@ static struct net_iov *io_zcrx_alloc_fallback(struct io_zcrx_area *area)
 	return niov;
 }
 
-struct io_copy_cache {
-	struct page		*page;
-	unsigned long		offset;
-	size_t			size;
-};
-
-static ssize_t io_copy_page(struct io_copy_cache *cc, struct page *src_page,
+static ssize_t io_copy_page(struct page *dst_page, struct page *src_page,
 			    unsigned int src_offset, size_t len)
 {
-	size_t copied = 0;
+	size_t dst_offset = 0;
 
-	len = min(len, cc->size);
+	len = min(len, PAGE_SIZE);
 
 	while (len) {
 		void *src_addr, *dst_addr;
-		struct page *dst_page = cc->page;
-		unsigned dst_offset = cc->offset;
 		size_t n = len;
 
-		if (folio_test_partial_kmap(page_folio(dst_page)) ||
-		    folio_test_partial_kmap(page_folio(src_page))) {
-			dst_page = nth_page(dst_page, dst_offset / PAGE_SIZE);
-			dst_offset = offset_in_page(dst_offset);
+		if (folio_test_partial_kmap(page_folio(src_page))) {
 			src_page = nth_page(src_page, src_offset / PAGE_SIZE);
 			src_offset = offset_in_page(src_offset);
 			n = min(PAGE_SIZE - src_offset, PAGE_SIZE - dst_offset);
@@ -991,12 +980,10 @@ static ssize_t io_copy_page(struct io_copy_cache *cc, struct page *src_page,
 		kunmap_local(src_addr);
 		kunmap_local(dst_addr);
 
-		cc->size -= n;
-		cc->offset += n;
+		dst_offset += n;
 		len -= n;
-		copied += n;
 	}
-	return copied;
+	return dst_offset;
 }
 
 static ssize_t io_zcrx_copy_chunk(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
@@ -1011,7 +998,6 @@ static ssize_t io_zcrx_copy_chunk(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
 		return -EFAULT;
 
 	while (len) {
-		struct io_copy_cache cc;
 		struct net_iov *niov;
 		size_t n;
 
@@ -1021,11 +1007,7 @@ static ssize_t io_zcrx_copy_chunk(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
 			break;
 		}
 
-		cc.page = io_zcrx_iov_page(niov);
-		cc.offset = 0;
-		cc.size = PAGE_SIZE;
-
-		n = io_copy_page(&cc, src_page, src_offset, len);
+		n = io_copy_page(io_zcrx_iov_page(niov), src_page, src_offset, len);
 
 		if (!io_zcrx_queue_cqe(req, niov, ifq, 0, n)) {
 			io_zcrx_return_niov(niov);
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH RFC 19/35] io_uring/zcrx: remove nth_page() usage within folio
       [not found] <20250821200701.1329277-1-david@redhat.com>
                   ` (12 preceding siblings ...)
  2025-08-21 20:06 ` [PATCH RFC 18/35] io_uring/zcrx: remove "struct io_copy_cache" and one nth_page() usage David Hildenbrand
@ 2025-08-21 20:06 ` David Hildenbrand
  2025-08-21 20:06 ` [PATCH RFC 20/35] mips: mm: convert __flush_dcache_pages() to __flush_dcache_folio_pages() David Hildenbrand
                   ` (12 subsequent siblings)
  26 siblings, 0 replies; 61+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Jens Axboe, Alexander Potapenko, Andrew Morton,
	Brendan Jackman, Christoph Lameter, Dennis Zhou, Dmitry Vyukov,
	dri-devel, intel-gfx, iommu, io-uring, Jason Gunthorpe,
	Johannes Weiner, John Hubbard, kasan-dev, kvm, Liam R. Howlett,
	Linus Torvalds, linux-arm-kernel, linux-arm-kernel, linux-crypto,
	linux-ide, linux-kselftest, linux-mips, linux-mmc, linux-mm,
	linux-riscv, linux-s390, linux-scsi, Lorenzo Stoakes, Marco Elver,
	Marek Szyprowski, Michal Hocko, Mike Rapoport, Muchun Song,
	netdev, Oscar Salvador, Peter Xu, Robin Murphy,
	Suren Baghdasaryan, Tejun Heo, virtualization, Vlastimil Babka,
	wireguard, x86, Zi Yan

Within a folio/compound page, nth_page() is no longer required.
Given that we call folio_test_partial_kmap()+kmap_local_page(), the code
would already be problematic if the src_pages would span multiple folios.

So let's just assume that all src pages belong to a single
folio/compound page and can be iterated ordinarily.

Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 io_uring/zcrx.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
index f29b2a4867516..107b2a1b31c1c 100644
--- a/io_uring/zcrx.c
+++ b/io_uring/zcrx.c
@@ -966,7 +966,7 @@ static ssize_t io_copy_page(struct page *dst_page, struct page *src_page,
 		size_t n = len;
 
 		if (folio_test_partial_kmap(page_folio(src_page))) {
-			src_page = nth_page(src_page, src_offset / PAGE_SIZE);
+			src_page += src_offset / PAGE_SIZE;
 			src_offset = offset_in_page(src_offset);
 			n = min(PAGE_SIZE - src_offset, PAGE_SIZE - dst_offset);
 			n = min(n, len);
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH RFC 20/35] mips: mm: convert __flush_dcache_pages() to __flush_dcache_folio_pages()
       [not found] <20250821200701.1329277-1-david@redhat.com>
                   ` (13 preceding siblings ...)
  2025-08-21 20:06 ` [PATCH RFC 19/35] io_uring/zcrx: remove nth_page() usage within folio David Hildenbrand
@ 2025-08-21 20:06 ` David Hildenbrand
  2025-08-21 20:06 ` [PATCH RFC 21/35] mm/cma: refuse handing out non-contiguous page ranges David Hildenbrand
                   ` (11 subsequent siblings)
  26 siblings, 0 replies; 61+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Thomas Bogendoerfer, Alexander Potapenko,
	Andrew Morton, Brendan Jackman, Christoph Lameter, Dennis Zhou,
	Dmitry Vyukov, dri-devel, intel-gfx, iommu, io-uring,
	Jason Gunthorpe, Jens Axboe, Johannes Weiner, John Hubbard,
	kasan-dev, kvm, Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Mike Rapoport, Muchun Song, netdev, Oscar Salvador,
	Peter Xu, Robin Murphy, Suren Baghdasaryan, Tejun Heo,
	virtualization, Vlastimil Babka, wireguard, x86, Zi Yan

Let's make it clearer that we are operating within a single folio by
providing both the folio and the page.

This implies that for flush_dcache_folio() we'll now avoid one more
page->folio lookup, and that we can safely drop the "nth_page" usage.

Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/mips/include/asm/cacheflush.h | 11 +++++++----
 arch/mips/mm/cache.c               |  8 ++++----
 2 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/arch/mips/include/asm/cacheflush.h b/arch/mips/include/asm/cacheflush.h
index 1f14132b3fc98..8a2de28936e07 100644
--- a/arch/mips/include/asm/cacheflush.h
+++ b/arch/mips/include/asm/cacheflush.h
@@ -50,13 +50,14 @@ extern void (*flush_cache_mm)(struct mm_struct *mm);
 extern void (*flush_cache_range)(struct vm_area_struct *vma,
 	unsigned long start, unsigned long end);
 extern void (*flush_cache_page)(struct vm_area_struct *vma, unsigned long page, unsigned long pfn);
-extern void __flush_dcache_pages(struct page *page, unsigned int nr);
+extern void __flush_dcache_folio_pages(struct folio *folio, struct page *page, unsigned int nr);
 
 #define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1
 static inline void flush_dcache_folio(struct folio *folio)
 {
 	if (cpu_has_dc_aliases)
-		__flush_dcache_pages(&folio->page, folio_nr_pages(folio));
+		__flush_dcache_folio_pages(folio, folio_page(folio, 0),
+					   folio_nr_pages(folio));
 	else if (!cpu_has_ic_fills_f_dc)
 		folio_set_dcache_dirty(folio);
 }
@@ -64,10 +65,12 @@ static inline void flush_dcache_folio(struct folio *folio)
 
 static inline void flush_dcache_page(struct page *page)
 {
+	struct folio *folio = page_folio(page);
+
 	if (cpu_has_dc_aliases)
-		__flush_dcache_pages(page, 1);
+		__flush_dcache_folio_pages(folio, page, folio_nr_pages(folio));
 	else if (!cpu_has_ic_fills_f_dc)
-		folio_set_dcache_dirty(page_folio(page));
+		folio_set_dcache_dirty(folio);
 }
 
 #define flush_dcache_mmap_lock(mapping)		do { } while (0)
diff --git a/arch/mips/mm/cache.c b/arch/mips/mm/cache.c
index bf9a37c60e9f0..e3b4224c9a406 100644
--- a/arch/mips/mm/cache.c
+++ b/arch/mips/mm/cache.c
@@ -99,9 +99,9 @@ SYSCALL_DEFINE3(cacheflush, unsigned long, addr, unsigned long, bytes,
 	return 0;
 }
 
-void __flush_dcache_pages(struct page *page, unsigned int nr)
+void __flush_dcache_folio_pages(struct folio *folio, struct page *page,
+		unsigned int nr)
 {
-	struct folio *folio = page_folio(page);
 	struct address_space *mapping = folio_flush_mapping(folio);
 	unsigned long addr;
 	unsigned int i;
@@ -117,12 +117,12 @@ void __flush_dcache_pages(struct page *page, unsigned int nr)
 	 * get faulted into the tlb (and thus flushed) anyways.
 	 */
 	for (i = 0; i < nr; i++) {
-		addr = (unsigned long)kmap_local_page(nth_page(page, i));
+		addr = (unsigned long)kmap_local_page(page + i);
 		flush_data_cache_page(addr);
 		kunmap_local((void *)addr);
 	}
 }
-EXPORT_SYMBOL(__flush_dcache_pages);
+EXPORT_SYMBOL(__flush_dcache_folio_pages);
 
 void __flush_anon_page(struct page *page, unsigned long vmaddr)
 {
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH RFC 21/35] mm/cma: refuse handing out non-contiguous page ranges
       [not found] <20250821200701.1329277-1-david@redhat.com>
                   ` (14 preceding siblings ...)
  2025-08-21 20:06 ` [PATCH RFC 20/35] mips: mm: convert __flush_dcache_pages() to __flush_dcache_folio_pages() David Hildenbrand
@ 2025-08-21 20:06 ` David Hildenbrand
       [not found]   ` <aK2QZnzS1ErHK5tP@raptor>
  2025-08-21 20:06 ` [PATCH RFC 22/35] dma-remap: drop nth_page() in dma_common_contiguous_remap() David Hildenbrand
                   ` (10 subsequent siblings)
  26 siblings, 1 reply; 61+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Alexander Potapenko, Andrew Morton,
	Brendan Jackman, Christoph Lameter, Dennis Zhou, Dmitry Vyukov,
	dri-devel, intel-gfx, iommu, io-uring, Jason Gunthorpe,
	Jens Axboe, Johannes Weiner, John Hubbard, kasan-dev, kvm,
	Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Mike Rapoport, Muchun Song, netdev, Oscar Salvador,
	Peter Xu, Robin Murphy, Suren Baghdasaryan, Tejun Heo,
	virtualization, Vlastimil Babka, wireguard, x86, Zi Yan

Let's disallow handing out PFN ranges with non-contiguous pages, so we
can remove the nth-page usage in __cma_alloc(), and so any callers don't
have to worry about that either when wanting to blindly iterate pages.

This is really only a problem in configs with SPARSEMEM but without
SPARSEMEM_VMEMMAP, and only when we would cross memory sections in some
cases.

Will this cause harm? Probably not, because it's mostly 32bit that does
not support SPARSEMEM_VMEMMAP. If this ever becomes a problem we could
look into allocating the memmap for the memory sections spanned by a
single CMA region in one go from memblock.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/mm.h |  6 ++++++
 mm/cma.c           | 36 +++++++++++++++++++++++-------------
 mm/util.c          | 33 +++++++++++++++++++++++++++++++++
 3 files changed, 62 insertions(+), 13 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ef360b72cb05c..f59ad1f9fc792 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -209,9 +209,15 @@ extern unsigned long sysctl_user_reserve_kbytes;
 extern unsigned long sysctl_admin_reserve_kbytes;
 
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
+bool page_range_contiguous(const struct page *page, unsigned long nr_pages);
 #define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n))
 #else
 #define nth_page(page,n) ((page) + (n))
+static inline bool page_range_contiguous(const struct page *page,
+		unsigned long nr_pages)
+{
+	return true;
+}
 #endif
 
 /* to align the pointer to the (next) page boundary */
diff --git a/mm/cma.c b/mm/cma.c
index 2ffa4befb99ab..1119fa2830008 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -780,10 +780,8 @@ static int cma_range_alloc(struct cma *cma, struct cma_memrange *cmr,
 				unsigned long count, unsigned int align,
 				struct page **pagep, gfp_t gfp)
 {
-	unsigned long mask, offset;
-	unsigned long pfn = -1;
-	unsigned long start = 0;
 	unsigned long bitmap_maxno, bitmap_no, bitmap_count;
+	unsigned long start, pfn, mask, offset;
 	int ret = -EBUSY;
 	struct page *page = NULL;
 
@@ -795,7 +793,7 @@ static int cma_range_alloc(struct cma *cma, struct cma_memrange *cmr,
 	if (bitmap_count > bitmap_maxno)
 		goto out;
 
-	for (;;) {
+	for (start = 0; ; start = bitmap_no + mask + 1) {
 		spin_lock_irq(&cma->lock);
 		/*
 		 * If the request is larger than the available number
@@ -812,6 +810,22 @@ static int cma_range_alloc(struct cma *cma, struct cma_memrange *cmr,
 			spin_unlock_irq(&cma->lock);
 			break;
 		}
+
+		pfn = cmr->base_pfn + (bitmap_no << cma->order_per_bit);
+		page = pfn_to_page(pfn);
+
+		/*
+		 * Do not hand out page ranges that are not contiguous, so
+		 * callers can just iterate the pages without having to worry
+		 * about these corner cases.
+		 */
+		if (!page_range_contiguous(page, count)) {
+			spin_unlock_irq(&cma->lock);
+			pr_warn_ratelimited("%s: %s: skipping incompatible area [0x%lx-0x%lx]",
+					    __func__, cma->name, pfn, pfn + count - 1);
+			continue;
+		}
+
 		bitmap_set(cmr->bitmap, bitmap_no, bitmap_count);
 		cma->available_count -= count;
 		/*
@@ -821,29 +835,25 @@ static int cma_range_alloc(struct cma *cma, struct cma_memrange *cmr,
 		 */
 		spin_unlock_irq(&cma->lock);
 
-		pfn = cmr->base_pfn + (bitmap_no << cma->order_per_bit);
 		mutex_lock(&cma->alloc_mutex);
 		ret = alloc_contig_range(pfn, pfn + count, ACR_FLAGS_CMA, gfp);
 		mutex_unlock(&cma->alloc_mutex);
-		if (ret == 0) {
-			page = pfn_to_page(pfn);
+		if (!ret)
 			break;
-		}
 
 		cma_clear_bitmap(cma, cmr, pfn, count);
 		if (ret != -EBUSY)
 			break;
 
 		pr_debug("%s(): memory range at pfn 0x%lx %p is busy, retrying\n",
-			 __func__, pfn, pfn_to_page(pfn));
+			 __func__, pfn, page);
 
 		trace_cma_alloc_busy_retry(cma->name, pfn, pfn_to_page(pfn),
 					   count, align);
-		/* try again with a bit different memory target */
-		start = bitmap_no + mask + 1;
 	}
 out:
-	*pagep = page;
+	if (!ret)
+		*pagep = page;
 	return ret;
 }
 
@@ -882,7 +892,7 @@ static struct page *__cma_alloc(struct cma *cma, unsigned long count,
 	 */
 	if (page) {
 		for (i = 0; i < count; i++)
-			page_kasan_tag_reset(nth_page(page, i));
+			page_kasan_tag_reset(page + i);
 	}
 
 	if (ret && !(gfp & __GFP_NOWARN)) {
diff --git a/mm/util.c b/mm/util.c
index d235b74f7aff7..0bf349b19b652 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -1280,4 +1280,37 @@ unsigned int folio_pte_batch(struct folio *folio, pte_t *ptep, pte_t pte,
 {
 	return folio_pte_batch_flags(folio, NULL, ptep, &pte, max_nr, 0);
 }
+
+#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
+/**
+ * page_range_contiguous - test whether the page range is contiguous
+ * @page: the start of the page range.
+ * @nr_pages: the number of pages in the range.
+ *
+ * Test whether the page range is contiguous, such that they can be iterated
+ * naively, corresponding to iterating a contiguous PFN range.
+ *
+ * This function should primarily only be used for debug checks, or when
+ * working with page ranges that are not naturally contiguous (e.g., pages
+ * within a folio are).
+ *
+ * Returns true if contiguous, otherwise false.
+ */
+bool page_range_contiguous(const struct page *page, unsigned long nr_pages)
+{
+	const unsigned long start_pfn = page_to_pfn(page);
+	const unsigned long end_pfn = start_pfn + nr_pages;
+	unsigned long pfn;
+
+	/*
+	 * The memmap is allocated per memory section. We need to check
+	 * each involved memory section once.
+	 */
+	for (pfn = ALIGN(start_pfn, PAGES_PER_SECTION);
+	     pfn < end_pfn; pfn += PAGES_PER_SECTION)
+		if (unlikely(page + (pfn - start_pfn) != pfn_to_page(pfn)))
+			return false;
+	return true;
+}
+#endif
 #endif /* CONFIG_MMU */
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH RFC 22/35] dma-remap: drop nth_page() in dma_common_contiguous_remap()
       [not found] <20250821200701.1329277-1-david@redhat.com>
                   ` (15 preceding siblings ...)
  2025-08-21 20:06 ` [PATCH RFC 21/35] mm/cma: refuse handing out non-contiguous page ranges David Hildenbrand
@ 2025-08-21 20:06 ` David Hildenbrand
  2025-08-21 20:06 ` [PATCH RFC 23/35] scatterlist: disallow non-contigous page ranges in a single SG entry David Hildenbrand
                   ` (9 subsequent siblings)
  26 siblings, 0 replies; 61+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Marek Szyprowski, Robin Murphy,
	Alexander Potapenko, Andrew Morton, Brendan Jackman,
	Christoph Lameter, Dennis Zhou, Dmitry Vyukov, dri-devel,
	intel-gfx, iommu, io-uring, Jason Gunthorpe, Jens Axboe,
	Johannes Weiner, John Hubbard, kasan-dev, kvm, Liam R. Howlett,
	Linus Torvalds, linux-arm-kernel, linux-arm-kernel, linux-crypto,
	linux-ide, linux-kselftest, linux-mips, linux-mmc, linux-mm,
	linux-riscv, linux-s390, linux-scsi, Lorenzo Stoakes, Marco Elver,
	Michal Hocko, Mike Rapoport, Muchun Song, netdev, Oscar Salvador,
	Peter Xu, Suren Baghdasaryan, Tejun Heo, virtualization,
	Vlastimil Babka, wireguard, x86, Zi Yan

dma_common_contiguous_remap() is used to remap an "allocated contiguous
region". Within a single allocation, there is no need to use nth_page()
anymore.

Neither the buddy, nor hugetlb, nor CMA will hand out problematic page
ranges.

Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 kernel/dma/remap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/dma/remap.c b/kernel/dma/remap.c
index 9e2afad1c6152..b7c1c0c92d0c8 100644
--- a/kernel/dma/remap.c
+++ b/kernel/dma/remap.c
@@ -49,7 +49,7 @@ void *dma_common_contiguous_remap(struct page *page, size_t size,
 	if (!pages)
 		return NULL;
 	for (i = 0; i < count; i++)
-		pages[i] = nth_page(page, i);
+		pages[i] = page++;
 	vaddr = vmap(pages, count, VM_DMA_COHERENT, prot);
 	kvfree(pages);
 
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH RFC 23/35] scatterlist: disallow non-contigous page ranges in a single SG entry
       [not found] <20250821200701.1329277-1-david@redhat.com>
                   ` (16 preceding siblings ...)
  2025-08-21 20:06 ` [PATCH RFC 22/35] dma-remap: drop nth_page() in dma_common_contiguous_remap() David Hildenbrand
@ 2025-08-21 20:06 ` David Hildenbrand
  2025-08-21 20:06 ` [PATCH RFC 32/35] mm/gup: drop nth_page() usage in unpin_user_page_range_dirty_lock() David Hildenbrand
                   ` (8 subsequent siblings)
  26 siblings, 0 replies; 61+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Alexander Potapenko, Andrew Morton,
	Brendan Jackman, Christoph Lameter, Dennis Zhou, Dmitry Vyukov,
	dri-devel, intel-gfx, iommu, io-uring, Jason Gunthorpe,
	Jens Axboe, Johannes Weiner, John Hubbard, kasan-dev, kvm,
	Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Mike Rapoport, Muchun Song, netdev, Oscar Salvador,
	Peter Xu, Robin Murphy, Suren Baghdasaryan, Tejun Heo,
	virtualization, Vlastimil Babka, wireguard, x86, Zi Yan

The expectation is that there is currently no user that would pass in
non-contigous page ranges: no allocator, not even VMA, will hand these
out.

The only problematic part would be if someone would provide a range
obtained directly from memblock, or manually merge problematic ranges.
If we find such cases, we should fix them to create separate
SG entries.

Let's check in sg_set_page() that this is really the case. No need to
check in sg_set_folio(), as pages in a folio are guaranteed to be
contiguous.

We can now drop the nth_page() usage in sg_page_iter_page().

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/scatterlist.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 6f8a4965f9b98..8196949dfc82c 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -6,6 +6,7 @@
 #include <linux/types.h>
 #include <linux/bug.h>
 #include <linux/mm.h>
+#include <linux/mm_inline.h>
 #include <asm/io.h>
 
 struct scatterlist {
@@ -158,6 +159,7 @@ static inline void sg_assign_page(struct scatterlist *sg, struct page *page)
 static inline void sg_set_page(struct scatterlist *sg, struct page *page,
 			       unsigned int len, unsigned int offset)
 {
+	VM_WARN_ON_ONCE(!page_range_contiguous(page, ALIGN(len + offset, PAGE_SIZE) / PAGE_SIZE));
 	sg_assign_page(sg, page);
 	sg->offset = offset;
 	sg->length = len;
@@ -600,7 +602,7 @@ void __sg_page_iter_start(struct sg_page_iter *piter,
  */
 static inline struct page *sg_page_iter_page(struct sg_page_iter *piter)
 {
-	return nth_page(sg_page(piter->sg), piter->sg_pgoffset);
+	return sg_page(piter->sg) + piter->sg_pgoffset;
 }
 
 /**
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH RFC 32/35] mm/gup: drop nth_page() usage in unpin_user_page_range_dirty_lock()
       [not found] <20250821200701.1329277-1-david@redhat.com>
                   ` (17 preceding siblings ...)
  2025-08-21 20:06 ` [PATCH RFC 23/35] scatterlist: disallow non-contigous page ranges in a single SG entry David Hildenbrand
@ 2025-08-21 20:06 ` David Hildenbrand
  2025-08-21 20:06 ` [PATCH RFC 33/35] kfence: drop nth_page() usage David Hildenbrand
                   ` (7 subsequent siblings)
  26 siblings, 0 replies; 61+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Alexander Potapenko, Andrew Morton,
	Brendan Jackman, Christoph Lameter, Dennis Zhou, Dmitry Vyukov,
	dri-devel, intel-gfx, iommu, io-uring, Jason Gunthorpe,
	Jens Axboe, Johannes Weiner, John Hubbard, kasan-dev, kvm,
	Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Mike Rapoport, Muchun Song, netdev, Oscar Salvador,
	Peter Xu, Robin Murphy, Suren Baghdasaryan, Tejun Heo,
	virtualization, Vlastimil Babka, wireguard, x86, Zi Yan

There is the concern that unpin_user_page_range_dirty_lock() might do
some weird merging of PFN ranges -- either now or in the future -- such
that PFN range is contiguous but the page range might not be.

Let's sanity-check for that and drop the nth_page() usage.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/gup.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/mm/gup.c b/mm/gup.c
index f017ff6d7d61a..0a669a766204b 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -237,7 +237,7 @@ void folio_add_pin(struct folio *folio)
 static inline struct folio *gup_folio_range_next(struct page *start,
 		unsigned long npages, unsigned long i, unsigned int *ntails)
 {
-	struct page *next = nth_page(start, i);
+	struct page *next = start + i;
 	struct folio *folio = page_folio(next);
 	unsigned int nr = 1;
 
@@ -342,6 +342,9 @@ EXPORT_SYMBOL(unpin_user_pages_dirty_lock);
  * "gup-pinned page range" refers to a range of pages that has had one of the
  * pin_user_pages() variants called on that page.
  *
+ * The page range must be truly contiguous: the page range corresponds
+ * to a contiguous PFN range and all pages can be iterated naturally.
+ *
  * For the page ranges defined by [page .. page+npages], make that range (or
  * its head pages, if a compound page) dirty, if @make_dirty is true, and if the
  * page range was previously listed as clean.
@@ -359,6 +362,8 @@ void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages,
 	struct folio *folio;
 	unsigned int nr;
 
+	VM_WARN_ON_ONCE(!page_range_contiguous(page, npages));
+
 	for (i = 0; i < npages; i += nr) {
 		folio = gup_folio_range_next(page, npages, i, &nr);
 		if (make_dirty && !folio_test_dirty(folio)) {
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH RFC 33/35] kfence: drop nth_page() usage
       [not found] <20250821200701.1329277-1-david@redhat.com>
                   ` (18 preceding siblings ...)
  2025-08-21 20:06 ` [PATCH RFC 32/35] mm/gup: drop nth_page() usage in unpin_user_page_range_dirty_lock() David Hildenbrand
@ 2025-08-21 20:06 ` David Hildenbrand
  2025-08-21 20:32   ` David Hildenbrand
  2025-08-21 20:07 ` [PATCH RFC 34/35] block: update comment of "struct bio_vec" regarding nth_page() David Hildenbrand
                   ` (6 subsequent siblings)
  26 siblings, 1 reply; 61+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Alexander Potapenko, Marco Elver,
	Dmitry Vyukov, Andrew Morton, Brendan Jackman, Christoph Lameter,
	Dennis Zhou, dri-devel, intel-gfx, iommu, io-uring,
	Jason Gunthorpe, Jens Axboe, Johannes Weiner, John Hubbard,
	kasan-dev, kvm, Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marek Szyprowski, Michal Hocko,
	Mike Rapoport, Muchun Song, netdev, Oscar Salvador, Peter Xu,
	Robin Murphy, Suren Baghdasaryan, Tejun Heo, virtualization,
	Vlastimil Babka, wireguard, x86, Zi Yan

We want to get rid of nth_page(), and kfence init code is the last user.

Unfortunately, we might actually walk a PFN range where the pages are
not contiguous, because we might be allocating an area from memblock
that could span memory sections in problematic kernel configs (SPARSEMEM
without SPARSEMEM_VMEMMAP).

We could check whether the page range is contiguous
using page_range_contiguous() and failing kfence init, or making kfence
incompatible these problemtic kernel configs.

Let's keep it simple and simply use pfn_to_page() by iterating PFNs.

Cc: Alexander Potapenko <glider@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/kfence/core.c | 17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/mm/kfence/core.c b/mm/kfence/core.c
index 0ed3be100963a..793507c77f9e8 100644
--- a/mm/kfence/core.c
+++ b/mm/kfence/core.c
@@ -594,15 +594,15 @@ static void rcu_guarded_free(struct rcu_head *h)
  */
 static unsigned long kfence_init_pool(void)
 {
-	unsigned long addr;
-	struct page *pages;
+	unsigned long addr, pfn, start_pfn, end_pfn;
 	int i;
 
 	if (!arch_kfence_init_pool())
 		return (unsigned long)__kfence_pool;
 
 	addr = (unsigned long)__kfence_pool;
-	pages = virt_to_page(__kfence_pool);
+	start_pfn = PHYS_PFN(virt_to_phys(__kfence_pool));
+	end_pfn = start_pfn + KFENCE_POOL_SIZE / PAGE_SIZE;
 
 	/*
 	 * Set up object pages: they must have PGTY_slab set to avoid freeing
@@ -612,12 +612,13 @@ static unsigned long kfence_init_pool(void)
 	 * fast-path in SLUB, and therefore need to ensure kfree() correctly
 	 * enters __slab_free() slow-path.
 	 */
-	for (i = 0; i < KFENCE_POOL_SIZE / PAGE_SIZE; i++) {
-		struct slab *slab = page_slab(nth_page(pages, i));
+	for (pfn = start_pfn; pfn != end_pfn; pfn++) {
+		struct slab *slab;
 
 		if (!i || (i % 2))
 			continue;
 
+		slab = page_slab(pfn_to_page(pfn));
 		__folio_set_slab(slab_folio(slab));
 #ifdef CONFIG_MEMCG
 		slab->obj_exts = (unsigned long)&kfence_metadata_init[i / 2 - 1].obj_exts |
@@ -664,11 +665,13 @@ static unsigned long kfence_init_pool(void)
 	return 0;
 
 reset_slab:
-	for (i = 0; i < KFENCE_POOL_SIZE / PAGE_SIZE; i++) {
-		struct slab *slab = page_slab(nth_page(pages, i));
+	for (pfn = start_pfn; pfn != end_pfn; pfn++) {
+		struct slab *slab;
 
 		if (!i || (i % 2))
 			continue;
+
+		slab = page_slab(pfn_to_page(pfn));
 #ifdef CONFIG_MEMCG
 		slab->obj_exts = 0;
 #endif
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH RFC 34/35] block: update comment of "struct bio_vec" regarding nth_page()
       [not found] <20250821200701.1329277-1-david@redhat.com>
                   ` (19 preceding siblings ...)
  2025-08-21 20:06 ` [PATCH RFC 33/35] kfence: drop nth_page() usage David Hildenbrand
@ 2025-08-21 20:07 ` David Hildenbrand
  2025-08-21 20:07 ` [PATCH RFC 35/35] mm: remove nth_page() David Hildenbrand
                   ` (5 subsequent siblings)
  26 siblings, 0 replies; 61+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:07 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Alexander Potapenko, Andrew Morton,
	Brendan Jackman, Christoph Lameter, Dennis Zhou, Dmitry Vyukov,
	dri-devel, intel-gfx, iommu, io-uring, Jason Gunthorpe,
	Jens Axboe, Johannes Weiner, John Hubbard, kasan-dev, kvm,
	Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Mike Rapoport, Muchun Song, netdev, Oscar Salvador,
	Peter Xu, Robin Murphy, Suren Baghdasaryan, Tejun Heo,
	virtualization, Vlastimil Babka, wireguard, x86, Zi Yan

Ever since commit 858c708d9efb ("block: move the bi_size update out of
__bio_try_merge_page"), page_is_mergeable() no longer exists, and the
logic in bvec_try_merge_page() is now a simple page pointer
comparison.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/bvec.h | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index 0a80e1f9aa201..3fc0efa0825b1 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -22,11 +22,8 @@ struct page;
  * @bv_len:    Number of bytes in the address range.
  * @bv_offset: Start of the address range relative to the start of @bv_page.
  *
- * The following holds for a bvec if n * PAGE_SIZE < bv_offset + bv_len:
- *
- *   nth_page(@bv_page, n) == @bv_page + n
- *
- * This holds because page_is_mergeable() checks the above property.
+ * All pages within a bio_vec starting from @bv_page are contiguous and
+ * can simply be iterated (see bvec_advance()).
  */
 struct bio_vec {
 	struct page	*bv_page;
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH RFC 35/35] mm: remove nth_page()
       [not found] <20250821200701.1329277-1-david@redhat.com>
                   ` (20 preceding siblings ...)
  2025-08-21 20:07 ` [PATCH RFC 34/35] block: update comment of "struct bio_vec" regarding nth_page() David Hildenbrand
@ 2025-08-21 20:07 ` David Hildenbrand
       [not found] ` <20250821200701.1329277-2-david@redhat.com>
                   ` (4 subsequent siblings)
  26 siblings, 0 replies; 61+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:07 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Alexander Potapenko, Andrew Morton,
	Brendan Jackman, Christoph Lameter, Dennis Zhou, Dmitry Vyukov,
	dri-devel, intel-gfx, iommu, io-uring, Jason Gunthorpe,
	Jens Axboe, Johannes Weiner, John Hubbard, kasan-dev, kvm,
	Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Mike Rapoport, Muchun Song, netdev, Oscar Salvador,
	Peter Xu, Robin Murphy, Suren Baghdasaryan, Tejun Heo,
	virtualization, Vlastimil Babka, wireguard, x86, Zi Yan

Now that all users are gone, let's remove it.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/mm.h                   | 2 --
 tools/testing/scatterlist/linux/mm.h | 1 -
 2 files changed, 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f59ad1f9fc792..3ded0db8322f7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -210,9 +210,7 @@ extern unsigned long sysctl_admin_reserve_kbytes;
 
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
 bool page_range_contiguous(const struct page *page, unsigned long nr_pages);
-#define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n))
 #else
-#define nth_page(page,n) ((page) + (n))
 static inline bool page_range_contiguous(const struct page *page,
 		unsigned long nr_pages)
 {
diff --git a/tools/testing/scatterlist/linux/mm.h b/tools/testing/scatterlist/linux/mm.h
index 5bd9e6e806254..121ae78d6e885 100644
--- a/tools/testing/scatterlist/linux/mm.h
+++ b/tools/testing/scatterlist/linux/mm.h
@@ -51,7 +51,6 @@ static inline unsigned long page_to_phys(struct page *page)
 
 #define page_to_pfn(page) ((unsigned long)(page) / PAGE_SIZE)
 #define pfn_to_page(pfn) (void *)((pfn) * PAGE_SIZE)
-#define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n))
 
 #define __min(t1, t2, min1, min2, x, y) ({              \
 	t1 min1 = (x);                                  \
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH RFC 01/35] mm: stop making SPARSEMEM_VMEMMAP user-selectable
       [not found] ` <20250821200701.1329277-2-david@redhat.com>
@ 2025-08-21 20:20   ` Zi Yan
  0 siblings, 0 replies; 61+ messages in thread
From: Zi Yan @ 2025-08-21 20:20 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Huacai Chen, WANG Xuerui, Madhavan Srinivasan,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Alexandre Ghiti,
	David S. Miller, Andreas Larsson, Alexander Potapenko,
	Andrew Morton, Brendan Jackman, Christoph Lameter, Dennis Zhou,
	Dmitry Vyukov, dri-devel, intel-gfx, iommu, io-uring,
	Jason Gunthorpe, Jens Axboe, Johannes Weiner, John Hubbard,
	kasan-dev, kvm, Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Mike Rapoport, Muchun Song, netdev, Oscar Salvador,
	Peter Xu, Robin Murphy, Suren Baghdasaryan, Tejun Heo,
	virtualization, Vlastimil Babka, wireguard, x86

On 21 Aug 2025, at 16:06, David Hildenbrand wrote:

> In an ideal world, we wouldn't have to deal with SPARSEMEM without
> SPARSEMEM_VMEMMAP, but in particular for 32bit SPARSEMEM_VMEMMAP is
> considered too costly and consequently not supported.
>
> However, if an architecture does support SPARSEMEM with
> SPARSEMEM_VMEMMAP, let's forbid the user to disable VMEMMAP: just
> like we already do for arm64, s390 and x86.
>
> So if SPARSEMEM_VMEMMAP is supported, don't allow to use SPARSEMEM without
> SPARSEMEM_VMEMMAP.
>
> This implies that the option to not use SPARSEMEM_VMEMMAP will now be
> gone for loongarch, powerpc, riscv and sparc. All architectures only
> enable SPARSEMEM_VMEMMAP with 64bit support, so there should not really
> be a big downside to using the VMEMMAP (quite the contrary).
>
> This is a preparation for not supporting
>
> (1) folio sizes that exceed a single memory section
> (2) CMA allocations of non-contiguous page ranges
>
> in SPARSEMEM without SPARSEMEM_VMEMMAP configs, whereby we
> want to limit possible impact as much as possible (e.g., gigantic hugetlb
> page allocations suddenly fails).

Sounds like a good idea.

>
> Cc: Huacai Chen <chenhuacai@kernel.org>
> Cc: WANG Xuerui <kernel@xen0n.name>
> Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
> Cc: Paul Walmsley <paul.walmsley@sifive.com>
> Cc: Palmer Dabbelt <palmer@dabbelt.com>
> Cc: Albert Ou <aou@eecs.berkeley.edu>
> Cc: Alexandre Ghiti <alex@ghiti.fr>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: Andreas Larsson <andreas@gaisler.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/Kconfig | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
>

Acked-by: Zi Yan <ziy@nvidia.com>

Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH RFC 06/35] mm/page_alloc: reject unreasonable folio/compound page sizes in alloc_contig_range_noprof()
  2025-08-21 20:06 ` [PATCH RFC 06/35] mm/page_alloc: reject unreasonable folio/compound page sizes in alloc_contig_range_noprof() David Hildenbrand
@ 2025-08-21 20:23   ` Zi Yan
  2025-08-22 17:07   ` SeongJae Park
  1 sibling, 0 replies; 61+ messages in thread
From: Zi Yan @ 2025-08-21 20:23 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Alexander Potapenko, Andrew Morton, Brendan Jackman,
	Christoph Lameter, Dennis Zhou, Dmitry Vyukov, dri-devel,
	intel-gfx, iommu, io-uring, Jason Gunthorpe, Jens Axboe,
	Johannes Weiner, John Hubbard, kasan-dev, kvm, Liam R. Howlett,
	Linus Torvalds, linux-arm-kernel, linux-arm-kernel, linux-crypto,
	linux-ide, linux-kselftest, linux-mips, linux-mmc, linux-mm,
	linux-riscv, linux-s390, linux-scsi, Lorenzo Stoakes, Marco Elver,
	Marek Szyprowski, Michal Hocko, Mike Rapoport, Muchun Song,
	netdev, Oscar Salvador, Peter Xu, Robin Murphy,
	Suren Baghdasaryan, Tejun Heo, virtualization, Vlastimil Babka,
	wireguard, x86

On 21 Aug 2025, at 16:06, David Hildenbrand wrote:

> Let's reject them early, which in turn makes folio_alloc_gigantic() reject
> them properly.
>
> To avoid converting from order to nr_pages, let's just add MAX_FOLIO_ORDER
> and calculate MAX_FOLIO_NR_PAGES based on that.
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  include/linux/mm.h | 6 ++++--
>  mm/page_alloc.c    | 5 ++++-
>  2 files changed, 8 insertions(+), 3 deletions(-)
>

LGTM. Reviewed-by: Zi Yan <ziy@nvidia.com>


Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH RFC 31/35] crypto: remove nth_page() usage within SG entry
       [not found] ` <20250821200701.1329277-32-david@redhat.com>
@ 2025-08-21 20:24   ` Linus Torvalds
  2025-08-21 20:29     ` David Hildenbrand
  0 siblings, 1 reply; 61+ messages in thread
From: Linus Torvalds @ 2025-08-21 20:24 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Herbert Xu, David S. Miller, Alexander Potapenko,
	Andrew Morton, Brendan Jackman, Christoph Lameter, Dennis Zhou,
	Dmitry Vyukov, dri-devel, intel-gfx, iommu, io-uring,
	Jason Gunthorpe, Jens Axboe, Johannes Weiner, John Hubbard,
	kasan-dev, kvm, Liam R. Howlett, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Mike Rapoport, Muchun Song, netdev, Oscar Salvador,
	Peter Xu, Robin Murphy, Suren Baghdasaryan, Tejun Heo,
	virtualization, Vlastimil Babka, wireguard, x86, Zi Yan

On Thu, 21 Aug 2025 at 16:08, David Hildenbrand <david@redhat.com> wrote:
>
> -       page = nth_page(page, offset >> PAGE_SHIFT);
> +       page += offset / PAGE_SIZE;

Please keep the " >> PAGE_SHIFT" form.

Is "offset" unsigned? Yes it is, But I had to look at the source code
to make sure, because it wasn't locally obvious from the patch. And
I'd rather we keep a pattern that is "safe", in that it doesn't
generate strange code if the value might be a 's64' (eg loff_t) on
32-bit architectures.

Because doing a 64-bit shift on x86-32 is like three cycles. Doing a
64-bit signed division by a simple constant is something like ten
strange instructions even if the end result is only 32-bit.

And again - not the case *here*, but just a general "let's keep to one
pattern", and the shift pattern is simply the better choice.

             Linus

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH RFC 31/35] crypto: remove nth_page() usage within SG entry
  2025-08-21 20:24   ` [PATCH RFC 31/35] crypto: remove nth_page() usage within SG entry Linus Torvalds
@ 2025-08-21 20:29     ` David Hildenbrand
  2025-08-21 20:36       ` Linus Torvalds
                         ` (2 more replies)
  0 siblings, 3 replies; 61+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, Herbert Xu, David S. Miller, Alexander Potapenko,
	Andrew Morton, Brendan Jackman, Christoph Lameter, Dennis Zhou,
	Dmitry Vyukov, dri-devel, intel-gfx, iommu, io-uring,
	Jason Gunthorpe, Jens Axboe, Johannes Weiner, John Hubbard,
	kasan-dev, kvm, Liam R. Howlett, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Mike Rapoport, Muchun Song, netdev, Oscar Salvador,
	Peter Xu, Robin Murphy, Suren Baghdasaryan, Tejun Heo,
	virtualization, Vlastimil Babka, wireguard, x86, Zi Yan

On 21.08.25 22:24, Linus Torvalds wrote:
> On Thu, 21 Aug 2025 at 16:08, David Hildenbrand <david@redhat.com> wrote:
>>
>> -       page = nth_page(page, offset >> PAGE_SHIFT);
>> +       page += offset / PAGE_SIZE;
> 
> Please keep the " >> PAGE_SHIFT" form.

No strong opinion.

I was primarily doing it to get rid of (in other cases) the parentheses.

Like in patch #29

-	/* Assumption: contiguous pages can be accessed as "page + i" */
-	page = nth_page(sg_page(sg), (*offset >> PAGE_SHIFT));
+	page = sg_page(sg) + *offset / PAGE_SIZE;

> 
> Is "offset" unsigned? Yes it is, But I had to look at the source code
> to make sure, because it wasn't locally obvious from the patch. And
> I'd rather we keep a pattern that is "safe", in that it doesn't
> generate strange code if the value might be a 's64' (eg loff_t) on
> 32-bit architectures.
> 
> Because doing a 64-bit shift on x86-32 is like three cycles. Doing a
> 64-bit signed division by a simple constant is something like ten
> strange instructions even if the end result is only 32-bit.

I would have thought that the compiler is smart enough to optimize that? 
PAGE_SIZE is a constant.

> 
> And again - not the case *here*, but just a general "let's keep to one
> pattern", and the shift pattern is simply the better choice.

It's a wild mixture, but I can keep doing what we already do in these cases.

-- 
Cheers

David / dhildenb

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH RFC 33/35] kfence: drop nth_page() usage
  2025-08-21 20:06 ` [PATCH RFC 33/35] kfence: drop nth_page() usage David Hildenbrand
@ 2025-08-21 20:32   ` David Hildenbrand
  2025-08-21 21:45     ` David Hildenbrand
  0 siblings, 1 reply; 61+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Potapenko, Marco Elver, Dmitry Vyukov, Andrew Morton,
	Brendan Jackman, Christoph Lameter, Dennis Zhou, dri-devel,
	intel-gfx, iommu, io-uring, Jason Gunthorpe, Jens Axboe,
	Johannes Weiner, John Hubbard, kasan-dev, kvm, Liam R. Howlett,
	Linus Torvalds, linux-arm-kernel, linux-arm-kernel, linux-crypto,
	linux-ide, linux-kselftest, linux-mips, linux-mmc, linux-mm,
	linux-riscv, linux-s390, linux-scsi, Lorenzo Stoakes,
	Marek Szyprowski, Michal Hocko, Mike Rapoport, Muchun Song,
	netdev, Oscar Salvador, Peter Xu, Robin Murphy,
	Suren Baghdasaryan, Tejun Heo, virtualization, Vlastimil Babka,
	wireguard, x86, Zi Yan

On 21.08.25 22:06, David Hildenbrand wrote:
> We want to get rid of nth_page(), and kfence init code is the last user.
> 
> Unfortunately, we might actually walk a PFN range where the pages are
> not contiguous, because we might be allocating an area from memblock
> that could span memory sections in problematic kernel configs (SPARSEMEM
> without SPARSEMEM_VMEMMAP).
> 
> We could check whether the page range is contiguous
> using page_range_contiguous() and failing kfence init, or making kfence
> incompatible these problemtic kernel configs.
> 
> Let's keep it simple and simply use pfn_to_page() by iterating PFNs.
> 

Fortunately this series is RFC due to lack of detailed testing :P

Something gives me a NULL-pointer pointer here (maybe the virt_to_phys()).

Will look into that tomorrow.

-- 
Cheers

David / dhildenb

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH RFC 31/35] crypto: remove nth_page() usage within SG entry
  2025-08-21 20:29     ` David Hildenbrand
@ 2025-08-21 20:36       ` Linus Torvalds
  2025-08-21 20:37       ` David Hildenbrand
  2025-08-21 20:40       ` Linus Torvalds
  2 siblings, 0 replies; 61+ messages in thread
From: Linus Torvalds @ 2025-08-21 20:36 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Herbert Xu, David S. Miller, Alexander Potapenko,
	Andrew Morton, Brendan Jackman, Christoph Lameter, Dennis Zhou,
	Dmitry Vyukov, dri-devel, intel-gfx, iommu, io-uring,
	Jason Gunthorpe, Jens Axboe, Johannes Weiner, John Hubbard,
	kasan-dev, kvm, Liam R. Howlett, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Mike Rapoport, Muchun Song, netdev, Oscar Salvador,
	Peter Xu, Robin Murphy, Suren Baghdasaryan, Tejun Heo,
	virtualization, Vlastimil Babka, wireguard, x86, Zi Yan

Oh, an your reply was an invalid email and ended up in my spam-box:

  From: David Hildenbrand <david@redhat.com>

but you apparently didn't use the redhat mail system, so the DKIM signing fails

       dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=QUARANTINE)
header.from=redhat.com

and it gets marked as spam.

I think you may have gone through smtp.kernel.org, but then you need
to use your kernel.org email address to get the DKIM right.

          Linus

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH RFC 11/35] mm: sanity-check maximum folio size in folio_set_order()
  2025-08-21 20:06 ` [PATCH RFC 11/35] mm: sanity-check maximum folio size in folio_set_order() David Hildenbrand
@ 2025-08-21 20:36   ` Zi Yan
  0 siblings, 0 replies; 61+ messages in thread
From: Zi Yan @ 2025-08-21 20:36 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Alexander Potapenko, Andrew Morton, Brendan Jackman,
	Christoph Lameter, Dennis Zhou, Dmitry Vyukov, dri-devel,
	intel-gfx, iommu, io-uring, Jason Gunthorpe, Jens Axboe,
	Johannes Weiner, John Hubbard, kasan-dev, kvm, Liam R. Howlett,
	Linus Torvalds, linux-arm-kernel, linux-arm-kernel, linux-crypto,
	linux-ide, linux-kselftest, linux-mips, linux-mmc, linux-mm,
	linux-riscv, linux-s390, linux-scsi, Lorenzo Stoakes, Marco Elver,
	Marek Szyprowski, Michal Hocko, Mike Rapoport, Muchun Song,
	netdev, Oscar Salvador, Peter Xu, Robin Murphy,
	Suren Baghdasaryan, Tejun Heo, virtualization, Vlastimil Babka,
	wireguard, x86

On 21 Aug 2025, at 16:06, David Hildenbrand wrote:

> Let's sanity-check in folio_set_order() whether we would be trying to
> create a folio with an order that would make it exceed MAX_FOLIO_ORDER.
>
> This will enable the check whenever a folio/compound page is initialized
> through prepare_compound_head() / prepare_compound_page().
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/internal.h | 1 +
>  1 file changed, 1 insertion(+)
>

Reviewed-by: Zi Yan <ziy@nvidia.com>

Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH RFC 31/35] crypto: remove nth_page() usage within SG entry
  2025-08-21 20:29     ` David Hildenbrand
  2025-08-21 20:36       ` Linus Torvalds
@ 2025-08-21 20:37       ` David Hildenbrand
  2025-08-21 20:40       ` Linus Torvalds
  2 siblings, 0 replies; 61+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, Herbert Xu, David S. Miller, Alexander Potapenko,
	Andrew Morton, Brendan Jackman, Christoph Lameter, Dennis Zhou,
	Dmitry Vyukov, dri-devel, intel-gfx, iommu, io-uring,
	Jason Gunthorpe, Jens Axboe, Johannes Weiner, John Hubbard,
	kasan-dev, kvm, Liam R. Howlett, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Mike Rapoport, Muchun Song, netdev, Oscar Salvador,
	Peter Xu, Robin Murphy, Suren Baghdasaryan, Tejun Heo,
	virtualization, Vlastimil Babka, wireguard, x86, Zi Yan

On 21.08.25 22:29, David Hildenbrand wrote:
> On 21.08.25 22:24, Linus Torvalds wrote:
>> On Thu, 21 Aug 2025 at 16:08, David Hildenbrand <david@redhat.com> wrote:
>>>
>>> -       page = nth_page(page, offset >> PAGE_SHIFT);
>>> +       page += offset / PAGE_SIZE;
>>
>> Please keep the " >> PAGE_SHIFT" form.
> 
> No strong opinion.
> 
> I was primarily doing it to get rid of (in other cases) the parentheses.
> 
> Like in patch #29
> 
> -	/* Assumption: contiguous pages can be accessed as "page + i" */
> -	page = nth_page(sg_page(sg), (*offset >> PAGE_SHIFT));
> +	page = sg_page(sg) + *offset / PAGE_SIZE;
> 
>>
>> Is "offset" unsigned? Yes it is, But I had to look at the source code
>> to make sure, because it wasn't locally obvious from the patch. And
>> I'd rather we keep a pattern that is "safe", in that it doesn't
>> generate strange code if the value might be a 's64' (eg loff_t) on
>> 32-bit architectures.
>>
>> Because doing a 64-bit shift on x86-32 is like three cycles. Doing a
>> 64-bit signed division by a simple constant is something like ten
>> strange instructions even if the end result is only 32-bit.
> 
> I would have thought that the compiler is smart enough to optimize that?
> PAGE_SIZE is a constant.

It's late, I get your point: if the compiler can't optimize if it's a 
signed value ...

-- 
Cheers

David / dhildenb

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH RFC 31/35] crypto: remove nth_page() usage within SG entry
  2025-08-21 20:29     ` David Hildenbrand
  2025-08-21 20:36       ` Linus Torvalds
  2025-08-21 20:37       ` David Hildenbrand
@ 2025-08-21 20:40       ` Linus Torvalds
  2 siblings, 0 replies; 61+ messages in thread
From: Linus Torvalds @ 2025-08-21 20:40 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Herbert Xu, David S. Miller, Alexander Potapenko,
	Andrew Morton, Brendan Jackman, Christoph Lameter, Dennis Zhou,
	Dmitry Vyukov, dri-devel, intel-gfx, iommu, io-uring,
	Jason Gunthorpe, Jens Axboe, Johannes Weiner, John Hubbard,
	kasan-dev, kvm, Liam R. Howlett, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Mike Rapoport, Muchun Song, netdev, Oscar Salvador,
	Peter Xu, Robin Murphy, Suren Baghdasaryan, Tejun Heo,
	virtualization, Vlastimil Babka, wireguard, x86, Zi Yan

On Thu, Aug 21, 2025 at 4:29 PM David Hildenbrand <david@redhat.com> wrote:
> > Because doing a 64-bit shift on x86-32 is like three cycles. Doing a
> > 64-bit signed division by a simple constant is something like ten
> > strange instructions even if the end result is only 32-bit.
>
> I would have thought that the compiler is smart enough to optimize that?
> PAGE_SIZE is a constant.

Oh, the compiler optimizes things. But dividing a 64-bit signed value
with a constant is still quite complicated.

It doesn't generate a 'div' instruction, but it generates something like this:

    movl %ebx, %edx
    sarl $31, %edx
    movl %edx, %eax
    xorl %edx, %edx
    andl $4095, %eax
    addl %ecx, %eax
    adcl %ebx, %edx

and that's certainly a lot faster than an actual 64-bit divide would be.

An unsigned divide - or a shift - results in just

    shrdl $12, %ecx, %eax

which is still not the fastest instruction (I think shrld gets split
into two uops), but it's certainly simpler and easier to read.

           Linus

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH RFC 12/35] mm: limit folio/compound page sizes in problematic kernel configs
  2025-08-21 20:06 ` [PATCH RFC 12/35] mm: limit folio/compound page sizes in problematic kernel configs David Hildenbrand
@ 2025-08-21 20:46   ` Zi Yan
  2025-08-21 20:49     ` David Hildenbrand
  2025-08-24 13:24   ` Mike Rapoport
  1 sibling, 1 reply; 61+ messages in thread
From: Zi Yan @ 2025-08-21 20:46 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Alexander Potapenko, Andrew Morton, Brendan Jackman,
	Christoph Lameter, Dennis Zhou, Dmitry Vyukov, dri-devel,
	intel-gfx, iommu, io-uring, Jason Gunthorpe, Jens Axboe,
	Johannes Weiner, John Hubbard, kasan-dev, kvm, Liam R. Howlett,
	Linus Torvalds, linux-arm-kernel, linux-arm-kernel, linux-crypto,
	linux-ide, linux-kselftest, linux-mips, linux-mmc, linux-mm,
	linux-riscv, linux-s390, linux-scsi, Lorenzo Stoakes, Marco Elver,
	Marek Szyprowski, Michal Hocko, Mike Rapoport, Muchun Song,
	netdev, Oscar Salvador, Peter Xu, Robin Murphy,
	Suren Baghdasaryan, Tejun Heo, virtualization, Vlastimil Babka,
	wireguard, x86

On 21 Aug 2025, at 16:06, David Hildenbrand wrote:

> Let's limit the maximum folio size in problematic kernel config where
> the memmap is allocated per memory section (SPARSEMEM without
> SPARSEMEM_VMEMMAP) to a single memory section.
>
> Currently, only a single architectures supports ARCH_HAS_GIGANTIC_PAGE
> but not SPARSEMEM_VMEMMAP: sh.
>
> Fortunately, the biggest hugetlb size sh supports is 64 MiB
> (HUGETLB_PAGE_SIZE_64MB) and the section size is at least 64 MiB
> (SECTION_SIZE_BITS == 26), so their use case is not degraded.
>
> As folios and memory sections are naturally aligned to their order-2 size
> in memory, consequently a single folio can no longer span multiple memory
> sections on these problematic kernel configs.
>
> nth_page() is no longer required when operating within a single compound
> page / folio.
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  include/linux/mm.h | 22 ++++++++++++++++++----
>  1 file changed, 18 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 77737cbf2216a..48a985e17ef4e 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2053,11 +2053,25 @@ static inline long folio_nr_pages(const struct folio *folio)
>  	return folio_large_nr_pages(folio);
>  }
>
> -/* Only hugetlbfs can allocate folios larger than MAX_ORDER */
> -#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
> -#define MAX_FOLIO_ORDER		PUD_ORDER
> -#else
> +#if !defined(CONFIG_ARCH_HAS_GIGANTIC_PAGE)
> +/*
> + * We don't expect any folios that exceed buddy sizes (and consequently
> + * memory sections).
> + */
>  #define MAX_FOLIO_ORDER		MAX_PAGE_ORDER
> +#elif defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
> +/*
> + * Only pages within a single memory section are guaranteed to be
> + * contiguous. By limiting folios to a single memory section, all folio
> + * pages are guaranteed to be contiguous.
> + */
> +#define MAX_FOLIO_ORDER		PFN_SECTION_SHIFT
> +#else
> +/*
> + * There is no real limit on the folio size. We limit them to the maximum we
> + * currently expect.

The comment about hugetlbfs is helpful here, since the other folios are still
limited by buddy allocator’s MAX_ORDER.

> + */
> +#define MAX_FOLIO_ORDER		PUD_ORDER
>  #endif
>
>  #define MAX_FOLIO_NR_PAGES	(1UL << MAX_FOLIO_ORDER)
> -- 
> 2.50.1

Otherwise, Reviewed-by: Zi Yan <ziy@nvidia.com>

Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH RFC 12/35] mm: limit folio/compound page sizes in problematic kernel configs
  2025-08-21 20:46   ` Zi Yan
@ 2025-08-21 20:49     ` David Hildenbrand
  2025-08-21 20:50       ` Zi Yan
  0 siblings, 1 reply; 61+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:49 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-kernel, Alexander Potapenko, Andrew Morton, Brendan Jackman,
	Christoph Lameter, Dennis Zhou, Dmitry Vyukov, dri-devel,
	intel-gfx, iommu, io-uring, Jason Gunthorpe, Jens Axboe,
	Johannes Weiner, John Hubbard, kasan-dev, kvm, Liam R. Howlett,
	Linus Torvalds, linux-arm-kernel, linux-arm-kernel, linux-crypto,
	linux-ide, linux-kselftest, linux-mips, linux-mmc, linux-mm,
	linux-riscv, linux-s390, linux-scsi, Lorenzo Stoakes, Marco Elver,
	Marek Szyprowski, Michal Hocko, Mike Rapoport, Muchun Song,
	netdev, Oscar Salvador, Peter Xu, Robin Murphy,
	Suren Baghdasaryan, Tejun Heo, virtualization, Vlastimil Babka,
	wireguard, x86

On 21.08.25 22:46, Zi Yan wrote:
> On 21 Aug 2025, at 16:06, David Hildenbrand wrote:
> 
>> Let's limit the maximum folio size in problematic kernel config where
>> the memmap is allocated per memory section (SPARSEMEM without
>> SPARSEMEM_VMEMMAP) to a single memory section.
>>
>> Currently, only a single architectures supports ARCH_HAS_GIGANTIC_PAGE
>> but not SPARSEMEM_VMEMMAP: sh.
>>
>> Fortunately, the biggest hugetlb size sh supports is 64 MiB
>> (HUGETLB_PAGE_SIZE_64MB) and the section size is at least 64 MiB
>> (SECTION_SIZE_BITS == 26), so their use case is not degraded.
>>
>> As folios and memory sections are naturally aligned to their order-2 size
>> in memory, consequently a single folio can no longer span multiple memory
>> sections on these problematic kernel configs.
>>
>> nth_page() is no longer required when operating within a single compound
>> page / folio.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>   include/linux/mm.h | 22 ++++++++++++++++++----
>>   1 file changed, 18 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 77737cbf2216a..48a985e17ef4e 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -2053,11 +2053,25 @@ static inline long folio_nr_pages(const struct folio *folio)
>>   	return folio_large_nr_pages(folio);
>>   }
>>
>> -/* Only hugetlbfs can allocate folios larger than MAX_ORDER */
>> -#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
>> -#define MAX_FOLIO_ORDER		PUD_ORDER
>> -#else
>> +#if !defined(CONFIG_ARCH_HAS_GIGANTIC_PAGE)
>> +/*
>> + * We don't expect any folios that exceed buddy sizes (and consequently
>> + * memory sections).
>> + */
>>   #define MAX_FOLIO_ORDER		MAX_PAGE_ORDER
>> +#elif defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
>> +/*
>> + * Only pages within a single memory section are guaranteed to be
>> + * contiguous. By limiting folios to a single memory section, all folio
>> + * pages are guaranteed to be contiguous.
>> + */
>> +#define MAX_FOLIO_ORDER		PFN_SECTION_SHIFT
>> +#else
>> +/*
>> + * There is no real limit on the folio size. We limit them to the maximum we
>> + * currently expect.
> 
> The comment about hugetlbfs is helpful here, since the other folios are still
> limited by buddy allocator’s MAX_ORDER.

Yeah, but the old comment was wrong (there is DAX).

I can add here "currently expect (e.g., hugetlfs, dax)."

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH RFC 12/35] mm: limit folio/compound page sizes in problematic kernel configs
  2025-08-21 20:49     ` David Hildenbrand
@ 2025-08-21 20:50       ` Zi Yan
  0 siblings, 0 replies; 61+ messages in thread
From: Zi Yan @ 2025-08-21 20:50 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Alexander Potapenko, Andrew Morton, Brendan Jackman,
	Christoph Lameter, Dennis Zhou, Dmitry Vyukov, dri-devel,
	intel-gfx, iommu, io-uring, Jason Gunthorpe, Jens Axboe,
	Johannes Weiner, John Hubbard, kasan-dev, kvm, Liam R. Howlett,
	Linus Torvalds, linux-arm-kernel, linux-arm-kernel, linux-crypto,
	linux-ide, linux-kselftest, linux-mips, linux-mmc, linux-mm,
	linux-riscv, linux-s390, linux-scsi, Lorenzo Stoakes, Marco Elver,
	Marek Szyprowski, Michal Hocko, Mike Rapoport, Muchun Song,
	netdev, Oscar Salvador, Peter Xu, Robin Murphy,
	Suren Baghdasaryan, Tejun Heo, virtualization, Vlastimil Babka,
	wireguard, x86

On 21 Aug 2025, at 16:49, David Hildenbrand wrote:

> On 21.08.25 22:46, Zi Yan wrote:
>> On 21 Aug 2025, at 16:06, David Hildenbrand wrote:
>>
>>> Let's limit the maximum folio size in problematic kernel config where
>>> the memmap is allocated per memory section (SPARSEMEM without
>>> SPARSEMEM_VMEMMAP) to a single memory section.
>>>
>>> Currently, only a single architectures supports ARCH_HAS_GIGANTIC_PAGE
>>> but not SPARSEMEM_VMEMMAP: sh.
>>>
>>> Fortunately, the biggest hugetlb size sh supports is 64 MiB
>>> (HUGETLB_PAGE_SIZE_64MB) and the section size is at least 64 MiB
>>> (SECTION_SIZE_BITS == 26), so their use case is not degraded.
>>>
>>> As folios and memory sections are naturally aligned to their order-2 size
>>> in memory, consequently a single folio can no longer span multiple memory
>>> sections on these problematic kernel configs.
>>>
>>> nth_page() is no longer required when operating within a single compound
>>> page / folio.
>>>
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>> ---
>>>   include/linux/mm.h | 22 ++++++++++++++++++----
>>>   1 file changed, 18 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>> index 77737cbf2216a..48a985e17ef4e 100644
>>> --- a/include/linux/mm.h
>>> +++ b/include/linux/mm.h
>>> @@ -2053,11 +2053,25 @@ static inline long folio_nr_pages(const struct folio *folio)
>>>   	return folio_large_nr_pages(folio);
>>>   }
>>>
>>> -/* Only hugetlbfs can allocate folios larger than MAX_ORDER */
>>> -#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
>>> -#define MAX_FOLIO_ORDER		PUD_ORDER
>>> -#else
>>> +#if !defined(CONFIG_ARCH_HAS_GIGANTIC_PAGE)
>>> +/*
>>> + * We don't expect any folios that exceed buddy sizes (and consequently
>>> + * memory sections).
>>> + */
>>>   #define MAX_FOLIO_ORDER		MAX_PAGE_ORDER
>>> +#elif defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
>>> +/*
>>> + * Only pages within a single memory section are guaranteed to be
>>> + * contiguous. By limiting folios to a single memory section, all folio
>>> + * pages are guaranteed to be contiguous.
>>> + */
>>> +#define MAX_FOLIO_ORDER		PFN_SECTION_SHIFT
>>> +#else
>>> +/*
>>> + * There is no real limit on the folio size. We limit them to the maximum we
>>> + * currently expect.
>>
>> The comment about hugetlbfs is helpful here, since the other folios are still
>> limited by buddy allocator’s MAX_ORDER.
>
> Yeah, but the old comment was wrong (there is DAX).
>
> I can add here "currently expect (e.g., hugetlfs, dax)."

Sounds good.

Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH RFC 13/35] mm: simplify folio_page() and folio_page_idx()
  2025-08-21 20:06 ` [PATCH RFC 13/35] mm: simplify folio_page() and folio_page_idx() David Hildenbrand
@ 2025-08-21 20:55   ` Zi Yan
  2025-08-21 21:00     ` David Hildenbrand
  0 siblings, 1 reply; 61+ messages in thread
From: Zi Yan @ 2025-08-21 20:55 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Alexander Potapenko, Andrew Morton, Brendan Jackman,
	Christoph Lameter, Dennis Zhou, Dmitry Vyukov, dri-devel,
	intel-gfx, iommu, io-uring, Jason Gunthorpe, Jens Axboe,
	Johannes Weiner, John Hubbard, kasan-dev, kvm, Liam R. Howlett,
	Linus Torvalds, linux-arm-kernel, linux-arm-kernel, linux-crypto,
	linux-ide, linux-kselftest, linux-mips, linux-mmc, linux-mm,
	linux-riscv, linux-s390, linux-scsi, Lorenzo Stoakes, Marco Elver,
	Marek Szyprowski, Michal Hocko, Mike Rapoport, Muchun Song,
	netdev, Oscar Salvador, Peter Xu, Robin Murphy,
	Suren Baghdasaryan, Tejun Heo, virtualization, Vlastimil Babka,
	wireguard, x86

On 21 Aug 2025, at 16:06, David Hildenbrand wrote:

> Now that a single folio/compound page can no longer span memory sections
> in problematic kernel configurations, we can stop using nth_page().
>
> While at it, turn both macros into static inline functions and add
> kernel doc for folio_page_idx().
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  include/linux/mm.h         | 16 ++++++++++++++--
>  include/linux/page-flags.h |  5 ++++-
>  2 files changed, 18 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 48a985e17ef4e..ef360b72cb05c 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -210,10 +210,8 @@ extern unsigned long sysctl_admin_reserve_kbytes;
>
>  #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
>  #define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n))
> -#define folio_page_idx(folio, p)	(page_to_pfn(p) - folio_pfn(folio))
>  #else
>  #define nth_page(page,n) ((page) + (n))
> -#define folio_page_idx(folio, p)	((p) - &(folio)->page)
>  #endif
>
>  /* to align the pointer to the (next) page boundary */
> @@ -225,6 +223,20 @@ extern unsigned long sysctl_admin_reserve_kbytes;
>  /* test whether an address (unsigned long or pointer) is aligned to PAGE_SIZE */
>  #define PAGE_ALIGNED(addr)	IS_ALIGNED((unsigned long)(addr), PAGE_SIZE)
>
> +/**
> + * folio_page_idx - Return the number of a page in a folio.
> + * @folio: The folio.
> + * @page: The folio page.
> + *
> + * This function expects that the page is actually part of the folio.
> + * The returned number is relative to the start of the folio.
> + */
> +static inline unsigned long folio_page_idx(const struct folio *folio,
> +		const struct page *page)
> +{
> +	return page - &folio->page;
> +}
> +
>  static inline struct folio *lru_to_folio(struct list_head *head)
>  {
>  	return list_entry((head)->prev, struct folio, lru);
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index d53a86e68c89b..080ad10c0defc 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -316,7 +316,10 @@ static __always_inline unsigned long _compound_head(const struct page *page)
>   * check that the page number lies within @folio; the caller is presumed
>   * to have a reference to the page.
>   */
> -#define folio_page(folio, n)	nth_page(&(folio)->page, n)
> +static inline struct page *folio_page(struct folio *folio, unsigned long nr)
> +{
> +	return &folio->page + nr;
> +}

Maybe s/nr/n/ or s/nr/nth/, since it returns the nth page within a folio.

Since you have added kernel doc for folio_page_idx(), it does not hurt
to have something similar for folio_page(). :)

+/**
+ * folio_page - Return the nth page in a folio.
+ * @folio: The folio.
+ * @n: Page index within the folio.
+ *
+ * This function expects that n does not exceed folio_nr_pages(folio).
+ * The returned page is relative to the first page of the folio.
+ */

>
>  static __always_inline int PageTail(const struct page *page)
>  {
> -- 
> 2.50.1

Otherwise, Reviewed-by: Zi Yan <ziy@nvidia.com>

Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH RFC 13/35] mm: simplify folio_page() and folio_page_idx()
  2025-08-21 20:55   ` Zi Yan
@ 2025-08-21 21:00     ` David Hildenbrand
  0 siblings, 0 replies; 61+ messages in thread
From: David Hildenbrand @ 2025-08-21 21:00 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-kernel, Alexander Potapenko, Andrew Morton, Brendan Jackman,
	Christoph Lameter, Dennis Zhou, Dmitry Vyukov, dri-devel,
	intel-gfx, iommu, io-uring, Jason Gunthorpe, Jens Axboe,
	Johannes Weiner, John Hubbard, kasan-dev, kvm, Liam R. Howlett,
	Linus Torvalds, linux-arm-kernel, linux-arm-kernel, linux-crypto,
	linux-ide, linux-kselftest, linux-mips, linux-mmc, linux-mm,
	linux-riscv, linux-s390, linux-scsi, Lorenzo Stoakes, Marco Elver,
	Marek Szyprowski, Michal Hocko, Mike Rapoport, Muchun Song,
	netdev, Oscar Salvador, Peter Xu, Robin Murphy,
	Suren Baghdasaryan, Tejun Heo, virtualization, Vlastimil Babka,
	wireguard, x86

On 21.08.25 22:55, Zi Yan wrote:
> On 21 Aug 2025, at 16:06, David Hildenbrand wrote:
> 
>> Now that a single folio/compound page can no longer span memory sections
>> in problematic kernel configurations, we can stop using nth_page().
>>
>> While at it, turn both macros into static inline functions and add
>> kernel doc for folio_page_idx().
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>   include/linux/mm.h         | 16 ++++++++++++++--
>>   include/linux/page-flags.h |  5 ++++-
>>   2 files changed, 18 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 48a985e17ef4e..ef360b72cb05c 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -210,10 +210,8 @@ extern unsigned long sysctl_admin_reserve_kbytes;
>>
>>   #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
>>   #define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n))
>> -#define folio_page_idx(folio, p)	(page_to_pfn(p) - folio_pfn(folio))
>>   #else
>>   #define nth_page(page,n) ((page) + (n))
>> -#define folio_page_idx(folio, p)	((p) - &(folio)->page)
>>   #endif
>>
>>   /* to align the pointer to the (next) page boundary */
>> @@ -225,6 +223,20 @@ extern unsigned long sysctl_admin_reserve_kbytes;
>>   /* test whether an address (unsigned long or pointer) is aligned to PAGE_SIZE */
>>   #define PAGE_ALIGNED(addr)	IS_ALIGNED((unsigned long)(addr), PAGE_SIZE)
>>
>> +/**
>> + * folio_page_idx - Return the number of a page in a folio.
>> + * @folio: The folio.
>> + * @page: The folio page.
>> + *
>> + * This function expects that the page is actually part of the folio.
>> + * The returned number is relative to the start of the folio.
>> + */
>> +static inline unsigned long folio_page_idx(const struct folio *folio,
>> +		const struct page *page)
>> +{
>> +	return page - &folio->page;
>> +}
>> +
>>   static inline struct folio *lru_to_folio(struct list_head *head)
>>   {
>>   	return list_entry((head)->prev, struct folio, lru);
>> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
>> index d53a86e68c89b..080ad10c0defc 100644
>> --- a/include/linux/page-flags.h
>> +++ b/include/linux/page-flags.h
>> @@ -316,7 +316,10 @@ static __always_inline unsigned long _compound_head(const struct page *page)
>>    * check that the page number lies within @folio; the caller is presumed
>>    * to have a reference to the page.
>>    */
>> -#define folio_page(folio, n)	nth_page(&(folio)->page, n)
>> +static inline struct page *folio_page(struct folio *folio, unsigned long nr)
>> +{
>> +	return &folio->page + nr;
>> +}
> 
> Maybe s/nr/n/ or s/nr/nth/, since it returns the nth page within a folio.

Yeah, it's even called "n" in the kernel docs ...

> 
> Since you have added kernel doc for folio_page_idx(), it does not hurt
> to have something similar for folio_page(). :)

... which we already have! (see above the macro) :)

Thanks!

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH RFC 33/35] kfence: drop nth_page() usage
  2025-08-21 20:32   ` David Hildenbrand
@ 2025-08-21 21:45     ` David Hildenbrand
  0 siblings, 0 replies; 61+ messages in thread
From: David Hildenbrand @ 2025-08-21 21:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Potapenko, Marco Elver, Dmitry Vyukov, Andrew Morton,
	Brendan Jackman, Christoph Lameter, Dennis Zhou, dri-devel,
	intel-gfx, iommu, io-uring, Jason Gunthorpe, Jens Axboe,
	Johannes Weiner, John Hubbard, kasan-dev, kvm, Liam R. Howlett,
	Linus Torvalds, linux-arm-kernel, linux-arm-kernel, linux-crypto,
	linux-ide, linux-kselftest, linux-mips, linux-mmc, linux-mm,
	linux-riscv, linux-s390, linux-scsi, Lorenzo Stoakes,
	Marek Szyprowski, Michal Hocko, Mike Rapoport, Muchun Song,
	netdev, Oscar Salvador, Peter Xu, Robin Murphy,
	Suren Baghdasaryan, Tejun Heo, virtualization, Vlastimil Babka,
	wireguard, x86, Zi Yan

On 21.08.25 22:32, David Hildenbrand wrote:
> On 21.08.25 22:06, David Hildenbrand wrote:
>> We want to get rid of nth_page(), and kfence init code is the last user.
>>
>> Unfortunately, we might actually walk a PFN range where the pages are
>> not contiguous, because we might be allocating an area from memblock
>> that could span memory sections in problematic kernel configs (SPARSEMEM
>> without SPARSEMEM_VMEMMAP).
>>
>> We could check whether the page range is contiguous
>> using page_range_contiguous() and failing kfence init, or making kfence
>> incompatible these problemtic kernel configs.
>>
>> Let's keep it simple and simply use pfn_to_page() by iterating PFNs.
>>
> 
> Fortunately this series is RFC due to lack of detailed testing :P
> 
> Something gives me a NULL-pointer pointer here (maybe the virt_to_phys()).
> 
> Will look into that tomorrow.

Okay, easy: relying on i but not updating it /me facepalm

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH RFC 24/35] ata: libata-eh: drop nth_page() usage within SG entry
       [not found] ` <20250821200701.1329277-25-david@redhat.com>
@ 2025-08-22  1:59   ` Damien Le Moal
  2025-08-22  6:18     ` David Hildenbrand
  0 siblings, 1 reply; 61+ messages in thread
From: Damien Le Moal @ 2025-08-22  1:59 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: Niklas Cassel, Alexander Potapenko, Andrew Morton,
	Brendan Jackman, Christoph Lameter, Dennis Zhou, Dmitry Vyukov,
	dri-devel, intel-gfx, iommu, io-uring, Jason Gunthorpe,
	Jens Axboe, Johannes Weiner, John Hubbard, kasan-dev, kvm,
	Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Mike Rapoport, Muchun Song, netdev, Oscar Salvador,
	Peter Xu, Robin Murphy, Suren Baghdasaryan, Tejun Heo,
	virtualization, Vlastimil Babka, wireguard, x86, Zi Yan

On 8/22/25 05:06, David Hildenbrand wrote:
> It's no longer required to use nth_page() when iterating pages within a
> single SG entry, so let's drop the nth_page() usage.
> 
> Cc: Damien Le Moal <dlemoal@kernel.org>
> Cc: Niklas Cassel <cassel@kernel.org>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  drivers/ata/libata-sff.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/ata/libata-sff.c b/drivers/ata/libata-sff.c
> index 7fc407255eb46..9f5d0f9f6d686 100644
> --- a/drivers/ata/libata-sff.c
> +++ b/drivers/ata/libata-sff.c
> @@ -614,7 +614,7 @@ static void ata_pio_sector(struct ata_queued_cmd *qc)
>  	offset = qc->cursg->offset + qc->cursg_ofs;
>  
>  	/* get the current page and offset */
> -	page = nth_page(page, (offset >> PAGE_SHIFT));
> +	page += offset / PAGE_SHIFT;

Shouldn't this be "offset >> PAGE_SHIFT" ?

>  	offset %= PAGE_SIZE;
>  
>  	/* don't overrun current sg */
> @@ -631,7 +631,7 @@ static void ata_pio_sector(struct ata_queued_cmd *qc)
>  		unsigned int split_len = PAGE_SIZE - offset;
>  
>  		ata_pio_xfer(qc, page, offset, split_len);
> -		ata_pio_xfer(qc, nth_page(page, 1), 0, count - split_len);
> +		ata_pio_xfer(qc, page + 1, 0, count - split_len);
>  	} else {
>  		ata_pio_xfer(qc, page, offset, count);
>  	}
> @@ -751,7 +751,7 @@ static int __atapi_pio_bytes(struct ata_queued_cmd *qc, unsigned int bytes)
>  	offset = sg->offset + qc->cursg_ofs;
>  
>  	/* get the current page and offset */
> -	page = nth_page(page, (offset >> PAGE_SHIFT));
> +	page += offset / PAGE_SIZE;

Same here, though this seems correct too.

>  	offset %= PAGE_SIZE;
>  
>  	/* don't overrun current sg */


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH RFC 10/35] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap()
  2025-08-21 20:06 ` [PATCH RFC 10/35] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap() David Hildenbrand
@ 2025-08-22  4:09   ` Mika Penttilä
  2025-08-22  6:24     ` David Hildenbrand
  0 siblings, 1 reply; 61+ messages in thread
From: Mika Penttilä @ 2025-08-22  4:09 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: Alexander Potapenko, Andrew Morton, Brendan Jackman,
	Christoph Lameter, Dennis Zhou, Dmitry Vyukov, dri-devel,
	intel-gfx, iommu, io-uring, Jason Gunthorpe, Jens Axboe,
	Johannes Weiner, John Hubbard, kasan-dev, kvm, Liam R. Howlett,
	Linus Torvalds, linux-arm-kernel, linux-arm-kernel, linux-crypto,
	linux-ide, linux-kselftest, linux-mips, linux-mmc, linux-mm,
	linux-riscv, linux-s390, linux-scsi, Lorenzo Stoakes, Marco Elver,
	Marek Szyprowski, Michal Hocko, Mike Rapoport, Muchun Song,
	netdev, Oscar Salvador, Peter Xu, Robin Murphy,
	Suren Baghdasaryan, Tejun Heo, virtualization, Vlastimil Babka,
	wireguard, x86, Zi Yan


On 8/21/25 23:06, David Hildenbrand wrote:

> All pages were already initialized and set to PageReserved() with a
> refcount of 1 by MM init code.

Just to be sure, how is this working with MEMBLOCK_RSRV_NOINIT, where MM is supposed not to
initialize struct pages?

> In fact, by using __init_single_page(), we will be setting the refcount to
> 1 just to freeze it again immediately afterwards.
>
> So drop the __init_single_page() and use __ClearPageReserved() instead.
> Adjust the comments to highlight that we are dealing with an open-coded
> prep_compound_page() variant.
>
> Further, as we can now safely iterate over all pages in a folio, let's
> avoid the page-pfn dance and just iterate the pages directly.
>
> Note that the current code was likely problematic, but we never ran into
> it: prep_compound_tail() would have been called with an offset that might
> exceed a memory section, and prep_compound_tail() would have simply
> added that offset to the page pointer -- which would not have done the
> right thing on sparsemem without vmemmap.
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/hugetlb.c | 21 ++++++++++-----------
>  1 file changed, 10 insertions(+), 11 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index d12a9d5146af4..ae82a845b14ad 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3235,17 +3235,14 @@ static void __init hugetlb_folio_init_tail_vmemmap(struct folio *folio,
>  					unsigned long start_page_number,
>  					unsigned long end_page_number)
>  {
> -	enum zone_type zone = zone_idx(folio_zone(folio));
> -	int nid = folio_nid(folio);
> -	unsigned long head_pfn = folio_pfn(folio);
> -	unsigned long pfn, end_pfn = head_pfn + end_page_number;
> +	struct page *head_page = folio_page(folio, 0);
> +	struct page *page = folio_page(folio, start_page_number);
> +	unsigned long i;
>  	int ret;
>  
> -	for (pfn = head_pfn + start_page_number; pfn < end_pfn; pfn++) {
> -		struct page *page = pfn_to_page(pfn);
> -
> -		__init_single_page(page, pfn, zone, nid);
> -		prep_compound_tail((struct page *)folio, pfn - head_pfn);
> +	for (i = start_page_number; i < end_page_number; i++, page++) {
> +		__ClearPageReserved(page);
> +		prep_compound_tail(head_page, i);
>  		ret = page_ref_freeze(page, 1);
>  		VM_BUG_ON(!ret);
>  	}
> @@ -3257,12 +3254,14 @@ static void __init hugetlb_folio_init_vmemmap(struct folio *folio,
>  {
>  	int ret;
>  
> -	/* Prepare folio head */
> +	/*
> +	 * This is an open-coded prep_compound_page() whereby we avoid
> +	 * walking pages twice by preparing+freezing them in the same go.
> +	 */
>  	__folio_clear_reserved(folio);
>  	__folio_set_head(folio);
>  	ret = folio_ref_freeze(folio, 1);
>  	VM_BUG_ON(!ret);
> -	/* Initialize the necessary tail struct pages */
>  	hugetlb_folio_init_tail_vmemmap(folio, 1, nr_pages);
>  	prep_compound_head((struct page *)folio, huge_page_order(h));
>  }

--Mika


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH RFC 24/35] ata: libata-eh: drop nth_page() usage within SG entry
  2025-08-22  1:59   ` [PATCH RFC 24/35] ata: libata-eh: drop " Damien Le Moal
@ 2025-08-22  6:18     ` David Hildenbrand
  0 siblings, 0 replies; 61+ messages in thread
From: David Hildenbrand @ 2025-08-22  6:18 UTC (permalink / raw)
  To: Damien Le Moal, linux-kernel
  Cc: Niklas Cassel, Alexander Potapenko, Andrew Morton,
	Brendan Jackman, Christoph Lameter, Dennis Zhou, Dmitry Vyukov,
	dri-devel, intel-gfx, iommu, io-uring, Jason Gunthorpe,
	Jens Axboe, Johannes Weiner, John Hubbard, kasan-dev, kvm,
	Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Mike Rapoport, Muchun Song, netdev, Oscar Salvador,
	Peter Xu, Robin Murphy, Suren Baghdasaryan, Tejun Heo,
	virtualization, Vlastimil Babka, wireguard, x86, Zi Yan

On 22.08.25 03:59, Damien Le Moal wrote:
> On 8/22/25 05:06, David Hildenbrand wrote:
>> It's no longer required to use nth_page() when iterating pages within a
>> single SG entry, so let's drop the nth_page() usage.
>>
>> Cc: Damien Le Moal <dlemoal@kernel.org>
>> Cc: Niklas Cassel <cassel@kernel.org>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>   drivers/ata/libata-sff.c | 6 +++---
>>   1 file changed, 3 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/ata/libata-sff.c b/drivers/ata/libata-sff.c
>> index 7fc407255eb46..9f5d0f9f6d686 100644
>> --- a/drivers/ata/libata-sff.c
>> +++ b/drivers/ata/libata-sff.c
>> @@ -614,7 +614,7 @@ static void ata_pio_sector(struct ata_queued_cmd *qc)
>>   	offset = qc->cursg->offset + qc->cursg_ofs;
>>   
>>   	/* get the current page and offset */
>> -	page = nth_page(page, (offset >> PAGE_SHIFT));
>> +	page += offset / PAGE_SHIFT;
> 
> Shouldn't this be "offset >> PAGE_SHIFT" ?

Thanks for taking a look!

Yeah, I already reverted back to "offset >> PAGE_SHIFT" after Linus 
mentioned in another mail in this thread that ">> PAGE_SHIFT" is 
generally preferred because the compiler cannot optimize as much if 
offset would be a signed variable.

So the next version will have the shift again.

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH RFC 10/35] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap()
  2025-08-22  4:09   ` Mika Penttilä
@ 2025-08-22  6:24     ` David Hildenbrand
  2025-08-23  8:59       ` Mike Rapoport
  0 siblings, 1 reply; 61+ messages in thread
From: David Hildenbrand @ 2025-08-22  6:24 UTC (permalink / raw)
  To: Mika Penttilä, linux-kernel
  Cc: Alexander Potapenko, Andrew Morton, Brendan Jackman,
	Christoph Lameter, Dennis Zhou, Dmitry Vyukov, dri-devel,
	intel-gfx, iommu, io-uring, Jason Gunthorpe, Jens Axboe,
	Johannes Weiner, John Hubbard, kasan-dev, kvm, Liam R. Howlett,
	Linus Torvalds, linux-arm-kernel, linux-arm-kernel, linux-crypto,
	linux-ide, linux-kselftest, linux-mips, linux-mmc, linux-mm,
	linux-riscv, linux-s390, linux-scsi, Lorenzo Stoakes, Marco Elver,
	Marek Szyprowski, Michal Hocko, Mike Rapoport, Muchun Song,
	netdev, Oscar Salvador, Peter Xu, Robin Murphy,
	Suren Baghdasaryan, Tejun Heo, virtualization, Vlastimil Babka,
	wireguard, x86, Zi Yan

On 22.08.25 06:09, Mika Penttilä wrote:
> 
> On 8/21/25 23:06, David Hildenbrand wrote:
> 
>> All pages were already initialized and set to PageReserved() with a
>> refcount of 1 by MM init code.
> 
> Just to be sure, how is this working with MEMBLOCK_RSRV_NOINIT, where MM is supposed not to
> initialize struct pages?

Excellent point, I did not know about that one.

Spotting that we don't do the same for the head page made me assume that 
it's just a misuse of __init_single_page().

But the nasty thing is that we use memblock_reserved_mark_noinit() to 
only mark the tail pages ...

Let me revert back to __init_single_page() and add a big fat comment why 
this is required.

Thanks!

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH RFC 18/35] io_uring/zcrx: remove "struct io_copy_cache" and one nth_page() usage
       [not found]   ` <b5b08ad3-d8cd-45ff-9767-7cf1b22b5e03@gmail.com>
@ 2025-08-22 13:59     ` David Hildenbrand
  0 siblings, 0 replies; 61+ messages in thread
From: David Hildenbrand @ 2025-08-22 13:59 UTC (permalink / raw)
  To: Pavel Begunkov, linux-kernel
  Cc: Jens Axboe, Alexander Potapenko, Andrew Morton, Brendan Jackman,
	Christoph Lameter, Dennis Zhou, Dmitry Vyukov, dri-devel,
	intel-gfx, iommu, io-uring, Jason Gunthorpe, Johannes Weiner,
	John Hubbard, kasan-dev, kvm, Liam R. Howlett, Linus Torvalds,
	linux-arm-kernel, linux-arm-kernel, linux-crypto, linux-ide,
	linux-kselftest, linux-mips, linux-mmc, linux-mm, linux-riscv,
	linux-s390, linux-scsi, Lorenzo Stoakes, Marco Elver,
	Marek Szyprowski, Michal Hocko, Mike Rapoport, Muchun Song,
	netdev, Oscar Salvador, Peter Xu, Robin Murphy,
	Suren Baghdasaryan, Tejun Heo, virtualization, Vlastimil Babka,
	wireguard, x86, Zi Yan

On 22.08.25 13:32, Pavel Begunkov wrote:
> On 8/21/25 21:06, David Hildenbrand wrote:
>> We always provide a single dst page, it's unclear why the io_copy_cache
>> complexity is required.
> 
> Because it'll need to be pulled outside the loop to reuse the page for
> multiple copies, i.e. packing multiple fragments of the same skb into
> it. Not finished, and currently it's wasting memory.

Okay, so what you're saying is that there will be follow-up work that 
will actually make this structure useful.

> 
> Why not do as below? Pages there never cross boundaries of their folios. > Do you want it to be taken into the io_uring tree?

This should better all go through the MM tree where we actually 
guarantee contiguous pages within a folio. (see the cover letter)

> 
> diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
> index e5ff49f3425e..18c12f4b56b6 100644
> --- a/io_uring/zcrx.c
> +++ b/io_uring/zcrx.c
> @@ -975,9 +975,9 @@ static ssize_t io_copy_page(struct io_copy_cache *cc, struct page *src_page,
>    
>    		if (folio_test_partial_kmap(page_folio(dst_page)) ||
>    		    folio_test_partial_kmap(page_folio(src_page))) {
> -			dst_page = nth_page(dst_page, dst_offset / PAGE_SIZE);
> +			dst_page += dst_offset / PAGE_SIZE;
>    			dst_offset = offset_in_page(dst_offset);
> -			src_page = nth_page(src_page, src_offset / PAGE_SIZE);
> +			src_page += src_offset / PAGE_SIZE;

Yeah, I can do that in the next version given that you have plans on 
extending that code soon.

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH RFC 02/35] arm64: Kconfig: drop superfluous "select SPARSEMEM_VMEMMAP"
       [not found] ` <20250821200701.1329277-3-david@redhat.com>
@ 2025-08-22 15:10   ` Mike Rapoport
  0 siblings, 0 replies; 61+ messages in thread
From: Mike Rapoport @ 2025-08-22 15:10 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Catalin Marinas, Will Deacon, Alexander Potapenko,
	Andrew Morton, Brendan Jackman, Christoph Lameter, Dennis Zhou,
	Dmitry Vyukov, dri-devel, intel-gfx, iommu, io-uring,
	Jason Gunthorpe, Jens Axboe, Johannes Weiner, John Hubbard,
	kasan-dev, kvm, Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Muchun Song, netdev, Oscar Salvador, Peter Xu,
	Robin Murphy, Suren Baghdasaryan, Tejun Heo, virtualization,
	Vlastimil Babka, wireguard, x86, Zi Yan

On Thu, Aug 21, 2025 at 10:06:28PM +0200, David Hildenbrand wrote:
> Now handled by the core automatically once SPARSEMEM_VMEMMAP_ENABLE
> is selected.
> 
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

> ---
>  arch/arm64/Kconfig | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index e9bbfacc35a64..b1d1f2ff2493b 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -1570,7 +1570,6 @@ source "kernel/Kconfig.hz"
>  config ARCH_SPARSEMEM_ENABLE
>  	def_bool y
>  	select SPARSEMEM_VMEMMAP_ENABLE
> -	select SPARSEMEM_VMEMMAP
>  
>  config HW_PERF_EVENTS
>  	def_bool y
> -- 
> 2.50.1
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH RFC 05/35] wireguard: selftests: remove CONFIG_SPARSEMEM_VMEMMAP=y from qemu kernel config
       [not found] ` <20250821200701.1329277-6-david@redhat.com>
@ 2025-08-22 15:13   ` Mike Rapoport
  0 siblings, 0 replies; 61+ messages in thread
From: Mike Rapoport @ 2025-08-22 15:13 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Jason A. Donenfeld, Shuah Khan, Alexander Potapenko,
	Andrew Morton, Brendan Jackman, Christoph Lameter, Dennis Zhou,
	Dmitry Vyukov, dri-devel, intel-gfx, iommu, io-uring,
	Jason Gunthorpe, Jens Axboe, Johannes Weiner, John Hubbard,
	kasan-dev, kvm, Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Muchun Song, netdev, Oscar Salvador, Peter Xu,
	Robin Murphy, Suren Baghdasaryan, Tejun Heo, virtualization,
	Vlastimil Babka, wireguard, x86, Zi Yan

On Thu, Aug 21, 2025 at 10:06:31PM +0200, David Hildenbrand wrote:
> It's no longer user-selectable (and the default was already "y"), so
> let's just drop it.

and it should not matter for wireguard selftest anyway
> 
> Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
> Cc: Shuah Khan <shuah@kernel.org>
> Signed-off-by: David Hildenbrand <david@redhat.com>

Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

> ---
>  tools/testing/selftests/wireguard/qemu/kernel.config | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/tools/testing/selftests/wireguard/qemu/kernel.config b/tools/testing/selftests/wireguard/qemu/kernel.config
> index 0a5381717e9f4..1149289f4b30f 100644
> --- a/tools/testing/selftests/wireguard/qemu/kernel.config
> +++ b/tools/testing/selftests/wireguard/qemu/kernel.config
> @@ -48,7 +48,6 @@ CONFIG_JUMP_LABEL=y
>  CONFIG_FUTEX=y
>  CONFIG_SHMEM=y
>  CONFIG_SLUB=y
> -CONFIG_SPARSEMEM_VMEMMAP=y
>  CONFIG_SMP=y
>  CONFIG_SCHED_SMT=y
>  CONFIG_SCHED_MC=y
> -- 
> 2.50.1
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH RFC 09/35] mm/mm_init: make memmap_init_compound() look more like prep_compound_page()
  2025-08-21 20:06 ` [PATCH RFC 09/35] mm/mm_init: make memmap_init_compound() look more like prep_compound_page() David Hildenbrand
@ 2025-08-22 15:27   ` Mike Rapoport
  2025-08-22 18:09     ` David Hildenbrand
  0 siblings, 1 reply; 61+ messages in thread
From: Mike Rapoport @ 2025-08-22 15:27 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Alexander Potapenko, Andrew Morton, Brendan Jackman,
	Christoph Lameter, Dennis Zhou, Dmitry Vyukov, dri-devel,
	intel-gfx, iommu, io-uring, Jason Gunthorpe, Jens Axboe,
	Johannes Weiner, John Hubbard, kasan-dev, kvm, Liam R. Howlett,
	Linus Torvalds, linux-arm-kernel, linux-arm-kernel, linux-crypto,
	linux-ide, linux-kselftest, linux-mips, linux-mmc, linux-mm,
	linux-riscv, linux-s390, linux-scsi, Lorenzo Stoakes, Marco Elver,
	Marek Szyprowski, Michal Hocko, Muchun Song, netdev,
	Oscar Salvador, Peter Xu, Robin Murphy, Suren Baghdasaryan,
	Tejun Heo, virtualization, Vlastimil Babka, wireguard, x86,
	Zi Yan

On Thu, Aug 21, 2025 at 10:06:35PM +0200, David Hildenbrand wrote:
> Grepping for "prep_compound_page" leaves on clueless how devdax gets its
> compound pages initialized.
> 
> Let's add a comment that might help finding this open-coded
> prep_compound_page() initialization more easily.
> 
> Further, let's be less smart about the ordering of initialization and just
> perform the prep_compound_head() call after all tail pages were
> initialized: just like prep_compound_page() does.
> 
> No need for a lengthy comment then: again, just like prep_compound_page().
> 
> Note that prep_compound_head() already does initialize stuff in page[2]
> through prep_compound_head() that successive tail page initialization
> will overwrite: _deferred_list, and on 32bit _entire_mapcount and
> _pincount. Very likely 32bit does not apply, and likely nobody ever ends
> up testing whether the _deferred_list is empty.
> 
> So it shouldn't be a fix at this point, but certainly something to clean
> up.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/mm_init.c | 13 +++++--------
>  1 file changed, 5 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index 5c21b3af216b2..708466c5b2cc9 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -1091,6 +1091,10 @@ static void __ref memmap_init_compound(struct page *head,
>  	unsigned long pfn, end_pfn = head_pfn + nr_pages;
>  	unsigned int order = pgmap->vmemmap_shift;
>  
> +	/*
> +	 * This is an open-coded prep_compound_page() whereby we avoid
> +	 * walking pages twice by initializing them in the same go.
> +	 */

While on it, can you also mention that prep_compound_page() is not used to
properly set page zone link?

With this

Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

>  	__SetPageHead(head);
>  	for (pfn = head_pfn + 1; pfn < end_pfn; pfn++) {
>  		struct page *page = pfn_to_page(pfn);
> @@ -1098,15 +1102,8 @@ static void __ref memmap_init_compound(struct page *head,
>  		__init_zone_device_page(page, pfn, zone_idx, nid, pgmap);
>  		prep_compound_tail(head, pfn - head_pfn);
>  		set_page_count(page, 0);
> -
> -		/*
> -		 * The first tail page stores important compound page info.
> -		 * Call prep_compound_head() after the first tail page has
> -		 * been initialized, to not have the data overwritten.
> -		 */
> -		if (pfn == head_pfn + 1)
> -			prep_compound_head(head, order);
>  	}
> +	prep_compound_head(head, order);
>  }
>  
>  void __ref memmap_init_zone_device(struct zone *zone,
> -- 
> 2.50.1
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH RFC 06/35] mm/page_alloc: reject unreasonable folio/compound page sizes in alloc_contig_range_noprof()
  2025-08-21 20:06 ` [PATCH RFC 06/35] mm/page_alloc: reject unreasonable folio/compound page sizes in alloc_contig_range_noprof() David Hildenbrand
  2025-08-21 20:23   ` Zi Yan
@ 2025-08-22 17:07   ` SeongJae Park
  1 sibling, 0 replies; 61+ messages in thread
From: SeongJae Park @ 2025-08-22 17:07 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: SeongJae Park, linux-kernel, Alexander Potapenko, Andrew Morton,
	Brendan Jackman, Christoph Lameter, Dennis Zhou, Dmitry Vyukov,
	dri-devel, intel-gfx, iommu, io-uring, Jason Gunthorpe,
	Jens Axboe, Johannes Weiner, John Hubbard, kasan-dev, kvm,
	Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Mike Rapoport, Muchun Song, netdev, Oscar Salvador,
	Peter Xu, Robin Murphy, Suren Baghdasaryan, Tejun Heo,
	virtualization, Vlastimil Babka, wireguard, x86, Zi Yan

On Thu, 21 Aug 2025 22:06:32 +0200 David Hildenbrand <david@redhat.com> wrote:

> Let's reject them early,

I like early failures. :)

> which in turn makes folio_alloc_gigantic() reject
> them properly.
> 
> To avoid converting from order to nr_pages, let's just add MAX_FOLIO_ORDER
> and calculate MAX_FOLIO_NR_PAGES based on that.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Acked-by: SeongJae Park <sj@kernel.org>


Thanks,
SJ

[...]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH RFC 07/35] mm/memremap: reject unreasonable folio/compound page sizes in memremap_pages()
  2025-08-21 20:06 ` [PATCH RFC 07/35] mm/memremap: reject unreasonable folio/compound page sizes in memremap_pages() David Hildenbrand
@ 2025-08-22 17:09   ` SeongJae Park
  0 siblings, 0 replies; 61+ messages in thread
From: SeongJae Park @ 2025-08-22 17:09 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: SeongJae Park, linux-kernel, Alexander Potapenko, Andrew Morton,
	Brendan Jackman, Christoph Lameter, Dennis Zhou, Dmitry Vyukov,
	dri-devel, intel-gfx, iommu, io-uring, Jason Gunthorpe,
	Jens Axboe, Johannes Weiner, John Hubbard, kasan-dev, kvm,
	Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Mike Rapoport, Muchun Song, netdev, Oscar Salvador,
	Peter Xu, Robin Murphy, Suren Baghdasaryan, Tejun Heo,
	virtualization, Vlastimil Babka, wireguard, x86, Zi Yan

On Thu, 21 Aug 2025 22:06:33 +0200 David Hildenbrand <david@redhat.com> wrote:

> Let's reject unreasonable folio sizes early, where we can still fail.
> We'll add sanity checks to prepare_compound_head/prepare_compound_page
> next.
> 
> Is there a way to configure a system such that unreasonable folio sizes
> would be possible? It would already be rather questionable.
> 
> If so, we'd probably want to bail out earlier, where we can avoid a
> WARN and just report a proper error message that indicates where
> something went wrong such that we messed up.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Acked-by: SeongJae Park <sj@kernel.org>


Thanks,
SJ

[...]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH RFC 09/35] mm/mm_init: make memmap_init_compound() look more like prep_compound_page()
  2025-08-22 15:27   ` Mike Rapoport
@ 2025-08-22 18:09     ` David Hildenbrand
  0 siblings, 0 replies; 61+ messages in thread
From: David Hildenbrand @ 2025-08-22 18:09 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, Alexander Potapenko, Andrew Morton, Brendan Jackman,
	Christoph Lameter, Dennis Zhou, Dmitry Vyukov, dri-devel,
	intel-gfx, iommu, io-uring, Jason Gunthorpe, Jens Axboe,
	Johannes Weiner, John Hubbard, kasan-dev, kvm, Liam R. Howlett,
	Linus Torvalds, linux-arm-kernel, linux-arm-kernel, linux-crypto,
	linux-ide, linux-kselftest, linux-mips, linux-mmc, linux-mm,
	linux-riscv, linux-s390, linux-scsi, Lorenzo Stoakes, Marco Elver,
	Marek Szyprowski, Michal Hocko, Muchun Song, netdev,
	Oscar Salvador, Peter Xu, Robin Murphy, Suren Baghdasaryan,
	Tejun Heo, virtualization, Vlastimil Babka, wireguard, x86,
	Zi Yan

On 22.08.25 17:27, Mike Rapoport wrote:
> On Thu, Aug 21, 2025 at 10:06:35PM +0200, David Hildenbrand wrote:
>> Grepping for "prep_compound_page" leaves on clueless how devdax gets its
>> compound pages initialized.
>>
>> Let's add a comment that might help finding this open-coded
>> prep_compound_page() initialization more easily.
>>
>> Further, let's be less smart about the ordering of initialization and just
>> perform the prep_compound_head() call after all tail pages were
>> initialized: just like prep_compound_page() does.
>>
>> No need for a lengthy comment then: again, just like prep_compound_page().
>>
>> Note that prep_compound_head() already does initialize stuff in page[2]
>> through prep_compound_head() that successive tail page initialization
>> will overwrite: _deferred_list, and on 32bit _entire_mapcount and
>> _pincount. Very likely 32bit does not apply, and likely nobody ever ends
>> up testing whether the _deferred_list is empty.
>>
>> So it shouldn't be a fix at this point, but certainly something to clean
>> up.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>   mm/mm_init.c | 13 +++++--------
>>   1 file changed, 5 insertions(+), 8 deletions(-)
>>
>> diff --git a/mm/mm_init.c b/mm/mm_init.c
>> index 5c21b3af216b2..708466c5b2cc9 100644
>> --- a/mm/mm_init.c
>> +++ b/mm/mm_init.c
>> @@ -1091,6 +1091,10 @@ static void __ref memmap_init_compound(struct page *head,
>>   	unsigned long pfn, end_pfn = head_pfn + nr_pages;
>>   	unsigned int order = pgmap->vmemmap_shift;
>>   
>> +	/*
>> +	 * This is an open-coded prep_compound_page() whereby we avoid
>> +	 * walking pages twice by initializing them in the same go.
>> +	 */
> 
> While on it, can you also mention that prep_compound_page() is not used to
> properly set page zone link?

Sure, thanks!

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH RFC 10/35] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap()
  2025-08-22  6:24     ` David Hildenbrand
@ 2025-08-23  8:59       ` Mike Rapoport
  2025-08-25 12:48         ` David Hildenbrand
  0 siblings, 1 reply; 61+ messages in thread
From: Mike Rapoport @ 2025-08-23  8:59 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mika Penttilä, linux-kernel, Alexander Potapenko,
	Andrew Morton, Brendan Jackman, Christoph Lameter, Dennis Zhou,
	Dmitry Vyukov, dri-devel, intel-gfx, iommu, io-uring,
	Jason Gunthorpe, Jens Axboe, Johannes Weiner, John Hubbard,
	kasan-dev, kvm, Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Muchun Song, netdev, Oscar Salvador, Peter Xu,
	Robin Murphy, Suren Baghdasaryan, Tejun Heo, virtualization,
	Vlastimil Babka, wireguard, x86, Zi Yan

On Fri, Aug 22, 2025 at 08:24:31AM +0200, David Hildenbrand wrote:
> On 22.08.25 06:09, Mika Penttilä wrote:
> > 
> > On 8/21/25 23:06, David Hildenbrand wrote:
> > 
> > > All pages were already initialized and set to PageReserved() with a
> > > refcount of 1 by MM init code.
> > 
> > Just to be sure, how is this working with MEMBLOCK_RSRV_NOINIT, where MM is supposed not to
> > initialize struct pages?
> 
> Excellent point, I did not know about that one.
> 
> Spotting that we don't do the same for the head page made me assume that
> it's just a misuse of __init_single_page().
> 
> But the nasty thing is that we use memblock_reserved_mark_noinit() to only
> mark the tail pages ...

And even nastier thing is that when CONFIG_DEFERRED_STRUCT_PAGE_INIT is
disabled struct pages are initialized regardless of
memblock_reserved_mark_noinit().

I think this patch should go in before your updates:

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 753f99b4c718..1c51788339a5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3230,6 +3230,22 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid)
 	return 1;
 }
 
+/*
+ * Tail pages in a huge folio allocated from memblock are marked as 'noinit',
+ * which means that when CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled their
+ * struct page won't be initialized
+ */
+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+static void __init hugetlb_init_tail_page(struct page *page, unsigned long pfn,
+					enum zone_type zone, int nid)
+{
+	__init_single_page(page, pfn, zone, nid);
+}
+#else
+static inline void hugetlb_init_tail_page(struct page *page, unsigned long pfn,
+					enum zone_type zone, int nid) {}
+#endif
+
 /* Initialize [start_page:end_page_number] tail struct pages of a hugepage */
 static void __init hugetlb_folio_init_tail_vmemmap(struct folio *folio,
 					unsigned long start_page_number,
@@ -3244,7 +3260,7 @@ static void __init hugetlb_folio_init_tail_vmemmap(struct folio *folio,
 	for (pfn = head_pfn + start_page_number; pfn < end_pfn; pfn++) {
 		struct page *page = pfn_to_page(pfn);
 
-		__init_single_page(page, pfn, zone, nid);
+		hugetlb_init_tail_page(page, pfn, zone, nid);
 		prep_compound_tail((struct page *)folio, pfn - head_pfn);
 		ret = page_ref_freeze(page, 1);
 		VM_BUG_ON(!ret);
 
> Let me revert back to __init_single_page() and add a big fat comment why
> this is required.
> 
> Thanks!

-- 
Sincerely yours,
Mike.

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH RFC 12/35] mm: limit folio/compound page sizes in problematic kernel configs
  2025-08-21 20:06 ` [PATCH RFC 12/35] mm: limit folio/compound page sizes in problematic kernel configs David Hildenbrand
  2025-08-21 20:46   ` Zi Yan
@ 2025-08-24 13:24   ` Mike Rapoport
  1 sibling, 0 replies; 61+ messages in thread
From: Mike Rapoport @ 2025-08-24 13:24 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Alexander Potapenko, Andrew Morton, Brendan Jackman,
	Christoph Lameter, Dennis Zhou, Dmitry Vyukov, dri-devel,
	intel-gfx, iommu, io-uring, Jason Gunthorpe, Jens Axboe,
	Johannes Weiner, John Hubbard, kasan-dev, kvm, Liam R. Howlett,
	Linus Torvalds, linux-arm-kernel, linux-arm-kernel, linux-crypto,
	linux-ide, linux-kselftest, linux-mips, linux-mmc, linux-mm,
	linux-riscv, linux-s390, linux-scsi, Lorenzo Stoakes, Marco Elver,
	Marek Szyprowski, Michal Hocko, Muchun Song, netdev,
	Oscar Salvador, Peter Xu, Robin Murphy, Suren Baghdasaryan,
	Tejun Heo, virtualization, Vlastimil Babka, wireguard, x86,
	Zi Yan

On Thu, Aug 21, 2025 at 10:06:38PM +0200, David Hildenbrand wrote:
> Let's limit the maximum folio size in problematic kernel config where
> the memmap is allocated per memory section (SPARSEMEM without
> SPARSEMEM_VMEMMAP) to a single memory section.
> 
> Currently, only a single architectures supports ARCH_HAS_GIGANTIC_PAGE
> but not SPARSEMEM_VMEMMAP: sh.
> 
> Fortunately, the biggest hugetlb size sh supports is 64 MiB
> (HUGETLB_PAGE_SIZE_64MB) and the section size is at least 64 MiB
> (SECTION_SIZE_BITS == 26), so their use case is not degraded.
> 
> As folios and memory sections are naturally aligned to their order-2 size
> in memory, consequently a single folio can no longer span multiple memory
> sections on these problematic kernel configs.
> 
> nth_page() is no longer required when operating within a single compound
> page / folio.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

> ---
>  include/linux/mm.h | 22 ++++++++++++++++++----
>  1 file changed, 18 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 77737cbf2216a..48a985e17ef4e 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2053,11 +2053,25 @@ static inline long folio_nr_pages(const struct folio *folio)
>  	return folio_large_nr_pages(folio);
>  }
>  
> -/* Only hugetlbfs can allocate folios larger than MAX_ORDER */
> -#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
> -#define MAX_FOLIO_ORDER		PUD_ORDER
> -#else
> +#if !defined(CONFIG_ARCH_HAS_GIGANTIC_PAGE)
> +/*
> + * We don't expect any folios that exceed buddy sizes (and consequently
> + * memory sections).
> + */
>  #define MAX_FOLIO_ORDER		MAX_PAGE_ORDER
> +#elif defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
> +/*
> + * Only pages within a single memory section are guaranteed to be
> + * contiguous. By limiting folios to a single memory section, all folio
> + * pages are guaranteed to be contiguous.
> + */
> +#define MAX_FOLIO_ORDER		PFN_SECTION_SHIFT
> +#else
> +/*
> + * There is no real limit on the folio size. We limit them to the maximum we
> + * currently expect.
> + */
> +#define MAX_FOLIO_ORDER		PUD_ORDER
>  #endif
>  
>  #define MAX_FOLIO_NR_PAGES	(1UL << MAX_FOLIO_ORDER)
> -- 
> 2.50.1
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH RFC 10/35] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap()
  2025-08-23  8:59       ` Mike Rapoport
@ 2025-08-25 12:48         ` David Hildenbrand
  2025-08-25 14:32           ` Mike Rapoport
  0 siblings, 1 reply; 61+ messages in thread
From: David Hildenbrand @ 2025-08-25 12:48 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Mika Penttilä, linux-kernel, Alexander Potapenko,
	Andrew Morton, Brendan Jackman, Christoph Lameter, Dennis Zhou,
	Dmitry Vyukov, dri-devel, intel-gfx, iommu, io-uring,
	Jason Gunthorpe, Jens Axboe, Johannes Weiner, John Hubbard,
	kasan-dev, kvm, Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Muchun Song, netdev, Oscar Salvador, Peter Xu,
	Robin Murphy, Suren Baghdasaryan, Tejun Heo, virtualization,
	Vlastimil Babka, wireguard, x86, Zi Yan

On 23.08.25 10:59, Mike Rapoport wrote:
> On Fri, Aug 22, 2025 at 08:24:31AM +0200, David Hildenbrand wrote:
>> On 22.08.25 06:09, Mika Penttilä wrote:
>>>
>>> On 8/21/25 23:06, David Hildenbrand wrote:
>>>
>>>> All pages were already initialized and set to PageReserved() with a
>>>> refcount of 1 by MM init code.
>>>
>>> Just to be sure, how is this working with MEMBLOCK_RSRV_NOINIT, where MM is supposed not to
>>> initialize struct pages?
>>
>> Excellent point, I did not know about that one.
>>
>> Spotting that we don't do the same for the head page made me assume that
>> it's just a misuse of __init_single_page().
>>
>> But the nasty thing is that we use memblock_reserved_mark_noinit() to only
>> mark the tail pages ...
> 
> And even nastier thing is that when CONFIG_DEFERRED_STRUCT_PAGE_INIT is
> disabled struct pages are initialized regardless of
> memblock_reserved_mark_noinit().
> 
> I think this patch should go in before your updates:

Shouldn't we fix this in memblock code?

Hacking around that in the memblock_reserved_mark_noinit() user sound 
wrong -- and nothing in the doc of memblock_reserved_mark_noinit() 
spells that behavior out.

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH RFC 10/35] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap()
  2025-08-25 12:48         ` David Hildenbrand
@ 2025-08-25 14:32           ` Mike Rapoport
  2025-08-25 14:38             ` David Hildenbrand
  0 siblings, 1 reply; 61+ messages in thread
From: Mike Rapoport @ 2025-08-25 14:32 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mika Penttilä, linux-kernel, Alexander Potapenko,
	Andrew Morton, Brendan Jackman, Christoph Lameter, Dennis Zhou,
	Dmitry Vyukov, dri-devel, intel-gfx, iommu, io-uring,
	Jason Gunthorpe, Jens Axboe, Johannes Weiner, John Hubbard,
	kasan-dev, kvm, Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Muchun Song, netdev, Oscar Salvador, Peter Xu,
	Robin Murphy, Suren Baghdasaryan, Tejun Heo, virtualization,
	Vlastimil Babka, wireguard, x86, Zi Yan

On Mon, Aug 25, 2025 at 02:48:58PM +0200, David Hildenbrand wrote:
> On 23.08.25 10:59, Mike Rapoport wrote:
> > On Fri, Aug 22, 2025 at 08:24:31AM +0200, David Hildenbrand wrote:
> > > On 22.08.25 06:09, Mika Penttilä wrote:
> > > > 
> > > > On 8/21/25 23:06, David Hildenbrand wrote:
> > > > 
> > > > > All pages were already initialized and set to PageReserved() with a
> > > > > refcount of 1 by MM init code.
> > > > 
> > > > Just to be sure, how is this working with MEMBLOCK_RSRV_NOINIT, where MM is supposed not to
> > > > initialize struct pages?
> > > 
> > > Excellent point, I did not know about that one.
> > > 
> > > Spotting that we don't do the same for the head page made me assume that
> > > it's just a misuse of __init_single_page().
> > > 
> > > But the nasty thing is that we use memblock_reserved_mark_noinit() to only
> > > mark the tail pages ...
> > 
> > And even nastier thing is that when CONFIG_DEFERRED_STRUCT_PAGE_INIT is
> > disabled struct pages are initialized regardless of
> > memblock_reserved_mark_noinit().
> > 
> > I think this patch should go in before your updates:
> 
> Shouldn't we fix this in memblock code?
> 
> Hacking around that in the memblock_reserved_mark_noinit() user sound wrong
> -- and nothing in the doc of memblock_reserved_mark_noinit() spells that
> behavior out.

We can surely update the docs, but unfortunately I don't see how to avoid
hacking around it in hugetlb. 
Since it's used to optimise HVO even further to the point hugetlb open
codes memmap initialization, I think it's fair that it should deal with all
possible configurations.
 
> -- 
> Cheers
> 
> David / dhildenb
> 
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH RFC 10/35] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap()
  2025-08-25 14:32           ` Mike Rapoport
@ 2025-08-25 14:38             ` David Hildenbrand
  2025-08-25 14:59               ` Mike Rapoport
  0 siblings, 1 reply; 61+ messages in thread
From: David Hildenbrand @ 2025-08-25 14:38 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Mika Penttilä, linux-kernel, Alexander Potapenko,
	Andrew Morton, Brendan Jackman, Christoph Lameter, Dennis Zhou,
	Dmitry Vyukov, dri-devel, intel-gfx, iommu, io-uring,
	Jason Gunthorpe, Jens Axboe, Johannes Weiner, John Hubbard,
	kasan-dev, kvm, Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Muchun Song, netdev, Oscar Salvador, Peter Xu,
	Robin Murphy, Suren Baghdasaryan, Tejun Heo, virtualization,
	Vlastimil Babka, wireguard, x86, Zi Yan

On 25.08.25 16:32, Mike Rapoport wrote:
> On Mon, Aug 25, 2025 at 02:48:58PM +0200, David Hildenbrand wrote:
>> On 23.08.25 10:59, Mike Rapoport wrote:
>>> On Fri, Aug 22, 2025 at 08:24:31AM +0200, David Hildenbrand wrote:
>>>> On 22.08.25 06:09, Mika Penttilä wrote:
>>>>>
>>>>> On 8/21/25 23:06, David Hildenbrand wrote:
>>>>>
>>>>>> All pages were already initialized and set to PageReserved() with a
>>>>>> refcount of 1 by MM init code.
>>>>>
>>>>> Just to be sure, how is this working with MEMBLOCK_RSRV_NOINIT, where MM is supposed not to
>>>>> initialize struct pages?
>>>>
>>>> Excellent point, I did not know about that one.
>>>>
>>>> Spotting that we don't do the same for the head page made me assume that
>>>> it's just a misuse of __init_single_page().
>>>>
>>>> But the nasty thing is that we use memblock_reserved_mark_noinit() to only
>>>> mark the tail pages ...
>>>
>>> And even nastier thing is that when CONFIG_DEFERRED_STRUCT_PAGE_INIT is
>>> disabled struct pages are initialized regardless of
>>> memblock_reserved_mark_noinit().
>>>
>>> I think this patch should go in before your updates:
>>
>> Shouldn't we fix this in memblock code?
>>
>> Hacking around that in the memblock_reserved_mark_noinit() user sound wrong
>> -- and nothing in the doc of memblock_reserved_mark_noinit() spells that
>> behavior out.
> 
> We can surely update the docs, but unfortunately I don't see how to avoid
> hacking around it in hugetlb.
> Since it's used to optimise HVO even further to the point hugetlb open
> codes memmap initialization, I think it's fair that it should deal with all
> possible configurations.

Remind me, why can't we support memblock_reserved_mark_noinit() when 
CONFIG_DEFERRED_STRUCT_PAGE_INIT is disabled?

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH RFC 10/35] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap()
  2025-08-25 14:38             ` David Hildenbrand
@ 2025-08-25 14:59               ` Mike Rapoport
  2025-08-25 15:42                 ` David Hildenbrand
  0 siblings, 1 reply; 61+ messages in thread
From: Mike Rapoport @ 2025-08-25 14:59 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mika Penttilä, linux-kernel, Alexander Potapenko,
	Andrew Morton, Brendan Jackman, Christoph Lameter, Dennis Zhou,
	Dmitry Vyukov, dri-devel, intel-gfx, iommu, io-uring,
	Jason Gunthorpe, Jens Axboe, Johannes Weiner, John Hubbard,
	kasan-dev, kvm, Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Muchun Song, netdev, Oscar Salvador, Peter Xu,
	Robin Murphy, Suren Baghdasaryan, Tejun Heo, virtualization,
	Vlastimil Babka, wireguard, x86, Zi Yan

On Mon, Aug 25, 2025 at 04:38:03PM +0200, David Hildenbrand wrote:
> On 25.08.25 16:32, Mike Rapoport wrote:
> > On Mon, Aug 25, 2025 at 02:48:58PM +0200, David Hildenbrand wrote:
> > > On 23.08.25 10:59, Mike Rapoport wrote:
> > > > On Fri, Aug 22, 2025 at 08:24:31AM +0200, David Hildenbrand wrote:
> > > > > On 22.08.25 06:09, Mika Penttilä wrote:
> > > > > > 
> > > > > > On 8/21/25 23:06, David Hildenbrand wrote:
> > > > > > 
> > > > > > > All pages were already initialized and set to PageReserved() with a
> > > > > > > refcount of 1 by MM init code.
> > > > > > 
> > > > > > Just to be sure, how is this working with MEMBLOCK_RSRV_NOINIT, where MM is supposed not to
> > > > > > initialize struct pages?
> > > > > 
> > > > > Excellent point, I did not know about that one.
> > > > > 
> > > > > Spotting that we don't do the same for the head page made me assume that
> > > > > it's just a misuse of __init_single_page().
> > > > > 
> > > > > But the nasty thing is that we use memblock_reserved_mark_noinit() to only
> > > > > mark the tail pages ...
> > > > 
> > > > And even nastier thing is that when CONFIG_DEFERRED_STRUCT_PAGE_INIT is
> > > > disabled struct pages are initialized regardless of
> > > > memblock_reserved_mark_noinit().
> > > > 
> > > > I think this patch should go in before your updates:
> > > 
> > > Shouldn't we fix this in memblock code?
> > > 
> > > Hacking around that in the memblock_reserved_mark_noinit() user sound wrong
> > > -- and nothing in the doc of memblock_reserved_mark_noinit() spells that
> > > behavior out.
> > 
> > We can surely update the docs, but unfortunately I don't see how to avoid
> > hacking around it in hugetlb.
> > Since it's used to optimise HVO even further to the point hugetlb open
> > codes memmap initialization, I think it's fair that it should deal with all
> > possible configurations.
> 
> Remind me, why can't we support memblock_reserved_mark_noinit() when
> CONFIG_DEFERRED_STRUCT_PAGE_INIT is disabled?

When CONFIG_DEFERRED_STRUCT_PAGE_INIT is disabled we initialize the entire
memmap early (setup_arch()->free_area_init()), and we may have a bunch of
memblock_reserved_mark_noinit() afterwards
 
> -- 
> Cheers
> 
> David / dhildenb
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH RFC 10/35] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap()
  2025-08-25 14:59               ` Mike Rapoport
@ 2025-08-25 15:42                 ` David Hildenbrand
  2025-08-25 16:17                   ` Mike Rapoport
  0 siblings, 1 reply; 61+ messages in thread
From: David Hildenbrand @ 2025-08-25 15:42 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Mika Penttilä, linux-kernel, Alexander Potapenko,
	Andrew Morton, Brendan Jackman, Christoph Lameter, Dennis Zhou,
	Dmitry Vyukov, dri-devel, intel-gfx, iommu, io-uring,
	Jason Gunthorpe, Jens Axboe, Johannes Weiner, John Hubbard,
	kasan-dev, kvm, Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Muchun Song, netdev, Oscar Salvador, Peter Xu,
	Robin Murphy, Suren Baghdasaryan, Tejun Heo, virtualization,
	Vlastimil Babka, wireguard, x86, Zi Yan

On 25.08.25 16:59, Mike Rapoport wrote:
> On Mon, Aug 25, 2025 at 04:38:03PM +0200, David Hildenbrand wrote:
>> On 25.08.25 16:32, Mike Rapoport wrote:
>>> On Mon, Aug 25, 2025 at 02:48:58PM +0200, David Hildenbrand wrote:
>>>> On 23.08.25 10:59, Mike Rapoport wrote:
>>>>> On Fri, Aug 22, 2025 at 08:24:31AM +0200, David Hildenbrand wrote:
>>>>>> On 22.08.25 06:09, Mika Penttilä wrote:
>>>>>>>
>>>>>>> On 8/21/25 23:06, David Hildenbrand wrote:
>>>>>>>
>>>>>>>> All pages were already initialized and set to PageReserved() with a
>>>>>>>> refcount of 1 by MM init code.
>>>>>>>
>>>>>>> Just to be sure, how is this working with MEMBLOCK_RSRV_NOINIT, where MM is supposed not to
>>>>>>> initialize struct pages?
>>>>>>
>>>>>> Excellent point, I did not know about that one.
>>>>>>
>>>>>> Spotting that we don't do the same for the head page made me assume that
>>>>>> it's just a misuse of __init_single_page().
>>>>>>
>>>>>> But the nasty thing is that we use memblock_reserved_mark_noinit() to only
>>>>>> mark the tail pages ...
>>>>>
>>>>> And even nastier thing is that when CONFIG_DEFERRED_STRUCT_PAGE_INIT is
>>>>> disabled struct pages are initialized regardless of
>>>>> memblock_reserved_mark_noinit().
>>>>>
>>>>> I think this patch should go in before your updates:
>>>>
>>>> Shouldn't we fix this in memblock code?
>>>>
>>>> Hacking around that in the memblock_reserved_mark_noinit() user sound wrong
>>>> -- and nothing in the doc of memblock_reserved_mark_noinit() spells that
>>>> behavior out.
>>>
>>> We can surely update the docs, but unfortunately I don't see how to avoid
>>> hacking around it in hugetlb.
>>> Since it's used to optimise HVO even further to the point hugetlb open
>>> codes memmap initialization, I think it's fair that it should deal with all
>>> possible configurations.
>>
>> Remind me, why can't we support memblock_reserved_mark_noinit() when
>> CONFIG_DEFERRED_STRUCT_PAGE_INIT is disabled?
> 
> When CONFIG_DEFERRED_STRUCT_PAGE_INIT is disabled we initialize the entire
> memmap early (setup_arch()->free_area_init()), and we may have a bunch of
> memblock_reserved_mark_noinit() afterwards

Oh, you mean that we get effective memblock modifications after already
initializing the memmap.

That sounds ... interesting :)

So yeah, we have to document this for memblock_reserved_mark_noinit().

Is it also a problem for kexec_handover?

We should do something like:

diff --git a/mm/memblock.c b/mm/memblock.c
index 154f1d73b61f2..ed4c563d72c32 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1091,13 +1091,16 @@ int __init_memblock memblock_clear_nomap(phys_addr_t base, phys_addr_t size)
  
  /**
   * memblock_reserved_mark_noinit - Mark a reserved memory region with flag
- * MEMBLOCK_RSRV_NOINIT which results in the struct pages not being initialized
- * for this region.
+ * MEMBLOCK_RSRV_NOINIT which allows for the "struct pages" corresponding
+ * to this region not getting initialized, because the caller will take
+ * care of it.
   * @base: the base phys addr of the region
   * @size: the size of the region
   *
- * struct pages will not be initialized for reserved memory regions marked with
- * %MEMBLOCK_RSRV_NOINIT.
+ * "struct pages" will not be initialized for reserved memory regions marked
+ * with %MEMBLOCK_RSRV_NOINIT if this function is called before initialization
+ * code runs. Without CONFIG_DEFERRED_STRUCT_PAGE_INIT, it is more likely
+ * that this function is not effective.
   *
   * Return: 0 on success, -errno on failure.
   */


Optimizing the hugetlb code could be done, but I am not sure how high
the priority is (nobody complained so far about the double init).

-- 
Cheers

David / dhildenb


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH RFC 10/35] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap()
  2025-08-25 15:42                 ` David Hildenbrand
@ 2025-08-25 16:17                   ` Mike Rapoport
  2025-08-25 16:23                     ` David Hildenbrand
  0 siblings, 1 reply; 61+ messages in thread
From: Mike Rapoport @ 2025-08-25 16:17 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mika Penttilä, linux-kernel, Alexander Potapenko,
	Andrew Morton, Brendan Jackman, Christoph Lameter, Dennis Zhou,
	Dmitry Vyukov, dri-devel, intel-gfx, iommu, io-uring,
	Jason Gunthorpe, Jens Axboe, Johannes Weiner, John Hubbard,
	kasan-dev, kvm, Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Muchun Song, netdev, Oscar Salvador, Peter Xu,
	Robin Murphy, Suren Baghdasaryan, Tejun Heo, virtualization,
	Vlastimil Babka, wireguard, x86, Zi Yan

On Mon, Aug 25, 2025 at 05:42:33PM +0200, David Hildenbrand wrote:
> On 25.08.25 16:59, Mike Rapoport wrote:
> > On Mon, Aug 25, 2025 at 04:38:03PM +0200, David Hildenbrand wrote:
> > > On 25.08.25 16:32, Mike Rapoport wrote:
> > > > On Mon, Aug 25, 2025 at 02:48:58PM +0200, David Hildenbrand wrote:
> > > > > On 23.08.25 10:59, Mike Rapoport wrote:
> > > > > > On Fri, Aug 22, 2025 at 08:24:31AM +0200, David Hildenbrand wrote:
> > > > > > > On 22.08.25 06:09, Mika Penttilä wrote:
> > > > > > > > 
> > > > > > > > On 8/21/25 23:06, David Hildenbrand wrote:
> > > > > > > > 
> > > > > > > > > All pages were already initialized and set to PageReserved() with a
> > > > > > > > > refcount of 1 by MM init code.
> > > > > > > > 
> > > > > > > > Just to be sure, how is this working with MEMBLOCK_RSRV_NOINIT, where MM is supposed not to
> > > > > > > > initialize struct pages?
> > > > > > > 
> > > > > > > Excellent point, I did not know about that one.
> > > > > > > 
> > > > > > > Spotting that we don't do the same for the head page made me assume that
> > > > > > > it's just a misuse of __init_single_page().
> > > > > > > 
> > > > > > > But the nasty thing is that we use memblock_reserved_mark_noinit() to only
> > > > > > > mark the tail pages ...
> > > > > > 
> > > > > > And even nastier thing is that when CONFIG_DEFERRED_STRUCT_PAGE_INIT is
> > > > > > disabled struct pages are initialized regardless of
> > > > > > memblock_reserved_mark_noinit().
> > > > > > 
> > > > > > I think this patch should go in before your updates:
> > > > > 
> > > > > Shouldn't we fix this in memblock code?
> > > > > 
> > > > > Hacking around that in the memblock_reserved_mark_noinit() user sound wrong
> > > > > -- and nothing in the doc of memblock_reserved_mark_noinit() spells that
> > > > > behavior out.
> > > > 
> > > > We can surely update the docs, but unfortunately I don't see how to avoid
> > > > hacking around it in hugetlb.
> > > > Since it's used to optimise HVO even further to the point hugetlb open
> > > > codes memmap initialization, I think it's fair that it should deal with all
> > > > possible configurations.
> > > 
> > > Remind me, why can't we support memblock_reserved_mark_noinit() when
> > > CONFIG_DEFERRED_STRUCT_PAGE_INIT is disabled?
> > 
> > When CONFIG_DEFERRED_STRUCT_PAGE_INIT is disabled we initialize the entire
> > memmap early (setup_arch()->free_area_init()), and we may have a bunch of
> > memblock_reserved_mark_noinit() afterwards
> 
> Oh, you mean that we get effective memblock modifications after already
> initializing the memmap.
> 
> That sounds ... interesting :)

It's memmap, not the free lists. Without deferred init, memblock is active
for a while after memmap initialized and before the memory goes to the free
lists.
 
> So yeah, we have to document this for memblock_reserved_mark_noinit().
> 
> Is it also a problem for kexec_handover?

With KHO it's also interesting, but it does not support deferred struct
page init for now :)
 
> We should do something like:
> 
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 154f1d73b61f2..ed4c563d72c32 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1091,13 +1091,16 @@ int __init_memblock memblock_clear_nomap(phys_addr_t base, phys_addr_t size)
>  /**
>   * memblock_reserved_mark_noinit - Mark a reserved memory region with flag
> - * MEMBLOCK_RSRV_NOINIT which results in the struct pages not being initialized
> - * for this region.
> + * MEMBLOCK_RSRV_NOINIT which allows for the "struct pages" corresponding
> + * to this region not getting initialized, because the caller will take
> + * care of it.
>   * @base: the base phys addr of the region
>   * @size: the size of the region
>   *
> - * struct pages will not be initialized for reserved memory regions marked with
> - * %MEMBLOCK_RSRV_NOINIT.
> + * "struct pages" will not be initialized for reserved memory regions marked
> + * with %MEMBLOCK_RSRV_NOINIT if this function is called before initialization
> + * code runs. Without CONFIG_DEFERRED_STRUCT_PAGE_INIT, it is more likely
> + * that this function is not effective.
>   *
>   * Return: 0 on success, -errno on failure.
>   */

I have a different version :)
 
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index b96746376e17..d20d091c6343 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -40,8 +40,9 @@ extern unsigned long long max_possible_pfn;
  * via a driver, and never indicated in the firmware-provided memory map as
  * system RAM. This corresponds to IORESOURCE_SYSRAM_DRIVER_MANAGED in the
  * kernel resource tree.
- * @MEMBLOCK_RSRV_NOINIT: memory region for which struct pages are
- * not initialized (only for reserved regions).
+ * @MEMBLOCK_RSRV_NOINIT: memory region for which struct pages don't have
+ * PG_Reserved set and are completely not initialized when
+ * %CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled (only for reserved regions).
  * @MEMBLOCK_RSRV_KERN: memory region that is reserved for kernel use,
  * either explictitly with memblock_reserve_kern() or via memblock
  * allocation APIs. All memblock allocations set this flag.
diff --git a/mm/memblock.c b/mm/memblock.c
index 154f1d73b61f..02de5ffb085b 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1091,13 +1091,15 @@ int __init_memblock memblock_clear_nomap(phys_addr_t base, phys_addr_t size)
 
 /**
  * memblock_reserved_mark_noinit - Mark a reserved memory region with flag
- * MEMBLOCK_RSRV_NOINIT which results in the struct pages not being initialized
- * for this region.
+ * MEMBLOCK_RSRV_NOINIT
+ *
  * @base: the base phys addr of the region
  * @size: the size of the region
  *
- * struct pages will not be initialized for reserved memory regions marked with
- * %MEMBLOCK_RSRV_NOINIT.
+ * The struct pages for the reserved regions marked %MEMBLOCK_RSRV_NOINIT will
+ * not have %PG_Reserved flag set.
+ * When %CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, setting this flags also
+ * completly bypasses the initialization of struct pages for this region.
  *
  * Return: 0 on success, -errno on failure.
  */
 
> Optimizing the hugetlb code could be done, but I am not sure how high
> the priority is (nobody complained so far about the double init).
> 
> -- 
> Cheers
> 
> David / dhildenb
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH RFC 10/35] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap()
  2025-08-25 16:17                   ` Mike Rapoport
@ 2025-08-25 16:23                     ` David Hildenbrand
  2025-08-25 16:58                       ` update kernel-doc for MEMBLOCK_RSRV_NOINIT (was: Re: [PATCH RFC 10/35] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap()) Mike Rapoport
  0 siblings, 1 reply; 61+ messages in thread
From: David Hildenbrand @ 2025-08-25 16:23 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Mika Penttilä, linux-kernel, Alexander Potapenko,
	Andrew Morton, Brendan Jackman, Christoph Lameter, Dennis Zhou,
	Dmitry Vyukov, dri-devel, intel-gfx, iommu, io-uring,
	Jason Gunthorpe, Jens Axboe, Johannes Weiner, John Hubbard,
	kasan-dev, kvm, Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Muchun Song, netdev, Oscar Salvador, Peter Xu,
	Robin Murphy, Suren Baghdasaryan, Tejun Heo, virtualization,
	Vlastimil Babka, wireguard, x86, Zi Yan

>   
>> We should do something like:
>>
>> diff --git a/mm/memblock.c b/mm/memblock.c
>> index 154f1d73b61f2..ed4c563d72c32 100644
>> --- a/mm/memblock.c
>> +++ b/mm/memblock.c
>> @@ -1091,13 +1091,16 @@ int __init_memblock memblock_clear_nomap(phys_addr_t base, phys_addr_t size)
>>   /**
>>    * memblock_reserved_mark_noinit - Mark a reserved memory region with flag
>> - * MEMBLOCK_RSRV_NOINIT which results in the struct pages not being initialized
>> - * for this region.
>> + * MEMBLOCK_RSRV_NOINIT which allows for the "struct pages" corresponding
>> + * to this region not getting initialized, because the caller will take
>> + * care of it.
>>    * @base: the base phys addr of the region
>>    * @size: the size of the region
>>    *
>> - * struct pages will not be initialized for reserved memory regions marked with
>> - * %MEMBLOCK_RSRV_NOINIT.
>> + * "struct pages" will not be initialized for reserved memory regions marked
>> + * with %MEMBLOCK_RSRV_NOINIT if this function is called before initialization
>> + * code runs. Without CONFIG_DEFERRED_STRUCT_PAGE_INIT, it is more likely
>> + * that this function is not effective.
>>    *
>>    * Return: 0 on success, -errno on failure.
>>    */
> 
> I have a different version :)
>   
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index b96746376e17..d20d091c6343 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -40,8 +40,9 @@ extern unsigned long long max_possible_pfn;
>    * via a driver, and never indicated in the firmware-provided memory map as
>    * system RAM. This corresponds to IORESOURCE_SYSRAM_DRIVER_MANAGED in the
>    * kernel resource tree.
> - * @MEMBLOCK_RSRV_NOINIT: memory region for which struct pages are
> - * not initialized (only for reserved regions).
> + * @MEMBLOCK_RSRV_NOINIT: memory region for which struct pages don't have
> + * PG_Reserved set and are completely not initialized when
> + * %CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled (only for reserved regions).
>    * @MEMBLOCK_RSRV_KERN: memory region that is reserved for kernel use,
>    * either explictitly with memblock_reserve_kern() or via memblock
>    * allocation APIs. All memblock allocations set this flag.
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 154f1d73b61f..02de5ffb085b 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1091,13 +1091,15 @@ int __init_memblock memblock_clear_nomap(phys_addr_t base, phys_addr_t size)
>   
>   /**
>    * memblock_reserved_mark_noinit - Mark a reserved memory region with flag
> - * MEMBLOCK_RSRV_NOINIT which results in the struct pages not being initialized
> - * for this region.
> + * MEMBLOCK_RSRV_NOINIT
> + *
>    * @base: the base phys addr of the region
>    * @size: the size of the region
>    *
> - * struct pages will not be initialized for reserved memory regions marked with
> - * %MEMBLOCK_RSRV_NOINIT.
> + * The struct pages for the reserved regions marked %MEMBLOCK_RSRV_NOINIT will
> + * not have %PG_Reserved flag set.
> + * When %CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, setting this flags also
> + * completly bypasses the initialization of struct pages for this region.

s/completly/completely.

I don't quite understand the interaction with PG_Reserved and why 
anybody using this function should care.

So maybe you can rephrase in a way that is easier to digest, and rather 
focuses on what callers of this function are supposed to do vs. have the 
liberty of not doing?

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 61+ messages in thread

* update kernel-doc for MEMBLOCK_RSRV_NOINIT (was: Re: [PATCH RFC 10/35] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap())
  2025-08-25 16:23                     ` David Hildenbrand
@ 2025-08-25 16:58                       ` Mike Rapoport
  2025-08-25 18:32                         ` update kernel-doc for MEMBLOCK_RSRV_NOINIT David Hildenbrand
  0 siblings, 1 reply; 61+ messages in thread
From: Mike Rapoport @ 2025-08-25 16:58 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mika Penttilä, linux-kernel, Alexander Potapenko,
	Andrew Morton, Brendan Jackman, Christoph Lameter, Dennis Zhou,
	Dmitry Vyukov, dri-devel, intel-gfx, iommu, io-uring,
	Jason Gunthorpe, Jens Axboe, Johannes Weiner, John Hubbard,
	kasan-dev, kvm, Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Muchun Song, netdev, Oscar Salvador, Peter Xu,
	Robin Murphy, Suren Baghdasaryan, Tejun Heo, virtualization,
	Vlastimil Babka, wireguard, x86, Zi Yan

On Mon, Aug 25, 2025 at 06:23:48PM +0200, David Hildenbrand wrote:
> 
> I don't quite understand the interaction with PG_Reserved and why anybody
> using this function should care.
> 
> So maybe you can rephrase in a way that is easier to digest, and rather
> focuses on what callers of this function are supposed to do vs. have the
> liberty of not doing?

How about
 
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index b96746376e17..fcda8481de9a 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -40,8 +40,9 @@ extern unsigned long long max_possible_pfn;
  * via a driver, and never indicated in the firmware-provided memory map as
  * system RAM. This corresponds to IORESOURCE_SYSRAM_DRIVER_MANAGED in the
  * kernel resource tree.
- * @MEMBLOCK_RSRV_NOINIT: memory region for which struct pages are
- * not initialized (only for reserved regions).
+ * @MEMBLOCK_RSRV_NOINIT: reserved memory region for which struct pages are not
+ * fully initialized. Users of this flag are responsible to properly initialize
+ * struct pages of this region
  * @MEMBLOCK_RSRV_KERN: memory region that is reserved for kernel use,
  * either explictitly with memblock_reserve_kern() or via memblock
  * allocation APIs. All memblock allocations set this flag.
diff --git a/mm/memblock.c b/mm/memblock.c
index 154f1d73b61f..46b411fb3630 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1091,13 +1091,20 @@ int __init_memblock memblock_clear_nomap(phys_addr_t base, phys_addr_t size)
 
 /**
  * memblock_reserved_mark_noinit - Mark a reserved memory region with flag
- * MEMBLOCK_RSRV_NOINIT which results in the struct pages not being initialized
- * for this region.
+ * MEMBLOCK_RSRV_NOINIT
+ *
  * @base: the base phys addr of the region
  * @size: the size of the region
  *
- * struct pages will not be initialized for reserved memory regions marked with
- * %MEMBLOCK_RSRV_NOINIT.
+ * The struct pages for the reserved regions marked %MEMBLOCK_RSRV_NOINIT will
+ * not be fully initialized to allow the caller optimize their initialization.
+ *
+ * When %CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, setting this flag
+ * completely bypasses the initialization of struct pages for such region.
+ *
+ * When %CONFIG_DEFERRED_STRUCT_PAGE_INIT is disabled, struct pages in this
+ * region will be initialized with default values but won't be marked as
+ * reserved.
  *
  * Return: 0 on success, -errno on failure.
  */

> -- 
> Cheers
> 
> David / dhildenb
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: update kernel-doc for MEMBLOCK_RSRV_NOINIT
  2025-08-25 16:58                       ` update kernel-doc for MEMBLOCK_RSRV_NOINIT (was: Re: [PATCH RFC 10/35] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap()) Mike Rapoport
@ 2025-08-25 18:32                         ` David Hildenbrand
  0 siblings, 0 replies; 61+ messages in thread
From: David Hildenbrand @ 2025-08-25 18:32 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Mika Penttilä, linux-kernel, Alexander Potapenko,
	Andrew Morton, Brendan Jackman, Christoph Lameter, Dennis Zhou,
	Dmitry Vyukov, dri-devel, intel-gfx, iommu, io-uring,
	Jason Gunthorpe, Jens Axboe, Johannes Weiner, John Hubbard,
	kasan-dev, kvm, Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Muchun Song, netdev, Oscar Salvador, Peter Xu,
	Robin Murphy, Suren Baghdasaryan, Tejun Heo, virtualization,
	Vlastimil Babka, wireguard, x86, Zi Yan

On 25.08.25 18:58, Mike Rapoport wrote:
> On Mon, Aug 25, 2025 at 06:23:48PM +0200, David Hildenbrand wrote:
>>
>> I don't quite understand the interaction with PG_Reserved and why anybody
>> using this function should care.
>>
>> So maybe you can rephrase in a way that is easier to digest, and rather
>> focuses on what callers of this function are supposed to do vs. have the
>> liberty of not doing?
> 
> How about
>   
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index b96746376e17..fcda8481de9a 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -40,8 +40,9 @@ extern unsigned long long max_possible_pfn;
>    * via a driver, and never indicated in the firmware-provided memory map as
>    * system RAM. This corresponds to IORESOURCE_SYSRAM_DRIVER_MANAGED in the
>    * kernel resource tree.
> - * @MEMBLOCK_RSRV_NOINIT: memory region for which struct pages are
> - * not initialized (only for reserved regions).
> + * @MEMBLOCK_RSRV_NOINIT: reserved memory region for which struct pages are not
> + * fully initialized. Users of this flag are responsible to properly initialize
> + * struct pages of this region
>    * @MEMBLOCK_RSRV_KERN: memory region that is reserved for kernel use,
>    * either explictitly with memblock_reserve_kern() or via memblock
>    * allocation APIs. All memblock allocations set this flag.
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 154f1d73b61f..46b411fb3630 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1091,13 +1091,20 @@ int __init_memblock memblock_clear_nomap(phys_addr_t base, phys_addr_t size)
>   
>   /**
>    * memblock_reserved_mark_noinit - Mark a reserved memory region with flag
> - * MEMBLOCK_RSRV_NOINIT which results in the struct pages not being initialized
> - * for this region.
> + * MEMBLOCK_RSRV_NOINIT
> + *
>    * @base: the base phys addr of the region
>    * @size: the size of the region
>    *
> - * struct pages will not be initialized for reserved memory regions marked with
> - * %MEMBLOCK_RSRV_NOINIT.
> + * The struct pages for the reserved regions marked %MEMBLOCK_RSRV_NOINIT will
> + * not be fully initialized to allow the caller optimize their initialization.
> + *
> + * When %CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, setting this flag
> + * completely bypasses the initialization of struct pages for such region.
> + *
> + * When %CONFIG_DEFERRED_STRUCT_PAGE_INIT is disabled, struct pages in this
> + * region will be initialized with default values but won't be marked as
> + * reserved.

Sounds good.

I am surprised regarding "reserved", but I guess that's because we don't 
end up calling "reserve_bootmem_region()" on these regions in 
memmap_init_reserved_pages().


-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH RFC 21/35] mm/cma: refuse handing out non-contiguous page ranges
       [not found]   ` <aK2QZnzS1ErHK5tP@raptor>
@ 2025-08-26 11:04     ` David Hildenbrand
       [not found]       ` <aK2wlGYvCaFQXzBm@raptor>
  0 siblings, 1 reply; 61+ messages in thread
From: David Hildenbrand @ 2025-08-26 11:04 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: linux-kernel, Alexander Potapenko, Andrew Morton, Brendan Jackman,
	Christoph Lameter, Dennis Zhou, Dmitry Vyukov, dri-devel,
	intel-gfx, iommu, io-uring, Jason Gunthorpe, Jens Axboe,
	Johannes Weiner, John Hubbard, kasan-dev, kvm, Liam R. Howlett,
	Linus Torvalds, linux-arm-kernel, linux-arm-kernel, linux-crypto,
	linux-ide, linux-kselftest, linux-mips, linux-mmc, linux-mm,
	linux-riscv, linux-s390, linux-scsi, Lorenzo Stoakes, Marco Elver,
	Marek Szyprowski, Michal Hocko, Mike Rapoport, Muchun Song,
	netdev, Oscar Salvador, Peter Xu, Robin Murphy,
	Suren Baghdasaryan, Tejun Heo, virtualization, Vlastimil Babka,
	wireguard, x86, Zi Yan

>>   
>>   		pr_debug("%s(): memory range at pfn 0x%lx %p is busy, retrying\n",
>> -			 __func__, pfn, pfn_to_page(pfn));
>> +			 __func__, pfn, page);
>>   
>>   		trace_cma_alloc_busy_retry(cma->name, pfn, pfn_to_page(pfn),
> 
> Nitpick: I think you already have the page here.

Indeed, forgot to clean that up as well.

> 
>>   					   count, align);
>> -		/* try again with a bit different memory target */
>> -		start = bitmap_no + mask + 1;
>>   	}
>>   out:
>> -	*pagep = page;
>> +	if (!ret)
>> +		*pagep = page;
>>   	return ret;
>>   }
>>   
>> @@ -882,7 +892,7 @@ static struct page *__cma_alloc(struct cma *cma, unsigned long count,
>>   	 */
>>   	if (page) {
>>   		for (i = 0; i < count; i++)
>> -			page_kasan_tag_reset(nth_page(page, i));
>> +			page_kasan_tag_reset(page + i);
> 
> Had a look at it, not very familiar with CMA, but the changes look equivalent to
> what was before. Not sure that's worth a Reviewed-by tag, but here it in case
> you want to add it:
> 
> Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>

Thanks!

> 
> Just so I can better understand the problem being fixed, I guess you can have
> two consecutive pfns with non-consecutive associated struct page if you have two
> adjacent memory sections spanning the same physical memory region, is that
> correct?

Exactly. Essentially on SPARSEMEM without SPARSEMEM_VMEMMAP it is not 
guaranteed that

	pfn_to_page(pfn + 1) == pfn_to_page(pfn) + 1

when we cross memory section boundaries.

It can be the case for early boot memory if we allocated consecutive 
areas from memblock when allocating the memmap (struct pages) per memory 
section, but it's not guaranteed.

So we rule out that case.

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH RFC 21/35] mm/cma: refuse handing out non-contiguous page ranges
       [not found]       ` <aK2wlGYvCaFQXzBm@raptor>
@ 2025-08-26 13:08         ` David Hildenbrand
  0 siblings, 0 replies; 61+ messages in thread
From: David Hildenbrand @ 2025-08-26 13:08 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: linux-kernel, Alexander Potapenko, Andrew Morton, Brendan Jackman,
	Christoph Lameter, Dennis Zhou, Dmitry Vyukov, dri-devel,
	intel-gfx, iommu, io-uring, Jason Gunthorpe, Jens Axboe,
	Johannes Weiner, John Hubbard, kasan-dev, kvm, Liam R. Howlett,
	Linus Torvalds, linux-arm-kernel, linux-arm-kernel, linux-crypto,
	linux-ide, linux-kselftest, linux-mips, linux-mmc, linux-mm,
	linux-riscv, linux-s390, linux-scsi, Lorenzo Stoakes, Marco Elver,
	Marek Szyprowski, Michal Hocko, Mike Rapoport, Muchun Song,
	netdev, Oscar Salvador, Peter Xu, Robin Murphy,
	Suren Baghdasaryan, Tejun Heo, virtualization, Vlastimil Babka,
	wireguard, x86, Zi Yan

On 26.08.25 15:03, Alexandru Elisei wrote:
> Hi David,
> 
> On Tue, Aug 26, 2025 at 01:04:33PM +0200, David Hildenbrand wrote:
> ..
>>> Just so I can better understand the problem being fixed, I guess you can have
>>> two consecutive pfns with non-consecutive associated struct page if you have two
>>> adjacent memory sections spanning the same physical memory region, is that
>>> correct?
>>
>> Exactly. Essentially on SPARSEMEM without SPARSEMEM_VMEMMAP it is not
>> guaranteed that
>>
>> 	pfn_to_page(pfn + 1) == pfn_to_page(pfn) + 1
>>
>> when we cross memory section boundaries.
>>
>> It can be the case for early boot memory if we allocated consecutive areas
>> from memblock when allocating the memmap (struct pages) per memory section,
>> but it's not guaranteed.
> 
> Thank you for the explanation, but I'm a bit confused by the last paragraph. I
> think what you're saying is that we can also have the reverse problem, where
> consecutive struct page * represent non-consecutive pfns, because memmap
> allocations happened to return consecutive virtual addresses, is that right?

Exactly, that's something we have to deal with elsewhere [1]. For this 
code, it's not a problem because we always allocate a contiguous PFN range.

> 
> If that's correct, I don't think that's the case for CMA, which deals out
> contiguous physical memory. Or were you just trying to explain the other side of
> the problem, and I'm just overthinking it?

The latter :)

[1] https://lkml.kernel.org/r/20250814064714.56485-2-lizhe.67@bytedance.com

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 61+ messages in thread

end of thread, other threads:[~2025-08-26 13:08 UTC | newest]

Thread overview: 61+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20250821200701.1329277-1-david@redhat.com>
2025-08-21 20:06 ` [PATCH RFC 06/35] mm/page_alloc: reject unreasonable folio/compound page sizes in alloc_contig_range_noprof() David Hildenbrand
2025-08-21 20:23   ` Zi Yan
2025-08-22 17:07   ` SeongJae Park
2025-08-21 20:06 ` [PATCH RFC 07/35] mm/memremap: reject unreasonable folio/compound page sizes in memremap_pages() David Hildenbrand
2025-08-22 17:09   ` SeongJae Park
2025-08-21 20:06 ` [PATCH RFC 08/35] mm/hugetlb: check for unreasonable folio sizes when registering hstate David Hildenbrand
2025-08-21 20:06 ` [PATCH RFC 09/35] mm/mm_init: make memmap_init_compound() look more like prep_compound_page() David Hildenbrand
2025-08-22 15:27   ` Mike Rapoport
2025-08-22 18:09     ` David Hildenbrand
2025-08-21 20:06 ` [PATCH RFC 10/35] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap() David Hildenbrand
2025-08-22  4:09   ` Mika Penttilä
2025-08-22  6:24     ` David Hildenbrand
2025-08-23  8:59       ` Mike Rapoport
2025-08-25 12:48         ` David Hildenbrand
2025-08-25 14:32           ` Mike Rapoport
2025-08-25 14:38             ` David Hildenbrand
2025-08-25 14:59               ` Mike Rapoport
2025-08-25 15:42                 ` David Hildenbrand
2025-08-25 16:17                   ` Mike Rapoport
2025-08-25 16:23                     ` David Hildenbrand
2025-08-25 16:58                       ` update kernel-doc for MEMBLOCK_RSRV_NOINIT (was: Re: [PATCH RFC 10/35] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap()) Mike Rapoport
2025-08-25 18:32                         ` update kernel-doc for MEMBLOCK_RSRV_NOINIT David Hildenbrand
2025-08-21 20:06 ` [PATCH RFC 11/35] mm: sanity-check maximum folio size in folio_set_order() David Hildenbrand
2025-08-21 20:36   ` Zi Yan
2025-08-21 20:06 ` [PATCH RFC 12/35] mm: limit folio/compound page sizes in problematic kernel configs David Hildenbrand
2025-08-21 20:46   ` Zi Yan
2025-08-21 20:49     ` David Hildenbrand
2025-08-21 20:50       ` Zi Yan
2025-08-24 13:24   ` Mike Rapoport
2025-08-21 20:06 ` [PATCH RFC 13/35] mm: simplify folio_page() and folio_page_idx() David Hildenbrand
2025-08-21 20:55   ` Zi Yan
2025-08-21 21:00     ` David Hildenbrand
2025-08-21 20:06 ` [PATCH RFC 14/35] mm/mm/percpu-km: drop nth_page() usage within single allocation David Hildenbrand
2025-08-21 20:06 ` [PATCH RFC 15/35] fs: hugetlbfs: remove nth_page() usage within folio in adjust_range_hwpoison() David Hildenbrand
2025-08-21 20:06 ` [PATCH RFC 16/35] mm/pagewalk: drop nth_page() usage within folio in folio_walk_start() David Hildenbrand
2025-08-21 20:06 ` [PATCH RFC 17/35] mm/gup: drop nth_page() usage within folio when recording subpages David Hildenbrand
2025-08-21 20:06 ` [PATCH RFC 18/35] io_uring/zcrx: remove "struct io_copy_cache" and one nth_page() usage David Hildenbrand
     [not found]   ` <b5b08ad3-d8cd-45ff-9767-7cf1b22b5e03@gmail.com>
2025-08-22 13:59     ` David Hildenbrand
2025-08-21 20:06 ` [PATCH RFC 19/35] io_uring/zcrx: remove nth_page() usage within folio David Hildenbrand
2025-08-21 20:06 ` [PATCH RFC 20/35] mips: mm: convert __flush_dcache_pages() to __flush_dcache_folio_pages() David Hildenbrand
2025-08-21 20:06 ` [PATCH RFC 21/35] mm/cma: refuse handing out non-contiguous page ranges David Hildenbrand
     [not found]   ` <aK2QZnzS1ErHK5tP@raptor>
2025-08-26 11:04     ` David Hildenbrand
     [not found]       ` <aK2wlGYvCaFQXzBm@raptor>
2025-08-26 13:08         ` David Hildenbrand
2025-08-21 20:06 ` [PATCH RFC 22/35] dma-remap: drop nth_page() in dma_common_contiguous_remap() David Hildenbrand
2025-08-21 20:06 ` [PATCH RFC 23/35] scatterlist: disallow non-contigous page ranges in a single SG entry David Hildenbrand
2025-08-21 20:06 ` [PATCH RFC 32/35] mm/gup: drop nth_page() usage in unpin_user_page_range_dirty_lock() David Hildenbrand
2025-08-21 20:06 ` [PATCH RFC 33/35] kfence: drop nth_page() usage David Hildenbrand
2025-08-21 20:32   ` David Hildenbrand
2025-08-21 21:45     ` David Hildenbrand
2025-08-21 20:07 ` [PATCH RFC 34/35] block: update comment of "struct bio_vec" regarding nth_page() David Hildenbrand
2025-08-21 20:07 ` [PATCH RFC 35/35] mm: remove nth_page() David Hildenbrand
     [not found] ` <20250821200701.1329277-2-david@redhat.com>
2025-08-21 20:20   ` [PATCH RFC 01/35] mm: stop making SPARSEMEM_VMEMMAP user-selectable Zi Yan
     [not found] ` <20250821200701.1329277-32-david@redhat.com>
2025-08-21 20:24   ` [PATCH RFC 31/35] crypto: remove nth_page() usage within SG entry Linus Torvalds
2025-08-21 20:29     ` David Hildenbrand
2025-08-21 20:36       ` Linus Torvalds
2025-08-21 20:37       ` David Hildenbrand
2025-08-21 20:40       ` Linus Torvalds
     [not found] ` <20250821200701.1329277-25-david@redhat.com>
2025-08-22  1:59   ` [PATCH RFC 24/35] ata: libata-eh: drop " Damien Le Moal
2025-08-22  6:18     ` David Hildenbrand
     [not found] ` <20250821200701.1329277-3-david@redhat.com>
2025-08-22 15:10   ` [PATCH RFC 02/35] arm64: Kconfig: drop superfluous "select SPARSEMEM_VMEMMAP" Mike Rapoport
     [not found] ` <20250821200701.1329277-6-david@redhat.com>
2025-08-22 15:13   ` [PATCH RFC 05/35] wireguard: selftests: remove CONFIG_SPARSEMEM_VMEMMAP=y from qemu kernel config Mike Rapoport

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).