[PATCH RFC 00/35] mm: remove nth

dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC 00/35] mm: remove nth_page()
@ 2025-08-21 20:06 David Hildenbrand
  2025-08-21 20:06 ` [PATCH RFC 01/35] mm: stop making SPARSEMEM_VMEMMAP user-selectable David Hildenbrand
                   ` (36 more replies)
  0 siblings, 37 replies; 90+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Andrew Morton, Linus Torvalds, Jason Gunthorpe,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jens Axboe, Marek Szyprowski,
	Robin Murphy, John Hubbard, Peter Xu, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, Brendan Jackman, Johannes Weiner,
	Zi Yan, Dennis Zhou, Tejun Heo, Christoph Lameter, Muchun Song,
	Oscar Salvador, x86, linux-arm-kernel, linux-mips, linux-s390,
	linux-crypto, linux-ide, intel-gfx, dri-devel, linux-mmc,
	linux-arm-kernel, linux-scsi, kvm, virtualization, linux-mm,
	io-uring, iommu, kasan-dev, wireguard, netdev, linux-kselftest,
	linux-riscv, Albert Ou, Alexander Gordeev, Alexandre Ghiti,
	Alex Dubov, Alex Williamson, Andreas Larsson, Borislav Petkov,
	Brett Creeley, Catalin Marinas, Christian Borntraeger,
	Christophe Leroy, Damien Le Moal, Dave Hansen, David Airlie,
	David S. Miller, Doug Gilbert, Heiko Carstens, Herbert Xu,
	Huacai Chen, Ingo Molnar, James E.J. Bottomley, Jani Nikula,
	Jason A. Donenfeld, Jason Gunthorpe, Jesper Nilsson,
	Joonas Lahtinen, Kevin Tian, Lars Persson, Madhavan Srinivasan,
	Martin K. Petersen, Maxim Levitsky, Michael Ellerman,
	Nicholas Piggin, Niklas Cassel, Palmer Dabbelt, Paul Walmsley,
	Rodrigo Vivi, Shameer Kolothum, Shuah Khan, Simona Vetter,
	Sven Schnelle, Thomas Bogendoerfer, Thomas Gleixner,
	Tvrtko Ursulin, Ulf Hansson, Vasily Gorbik, WANG Xuerui,
	Will Deacon, Yishai Hadas

This is based on mm-unstable and was cross-compiled heavily.

I should probably have already dropped the RFC label but I want to hear
first if I ignored some corner case (SG entries?) and I need to do
at least a bit more testing.

I will only CC non-MM folks on the cover letter and the respective patch
to not flood too many inboxes (the lists receive all patches).

---

As discussed recently with Linus, nth_page() is just nasty and we would
like to remove it.

To recap, the reason we currently need nth_page() within a folio is because
on some kernel configs (SPARSEMEM without SPARSEMEM_VMEMMAP), the
memmap is allocated per memory section.

While buddy allocations cannot cross memory section boundaries, hugetlb
and dax folios can.

So crossing a memory section means that "page++" could do the wrong thing.
Instead, nth_page() on these problematic configs always goes from
page->pfn, to the go from (++pfn)->page, which is rather nasty.

Likely, many people have no idea when nth_page() is required and when
it might be dropped.

We refer to such problematic PFN ranges and "non-contiguous pages".
If we only deal with "contiguous pages", there is not need for nth_page().

Besides that "obvious" folio case, we might end up using nth_page()
within CMA allocations (again, could span memory sections), and in
one corner case (kfence) when processing memblock allocations (again,
could span memory sections).

So let's handle all that, add sanity checks, and remove nth_page().

Patch #1 -> #5   : stop making SPARSEMEM_VMEMMAP user-selectable + cleanups
Patch #6 -> #12  : disallow folios to have non-contiguous pages
Patch #13 -> #20 : remove nth_page() usage within folios
Patch #21        : disallow CMA allocations of non-contiguous pages
Patch #22 -> #31 : sanity+check + remove nth_page() usage within SG entry
Patch #32        : sanity-check + remove nth_page() usage in
                   unpin_user_page_range_dirty_lock()
Patch #33        : remove nth_page() in kfence
Patch #34        : adjust stale comment regarding nth_page
Patch #35        : mm: remove nth_page()

A lot of this is inspired from the discussion at [1] between Linus, Jason
and me, so cudos to them.

[1] https://lore.kernel.org/all/CAHk-=wiCYfNp4AJLBORU-c7ZyRBUp66W2-Et6cdQ4REx-GyQ_A@mail.gmail.com/T/#u

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@gentwo.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: x86@kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-mips@vger.kernel.org
Cc: linux-s390@vger.kernel.org
Cc: linux-crypto@vger.kernel.org
Cc: linux-ide@vger.kernel.org
Cc: intel-gfx@lists.freedesktop.org
Cc: dri-devel@lists.freedesktop.org
Cc: linux-mmc@vger.kernel.org
Cc: linux-arm-kernel@axis.com
Cc: linux-scsi@vger.kernel.org
Cc: kvm@vger.kernel.org
Cc: virtualization@lists.linux.dev
Cc: linux-mm@kvack.org
Cc: io-uring@vger.kernel.org
Cc: iommu@lists.linux.dev
Cc: kasan-dev@googlegroups.com
Cc: wireguard@lists.zx2c4.com
Cc: netdev@vger.kernel.org
Cc: linux-kselftest@vger.kernel.org
Cc: linux-riscv@lists.infradead.org

David Hildenbrand (35):
  mm: stop making SPARSEMEM_VMEMMAP user-selectable
  arm64: Kconfig: drop superfluous "select SPARSEMEM_VMEMMAP"
  s390/Kconfig: drop superfluous "select SPARSEMEM_VMEMMAP"
  x86/Kconfig: drop superfluous "select SPARSEMEM_VMEMMAP"
  wireguard: selftests: remove CONFIG_SPARSEMEM_VMEMMAP=y from qemu
    kernel config
  mm/page_alloc: reject unreasonable folio/compound page sizes in
    alloc_contig_range_noprof()
  mm/memremap: reject unreasonable folio/compound page sizes in
    memremap_pages()
  mm/hugetlb: check for unreasonable folio sizes when registering hstate
  mm/mm_init: make memmap_init_compound() look more like
    prep_compound_page()
  mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap()
  mm: sanity-check maximum folio size in folio_set_order()
  mm: limit folio/compound page sizes in problematic kernel configs
  mm: simplify folio_page() and folio_page_idx()
  mm/mm/percpu-km: drop nth_page() usage within single allocation
  fs: hugetlbfs: remove nth_page() usage within folio in
    adjust_range_hwpoison()
  mm/pagewalk: drop nth_page() usage within folio in folio_walk_start()
  mm/gup: drop nth_page() usage within folio when recording subpages
  io_uring/zcrx: remove "struct io_copy_cache" and one nth_page() usage
  io_uring/zcrx: remove nth_page() usage within folio
  mips: mm: convert __flush_dcache_pages() to
    __flush_dcache_folio_pages()
  mm/cma: refuse handing out non-contiguous page ranges
  dma-remap: drop nth_page() in dma_common_contiguous_remap()
  scatterlist: disallow non-contigous page ranges in a single SG entry
  ata: libata-eh: drop nth_page() usage within SG entry
  drm/i915/gem: drop nth_page() usage within SG entry
  mspro_block: drop nth_page() usage within SG entry
  memstick: drop nth_page() usage within SG entry
  mmc: drop nth_page() usage within SG entry
  scsi: core: drop nth_page() usage within SG entry
  vfio/pci: drop nth_page() usage within SG entry
  crypto: remove nth_page() usage within SG entry
  mm/gup: drop nth_page() usage in unpin_user_page_range_dirty_lock()
  kfence: drop nth_page() usage
  block: update comment of "struct bio_vec" regarding nth_page()
  mm: remove nth_page()

 arch/arm64/Kconfig                            |  1 -
 arch/mips/include/asm/cacheflush.h            | 11 +++--
 arch/mips/mm/cache.c                          |  8 ++--
 arch/s390/Kconfig                             |  1 -
 arch/x86/Kconfig                              |  1 -
 crypto/ahash.c                                |  4 +-
 crypto/scompress.c                            |  8 ++--
 drivers/ata/libata-sff.c                      |  6 +--
 drivers/gpu/drm/i915/gem/i915_gem_pages.c     |  2 +-
 drivers/memstick/core/mspro_block.c           |  3 +-
 drivers/memstick/host/jmb38x_ms.c             |  3 +-
 drivers/memstick/host/tifm_ms.c               |  3 +-
 drivers/mmc/host/tifm_sd.c                    |  4 +-
 drivers/mmc/host/usdhi6rol0.c                 |  4 +-
 drivers/scsi/scsi_lib.c                       |  3 +-
 drivers/scsi/sg.c                             |  3 +-
 drivers/vfio/pci/pds/lm.c                     |  3 +-
 drivers/vfio/pci/virtio/migrate.c             |  3 +-
 fs/hugetlbfs/inode.c                          | 25 ++++------
 include/crypto/scatterwalk.h                  |  4 +-
 include/linux/bvec.h                          |  7 +--
 include/linux/mm.h                            | 48 +++++++++++++++----
 include/linux/page-flags.h                    |  5 +-
 include/linux/scatterlist.h                   |  4 +-
 io_uring/zcrx.c                               | 34 ++++---------
 kernel/dma/remap.c                            |  2 +-
 mm/Kconfig                                    |  3 +-
 mm/cma.c                                      | 36 +++++++++-----
 mm/gup.c                                      | 13 +++--
 mm/hugetlb.c                                  | 23 ++++-----
 mm/internal.h                                 |  1 +
 mm/kfence/core.c                              | 17 ++++---
 mm/memremap.c                                 |  3 ++
 mm/mm_init.c                                  | 13 ++---
 mm/page_alloc.c                               |  5 +-
 mm/pagewalk.c                                 |  2 +-
 mm/percpu-km.c                                |  2 +-
 mm/util.c                                     | 33 +++++++++++++
 tools/testing/scatterlist/linux/mm.h          |  1 -
 .../selftests/wireguard/qemu/kernel.config    |  1 -
 40 files changed, 203 insertions(+), 150 deletions(-)

base-commit: c0e3b3f33ba7b767368de4afabaf7c1ddfdc3872
-- 
2.50.1

^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH RFC 01/35] mm: stop making SPARSEMEM_VMEMMAP user-selectable
  2025-08-21 20:06 [PATCH RFC 00/35] mm: remove nth_page() David Hildenbrand
@ 2025-08-21 20:06 ` David Hildenbrand
  2025-08-21 20:20   ` Zi Yan
                     ` (2 more replies)
  2025-08-21 20:06 ` [PATCH RFC 02/35] arm64: Kconfig: drop superfluous "select SPARSEMEM_VMEMMAP" David Hildenbrand
                   ` (35 subsequent siblings)
  36 siblings, 3 replies; 90+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Huacai Chen, WANG Xuerui, Madhavan Srinivasan,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Alexandre Ghiti,
	David S. Miller, Andreas Larsson, Alexander Potapenko,
	Andrew Morton, Brendan Jackman, Christoph Lameter, Dennis Zhou,
	Dmitry Vyukov, dri-devel, intel-gfx, iommu, io-uring,
	Jason Gunthorpe, Jens Axboe, Johannes Weiner, John Hubbard,
	kasan-dev, kvm, Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Mike Rapoport, Muchun Song, netdev, Oscar Salvador,
	Peter Xu, Robin Murphy, Suren Baghdasaryan, Tejun Heo,
	virtualization, Vlastimil Babka, wireguard, x86, Zi Yan

In an ideal world, we wouldn't have to deal with SPARSEMEM without
SPARSEMEM_VMEMMAP, but in particular for 32bit SPARSEMEM_VMEMMAP is
considered too costly and consequently not supported.

However, if an architecture does support SPARSEMEM with
SPARSEMEM_VMEMMAP, let's forbid the user to disable VMEMMAP: just
like we already do for arm64, s390 and x86.

So if SPARSEMEM_VMEMMAP is supported, don't allow to use SPARSEMEM without
SPARSEMEM_VMEMMAP.

This implies that the option to not use SPARSEMEM_VMEMMAP will now be
gone for loongarch, powerpc, riscv and sparc. All architectures only
enable SPARSEMEM_VMEMMAP with 64bit support, so there should not really
be a big downside to using the VMEMMAP (quite the contrary).

This is a preparation for not supporting

(1) folio sizes that exceed a single memory section
(2) CMA allocations of non-contiguous page ranges

in SPARSEMEM without SPARSEMEM_VMEMMAP configs, whereby we
want to limit possible impact as much as possible (e.g., gigantic hugetlb
page allocations suddenly fails).

Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Andreas Larsson <andreas@gaisler.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/Kconfig | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index 4108bcd967848..330d0e698ef96 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -439,9 +439,8 @@ config SPARSEMEM_VMEMMAP_ENABLE
 	bool
 
 config SPARSEMEM_VMEMMAP
-	bool "Sparse Memory virtual memmap"
+	def_bool y
 	depends on SPARSEMEM && SPARSEMEM_VMEMMAP_ENABLE
-	default y
 	help
 	  SPARSEMEM_VMEMMAP uses a virtually mapped memmap to optimise
 	  pfn_to_page and page_to_pfn operations.  This is the most
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH RFC 02/35] arm64: Kconfig: drop superfluous "select SPARSEMEM_VMEMMAP"
  2025-08-21 20:06 [PATCH RFC 00/35] mm: remove nth_page() David Hildenbrand
  2025-08-21 20:06 ` [PATCH RFC 01/35] mm: stop making SPARSEMEM_VMEMMAP user-selectable David Hildenbrand
@ 2025-08-21 20:06 ` David Hildenbrand
  2025-08-22 15:10   ` Mike Rapoport
  2025-08-21 20:06 ` [PATCH RFC 03/35] s390/Kconfig: " David Hildenbrand
                   ` (34 subsequent siblings)
  36 siblings, 1 reply; 90+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Catalin Marinas, Will Deacon,
	Alexander Potapenko, Andrew Morton, Brendan Jackman,
	Christoph Lameter, Dennis Zhou, Dmitry Vyukov, dri-devel,
	intel-gfx, iommu, io-uring, Jason Gunthorpe, Jens Axboe,
	Johannes Weiner, John Hubbard, kasan-dev, kvm, Liam R. Howlett,
	Linus Torvalds, linux-arm-kernel, linux-arm-kernel, linux-crypto,
	linux-ide, linux-kselftest, linux-mips, linux-mmc, linux-mm,
	linux-riscv, linux-s390, linux-scsi, Lorenzo Stoakes, Marco Elver,
	Marek Szyprowski, Michal Hocko, Mike Rapoport, Muchun Song,
	netdev, Oscar Salvador, Peter Xu, Robin Murphy,
	Suren Baghdasaryan, Tejun Heo, virtualization, Vlastimil Babka,
	wireguard, x86, Zi Yan

Now handled by the core automatically once SPARSEMEM_VMEMMAP_ENABLE
is selected.

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/arm64/Kconfig | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index e9bbfacc35a64..b1d1f2ff2493b 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -1570,7 +1570,6 @@ source "kernel/Kconfig.hz"
 config ARCH_SPARSEMEM_ENABLE
 	def_bool y
 	select SPARSEMEM_VMEMMAP_ENABLE
-	select SPARSEMEM_VMEMMAP
 
 config HW_PERF_EVENTS
 	def_bool y
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH RFC 03/35] s390/Kconfig: drop superfluous "select SPARSEMEM_VMEMMAP"
  2025-08-21 20:06 [PATCH RFC 00/35] mm: remove nth_page() David Hildenbrand
  2025-08-21 20:06 ` [PATCH RFC 01/35] mm: stop making SPARSEMEM_VMEMMAP user-selectable David Hildenbrand
  2025-08-21 20:06 ` [PATCH RFC 02/35] arm64: Kconfig: drop superfluous "select SPARSEMEM_VMEMMAP" David Hildenbrand
@ 2025-08-21 20:06 ` David Hildenbrand
  2025-08-22 15:11   ` Mike Rapoport
  2025-08-21 20:06 ` [PATCH RFC 04/35] x86/Kconfig: " David Hildenbrand
                   ` (33 subsequent siblings)
  36 siblings, 1 reply; 90+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	Alexander Potapenko, Andrew Morton, Brendan Jackman,
	Christoph Lameter, Dennis Zhou, Dmitry Vyukov, dri-devel,
	intel-gfx, iommu, io-uring, Jason Gunthorpe, Jens Axboe,
	Johannes Weiner, John Hubbard, kasan-dev, kvm, Liam R. Howlett,
	Linus Torvalds, linux-arm-kernel, linux-arm-kernel, linux-crypto,
	linux-ide, linux-kselftest, linux-mips, linux-mmc, linux-mm,
	linux-riscv, linux-s390, linux-scsi, Lorenzo Stoakes, Marco Elver,
	Marek Szyprowski, Michal Hocko, Mike Rapoport, Muchun Song,
	netdev, Oscar Salvador, Peter Xu, Robin Murphy,
	Suren Baghdasaryan, Tejun Heo, virtualization, Vlastimil Babka,
	wireguard, x86, Zi Yan

Now handled by the core automatically once SPARSEMEM_VMEMMAP_ENABLE
is selected.

Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/s390/Kconfig | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index bf680c26a33cf..145ca23c2fff6 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -710,7 +710,6 @@ menu "Memory setup"
 config ARCH_SPARSEMEM_ENABLE
 	def_bool y
 	select SPARSEMEM_VMEMMAP_ENABLE
-	select SPARSEMEM_VMEMMAP
 
 config ARCH_SPARSEMEM_DEFAULT
 	def_bool y
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH RFC 04/35] x86/Kconfig: drop superfluous "select SPARSEMEM_VMEMMAP"
  2025-08-21 20:06 [PATCH RFC 00/35] mm: remove nth_page() David Hildenbrand
                   ` (2 preceding siblings ...)
  2025-08-21 20:06 ` [PATCH RFC 03/35] s390/Kconfig: " David Hildenbrand
@ 2025-08-21 20:06 ` David Hildenbrand
  2025-08-22 15:11   ` Mike Rapoport
  2025-08-21 20:06 ` [PATCH RFC 05/35] wireguard: selftests: remove CONFIG_SPARSEMEM_VMEMMAP=y from qemu kernel config David Hildenbrand
                   ` (32 subsequent siblings)
  36 siblings, 1 reply; 90+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Alexander Potapenko, Andrew Morton, Brendan Jackman,
	Christoph Lameter, Dennis Zhou, Dmitry Vyukov, dri-devel,
	intel-gfx, iommu, io-uring, Jason Gunthorpe, Jens Axboe,
	Johannes Weiner, John Hubbard, kasan-dev, kvm, Liam R. Howlett,
	Linus Torvalds, linux-arm-kernel, linux-arm-kernel, linux-crypto,
	linux-ide, linux-kselftest, linux-mips, linux-mmc, linux-mm,
	linux-riscv, linux-s390, linux-scsi, Lorenzo Stoakes, Marco Elver,
	Marek Szyprowski, Michal Hocko, Mike Rapoport, Muchun Song,
	netdev, Oscar Salvador, Peter Xu, Robin Murphy,
	Suren Baghdasaryan, Tejun Heo, virtualization, Vlastimil Babka,
	wireguard, x86, Zi Yan

Now handled by the core automatically once SPARSEMEM_VMEMMAP_ENABLE
is selected.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/x86/Kconfig | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 58d890fe2100e..e431d1c06fecd 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1552,7 +1552,6 @@ config ARCH_SPARSEMEM_ENABLE
 	def_bool y
 	select SPARSEMEM_STATIC if X86_32
 	select SPARSEMEM_VMEMMAP_ENABLE if X86_64
-	select SPARSEMEM_VMEMMAP if X86_64
 
 config ARCH_SPARSEMEM_DEFAULT
 	def_bool X86_64 || (NUMA && X86_32)
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH RFC 05/35] wireguard: selftests: remove CONFIG_SPARSEMEM_VMEMMAP=y from qemu kernel config
  2025-08-21 20:06 [PATCH RFC 00/35] mm: remove nth_page() David Hildenbrand
                   ` (3 preceding siblings ...)
  2025-08-21 20:06 ` [PATCH RFC 04/35] x86/Kconfig: " David Hildenbrand
@ 2025-08-21 20:06 ` David Hildenbrand
  2025-08-22 15:13   ` Mike Rapoport
  2025-08-21 20:06 ` [PATCH RFC 06/35] mm/page_alloc: reject unreasonable folio/compound page sizes in alloc_contig_range_noprof() David Hildenbrand
                   ` (31 subsequent siblings)
  36 siblings, 1 reply; 90+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Jason A. Donenfeld, Shuah Khan,
	Alexander Potapenko, Andrew Morton, Brendan Jackman,
	Christoph Lameter, Dennis Zhou, Dmitry Vyukov, dri-devel,
	intel-gfx, iommu, io-uring, Jason Gunthorpe, Jens Axboe,
	Johannes Weiner, John Hubbard, kasan-dev, kvm, Liam R. Howlett,
	Linus Torvalds, linux-arm-kernel, linux-arm-kernel, linux-crypto,
	linux-ide, linux-kselftest, linux-mips, linux-mmc, linux-mm,
	linux-riscv, linux-s390, linux-scsi, Lorenzo Stoakes, Marco Elver,
	Marek Szyprowski, Michal Hocko, Mike Rapoport, Muchun Song,
	netdev, Oscar Salvador, Peter Xu, Robin Murphy,
	Suren Baghdasaryan, Tejun Heo, virtualization, Vlastimil Babka,
	wireguard, x86, Zi Yan

It's no longer user-selectable (and the default was already "y"), so
let's just drop it.

Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 tools/testing/selftests/wireguard/qemu/kernel.config | 1 -
 1 file changed, 1 deletion(-)

diff --git a/tools/testing/selftests/wireguard/qemu/kernel.config b/tools/testing/selftests/wireguard/qemu/kernel.config
index 0a5381717e9f4..1149289f4b30f 100644
--- a/tools/testing/selftests/wireguard/qemu/kernel.config
+++ b/tools/testing/selftests/wireguard/qemu/kernel.config
@@ -48,7 +48,6 @@ CONFIG_JUMP_LABEL=y
 CONFIG_FUTEX=y
 CONFIG_SHMEM=y
 CONFIG_SLUB=y
-CONFIG_SPARSEMEM_VMEMMAP=y
 CONFIG_SMP=y
 CONFIG_SCHED_SMT=y
 CONFIG_SCHED_MC=y
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH RFC 06/35] mm/page_alloc: reject unreasonable folio/compound page sizes in alloc_contig_range_noprof()
  2025-08-21 20:06 [PATCH RFC 00/35] mm: remove nth_page() David Hildenbrand
                   ` (4 preceding siblings ...)
  2025-08-21 20:06 ` [PATCH RFC 05/35] wireguard: selftests: remove CONFIG_SPARSEMEM_VMEMMAP=y from qemu kernel config David Hildenbrand
@ 2025-08-21 20:06 ` David Hildenbrand
  2025-08-21 20:23   ` Zi Yan
  2025-08-22 17:07   ` SeongJae Park
  2025-08-21 20:06 ` [PATCH RFC 07/35] mm/memremap: reject unreasonable folio/compound page sizes in memremap_pages() David Hildenbrand
                   ` (30 subsequent siblings)
  36 siblings, 2 replies; 90+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Alexander Potapenko, Andrew Morton,
	Brendan Jackman, Christoph Lameter, Dennis Zhou, Dmitry Vyukov,
	dri-devel, intel-gfx, iommu, io-uring, Jason Gunthorpe,
	Jens Axboe, Johannes Weiner, John Hubbard, kasan-dev, kvm,
	Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Mike Rapoport, Muchun Song, netdev, Oscar Salvador,
	Peter Xu, Robin Murphy, Suren Baghdasaryan, Tejun Heo,
	virtualization, Vlastimil Babka, wireguard, x86, Zi Yan

Let's reject them early, which in turn makes folio_alloc_gigantic() reject
them properly.

To avoid converting from order to nr_pages, let's just add MAX_FOLIO_ORDER
and calculate MAX_FOLIO_NR_PAGES based on that.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/mm.h | 6 ++++--
 mm/page_alloc.c    | 5 ++++-
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 00c8a54127d37..77737cbf2216a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2055,11 +2055,13 @@ static inline long folio_nr_pages(const struct folio *folio)
 
 /* Only hugetlbfs can allocate folios larger than MAX_ORDER */
 #ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
-#define MAX_FOLIO_NR_PAGES	(1UL << PUD_ORDER)
+#define MAX_FOLIO_ORDER		PUD_ORDER
 #else
-#define MAX_FOLIO_NR_PAGES	MAX_ORDER_NR_PAGES
+#define MAX_FOLIO_ORDER		MAX_PAGE_ORDER
 #endif
 
+#define MAX_FOLIO_NR_PAGES	(1UL << MAX_FOLIO_ORDER)
+
 /*
  * compound_nr() returns the number of pages in this potentially compound
  * page.  compound_nr() can be called on a tail page, and is defined to
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ca9e6b9633f79..1e6ae4c395b30 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6833,6 +6833,7 @@ static int __alloc_contig_verify_gfp_mask(gfp_t gfp_mask, gfp_t *gfp_cc_mask)
 int alloc_contig_range_noprof(unsigned long start, unsigned long end,
 			      acr_flags_t alloc_flags, gfp_t gfp_mask)
 {
+	const unsigned int order = ilog2(end - start);
 	unsigned long outer_start, outer_end;
 	int ret = 0;
 
@@ -6850,6 +6851,9 @@ int alloc_contig_range_noprof(unsigned long start, unsigned long end,
 					    PB_ISOLATE_MODE_CMA_ALLOC :
 					    PB_ISOLATE_MODE_OTHER;
 
+	if (WARN_ON_ONCE((gfp_mask & __GFP_COMP) && order > MAX_FOLIO_ORDER))
+		return -EINVAL;
+
 	gfp_mask = current_gfp_context(gfp_mask);
 	if (__alloc_contig_verify_gfp_mask(gfp_mask, (gfp_t *)&cc.gfp_mask))
 		return -EINVAL;
@@ -6947,7 +6951,6 @@ int alloc_contig_range_noprof(unsigned long start, unsigned long end,
 			free_contig_range(end, outer_end - end);
 	} else if (start == outer_start && end == outer_end && is_power_of_2(end - start)) {
 		struct page *head = pfn_to_page(start);
-		int order = ilog2(end - start);
 
 		check_new_pages(head, order);
 		prep_new_page(head, order, gfp_mask, 0);
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH RFC 07/35] mm/memremap: reject unreasonable folio/compound page sizes in memremap_pages()
  2025-08-21 20:06 [PATCH RFC 00/35] mm: remove nth_page() David Hildenbrand
                   ` (5 preceding siblings ...)
  2025-08-21 20:06 ` [PATCH RFC 06/35] mm/page_alloc: reject unreasonable folio/compound page sizes in alloc_contig_range_noprof() David Hildenbrand
@ 2025-08-21 20:06 ` David Hildenbrand
  2025-08-22 17:09   ` SeongJae Park
  2025-08-21 20:06 ` [PATCH RFC 08/35] mm/hugetlb: check for unreasonable folio sizes when registering hstate David Hildenbrand
                   ` (29 subsequent siblings)
  36 siblings, 1 reply; 90+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Alexander Potapenko, Andrew Morton,
	Brendan Jackman, Christoph Lameter, Dennis Zhou, Dmitry Vyukov,
	dri-devel, intel-gfx, iommu, io-uring, Jason Gunthorpe,
	Jens Axboe, Johannes Weiner, John Hubbard, kasan-dev, kvm,
	Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Mike Rapoport, Muchun Song, netdev, Oscar Salvador,
	Peter Xu, Robin Murphy, Suren Baghdasaryan, Tejun Heo,
	virtualization, Vlastimil Babka, wireguard, x86, Zi Yan

Let's reject unreasonable folio sizes early, where we can still fail.
We'll add sanity checks to prepare_compound_head/prepare_compound_page
next.

Is there a way to configure a system such that unreasonable folio sizes
would be possible? It would already be rather questionable.

If so, we'd probably want to bail out earlier, where we can avoid a
WARN and just report a proper error message that indicates where
something went wrong such that we messed up.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memremap.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/memremap.c b/mm/memremap.c
index b0ce0d8254bd8..a2d4bb88f64b6 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -275,6 +275,9 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
 
 	if (WARN_ONCE(!nr_range, "nr_range must be specified\n"))
 		return ERR_PTR(-EINVAL);
+	if (WARN_ONCE(pgmap->vmemmap_shift > MAX_FOLIO_ORDER,
+		      "requested folio size unsupported\n"))
+		return ERR_PTR(-EINVAL);
 
 	switch (pgmap->type) {
 	case MEMORY_DEVICE_PRIVATE:
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH RFC 08/35] mm/hugetlb: check for unreasonable folio sizes when registering hstate
  2025-08-21 20:06 [PATCH RFC 00/35] mm: remove nth_page() David Hildenbrand
                   ` (6 preceding siblings ...)
  2025-08-21 20:06 ` [PATCH RFC 07/35] mm/memremap: reject unreasonable folio/compound page sizes in memremap_pages() David Hildenbrand
@ 2025-08-21 20:06 ` David Hildenbrand
  2025-08-21 20:06 ` [PATCH RFC 09/35] mm/mm_init: make memmap_init_compound() look more like prep_compound_page() David Hildenbrand
                   ` (28 subsequent siblings)
  36 siblings, 0 replies; 90+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Alexander Potapenko, Andrew Morton,
	Brendan Jackman, Christoph Lameter, Dennis Zhou, Dmitry Vyukov,
	dri-devel, intel-gfx, iommu, io-uring, Jason Gunthorpe,
	Jens Axboe, Johannes Weiner, John Hubbard, kasan-dev, kvm,
	Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Mike Rapoport, Muchun Song, netdev, Oscar Salvador,
	Peter Xu, Robin Murphy, Suren Baghdasaryan, Tejun Heo,
	virtualization, Vlastimil Babka, wireguard, x86, Zi Yan

Let's check that no hstate that corresponds to an unreasonable folio size
is registered by an architecture. If we were to succeed registering, we
could later try allocating an unsupported gigantic folio size.

Further, let's add a BUILD_BUG_ON() for checking that HUGETLB_PAGE_ORDER
is sane at build time. As HUGETLB_PAGE_ORDER is dynamic on powerpc, we have
to use a BUILD_BUG_ON_INVALID() to make it compile.

No existing kernel configuration should be able to trigger this check:
either SPARSEMEM without SPARSEMEM_VMEMMAP cannot be configured or
gigantic folios will not exceed a memory section (the case on sparse).

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/hugetlb.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 514fab5a20ef8..d12a9d5146af4 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4657,6 +4657,7 @@ static int __init hugetlb_init(void)
 
 	BUILD_BUG_ON(sizeof_field(struct page, private) * BITS_PER_BYTE <
 			__NR_HPAGEFLAGS);
+	BUILD_BUG_ON_INVALID(HUGETLB_PAGE_ORDER > MAX_FOLIO_ORDER);
 
 	if (!hugepages_supported()) {
 		if (hugetlb_max_hstate || default_hstate_max_huge_pages)
@@ -4740,6 +4741,7 @@ void __init hugetlb_add_hstate(unsigned int order)
 	}
 	BUG_ON(hugetlb_max_hstate >= HUGE_MAX_HSTATE);
 	BUG_ON(order < order_base_2(__NR_USED_SUBPAGE));
+	WARN_ON(order > MAX_FOLIO_ORDER);
 	h = &hstates[hugetlb_max_hstate++];
 	__mutex_init(&h->resize_lock, "resize mutex", &h->resize_key);
 	h->order = order;
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH RFC 09/35] mm/mm_init: make memmap_init_compound() look more like prep_compound_page()
  2025-08-21 20:06 [PATCH RFC 00/35] mm: remove nth_page() David Hildenbrand
                   ` (7 preceding siblings ...)
  2025-08-21 20:06 ` [PATCH RFC 08/35] mm/hugetlb: check for unreasonable folio sizes when registering hstate David Hildenbrand
@ 2025-08-21 20:06 ` David Hildenbrand
  2025-08-22 15:27   ` Mike Rapoport
  2025-08-21 20:06 ` [PATCH RFC 10/35] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap() David Hildenbrand
                   ` (27 subsequent siblings)
  36 siblings, 1 reply; 90+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Alexander Potapenko, Andrew Morton,
	Brendan Jackman, Christoph Lameter, Dennis Zhou, Dmitry Vyukov,
	dri-devel, intel-gfx, iommu, io-uring, Jason Gunthorpe,
	Jens Axboe, Johannes Weiner, John Hubbard, kasan-dev, kvm,
	Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Mike Rapoport, Muchun Song, netdev, Oscar Salvador,
	Peter Xu, Robin Murphy, Suren Baghdasaryan, Tejun Heo,
	virtualization, Vlastimil Babka, wireguard, x86, Zi Yan

Grepping for "prep_compound_page" leaves on clueless how devdax gets its
compound pages initialized.

Let's add a comment that might help finding this open-coded
prep_compound_page() initialization more easily.

Further, let's be less smart about the ordering of initialization and just
perform the prep_compound_head() call after all tail pages were
initialized: just like prep_compound_page() does.

No need for a lengthy comment then: again, just like prep_compound_page().

Note that prep_compound_head() already does initialize stuff in page[2]
through prep_compound_head() that successive tail page initialization
will overwrite: _deferred_list, and on 32bit _entire_mapcount and
_pincount. Very likely 32bit does not apply, and likely nobody ever ends
up testing whether the _deferred_list is empty.

So it shouldn't be a fix at this point, but certainly something to clean
up.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/mm_init.c | 13 +++++--------
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/mm/mm_init.c b/mm/mm_init.c
index 5c21b3af216b2..708466c5b2cc9 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1091,6 +1091,10 @@ static void __ref memmap_init_compound(struct page *head,
 	unsigned long pfn, end_pfn = head_pfn + nr_pages;
 	unsigned int order = pgmap->vmemmap_shift;
 
+	/*
+	 * This is an open-coded prep_compound_page() whereby we avoid
+	 * walking pages twice by initializing them in the same go.
+	 */
 	__SetPageHead(head);
 	for (pfn = head_pfn + 1; pfn < end_pfn; pfn++) {
 		struct page *page = pfn_to_page(pfn);
@@ -1098,15 +1102,8 @@ static void __ref memmap_init_compound(struct page *head,
 		__init_zone_device_page(page, pfn, zone_idx, nid, pgmap);
 		prep_compound_tail(head, pfn - head_pfn);
 		set_page_count(page, 0);
-
-		/*
-		 * The first tail page stores important compound page info.
-		 * Call prep_compound_head() after the first tail page has
-		 * been initialized, to not have the data overwritten.
-		 */
-		if (pfn == head_pfn + 1)
-			prep_compound_head(head, order);
 	}
+	prep_compound_head(head, order);
 }
 
 void __ref memmap_init_zone_device(struct zone *zone,
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH RFC 10/35] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap()
  2025-08-21 20:06 [PATCH RFC 00/35] mm: remove nth_page() David Hildenbrand
                   ` (8 preceding siblings ...)
  2025-08-21 20:06 ` [PATCH RFC 09/35] mm/mm_init: make memmap_init_compound() look more like prep_compound_page() David Hildenbrand
@ 2025-08-21 20:06 ` David Hildenbrand
  2025-08-22  4:09   ` Mika Penttilä
  2025-08-21 20:06 ` [PATCH RFC 11/35] mm: sanity-check maximum folio size in folio_set_order() David Hildenbrand
                   ` (26 subsequent siblings)
  36 siblings, 1 reply; 90+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Alexander Potapenko, Andrew Morton,
	Brendan Jackman, Christoph Lameter, Dennis Zhou, Dmitry Vyukov,
	dri-devel, intel-gfx, iommu, io-uring, Jason Gunthorpe,
	Jens Axboe, Johannes Weiner, John Hubbard, kasan-dev, kvm,
	Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Mike Rapoport, Muchun Song, netdev, Oscar Salvador,
	Peter Xu, Robin Murphy, Suren Baghdasaryan, Tejun Heo,
	virtualization, Vlastimil Babka, wireguard, x86, Zi Yan

All pages were already initialized and set to PageReserved() with a
refcount of 1 by MM init code.

In fact, by using __init_single_page(), we will be setting the refcount to
1 just to freeze it again immediately afterwards.

So drop the __init_single_page() and use __ClearPageReserved() instead.
Adjust the comments to highlight that we are dealing with an open-coded
prep_compound_page() variant.

Further, as we can now safely iterate over all pages in a folio, let's
avoid the page-pfn dance and just iterate the pages directly.

Note that the current code was likely problematic, but we never ran into
it: prep_compound_tail() would have been called with an offset that might
exceed a memory section, and prep_compound_tail() would have simply
added that offset to the page pointer -- which would not have done the
right thing on sparsemem without vmemmap.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/hugetlb.c | 21 ++++++++++-----------
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d12a9d5146af4..ae82a845b14ad 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3235,17 +3235,14 @@ static void __init hugetlb_folio_init_tail_vmemmap(struct folio *folio,
 					unsigned long start_page_number,
 					unsigned long end_page_number)
 {
-	enum zone_type zone = zone_idx(folio_zone(folio));
-	int nid = folio_nid(folio);
-	unsigned long head_pfn = folio_pfn(folio);
-	unsigned long pfn, end_pfn = head_pfn + end_page_number;
+	struct page *head_page = folio_page(folio, 0);
+	struct page *page = folio_page(folio, start_page_number);
+	unsigned long i;
 	int ret;
 
-	for (pfn = head_pfn + start_page_number; pfn < end_pfn; pfn++) {
-		struct page *page = pfn_to_page(pfn);
-
-		__init_single_page(page, pfn, zone, nid);
-		prep_compound_tail((struct page *)folio, pfn - head_pfn);
+	for (i = start_page_number; i < end_page_number; i++, page++) {
+		__ClearPageReserved(page);
+		prep_compound_tail(head_page, i);
 		ret = page_ref_freeze(page, 1);
 		VM_BUG_ON(!ret);
 	}
@@ -3257,12 +3254,14 @@ static void __init hugetlb_folio_init_vmemmap(struct folio *folio,
 {
 	int ret;
 
-	/* Prepare folio head */
+	/*
+	 * This is an open-coded prep_compound_page() whereby we avoid
+	 * walking pages twice by preparing+freezing them in the same go.
+	 */
 	__folio_clear_reserved(folio);
 	__folio_set_head(folio);
 	ret = folio_ref_freeze(folio, 1);
 	VM_BUG_ON(!ret);
-	/* Initialize the necessary tail struct pages */
 	hugetlb_folio_init_tail_vmemmap(folio, 1, nr_pages);
 	prep_compound_head((struct page *)folio, huge_page_order(h));
 }
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH RFC 11/35] mm: sanity-check maximum folio size in folio_set_order()
  2025-08-21 20:06 [PATCH RFC 00/35] mm: remove nth_page() David Hildenbrand
                   ` (9 preceding siblings ...)
  2025-08-21 20:06 ` [PATCH RFC 10/35] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap() David Hildenbrand
@ 2025-08-21 20:06 ` David Hildenbrand
  2025-08-21 20:36   ` Zi Yan
  2025-08-21 20:06 ` [PATCH RFC 12/35] mm: limit folio/compound page sizes in problematic kernel configs David Hildenbrand
                   ` (25 subsequent siblings)
  36 siblings, 1 reply; 90+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Alexander Potapenko, Andrew Morton,
	Brendan Jackman, Christoph Lameter, Dennis Zhou, Dmitry Vyukov,
	dri-devel, intel-gfx, iommu, io-uring, Jason Gunthorpe,
	Jens Axboe, Johannes Weiner, John Hubbard, kasan-dev, kvm,
	Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Mike Rapoport, Muchun Song, netdev, Oscar Salvador,
	Peter Xu, Robin Murphy, Suren Baghdasaryan, Tejun Heo,
	virtualization, Vlastimil Babka, wireguard, x86, Zi Yan

Let's sanity-check in folio_set_order() whether we would be trying to
create a folio with an order that would make it exceed MAX_FOLIO_ORDER.

This will enable the check whenever a folio/compound page is initialized
through prepare_compound_head() / prepare_compound_page().

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/internal.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/internal.h b/mm/internal.h
index 45b725c3dc030..946ce97036d67 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -755,6 +755,7 @@ static inline void folio_set_order(struct folio *folio, unsigned int order)
 {
 	if (WARN_ON_ONCE(!order || !folio_test_large(folio)))
 		return;
+	VM_WARN_ON_ONCE(order > MAX_FOLIO_ORDER);
 
 	folio->_flags_1 = (folio->_flags_1 & ~0xffUL) | order;
 #ifdef NR_PAGES_IN_LARGE_FOLIO
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH RFC 12/35] mm: limit folio/compound page sizes in problematic kernel configs
  2025-08-21 20:06 [PATCH RFC 00/35] mm: remove nth_page() David Hildenbrand
                   ` (10 preceding siblings ...)
  2025-08-21 20:06 ` [PATCH RFC 11/35] mm: sanity-check maximum folio size in folio_set_order() David Hildenbrand
@ 2025-08-21 20:06 ` David Hildenbrand
  2025-08-21 20:46   ` Zi Yan
  2025-08-24 13:24   ` Mike Rapoport
  2025-08-21 20:06 ` [PATCH RFC 13/35] mm: simplify folio_page() and folio_page_idx() David Hildenbrand
                   ` (24 subsequent siblings)
  36 siblings, 2 replies; 90+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Alexander Potapenko, Andrew Morton,
	Brendan Jackman, Christoph Lameter, Dennis Zhou, Dmitry Vyukov,
	dri-devel, intel-gfx, iommu, io-uring, Jason Gunthorpe,
	Jens Axboe, Johannes Weiner, John Hubbard, kasan-dev, kvm,
	Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Mike Rapoport, Muchun Song, netdev, Oscar Salvador,
	Peter Xu, Robin Murphy, Suren Baghdasaryan, Tejun Heo,
	virtualization, Vlastimil Babka, wireguard, x86, Zi Yan

Let's limit the maximum folio size in problematic kernel config where
the memmap is allocated per memory section (SPARSEMEM without
SPARSEMEM_VMEMMAP) to a single memory section.

Currently, only a single architectures supports ARCH_HAS_GIGANTIC_PAGE
but not SPARSEMEM_VMEMMAP: sh.

Fortunately, the biggest hugetlb size sh supports is 64 MiB
(HUGETLB_PAGE_SIZE_64MB) and the section size is at least 64 MiB
(SECTION_SIZE_BITS == 26), so their use case is not degraded.

As folios and memory sections are naturally aligned to their order-2 size
in memory, consequently a single folio can no longer span multiple memory
sections on these problematic kernel configs.

nth_page() is no longer required when operating within a single compound
page / folio.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/mm.h | 22 ++++++++++++++++++----
 1 file changed, 18 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 77737cbf2216a..48a985e17ef4e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2053,11 +2053,25 @@ static inline long folio_nr_pages(const struct folio *folio)
 	return folio_large_nr_pages(folio);
 }
 
-/* Only hugetlbfs can allocate folios larger than MAX_ORDER */
-#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
-#define MAX_FOLIO_ORDER		PUD_ORDER
-#else
+#if !defined(CONFIG_ARCH_HAS_GIGANTIC_PAGE)
+/*
+ * We don't expect any folios that exceed buddy sizes (and consequently
+ * memory sections).
+ */
 #define MAX_FOLIO_ORDER		MAX_PAGE_ORDER
+#elif defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
+/*
+ * Only pages within a single memory section are guaranteed to be
+ * contiguous. By limiting folios to a single memory section, all folio
+ * pages are guaranteed to be contiguous.
+ */
+#define MAX_FOLIO_ORDER		PFN_SECTION_SHIFT
+#else
+/*
+ * There is no real limit on the folio size. We limit them to the maximum we
+ * currently expect.
+ */
+#define MAX_FOLIO_ORDER		PUD_ORDER
 #endif
 
 #define MAX_FOLIO_NR_PAGES	(1UL << MAX_FOLIO_ORDER)
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH RFC 13/35] mm: simplify folio_page() and folio_page_idx()
  2025-08-21 20:06 [PATCH RFC 00/35] mm: remove nth_page() David Hildenbrand
                   ` (11 preceding siblings ...)
  2025-08-21 20:06 ` [PATCH RFC 12/35] mm: limit folio/compound page sizes in problematic kernel configs David Hildenbrand
@ 2025-08-21 20:06 ` David Hildenbrand
  2025-08-21 20:55   ` Zi Yan
  2025-08-21 20:06 ` [PATCH RFC 14/35] mm/mm/percpu-km: drop nth_page() usage within single allocation David Hildenbrand
                   ` (23 subsequent siblings)
  36 siblings, 1 reply; 90+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Alexander Potapenko, Andrew Morton,
	Brendan Jackman, Christoph Lameter, Dennis Zhou, Dmitry Vyukov,
	dri-devel, intel-gfx, iommu, io-uring, Jason Gunthorpe,
	Jens Axboe, Johannes Weiner, John Hubbard, kasan-dev, kvm,
	Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Mike Rapoport, Muchun Song, netdev, Oscar Salvador,
	Peter Xu, Robin Murphy, Suren Baghdasaryan, Tejun Heo,
	virtualization, Vlastimil Babka, wireguard, x86, Zi Yan

Now that a single folio/compound page can no longer span memory sections
in problematic kernel configurations, we can stop using nth_page().

While at it, turn both macros into static inline functions and add
kernel doc for folio_page_idx().

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/mm.h         | 16 ++++++++++++++--
 include/linux/page-flags.h |  5 ++++-
 2 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 48a985e17ef4e..ef360b72cb05c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -210,10 +210,8 @@ extern unsigned long sysctl_admin_reserve_kbytes;
 
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
 #define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n))
-#define folio_page_idx(folio, p)	(page_to_pfn(p) - folio_pfn(folio))
 #else
 #define nth_page(page,n) ((page) + (n))
-#define folio_page_idx(folio, p)	((p) - &(folio)->page)
 #endif
 
 /* to align the pointer to the (next) page boundary */
@@ -225,6 +223,20 @@ extern unsigned long sysctl_admin_reserve_kbytes;
 /* test whether an address (unsigned long or pointer) is aligned to PAGE_SIZE */
 #define PAGE_ALIGNED(addr)	IS_ALIGNED((unsigned long)(addr), PAGE_SIZE)
 
+/**
+ * folio_page_idx - Return the number of a page in a folio.
+ * @folio: The folio.
+ * @page: The folio page.
+ *
+ * This function expects that the page is actually part of the folio.
+ * The returned number is relative to the start of the folio.
+ */
+static inline unsigned long folio_page_idx(const struct folio *folio,
+		const struct page *page)
+{
+	return page - &folio->page;
+}
+
 static inline struct folio *lru_to_folio(struct list_head *head)
 {
 	return list_entry((head)->prev, struct folio, lru);
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index d53a86e68c89b..080ad10c0defc 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -316,7 +316,10 @@ static __always_inline unsigned long _compound_head(const struct page *page)
  * check that the page number lies within @folio; the caller is presumed
  * to have a reference to the page.
  */
-#define folio_page(folio, n)	nth_page(&(folio)->page, n)
+static inline struct page *folio_page(struct folio *folio, unsigned long nr)
+{
+	return &folio->page + nr;
+}
 
 static __always_inline int PageTail(const struct page *page)
 {
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH RFC 14/35] mm/mm/percpu-km: drop nth_page() usage within single allocation
  2025-08-21 20:06 [PATCH RFC 00/35] mm: remove nth_page() David Hildenbrand
                   ` (12 preceding siblings ...)
  2025-08-21 20:06 ` [PATCH RFC 13/35] mm: simplify folio_page() and folio_page_idx() David Hildenbrand
@ 2025-08-21 20:06 ` David Hildenbrand
  2025-08-21 20:06 ` [PATCH RFC 15/35] fs: hugetlbfs: remove nth_page() usage within folio in adjust_range_hwpoison() David Hildenbrand
                   ` (22 subsequent siblings)
  36 siblings, 0 replies; 90+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Alexander Potapenko, Andrew Morton,
	Brendan Jackman, Christoph Lameter, Dennis Zhou, Dmitry Vyukov,
	dri-devel, intel-gfx, iommu, io-uring, Jason Gunthorpe,
	Jens Axboe, Johannes Weiner, John Hubbard, kasan-dev, kvm,
	Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Mike Rapoport, Muchun Song, netdev, Oscar Salvador,
	Peter Xu, Robin Murphy, Suren Baghdasaryan, Tejun Heo,
	virtualization, Vlastimil Babka, wireguard, x86, Zi Yan

We're allocating a higher-order page from the buddy. For these pages
(that are guaranteed to not exceed a single memory section) there is no
need to use nth_page().

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/percpu-km.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/percpu-km.c b/mm/percpu-km.c
index fe31aa19db81a..4efa74a495cb6 100644
--- a/mm/percpu-km.c
+++ b/mm/percpu-km.c
@@ -69,7 +69,7 @@ static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp)
 	}
 
 	for (i = 0; i < nr_pages; i++)
-		pcpu_set_page_chunk(nth_page(pages, i), chunk);
+		pcpu_set_page_chunk(pages + i, chunk);
 
 	chunk->data = pages;
 	chunk->base_addr = page_address(pages);
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH RFC 15/35] fs: hugetlbfs: remove nth_page() usage within folio in adjust_range_hwpoison()
  2025-08-21 20:06 [PATCH RFC 00/35] mm: remove nth_page() David Hildenbrand
                   ` (13 preceding siblings ...)
  2025-08-21 20:06 ` [PATCH RFC 14/35] mm/mm/percpu-km: drop nth_page() usage within single allocation David Hildenbrand
@ 2025-08-21 20:06 ` David Hildenbrand
  2025-08-21 20:06 ` [PATCH RFC 16/35] mm/pagewalk: drop nth_page() usage within folio in folio_walk_start() David Hildenbrand
                   ` (21 subsequent siblings)
  36 siblings, 0 replies; 90+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Alexander Potapenko, Andrew Morton,
	Brendan Jackman, Christoph Lameter, Dennis Zhou, Dmitry Vyukov,
	dri-devel, intel-gfx, iommu, io-uring, Jason Gunthorpe,
	Jens Axboe, Johannes Weiner, John Hubbard, kasan-dev, kvm,
	Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Mike Rapoport, Muchun Song, netdev, Oscar Salvador,
	Peter Xu, Robin Murphy, Suren Baghdasaryan, Tejun Heo,
	virtualization, Vlastimil Babka, wireguard, x86, Zi Yan

The nth_page() is not really required anymore, so let's remove it.
While at it, cleanup and simplify the code a bit.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 fs/hugetlbfs/inode.c | 25 ++++++++-----------------
 1 file changed, 8 insertions(+), 17 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 34d496a2b7de6..dc981509a7717 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -198,31 +198,22 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
 static size_t adjust_range_hwpoison(struct folio *folio, size_t offset,
 		size_t bytes)
 {
-	struct page *page;
-	size_t n = 0;
-	size_t res = 0;
+	struct page *page = folio_page(folio, offset / PAGE_SIZE);
+	size_t n, safe_bytes;
 
-	/* First page to start the loop. */
-	page = folio_page(folio, offset / PAGE_SIZE);
 	offset %= PAGE_SIZE;
-	while (1) {
+	for (safe_bytes = 0; safe_bytes < bytes; safe_bytes += n) {
+
 		if (is_raw_hwpoison_page_in_hugepage(page))
 			break;
 
 		/* Safe to read n bytes without touching HWPOISON subpage. */
-		n = min(bytes, (size_t)PAGE_SIZE - offset);
-		res += n;
-		bytes -= n;
-		if (!bytes || !n)
-			break;
-		offset += n;
-		if (offset == PAGE_SIZE) {
-			page = nth_page(page, 1);
-			offset = 0;
-		}
+		n = min(bytes - safe_bytes, (size_t)PAGE_SIZE - offset);
+		offset = 0;
+		page++;
 	}
 
-	return res;
+	return safe_bytes;
 }
 
 /*
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH RFC 16/35] mm/pagewalk: drop nth_page() usage within folio in folio_walk_start()
  2025-08-21 20:06 [PATCH RFC 00/35] mm: remove nth_page() David Hildenbrand
                   ` (14 preceding siblings ...)
  2025-08-21 20:06 ` [PATCH RFC 15/35] fs: hugetlbfs: remove nth_page() usage within folio in adjust_range_hwpoison() David Hildenbrand
@ 2025-08-21 20:06 ` David Hildenbrand
  2025-08-21 20:06 ` [PATCH RFC 17/35] mm/gup: drop nth_page() usage within folio when recording subpages David Hildenbrand
                   ` (20 subsequent siblings)
  36 siblings, 0 replies; 90+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Alexander Potapenko, Andrew Morton,
	Brendan Jackman, Christoph Lameter, Dennis Zhou, Dmitry Vyukov,
	dri-devel, intel-gfx, iommu, io-uring, Jason Gunthorpe,
	Jens Axboe, Johannes Weiner, John Hubbard, kasan-dev, kvm,
	Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Mike Rapoport, Muchun Song, netdev, Oscar Salvador,
	Peter Xu, Robin Murphy, Suren Baghdasaryan, Tejun Heo,
	virtualization, Vlastimil Babka, wireguard, x86, Zi Yan

It's no longer required to use nth_page() within a folio, so let's just
drop the nth_page() in folio_walk_start().

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/pagewalk.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index c6753d370ff4e..9e4225e5fcf5c 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -1004,7 +1004,7 @@ struct folio *folio_walk_start(struct folio_walk *fw,
 found:
 	if (expose_page)
 		/* Note: Offset from the mapped page, not the folio start. */
-		fw->page = nth_page(page, (addr & (entry_size - 1)) >> PAGE_SHIFT);
+		fw->page = page + ((addr & (entry_size - 1)) >> PAGE_SHIFT);
 	else
 		fw->page = NULL;
 	fw->ptl = ptl;
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH RFC 17/35] mm/gup: drop nth_page() usage within folio when recording subpages
  2025-08-21 20:06 [PATCH RFC 00/35] mm: remove nth_page() David Hildenbrand
                   ` (15 preceding siblings ...)
  2025-08-21 20:06 ` [PATCH RFC 16/35] mm/pagewalk: drop nth_page() usage within folio in folio_walk_start() David Hildenbrand
@ 2025-08-21 20:06 ` David Hildenbrand
  2025-08-21 20:06 ` [PATCH RFC 18/35] io_uring/zcrx: remove "struct io_copy_cache" and one nth_page() usage David Hildenbrand
                   ` (19 subsequent siblings)
  36 siblings, 0 replies; 90+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Alexander Potapenko, Andrew Morton,
	Brendan Jackman, Christoph Lameter, Dennis Zhou, Dmitry Vyukov,
	dri-devel, intel-gfx, iommu, io-uring, Jason Gunthorpe,
	Jens Axboe, Johannes Weiner, John Hubbard, kasan-dev, kvm,
	Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Mike Rapoport, Muchun Song, netdev, Oscar Salvador,
	Peter Xu, Robin Murphy, Suren Baghdasaryan, Tejun Heo,
	virtualization, Vlastimil Babka, wireguard, x86, Zi Yan

nth_page() is no longer required when iterating over pages within a
single folio, so let's just drop it when recording subpages.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/gup.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index b2a78f0291273..f017ff6d7d61a 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -491,9 +491,9 @@ static int record_subpages(struct page *page, unsigned long sz,
 	struct page *start_page;
 	int nr;
 
-	start_page = nth_page(page, (addr & (sz - 1)) >> PAGE_SHIFT);
+	start_page = page + ((addr & (sz - 1)) >> PAGE_SHIFT);
 	for (nr = 0; addr != end; nr++, addr += PAGE_SIZE)
-		pages[nr] = nth_page(start_page, nr);
+		pages[nr] = start_page + nr;
 
 	return nr;
 }
@@ -1512,7 +1512,7 @@ static long __get_user_pages(struct mm_struct *mm,
 			}
 
 			for (j = 0; j < page_increm; j++) {
-				subpage = nth_page(page, j);
+				subpage = page + j;
 				pages[i + j] = subpage;
 				flush_anon_page(vma, subpage, start + j * PAGE_SIZE);
 				flush_dcache_page(subpage);
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH RFC 18/35] io_uring/zcrx: remove "struct io_copy_cache" and one nth_page() usage
  2025-08-21 20:06 [PATCH RFC 00/35] mm: remove nth_page() David Hildenbrand
                   ` (16 preceding siblings ...)
  2025-08-21 20:06 ` [PATCH RFC 17/35] mm/gup: drop nth_page() usage within folio when recording subpages David Hildenbrand
@ 2025-08-21 20:06 ` David Hildenbrand
  2025-08-22 11:32   ` Pavel Begunkov
  2025-08-21 20:06 ` [PATCH RFC 19/35] io_uring/zcrx: remove nth_page() usage within folio David Hildenbrand
                   ` (18 subsequent siblings)
  36 siblings, 1 reply; 90+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Jens Axboe, Alexander Potapenko, Andrew Morton,
	Brendan Jackman, Christoph Lameter, Dennis Zhou, Dmitry Vyukov,
	dri-devel, intel-gfx, iommu, io-uring, Jason Gunthorpe,
	Johannes Weiner, John Hubbard, kasan-dev, kvm, Liam R. Howlett,
	Linus Torvalds, linux-arm-kernel, linux-arm-kernel, linux-crypto,
	linux-ide, linux-kselftest, linux-mips, linux-mmc, linux-mm,
	linux-riscv, linux-s390, linux-scsi, Lorenzo Stoakes, Marco Elver,
	Marek Szyprowski, Michal Hocko, Mike Rapoport, Muchun Song,
	netdev, Oscar Salvador, Peter Xu, Robin Murphy,
	Suren Baghdasaryan, Tejun Heo, virtualization, Vlastimil Babka,
	wireguard, x86, Zi Yan

We always provide a single dst page, it's unclear why the io_copy_cache
complexity is required.

So let's simplify and get rid of "struct io_copy_cache", simply working on
the single page.

... which immediately allows us for dropping one "nth_page" usage,
because it's really just a single page.

Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 io_uring/zcrx.c | 32 +++++++-------------------------
 1 file changed, 7 insertions(+), 25 deletions(-)

diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
index e5ff49f3425e0..f29b2a4867516 100644
--- a/io_uring/zcrx.c
+++ b/io_uring/zcrx.c
@@ -954,29 +954,18 @@ static struct net_iov *io_zcrx_alloc_fallback(struct io_zcrx_area *area)
 	return niov;
 }
 
-struct io_copy_cache {
-	struct page		*page;
-	unsigned long		offset;
-	size_t			size;
-};
-
-static ssize_t io_copy_page(struct io_copy_cache *cc, struct page *src_page,
+static ssize_t io_copy_page(struct page *dst_page, struct page *src_page,
 			    unsigned int src_offset, size_t len)
 {
-	size_t copied = 0;
+	size_t dst_offset = 0;
 
-	len = min(len, cc->size);
+	len = min(len, PAGE_SIZE);
 
 	while (len) {
 		void *src_addr, *dst_addr;
-		struct page *dst_page = cc->page;
-		unsigned dst_offset = cc->offset;
 		size_t n = len;
 
-		if (folio_test_partial_kmap(page_folio(dst_page)) ||
-		    folio_test_partial_kmap(page_folio(src_page))) {
-			dst_page = nth_page(dst_page, dst_offset / PAGE_SIZE);
-			dst_offset = offset_in_page(dst_offset);
+		if (folio_test_partial_kmap(page_folio(src_page))) {
 			src_page = nth_page(src_page, src_offset / PAGE_SIZE);
 			src_offset = offset_in_page(src_offset);
 			n = min(PAGE_SIZE - src_offset, PAGE_SIZE - dst_offset);
@@ -991,12 +980,10 @@ static ssize_t io_copy_page(struct io_copy_cache *cc, struct page *src_page,
 		kunmap_local(src_addr);
 		kunmap_local(dst_addr);
 
-		cc->size -= n;
-		cc->offset += n;
+		dst_offset += n;
 		len -= n;
-		copied += n;
 	}
-	return copied;
+	return dst_offset;
 }
 
 static ssize_t io_zcrx_copy_chunk(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
@@ -1011,7 +998,6 @@ static ssize_t io_zcrx_copy_chunk(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
 		return -EFAULT;
 
 	while (len) {
-		struct io_copy_cache cc;
 		struct net_iov *niov;
 		size_t n;
 
@@ -1021,11 +1007,7 @@ static ssize_t io_zcrx_copy_chunk(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
 			break;
 		}
 
-		cc.page = io_zcrx_iov_page(niov);
-		cc.offset = 0;
-		cc.size = PAGE_SIZE;
-
-		n = io_copy_page(&cc, src_page, src_offset, len);
+		n = io_copy_page(io_zcrx_iov_page(niov), src_page, src_offset, len);
 
 		if (!io_zcrx_queue_cqe(req, niov, ifq, 0, n)) {
 			io_zcrx_return_niov(niov);
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH RFC 19/35] io_uring/zcrx: remove nth_page() usage within folio
  2025-08-21 20:06 [PATCH RFC 00/35] mm: remove nth_page() David Hildenbrand
                   ` (17 preceding siblings ...)
  2025-08-21 20:06 ` [PATCH RFC 18/35] io_uring/zcrx: remove "struct io_copy_cache" and one nth_page() usage David Hildenbrand
@ 2025-08-21 20:06 ` David Hildenbrand
  2025-08-21 20:06 ` [PATCH RFC 20/35] mips: mm: convert __flush_dcache_pages() to __flush_dcache_folio_pages() David Hildenbrand
                   ` (17 subsequent siblings)
  36 siblings, 0 replies; 90+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Jens Axboe, Alexander Potapenko, Andrew Morton,
	Brendan Jackman, Christoph Lameter, Dennis Zhou, Dmitry Vyukov,
	dri-devel, intel-gfx, iommu, io-uring, Jason Gunthorpe,
	Johannes Weiner, John Hubbard, kasan-dev, kvm, Liam R. Howlett,
	Linus Torvalds, linux-arm-kernel, linux-arm-kernel, linux-crypto,
	linux-ide, linux-kselftest, linux-mips, linux-mmc, linux-mm,
	linux-riscv, linux-s390, linux-scsi, Lorenzo Stoakes, Marco Elver,
	Marek Szyprowski, Michal Hocko, Mike Rapoport, Muchun Song,
	netdev, Oscar Salvador, Peter Xu, Robin Murphy,
	Suren Baghdasaryan, Tejun Heo, virtualization, Vlastimil Babka,
	wireguard, x86, Zi Yan

Within a folio/compound page, nth_page() is no longer required.
Given that we call folio_test_partial_kmap()+kmap_local_page(), the code
would already be problematic if the src_pages would span multiple folios.

So let's just assume that all src pages belong to a single
folio/compound page and can be iterated ordinarily.

Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 io_uring/zcrx.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
index f29b2a4867516..107b2a1b31c1c 100644
--- a/io_uring/zcrx.c
+++ b/io_uring/zcrx.c
@@ -966,7 +966,7 @@ static ssize_t io_copy_page(struct page *dst_page, struct page *src_page,
 		size_t n = len;
 
 		if (folio_test_partial_kmap(page_folio(src_page))) {
-			src_page = nth_page(src_page, src_offset / PAGE_SIZE);
+			src_page += src_offset / PAGE_SIZE;
 			src_offset = offset_in_page(src_offset);
 			n = min(PAGE_SIZE - src_offset, PAGE_SIZE - dst_offset);
 			n = min(n, len);
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH RFC 20/35] mips: mm: convert __flush_dcache_pages() to __flush_dcache_folio_pages()
  2025-08-21 20:06 [PATCH RFC 00/35] mm: remove nth_page() David Hildenbrand
                   ` (18 preceding siblings ...)
  2025-08-21 20:06 ` [PATCH RFC 19/35] io_uring/zcrx: remove nth_page() usage within folio David Hildenbrand
@ 2025-08-21 20:06 ` David Hildenbrand
  2025-08-21 20:06 ` [PATCH RFC 21/35] mm/cma: refuse handing out non-contiguous page ranges David Hildenbrand
                   ` (16 subsequent siblings)
  36 siblings, 0 replies; 90+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Thomas Bogendoerfer, Alexander Potapenko,
	Andrew Morton, Brendan Jackman, Christoph Lameter, Dennis Zhou,
	Dmitry Vyukov, dri-devel, intel-gfx, iommu, io-uring,
	Jason Gunthorpe, Jens Axboe, Johannes Weiner, John Hubbard,
	kasan-dev, kvm, Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Mike Rapoport, Muchun Song, netdev, Oscar Salvador,
	Peter Xu, Robin Murphy, Suren Baghdasaryan, Tejun Heo,
	virtualization, Vlastimil Babka, wireguard, x86, Zi Yan

Let's make it clearer that we are operating within a single folio by
providing both the folio and the page.

This implies that for flush_dcache_folio() we'll now avoid one more
page->folio lookup, and that we can safely drop the "nth_page" usage.

Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/mips/include/asm/cacheflush.h | 11 +++++++----
 arch/mips/mm/cache.c               |  8 ++++----
 2 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/arch/mips/include/asm/cacheflush.h b/arch/mips/include/asm/cacheflush.h
index 1f14132b3fc98..8a2de28936e07 100644
--- a/arch/mips/include/asm/cacheflush.h
+++ b/arch/mips/include/asm/cacheflush.h
@@ -50,13 +50,14 @@ extern void (*flush_cache_mm)(struct mm_struct *mm);
 extern void (*flush_cache_range)(struct vm_area_struct *vma,
 	unsigned long start, unsigned long end);
 extern void (*flush_cache_page)(struct vm_area_struct *vma, unsigned long page, unsigned long pfn);
-extern void __flush_dcache_pages(struct page *page, unsigned int nr);
+extern void __flush_dcache_folio_pages(struct folio *folio, struct page *page, unsigned int nr);
 
 #define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1
 static inline void flush_dcache_folio(struct folio *folio)
 {
 	if (cpu_has_dc_aliases)
-		__flush_dcache_pages(&folio->page, folio_nr_pages(folio));
+		__flush_dcache_folio_pages(folio, folio_page(folio, 0),
+					   folio_nr_pages(folio));
 	else if (!cpu_has_ic_fills_f_dc)
 		folio_set_dcache_dirty(folio);
 }
@@ -64,10 +65,12 @@ static inline void flush_dcache_folio(struct folio *folio)
 
 static inline void flush_dcache_page(struct page *page)
 {
+	struct folio *folio = page_folio(page);
+
 	if (cpu_has_dc_aliases)
-		__flush_dcache_pages(page, 1);
+		__flush_dcache_folio_pages(folio, page, folio_nr_pages(folio));
 	else if (!cpu_has_ic_fills_f_dc)
-		folio_set_dcache_dirty(page_folio(page));
+		folio_set_dcache_dirty(folio);
 }
 
 #define flush_dcache_mmap_lock(mapping)		do { } while (0)
diff --git a/arch/mips/mm/cache.c b/arch/mips/mm/cache.c
index bf9a37c60e9f0..e3b4224c9a406 100644
--- a/arch/mips/mm/cache.c
+++ b/arch/mips/mm/cache.c
@@ -99,9 +99,9 @@ SYSCALL_DEFINE3(cacheflush, unsigned long, addr, unsigned long, bytes,
 	return 0;
 }
 
-void __flush_dcache_pages(struct page *page, unsigned int nr)
+void __flush_dcache_folio_pages(struct folio *folio, struct page *page,
+		unsigned int nr)
 {
-	struct folio *folio = page_folio(page);
 	struct address_space *mapping = folio_flush_mapping(folio);
 	unsigned long addr;
 	unsigned int i;
@@ -117,12 +117,12 @@ void __flush_dcache_pages(struct page *page, unsigned int nr)
 	 * get faulted into the tlb (and thus flushed) anyways.
 	 */
 	for (i = 0; i < nr; i++) {
-		addr = (unsigned long)kmap_local_page(nth_page(page, i));
+		addr = (unsigned long)kmap_local_page(page + i);
 		flush_data_cache_page(addr);
 		kunmap_local((void *)addr);
 	}
 }
-EXPORT_SYMBOL(__flush_dcache_pages);
+EXPORT_SYMBOL(__flush_dcache_folio_pages);
 
 void __flush_anon_page(struct page *page, unsigned long vmaddr)
 {
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH RFC 21/35] mm/cma: refuse handing out non-contiguous page ranges
  2025-08-21 20:06 [PATCH RFC 00/35] mm: remove nth_page() David Hildenbrand
                   ` (19 preceding siblings ...)
  2025-08-21 20:06 ` [PATCH RFC 20/35] mips: mm: convert __flush_dcache_pages() to __flush_dcache_folio_pages() David Hildenbrand
@ 2025-08-21 20:06 ` David Hildenbrand
  2025-08-26 10:45   ` Alexandru Elisei
  2025-08-21 20:06 ` [PATCH RFC 22/35] dma-remap: drop nth_page() in dma_common_contiguous_remap() David Hildenbrand
                   ` (15 subsequent siblings)
  36 siblings, 1 reply; 90+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Alexander Potapenko, Andrew Morton,
	Brendan Jackman, Christoph Lameter, Dennis Zhou, Dmitry Vyukov,
	dri-devel, intel-gfx, iommu, io-uring, Jason Gunthorpe,
	Jens Axboe, Johannes Weiner, John Hubbard, kasan-dev, kvm,
	Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Mike Rapoport, Muchun Song, netdev, Oscar Salvador,
	Peter Xu, Robin Murphy, Suren Baghdasaryan, Tejun Heo,
	virtualization, Vlastimil Babka, wireguard, x86, Zi Yan

Let's disallow handing out PFN ranges with non-contiguous pages, so we
can remove the nth-page usage in __cma_alloc(), and so any callers don't
have to worry about that either when wanting to blindly iterate pages.

This is really only a problem in configs with SPARSEMEM but without
SPARSEMEM_VMEMMAP, and only when we would cross memory sections in some
cases.

Will this cause harm? Probably not, because it's mostly 32bit that does
not support SPARSEMEM_VMEMMAP. If this ever becomes a problem we could
look into allocating the memmap for the memory sections spanned by a
single CMA region in one go from memblock.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/mm.h |  6 ++++++
 mm/cma.c           | 36 +++++++++++++++++++++++-------------
 mm/util.c          | 33 +++++++++++++++++++++++++++++++++
 3 files changed, 62 insertions(+), 13 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ef360b72cb05c..f59ad1f9fc792 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -209,9 +209,15 @@ extern unsigned long sysctl_user_reserve_kbytes;
 extern unsigned long sysctl_admin_reserve_kbytes;
 
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
+bool page_range_contiguous(const struct page *page, unsigned long nr_pages);
 #define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n))
 #else
 #define nth_page(page,n) ((page) + (n))
+static inline bool page_range_contiguous(const struct page *page,
+		unsigned long nr_pages)
+{
+	return true;
+}
 #endif
 
 /* to align the pointer to the (next) page boundary */
diff --git a/mm/cma.c b/mm/cma.c
index 2ffa4befb99ab..1119fa2830008 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -780,10 +780,8 @@ static int cma_range_alloc(struct cma *cma, struct cma_memrange *cmr,
 				unsigned long count, unsigned int align,
 				struct page **pagep, gfp_t gfp)
 {
-	unsigned long mask, offset;
-	unsigned long pfn = -1;
-	unsigned long start = 0;
 	unsigned long bitmap_maxno, bitmap_no, bitmap_count;
+	unsigned long start, pfn, mask, offset;
 	int ret = -EBUSY;
 	struct page *page = NULL;
 
@@ -795,7 +793,7 @@ static int cma_range_alloc(struct cma *cma, struct cma_memrange *cmr,
 	if (bitmap_count > bitmap_maxno)
 		goto out;
 
-	for (;;) {
+	for (start = 0; ; start = bitmap_no + mask + 1) {
 		spin_lock_irq(&cma->lock);
 		/*
 		 * If the request is larger than the available number
@@ -812,6 +810,22 @@ static int cma_range_alloc(struct cma *cma, struct cma_memrange *cmr,
 			spin_unlock_irq(&cma->lock);
 			break;
 		}
+
+		pfn = cmr->base_pfn + (bitmap_no << cma->order_per_bit);
+		page = pfn_to_page(pfn);
+
+		/*
+		 * Do not hand out page ranges that are not contiguous, so
+		 * callers can just iterate the pages without having to worry
+		 * about these corner cases.
+		 */
+		if (!page_range_contiguous(page, count)) {
+			spin_unlock_irq(&cma->lock);
+			pr_warn_ratelimited("%s: %s: skipping incompatible area [0x%lx-0x%lx]",
+					    __func__, cma->name, pfn, pfn + count - 1);
+			continue;
+		}
+
 		bitmap_set(cmr->bitmap, bitmap_no, bitmap_count);
 		cma->available_count -= count;
 		/*
@@ -821,29 +835,25 @@ static int cma_range_alloc(struct cma *cma, struct cma_memrange *cmr,
 		 */
 		spin_unlock_irq(&cma->lock);
 
-		pfn = cmr->base_pfn + (bitmap_no << cma->order_per_bit);
 		mutex_lock(&cma->alloc_mutex);
 		ret = alloc_contig_range(pfn, pfn + count, ACR_FLAGS_CMA, gfp);
 		mutex_unlock(&cma->alloc_mutex);
-		if (ret == 0) {
-			page = pfn_to_page(pfn);
+		if (!ret)
 			break;
-		}
 
 		cma_clear_bitmap(cma, cmr, pfn, count);
 		if (ret != -EBUSY)
 			break;
 
 		pr_debug("%s(): memory range at pfn 0x%lx %p is busy, retrying\n",
-			 __func__, pfn, pfn_to_page(pfn));
+			 __func__, pfn, page);
 
 		trace_cma_alloc_busy_retry(cma->name, pfn, pfn_to_page(pfn),
 					   count, align);
-		/* try again with a bit different memory target */
-		start = bitmap_no + mask + 1;
 	}
 out:
-	*pagep = page;
+	if (!ret)
+		*pagep = page;
 	return ret;
 }
 
@@ -882,7 +892,7 @@ static struct page *__cma_alloc(struct cma *cma, unsigned long count,
 	 */
 	if (page) {
 		for (i = 0; i < count; i++)
-			page_kasan_tag_reset(nth_page(page, i));
+			page_kasan_tag_reset(page + i);
 	}
 
 	if (ret && !(gfp & __GFP_NOWARN)) {
diff --git a/mm/util.c b/mm/util.c
index d235b74f7aff7..0bf349b19b652 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -1280,4 +1280,37 @@ unsigned int folio_pte_batch(struct folio *folio, pte_t *ptep, pte_t pte,
 {
 	return folio_pte_batch_flags(folio, NULL, ptep, &pte, max_nr, 0);
 }
+
+#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
+/**
+ * page_range_contiguous - test whether the page range is contiguous
+ * @page: the start of the page range.
+ * @nr_pages: the number of pages in the range.
+ *
+ * Test whether the page range is contiguous, such that they can be iterated
+ * naively, corresponding to iterating a contiguous PFN range.
+ *
+ * This function should primarily only be used for debug checks, or when
+ * working with page ranges that are not naturally contiguous (e.g., pages
+ * within a folio are).
+ *
+ * Returns true if contiguous, otherwise false.
+ */
+bool page_range_contiguous(const struct page *page, unsigned long nr_pages)
+{
+	const unsigned long start_pfn = page_to_pfn(page);
+	const unsigned long end_pfn = start_pfn + nr_pages;
+	unsigned long pfn;
+
+	/*
+	 * The memmap is allocated per memory section. We need to check
+	 * each involved memory section once.
+	 */
+	for (pfn = ALIGN(start_pfn, PAGES_PER_SECTION);
+	     pfn < end_pfn; pfn += PAGES_PER_SECTION)
+		if (unlikely(page + (pfn - start_pfn) != pfn_to_page(pfn)))
+			return false;
+	return true;
+}
+#endif
 #endif /* CONFIG_MMU */
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH RFC 22/35] dma-remap: drop nth_page() in dma_common_contiguous_remap()
  2025-08-21 20:06 [PATCH RFC 00/35] mm: remove nth_page() David Hildenbrand
                   ` (20 preceding siblings ...)
  2025-08-21 20:06 ` [PATCH RFC 21/35] mm/cma: refuse handing out non-contiguous page ranges David Hildenbrand
@ 2025-08-21 20:06 ` David Hildenbrand
  2025-08-22  8:15   ` Marek Szyprowski
  2025-08-21 20:06 ` [PATCH RFC 23/35] scatterlist: disallow non-contigous page ranges in a single SG entry David Hildenbrand
                   ` (14 subsequent siblings)
  36 siblings, 1 reply; 90+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Marek Szyprowski, Robin Murphy,
	Alexander Potapenko, Andrew Morton, Brendan Jackman,
	Christoph Lameter, Dennis Zhou, Dmitry Vyukov, dri-devel,
	intel-gfx, iommu, io-uring, Jason Gunthorpe, Jens Axboe,
	Johannes Weiner, John Hubbard, kasan-dev, kvm, Liam R. Howlett,
	Linus Torvalds, linux-arm-kernel, linux-arm-kernel, linux-crypto,
	linux-ide, linux-kselftest, linux-mips, linux-mmc, linux-mm,
	linux-riscv, linux-s390, linux-scsi, Lorenzo Stoakes, Marco Elver,
	Michal Hocko, Mike Rapoport, Muchun Song, netdev, Oscar Salvador,
	Peter Xu, Suren Baghdasaryan, Tejun Heo, virtualization,
	Vlastimil Babka, wireguard, x86, Zi Yan

dma_common_contiguous_remap() is used to remap an "allocated contiguous
region". Within a single allocation, there is no need to use nth_page()
anymore.

Neither the buddy, nor hugetlb, nor CMA will hand out problematic page
ranges.

Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 kernel/dma/remap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/dma/remap.c b/kernel/dma/remap.c
index 9e2afad1c6152..b7c1c0c92d0c8 100644
--- a/kernel/dma/remap.c
+++ b/kernel/dma/remap.c
@@ -49,7 +49,7 @@ void *dma_common_contiguous_remap(struct page *page, size_t size,
 	if (!pages)
 		return NULL;
 	for (i = 0; i < count; i++)
-		pages[i] = nth_page(page, i);
+		pages[i] = page++;
 	vaddr = vmap(pages, count, VM_DMA_COHERENT, prot);
 	kvfree(pages);
 
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH RFC 23/35] scatterlist: disallow non-contigous page ranges in a single SG entry
  2025-08-21 20:06 [PATCH RFC 00/35] mm: remove nth_page() David Hildenbrand
                   ` (21 preceding siblings ...)
  2025-08-21 20:06 ` [PATCH RFC 22/35] dma-remap: drop nth_page() in dma_common_contiguous_remap() David Hildenbrand
@ 2025-08-21 20:06 ` David Hildenbrand
  2025-08-22  8:15   ` Marek Szyprowski
  2025-08-21 20:06 ` [PATCH RFC 24/35] ata: libata-eh: drop nth_page() usage within " David Hildenbrand
                   ` (13 subsequent siblings)
  36 siblings, 1 reply; 90+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Alexander Potapenko, Andrew Morton,
	Brendan Jackman, Christoph Lameter, Dennis Zhou, Dmitry Vyukov,
	dri-devel, intel-gfx, iommu, io-uring, Jason Gunthorpe,
	Jens Axboe, Johannes Weiner, John Hubbard, kasan-dev, kvm,
	Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Mike Rapoport, Muchun Song, netdev, Oscar Salvador,
	Peter Xu, Robin Murphy, Suren Baghdasaryan, Tejun Heo,
	virtualization, Vlastimil Babka, wireguard, x86, Zi Yan

The expectation is that there is currently no user that would pass in
non-contigous page ranges: no allocator, not even VMA, will hand these
out.

The only problematic part would be if someone would provide a range
obtained directly from memblock, or manually merge problematic ranges.
If we find such cases, we should fix them to create separate
SG entries.

Let's check in sg_set_page() that this is really the case. No need to
check in sg_set_folio(), as pages in a folio are guaranteed to be
contiguous.

We can now drop the nth_page() usage in sg_page_iter_page().

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/scatterlist.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 6f8a4965f9b98..8196949dfc82c 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -6,6 +6,7 @@
 #include <linux/types.h>
 #include <linux/bug.h>
 #include <linux/mm.h>
+#include <linux/mm_inline.h>
 #include <asm/io.h>
 
 struct scatterlist {
@@ -158,6 +159,7 @@ static inline void sg_assign_page(struct scatterlist *sg, struct page *page)
 static inline void sg_set_page(struct scatterlist *sg, struct page *page,
 			       unsigned int len, unsigned int offset)
 {
+	VM_WARN_ON_ONCE(!page_range_contiguous(page, ALIGN(len + offset, PAGE_SIZE) / PAGE_SIZE));
 	sg_assign_page(sg, page);
 	sg->offset = offset;
 	sg->length = len;
@@ -600,7 +602,7 @@ void __sg_page_iter_start(struct sg_page_iter *piter,
  */
 static inline struct page *sg_page_iter_page(struct sg_page_iter *piter)
 {
-	return nth_page(sg_page(piter->sg), piter->sg_pgoffset);
+	return sg_page(piter->sg) + piter->sg_pgoffset;
 }
 
 /**
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH RFC 24/35] ata: libata-eh: drop nth_page() usage within SG entry
  2025-08-21 20:06 [PATCH RFC 00/35] mm: remove nth_page() David Hildenbrand
                   ` (22 preceding siblings ...)
  2025-08-21 20:06 ` [PATCH RFC 23/35] scatterlist: disallow non-contigous page ranges in a single SG entry David Hildenbrand
@ 2025-08-21 20:06 ` David Hildenbrand
  2025-08-22  1:59   ` Damien Le Moal
  2025-08-21 20:06 ` [PATCH RFC 25/35] drm/i915/gem: " David Hildenbrand
                   ` (12 subsequent siblings)
  36 siblings, 1 reply; 90+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Damien Le Moal, Niklas Cassel,
	Alexander Potapenko, Andrew Morton, Brendan Jackman,
	Christoph Lameter, Dennis Zhou, Dmitry Vyukov, dri-devel,
	intel-gfx, iommu, io-uring, Jason Gunthorpe, Jens Axboe,
	Johannes Weiner, John Hubbard, kasan-dev, kvm, Liam R. Howlett,
	Linus Torvalds, linux-arm-kernel, linux-arm-kernel, linux-crypto,
	linux-ide, linux-kselftest, linux-mips, linux-mmc, linux-mm,
	linux-riscv, linux-s390, linux-scsi, Lorenzo Stoakes, Marco Elver,
	Marek Szyprowski, Michal Hocko, Mike Rapoport, Muchun Song,
	netdev, Oscar Salvador, Peter Xu, Robin Murphy,
	Suren Baghdasaryan, Tejun Heo, virtualization, Vlastimil Babka,
	wireguard, x86, Zi Yan

It's no longer required to use nth_page() when iterating pages within a
single SG entry, so let's drop the nth_page() usage.

Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Niklas Cassel <cassel@kernel.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 drivers/ata/libata-sff.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/ata/libata-sff.c b/drivers/ata/libata-sff.c
index 7fc407255eb46..9f5d0f9f6d686 100644
--- a/drivers/ata/libata-sff.c
+++ b/drivers/ata/libata-sff.c
@@ -614,7 +614,7 @@ static void ata_pio_sector(struct ata_queued_cmd *qc)
 	offset = qc->cursg->offset + qc->cursg_ofs;
 
 	/* get the current page and offset */
-	page = nth_page(page, (offset >> PAGE_SHIFT));
+	page += offset / PAGE_SHIFT;
 	offset %= PAGE_SIZE;
 
 	/* don't overrun current sg */
@@ -631,7 +631,7 @@ static void ata_pio_sector(struct ata_queued_cmd *qc)
 		unsigned int split_len = PAGE_SIZE - offset;
 
 		ata_pio_xfer(qc, page, offset, split_len);
-		ata_pio_xfer(qc, nth_page(page, 1), 0, count - split_len);
+		ata_pio_xfer(qc, page + 1, 0, count - split_len);
 	} else {
 		ata_pio_xfer(qc, page, offset, count);
 	}
@@ -751,7 +751,7 @@ static int __atapi_pio_bytes(struct ata_queued_cmd *qc, unsigned int bytes)
 	offset = sg->offset + qc->cursg_ofs;
 
 	/* get the current page and offset */
-	page = nth_page(page, (offset >> PAGE_SHIFT));
+	page += offset / PAGE_SIZE;
 	offset %= PAGE_SIZE;
 
 	/* don't overrun current sg */
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH RFC 25/35] drm/i915/gem: drop nth_page() usage within SG entry
  2025-08-21 20:06 [PATCH RFC 00/35] mm: remove nth_page() David Hildenbrand
                   ` (23 preceding siblings ...)
  2025-08-21 20:06 ` [PATCH RFC 24/35] ata: libata-eh: drop nth_page() usage within " David Hildenbrand
@ 2025-08-21 20:06 ` David Hildenbrand
  2025-08-21 20:06 ` [PATCH RFC 26/35] mspro_block: " David Hildenbrand
                   ` (11 subsequent siblings)
  36 siblings, 0 replies; 90+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi,
	Tvrtko Ursulin, David Airlie, Simona Vetter, Alexander Potapenko,
	Andrew Morton, Brendan Jackman, Christoph Lameter, Dennis Zhou,
	Dmitry Vyukov, dri-devel, intel-gfx, iommu, io-uring,
	Jason Gunthorpe, Jens Axboe, Johannes Weiner, John Hubbard,
	kasan-dev, kvm, Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Mike Rapoport, Muchun Song, netdev, Oscar Salvador,
	Peter Xu, Robin Murphy, Suren Baghdasaryan, Tejun Heo,
	virtualization, Vlastimil Babka, wireguard, x86, Zi Yan

It's no longer required to use nth_page() when iterating pages within a
single SG entry, so let's drop the nth_page() usage.

Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Tvrtko Ursulin <tursulin@ursulin.net>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 drivers/gpu/drm/i915/gem/i915_gem_pages.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_pages.c b/drivers/gpu/drm/i915/gem/i915_gem_pages.c
index c16a57160b262..031d7acc16142 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_pages.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_pages.c
@@ -779,7 +779,7 @@ __i915_gem_object_get_page(struct drm_i915_gem_object *obj, pgoff_t n)
 	GEM_BUG_ON(!i915_gem_object_has_struct_page(obj));
 
 	sg = i915_gem_object_get_sg(obj, n, &offset);
-	return nth_page(sg_page(sg), offset);
+	return sg_page(sg) + offset;
 }
 
 /* Like i915_gem_object_get_page(), but mark the returned page dirty */
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH RFC 26/35] mspro_block: drop nth_page() usage within SG entry
  2025-08-21 20:06 [PATCH RFC 00/35] mm: remove nth_page() David Hildenbrand
                   ` (24 preceding siblings ...)
  2025-08-21 20:06 ` [PATCH RFC 25/35] drm/i915/gem: " David Hildenbrand
@ 2025-08-21 20:06 ` David Hildenbrand
  2025-08-21 20:06 ` [PATCH RFC 27/35] memstick: " David Hildenbrand
                   ` (10 subsequent siblings)
  36 siblings, 0 replies; 90+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Maxim Levitsky, Alex Dubov, Ulf Hansson,
	Alexander Potapenko, Andrew Morton, Brendan Jackman,
	Christoph Lameter, Dennis Zhou, Dmitry Vyukov, dri-devel,
	intel-gfx, iommu, io-uring, Jason Gunthorpe, Jens Axboe,
	Johannes Weiner, John Hubbard, kasan-dev, kvm, Liam R. Howlett,
	Linus Torvalds, linux-arm-kernel, linux-arm-kernel, linux-crypto,
	linux-ide, linux-kselftest, linux-mips, linux-mmc, linux-mm,
	linux-riscv, linux-s390, linux-scsi, Lorenzo Stoakes, Marco Elver,
	Marek Szyprowski, Michal Hocko, Mike Rapoport, Muchun Song,
	netdev, Oscar Salvador, Peter Xu, Robin Murphy,
	Suren Baghdasaryan, Tejun Heo, virtualization, Vlastimil Babka,
	wireguard, x86, Zi Yan

It's no longer required to use nth_page() when iterating pages within a
single SG entry, so let's drop the nth_page() usage.

Cc: Maxim Levitsky <maximlevitsky@gmail.com>
Cc: Alex Dubov <oakad@yahoo.com>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 drivers/memstick/core/mspro_block.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/memstick/core/mspro_block.c b/drivers/memstick/core/mspro_block.c
index c9853d887d282..985cfca3f6944 100644
--- a/drivers/memstick/core/mspro_block.c
+++ b/drivers/memstick/core/mspro_block.c
@@ -560,8 +560,7 @@ static int h_mspro_block_transfer_data(struct memstick_dev *card,
 		t_offset += msb->current_page * msb->page_size;
 
 		sg_set_page(&t_sg,
-			    nth_page(sg_page(&(msb->req_sg[msb->current_seg])),
-				     t_offset >> PAGE_SHIFT),
+			    sg_page(&(msb->req_sg[msb->current_seg])) + t_offset / PAGE_SIZE,
 			    msb->page_size, offset_in_page(t_offset));
 
 		memstick_init_req_sg(*mrq, msb->data_dir == READ
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH RFC 27/35] memstick: drop nth_page() usage within SG entry
  2025-08-21 20:06 [PATCH RFC 00/35] mm: remove nth_page() David Hildenbrand
                   ` (25 preceding siblings ...)
  2025-08-21 20:06 ` [PATCH RFC 26/35] mspro_block: " David Hildenbrand
@ 2025-08-21 20:06 ` David Hildenbrand
  2025-08-21 20:06 ` [PATCH RFC 28/35] mmc: " David Hildenbrand
                   ` (9 subsequent siblings)
  36 siblings, 0 replies; 90+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Maxim Levitsky, Alex Dubov, Ulf Hansson,
	Alexander Potapenko, Andrew Morton, Brendan Jackman,
	Christoph Lameter, Dennis Zhou, Dmitry Vyukov, dri-devel,
	intel-gfx, iommu, io-uring, Jason Gunthorpe, Jens Axboe,
	Johannes Weiner, John Hubbard, kasan-dev, kvm, Liam R. Howlett,
	Linus Torvalds, linux-arm-kernel, linux-arm-kernel, linux-crypto,
	linux-ide, linux-kselftest, linux-mips, linux-mmc, linux-mm,
	linux-riscv, linux-s390, linux-scsi, Lorenzo Stoakes, Marco Elver,
	Marek Szyprowski, Michal Hocko, Mike Rapoport, Muchun Song,
	netdev, Oscar Salvador, Peter Xu, Robin Murphy,
	Suren Baghdasaryan, Tejun Heo, virtualization, Vlastimil Babka,
	wireguard, x86, Zi Yan

It's no longer required to use nth_page() when iterating pages within a
single SG entry, so let's drop the nth_page() usage.

Cc: Maxim Levitsky <maximlevitsky@gmail.com>
Cc: Alex Dubov <oakad@yahoo.com>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 drivers/memstick/host/jmb38x_ms.c | 3 +--
 drivers/memstick/host/tifm_ms.c   | 3 +--
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/memstick/host/jmb38x_ms.c b/drivers/memstick/host/jmb38x_ms.c
index cddddb3a5a27f..c5e71d39ffd51 100644
--- a/drivers/memstick/host/jmb38x_ms.c
+++ b/drivers/memstick/host/jmb38x_ms.c
@@ -317,8 +317,7 @@ static int jmb38x_ms_transfer_data(struct jmb38x_ms_host *host)
 		unsigned int p_off;
 
 		if (host->req->long_data) {
-			pg = nth_page(sg_page(&host->req->sg),
-				      off >> PAGE_SHIFT);
+			pg = sg_page(&host->req->sg) + off / PAGE_SIZE;
 			p_off = offset_in_page(off);
 			p_cnt = PAGE_SIZE - p_off;
 			p_cnt = min(p_cnt, length);
diff --git a/drivers/memstick/host/tifm_ms.c b/drivers/memstick/host/tifm_ms.c
index db7f3a088fb09..0d64184ca10a9 100644
--- a/drivers/memstick/host/tifm_ms.c
+++ b/drivers/memstick/host/tifm_ms.c
@@ -201,8 +201,7 @@ static unsigned int tifm_ms_transfer_data(struct tifm_ms *host)
 		unsigned int p_off;
 
 		if (host->req->long_data) {
-			pg = nth_page(sg_page(&host->req->sg),
-				      off >> PAGE_SHIFT);
+			pg = sg_page(&host->req->sg) + off / PAGE_SIZE;
 			p_off = offset_in_page(off);
 			p_cnt = PAGE_SIZE - p_off;
 			p_cnt = min(p_cnt, length);
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH RFC 28/35] mmc: drop nth_page() usage within SG entry
  2025-08-21 20:06 [PATCH RFC 00/35] mm: remove nth_page() David Hildenbrand
                   ` (26 preceding siblings ...)
  2025-08-21 20:06 ` [PATCH RFC 27/35] memstick: " David Hildenbrand
@ 2025-08-21 20:06 ` David Hildenbrand
  2025-08-21 20:06 ` [PATCH RFC 29/35] scsi: core: " David Hildenbrand
                   ` (8 subsequent siblings)
  36 siblings, 0 replies; 90+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Alex Dubov, Ulf Hansson, Jesper Nilsson,
	Lars Persson, Alexander Potapenko, Andrew Morton, Brendan Jackman,
	Christoph Lameter, Dennis Zhou, Dmitry Vyukov, dri-devel,
	intel-gfx, iommu, io-uring, Jason Gunthorpe, Jens Axboe,
	Johannes Weiner, John Hubbard, kasan-dev, kvm, Liam R. Howlett,
	Linus Torvalds, linux-arm-kernel, linux-arm-kernel, linux-crypto,
	linux-ide, linux-kselftest, linux-mips, linux-mmc, linux-mm,
	linux-riscv, linux-s390, linux-scsi, Lorenzo Stoakes, Marco Elver,
	Marek Szyprowski, Michal Hocko, Mike Rapoport, Muchun Song,
	netdev, Oscar Salvador, Peter Xu, Robin Murphy,
	Suren Baghdasaryan, Tejun Heo, virtualization, Vlastimil Babka,
	wireguard, x86, Zi Yan

It's no longer required to use nth_page() when iterating pages within a
single SG entry, so let's drop the nth_page() usage.

Cc: Alex Dubov <oakad@yahoo.com>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Lars Persson <lars.persson@axis.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 drivers/mmc/host/tifm_sd.c    | 4 ++--
 drivers/mmc/host/usdhi6rol0.c | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/mmc/host/tifm_sd.c b/drivers/mmc/host/tifm_sd.c
index ac636efd911d3..f1ede2b39b505 100644
--- a/drivers/mmc/host/tifm_sd.c
+++ b/drivers/mmc/host/tifm_sd.c
@@ -191,7 +191,7 @@ static void tifm_sd_transfer_data(struct tifm_sd *host)
 		}
 		off = sg[host->sg_pos].offset + host->block_pos;
 
-		pg = nth_page(sg_page(&sg[host->sg_pos]), off >> PAGE_SHIFT);
+		pg = sg_page(&sg[host->sg_pos]) + off / PAGE_SIZE;
 		p_off = offset_in_page(off);
 		p_cnt = PAGE_SIZE - p_off;
 		p_cnt = min(p_cnt, cnt);
@@ -240,7 +240,7 @@ static void tifm_sd_bounce_block(struct tifm_sd *host, struct mmc_data *r_data)
 		}
 		off = sg[host->sg_pos].offset + host->block_pos;
 
-		pg = nth_page(sg_page(&sg[host->sg_pos]), off >> PAGE_SHIFT);
+		pg = sg_page(&sg[host->sg_pos]) + off / PAGE_SIZE;
 		p_off = offset_in_page(off);
 		p_cnt = PAGE_SIZE - p_off;
 		p_cnt = min(p_cnt, cnt);
diff --git a/drivers/mmc/host/usdhi6rol0.c b/drivers/mmc/host/usdhi6rol0.c
index 85b49c07918b3..3bccf800339ba 100644
--- a/drivers/mmc/host/usdhi6rol0.c
+++ b/drivers/mmc/host/usdhi6rol0.c
@@ -323,7 +323,7 @@ static void usdhi6_blk_bounce(struct usdhi6_host *host,
 
 	host->head_pg.page	= host->pg.page;
 	host->head_pg.mapped	= host->pg.mapped;
-	host->pg.page		= nth_page(host->pg.page, 1);
+	host->pg.page		= host->pg.page + 1;
 	host->pg.mapped		= kmap(host->pg.page);
 
 	host->blk_page = host->bounce_buf;
@@ -503,7 +503,7 @@ static void usdhi6_sg_advance(struct usdhi6_host *host)
 	/* We cannot get here after crossing a page border */
 
 	/* Next page in the same SG */
-	host->pg.page = nth_page(sg_page(host->sg), host->page_idx);
+	host->pg.page = sg_page(host->sg) + host->page_idx;
 	host->pg.mapped = kmap(host->pg.page);
 	host->blk_page = host->pg.mapped;
 
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH RFC 29/35] scsi: core: drop nth_page() usage within SG entry
  2025-08-21 20:06 [PATCH RFC 00/35] mm: remove nth_page() David Hildenbrand
                   ` (27 preceding siblings ...)
  2025-08-21 20:06 ` [PATCH RFC 28/35] mmc: " David Hildenbrand
@ 2025-08-21 20:06 ` David Hildenbrand
  2025-08-22 18:01   ` Bart Van Assche
  2025-08-21 20:06 ` [PATCH RFC 30/35] vfio/pci: " David Hildenbrand
                   ` (7 subsequent siblings)
  36 siblings, 1 reply; 90+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, James E.J. Bottomley, Martin K. Petersen,
	Doug Gilbert, Alexander Potapenko, Andrew Morton, Brendan Jackman,
	Christoph Lameter, Dennis Zhou, Dmitry Vyukov, dri-devel,
	intel-gfx, iommu, io-uring, Jason Gunthorpe, Jens Axboe,
	Johannes Weiner, John Hubbard, kasan-dev, kvm, Liam R. Howlett,
	Linus Torvalds, linux-arm-kernel, linux-arm-kernel, linux-crypto,
	linux-ide, linux-kselftest, linux-mips, linux-mmc, linux-mm,
	linux-riscv, linux-s390, linux-scsi, Lorenzo Stoakes, Marco Elver,
	Marek Szyprowski, Michal Hocko, Mike Rapoport, Muchun Song,
	netdev, Oscar Salvador, Peter Xu, Robin Murphy,
	Suren Baghdasaryan, Tejun Heo, virtualization, Vlastimil Babka,
	wireguard, x86, Zi Yan

It's no longer required to use nth_page() when iterating pages within a
single SG entry, so let's drop the nth_page() usage.

Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Doug Gilbert <dgilbert@interlog.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 drivers/scsi/scsi_lib.c | 3 +--
 drivers/scsi/sg.c       | 3 +--
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 0c65ecfedfbd6..f523f85828b89 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -3148,8 +3148,7 @@ void *scsi_kmap_atomic_sg(struct scatterlist *sgl, int sg_count,
 	/* Offset starting from the beginning of first page in this sg-entry */
 	*offset = *offset - len_complete + sg->offset;
 
-	/* Assumption: contiguous pages can be accessed as "page + i" */
-	page = nth_page(sg_page(sg), (*offset >> PAGE_SHIFT));
+	page = sg_page(sg) + *offset / PAGE_SIZE;
 	*offset &= ~PAGE_MASK;
 
 	/* Bytes in this sg-entry from *offset to the end of the page */
diff --git a/drivers/scsi/sg.c b/drivers/scsi/sg.c
index 3c02a5f7b5f39..2c653f2b21133 100644
--- a/drivers/scsi/sg.c
+++ b/drivers/scsi/sg.c
@@ -1235,8 +1235,7 @@ sg_vma_fault(struct vm_fault *vmf)
 		len = vma->vm_end - sa;
 		len = (len < length) ? len : length;
 		if (offset < len) {
-			struct page *page = nth_page(rsv_schp->pages[k],
-						     offset >> PAGE_SHIFT);
+			struct page *page = rsv_schp->pages[k] + offset / PAGE_SIZE;
 			get_page(page);	/* increment page count */
 			vmf->page = page;
 			return 0; /* success */
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH RFC 30/35] vfio/pci: drop nth_page() usage within SG entry
  2025-08-21 20:06 [PATCH RFC 00/35] mm: remove nth_page() David Hildenbrand
                   ` (28 preceding siblings ...)
  2025-08-21 20:06 ` [PATCH RFC 29/35] scsi: core: " David Hildenbrand
@ 2025-08-21 20:06 ` David Hildenbrand
  2025-08-21 20:06 ` [PATCH RFC 31/35] crypto: remove " David Hildenbrand
                   ` (6 subsequent siblings)
  36 siblings, 0 replies; 90+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Brett Creeley, Jason Gunthorpe, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson,
	Alexander Potapenko, Andrew Morton, Brendan Jackman,
	Christoph Lameter, Dennis Zhou, Dmitry Vyukov, dri-devel,
	intel-gfx, iommu, io-uring, Jason Gunthorpe, Jens Axboe,
	Johannes Weiner, John Hubbard, kasan-dev, kvm, Liam R. Howlett,
	Linus Torvalds, linux-arm-kernel, linux-arm-kernel, linux-crypto,
	linux-ide, linux-kselftest, linux-mips, linux-mmc, linux-mm,
	linux-riscv, linux-s390, linux-scsi, Lorenzo Stoakes, Marco Elver,
	Marek Szyprowski, Michal Hocko, Mike Rapoport, Muchun Song,
	netdev, Oscar Salvador, Peter Xu, Robin Murphy,
	Suren Baghdasaryan, Tejun Heo, virtualization, Vlastimil Babka,
	wireguard, x86, Zi Yan

It's no longer required to use nth_page() when iterating pages within a
single SG entry, so let's drop the nth_page() usage.

Cc: Brett Creeley <brett.creeley@amd.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Yishai Hadas <yishaih@nvidia.com>
Cc: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 drivers/vfio/pci/pds/lm.c         | 3 +--
 drivers/vfio/pci/virtio/migrate.c | 3 +--
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/vfio/pci/pds/lm.c b/drivers/vfio/pci/pds/lm.c
index f2673d395236a..4d70c833fa32e 100644
--- a/drivers/vfio/pci/pds/lm.c
+++ b/drivers/vfio/pci/pds/lm.c
@@ -151,8 +151,7 @@ static struct page *pds_vfio_get_file_page(struct pds_vfio_lm_file *lm_file,
 			lm_file->last_offset_sg = sg;
 			lm_file->sg_last_entry += i;
 			lm_file->last_offset = cur_offset;
-			return nth_page(sg_page(sg),
-					(offset - cur_offset) / PAGE_SIZE);
+			return sg_page(sg) + (offset - cur_offset) / PAGE_SIZE;
 		}
 		cur_offset += sg->length;
 	}
diff --git a/drivers/vfio/pci/virtio/migrate.c b/drivers/vfio/pci/virtio/migrate.c
index ba92bb4e9af94..7dd0ac866461d 100644
--- a/drivers/vfio/pci/virtio/migrate.c
+++ b/drivers/vfio/pci/virtio/migrate.c
@@ -53,8 +53,7 @@ virtiovf_get_migration_page(struct virtiovf_data_buffer *buf,
 			buf->last_offset_sg = sg;
 			buf->sg_last_entry += i;
 			buf->last_offset = cur_offset;
-			return nth_page(sg_page(sg),
-					(offset - cur_offset) / PAGE_SIZE);
+			return sg_page(sg) + (offset - cur_offset) / PAGE_SIZE;
 		}
 		cur_offset += sg->length;
 	}
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH RFC 31/35] crypto: remove nth_page() usage within SG entry
  2025-08-21 20:06 [PATCH RFC 00/35] mm: remove nth_page() David Hildenbrand
                   ` (29 preceding siblings ...)
  2025-08-21 20:06 ` [PATCH RFC 30/35] vfio/pci: " David Hildenbrand
@ 2025-08-21 20:06 ` David Hildenbrand
  2025-08-21 20:24   ` Linus Torvalds
  2025-08-21 20:06 ` [PATCH RFC 32/35] mm/gup: drop nth_page() usage in unpin_user_page_range_dirty_lock() David Hildenbrand
                   ` (5 subsequent siblings)
  36 siblings, 1 reply; 90+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Herbert Xu, David S. Miller,
	Alexander Potapenko, Andrew Morton, Brendan Jackman,
	Christoph Lameter, Dennis Zhou, Dmitry Vyukov, dri-devel,
	intel-gfx, iommu, io-uring, Jason Gunthorpe, Jens Axboe,
	Johannes Weiner, John Hubbard, kasan-dev, kvm, Liam R. Howlett,
	Linus Torvalds, linux-arm-kernel, linux-arm-kernel, linux-crypto,
	linux-ide, linux-kselftest, linux-mips, linux-mmc, linux-mm,
	linux-riscv, linux-s390, linux-scsi, Lorenzo Stoakes, Marco Elver,
	Marek Szyprowski, Michal Hocko, Mike Rapoport, Muchun Song,
	netdev, Oscar Salvador, Peter Xu, Robin Murphy,
	Suren Baghdasaryan, Tejun Heo, virtualization, Vlastimil Babka,
	wireguard, x86, Zi Yan

It's no longer required to use nth_page() when iterating pages within a
single SG entry, so let's drop the nth_page() usage.

Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 crypto/ahash.c               | 4 ++--
 crypto/scompress.c           | 8 ++++----
 include/crypto/scatterwalk.h | 4 ++--
 3 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/crypto/ahash.c b/crypto/ahash.c
index a227793d2c5b5..a9f757224a223 100644
--- a/crypto/ahash.c
+++ b/crypto/ahash.c
@@ -88,7 +88,7 @@ static int hash_walk_new_entry(struct crypto_hash_walk *walk)
 
 	sg = walk->sg;
 	walk->offset = sg->offset;
-	walk->pg = nth_page(sg_page(walk->sg), (walk->offset >> PAGE_SHIFT));
+	walk->pg = sg_page(walk->sg) + walk->offset / PAGE_SIZE;
 	walk->offset = offset_in_page(walk->offset);
 	walk->entrylen = sg->length;
 
@@ -226,7 +226,7 @@ int shash_ahash_digest(struct ahash_request *req, struct shash_desc *desc)
 	if (!IS_ENABLED(CONFIG_HIGHMEM))
 		return crypto_shash_digest(desc, data, nbytes, req->result);
 
-	page = nth_page(page, offset >> PAGE_SHIFT);
+	page += offset / PAGE_SIZE;
 	offset = offset_in_page(offset);
 
 	if (nbytes > (unsigned int)PAGE_SIZE - offset)
diff --git a/crypto/scompress.c b/crypto/scompress.c
index c651e7f2197a9..1a7ed8ae65b07 100644
--- a/crypto/scompress.c
+++ b/crypto/scompress.c
@@ -198,7 +198,7 @@ static int scomp_acomp_comp_decomp(struct acomp_req *req, int dir)
 		} else
 			return -ENOSYS;
 
-		dpage = nth_page(dpage, doff / PAGE_SIZE);
+		dpage += doff / PAGE_SIZE;
 		doff = offset_in_page(doff);
 
 		n = (dlen - 1) / PAGE_SIZE;
@@ -220,12 +220,12 @@ static int scomp_acomp_comp_decomp(struct acomp_req *req, int dir)
 			} else
 				break;
 
-			spage = nth_page(spage, soff / PAGE_SIZE);
+			spage = spage + soff / PAGE_SIZE;
 			soff = offset_in_page(soff);
 
 			n = (slen - 1) / PAGE_SIZE;
 			n += (offset_in_page(slen - 1) + soff) / PAGE_SIZE;
-			if (PageHighMem(nth_page(spage, n)) &&
+			if (PageHighMem(spage + n) &&
 			    size_add(soff, slen) > PAGE_SIZE)
 				break;
 			src = kmap_local_page(spage) + soff;
@@ -270,7 +270,7 @@ static int scomp_acomp_comp_decomp(struct acomp_req *req, int dir)
 			if (dlen <= PAGE_SIZE)
 				break;
 			dlen -= PAGE_SIZE;
-			dpage = nth_page(dpage, 1);
+			dpage++;
 		}
 	}
 
diff --git a/include/crypto/scatterwalk.h b/include/crypto/scatterwalk.h
index 15ab743f68c8f..cdf8497d19d27 100644
--- a/include/crypto/scatterwalk.h
+++ b/include/crypto/scatterwalk.h
@@ -159,7 +159,7 @@ static inline void scatterwalk_map(struct scatter_walk *walk)
 	if (IS_ENABLED(CONFIG_HIGHMEM)) {
 		struct page *page;
 
-		page = nth_page(base_page, offset >> PAGE_SHIFT);
+		page = base_page + offset / PAGE_SIZE;
 		offset = offset_in_page(offset);
 		addr = kmap_local_page(page) + offset;
 	} else {
@@ -259,7 +259,7 @@ static inline void scatterwalk_done_dst(struct scatter_walk *walk,
 		end += (offset_in_page(offset) + offset_in_page(nbytes) +
 			PAGE_SIZE - 1) >> PAGE_SHIFT;
 		for (i = start; i < end; i++)
-			flush_dcache_page(nth_page(base_page, i));
+			flush_dcache_page(base_page + i);
 	}
 	scatterwalk_advance(walk, nbytes);
 }
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH RFC 32/35] mm/gup: drop nth_page() usage in unpin_user_page_range_dirty_lock()
  2025-08-21 20:06 [PATCH RFC 00/35] mm: remove nth_page() David Hildenbrand
                   ` (30 preceding siblings ...)
  2025-08-21 20:06 ` [PATCH RFC 31/35] crypto: remove " David Hildenbrand
@ 2025-08-21 20:06 ` David Hildenbrand
  2025-08-21 20:06 ` [PATCH RFC 33/35] kfence: drop nth_page() usage David Hildenbrand
                   ` (4 subsequent siblings)
  36 siblings, 0 replies; 90+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Alexander Potapenko, Andrew Morton,
	Brendan Jackman, Christoph Lameter, Dennis Zhou, Dmitry Vyukov,
	dri-devel, intel-gfx, iommu, io-uring, Jason Gunthorpe,
	Jens Axboe, Johannes Weiner, John Hubbard, kasan-dev, kvm,
	Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Mike Rapoport, Muchun Song, netdev, Oscar Salvador,
	Peter Xu, Robin Murphy, Suren Baghdasaryan, Tejun Heo,
	virtualization, Vlastimil Babka, wireguard, x86, Zi Yan

There is the concern that unpin_user_page_range_dirty_lock() might do
some weird merging of PFN ranges -- either now or in the future -- such
that PFN range is contiguous but the page range might not be.

Let's sanity-check for that and drop the nth_page() usage.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/gup.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/mm/gup.c b/mm/gup.c
index f017ff6d7d61a..0a669a766204b 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -237,7 +237,7 @@ void folio_add_pin(struct folio *folio)
 static inline struct folio *gup_folio_range_next(struct page *start,
 		unsigned long npages, unsigned long i, unsigned int *ntails)
 {
-	struct page *next = nth_page(start, i);
+	struct page *next = start + i;
 	struct folio *folio = page_folio(next);
 	unsigned int nr = 1;
 
@@ -342,6 +342,9 @@ EXPORT_SYMBOL(unpin_user_pages_dirty_lock);
  * "gup-pinned page range" refers to a range of pages that has had one of the
  * pin_user_pages() variants called on that page.
  *
+ * The page range must be truly contiguous: the page range corresponds
+ * to a contiguous PFN range and all pages can be iterated naturally.
+ *
  * For the page ranges defined by [page .. page+npages], make that range (or
  * its head pages, if a compound page) dirty, if @make_dirty is true, and if the
  * page range was previously listed as clean.
@@ -359,6 +362,8 @@ void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages,
 	struct folio *folio;
 	unsigned int nr;
 
+	VM_WARN_ON_ONCE(!page_range_contiguous(page, npages));
+
 	for (i = 0; i < npages; i += nr) {
 		folio = gup_folio_range_next(page, npages, i, &nr);
 		if (make_dirty && !folio_test_dirty(folio)) {
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH RFC 33/35] kfence: drop nth_page() usage
  2025-08-21 20:06 [PATCH RFC 00/35] mm: remove nth_page() David Hildenbrand
                   ` (31 preceding siblings ...)
  2025-08-21 20:06 ` [PATCH RFC 32/35] mm/gup: drop nth_page() usage in unpin_user_page_range_dirty_lock() David Hildenbrand
@ 2025-08-21 20:06 ` David Hildenbrand
  2025-08-21 20:32   ` David Hildenbrand
  2025-08-21 20:07 ` [PATCH RFC 34/35] block: update comment of "struct bio_vec" regarding nth_page() David Hildenbrand
                   ` (3 subsequent siblings)
  36 siblings, 1 reply; 90+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Alexander Potapenko, Marco Elver,
	Dmitry Vyukov, Andrew Morton, Brendan Jackman, Christoph Lameter,
	Dennis Zhou, dri-devel, intel-gfx, iommu, io-uring,
	Jason Gunthorpe, Jens Axboe, Johannes Weiner, John Hubbard,
	kasan-dev, kvm, Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marek Szyprowski, Michal Hocko,
	Mike Rapoport, Muchun Song, netdev, Oscar Salvador, Peter Xu,
	Robin Murphy, Suren Baghdasaryan, Tejun Heo, virtualization,
	Vlastimil Babka, wireguard, x86, Zi Yan

We want to get rid of nth_page(), and kfence init code is the last user.

Unfortunately, we might actually walk a PFN range where the pages are
not contiguous, because we might be allocating an area from memblock
that could span memory sections in problematic kernel configs (SPARSEMEM
without SPARSEMEM_VMEMMAP).

We could check whether the page range is contiguous
using page_range_contiguous() and failing kfence init, or making kfence
incompatible these problemtic kernel configs.

Let's keep it simple and simply use pfn_to_page() by iterating PFNs.

Cc: Alexander Potapenko <glider@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/kfence/core.c | 17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/mm/kfence/core.c b/mm/kfence/core.c
index 0ed3be100963a..793507c77f9e8 100644
--- a/mm/kfence/core.c
+++ b/mm/kfence/core.c
@@ -594,15 +594,15 @@ static void rcu_guarded_free(struct rcu_head *h)
  */
 static unsigned long kfence_init_pool(void)
 {
-	unsigned long addr;
-	struct page *pages;
+	unsigned long addr, pfn, start_pfn, end_pfn;
 	int i;
 
 	if (!arch_kfence_init_pool())
 		return (unsigned long)__kfence_pool;
 
 	addr = (unsigned long)__kfence_pool;
-	pages = virt_to_page(__kfence_pool);
+	start_pfn = PHYS_PFN(virt_to_phys(__kfence_pool));
+	end_pfn = start_pfn + KFENCE_POOL_SIZE / PAGE_SIZE;
 
 	/*
 	 * Set up object pages: they must have PGTY_slab set to avoid freeing
@@ -612,12 +612,13 @@ static unsigned long kfence_init_pool(void)
 	 * fast-path in SLUB, and therefore need to ensure kfree() correctly
 	 * enters __slab_free() slow-path.
 	 */
-	for (i = 0; i < KFENCE_POOL_SIZE / PAGE_SIZE; i++) {
-		struct slab *slab = page_slab(nth_page(pages, i));
+	for (pfn = start_pfn; pfn != end_pfn; pfn++) {
+		struct slab *slab;
 
 		if (!i || (i % 2))
 			continue;
 
+		slab = page_slab(pfn_to_page(pfn));
 		__folio_set_slab(slab_folio(slab));
 #ifdef CONFIG_MEMCG
 		slab->obj_exts = (unsigned long)&kfence_metadata_init[i / 2 - 1].obj_exts |
@@ -664,11 +665,13 @@ static unsigned long kfence_init_pool(void)
 	return 0;
 
 reset_slab:
-	for (i = 0; i < KFENCE_POOL_SIZE / PAGE_SIZE; i++) {
-		struct slab *slab = page_slab(nth_page(pages, i));
+	for (pfn = start_pfn; pfn != end_pfn; pfn++) {
+		struct slab *slab;
 
 		if (!i || (i % 2))
 			continue;
+
+		slab = page_slab(pfn_to_page(pfn));
 #ifdef CONFIG_MEMCG
 		slab->obj_exts = 0;
 #endif
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH RFC 34/35] block: update comment of "struct bio_vec" regarding nth_page()
  2025-08-21 20:06 [PATCH RFC 00/35] mm: remove nth_page() David Hildenbrand
                   ` (32 preceding siblings ...)
  2025-08-21 20:06 ` [PATCH RFC 33/35] kfence: drop nth_page() usage David Hildenbrand
@ 2025-08-21 20:07 ` David Hildenbrand
  2025-08-21 20:07 ` [PATCH RFC 35/35] mm: remove nth_page() David Hildenbrand
                   ` (2 subsequent siblings)
  36 siblings, 0 replies; 90+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:07 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Alexander Potapenko, Andrew Morton,
	Brendan Jackman, Christoph Lameter, Dennis Zhou, Dmitry Vyukov,
	dri-devel, intel-gfx, iommu, io-uring, Jason Gunthorpe,
	Jens Axboe, Johannes Weiner, John Hubbard, kasan-dev, kvm,
	Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Mike Rapoport, Muchun Song, netdev, Oscar Salvador,
	Peter Xu, Robin Murphy, Suren Baghdasaryan, Tejun Heo,
	virtualization, Vlastimil Babka, wireguard, x86, Zi Yan

Ever since commit 858c708d9efb ("block: move the bi_size update out of
__bio_try_merge_page"), page_is_mergeable() no longer exists, and the
logic in bvec_try_merge_page() is now a simple page pointer
comparison.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/bvec.h | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index 0a80e1f9aa201..3fc0efa0825b1 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -22,11 +22,8 @@ struct page;
  * @bv_len:    Number of bytes in the address range.
  * @bv_offset: Start of the address range relative to the start of @bv_page.
  *
- * The following holds for a bvec if n * PAGE_SIZE < bv_offset + bv_len:
- *
- *   nth_page(@bv_page, n) == @bv_page + n
- *
- * This holds because page_is_mergeable() checks the above property.
+ * All pages within a bio_vec starting from @bv_page are contiguous and
+ * can simply be iterated (see bvec_advance()).
  */
 struct bio_vec {
 	struct page	*bv_page;
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH RFC 35/35] mm: remove nth_page()
  2025-08-21 20:06 [PATCH RFC 00/35] mm: remove nth_page() David Hildenbrand
                   ` (33 preceding siblings ...)
  2025-08-21 20:07 ` [PATCH RFC 34/35] block: update comment of "struct bio_vec" regarding nth_page() David Hildenbrand
@ 2025-08-21 20:07 ` David Hildenbrand
  2025-08-21 21:37 ` [syzbot ci] " syzbot ci
  2025-08-22 14:30 ` [PATCH RFC 00/35] " Jason Gunthorpe
  36 siblings, 0 replies; 90+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:07 UTC (permalink / raw)
  To: linux-kernel
  Cc: David Hildenbrand, Alexander Potapenko, Andrew Morton,
	Brendan Jackman, Christoph Lameter, Dennis Zhou, Dmitry Vyukov,
	dri-devel, intel-gfx, iommu, io-uring, Jason Gunthorpe,
	Jens Axboe, Johannes Weiner, John Hubbard, kasan-dev, kvm,
	Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Mike Rapoport, Muchun Song, netdev, Oscar Salvador,
	Peter Xu, Robin Murphy, Suren Baghdasaryan, Tejun Heo,
	virtualization, Vlastimil Babka, wireguard, x86, Zi Yan

Now that all users are gone, let's remove it.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/mm.h                   | 2 --
 tools/testing/scatterlist/linux/mm.h | 1 -
 2 files changed, 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f59ad1f9fc792..3ded0db8322f7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -210,9 +210,7 @@ extern unsigned long sysctl_admin_reserve_kbytes;
 
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
 bool page_range_contiguous(const struct page *page, unsigned long nr_pages);
-#define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n))
 #else
-#define nth_page(page,n) ((page) + (n))
 static inline bool page_range_contiguous(const struct page *page,
 		unsigned long nr_pages)
 {
diff --git a/tools/testing/scatterlist/linux/mm.h b/tools/testing/scatterlist/linux/mm.h
index 5bd9e6e806254..121ae78d6e885 100644
--- a/tools/testing/scatterlist/linux/mm.h
+++ b/tools/testing/scatterlist/linux/mm.h
@@ -51,7 +51,6 @@ static inline unsigned long page_to_phys(struct page *page)
 
 #define page_to_pfn(page) ((unsigned long)(page) / PAGE_SIZE)
 #define pfn_to_page(pfn) (void *)((pfn) * PAGE_SIZE)
-#define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n))
 
 #define __min(t1, t2, min1, min2, x, y) ({              \
 	t1 min1 = (x);                                  \
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* Re: [PATCH RFC 01/35] mm: stop making SPARSEMEM_VMEMMAP user-selectable
  2025-08-21 20:06 ` [PATCH RFC 01/35] mm: stop making SPARSEMEM_VMEMMAP user-selectable David Hildenbrand
@ 2025-08-21 20:20   ` Zi Yan
  2025-08-22 15:09   ` Mike Rapoport
  2025-08-22 17:02   ` SeongJae Park
  2 siblings, 0 replies; 90+ messages in thread
From: Zi Yan @ 2025-08-21 20:20 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Huacai Chen, WANG Xuerui, Madhavan Srinivasan,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Alexandre Ghiti,
	David S. Miller, Andreas Larsson, Alexander Potapenko,
	Andrew Morton, Brendan Jackman, Christoph Lameter, Dennis Zhou,
	Dmitry Vyukov, dri-devel, intel-gfx, iommu, io-uring,
	Jason Gunthorpe, Jens Axboe, Johannes Weiner, John Hubbard,
	kasan-dev, kvm, Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Mike Rapoport, Muchun Song, netdev, Oscar Salvador,
	Peter Xu, Robin Murphy, Suren Baghdasaryan, Tejun Heo,
	virtualization, Vlastimil Babka, wireguard, x86

On 21 Aug 2025, at 16:06, David Hildenbrand wrote:

> In an ideal world, we wouldn't have to deal with SPARSEMEM without
> SPARSEMEM_VMEMMAP, but in particular for 32bit SPARSEMEM_VMEMMAP is
> considered too costly and consequently not supported.
>
> However, if an architecture does support SPARSEMEM with
> SPARSEMEM_VMEMMAP, let's forbid the user to disable VMEMMAP: just
> like we already do for arm64, s390 and x86.
>
> So if SPARSEMEM_VMEMMAP is supported, don't allow to use SPARSEMEM without
> SPARSEMEM_VMEMMAP.
>
> This implies that the option to not use SPARSEMEM_VMEMMAP will now be
> gone for loongarch, powerpc, riscv and sparc. All architectures only
> enable SPARSEMEM_VMEMMAP with 64bit support, so there should not really
> be a big downside to using the VMEMMAP (quite the contrary).
>
> This is a preparation for not supporting
>
> (1) folio sizes that exceed a single memory section
> (2) CMA allocations of non-contiguous page ranges
>
> in SPARSEMEM without SPARSEMEM_VMEMMAP configs, whereby we
> want to limit possible impact as much as possible (e.g., gigantic hugetlb
> page allocations suddenly fails).

Sounds like a good idea.

>
> Cc: Huacai Chen <chenhuacai@kernel.org>
> Cc: WANG Xuerui <kernel@xen0n.name>
> Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
> Cc: Paul Walmsley <paul.walmsley@sifive.com>
> Cc: Palmer Dabbelt <palmer@dabbelt.com>
> Cc: Albert Ou <aou@eecs.berkeley.edu>
> Cc: Alexandre Ghiti <alex@ghiti.fr>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: Andreas Larsson <andreas@gaisler.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/Kconfig | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
>

Acked-by: Zi Yan <ziy@nvidia.com>

Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH RFC 06/35] mm/page_alloc: reject unreasonable folio/compound page sizes in alloc_contig_range_noprof()
  2025-08-21 20:06 ` [PATCH RFC 06/35] mm/page_alloc: reject unreasonable folio/compound page sizes in alloc_contig_range_noprof() David Hildenbrand
@ 2025-08-21 20:23   ` Zi Yan
  2025-08-22 17:07   ` SeongJae Park
  1 sibling, 0 replies; 90+ messages in thread
From: Zi Yan @ 2025-08-21 20:23 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Alexander Potapenko, Andrew Morton, Brendan Jackman,
	Christoph Lameter, Dennis Zhou, Dmitry Vyukov, dri-devel,
	intel-gfx, iommu, io-uring, Jason Gunthorpe, Jens Axboe,
	Johannes Weiner, John Hubbard, kasan-dev, kvm, Liam R. Howlett,
	Linus Torvalds, linux-arm-kernel, linux-arm-kernel, linux-crypto,
	linux-ide, linux-kselftest, linux-mips, linux-mmc, linux-mm,
	linux-riscv, linux-s390, linux-scsi, Lorenzo Stoakes, Marco Elver,
	Marek Szyprowski, Michal Hocko, Mike Rapoport, Muchun Song,
	netdev, Oscar Salvador, Peter Xu, Robin Murphy,
	Suren Baghdasaryan, Tejun Heo, virtualization, Vlastimil Babka,
	wireguard, x86

On 21 Aug 2025, at 16:06, David Hildenbrand wrote:

> Let's reject them early, which in turn makes folio_alloc_gigantic() reject
> them properly.
>
> To avoid converting from order to nr_pages, let's just add MAX_FOLIO_ORDER
> and calculate MAX_FOLIO_NR_PAGES based on that.
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  include/linux/mm.h | 6 ++++--
>  mm/page_alloc.c    | 5 ++++-
>  2 files changed, 8 insertions(+), 3 deletions(-)
>

LGTM. Reviewed-by: Zi Yan <ziy@nvidia.com>


Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH RFC 31/35] crypto: remove nth_page() usage within SG entry
  2025-08-21 20:06 ` [PATCH RFC 31/35] crypto: remove " David Hildenbrand
@ 2025-08-21 20:24   ` Linus Torvalds
  2025-08-21 20:29     ` David Hildenbrand
  0 siblings, 1 reply; 90+ messages in thread
From: Linus Torvalds @ 2025-08-21 20:24 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Herbert Xu, David S. Miller, Alexander Potapenko,
	Andrew Morton, Brendan Jackman, Christoph Lameter, Dennis Zhou,
	Dmitry Vyukov, dri-devel, intel-gfx, iommu, io-uring,
	Jason Gunthorpe, Jens Axboe, Johannes Weiner, John Hubbard,
	kasan-dev, kvm, Liam R. Howlett, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Mike Rapoport, Muchun Song, netdev, Oscar Salvador,
	Peter Xu, Robin Murphy, Suren Baghdasaryan, Tejun Heo,
	virtualization, Vlastimil Babka, wireguard, x86, Zi Yan

On Thu, 21 Aug 2025 at 16:08, David Hildenbrand <david@redhat.com> wrote:
>
> -       page = nth_page(page, offset >> PAGE_SHIFT);
> +       page += offset / PAGE_SIZE;

Please keep the " >> PAGE_SHIFT" form.

Is "offset" unsigned? Yes it is, But I had to look at the source code
to make sure, because it wasn't locally obvious from the patch. And
I'd rather we keep a pattern that is "safe", in that it doesn't
generate strange code if the value might be a 's64' (eg loff_t) on
32-bit architectures.

Because doing a 64-bit shift on x86-32 is like three cycles. Doing a
64-bit signed division by a simple constant is something like ten
strange instructions even if the end result is only 32-bit.

And again - not the case *here*, but just a general "let's keep to one
pattern", and the shift pattern is simply the better choice.

             Linus

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH RFC 31/35] crypto: remove nth_page() usage within SG entry
  2025-08-21 20:24   ` Linus Torvalds
@ 2025-08-21 20:29     ` David Hildenbrand
  2025-08-21 20:36       ` Linus Torvalds
                         ` (2 more replies)
  0 siblings, 3 replies; 90+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, Herbert Xu, David S. Miller, Alexander Potapenko,
	Andrew Morton, Brendan Jackman, Christoph Lameter, Dennis Zhou,
	Dmitry Vyukov, dri-devel, intel-gfx, iommu, io-uring,
	Jason Gunthorpe, Jens Axboe, Johannes Weiner, John Hubbard,
	kasan-dev, kvm, Liam R. Howlett, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Mike Rapoport, Muchun Song, netdev, Oscar Salvador,
	Peter Xu, Robin Murphy, Suren Baghdasaryan, Tejun Heo,
	virtualization, Vlastimil Babka, wireguard, x86, Zi Yan

On 21.08.25 22:24, Linus Torvalds wrote:
> On Thu, 21 Aug 2025 at 16:08, David Hildenbrand <david@redhat.com> wrote:
>>
>> -       page = nth_page(page, offset >> PAGE_SHIFT);
>> +       page += offset / PAGE_SIZE;
> 
> Please keep the " >> PAGE_SHIFT" form.

No strong opinion.

I was primarily doing it to get rid of (in other cases) the parentheses.

Like in patch #29

-	/* Assumption: contiguous pages can be accessed as "page + i" */
-	page = nth_page(sg_page(sg), (*offset >> PAGE_SHIFT));
+	page = sg_page(sg) + *offset / PAGE_SIZE;

> 
> Is "offset" unsigned? Yes it is, But I had to look at the source code
> to make sure, because it wasn't locally obvious from the patch. And
> I'd rather we keep a pattern that is "safe", in that it doesn't
> generate strange code if the value might be a 's64' (eg loff_t) on
> 32-bit architectures.
> 
> Because doing a 64-bit shift on x86-32 is like three cycles. Doing a
> 64-bit signed division by a simple constant is something like ten
> strange instructions even if the end result is only 32-bit.

I would have thought that the compiler is smart enough to optimize that? 
PAGE_SIZE is a constant.

> 
> And again - not the case *here*, but just a general "let's keep to one
> pattern", and the shift pattern is simply the better choice.

It's a wild mixture, but I can keep doing what we already do in these cases.

-- 
Cheers

David / dhildenb

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH RFC 33/35] kfence: drop nth_page() usage
  2025-08-21 20:06 ` [PATCH RFC 33/35] kfence: drop nth_page() usage David Hildenbrand
@ 2025-08-21 20:32   ` David Hildenbrand
  2025-08-21 21:45     ` David Hildenbrand
  0 siblings, 1 reply; 90+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Potapenko, Marco Elver, Dmitry Vyukov, Andrew Morton,
	Brendan Jackman, Christoph Lameter, Dennis Zhou, dri-devel,
	intel-gfx, iommu, io-uring, Jason Gunthorpe, Jens Axboe,
	Johannes Weiner, John Hubbard, kasan-dev, kvm, Liam R. Howlett,
	Linus Torvalds, linux-arm-kernel, linux-arm-kernel, linux-crypto,
	linux-ide, linux-kselftest, linux-mips, linux-mmc, linux-mm,
	linux-riscv, linux-s390, linux-scsi, Lorenzo Stoakes,
	Marek Szyprowski, Michal Hocko, Mike Rapoport, Muchun Song,
	netdev, Oscar Salvador, Peter Xu, Robin Murphy,
	Suren Baghdasaryan, Tejun Heo, virtualization, Vlastimil Babka,
	wireguard, x86, Zi Yan

On 21.08.25 22:06, David Hildenbrand wrote:
> We want to get rid of nth_page(), and kfence init code is the last user.
> 
> Unfortunately, we might actually walk a PFN range where the pages are
> not contiguous, because we might be allocating an area from memblock
> that could span memory sections in problematic kernel configs (SPARSEMEM
> without SPARSEMEM_VMEMMAP).
> 
> We could check whether the page range is contiguous
> using page_range_contiguous() and failing kfence init, or making kfence
> incompatible these problemtic kernel configs.
> 
> Let's keep it simple and simply use pfn_to_page() by iterating PFNs.
> 

Fortunately this series is RFC due to lack of detailed testing :P

Something gives me a NULL-pointer pointer here (maybe the virt_to_phys()).

Will look into that tomorrow.

-- 
Cheers

David / dhildenb

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH RFC 31/35] crypto: remove nth_page() usage within SG entry
  2025-08-21 20:29     ` David Hildenbrand
@ 2025-08-21 20:36       ` Linus Torvalds
  2025-08-21 20:37       ` David Hildenbrand
  2025-08-21 20:40       ` Linus Torvalds
  2 siblings, 0 replies; 90+ messages in thread
From: Linus Torvalds @ 2025-08-21 20:36 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Herbert Xu, David S. Miller, Alexander Potapenko,
	Andrew Morton, Brendan Jackman, Christoph Lameter, Dennis Zhou,
	Dmitry Vyukov, dri-devel, intel-gfx, iommu, io-uring,
	Jason Gunthorpe, Jens Axboe, Johannes Weiner, John Hubbard,
	kasan-dev, kvm, Liam R. Howlett, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Mike Rapoport, Muchun Song, netdev, Oscar Salvador,
	Peter Xu, Robin Murphy, Suren Baghdasaryan, Tejun Heo,
	virtualization, Vlastimil Babka, wireguard, x86, Zi Yan

Oh, an your reply was an invalid email and ended up in my spam-box:

  From: David Hildenbrand <david@redhat.com>

but you apparently didn't use the redhat mail system, so the DKIM signing fails

       dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=QUARANTINE)
header.from=redhat.com

and it gets marked as spam.

I think you may have gone through smtp.kernel.org, but then you need
to use your kernel.org email address to get the DKIM right.

          Linus

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH RFC 11/35] mm: sanity-check maximum folio size in folio_set_order()
  2025-08-21 20:06 ` [PATCH RFC 11/35] mm: sanity-check maximum folio size in folio_set_order() David Hildenbrand
@ 2025-08-21 20:36   ` Zi Yan
  0 siblings, 0 replies; 90+ messages in thread
From: Zi Yan @ 2025-08-21 20:36 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Alexander Potapenko, Andrew Morton, Brendan Jackman,
	Christoph Lameter, Dennis Zhou, Dmitry Vyukov, dri-devel,
	intel-gfx, iommu, io-uring, Jason Gunthorpe, Jens Axboe,
	Johannes Weiner, John Hubbard, kasan-dev, kvm, Liam R. Howlett,
	Linus Torvalds, linux-arm-kernel, linux-arm-kernel, linux-crypto,
	linux-ide, linux-kselftest, linux-mips, linux-mmc, linux-mm,
	linux-riscv, linux-s390, linux-scsi, Lorenzo Stoakes, Marco Elver,
	Marek Szyprowski, Michal Hocko, Mike Rapoport, Muchun Song,
	netdev, Oscar Salvador, Peter Xu, Robin Murphy,
	Suren Baghdasaryan, Tejun Heo, virtualization, Vlastimil Babka,
	wireguard, x86

On 21 Aug 2025, at 16:06, David Hildenbrand wrote:

> Let's sanity-check in folio_set_order() whether we would be trying to
> create a folio with an order that would make it exceed MAX_FOLIO_ORDER.
>
> This will enable the check whenever a folio/compound page is initialized
> through prepare_compound_head() / prepare_compound_page().
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/internal.h | 1 +
>  1 file changed, 1 insertion(+)
>

Reviewed-by: Zi Yan <ziy@nvidia.com>

Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH RFC 31/35] crypto: remove nth_page() usage within SG entry
  2025-08-21 20:29     ` David Hildenbrand
  2025-08-21 20:36       ` Linus Torvalds
@ 2025-08-21 20:37       ` David Hildenbrand
  2025-08-21 20:40       ` Linus Torvalds
  2 siblings, 0 replies; 90+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, Herbert Xu, David S. Miller, Alexander Potapenko,
	Andrew Morton, Brendan Jackman, Christoph Lameter, Dennis Zhou,
	Dmitry Vyukov, dri-devel, intel-gfx, iommu, io-uring,
	Jason Gunthorpe, Jens Axboe, Johannes Weiner, John Hubbard,
	kasan-dev, kvm, Liam R. Howlett, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Mike Rapoport, Muchun Song, netdev, Oscar Salvador,
	Peter Xu, Robin Murphy, Suren Baghdasaryan, Tejun Heo,
	virtualization, Vlastimil Babka, wireguard, x86, Zi Yan

On 21.08.25 22:29, David Hildenbrand wrote:
> On 21.08.25 22:24, Linus Torvalds wrote:
>> On Thu, 21 Aug 2025 at 16:08, David Hildenbrand <david@redhat.com> wrote:
>>>
>>> -       page = nth_page(page, offset >> PAGE_SHIFT);
>>> +       page += offset / PAGE_SIZE;
>>
>> Please keep the " >> PAGE_SHIFT" form.
> 
> No strong opinion.
> 
> I was primarily doing it to get rid of (in other cases) the parentheses.
> 
> Like in patch #29
> 
> -	/* Assumption: contiguous pages can be accessed as "page + i" */
> -	page = nth_page(sg_page(sg), (*offset >> PAGE_SHIFT));
> +	page = sg_page(sg) + *offset / PAGE_SIZE;
> 
>>
>> Is "offset" unsigned? Yes it is, But I had to look at the source code
>> to make sure, because it wasn't locally obvious from the patch. And
>> I'd rather we keep a pattern that is "safe", in that it doesn't
>> generate strange code if the value might be a 's64' (eg loff_t) on
>> 32-bit architectures.
>>
>> Because doing a 64-bit shift on x86-32 is like three cycles. Doing a
>> 64-bit signed division by a simple constant is something like ten
>> strange instructions even if the end result is only 32-bit.
> 
> I would have thought that the compiler is smart enough to optimize that?
> PAGE_SIZE is a constant.

It's late, I get your point: if the compiler can't optimize if it's a 
signed value ...

-- 
Cheers

David / dhildenb

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH RFC 31/35] crypto: remove nth_page() usage within SG entry
  2025-08-21 20:29     ` David Hildenbrand
  2025-08-21 20:36       ` Linus Torvalds
  2025-08-21 20:37       ` David Hildenbrand
@ 2025-08-21 20:40       ` Linus Torvalds
  2 siblings, 0 replies; 90+ messages in thread
From: Linus Torvalds @ 2025-08-21 20:40 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Herbert Xu, David S. Miller, Alexander Potapenko,
	Andrew Morton, Brendan Jackman, Christoph Lameter, Dennis Zhou,
	Dmitry Vyukov, dri-devel, intel-gfx, iommu, io-uring,
	Jason Gunthorpe, Jens Axboe, Johannes Weiner, John Hubbard,
	kasan-dev, kvm, Liam R. Howlett, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Mike Rapoport, Muchun Song, netdev, Oscar Salvador,
	Peter Xu, Robin Murphy, Suren Baghdasaryan, Tejun Heo,
	virtualization, Vlastimil Babka, wireguard, x86, Zi Yan

On Thu, Aug 21, 2025 at 4:29 PM David Hildenbrand <david@redhat.com> wrote:
> > Because doing a 64-bit shift on x86-32 is like three cycles. Doing a
> > 64-bit signed division by a simple constant is something like ten
> > strange instructions even if the end result is only 32-bit.
>
> I would have thought that the compiler is smart enough to optimize that?
> PAGE_SIZE is a constant.

Oh, the compiler optimizes things. But dividing a 64-bit signed value
with a constant is still quite complicated.

It doesn't generate a 'div' instruction, but it generates something like this:

    movl %ebx, %edx
    sarl $31, %edx
    movl %edx, %eax
    xorl %edx, %edx
    andl $4095, %eax
    addl %ecx, %eax
    adcl %ebx, %edx

and that's certainly a lot faster than an actual 64-bit divide would be.

An unsigned divide - or a shift - results in just

    shrdl $12, %ecx, %eax

which is still not the fastest instruction (I think shrld gets split
into two uops), but it's certainly simpler and easier to read.

           Linus

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH RFC 12/35] mm: limit folio/compound page sizes in problematic kernel configs
  2025-08-21 20:06 ` [PATCH RFC 12/35] mm: limit folio/compound page sizes in problematic kernel configs David Hildenbrand
@ 2025-08-21 20:46   ` Zi Yan
  2025-08-21 20:49     ` David Hildenbrand
  2025-08-24 13:24   ` Mike Rapoport
  1 sibling, 1 reply; 90+ messages in thread
From: Zi Yan @ 2025-08-21 20:46 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Alexander Potapenko, Andrew Morton, Brendan Jackman,
	Christoph Lameter, Dennis Zhou, Dmitry Vyukov, dri-devel,
	intel-gfx, iommu, io-uring, Jason Gunthorpe, Jens Axboe,
	Johannes Weiner, John Hubbard, kasan-dev, kvm, Liam R. Howlett,
	Linus Torvalds, linux-arm-kernel, linux-arm-kernel, linux-crypto,
	linux-ide, linux-kselftest, linux-mips, linux-mmc, linux-mm,
	linux-riscv, linux-s390, linux-scsi, Lorenzo Stoakes, Marco Elver,
	Marek Szyprowski, Michal Hocko, Mike Rapoport, Muchun Song,
	netdev, Oscar Salvador, Peter Xu, Robin Murphy,
	Suren Baghdasaryan, Tejun Heo, virtualization, Vlastimil Babka,
	wireguard, x86

On 21 Aug 2025, at 16:06, David Hildenbrand wrote:

> Let's limit the maximum folio size in problematic kernel config where
> the memmap is allocated per memory section (SPARSEMEM without
> SPARSEMEM_VMEMMAP) to a single memory section.
>
> Currently, only a single architectures supports ARCH_HAS_GIGANTIC_PAGE
> but not SPARSEMEM_VMEMMAP: sh.
>
> Fortunately, the biggest hugetlb size sh supports is 64 MiB
> (HUGETLB_PAGE_SIZE_64MB) and the section size is at least 64 MiB
> (SECTION_SIZE_BITS == 26), so their use case is not degraded.
>
> As folios and memory sections are naturally aligned to their order-2 size
> in memory, consequently a single folio can no longer span multiple memory
> sections on these problematic kernel configs.
>
> nth_page() is no longer required when operating within a single compound
> page / folio.
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  include/linux/mm.h | 22 ++++++++++++++++++----
>  1 file changed, 18 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 77737cbf2216a..48a985e17ef4e 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2053,11 +2053,25 @@ static inline long folio_nr_pages(const struct folio *folio)
>  	return folio_large_nr_pages(folio);
>  }
>
> -/* Only hugetlbfs can allocate folios larger than MAX_ORDER */
> -#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
> -#define MAX_FOLIO_ORDER		PUD_ORDER
> -#else
> +#if !defined(CONFIG_ARCH_HAS_GIGANTIC_PAGE)
> +/*
> + * We don't expect any folios that exceed buddy sizes (and consequently
> + * memory sections).
> + */
>  #define MAX_FOLIO_ORDER		MAX_PAGE_ORDER
> +#elif defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
> +/*
> + * Only pages within a single memory section are guaranteed to be
> + * contiguous. By limiting folios to a single memory section, all folio
> + * pages are guaranteed to be contiguous.
> + */
> +#define MAX_FOLIO_ORDER		PFN_SECTION_SHIFT
> +#else
> +/*
> + * There is no real limit on the folio size. We limit them to the maximum we
> + * currently expect.

The comment about hugetlbfs is helpful here, since the other folios are still
limited by buddy allocator’s MAX_ORDER.

> + */
> +#define MAX_FOLIO_ORDER		PUD_ORDER
>  #endif
>
>  #define MAX_FOLIO_NR_PAGES	(1UL << MAX_FOLIO_ORDER)
> -- 
> 2.50.1

Otherwise, Reviewed-by: Zi Yan <ziy@nvidia.com>

Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH RFC 12/35] mm: limit folio/compound page sizes in problematic kernel configs
  2025-08-21 20:46   ` Zi Yan
@ 2025-08-21 20:49     ` David Hildenbrand
  2025-08-21 20:50       ` Zi Yan
  0 siblings, 1 reply; 90+ messages in thread
From: David Hildenbrand @ 2025-08-21 20:49 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-kernel, Alexander Potapenko, Andrew Morton, Brendan Jackman,
	Christoph Lameter, Dennis Zhou, Dmitry Vyukov, dri-devel,
	intel-gfx, iommu, io-uring, Jason Gunthorpe, Jens Axboe,
	Johannes Weiner, John Hubbard, kasan-dev, kvm, Liam R. Howlett,
	Linus Torvalds, linux-arm-kernel, linux-arm-kernel, linux-crypto,
	linux-ide, linux-kselftest, linux-mips, linux-mmc, linux-mm,
	linux-riscv, linux-s390, linux-scsi, Lorenzo Stoakes, Marco Elver,
	Marek Szyprowski, Michal Hocko, Mike Rapoport, Muchun Song,
	netdev, Oscar Salvador, Peter Xu, Robin Murphy,
	Suren Baghdasaryan, Tejun Heo, virtualization, Vlastimil Babka,
	wireguard, x86

On 21.08.25 22:46, Zi Yan wrote:
> On 21 Aug 2025, at 16:06, David Hildenbrand wrote:
> 
>> Let's limit the maximum folio size in problematic kernel config where
>> the memmap is allocated per memory section (SPARSEMEM without
>> SPARSEMEM_VMEMMAP) to a single memory section.
>>
>> Currently, only a single architectures supports ARCH_HAS_GIGANTIC_PAGE
>> but not SPARSEMEM_VMEMMAP: sh.
>>
>> Fortunately, the biggest hugetlb size sh supports is 64 MiB
>> (HUGETLB_PAGE_SIZE_64MB) and the section size is at least 64 MiB
>> (SECTION_SIZE_BITS == 26), so their use case is not degraded.
>>
>> As folios and memory sections are naturally aligned to their order-2 size
>> in memory, consequently a single folio can no longer span multiple memory
>> sections on these problematic kernel configs.
>>
>> nth_page() is no longer required when operating within a single compound
>> page / folio.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>   include/linux/mm.h | 22 ++++++++++++++++++----
>>   1 file changed, 18 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 77737cbf2216a..48a985e17ef4e 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -2053,11 +2053,25 @@ static inline long folio_nr_pages(const struct folio *folio)
>>   	return folio_large_nr_pages(folio);
>>   }
>>
>> -/* Only hugetlbfs can allocate folios larger than MAX_ORDER */
>> -#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
>> -#define MAX_FOLIO_ORDER		PUD_ORDER
>> -#else
>> +#if !defined(CONFIG_ARCH_HAS_GIGANTIC_PAGE)
>> +/*
>> + * We don't expect any folios that exceed buddy sizes (and consequently
>> + * memory sections).
>> + */
>>   #define MAX_FOLIO_ORDER		MAX_PAGE_ORDER
>> +#elif defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
>> +/*
>> + * Only pages within a single memory section are guaranteed to be
>> + * contiguous. By limiting folios to a single memory section, all folio
>> + * pages are guaranteed to be contiguous.
>> + */
>> +#define MAX_FOLIO_ORDER		PFN_SECTION_SHIFT
>> +#else
>> +/*
>> + * There is no real limit on the folio size. We limit them to the maximum we
>> + * currently expect.
> 
> The comment about hugetlbfs is helpful here, since the other folios are still
> limited by buddy allocator’s MAX_ORDER.

Yeah, but the old comment was wrong (there is DAX).

I can add here "currently expect (e.g., hugetlfs, dax)."

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH RFC 12/35] mm: limit folio/compound page sizes in problematic kernel configs
  2025-08-21 20:49     ` David Hildenbrand
@ 2025-08-21 20:50       ` Zi Yan
  0 siblings, 0 replies; 90+ messages in thread
From: Zi Yan @ 2025-08-21 20:50 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Alexander Potapenko, Andrew Morton, Brendan Jackman,
	Christoph Lameter, Dennis Zhou, Dmitry Vyukov, dri-devel,
	intel-gfx, iommu, io-uring, Jason Gunthorpe, Jens Axboe,
	Johannes Weiner, John Hubbard, kasan-dev, kvm, Liam R. Howlett,
	Linus Torvalds, linux-arm-kernel, linux-arm-kernel, linux-crypto,
	linux-ide, linux-kselftest, linux-mips, linux-mmc, linux-mm,
	linux-riscv, linux-s390, linux-scsi, Lorenzo Stoakes, Marco Elver,
	Marek Szyprowski, Michal Hocko, Mike Rapoport, Muchun Song,
	netdev, Oscar Salvador, Peter Xu, Robin Murphy,
	Suren Baghdasaryan, Tejun Heo, virtualization, Vlastimil Babka,
	wireguard, x86

On 21 Aug 2025, at 16:49, David Hildenbrand wrote:

> On 21.08.25 22:46, Zi Yan wrote:
>> On 21 Aug 2025, at 16:06, David Hildenbrand wrote:
>>
>>> Let's limit the maximum folio size in problematic kernel config where
>>> the memmap is allocated per memory section (SPARSEMEM without
>>> SPARSEMEM_VMEMMAP) to a single memory section.
>>>
>>> Currently, only a single architectures supports ARCH_HAS_GIGANTIC_PAGE
>>> but not SPARSEMEM_VMEMMAP: sh.
>>>
>>> Fortunately, the biggest hugetlb size sh supports is 64 MiB
>>> (HUGETLB_PAGE_SIZE_64MB) and the section size is at least 64 MiB
>>> (SECTION_SIZE_BITS == 26), so their use case is not degraded.
>>>
>>> As folios and memory sections are naturally aligned to their order-2 size
>>> in memory, consequently a single folio can no longer span multiple memory
>>> sections on these problematic kernel configs.
>>>
>>> nth_page() is no longer required when operating within a single compound
>>> page / folio.
>>>
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>> ---
>>>   include/linux/mm.h | 22 ++++++++++++++++++----
>>>   1 file changed, 18 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>> index 77737cbf2216a..48a985e17ef4e 100644
>>> --- a/include/linux/mm.h
>>> +++ b/include/linux/mm.h
>>> @@ -2053,11 +2053,25 @@ static inline long folio_nr_pages(const struct folio *folio)
>>>   	return folio_large_nr_pages(folio);
>>>   }
>>>
>>> -/* Only hugetlbfs can allocate folios larger than MAX_ORDER */
>>> -#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
>>> -#define MAX_FOLIO_ORDER		PUD_ORDER
>>> -#else
>>> +#if !defined(CONFIG_ARCH_HAS_GIGANTIC_PAGE)
>>> +/*
>>> + * We don't expect any folios that exceed buddy sizes (and consequently
>>> + * memory sections).
>>> + */
>>>   #define MAX_FOLIO_ORDER		MAX_PAGE_ORDER
>>> +#elif defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
>>> +/*
>>> + * Only pages within a single memory section are guaranteed to be
>>> + * contiguous. By limiting folios to a single memory section, all folio
>>> + * pages are guaranteed to be contiguous.
>>> + */
>>> +#define MAX_FOLIO_ORDER		PFN_SECTION_SHIFT
>>> +#else
>>> +/*
>>> + * There is no real limit on the folio size. We limit them to the maximum we
>>> + * currently expect.
>>
>> The comment about hugetlbfs is helpful here, since the other folios are still
>> limited by buddy allocator’s MAX_ORDER.
>
> Yeah, but the old comment was wrong (there is DAX).
>
> I can add here "currently expect (e.g., hugetlfs, dax)."

Sounds good.

Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH RFC 13/35] mm: simplify folio_page() and folio_page_idx()
  2025-08-21 20:06 ` [PATCH RFC 13/35] mm: simplify folio_page() and folio_page_idx() David Hildenbrand
@ 2025-08-21 20:55   ` Zi Yan
  2025-08-21 21:00     ` David Hildenbrand
  0 siblings, 1 reply; 90+ messages in thread
From: Zi Yan @ 2025-08-21 20:55 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Alexander Potapenko, Andrew Morton, Brendan Jackman,
	Christoph Lameter, Dennis Zhou, Dmitry Vyukov, dri-devel,
	intel-gfx, iommu, io-uring, Jason Gunthorpe, Jens Axboe,
	Johannes Weiner, John Hubbard, kasan-dev, kvm, Liam R. Howlett,
	Linus Torvalds, linux-arm-kernel, linux-arm-kernel, linux-crypto,
	linux-ide, linux-kselftest, linux-mips, linux-mmc, linux-mm,
	linux-riscv, linux-s390, linux-scsi, Lorenzo Stoakes, Marco Elver,
	Marek Szyprowski, Michal Hocko, Mike Rapoport, Muchun Song,
	netdev, Oscar Salvador, Peter Xu, Robin Murphy,
	Suren Baghdasaryan, Tejun Heo, virtualization, Vlastimil Babka,
	wireguard, x86

On 21 Aug 2025, at 16:06, David Hildenbrand wrote:

> Now that a single folio/compound page can no longer span memory sections
> in problematic kernel configurations, we can stop using nth_page().
>
> While at it, turn both macros into static inline functions and add
> kernel doc for folio_page_idx().
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  include/linux/mm.h         | 16 ++++++++++++++--
>  include/linux/page-flags.h |  5 ++++-
>  2 files changed, 18 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 48a985e17ef4e..ef360b72cb05c 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -210,10 +210,8 @@ extern unsigned long sysctl_admin_reserve_kbytes;
>
>  #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
>  #define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n))
> -#define folio_page_idx(folio, p)	(page_to_pfn(p) - folio_pfn(folio))
>  #else
>  #define nth_page(page,n) ((page) + (n))
> -#define folio_page_idx(folio, p)	((p) - &(folio)->page)
>  #endif
>
>  /* to align the pointer to the (next) page boundary */
> @@ -225,6 +223,20 @@ extern unsigned long sysctl_admin_reserve_kbytes;
>  /* test whether an address (unsigned long or pointer) is aligned to PAGE_SIZE */
>  #define PAGE_ALIGNED(addr)	IS_ALIGNED((unsigned long)(addr), PAGE_SIZE)
>
> +/**
> + * folio_page_idx - Return the number of a page in a folio.
> + * @folio: The folio.
> + * @page: The folio page.
> + *
> + * This function expects that the page is actually part of the folio.
> + * The returned number is relative to the start of the folio.
> + */
> +static inline unsigned long folio_page_idx(const struct folio *folio,
> +		const struct page *page)
> +{
> +	return page - &folio->page;
> +}
> +
>  static inline struct folio *lru_to_folio(struct list_head *head)
>  {
>  	return list_entry((head)->prev, struct folio, lru);
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index d53a86e68c89b..080ad10c0defc 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -316,7 +316,10 @@ static __always_inline unsigned long _compound_head(const struct page *page)
>   * check that the page number lies within @folio; the caller is presumed
>   * to have a reference to the page.
>   */
> -#define folio_page(folio, n)	nth_page(&(folio)->page, n)
> +static inline struct page *folio_page(struct folio *folio, unsigned long nr)
> +{
> +	return &folio->page + nr;
> +}

Maybe s/nr/n/ or s/nr/nth/, since it returns the nth page within a folio.

Since you have added kernel doc for folio_page_idx(), it does not hurt
to have something similar for folio_page(). :)

+/**
+ * folio_page - Return the nth page in a folio.
+ * @folio: The folio.
+ * @n: Page index within the folio.
+ *
+ * This function expects that n does not exceed folio_nr_pages(folio).
+ * The returned page is relative to the first page of the folio.
+ */

>
>  static __always_inline int PageTail(const struct page *page)
>  {
> -- 
> 2.50.1

Otherwise, Reviewed-by: Zi Yan <ziy@nvidia.com>

Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH RFC 13/35] mm: simplify folio_page() and folio_page_idx()
  2025-08-21 20:55   ` Zi Yan
@ 2025-08-21 21:00     ` David Hildenbrand
  0 siblings, 0 replies; 90+ messages in thread
From: David Hildenbrand @ 2025-08-21 21:00 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-kernel, Alexander Potapenko, Andrew Morton, Brendan Jackman,
	Christoph Lameter, Dennis Zhou, Dmitry Vyukov, dri-devel,
	intel-gfx, iommu, io-uring, Jason Gunthorpe, Jens Axboe,
	Johannes Weiner, John Hubbard, kasan-dev, kvm, Liam R. Howlett,
	Linus Torvalds, linux-arm-kernel, linux-arm-kernel, linux-crypto,
	linux-ide, linux-kselftest, linux-mips, linux-mmc, linux-mm,
	linux-riscv, linux-s390, linux-scsi, Lorenzo Stoakes, Marco Elver,
	Marek Szyprowski, Michal Hocko, Mike Rapoport, Muchun Song,
	netdev, Oscar Salvador, Peter Xu, Robin Murphy,
	Suren Baghdasaryan, Tejun Heo, virtualization, Vlastimil Babka,
	wireguard, x86

On 21.08.25 22:55, Zi Yan wrote:
> On 21 Aug 2025, at 16:06, David Hildenbrand wrote:
> 
>> Now that a single folio/compound page can no longer span memory sections
>> in problematic kernel configurations, we can stop using nth_page().
>>
>> While at it, turn both macros into static inline functions and add
>> kernel doc for folio_page_idx().
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>   include/linux/mm.h         | 16 ++++++++++++++--
>>   include/linux/page-flags.h |  5 ++++-
>>   2 files changed, 18 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 48a985e17ef4e..ef360b72cb05c 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -210,10 +210,8 @@ extern unsigned long sysctl_admin_reserve_kbytes;
>>
>>   #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
>>   #define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n))
>> -#define folio_page_idx(folio, p)	(page_to_pfn(p) - folio_pfn(folio))
>>   #else
>>   #define nth_page(page,n) ((page) + (n))
>> -#define folio_page_idx(folio, p)	((p) - &(folio)->page)
>>   #endif
>>
>>   /* to align the pointer to the (next) page boundary */
>> @@ -225,6 +223,20 @@ extern unsigned long sysctl_admin_reserve_kbytes;
>>   /* test whether an address (unsigned long or pointer) is aligned to PAGE_SIZE */
>>   #define PAGE_ALIGNED(addr)	IS_ALIGNED((unsigned long)(addr), PAGE_SIZE)
>>
>> +/**
>> + * folio_page_idx - Return the number of a page in a folio.
>> + * @folio: The folio.
>> + * @page: The folio page.
>> + *
>> + * This function expects that the page is actually part of the folio.
>> + * The returned number is relative to the start of the folio.
>> + */
>> +static inline unsigned long folio_page_idx(const struct folio *folio,
>> +		const struct page *page)
>> +{
>> +	return page - &folio->page;
>> +}
>> +
>>   static inline struct folio *lru_to_folio(struct list_head *head)
>>   {
>>   	return list_entry((head)->prev, struct folio, lru);
>> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
>> index d53a86e68c89b..080ad10c0defc 100644
>> --- a/include/linux/page-flags.h
>> +++ b/include/linux/page-flags.h
>> @@ -316,7 +316,10 @@ static __always_inline unsigned long _compound_head(const struct page *page)
>>    * check that the page number lies within @folio; the caller is presumed
>>    * to have a reference to the page.
>>    */
>> -#define folio_page(folio, n)	nth_page(&(folio)->page, n)
>> +static inline struct page *folio_page(struct folio *folio, unsigned long nr)
>> +{
>> +	return &folio->page + nr;
>> +}
> 
> Maybe s/nr/n/ or s/nr/nth/, since it returns the nth page within a folio.

Yeah, it's even called "n" in the kernel docs ...

> 
> Since you have added kernel doc for folio_page_idx(), it does not hurt
> to have something similar for folio_page(). :)

... which we already have! (see above the macro) :)

Thanks!

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 90+ messages in thread

* [syzbot ci] Re: mm: remove nth_page()
  2025-08-21 20:06 [PATCH RFC 00/35] mm: remove nth_page() David Hildenbrand
                   ` (34 preceding siblings ...)
  2025-08-21 20:07 ` [PATCH RFC 35/35] mm: remove nth_page() David Hildenbrand
@ 2025-08-21 21:37 ` syzbot ci
  2025-08-22 14:30 ` [PATCH RFC 00/35] " Jason Gunthorpe
  36 siblings, 0 replies; 90+ messages in thread
From: syzbot ci @ 2025-08-21 21:37 UTC (permalink / raw)
  To: agordeev, airlied, akpm, alex.williamson, alex, andreas, aou,
	axboe, borntraeger, bp, brett.creeley, cassel, catalin.marinas,
	chenhuacai, christophe.leroy, cl, dave.hansen, davem, david,
	dennis, dgilbert, dlemoal, dri-devel, dvyukov, elver, glider, gor,
	hannes, hca, herbert, intel-gfx, io-uring, iommu, jackmanb,
	james.bottomley, jani.nikula, jason, jesper.nilsson, jgg, jgg,
	jhubbard, joonas.lahtinen, kasan-dev, kernel, kevin.tian, kvm,
	lars.persson, liam.howlett, linux-arm-kernel, linux-arm-kernel,
	linux-crypto, linux-ide, linux-kernel, linux-kselftest,
	linux-mips, linux-mm, linux-mmc, linux-riscv, linux-s390,
	linux-scsi, lorenzo.stoakes, m.szyprowski, maddy, martin.petersen,
	maximlevitsky, mhocko, mingo, mpe, muchun.song, netdev, npiggin,
	oakad, osalvador, palmer, paul.walmsley, peterx, robin.murphy,
	rodrigo.vivi, rppt, shameerali.kolothum.thodi, shuah, simona,
	surenb, svens, tglx, tj, torvalds, tsbogend, tursulin,
	ulf.hansson, vbabka, virtualization, will, wireguard, x86, ziy
  Cc: syzbot, syzkaller-bugs

syzbot ci has tested the following series

[v1] mm: remove nth_page()
https://lore.kernel.org/all/20250821200701.1329277-1-david@redhat.com
* [PATCH RFC 01/35] mm: stop making SPARSEMEM_VMEMMAP user-selectable
* [PATCH RFC 02/35] arm64: Kconfig: drop superfluous "select SPARSEMEM_VMEMMAP"
* [PATCH RFC 03/35] s390/Kconfig: drop superfluous "select SPARSEMEM_VMEMMAP"
* [PATCH RFC 04/35] x86/Kconfig: drop superfluous "select SPARSEMEM_VMEMMAP"
* [PATCH RFC 05/35] wireguard: selftests: remove CONFIG_SPARSEMEM_VMEMMAP=y from qemu kernel config
* [PATCH RFC 06/35] mm/page_alloc: reject unreasonable folio/compound page sizes in alloc_contig_range_noprof()
* [PATCH RFC 07/35] mm/memremap: reject unreasonable folio/compound page sizes in memremap_pages()
* [PATCH RFC 08/35] mm/hugetlb: check for unreasonable folio sizes when registering hstate
* [PATCH RFC 09/35] mm/mm_init: make memmap_init_compound() look more like prep_compound_page()
* [PATCH RFC 10/35] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap()
* [PATCH RFC 11/35] mm: sanity-check maximum folio size in folio_set_order()
* [PATCH RFC 12/35] mm: limit folio/compound page sizes in problematic kernel configs
* [PATCH RFC 13/35] mm: simplify folio_page() and folio_page_idx()
* [PATCH RFC 14/35] mm/mm/percpu-km: drop nth_page() usage within single allocation
* [PATCH RFC 15/35] fs: hugetlbfs: remove nth_page() usage within folio in adjust_range_hwpoison()
* [PATCH RFC 16/35] mm/pagewalk: drop nth_page() usage within folio in folio_walk_start()
* [PATCH RFC 17/35] mm/gup: drop nth_page() usage within folio when recording subpages
* [PATCH RFC 18/35] io_uring/zcrx: remove "struct io_copy_cache" and one nth_page() usage
* [PATCH RFC 19/35] io_uring/zcrx: remove nth_page() usage within folio
* [PATCH RFC 20/35] mips: mm: convert __flush_dcache_pages() to __flush_dcache_folio_pages()
* [PATCH RFC 21/35] mm/cma: refuse handing out non-contiguous page ranges
* [PATCH RFC 22/35] dma-remap: drop nth_page() in dma_common_contiguous_remap()
* [PATCH RFC 23/35] scatterlist: disallow non-contigous page ranges in a single SG entry
* [PATCH RFC 24/35] ata: libata-eh: drop nth_page() usage within SG entry
* [PATCH RFC 25/35] drm/i915/gem: drop nth_page() usage within SG entry
* [PATCH RFC 26/35] mspro_block: drop nth_page() usage within SG entry
* [PATCH RFC 27/35] memstick: drop nth_page() usage within SG entry
* [PATCH RFC 28/35] mmc: drop nth_page() usage within SG entry
* [PATCH RFC 29/35] scsi: core: drop nth_page() usage within SG entry
* [PATCH RFC 30/35] vfio/pci: drop nth_page() usage within SG entry
* [PATCH RFC 31/35] crypto: remove nth_page() usage within SG entry
* [PATCH RFC 32/35] mm/gup: drop nth_page() usage in unpin_user_page_range_dirty_lock()
* [PATCH RFC 33/35] kfence: drop nth_page() usage
* [PATCH RFC 34/35] block: update comment of "struct bio_vec" regarding nth_page()
* [PATCH RFC 35/35] mm: remove nth_page()

and found the following issue:
general protection fault in kfence_guarded_alloc

Full report is available here:
https://ci.syzbot.org/series/f6f0aea1-9616-4675-8c80-f9b59ba3211c

***

general protection fault in kfence_guarded_alloc

tree:      net-next
URL:       https://kernel.googlesource.com/pub/scm/linux/kernel/git/netdev/net-next.git
base:      da114122b83149d1f1db0586b1d67947b651aa20
arch:      amd64
compiler:  Debian clang version 20.1.7 (++20250616065708+6146a88f6049-1~exp1~20250616065826.132), Debian LLD 20.1.7
config:    https://ci.syzbot.org/builds/705b7862-eb10-40bd-a4cf-4820b4912466/config

smpboot: CPU0: Intel(R) Xeon(R) CPU @ 2.80GHz (family: 0x6, model: 0x55, stepping: 0x7)
Oops: general protection fault, probably for non-canonical address 0xdffffc0000000001: 0000 [#1] SMP KASAN NOPTI
KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
RIP: 0010:kfence_guarded_alloc+0x643/0xc70
Code: 41 c1 e5 18 bf 00 00 00 f5 44 89 ee e8 a6 67 9c ff 45 31 f6 41 81 fd 00 00 00 f5 4c 0f 44 f3 49 8d 7e 08 48 89 f8 48 c1 e8 03 <42> 80 3c 20 00 74 05 e8 f1 cb ff ff 4c 8b 6c 24 18 4d 89 6e 08 49
RSP: 0000:ffffc90000047740 EFLAGS: 00010202
RAX: 0000000000000001 RBX: ffffea0004d90080 RCX: 0000000000000000
RDX: ffff88801c2e8000 RSI: 00000000ff000000 RDI: 0000000000000008
RBP: ffffc90000047850 R08: ffffffff99b2201b R09: 1ffffffff3364403
R10: dffffc0000000000 R11: fffffbfff3364404 R12: dffffc0000000000
R13: 00000000ff000000 R14: 0000000000000000 R15: ffff88813fec7068
FS:  0000000000000000(0000) GS:ffff8880b861c000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff88813ffff000 CR3: 000000000df36000 CR4: 0000000000350ef0
Call Trace:
 <TASK>
 __kfence_alloc+0x385/0x3b0
 __kmalloc_noprof+0x440/0x4f0
 __alloc_workqueue+0x103/0x1b70
 alloc_workqueue_noprof+0xd4/0x210
 init_mm_internals+0x17/0x140
 kernel_init_freeable+0x307/0x4b0
 kernel_init+0x1d/0x1d0
 ret_from_fork+0x3f9/0x770
 ret_from_fork_asm+0x1a/0x30
 </TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---
RIP: 0010:kfence_guarded_alloc+0x643/0xc70
Code: 41 c1 e5 18 bf 00 00 00 f5 44 89 ee e8 a6 67 9c ff 45 31 f6 41 81 fd 00 00 00 f5 4c 0f 44 f3 49 8d 7e 08 48 89 f8 48 c1 e8 03 <42> 80 3c 20 00 74 05 e8 f1 cb ff ff 4c 8b 6c 24 18 4d 89 6e 08 49
RSP: 0000:ffffc90000047740 EFLAGS: 00010202
RAX: 0000000000000001 RBX: ffffea0004d90080 RCX: 0000000000000000
RDX: ffff88801c2e8000 RSI: 00000000ff000000 RDI: 0000000000000008
RBP: ffffc90000047850 R08: ffffffff99b2201b R09: 1ffffffff3364403
R10: dffffc0000000000 R11: fffffbfff3364404 R12: dffffc0000000000
R13: 00000000ff000000 R14: 0000000000000000 R15: ffff88813fec7068
FS:  0000000000000000(0000) GS:ffff8880b861c000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff88813ffff000 CR3: 000000000df36000 CR4: 0000000000350ef0


***

If these findings have caused you to resend the series or submit a
separate fix, please add the following tag to your commit message:
  Tested-by: syzbot@syzkaller.appspotmail.com

---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller@googlegroups.com.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH RFC 33/35] kfence: drop nth_page() usage
  2025-08-21 20:32   ` David Hildenbrand
@ 2025-08-21 21:45     ` David Hildenbrand
  0 siblings, 0 replies; 90+ messages in thread
From: David Hildenbrand @ 2025-08-21 21:45 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Potapenko, Marco Elver, Dmitry Vyukov, Andrew Morton,
	Brendan Jackman, Christoph Lameter, Dennis Zhou, dri-devel,
	intel-gfx, iommu, io-uring, Jason Gunthorpe, Jens Axboe,
	Johannes Weiner, John Hubbard, kasan-dev, kvm, Liam R. Howlett,
	Linus Torvalds, linux-arm-kernel, linux-arm-kernel, linux-crypto,
	linux-ide, linux-kselftest, linux-mips, linux-mmc, linux-mm,
	linux-riscv, linux-s390, linux-scsi, Lorenzo Stoakes,
	Marek Szyprowski, Michal Hocko, Mike Rapoport, Muchun Song,
	netdev, Oscar Salvador, Peter Xu, Robin Murphy,
	Suren Baghdasaryan, Tejun Heo, virtualization, Vlastimil Babka,
	wireguard, x86, Zi Yan

On 21.08.25 22:32, David Hildenbrand wrote:
> On 21.08.25 22:06, David Hildenbrand wrote:
>> We want to get rid of nth_page(), and kfence init code is the last user.
>>
>> Unfortunately, we might actually walk a PFN range where the pages are
>> not contiguous, because we might be allocating an area from memblock
>> that could span memory sections in problematic kernel configs (SPARSEMEM
>> without SPARSEMEM_VMEMMAP).
>>
>> We could check whether the page range is contiguous
>> using page_range_contiguous() and failing kfence init, or making kfence
>> incompatible these problemtic kernel configs.
>>
>> Let's keep it simple and simply use pfn_to_page() by iterating PFNs.
>>
> 
> Fortunately this series is RFC due to lack of detailed testing :P
> 
> Something gives me a NULL-pointer pointer here (maybe the virt_to_phys()).
> 
> Will look into that tomorrow.

Okay, easy: relying on i but not updating it /me facepalm

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH RFC 24/35] ata: libata-eh: drop nth_page() usage within SG entry
  2025-08-21 20:06 ` [PATCH RFC 24/35] ata: libata-eh: drop nth_page() usage within " David Hildenbrand
@ 2025-08-22  1:59   ` Damien Le Moal
  2025-08-22  6:18     ` David Hildenbrand
  0 siblings, 1 reply; 90+ messages in thread
From: Damien Le Moal @ 2025-08-22  1:59 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: Niklas Cassel, Alexander Potapenko, Andrew Morton,
	Brendan Jackman, Christoph Lameter, Dennis Zhou, Dmitry Vyukov,
	dri-devel, intel-gfx, iommu, io-uring, Jason Gunthorpe,
	Jens Axboe, Johannes Weiner, John Hubbard, kasan-dev, kvm,
	Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Mike Rapoport, Muchun Song, netdev, Oscar Salvador,
	Peter Xu, Robin Murphy, Suren Baghdasaryan, Tejun Heo,
	virtualization, Vlastimil Babka, wireguard, x86, Zi Yan

On 8/22/25 05:06, David Hildenbrand wrote:
> It's no longer required to use nth_page() when iterating pages within a
> single SG entry, so let's drop the nth_page() usage.
> 
> Cc: Damien Le Moal <dlemoal@kernel.org>
> Cc: Niklas Cassel <cassel@kernel.org>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  drivers/ata/libata-sff.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/ata/libata-sff.c b/drivers/ata/libata-sff.c
> index 7fc407255eb46..9f5d0f9f6d686 100644
> --- a/drivers/ata/libata-sff.c
> +++ b/drivers/ata/libata-sff.c
> @@ -614,7 +614,7 @@ static void ata_pio_sector(struct ata_queued_cmd *qc)
>  	offset = qc->cursg->offset + qc->cursg_ofs;
>  
>  	/* get the current page and offset */
> -	page = nth_page(page, (offset >> PAGE_SHIFT));
> +	page += offset / PAGE_SHIFT;

Shouldn't this be "offset >> PAGE_SHIFT" ?

>  	offset %= PAGE_SIZE;
>  
>  	/* don't overrun current sg */
> @@ -631,7 +631,7 @@ static void ata_pio_sector(struct ata_queued_cmd *qc)
>  		unsigned int split_len = PAGE_SIZE - offset;
>  
>  		ata_pio_xfer(qc, page, offset, split_len);
> -		ata_pio_xfer(qc, nth_page(page, 1), 0, count - split_len);
> +		ata_pio_xfer(qc, page + 1, 0, count - split_len);
>  	} else {
>  		ata_pio_xfer(qc, page, offset, count);
>  	}
> @@ -751,7 +751,7 @@ static int __atapi_pio_bytes(struct ata_queued_cmd *qc, unsigned int bytes)
>  	offset = sg->offset + qc->cursg_ofs;
>  
>  	/* get the current page and offset */
> -	page = nth_page(page, (offset >> PAGE_SHIFT));
> +	page += offset / PAGE_SIZE;

Same here, though this seems correct too.

>  	offset %= PAGE_SIZE;
>  
>  	/* don't overrun current sg */


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH RFC 10/35] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap()
  2025-08-21 20:06 ` [PATCH RFC 10/35] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap() David Hildenbrand
@ 2025-08-22  4:09   ` Mika Penttilä
  2025-08-22  6:24     ` David Hildenbrand
  0 siblings, 1 reply; 90+ messages in thread
From: Mika Penttilä @ 2025-08-22  4:09 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: Alexander Potapenko, Andrew Morton, Brendan Jackman,
	Christoph Lameter, Dennis Zhou, Dmitry Vyukov, dri-devel,
	intel-gfx, iommu, io-uring, Jason Gunthorpe, Jens Axboe,
	Johannes Weiner, John Hubbard, kasan-dev, kvm, Liam R. Howlett,
	Linus Torvalds, linux-arm-kernel, linux-arm-kernel, linux-crypto,
	linux-ide, linux-kselftest, linux-mips, linux-mmc, linux-mm,
	linux-riscv, linux-s390, linux-scsi, Lorenzo Stoakes, Marco Elver,
	Marek Szyprowski, Michal Hocko, Mike Rapoport, Muchun Song,
	netdev, Oscar Salvador, Peter Xu, Robin Murphy,
	Suren Baghdasaryan, Tejun Heo, virtualization, Vlastimil Babka,
	wireguard, x86, Zi Yan


On 8/21/25 23:06, David Hildenbrand wrote:

> All pages were already initialized and set to PageReserved() with a
> refcount of 1 by MM init code.

Just to be sure, how is this working with MEMBLOCK_RSRV_NOINIT, where MM is supposed not to
initialize struct pages?

> In fact, by using __init_single_page(), we will be setting the refcount to
> 1 just to freeze it again immediately afterwards.
>
> So drop the __init_single_page() and use __ClearPageReserved() instead.
> Adjust the comments to highlight that we are dealing with an open-coded
> prep_compound_page() variant.
>
> Further, as we can now safely iterate over all pages in a folio, let's
> avoid the page-pfn dance and just iterate the pages directly.
>
> Note that the current code was likely problematic, but we never ran into
> it: prep_compound_tail() would have been called with an offset that might
> exceed a memory section, and prep_compound_tail() would have simply
> added that offset to the page pointer -- which would not have done the
> right thing on sparsemem without vmemmap.
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/hugetlb.c | 21 ++++++++++-----------
>  1 file changed, 10 insertions(+), 11 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index d12a9d5146af4..ae82a845b14ad 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3235,17 +3235,14 @@ static void __init hugetlb_folio_init_tail_vmemmap(struct folio *folio,
>  					unsigned long start_page_number,
>  					unsigned long end_page_number)
>  {
> -	enum zone_type zone = zone_idx(folio_zone(folio));
> -	int nid = folio_nid(folio);
> -	unsigned long head_pfn = folio_pfn(folio);
> -	unsigned long pfn, end_pfn = head_pfn + end_page_number;
> +	struct page *head_page = folio_page(folio, 0);
> +	struct page *page = folio_page(folio, start_page_number);
> +	unsigned long i;
>  	int ret;
>  
> -	for (pfn = head_pfn + start_page_number; pfn < end_pfn; pfn++) {
> -		struct page *page = pfn_to_page(pfn);
> -
> -		__init_single_page(page, pfn, zone, nid);
> -		prep_compound_tail((struct page *)folio, pfn - head_pfn);
> +	for (i = start_page_number; i < end_page_number; i++, page++) {
> +		__ClearPageReserved(page);
> +		prep_compound_tail(head_page, i);
>  		ret = page_ref_freeze(page, 1);
>  		VM_BUG_ON(!ret);
>  	}
> @@ -3257,12 +3254,14 @@ static void __init hugetlb_folio_init_vmemmap(struct folio *folio,
>  {
>  	int ret;
>  
> -	/* Prepare folio head */
> +	/*
> +	 * This is an open-coded prep_compound_page() whereby we avoid
> +	 * walking pages twice by preparing+freezing them in the same go.
> +	 */
>  	__folio_clear_reserved(folio);
>  	__folio_set_head(folio);
>  	ret = folio_ref_freeze(folio, 1);
>  	VM_BUG_ON(!ret);
> -	/* Initialize the necessary tail struct pages */
>  	hugetlb_folio_init_tail_vmemmap(folio, 1, nr_pages);
>  	prep_compound_head((struct page *)folio, huge_page_order(h));
>  }

--Mika


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH RFC 24/35] ata: libata-eh: drop nth_page() usage within SG entry
  2025-08-22  1:59   ` Damien Le Moal
@ 2025-08-22  6:18     ` David Hildenbrand
  0 siblings, 0 replies; 90+ messages in thread
From: David Hildenbrand @ 2025-08-22  6:18 UTC (permalink / raw)
  To: Damien Le Moal, linux-kernel
  Cc: Niklas Cassel, Alexander Potapenko, Andrew Morton,
	Brendan Jackman, Christoph Lameter, Dennis Zhou, Dmitry Vyukov,
	dri-devel, intel-gfx, iommu, io-uring, Jason Gunthorpe,
	Jens Axboe, Johannes Weiner, John Hubbard, kasan-dev, kvm,
	Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Mike Rapoport, Muchun Song, netdev, Oscar Salvador,
	Peter Xu, Robin Murphy, Suren Baghdasaryan, Tejun Heo,
	virtualization, Vlastimil Babka, wireguard, x86, Zi Yan

On 22.08.25 03:59, Damien Le Moal wrote:
> On 8/22/25 05:06, David Hildenbrand wrote:
>> It's no longer required to use nth_page() when iterating pages within a
>> single SG entry, so let's drop the nth_page() usage.
>>
>> Cc: Damien Le Moal <dlemoal@kernel.org>
>> Cc: Niklas Cassel <cassel@kernel.org>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>   drivers/ata/libata-sff.c | 6 +++---
>>   1 file changed, 3 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/ata/libata-sff.c b/drivers/ata/libata-sff.c
>> index 7fc407255eb46..9f5d0f9f6d686 100644
>> --- a/drivers/ata/libata-sff.c
>> +++ b/drivers/ata/libata-sff.c
>> @@ -614,7 +614,7 @@ static void ata_pio_sector(struct ata_queued_cmd *qc)
>>   	offset = qc->cursg->offset + qc->cursg_ofs;
>>   
>>   	/* get the current page and offset */
>> -	page = nth_page(page, (offset >> PAGE_SHIFT));
>> +	page += offset / PAGE_SHIFT;
> 
> Shouldn't this be "offset >> PAGE_SHIFT" ?

Thanks for taking a look!

Yeah, I already reverted back to "offset >> PAGE_SHIFT" after Linus 
mentioned in another mail in this thread that ">> PAGE_SHIFT" is 
generally preferred because the compiler cannot optimize as much if 
offset would be a signed variable.

So the next version will have the shift again.

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH RFC 10/35] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap()
  2025-08-22  4:09   ` Mika Penttilä
@ 2025-08-22  6:24     ` David Hildenbrand
  2025-08-23  8:59       ` Mike Rapoport
  0 siblings, 1 reply; 90+ messages in thread
From: David Hildenbrand @ 2025-08-22  6:24 UTC (permalink / raw)
  To: Mika Penttilä, linux-kernel
  Cc: Alexander Potapenko, Andrew Morton, Brendan Jackman,
	Christoph Lameter, Dennis Zhou, Dmitry Vyukov, dri-devel,
	intel-gfx, iommu, io-uring, Jason Gunthorpe, Jens Axboe,
	Johannes Weiner, John Hubbard, kasan-dev, kvm, Liam R. Howlett,
	Linus Torvalds, linux-arm-kernel, linux-arm-kernel, linux-crypto,
	linux-ide, linux-kselftest, linux-mips, linux-mmc, linux-mm,
	linux-riscv, linux-s390, linux-scsi, Lorenzo Stoakes, Marco Elver,
	Marek Szyprowski, Michal Hocko, Mike Rapoport, Muchun Song,
	netdev, Oscar Salvador, Peter Xu, Robin Murphy,
	Suren Baghdasaryan, Tejun Heo, virtualization, Vlastimil Babka,
	wireguard, x86, Zi Yan

On 22.08.25 06:09, Mika Penttilä wrote:
> 
> On 8/21/25 23:06, David Hildenbrand wrote:
> 
>> All pages were already initialized and set to PageReserved() with a
>> refcount of 1 by MM init code.
> 
> Just to be sure, how is this working with MEMBLOCK_RSRV_NOINIT, where MM is supposed not to
> initialize struct pages?

Excellent point, I did not know about that one.

Spotting that we don't do the same for the head page made me assume that 
it's just a misuse of __init_single_page().

But the nasty thing is that we use memblock_reserved_mark_noinit() to 
only mark the tail pages ...

Let me revert back to __init_single_page() and add a big fat comment why 
this is required.

Thanks!

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH RFC 22/35] dma-remap: drop nth_page() in dma_common_contiguous_remap()
  2025-08-21 20:06 ` [PATCH RFC 22/35] dma-remap: drop nth_page() in dma_common_contiguous_remap() David Hildenbrand
@ 2025-08-22  8:15   ` Marek Szyprowski
  0 siblings, 0 replies; 90+ messages in thread
From: Marek Szyprowski @ 2025-08-22  8:15 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: Robin Murphy, Alexander Potapenko, Andrew Morton, Brendan Jackman,
	Christoph Lameter, Dennis Zhou, Dmitry Vyukov, dri-devel,
	intel-gfx, iommu, io-uring, Jason Gunthorpe, Jens Axboe,
	Johannes Weiner, John Hubbard, kasan-dev, kvm, Liam R. Howlett,
	Linus Torvalds, linux-arm-kernel, linux-arm-kernel, linux-crypto,
	linux-ide, linux-kselftest, linux-mips, linux-mmc, linux-mm,
	linux-riscv, linux-s390, linux-scsi, Lorenzo Stoakes, Marco Elver,
	Michal Hocko, Mike Rapoport, Muchun Song, netdev, Oscar Salvador,
	Peter Xu, Suren Baghdasaryan, Tejun Heo, virtualization,
	Vlastimil Babka, wireguard, x86, Zi Yan

On 21.08.2025 22:06, David Hildenbrand wrote:
> dma_common_contiguous_remap() is used to remap an "allocated contiguous
> region". Within a single allocation, there is no need to use nth_page()
> anymore.
>
> Neither the buddy, nor hugetlb, nor CMA will hand out problematic page
> ranges.
>
> Cc: Marek Szyprowski <m.szyprowski@samsung.com>
> Cc: Robin Murphy <robin.murphy@arm.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Marek Szyprowski <m.szyprowski@samsung.com>
> ---
>   kernel/dma/remap.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/dma/remap.c b/kernel/dma/remap.c
> index 9e2afad1c6152..b7c1c0c92d0c8 100644
> --- a/kernel/dma/remap.c
> +++ b/kernel/dma/remap.c
> @@ -49,7 +49,7 @@ void *dma_common_contiguous_remap(struct page *page, size_t size,
>   	if (!pages)
>   		return NULL;
>   	for (i = 0; i < count; i++)
> -		pages[i] = nth_page(page, i);
> +		pages[i] = page++;
>   	vaddr = vmap(pages, count, VM_DMA_COHERENT, prot);
>   	kvfree(pages);
>   

Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH RFC 23/35] scatterlist: disallow non-contigous page ranges in a single SG entry
  2025-08-21 20:06 ` [PATCH RFC 23/35] scatterlist: disallow non-contigous page ranges in a single SG entry David Hildenbrand
@ 2025-08-22  8:15   ` Marek Szyprowski
  0 siblings, 0 replies; 90+ messages in thread
From: Marek Szyprowski @ 2025-08-22  8:15 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: Alexander Potapenko, Andrew Morton, Brendan Jackman,
	Christoph Lameter, Dennis Zhou, Dmitry Vyukov, dri-devel,
	intel-gfx, iommu, io-uring, Jason Gunthorpe, Jens Axboe,
	Johannes Weiner, John Hubbard, kasan-dev, kvm, Liam R. Howlett,
	Linus Torvalds, linux-arm-kernel, linux-arm-kernel, linux-crypto,
	linux-ide, linux-kselftest, linux-mips, linux-mmc, linux-mm,
	linux-riscv, linux-s390, linux-scsi, Lorenzo Stoakes, Marco Elver,
	Michal Hocko, Mike Rapoport, Muchun Song, netdev, Oscar Salvador,
	Peter Xu, Robin Murphy, Suren Baghdasaryan, Tejun Heo,
	virtualization, Vlastimil Babka, wireguard, x86, Zi Yan

On 21.08.2025 22:06, David Hildenbrand wrote:
> The expectation is that there is currently no user that would pass in
> non-contigous page ranges: no allocator, not even VMA, will hand these
> out.
>
> The only problematic part would be if someone would provide a range
> obtained directly from memblock, or manually merge problematic ranges.
> If we find such cases, we should fix them to create separate
> SG entries.
>
> Let's check in sg_set_page() that this is really the case. No need to
> check in sg_set_folio(), as pages in a folio are guaranteed to be
> contiguous.
>
> We can now drop the nth_page() usage in sg_page_iter_page().
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Marek Szyprowski <m.szyprowski@samsung.com>
> ---
>   include/linux/scatterlist.h | 4 +++-
>   1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
> index 6f8a4965f9b98..8196949dfc82c 100644
> --- a/include/linux/scatterlist.h
> +++ b/include/linux/scatterlist.h
> @@ -6,6 +6,7 @@
>   #include <linux/types.h>
>   #include <linux/bug.h>
>   #include <linux/mm.h>
> +#include <linux/mm_inline.h>
>   #include <asm/io.h>
>   
>   struct scatterlist {
> @@ -158,6 +159,7 @@ static inline void sg_assign_page(struct scatterlist *sg, struct page *page)
>   static inline void sg_set_page(struct scatterlist *sg, struct page *page,
>   			       unsigned int len, unsigned int offset)
>   {
> +	VM_WARN_ON_ONCE(!page_range_contiguous(page, ALIGN(len + offset, PAGE_SIZE) / PAGE_SIZE));
>   	sg_assign_page(sg, page);
>   	sg->offset = offset;
>   	sg->length = len;
> @@ -600,7 +602,7 @@ void __sg_page_iter_start(struct sg_page_iter *piter,
>    */
>   static inline struct page *sg_page_iter_page(struct sg_page_iter *piter)
>   {
> -	return nth_page(sg_page(piter->sg), piter->sg_pgoffset);
> +	return sg_page(piter->sg) + piter->sg_pgoffset;
>   }
>   
>   /**

Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH RFC 18/35] io_uring/zcrx: remove "struct io_copy_cache" and one nth_page() usage
  2025-08-21 20:06 ` [PATCH RFC 18/35] io_uring/zcrx: remove "struct io_copy_cache" and one nth_page() usage David Hildenbrand
@ 2025-08-22 11:32   ` Pavel Begunkov
  2025-08-22 13:59     ` David Hildenbrand
  0 siblings, 1 reply; 90+ messages in thread
From: Pavel Begunkov @ 2025-08-22 11:32 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: Jens Axboe, Alexander Potapenko, Andrew Morton, Brendan Jackman,
	Christoph Lameter, Dennis Zhou, Dmitry Vyukov, dri-devel,
	intel-gfx, iommu, io-uring, Jason Gunthorpe, Johannes Weiner,
	John Hubbard, kasan-dev, kvm, Liam R. Howlett, Linus Torvalds,
	linux-arm-kernel, linux-arm-kernel, linux-crypto, linux-ide,
	linux-kselftest, linux-mips, linux-mmc, linux-mm, linux-riscv,
	linux-s390, linux-scsi, Lorenzo Stoakes, Marco Elver,
	Marek Szyprowski, Michal Hocko, Mike Rapoport, Muchun Song,
	netdev, Oscar Salvador, Peter Xu, Robin Murphy,
	Suren Baghdasaryan, Tejun Heo, virtualization, Vlastimil Babka,
	wireguard, x86, Zi Yan

On 8/21/25 21:06, David Hildenbrand wrote:
> We always provide a single dst page, it's unclear why the io_copy_cache
> complexity is required.

Because it'll need to be pulled outside the loop to reuse the page for
multiple copies, i.e. packing multiple fragments of the same skb into
it. Not finished, and currently it's wasting memory.

Why not do as below? Pages there never cross boundaries of their folios.

Do you want it to be taken into the io_uring tree?

diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
index e5ff49f3425e..18c12f4b56b6 100644
--- a/io_uring/zcrx.c
+++ b/io_uring/zcrx.c
@@ -975,9 +975,9 @@ static ssize_t io_copy_page(struct io_copy_cache *cc, struct page *src_page,
  
  		if (folio_test_partial_kmap(page_folio(dst_page)) ||
  		    folio_test_partial_kmap(page_folio(src_page))) {
-			dst_page = nth_page(dst_page, dst_offset / PAGE_SIZE);
+			dst_page += dst_offset / PAGE_SIZE;
  			dst_offset = offset_in_page(dst_offset);
-			src_page = nth_page(src_page, src_offset / PAGE_SIZE);
+			src_page += src_offset / PAGE_SIZE;
  			src_offset = offset_in_page(src_offset);
  			n = min(PAGE_SIZE - src_offset, PAGE_SIZE - dst_offset);
  			n = min(n, len);

-- 
Pavel Begunkov


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* Re: [PATCH RFC 18/35] io_uring/zcrx: remove "struct io_copy_cache" and one nth_page() usage
  2025-08-22 11:32   ` Pavel Begunkov
@ 2025-08-22 13:59     ` David Hildenbrand
  2025-08-27  9:43       ` Pavel Begunkov
  0 siblings, 1 reply; 90+ messages in thread
From: David Hildenbrand @ 2025-08-22 13:59 UTC (permalink / raw)
  To: Pavel Begunkov, linux-kernel
  Cc: Jens Axboe, Alexander Potapenko, Andrew Morton, Brendan Jackman,
	Christoph Lameter, Dennis Zhou, Dmitry Vyukov, dri-devel,
	intel-gfx, iommu, io-uring, Jason Gunthorpe, Johannes Weiner,
	John Hubbard, kasan-dev, kvm, Liam R. Howlett, Linus Torvalds,
	linux-arm-kernel, linux-arm-kernel, linux-crypto, linux-ide,
	linux-kselftest, linux-mips, linux-mmc, linux-mm, linux-riscv,
	linux-s390, linux-scsi, Lorenzo Stoakes, Marco Elver,
	Marek Szyprowski, Michal Hocko, Mike Rapoport, Muchun Song,
	netdev, Oscar Salvador, Peter Xu, Robin Murphy,
	Suren Baghdasaryan, Tejun Heo, virtualization, Vlastimil Babka,
	wireguard, x86, Zi Yan

On 22.08.25 13:32, Pavel Begunkov wrote:
> On 8/21/25 21:06, David Hildenbrand wrote:
>> We always provide a single dst page, it's unclear why the io_copy_cache
>> complexity is required.
> 
> Because it'll need to be pulled outside the loop to reuse the page for
> multiple copies, i.e. packing multiple fragments of the same skb into
> it. Not finished, and currently it's wasting memory.

Okay, so what you're saying is that there will be follow-up work that 
will actually make this structure useful.

> 
> Why not do as below? Pages there never cross boundaries of their folios. > Do you want it to be taken into the io_uring tree?

This should better all go through the MM tree where we actually 
guarantee contiguous pages within a folio. (see the cover letter)

> 
> diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
> index e5ff49f3425e..18c12f4b56b6 100644
> --- a/io_uring/zcrx.c
> +++ b/io_uring/zcrx.c
> @@ -975,9 +975,9 @@ static ssize_t io_copy_page(struct io_copy_cache *cc, struct page *src_page,
>    
>    		if (folio_test_partial_kmap(page_folio(dst_page)) ||
>    		    folio_test_partial_kmap(page_folio(src_page))) {
> -			dst_page = nth_page(dst_page, dst_offset / PAGE_SIZE);
> +			dst_page += dst_offset / PAGE_SIZE;
>    			dst_offset = offset_in_page(dst_offset);
> -			src_page = nth_page(src_page, src_offset / PAGE_SIZE);
> +			src_page += src_offset / PAGE_SIZE;

Yeah, I can do that in the next version given that you have plans on 
extending that code soon.

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH RFC 00/35] mm: remove nth_page()
  2025-08-21 20:06 [PATCH RFC 00/35] mm: remove nth_page() David Hildenbrand
                   ` (35 preceding siblings ...)
  2025-08-21 21:37 ` [syzbot ci] " syzbot ci
@ 2025-08-22 14:30 ` Jason Gunthorpe
  36 siblings, 0 replies; 90+ messages in thread
From: Jason Gunthorpe @ 2025-08-22 14:30 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Andrew Morton, Linus Torvalds, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jens Axboe, Marek Szyprowski,
	Robin Murphy, John Hubbard, Peter Xu, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov, Brendan Jackman, Johannes Weiner,
	Zi Yan, Dennis Zhou, Tejun Heo, Christoph Lameter, Muchun Song,
	Oscar Salvador, x86, linux-arm-kernel, linux-mips, linux-s390,
	linux-crypto, linux-ide, intel-gfx, dri-devel, linux-mmc,
	linux-arm-kernel, linux-scsi, kvm, virtualization, linux-mm,
	io-uring, iommu, kasan-dev, wireguard, netdev, linux-kselftest,
	linux-riscv, Albert Ou, Alexander Gordeev, Alexandre Ghiti,
	Alex Dubov, Alex Williamson, Andreas Larsson, Borislav Petkov,
	Brett Creeley, Catalin Marinas, Christian Borntraeger,
	Christophe Leroy, Damien Le Moal, Dave Hansen, David Airlie,
	David S. Miller, Doug Gilbert, Heiko Carstens, Herbert Xu,
	Huacai Chen, Ingo Molnar, James E.J. Bottomley, Jani Nikula,
	Jason A. Donenfeld, Jesper Nilsson, Joonas Lahtinen, Kevin Tian,
	Lars Persson, Madhavan Srinivasan, Martin K. Petersen,
	Maxim Levitsky, Michael Ellerman, Nicholas Piggin, Niklas Cassel,
	Palmer Dabbelt, Paul Walmsley, Rodrigo Vivi, Shameer Kolothum,
	Shuah Khan, Simona Vetter, Sven Schnelle, Thomas Bogendoerfer,
	Thomas Gleixner, Tvrtko Ursulin, Ulf Hansson, Vasily Gorbik,
	WANG Xuerui, Will Deacon, Yishai Hadas

On Thu, Aug 21, 2025 at 10:06:26PM +0200, David Hildenbrand wrote:
> As discussed recently with Linus, nth_page() is just nasty and we would
> like to remove it.
> 
> To recap, the reason we currently need nth_page() within a folio is because
> on some kernel configs (SPARSEMEM without SPARSEMEM_VMEMMAP), the
> memmap is allocated per memory section.
> 
> While buddy allocations cannot cross memory section boundaries, hugetlb
> and dax folios can.
> 
> So crossing a memory section means that "page++" could do the wrong thing.
> Instead, nth_page() on these problematic configs always goes from
> page->pfn, to the go from (++pfn)->page, which is rather nasty.
> 
> Likely, many people have no idea when nth_page() is required and when
> it might be dropped.
> 
> We refer to such problematic PFN ranges and "non-contiguous pages".
> If we only deal with "contiguous pages", there is not need for nth_page().
>
> Besides that "obvious" folio case, we might end up using nth_page()
> within CMA allocations (again, could span memory sections), and in
> one corner case (kfence) when processing memblock allocations (again,
> could span memory sections).

I browsed the patches and it looks great to me, thanks for doing this

Jason

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH RFC 01/35] mm: stop making SPARSEMEM_VMEMMAP user-selectable
  2025-08-21 20:06 ` [PATCH RFC 01/35] mm: stop making SPARSEMEM_VMEMMAP user-selectable David Hildenbrand
  2025-08-21 20:20   ` Zi Yan
@ 2025-08-22 15:09   ` Mike Rapoport
  2025-08-22 17:02   ` SeongJae Park
  2 siblings, 0 replies; 90+ messages in thread
From: Mike Rapoport @ 2025-08-22 15:09 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Huacai Chen, WANG Xuerui, Madhavan Srinivasan,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Alexandre Ghiti,
	David S. Miller, Andreas Larsson, Alexander Potapenko,
	Andrew Morton, Brendan Jackman, Christoph Lameter, Dennis Zhou,
	Dmitry Vyukov, dri-devel, intel-gfx, iommu, io-uring,
	Jason Gunthorpe, Jens Axboe, Johannes Weiner, John Hubbard,
	kasan-dev, kvm, Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Muchun Song, netdev, Oscar Salvador, Peter Xu,
	Robin Murphy, Suren Baghdasaryan, Tejun Heo, virtualization,
	Vlastimil Babka, wireguard, x86, Zi Yan

On Thu, Aug 21, 2025 at 10:06:27PM +0200, David Hildenbrand wrote:
> In an ideal world, we wouldn't have to deal with SPARSEMEM without
> SPARSEMEM_VMEMMAP, but in particular for 32bit SPARSEMEM_VMEMMAP is
> considered too costly and consequently not supported.
> 
> However, if an architecture does support SPARSEMEM with
> SPARSEMEM_VMEMMAP, let's forbid the user to disable VMEMMAP: just
> like we already do for arm64, s390 and x86.
> 
> So if SPARSEMEM_VMEMMAP is supported, don't allow to use SPARSEMEM without
> SPARSEMEM_VMEMMAP.
> 
> This implies that the option to not use SPARSEMEM_VMEMMAP will now be
> gone for loongarch, powerpc, riscv and sparc. All architectures only
> enable SPARSEMEM_VMEMMAP with 64bit support, so there should not really
> be a big downside to using the VMEMMAP (quite the contrary).
> 
> This is a preparation for not supporting
> 
> (1) folio sizes that exceed a single memory section
> (2) CMA allocations of non-contiguous page ranges
> 
> in SPARSEMEM without SPARSEMEM_VMEMMAP configs, whereby we
> want to limit possible impact as much as possible (e.g., gigantic hugetlb
> page allocations suddenly fails).
> 
> Cc: Huacai Chen <chenhuacai@kernel.org>
> Cc: WANG Xuerui <kernel@xen0n.name>
> Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
> Cc: Paul Walmsley <paul.walmsley@sifive.com>
> Cc: Palmer Dabbelt <palmer@dabbelt.com>
> Cc: Albert Ou <aou@eecs.berkeley.edu>
> Cc: Alexandre Ghiti <alex@ghiti.fr>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: Andreas Larsson <andreas@gaisler.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>

Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

> ---
>  mm/Kconfig | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 4108bcd967848..330d0e698ef96 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -439,9 +439,8 @@ config SPARSEMEM_VMEMMAP_ENABLE
>  	bool
>  
>  config SPARSEMEM_VMEMMAP
> -	bool "Sparse Memory virtual memmap"
> +	def_bool y
>  	depends on SPARSEMEM && SPARSEMEM_VMEMMAP_ENABLE
> -	default y
>  	help
>  	  SPARSEMEM_VMEMMAP uses a virtually mapped memmap to optimise
>  	  pfn_to_page and page_to_pfn operations.  This is the most
> -- 
> 2.50.1
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH RFC 02/35] arm64: Kconfig: drop superfluous "select SPARSEMEM_VMEMMAP"
  2025-08-21 20:06 ` [PATCH RFC 02/35] arm64: Kconfig: drop superfluous "select SPARSEMEM_VMEMMAP" David Hildenbrand
@ 2025-08-22 15:10   ` Mike Rapoport
  0 siblings, 0 replies; 90+ messages in thread
From: Mike Rapoport @ 2025-08-22 15:10 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Catalin Marinas, Will Deacon, Alexander Potapenko,
	Andrew Morton, Brendan Jackman, Christoph Lameter, Dennis Zhou,
	Dmitry Vyukov, dri-devel, intel-gfx, iommu, io-uring,
	Jason Gunthorpe, Jens Axboe, Johannes Weiner, John Hubbard,
	kasan-dev, kvm, Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Muchun Song, netdev, Oscar Salvador, Peter Xu,
	Robin Murphy, Suren Baghdasaryan, Tejun Heo, virtualization,
	Vlastimil Babka, wireguard, x86, Zi Yan

On Thu, Aug 21, 2025 at 10:06:28PM +0200, David Hildenbrand wrote:
> Now handled by the core automatically once SPARSEMEM_VMEMMAP_ENABLE
> is selected.
> 
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

> ---
>  arch/arm64/Kconfig | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index e9bbfacc35a64..b1d1f2ff2493b 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -1570,7 +1570,6 @@ source "kernel/Kconfig.hz"
>  config ARCH_SPARSEMEM_ENABLE
>  	def_bool y
>  	select SPARSEMEM_VMEMMAP_ENABLE
> -	select SPARSEMEM_VMEMMAP
>  
>  config HW_PERF_EVENTS
>  	def_bool y
> -- 
> 2.50.1
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH RFC 03/35] s390/Kconfig: drop superfluous "select SPARSEMEM_VMEMMAP"
  2025-08-21 20:06 ` [PATCH RFC 03/35] s390/Kconfig: " David Hildenbrand
@ 2025-08-22 15:11   ` Mike Rapoport
  0 siblings, 0 replies; 90+ messages in thread
From: Mike Rapoport @ 2025-08-22 15:11 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, Alexander Potapenko,
	Andrew Morton, Brendan Jackman, Christoph Lameter, Dennis Zhou,
	Dmitry Vyukov, dri-devel, intel-gfx, iommu, io-uring,
	Jason Gunthorpe, Jens Axboe, Johannes Weiner, John Hubbard,
	kasan-dev, kvm, Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Muchun Song, netdev, Oscar Salvador, Peter Xu,
	Robin Murphy, Suren Baghdasaryan, Tejun Heo, virtualization,
	Vlastimil Babka, wireguard, x86, Zi Yan

On Thu, Aug 21, 2025 at 10:06:29PM +0200, David Hildenbrand wrote:
> Now handled by the core automatically once SPARSEMEM_VMEMMAP_ENABLE
> is selected.
> 
> Cc: Heiko Carstens <hca@linux.ibm.com>
> Cc: Vasily Gorbik <gor@linux.ibm.com>
> Cc: Alexander Gordeev <agordeev@linux.ibm.com>
> Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
> Cc: Sven Schnelle <svens@linux.ibm.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

> ---
>  arch/s390/Kconfig | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
> index bf680c26a33cf..145ca23c2fff6 100644
> --- a/arch/s390/Kconfig
> +++ b/arch/s390/Kconfig
> @@ -710,7 +710,6 @@ menu "Memory setup"
>  config ARCH_SPARSEMEM_ENABLE
>  	def_bool y
>  	select SPARSEMEM_VMEMMAP_ENABLE
> -	select SPARSEMEM_VMEMMAP
>  
>  config ARCH_SPARSEMEM_DEFAULT
>  	def_bool y
> -- 
> 2.50.1
> 
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH RFC 04/35] x86/Kconfig: drop superfluous "select SPARSEMEM_VMEMMAP"
  2025-08-21 20:06 ` [PATCH RFC 04/35] x86/Kconfig: " David Hildenbrand
@ 2025-08-22 15:11   ` Mike Rapoport
  0 siblings, 0 replies; 90+ messages in thread
From: Mike Rapoport @ 2025-08-22 15:11 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, Alexander Potapenko, Andrew Morton, Brendan Jackman,
	Christoph Lameter, Dennis Zhou, Dmitry Vyukov, dri-devel,
	intel-gfx, iommu, io-uring, Jason Gunthorpe, Jens Axboe,
	Johannes Weiner, John Hubbard, kasan-dev, kvm, Liam R. Howlett,
	Linus Torvalds, linux-arm-kernel, linux-arm-kernel, linux-crypto,
	linux-ide, linux-kselftest, linux-mips, linux-mmc, linux-mm,
	linux-riscv, linux-s390, linux-scsi, Lorenzo Stoakes, Marco Elver,
	Marek Szyprowski, Michal Hocko, Muchun Song, netdev,
	Oscar Salvador, Peter Xu, Robin Murphy, Suren Baghdasaryan,
	Tejun Heo, virtualization, Vlastimil Babka, wireguard, x86,
	Zi Yan

On Thu, Aug 21, 2025 at 10:06:30PM +0200, David Hildenbrand wrote:
> Now handled by the core automatically once SPARSEMEM_VMEMMAP_ENABLE
> is selected.
> 
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

> ---
>  arch/x86/Kconfig | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 58d890fe2100e..e431d1c06fecd 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1552,7 +1552,6 @@ config ARCH_SPARSEMEM_ENABLE
>  	def_bool y
>  	select SPARSEMEM_STATIC if X86_32
>  	select SPARSEMEM_VMEMMAP_ENABLE if X86_64
> -	select SPARSEMEM_VMEMMAP if X86_64
>  
>  config ARCH_SPARSEMEM_DEFAULT
>  	def_bool X86_64 || (NUMA && X86_32)
> -- 
> 2.50.1
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH RFC 05/35] wireguard: selftests: remove CONFIG_SPARSEMEM_VMEMMAP=y from qemu kernel config
  2025-08-21 20:06 ` [PATCH RFC 05/35] wireguard: selftests: remove CONFIG_SPARSEMEM_VMEMMAP=y from qemu kernel config David Hildenbrand
@ 2025-08-22 15:13   ` Mike Rapoport
  0 siblings, 0 replies; 90+ messages in thread
From: Mike Rapoport @ 2025-08-22 15:13 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Jason A. Donenfeld, Shuah Khan, Alexander Potapenko,
	Andrew Morton, Brendan Jackman, Christoph Lameter, Dennis Zhou,
	Dmitry Vyukov, dri-devel, intel-gfx, iommu, io-uring,
	Jason Gunthorpe, Jens Axboe, Johannes Weiner, John Hubbard,
	kasan-dev, kvm, Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Muchun Song, netdev, Oscar Salvador, Peter Xu,
	Robin Murphy, Suren Baghdasaryan, Tejun Heo, virtualization,
	Vlastimil Babka, wireguard, x86, Zi Yan

On Thu, Aug 21, 2025 at 10:06:31PM +0200, David Hildenbrand wrote:
> It's no longer user-selectable (and the default was already "y"), so
> let's just drop it.

and it should not matter for wireguard selftest anyway
> 
> Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
> Cc: Shuah Khan <shuah@kernel.org>
> Signed-off-by: David Hildenbrand <david@redhat.com>

Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

> ---
>  tools/testing/selftests/wireguard/qemu/kernel.config | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/tools/testing/selftests/wireguard/qemu/kernel.config b/tools/testing/selftests/wireguard/qemu/kernel.config
> index 0a5381717e9f4..1149289f4b30f 100644
> --- a/tools/testing/selftests/wireguard/qemu/kernel.config
> +++ b/tools/testing/selftests/wireguard/qemu/kernel.config
> @@ -48,7 +48,6 @@ CONFIG_JUMP_LABEL=y
>  CONFIG_FUTEX=y
>  CONFIG_SHMEM=y
>  CONFIG_SLUB=y
> -CONFIG_SPARSEMEM_VMEMMAP=y
>  CONFIG_SMP=y
>  CONFIG_SCHED_SMT=y
>  CONFIG_SCHED_MC=y
> -- 
> 2.50.1
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH RFC 09/35] mm/mm_init: make memmap_init_compound() look more like prep_compound_page()
  2025-08-21 20:06 ` [PATCH RFC 09/35] mm/mm_init: make memmap_init_compound() look more like prep_compound_page() David Hildenbrand
@ 2025-08-22 15:27   ` Mike Rapoport
  2025-08-22 18:09     ` David Hildenbrand
  0 siblings, 1 reply; 90+ messages in thread
From: Mike Rapoport @ 2025-08-22 15:27 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Alexander Potapenko, Andrew Morton, Brendan Jackman,
	Christoph Lameter, Dennis Zhou, Dmitry Vyukov, dri-devel,
	intel-gfx, iommu, io-uring, Jason Gunthorpe, Jens Axboe,
	Johannes Weiner, John Hubbard, kasan-dev, kvm, Liam R. Howlett,
	Linus Torvalds, linux-arm-kernel, linux-arm-kernel, linux-crypto,
	linux-ide, linux-kselftest, linux-mips, linux-mmc, linux-mm,
	linux-riscv, linux-s390, linux-scsi, Lorenzo Stoakes, Marco Elver,
	Marek Szyprowski, Michal Hocko, Muchun Song, netdev,
	Oscar Salvador, Peter Xu, Robin Murphy, Suren Baghdasaryan,
	Tejun Heo, virtualization, Vlastimil Babka, wireguard, x86,
	Zi Yan

On Thu, Aug 21, 2025 at 10:06:35PM +0200, David Hildenbrand wrote:
> Grepping for "prep_compound_page" leaves on clueless how devdax gets its
> compound pages initialized.
> 
> Let's add a comment that might help finding this open-coded
> prep_compound_page() initialization more easily.
> 
> Further, let's be less smart about the ordering of initialization and just
> perform the prep_compound_head() call after all tail pages were
> initialized: just like prep_compound_page() does.
> 
> No need for a lengthy comment then: again, just like prep_compound_page().
> 
> Note that prep_compound_head() already does initialize stuff in page[2]
> through prep_compound_head() that successive tail page initialization
> will overwrite: _deferred_list, and on 32bit _entire_mapcount and
> _pincount. Very likely 32bit does not apply, and likely nobody ever ends
> up testing whether the _deferred_list is empty.
> 
> So it shouldn't be a fix at this point, but certainly something to clean
> up.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/mm_init.c | 13 +++++--------
>  1 file changed, 5 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index 5c21b3af216b2..708466c5b2cc9 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -1091,6 +1091,10 @@ static void __ref memmap_init_compound(struct page *head,
>  	unsigned long pfn, end_pfn = head_pfn + nr_pages;
>  	unsigned int order = pgmap->vmemmap_shift;
>  
> +	/*
> +	 * This is an open-coded prep_compound_page() whereby we avoid
> +	 * walking pages twice by initializing them in the same go.
> +	 */

While on it, can you also mention that prep_compound_page() is not used to
properly set page zone link?

With this

Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

>  	__SetPageHead(head);
>  	for (pfn = head_pfn + 1; pfn < end_pfn; pfn++) {
>  		struct page *page = pfn_to_page(pfn);
> @@ -1098,15 +1102,8 @@ static void __ref memmap_init_compound(struct page *head,
>  		__init_zone_device_page(page, pfn, zone_idx, nid, pgmap);
>  		prep_compound_tail(head, pfn - head_pfn);
>  		set_page_count(page, 0);
> -
> -		/*
> -		 * The first tail page stores important compound page info.
> -		 * Call prep_compound_head() after the first tail page has
> -		 * been initialized, to not have the data overwritten.
> -		 */
> -		if (pfn == head_pfn + 1)
> -			prep_compound_head(head, order);
>  	}
> +	prep_compound_head(head, order);
>  }
>  
>  void __ref memmap_init_zone_device(struct zone *zone,
> -- 
> 2.50.1
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH RFC 01/35] mm: stop making SPARSEMEM_VMEMMAP user-selectable
  2025-08-21 20:06 ` [PATCH RFC 01/35] mm: stop making SPARSEMEM_VMEMMAP user-selectable David Hildenbrand
  2025-08-21 20:20   ` Zi Yan
  2025-08-22 15:09   ` Mike Rapoport
@ 2025-08-22 17:02   ` SeongJae Park
  2 siblings, 0 replies; 90+ messages in thread
From: SeongJae Park @ 2025-08-22 17:02 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: SeongJae Park, linux-kernel, Huacai Chen, WANG Xuerui,
	Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
	Christophe Leroy, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Alexandre Ghiti, David S. Miller, Andreas Larsson,
	Alexander Potapenko, Andrew Morton, Brendan Jackman,
	Christoph Lameter, Dennis Zhou, Dmitry Vyukov, dri-devel,
	intel-gfx, iommu, io-uring, Jason Gunthorpe, Jens Axboe,
	Johannes Weiner, John Hubbard, kasan-dev, kvm, Liam R. Howlett,
	Linus Torvalds, linux-arm-kernel, linux-arm-kernel, linux-crypto,
	linux-ide, linux-kselftest, linux-mips, linux-mmc, linux-mm,
	linux-riscv, linux-s390, linux-scsi, Lorenzo Stoakes, Marco Elver,
	Marek Szyprowski, Michal Hocko, Mike Rapoport, Muchun Song,
	netdev, Oscar Salvador, Peter Xu, Robin Murphy,
	Suren Baghdasaryan, Tejun Heo, virtualization, Vlastimil Babka,
	wireguard, x86, Zi Yan

On Thu, 21 Aug 2025 22:06:27 +0200 David Hildenbrand <david@redhat.com> wrote:

> In an ideal world, we wouldn't have to deal with SPARSEMEM without
> SPARSEMEM_VMEMMAP, but in particular for 32bit SPARSEMEM_VMEMMAP is
> considered too costly and consequently not supported.
> 
> However, if an architecture does support SPARSEMEM with
> SPARSEMEM_VMEMMAP, let's forbid the user to disable VMEMMAP: just
> like we already do for arm64, s390 and x86.
> 
> So if SPARSEMEM_VMEMMAP is supported, don't allow to use SPARSEMEM without
> SPARSEMEM_VMEMMAP.
> 
> This implies that the option to not use SPARSEMEM_VMEMMAP will now be
> gone for loongarch, powerpc, riscv and sparc. All architectures only
> enable SPARSEMEM_VMEMMAP with 64bit support, so there should not really
> be a big downside to using the VMEMMAP (quite the contrary).
> 
> This is a preparation for not supporting
> 
> (1) folio sizes that exceed a single memory section
> (2) CMA allocations of non-contiguous page ranges
> 
> in SPARSEMEM without SPARSEMEM_VMEMMAP configs, whereby we
> want to limit possible impact as much as possible (e.g., gigantic hugetlb
> page allocations suddenly fails).
> 
> Cc: Huacai Chen <chenhuacai@kernel.org>
> Cc: WANG Xuerui <kernel@xen0n.name>
> Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
> Cc: Paul Walmsley <paul.walmsley@sifive.com>
> Cc: Palmer Dabbelt <palmer@dabbelt.com>
> Cc: Albert Ou <aou@eecs.berkeley.edu>
> Cc: Alexandre Ghiti <alex@ghiti.fr>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: Andreas Larsson <andreas@gaisler.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>

Acked-by: SeongJae Park <sj@kernel.org>


Thanks,
SJ

[...]

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH RFC 06/35] mm/page_alloc: reject unreasonable folio/compound page sizes in alloc_contig_range_noprof()
  2025-08-21 20:06 ` [PATCH RFC 06/35] mm/page_alloc: reject unreasonable folio/compound page sizes in alloc_contig_range_noprof() David Hildenbrand
  2025-08-21 20:23   ` Zi Yan
@ 2025-08-22 17:07   ` SeongJae Park
  1 sibling, 0 replies; 90+ messages in thread
From: SeongJae Park @ 2025-08-22 17:07 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: SeongJae Park, linux-kernel, Alexander Potapenko, Andrew Morton,
	Brendan Jackman, Christoph Lameter, Dennis Zhou, Dmitry Vyukov,
	dri-devel, intel-gfx, iommu, io-uring, Jason Gunthorpe,
	Jens Axboe, Johannes Weiner, John Hubbard, kasan-dev, kvm,
	Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Mike Rapoport, Muchun Song, netdev, Oscar Salvador,
	Peter Xu, Robin Murphy, Suren Baghdasaryan, Tejun Heo,
	virtualization, Vlastimil Babka, wireguard, x86, Zi Yan

On Thu, 21 Aug 2025 22:06:32 +0200 David Hildenbrand <david@redhat.com> wrote:

> Let's reject them early,

I like early failures. :)

> which in turn makes folio_alloc_gigantic() reject
> them properly.
> 
> To avoid converting from order to nr_pages, let's just add MAX_FOLIO_ORDER
> and calculate MAX_FOLIO_NR_PAGES based on that.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Acked-by: SeongJae Park <sj@kernel.org>


Thanks,
SJ

[...]

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH RFC 07/35] mm/memremap: reject unreasonable folio/compound page sizes in memremap_pages()
  2025-08-21 20:06 ` [PATCH RFC 07/35] mm/memremap: reject unreasonable folio/compound page sizes in memremap_pages() David Hildenbrand
@ 2025-08-22 17:09   ` SeongJae Park
  0 siblings, 0 replies; 90+ messages in thread
From: SeongJae Park @ 2025-08-22 17:09 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: SeongJae Park, linux-kernel, Alexander Potapenko, Andrew Morton,
	Brendan Jackman, Christoph Lameter, Dennis Zhou, Dmitry Vyukov,
	dri-devel, intel-gfx, iommu, io-uring, Jason Gunthorpe,
	Jens Axboe, Johannes Weiner, John Hubbard, kasan-dev, kvm,
	Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Mike Rapoport, Muchun Song, netdev, Oscar Salvador,
	Peter Xu, Robin Murphy, Suren Baghdasaryan, Tejun Heo,
	virtualization, Vlastimil Babka, wireguard, x86, Zi Yan

On Thu, 21 Aug 2025 22:06:33 +0200 David Hildenbrand <david@redhat.com> wrote:

> Let's reject unreasonable folio sizes early, where we can still fail.
> We'll add sanity checks to prepare_compound_head/prepare_compound_page
> next.
> 
> Is there a way to configure a system such that unreasonable folio sizes
> would be possible? It would already be rather questionable.
> 
> If so, we'd probably want to bail out earlier, where we can avoid a
> WARN and just report a proper error message that indicates where
> something went wrong such that we messed up.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Acked-by: SeongJae Park <sj@kernel.org>


Thanks,
SJ

[...]

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH RFC 29/35] scsi: core: drop nth_page() usage within SG entry
  2025-08-21 20:06 ` [PATCH RFC 29/35] scsi: core: " David Hildenbrand
@ 2025-08-22 18:01   ` Bart Van Assche
  2025-08-22 18:10     ` David Hildenbrand
  0 siblings, 1 reply; 90+ messages in thread
From: Bart Van Assche @ 2025-08-22 18:01 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: James E.J. Bottomley, Martin K. Petersen, Doug Gilbert,
	Alexander Potapenko, Andrew Morton, Brendan Jackman,
	Christoph Lameter, Dennis Zhou, Dmitry Vyukov, dri-devel,
	intel-gfx, iommu, io-uring, Jason Gunthorpe, Jens Axboe,
	Johannes Weiner, John Hubbard, kasan-dev, kvm, Liam R. Howlett,
	Linus Torvalds, linux-arm-kernel, linux-arm-kernel, linux-crypto,
	linux-ide, linux-kselftest, linux-mips, linux-mmc, linux-mm,
	linux-riscv, linux-s390, linux-scsi, Lorenzo Stoakes, Marco Elver,
	Marek Szyprowski, Michal Hocko, Mike Rapoport, Muchun Song,
	netdev, Oscar Salvador, Peter Xu, Robin Murphy,
	Suren Baghdasaryan, Tejun Heo, virtualization, Vlastimil Babka,
	wireguard, x86, Zi Yan

On 8/21/25 1:06 PM, David Hildenbrand wrote:
> It's no longer required to use nth_page() when iterating pages within a
> single SG entry, so let's drop the nth_page() usage.
Usually the SCSI core and the SG I/O driver are updated separately.
Anyway:

Reviewed-by: Bart Van Assche <bvanassche@acm.org>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH RFC 09/35] mm/mm_init: make memmap_init_compound() look more like prep_compound_page()
  2025-08-22 15:27   ` Mike Rapoport
@ 2025-08-22 18:09     ` David Hildenbrand
  0 siblings, 0 replies; 90+ messages in thread
From: David Hildenbrand @ 2025-08-22 18:09 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: linux-kernel, Alexander Potapenko, Andrew Morton, Brendan Jackman,
	Christoph Lameter, Dennis Zhou, Dmitry Vyukov, dri-devel,
	intel-gfx, iommu, io-uring, Jason Gunthorpe, Jens Axboe,
	Johannes Weiner, John Hubbard, kasan-dev, kvm, Liam R. Howlett,
	Linus Torvalds, linux-arm-kernel, linux-arm-kernel, linux-crypto,
	linux-ide, linux-kselftest, linux-mips, linux-mmc, linux-mm,
	linux-riscv, linux-s390, linux-scsi, Lorenzo Stoakes, Marco Elver,
	Marek Szyprowski, Michal Hocko, Muchun Song, netdev,
	Oscar Salvador, Peter Xu, Robin Murphy, Suren Baghdasaryan,
	Tejun Heo, virtualization, Vlastimil Babka, wireguard, x86,
	Zi Yan

On 22.08.25 17:27, Mike Rapoport wrote:
> On Thu, Aug 21, 2025 at 10:06:35PM +0200, David Hildenbrand wrote:
>> Grepping for "prep_compound_page" leaves on clueless how devdax gets its
>> compound pages initialized.
>>
>> Let's add a comment that might help finding this open-coded
>> prep_compound_page() initialization more easily.
>>
>> Further, let's be less smart about the ordering of initialization and just
>> perform the prep_compound_head() call after all tail pages were
>> initialized: just like prep_compound_page() does.
>>
>> No need for a lengthy comment then: again, just like prep_compound_page().
>>
>> Note that prep_compound_head() already does initialize stuff in page[2]
>> through prep_compound_head() that successive tail page initialization
>> will overwrite: _deferred_list, and on 32bit _entire_mapcount and
>> _pincount. Very likely 32bit does not apply, and likely nobody ever ends
>> up testing whether the _deferred_list is empty.
>>
>> So it shouldn't be a fix at this point, but certainly something to clean
>> up.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>   mm/mm_init.c | 13 +++++--------
>>   1 file changed, 5 insertions(+), 8 deletions(-)
>>
>> diff --git a/mm/mm_init.c b/mm/mm_init.c
>> index 5c21b3af216b2..708466c5b2cc9 100644
>> --- a/mm/mm_init.c
>> +++ b/mm/mm_init.c
>> @@ -1091,6 +1091,10 @@ static void __ref memmap_init_compound(struct page *head,
>>   	unsigned long pfn, end_pfn = head_pfn + nr_pages;
>>   	unsigned int order = pgmap->vmemmap_shift;
>>   
>> +	/*
>> +	 * This is an open-coded prep_compound_page() whereby we avoid
>> +	 * walking pages twice by initializing them in the same go.
>> +	 */
> 
> While on it, can you also mention that prep_compound_page() is not used to
> properly set page zone link?

Sure, thanks!

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH RFC 29/35] scsi: core: drop nth_page() usage within SG entry
  2025-08-22 18:01   ` Bart Van Assche
@ 2025-08-22 18:10     ` David Hildenbrand
  0 siblings, 0 replies; 90+ messages in thread
From: David Hildenbrand @ 2025-08-22 18:10 UTC (permalink / raw)
  To: Bart Van Assche, linux-kernel
  Cc: James E.J. Bottomley, Martin K. Petersen, Doug Gilbert,
	Alexander Potapenko, Andrew Morton, Brendan Jackman,
	Christoph Lameter, Dennis Zhou, Dmitry Vyukov, dri-devel,
	intel-gfx, iommu, io-uring, Jason Gunthorpe, Jens Axboe,
	Johannes Weiner, John Hubbard, kasan-dev, kvm, Liam R. Howlett,
	Linus Torvalds, linux-arm-kernel, linux-arm-kernel, linux-crypto,
	linux-ide, linux-kselftest, linux-mips, linux-mmc, linux-mm,
	linux-riscv, linux-s390, linux-scsi, Lorenzo Stoakes, Marco Elver,
	Marek Szyprowski, Michal Hocko, Mike Rapoport, Muchun Song,
	netdev, Oscar Salvador, Peter Xu, Robin Murphy,
	Suren Baghdasaryan, Tejun Heo, virtualization, Vlastimil Babka,
	wireguard, x86, Zi Yan

On 22.08.25 20:01, Bart Van Assche wrote:
> On 8/21/25 1:06 PM, David Hildenbrand wrote:
>> It's no longer required to use nth_page() when iterating pages within a
>> single SG entry, so let's drop the nth_page() usage.
> Usually the SCSI core and the SG I/O driver are updated separately.
> Anyway:

Thanks, I had it separately but decided to merge per broader subsystem 
before sending. I can split it up in the next version.

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH RFC 10/35] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap()
  2025-08-22  6:24     ` David Hildenbrand
@ 2025-08-23  8:59       ` Mike Rapoport
  2025-08-25 12:48         ` David Hildenbrand
  0 siblings, 1 reply; 90+ messages in thread
From: Mike Rapoport @ 2025-08-23  8:59 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mika Penttilä, linux-kernel, Alexander Potapenko,
	Andrew Morton, Brendan Jackman, Christoph Lameter, Dennis Zhou,
	Dmitry Vyukov, dri-devel, intel-gfx, iommu, io-uring,
	Jason Gunthorpe, Jens Axboe, Johannes Weiner, John Hubbard,
	kasan-dev, kvm, Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Muchun Song, netdev, Oscar Salvador, Peter Xu,
	Robin Murphy, Suren Baghdasaryan, Tejun Heo, virtualization,
	Vlastimil Babka, wireguard, x86, Zi Yan

On Fri, Aug 22, 2025 at 08:24:31AM +0200, David Hildenbrand wrote:
> On 22.08.25 06:09, Mika Penttilä wrote:
> > 
> > On 8/21/25 23:06, David Hildenbrand wrote:
> > 
> > > All pages were already initialized and set to PageReserved() with a
> > > refcount of 1 by MM init code.
> > 
> > Just to be sure, how is this working with MEMBLOCK_RSRV_NOINIT, where MM is supposed not to
> > initialize struct pages?
> 
> Excellent point, I did not know about that one.
> 
> Spotting that we don't do the same for the head page made me assume that
> it's just a misuse of __init_single_page().
> 
> But the nasty thing is that we use memblock_reserved_mark_noinit() to only
> mark the tail pages ...

And even nastier thing is that when CONFIG_DEFERRED_STRUCT_PAGE_INIT is
disabled struct pages are initialized regardless of
memblock_reserved_mark_noinit().

I think this patch should go in before your updates:

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 753f99b4c718..1c51788339a5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3230,6 +3230,22 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid)
 	return 1;
 }
 
+/*
+ * Tail pages in a huge folio allocated from memblock are marked as 'noinit',
+ * which means that when CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled their
+ * struct page won't be initialized
+ */
+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+static void __init hugetlb_init_tail_page(struct page *page, unsigned long pfn,
+					enum zone_type zone, int nid)
+{
+	__init_single_page(page, pfn, zone, nid);
+}
+#else
+static inline void hugetlb_init_tail_page(struct page *page, unsigned long pfn,
+					enum zone_type zone, int nid) {}
+#endif
+
 /* Initialize [start_page:end_page_number] tail struct pages of a hugepage */
 static void __init hugetlb_folio_init_tail_vmemmap(struct folio *folio,
 					unsigned long start_page_number,
@@ -3244,7 +3260,7 @@ static void __init hugetlb_folio_init_tail_vmemmap(struct folio *folio,
 	for (pfn = head_pfn + start_page_number; pfn < end_pfn; pfn++) {
 		struct page *page = pfn_to_page(pfn);
 
-		__init_single_page(page, pfn, zone, nid);
+		hugetlb_init_tail_page(page, pfn, zone, nid);
 		prep_compound_tail((struct page *)folio, pfn - head_pfn);
 		ret = page_ref_freeze(page, 1);
 		VM_BUG_ON(!ret);
 
> Let me revert back to __init_single_page() and add a big fat comment why
> this is required.
> 
> Thanks!

-- 
Sincerely yours,
Mike.

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* Re: [PATCH RFC 12/35] mm: limit folio/compound page sizes in problematic kernel configs
  2025-08-21 20:06 ` [PATCH RFC 12/35] mm: limit folio/compound page sizes in problematic kernel configs David Hildenbrand
  2025-08-21 20:46   ` Zi Yan
@ 2025-08-24 13:24   ` Mike Rapoport
  1 sibling, 0 replies; 90+ messages in thread
From: Mike Rapoport @ 2025-08-24 13:24 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Alexander Potapenko, Andrew Morton, Brendan Jackman,
	Christoph Lameter, Dennis Zhou, Dmitry Vyukov, dri-devel,
	intel-gfx, iommu, io-uring, Jason Gunthorpe, Jens Axboe,
	Johannes Weiner, John Hubbard, kasan-dev, kvm, Liam R. Howlett,
	Linus Torvalds, linux-arm-kernel, linux-arm-kernel, linux-crypto,
	linux-ide, linux-kselftest, linux-mips, linux-mmc, linux-mm,
	linux-riscv, linux-s390, linux-scsi, Lorenzo Stoakes, Marco Elver,
	Marek Szyprowski, Michal Hocko, Muchun Song, netdev,
	Oscar Salvador, Peter Xu, Robin Murphy, Suren Baghdasaryan,
	Tejun Heo, virtualization, Vlastimil Babka, wireguard, x86,
	Zi Yan

On Thu, Aug 21, 2025 at 10:06:38PM +0200, David Hildenbrand wrote:
> Let's limit the maximum folio size in problematic kernel config where
> the memmap is allocated per memory section (SPARSEMEM without
> SPARSEMEM_VMEMMAP) to a single memory section.
> 
> Currently, only a single architectures supports ARCH_HAS_GIGANTIC_PAGE
> but not SPARSEMEM_VMEMMAP: sh.
> 
> Fortunately, the biggest hugetlb size sh supports is 64 MiB
> (HUGETLB_PAGE_SIZE_64MB) and the section size is at least 64 MiB
> (SECTION_SIZE_BITS == 26), so their use case is not degraded.
> 
> As folios and memory sections are naturally aligned to their order-2 size
> in memory, consequently a single folio can no longer span multiple memory
> sections on these problematic kernel configs.
> 
> nth_page() is no longer required when operating within a single compound
> page / folio.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

> ---
>  include/linux/mm.h | 22 ++++++++++++++++++----
>  1 file changed, 18 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 77737cbf2216a..48a985e17ef4e 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2053,11 +2053,25 @@ static inline long folio_nr_pages(const struct folio *folio)
>  	return folio_large_nr_pages(folio);
>  }
>  
> -/* Only hugetlbfs can allocate folios larger than MAX_ORDER */
> -#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
> -#define MAX_FOLIO_ORDER		PUD_ORDER
> -#else
> +#if !defined(CONFIG_ARCH_HAS_GIGANTIC_PAGE)
> +/*
> + * We don't expect any folios that exceed buddy sizes (and consequently
> + * memory sections).
> + */
>  #define MAX_FOLIO_ORDER		MAX_PAGE_ORDER
> +#elif defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
> +/*
> + * Only pages within a single memory section are guaranteed to be
> + * contiguous. By limiting folios to a single memory section, all folio
> + * pages are guaranteed to be contiguous.
> + */
> +#define MAX_FOLIO_ORDER		PFN_SECTION_SHIFT
> +#else
> +/*
> + * There is no real limit on the folio size. We limit them to the maximum we
> + * currently expect.
> + */
> +#define MAX_FOLIO_ORDER		PUD_ORDER
>  #endif
>  
>  #define MAX_FOLIO_NR_PAGES	(1UL << MAX_FOLIO_ORDER)
> -- 
> 2.50.1
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH RFC 10/35] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap()
  2025-08-23  8:59       ` Mike Rapoport
@ 2025-08-25 12:48         ` David Hildenbrand
  2025-08-25 14:32           ` Mike Rapoport
  0 siblings, 1 reply; 90+ messages in thread
From: David Hildenbrand @ 2025-08-25 12:48 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Mika Penttilä, linux-kernel, Alexander Potapenko,
	Andrew Morton, Brendan Jackman, Christoph Lameter, Dennis Zhou,
	Dmitry Vyukov, dri-devel, intel-gfx, iommu, io-uring,
	Jason Gunthorpe, Jens Axboe, Johannes Weiner, John Hubbard,
	kasan-dev, kvm, Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Muchun Song, netdev, Oscar Salvador, Peter Xu,
	Robin Murphy, Suren Baghdasaryan, Tejun Heo, virtualization,
	Vlastimil Babka, wireguard, x86, Zi Yan

On 23.08.25 10:59, Mike Rapoport wrote:
> On Fri, Aug 22, 2025 at 08:24:31AM +0200, David Hildenbrand wrote:
>> On 22.08.25 06:09, Mika Penttilä wrote:
>>>
>>> On 8/21/25 23:06, David Hildenbrand wrote:
>>>
>>>> All pages were already initialized and set to PageReserved() with a
>>>> refcount of 1 by MM init code.
>>>
>>> Just to be sure, how is this working with MEMBLOCK_RSRV_NOINIT, where MM is supposed not to
>>> initialize struct pages?
>>
>> Excellent point, I did not know about that one.
>>
>> Spotting that we don't do the same for the head page made me assume that
>> it's just a misuse of __init_single_page().
>>
>> But the nasty thing is that we use memblock_reserved_mark_noinit() to only
>> mark the tail pages ...
> 
> And even nastier thing is that when CONFIG_DEFERRED_STRUCT_PAGE_INIT is
> disabled struct pages are initialized regardless of
> memblock_reserved_mark_noinit().
> 
> I think this patch should go in before your updates:

Shouldn't we fix this in memblock code?

Hacking around that in the memblock_reserved_mark_noinit() user sound 
wrong -- and nothing in the doc of memblock_reserved_mark_noinit() 
spells that behavior out.

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH RFC 10/35] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap()
  2025-08-25 12:48         ` David Hildenbrand
@ 2025-08-25 14:32           ` Mike Rapoport
  2025-08-25 14:38             ` David Hildenbrand
  0 siblings, 1 reply; 90+ messages in thread
From: Mike Rapoport @ 2025-08-25 14:32 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mika Penttilä, linux-kernel, Alexander Potapenko,
	Andrew Morton, Brendan Jackman, Christoph Lameter, Dennis Zhou,
	Dmitry Vyukov, dri-devel, intel-gfx, iommu, io-uring,
	Jason Gunthorpe, Jens Axboe, Johannes Weiner, John Hubbard,
	kasan-dev, kvm, Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Muchun Song, netdev, Oscar Salvador, Peter Xu,
	Robin Murphy, Suren Baghdasaryan, Tejun Heo, virtualization,
	Vlastimil Babka, wireguard, x86, Zi Yan

On Mon, Aug 25, 2025 at 02:48:58PM +0200, David Hildenbrand wrote:
> On 23.08.25 10:59, Mike Rapoport wrote:
> > On Fri, Aug 22, 2025 at 08:24:31AM +0200, David Hildenbrand wrote:
> > > On 22.08.25 06:09, Mika Penttilä wrote:
> > > > 
> > > > On 8/21/25 23:06, David Hildenbrand wrote:
> > > > 
> > > > > All pages were already initialized and set to PageReserved() with a
> > > > > refcount of 1 by MM init code.
> > > > 
> > > > Just to be sure, how is this working with MEMBLOCK_RSRV_NOINIT, where MM is supposed not to
> > > > initialize struct pages?
> > > 
> > > Excellent point, I did not know about that one.
> > > 
> > > Spotting that we don't do the same for the head page made me assume that
> > > it's just a misuse of __init_single_page().
> > > 
> > > But the nasty thing is that we use memblock_reserved_mark_noinit() to only
> > > mark the tail pages ...
> > 
> > And even nastier thing is that when CONFIG_DEFERRED_STRUCT_PAGE_INIT is
> > disabled struct pages are initialized regardless of
> > memblock_reserved_mark_noinit().
> > 
> > I think this patch should go in before your updates:
> 
> Shouldn't we fix this in memblock code?
> 
> Hacking around that in the memblock_reserved_mark_noinit() user sound wrong
> -- and nothing in the doc of memblock_reserved_mark_noinit() spells that
> behavior out.

We can surely update the docs, but unfortunately I don't see how to avoid
hacking around it in hugetlb. 
Since it's used to optimise HVO even further to the point hugetlb open
codes memmap initialization, I think it's fair that it should deal with all
possible configurations.
 
> -- 
> Cheers
> 
> David / dhildenb
> 
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH RFC 10/35] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap()
  2025-08-25 14:32           ` Mike Rapoport
@ 2025-08-25 14:38             ` David Hildenbrand
  2025-08-25 14:59               ` Mike Rapoport
  0 siblings, 1 reply; 90+ messages in thread
From: David Hildenbrand @ 2025-08-25 14:38 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Mika Penttilä, linux-kernel, Alexander Potapenko,
	Andrew Morton, Brendan Jackman, Christoph Lameter, Dennis Zhou,
	Dmitry Vyukov, dri-devel, intel-gfx, iommu, io-uring,
	Jason Gunthorpe, Jens Axboe, Johannes Weiner, John Hubbard,
	kasan-dev, kvm, Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Muchun Song, netdev, Oscar Salvador, Peter Xu,
	Robin Murphy, Suren Baghdasaryan, Tejun Heo, virtualization,
	Vlastimil Babka, wireguard, x86, Zi Yan

On 25.08.25 16:32, Mike Rapoport wrote:
> On Mon, Aug 25, 2025 at 02:48:58PM +0200, David Hildenbrand wrote:
>> On 23.08.25 10:59, Mike Rapoport wrote:
>>> On Fri, Aug 22, 2025 at 08:24:31AM +0200, David Hildenbrand wrote:
>>>> On 22.08.25 06:09, Mika Penttilä wrote:
>>>>>
>>>>> On 8/21/25 23:06, David Hildenbrand wrote:
>>>>>
>>>>>> All pages were already initialized and set to PageReserved() with a
>>>>>> refcount of 1 by MM init code.
>>>>>
>>>>> Just to be sure, how is this working with MEMBLOCK_RSRV_NOINIT, where MM is supposed not to
>>>>> initialize struct pages?
>>>>
>>>> Excellent point, I did not know about that one.
>>>>
>>>> Spotting that we don't do the same for the head page made me assume that
>>>> it's just a misuse of __init_single_page().
>>>>
>>>> But the nasty thing is that we use memblock_reserved_mark_noinit() to only
>>>> mark the tail pages ...
>>>
>>> And even nastier thing is that when CONFIG_DEFERRED_STRUCT_PAGE_INIT is
>>> disabled struct pages are initialized regardless of
>>> memblock_reserved_mark_noinit().
>>>
>>> I think this patch should go in before your updates:
>>
>> Shouldn't we fix this in memblock code?
>>
>> Hacking around that in the memblock_reserved_mark_noinit() user sound wrong
>> -- and nothing in the doc of memblock_reserved_mark_noinit() spells that
>> behavior out.
> 
> We can surely update the docs, but unfortunately I don't see how to avoid
> hacking around it in hugetlb.
> Since it's used to optimise HVO even further to the point hugetlb open
> codes memmap initialization, I think it's fair that it should deal with all
> possible configurations.

Remind me, why can't we support memblock_reserved_mark_noinit() when 
CONFIG_DEFERRED_STRUCT_PAGE_INIT is disabled?

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH RFC 10/35] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap()
  2025-08-25 14:38             ` David Hildenbrand
@ 2025-08-25 14:59               ` Mike Rapoport
  2025-08-25 15:42                 ` David Hildenbrand
  0 siblings, 1 reply; 90+ messages in thread
From: Mike Rapoport @ 2025-08-25 14:59 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mika Penttilä, linux-kernel, Alexander Potapenko,
	Andrew Morton, Brendan Jackman, Christoph Lameter, Dennis Zhou,
	Dmitry Vyukov, dri-devel, intel-gfx, iommu, io-uring,
	Jason Gunthorpe, Jens Axboe, Johannes Weiner, John Hubbard,
	kasan-dev, kvm, Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Muchun Song, netdev, Oscar Salvador, Peter Xu,
	Robin Murphy, Suren Baghdasaryan, Tejun Heo, virtualization,
	Vlastimil Babka, wireguard, x86, Zi Yan

On Mon, Aug 25, 2025 at 04:38:03PM +0200, David Hildenbrand wrote:
> On 25.08.25 16:32, Mike Rapoport wrote:
> > On Mon, Aug 25, 2025 at 02:48:58PM +0200, David Hildenbrand wrote:
> > > On 23.08.25 10:59, Mike Rapoport wrote:
> > > > On Fri, Aug 22, 2025 at 08:24:31AM +0200, David Hildenbrand wrote:
> > > > > On 22.08.25 06:09, Mika Penttilä wrote:
> > > > > > 
> > > > > > On 8/21/25 23:06, David Hildenbrand wrote:
> > > > > > 
> > > > > > > All pages were already initialized and set to PageReserved() with a
> > > > > > > refcount of 1 by MM init code.
> > > > > > 
> > > > > > Just to be sure, how is this working with MEMBLOCK_RSRV_NOINIT, where MM is supposed not to
> > > > > > initialize struct pages?
> > > > > 
> > > > > Excellent point, I did not know about that one.
> > > > > 
> > > > > Spotting that we don't do the same for the head page made me assume that
> > > > > it's just a misuse of __init_single_page().
> > > > > 
> > > > > But the nasty thing is that we use memblock_reserved_mark_noinit() to only
> > > > > mark the tail pages ...
> > > > 
> > > > And even nastier thing is that when CONFIG_DEFERRED_STRUCT_PAGE_INIT is
> > > > disabled struct pages are initialized regardless of
> > > > memblock_reserved_mark_noinit().
> > > > 
> > > > I think this patch should go in before your updates:
> > > 
> > > Shouldn't we fix this in memblock code?
> > > 
> > > Hacking around that in the memblock_reserved_mark_noinit() user sound wrong
> > > -- and nothing in the doc of memblock_reserved_mark_noinit() spells that
> > > behavior out.
> > 
> > We can surely update the docs, but unfortunately I don't see how to avoid
> > hacking around it in hugetlb.
> > Since it's used to optimise HVO even further to the point hugetlb open
> > codes memmap initialization, I think it's fair that it should deal with all
> > possible configurations.
> 
> Remind me, why can't we support memblock_reserved_mark_noinit() when
> CONFIG_DEFERRED_STRUCT_PAGE_INIT is disabled?

When CONFIG_DEFERRED_STRUCT_PAGE_INIT is disabled we initialize the entire
memmap early (setup_arch()->free_area_init()), and we may have a bunch of
memblock_reserved_mark_noinit() afterwards
 
> -- 
> Cheers
> 
> David / dhildenb
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH RFC 10/35] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap()
  2025-08-25 14:59               ` Mike Rapoport
@ 2025-08-25 15:42                 ` David Hildenbrand
  2025-08-25 16:17                   ` Mike Rapoport
  0 siblings, 1 reply; 90+ messages in thread
From: David Hildenbrand @ 2025-08-25 15:42 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Mika Penttilä, linux-kernel, Alexander Potapenko,
	Andrew Morton, Brendan Jackman, Christoph Lameter, Dennis Zhou,
	Dmitry Vyukov, dri-devel, intel-gfx, iommu, io-uring,
	Jason Gunthorpe, Jens Axboe, Johannes Weiner, John Hubbard,
	kasan-dev, kvm, Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Muchun Song, netdev, Oscar Salvador, Peter Xu,
	Robin Murphy, Suren Baghdasaryan, Tejun Heo, virtualization,
	Vlastimil Babka, wireguard, x86, Zi Yan

On 25.08.25 16:59, Mike Rapoport wrote:
> On Mon, Aug 25, 2025 at 04:38:03PM +0200, David Hildenbrand wrote:
>> On 25.08.25 16:32, Mike Rapoport wrote:
>>> On Mon, Aug 25, 2025 at 02:48:58PM +0200, David Hildenbrand wrote:
>>>> On 23.08.25 10:59, Mike Rapoport wrote:
>>>>> On Fri, Aug 22, 2025 at 08:24:31AM +0200, David Hildenbrand wrote:
>>>>>> On 22.08.25 06:09, Mika Penttilä wrote:
>>>>>>>
>>>>>>> On 8/21/25 23:06, David Hildenbrand wrote:
>>>>>>>
>>>>>>>> All pages were already initialized and set to PageReserved() with a
>>>>>>>> refcount of 1 by MM init code.
>>>>>>>
>>>>>>> Just to be sure, how is this working with MEMBLOCK_RSRV_NOINIT, where MM is supposed not to
>>>>>>> initialize struct pages?
>>>>>>
>>>>>> Excellent point, I did not know about that one.
>>>>>>
>>>>>> Spotting that we don't do the same for the head page made me assume that
>>>>>> it's just a misuse of __init_single_page().
>>>>>>
>>>>>> But the nasty thing is that we use memblock_reserved_mark_noinit() to only
>>>>>> mark the tail pages ...
>>>>>
>>>>> And even nastier thing is that when CONFIG_DEFERRED_STRUCT_PAGE_INIT is
>>>>> disabled struct pages are initialized regardless of
>>>>> memblock_reserved_mark_noinit().
>>>>>
>>>>> I think this patch should go in before your updates:
>>>>
>>>> Shouldn't we fix this in memblock code?
>>>>
>>>> Hacking around that in the memblock_reserved_mark_noinit() user sound wrong
>>>> -- and nothing in the doc of memblock_reserved_mark_noinit() spells that
>>>> behavior out.
>>>
>>> We can surely update the docs, but unfortunately I don't see how to avoid
>>> hacking around it in hugetlb.
>>> Since it's used to optimise HVO even further to the point hugetlb open
>>> codes memmap initialization, I think it's fair that it should deal with all
>>> possible configurations.
>>
>> Remind me, why can't we support memblock_reserved_mark_noinit() when
>> CONFIG_DEFERRED_STRUCT_PAGE_INIT is disabled?
> 
> When CONFIG_DEFERRED_STRUCT_PAGE_INIT is disabled we initialize the entire
> memmap early (setup_arch()->free_area_init()), and we may have a bunch of
> memblock_reserved_mark_noinit() afterwards

Oh, you mean that we get effective memblock modifications after already
initializing the memmap.

That sounds ... interesting :)

So yeah, we have to document this for memblock_reserved_mark_noinit().

Is it also a problem for kexec_handover?

We should do something like:

diff --git a/mm/memblock.c b/mm/memblock.c
index 154f1d73b61f2..ed4c563d72c32 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1091,13 +1091,16 @@ int __init_memblock memblock_clear_nomap(phys_addr_t base, phys_addr_t size)
  
  /**
   * memblock_reserved_mark_noinit - Mark a reserved memory region with flag
- * MEMBLOCK_RSRV_NOINIT which results in the struct pages not being initialized
- * for this region.
+ * MEMBLOCK_RSRV_NOINIT which allows for the "struct pages" corresponding
+ * to this region not getting initialized, because the caller will take
+ * care of it.
   * @base: the base phys addr of the region
   * @size: the size of the region
   *
- * struct pages will not be initialized for reserved memory regions marked with
- * %MEMBLOCK_RSRV_NOINIT.
+ * "struct pages" will not be initialized for reserved memory regions marked
+ * with %MEMBLOCK_RSRV_NOINIT if this function is called before initialization
+ * code runs. Without CONFIG_DEFERRED_STRUCT_PAGE_INIT, it is more likely
+ * that this function is not effective.
   *
   * Return: 0 on success, -errno on failure.
   */


Optimizing the hugetlb code could be done, but I am not sure how high
the priority is (nobody complained so far about the double init).

-- 
Cheers

David / dhildenb


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* Re: [PATCH RFC 10/35] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap()
  2025-08-25 15:42                 ` David Hildenbrand
@ 2025-08-25 16:17                   ` Mike Rapoport
  2025-08-25 16:23                     ` David Hildenbrand
  0 siblings, 1 reply; 90+ messages in thread
From: Mike Rapoport @ 2025-08-25 16:17 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mika Penttilä, linux-kernel, Alexander Potapenko,
	Andrew Morton, Brendan Jackman, Christoph Lameter, Dennis Zhou,
	Dmitry Vyukov, dri-devel, intel-gfx, iommu, io-uring,
	Jason Gunthorpe, Jens Axboe, Johannes Weiner, John Hubbard,
	kasan-dev, kvm, Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Muchun Song, netdev, Oscar Salvador, Peter Xu,
	Robin Murphy, Suren Baghdasaryan, Tejun Heo, virtualization,
	Vlastimil Babka, wireguard, x86, Zi Yan

On Mon, Aug 25, 2025 at 05:42:33PM +0200, David Hildenbrand wrote:
> On 25.08.25 16:59, Mike Rapoport wrote:
> > On Mon, Aug 25, 2025 at 04:38:03PM +0200, David Hildenbrand wrote:
> > > On 25.08.25 16:32, Mike Rapoport wrote:
> > > > On Mon, Aug 25, 2025 at 02:48:58PM +0200, David Hildenbrand wrote:
> > > > > On 23.08.25 10:59, Mike Rapoport wrote:
> > > > > > On Fri, Aug 22, 2025 at 08:24:31AM +0200, David Hildenbrand wrote:
> > > > > > > On 22.08.25 06:09, Mika Penttilä wrote:
> > > > > > > > 
> > > > > > > > On 8/21/25 23:06, David Hildenbrand wrote:
> > > > > > > > 
> > > > > > > > > All pages were already initialized and set to PageReserved() with a
> > > > > > > > > refcount of 1 by MM init code.
> > > > > > > > 
> > > > > > > > Just to be sure, how is this working with MEMBLOCK_RSRV_NOINIT, where MM is supposed not to
> > > > > > > > initialize struct pages?
> > > > > > > 
> > > > > > > Excellent point, I did not know about that one.
> > > > > > > 
> > > > > > > Spotting that we don't do the same for the head page made me assume that
> > > > > > > it's just a misuse of __init_single_page().
> > > > > > > 
> > > > > > > But the nasty thing is that we use memblock_reserved_mark_noinit() to only
> > > > > > > mark the tail pages ...
> > > > > > 
> > > > > > And even nastier thing is that when CONFIG_DEFERRED_STRUCT_PAGE_INIT is
> > > > > > disabled struct pages are initialized regardless of
> > > > > > memblock_reserved_mark_noinit().
> > > > > > 
> > > > > > I think this patch should go in before your updates:
> > > > > 
> > > > > Shouldn't we fix this in memblock code?
> > > > > 
> > > > > Hacking around that in the memblock_reserved_mark_noinit() user sound wrong
> > > > > -- and nothing in the doc of memblock_reserved_mark_noinit() spells that
> > > > > behavior out.
> > > > 
> > > > We can surely update the docs, but unfortunately I don't see how to avoid
> > > > hacking around it in hugetlb.
> > > > Since it's used to optimise HVO even further to the point hugetlb open
> > > > codes memmap initialization, I think it's fair that it should deal with all
> > > > possible configurations.
> > > 
> > > Remind me, why can't we support memblock_reserved_mark_noinit() when
> > > CONFIG_DEFERRED_STRUCT_PAGE_INIT is disabled?
> > 
> > When CONFIG_DEFERRED_STRUCT_PAGE_INIT is disabled we initialize the entire
> > memmap early (setup_arch()->free_area_init()), and we may have a bunch of
> > memblock_reserved_mark_noinit() afterwards
> 
> Oh, you mean that we get effective memblock modifications after already
> initializing the memmap.
> 
> That sounds ... interesting :)

It's memmap, not the free lists. Without deferred init, memblock is active
for a while after memmap initialized and before the memory goes to the free
lists.
 
> So yeah, we have to document this for memblock_reserved_mark_noinit().
> 
> Is it also a problem for kexec_handover?

With KHO it's also interesting, but it does not support deferred struct
page init for now :)
 
> We should do something like:
> 
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 154f1d73b61f2..ed4c563d72c32 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1091,13 +1091,16 @@ int __init_memblock memblock_clear_nomap(phys_addr_t base, phys_addr_t size)
>  /**
>   * memblock_reserved_mark_noinit - Mark a reserved memory region with flag
> - * MEMBLOCK_RSRV_NOINIT which results in the struct pages not being initialized
> - * for this region.
> + * MEMBLOCK_RSRV_NOINIT which allows for the "struct pages" corresponding
> + * to this region not getting initialized, because the caller will take
> + * care of it.
>   * @base: the base phys addr of the region
>   * @size: the size of the region
>   *
> - * struct pages will not be initialized for reserved memory regions marked with
> - * %MEMBLOCK_RSRV_NOINIT.
> + * "struct pages" will not be initialized for reserved memory regions marked
> + * with %MEMBLOCK_RSRV_NOINIT if this function is called before initialization
> + * code runs. Without CONFIG_DEFERRED_STRUCT_PAGE_INIT, it is more likely
> + * that this function is not effective.
>   *
>   * Return: 0 on success, -errno on failure.
>   */

I have a different version :)
 
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index b96746376e17..d20d091c6343 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -40,8 +40,9 @@ extern unsigned long long max_possible_pfn;
  * via a driver, and never indicated in the firmware-provided memory map as
  * system RAM. This corresponds to IORESOURCE_SYSRAM_DRIVER_MANAGED in the
  * kernel resource tree.
- * @MEMBLOCK_RSRV_NOINIT: memory region for which struct pages are
- * not initialized (only for reserved regions).
+ * @MEMBLOCK_RSRV_NOINIT: memory region for which struct pages don't have
+ * PG_Reserved set and are completely not initialized when
+ * %CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled (only for reserved regions).
  * @MEMBLOCK_RSRV_KERN: memory region that is reserved for kernel use,
  * either explictitly with memblock_reserve_kern() or via memblock
  * allocation APIs. All memblock allocations set this flag.
diff --git a/mm/memblock.c b/mm/memblock.c
index 154f1d73b61f..02de5ffb085b 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1091,13 +1091,15 @@ int __init_memblock memblock_clear_nomap(phys_addr_t base, phys_addr_t size)
 
 /**
  * memblock_reserved_mark_noinit - Mark a reserved memory region with flag
- * MEMBLOCK_RSRV_NOINIT which results in the struct pages not being initialized
- * for this region.
+ * MEMBLOCK_RSRV_NOINIT
+ *
  * @base: the base phys addr of the region
  * @size: the size of the region
  *
- * struct pages will not be initialized for reserved memory regions marked with
- * %MEMBLOCK_RSRV_NOINIT.
+ * The struct pages for the reserved regions marked %MEMBLOCK_RSRV_NOINIT will
+ * not have %PG_Reserved flag set.
+ * When %CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, setting this flags also
+ * completly bypasses the initialization of struct pages for this region.
  *
  * Return: 0 on success, -errno on failure.
  */
 
> Optimizing the hugetlb code could be done, but I am not sure how high
> the priority is (nobody complained so far about the double init).
> 
> -- 
> Cheers
> 
> David / dhildenb
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* Re: [PATCH RFC 10/35] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap()
  2025-08-25 16:17                   ` Mike Rapoport
@ 2025-08-25 16:23                     ` David Hildenbrand
  2025-08-25 16:58                       ` update kernel-doc for MEMBLOCK_RSRV_NOINIT (was: Re: [PATCH RFC 10/35] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap()) Mike Rapoport
  0 siblings, 1 reply; 90+ messages in thread
From: David Hildenbrand @ 2025-08-25 16:23 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Mika Penttilä, linux-kernel, Alexander Potapenko,
	Andrew Morton, Brendan Jackman, Christoph Lameter, Dennis Zhou,
	Dmitry Vyukov, dri-devel, intel-gfx, iommu, io-uring,
	Jason Gunthorpe, Jens Axboe, Johannes Weiner, John Hubbard,
	kasan-dev, kvm, Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Muchun Song, netdev, Oscar Salvador, Peter Xu,
	Robin Murphy, Suren Baghdasaryan, Tejun Heo, virtualization,
	Vlastimil Babka, wireguard, x86, Zi Yan

>   
>> We should do something like:
>>
>> diff --git a/mm/memblock.c b/mm/memblock.c
>> index 154f1d73b61f2..ed4c563d72c32 100644
>> --- a/mm/memblock.c
>> +++ b/mm/memblock.c
>> @@ -1091,13 +1091,16 @@ int __init_memblock memblock_clear_nomap(phys_addr_t base, phys_addr_t size)
>>   /**
>>    * memblock_reserved_mark_noinit - Mark a reserved memory region with flag
>> - * MEMBLOCK_RSRV_NOINIT which results in the struct pages not being initialized
>> - * for this region.
>> + * MEMBLOCK_RSRV_NOINIT which allows for the "struct pages" corresponding
>> + * to this region not getting initialized, because the caller will take
>> + * care of it.
>>    * @base: the base phys addr of the region
>>    * @size: the size of the region
>>    *
>> - * struct pages will not be initialized for reserved memory regions marked with
>> - * %MEMBLOCK_RSRV_NOINIT.
>> + * "struct pages" will not be initialized for reserved memory regions marked
>> + * with %MEMBLOCK_RSRV_NOINIT if this function is called before initialization
>> + * code runs. Without CONFIG_DEFERRED_STRUCT_PAGE_INIT, it is more likely
>> + * that this function is not effective.
>>    *
>>    * Return: 0 on success, -errno on failure.
>>    */
> 
> I have a different version :)
>   
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index b96746376e17..d20d091c6343 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -40,8 +40,9 @@ extern unsigned long long max_possible_pfn;
>    * via a driver, and never indicated in the firmware-provided memory map as
>    * system RAM. This corresponds to IORESOURCE_SYSRAM_DRIVER_MANAGED in the
>    * kernel resource tree.
> - * @MEMBLOCK_RSRV_NOINIT: memory region for which struct pages are
> - * not initialized (only for reserved regions).
> + * @MEMBLOCK_RSRV_NOINIT: memory region for which struct pages don't have
> + * PG_Reserved set and are completely not initialized when
> + * %CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled (only for reserved regions).
>    * @MEMBLOCK_RSRV_KERN: memory region that is reserved for kernel use,
>    * either explictitly with memblock_reserve_kern() or via memblock
>    * allocation APIs. All memblock allocations set this flag.
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 154f1d73b61f..02de5ffb085b 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1091,13 +1091,15 @@ int __init_memblock memblock_clear_nomap(phys_addr_t base, phys_addr_t size)
>   
>   /**
>    * memblock_reserved_mark_noinit - Mark a reserved memory region with flag
> - * MEMBLOCK_RSRV_NOINIT which results in the struct pages not being initialized
> - * for this region.
> + * MEMBLOCK_RSRV_NOINIT
> + *
>    * @base: the base phys addr of the region
>    * @size: the size of the region
>    *
> - * struct pages will not be initialized for reserved memory regions marked with
> - * %MEMBLOCK_RSRV_NOINIT.
> + * The struct pages for the reserved regions marked %MEMBLOCK_RSRV_NOINIT will
> + * not have %PG_Reserved flag set.
> + * When %CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, setting this flags also
> + * completly bypasses the initialization of struct pages for this region.

s/completly/completely.

I don't quite understand the interaction with PG_Reserved and why 
anybody using this function should care.

So maybe you can rephrase in a way that is easier to digest, and rather 
focuses on what callers of this function are supposed to do vs. have the 
liberty of not doing?

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 90+ messages in thread

* update kernel-doc for MEMBLOCK_RSRV_NOINIT (was: Re: [PATCH RFC 10/35] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap())
  2025-08-25 16:23                     ` David Hildenbrand
@ 2025-08-25 16:58                       ` Mike Rapoport
  2025-08-25 18:32                         ` update kernel-doc for MEMBLOCK_RSRV_NOINIT David Hildenbrand
  0 siblings, 1 reply; 90+ messages in thread
From: Mike Rapoport @ 2025-08-25 16:58 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mika Penttilä, linux-kernel, Alexander Potapenko,
	Andrew Morton, Brendan Jackman, Christoph Lameter, Dennis Zhou,
	Dmitry Vyukov, dri-devel, intel-gfx, iommu, io-uring,
	Jason Gunthorpe, Jens Axboe, Johannes Weiner, John Hubbard,
	kasan-dev, kvm, Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Muchun Song, netdev, Oscar Salvador, Peter Xu,
	Robin Murphy, Suren Baghdasaryan, Tejun Heo, virtualization,
	Vlastimil Babka, wireguard, x86, Zi Yan

On Mon, Aug 25, 2025 at 06:23:48PM +0200, David Hildenbrand wrote:
> 
> I don't quite understand the interaction with PG_Reserved and why anybody
> using this function should care.
> 
> So maybe you can rephrase in a way that is easier to digest, and rather
> focuses on what callers of this function are supposed to do vs. have the
> liberty of not doing?

How about
 
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index b96746376e17..fcda8481de9a 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -40,8 +40,9 @@ extern unsigned long long max_possible_pfn;
  * via a driver, and never indicated in the firmware-provided memory map as
  * system RAM. This corresponds to IORESOURCE_SYSRAM_DRIVER_MANAGED in the
  * kernel resource tree.
- * @MEMBLOCK_RSRV_NOINIT: memory region for which struct pages are
- * not initialized (only for reserved regions).
+ * @MEMBLOCK_RSRV_NOINIT: reserved memory region for which struct pages are not
+ * fully initialized. Users of this flag are responsible to properly initialize
+ * struct pages of this region
  * @MEMBLOCK_RSRV_KERN: memory region that is reserved for kernel use,
  * either explictitly with memblock_reserve_kern() or via memblock
  * allocation APIs. All memblock allocations set this flag.
diff --git a/mm/memblock.c b/mm/memblock.c
index 154f1d73b61f..46b411fb3630 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1091,13 +1091,20 @@ int __init_memblock memblock_clear_nomap(phys_addr_t base, phys_addr_t size)
 
 /**
  * memblock_reserved_mark_noinit - Mark a reserved memory region with flag
- * MEMBLOCK_RSRV_NOINIT which results in the struct pages not being initialized
- * for this region.
+ * MEMBLOCK_RSRV_NOINIT
+ *
  * @base: the base phys addr of the region
  * @size: the size of the region
  *
- * struct pages will not be initialized for reserved memory regions marked with
- * %MEMBLOCK_RSRV_NOINIT.
+ * The struct pages for the reserved regions marked %MEMBLOCK_RSRV_NOINIT will
+ * not be fully initialized to allow the caller optimize their initialization.
+ *
+ * When %CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, setting this flag
+ * completely bypasses the initialization of struct pages for such region.
+ *
+ * When %CONFIG_DEFERRED_STRUCT_PAGE_INIT is disabled, struct pages in this
+ * region will be initialized with default values but won't be marked as
+ * reserved.
  *
  * Return: 0 on success, -errno on failure.
  */

> -- 
> Cheers
> 
> David / dhildenb
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply related	[flat|nested] 90+ messages in thread

* Re: update kernel-doc for MEMBLOCK_RSRV_NOINIT
  2025-08-25 16:58                       ` update kernel-doc for MEMBLOCK_RSRV_NOINIT (was: Re: [PATCH RFC 10/35] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap()) Mike Rapoport
@ 2025-08-25 18:32                         ` David Hildenbrand
  0 siblings, 0 replies; 90+ messages in thread
From: David Hildenbrand @ 2025-08-25 18:32 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Mika Penttilä, linux-kernel, Alexander Potapenko,
	Andrew Morton, Brendan Jackman, Christoph Lameter, Dennis Zhou,
	Dmitry Vyukov, dri-devel, intel-gfx, iommu, io-uring,
	Jason Gunthorpe, Jens Axboe, Johannes Weiner, John Hubbard,
	kasan-dev, kvm, Liam R. Howlett, Linus Torvalds, linux-arm-kernel,
	linux-arm-kernel, linux-crypto, linux-ide, linux-kselftest,
	linux-mips, linux-mmc, linux-mm, linux-riscv, linux-s390,
	linux-scsi, Lorenzo Stoakes, Marco Elver, Marek Szyprowski,
	Michal Hocko, Muchun Song, netdev, Oscar Salvador, Peter Xu,
	Robin Murphy, Suren Baghdasaryan, Tejun Heo, virtualization,
	Vlastimil Babka, wireguard, x86, Zi Yan

On 25.08.25 18:58, Mike Rapoport wrote:
> On Mon, Aug 25, 2025 at 06:23:48PM +0200, David Hildenbrand wrote:
>>
>> I don't quite understand the interaction with PG_Reserved and why anybody
>> using this function should care.
>>
>> So maybe you can rephrase in a way that is easier to digest, and rather
>> focuses on what callers of this function are supposed to do vs. have the
>> liberty of not doing?
> 
> How about
>   
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index b96746376e17..fcda8481de9a 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -40,8 +40,9 @@ extern unsigned long long max_possible_pfn;
>    * via a driver, and never indicated in the firmware-provided memory map as
>    * system RAM. This corresponds to IORESOURCE_SYSRAM_DRIVER_MANAGED in the
>    * kernel resource tree.
> - * @MEMBLOCK_RSRV_NOINIT: memory region for which struct pages are
> - * not initialized (only for reserved regions).
> + * @MEMBLOCK_RSRV_NOINIT: reserved memory region for which struct pages are not
> + * fully initialized. Users of this flag are responsible to properly initialize
> + * struct pages of this region
>    * @MEMBLOCK_RSRV_KERN: memory region that is reserved for kernel use,
>    * either explictitly with memblock_reserve_kern() or via memblock
>    * allocation APIs. All memblock allocations set this flag.
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 154f1d73b61f..46b411fb3630 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1091,13 +1091,20 @@ int __init_memblock memblock_clear_nomap(phys_addr_t base, phys_addr_t size)
>   
>   /**
>    * memblock_reserved_mark_noinit - Mark a reserved memory region with flag
> - * MEMBLOCK_RSRV_NOINIT which results in the struct pages not being initialized
> - * for this region.
> + * MEMBLOCK_RSRV_NOINIT
> + *
>    * @base: the base phys addr of the region
>    * @size: the size of the region
>    *
> - * struct pages will not be initialized for reserved memory regions marked with
> - * %MEMBLOCK_RSRV_NOINIT.
> + * The struct pages for the reserved regions marked %MEMBLOCK_RSRV_NOINIT will
> + * not be fully initialized to allow the caller optimize their initialization.
> + *
> + * When %CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, setting this flag
> + * completely bypasses the initialization of struct pages for such region.
> + *
> + * When %CONFIG_DEFERRED_STRUCT_PAGE_INIT is disabled, struct pages in this
> + * region will be initialized with default values but won't be marked as
> + * reserved.

Sounds good.

I am surprised regarding "reserved", but I guess that's because we don't 
end up calling "reserve_bootmem_region()" on these regions in 
memmap_init_reserved_pages().


-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH RFC 21/35] mm/cma: refuse handing out non-contiguous page ranges
  2025-08-21 20:06 ` [PATCH RFC 21/35] mm/cma: refuse handing out non-contiguous page ranges David Hildenbrand
@ 2025-08-26 10:45   ` Alexandru Elisei
  2025-08-26 11:04     ` David Hildenbrand
  0 siblings, 1 reply; 90+ messages in thread
From: Alexandru Elisei @ 2025-08-26 10:45 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Alexander Potapenko, Andrew Morton, Brendan Jackman,
	Christoph Lameter, Dennis Zhou, Dmitry Vyukov, dri-devel,
	intel-gfx, iommu, io-uring, Jason Gunthorpe, Jens Axboe,
	Johannes Weiner, John Hubbard, kasan-dev, kvm, Liam R. Howlett,
	Linus Torvalds, linux-arm-kernel, linux-arm-kernel, linux-crypto,
	linux-ide, linux-kselftest, linux-mips, linux-mmc, linux-mm,
	linux-riscv, linux-s390, linux-scsi, Lorenzo Stoakes, Marco Elver,
	Marek Szyprowski, Michal Hocko, Mike Rapoport, Muchun Song,
	netdev, Oscar Salvador, Peter Xu, Robin Murphy,
	Suren Baghdasaryan, Tejun Heo, virtualization, Vlastimil Babka,
	wireguard, x86, Zi Yan

Hi David,

On Thu, Aug 21, 2025 at 10:06:47PM +0200, David Hildenbrand wrote:
> Let's disallow handing out PFN ranges with non-contiguous pages, so we
> can remove the nth-page usage in __cma_alloc(), and so any callers don't
> have to worry about that either when wanting to blindly iterate pages.
> 
> This is really only a problem in configs with SPARSEMEM but without
> SPARSEMEM_VMEMMAP, and only when we would cross memory sections in some
> cases.
> 
> Will this cause harm? Probably not, because it's mostly 32bit that does
> not support SPARSEMEM_VMEMMAP. If this ever becomes a problem we could
> look into allocating the memmap for the memory sections spanned by a
> single CMA region in one go from memblock.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  include/linux/mm.h |  6 ++++++
>  mm/cma.c           | 36 +++++++++++++++++++++++-------------
>  mm/util.c          | 33 +++++++++++++++++++++++++++++++++
>  3 files changed, 62 insertions(+), 13 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ef360b72cb05c..f59ad1f9fc792 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -209,9 +209,15 @@ extern unsigned long sysctl_user_reserve_kbytes;
>  extern unsigned long sysctl_admin_reserve_kbytes;
>  
>  #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
> +bool page_range_contiguous(const struct page *page, unsigned long nr_pages);
>  #define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n))
>  #else
>  #define nth_page(page,n) ((page) + (n))
> +static inline bool page_range_contiguous(const struct page *page,
> +		unsigned long nr_pages)
> +{
> +	return true;
> +}
>  #endif
>  
>  /* to align the pointer to the (next) page boundary */
> diff --git a/mm/cma.c b/mm/cma.c
> index 2ffa4befb99ab..1119fa2830008 100644
> --- a/mm/cma.c
> +++ b/mm/cma.c
> @@ -780,10 +780,8 @@ static int cma_range_alloc(struct cma *cma, struct cma_memrange *cmr,
>  				unsigned long count, unsigned int align,
>  				struct page **pagep, gfp_t gfp)
>  {
> -	unsigned long mask, offset;
> -	unsigned long pfn = -1;
> -	unsigned long start = 0;
>  	unsigned long bitmap_maxno, bitmap_no, bitmap_count;
> +	unsigned long start, pfn, mask, offset;
>  	int ret = -EBUSY;
>  	struct page *page = NULL;
>  
> @@ -795,7 +793,7 @@ static int cma_range_alloc(struct cma *cma, struct cma_memrange *cmr,
>  	if (bitmap_count > bitmap_maxno)
>  		goto out;
>  
> -	for (;;) {
> +	for (start = 0; ; start = bitmap_no + mask + 1) {
>  		spin_lock_irq(&cma->lock);
>  		/*
>  		 * If the request is larger than the available number
> @@ -812,6 +810,22 @@ static int cma_range_alloc(struct cma *cma, struct cma_memrange *cmr,
>  			spin_unlock_irq(&cma->lock);
>  			break;
>  		}
> +
> +		pfn = cmr->base_pfn + (bitmap_no << cma->order_per_bit);
> +		page = pfn_to_page(pfn);
> +
> +		/*
> +		 * Do not hand out page ranges that are not contiguous, so
> +		 * callers can just iterate the pages without having to worry
> +		 * about these corner cases.
> +		 */
> +		if (!page_range_contiguous(page, count)) {
> +			spin_unlock_irq(&cma->lock);
> +			pr_warn_ratelimited("%s: %s: skipping incompatible area [0x%lx-0x%lx]",
> +					    __func__, cma->name, pfn, pfn + count - 1);
> +			continue;
> +		}
> +
>  		bitmap_set(cmr->bitmap, bitmap_no, bitmap_count);
>  		cma->available_count -= count;
>  		/*
> @@ -821,29 +835,25 @@ static int cma_range_alloc(struct cma *cma, struct cma_memrange *cmr,
>  		 */
>  		spin_unlock_irq(&cma->lock);
>  
> -		pfn = cmr->base_pfn + (bitmap_no << cma->order_per_bit);
>  		mutex_lock(&cma->alloc_mutex);
>  		ret = alloc_contig_range(pfn, pfn + count, ACR_FLAGS_CMA, gfp);
>  		mutex_unlock(&cma->alloc_mutex);
> -		if (ret == 0) {
> -			page = pfn_to_page(pfn);
> +		if (!ret)
>  			break;
> -		}
>  
>  		cma_clear_bitmap(cma, cmr, pfn, count);
>  		if (ret != -EBUSY)
>  			break;
>  
>  		pr_debug("%s(): memory range at pfn 0x%lx %p is busy, retrying\n",
> -			 __func__, pfn, pfn_to_page(pfn));
> +			 __func__, pfn, page);
>  
>  		trace_cma_alloc_busy_retry(cma->name, pfn, pfn_to_page(pfn),

Nitpick: I think you already have the page here.

>  					   count, align);
> -		/* try again with a bit different memory target */
> -		start = bitmap_no + mask + 1;
>  	}
>  out:
> -	*pagep = page;
> +	if (!ret)
> +		*pagep = page;
>  	return ret;
>  }
>  
> @@ -882,7 +892,7 @@ static struct page *__cma_alloc(struct cma *cma, unsigned long count,
>  	 */
>  	if (page) {
>  		for (i = 0; i < count; i++)
> -			page_kasan_tag_reset(nth_page(page, i));
> +			page_kasan_tag_reset(page + i);

Had a look at it, not very familiar with CMA, but the changes look equivalent to
what was before. Not sure that's worth a Reviewed-by tag, but here it in case
you want to add it:

Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>

Just so I can better understand the problem being fixed, I guess you can have
two consecutive pfns with non-consecutive associated struct page if you have two
adjacent memory sections spanning the same physical memory region, is that
correct?

Thanks,
Alex

>  	}
>  
>  	if (ret && !(gfp & __GFP_NOWARN)) {
> diff --git a/mm/util.c b/mm/util.c
> index d235b74f7aff7..0bf349b19b652 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -1280,4 +1280,37 @@ unsigned int folio_pte_batch(struct folio *folio, pte_t *ptep, pte_t pte,
>  {
>  	return folio_pte_batch_flags(folio, NULL, ptep, &pte, max_nr, 0);
>  }
> +
> +#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
> +/**
> + * page_range_contiguous - test whether the page range is contiguous
> + * @page: the start of the page range.
> + * @nr_pages: the number of pages in the range.
> + *
> + * Test whether the page range is contiguous, such that they can be iterated
> + * naively, corresponding to iterating a contiguous PFN range.
> + *
> + * This function should primarily only be used for debug checks, or when
> + * working with page ranges that are not naturally contiguous (e.g., pages
> + * within a folio are).
> + *
> + * Returns true if contiguous, otherwise false.
> + */
> +bool page_range_contiguous(const struct page *page, unsigned long nr_pages)
> +{
> +	const unsigned long start_pfn = page_to_pfn(page);
> +	const unsigned long end_pfn = start_pfn + nr_pages;
> +	unsigned long pfn;
> +
> +	/*
> +	 * The memmap is allocated per memory section. We need to check
> +	 * each involved memory section once.
> +	 */
> +	for (pfn = ALIGN(start_pfn, PAGES_PER_SECTION);
> +	     pfn < end_pfn; pfn += PAGES_PER_SECTION)
> +		if (unlikely(page + (pfn - start_pfn) != pfn_to_page(pfn)))
> +			return false;
> +	return true;
> +}
> +#endif
>  #endif /* CONFIG_MMU */
> -- 
> 2.50.1
> 
> 

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH RFC 21/35] mm/cma: refuse handing out non-contiguous page ranges
  2025-08-26 10:45   ` Alexandru Elisei
@ 2025-08-26 11:04     ` David Hildenbrand
  2025-08-26 13:03       ` Alexandru Elisei
  0 siblings, 1 reply; 90+ messages in thread
From: David Hildenbrand @ 2025-08-26 11:04 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: linux-kernel, Alexander Potapenko, Andrew Morton, Brendan Jackman,
	Christoph Lameter, Dennis Zhou, Dmitry Vyukov, dri-devel,
	intel-gfx, iommu, io-uring, Jason Gunthorpe, Jens Axboe,
	Johannes Weiner, John Hubbard, kasan-dev, kvm, Liam R. Howlett,
	Linus Torvalds, linux-arm-kernel, linux-arm-kernel, linux-crypto,
	linux-ide, linux-kselftest, linux-mips, linux-mmc, linux-mm,
	linux-riscv, linux-s390, linux-scsi, Lorenzo Stoakes, Marco Elver,
	Marek Szyprowski, Michal Hocko, Mike Rapoport, Muchun Song,
	netdev, Oscar Salvador, Peter Xu, Robin Murphy,
	Suren Baghdasaryan, Tejun Heo, virtualization, Vlastimil Babka,
	wireguard, x86, Zi Yan

>>   
>>   		pr_debug("%s(): memory range at pfn 0x%lx %p is busy, retrying\n",
>> -			 __func__, pfn, pfn_to_page(pfn));
>> +			 __func__, pfn, page);
>>   
>>   		trace_cma_alloc_busy_retry(cma->name, pfn, pfn_to_page(pfn),
> 
> Nitpick: I think you already have the page here.

Indeed, forgot to clean that up as well.

> 
>>   					   count, align);
>> -		/* try again with a bit different memory target */
>> -		start = bitmap_no + mask + 1;
>>   	}
>>   out:
>> -	*pagep = page;
>> +	if (!ret)
>> +		*pagep = page;
>>   	return ret;
>>   }
>>   
>> @@ -882,7 +892,7 @@ static struct page *__cma_alloc(struct cma *cma, unsigned long count,
>>   	 */
>>   	if (page) {
>>   		for (i = 0; i < count; i++)
>> -			page_kasan_tag_reset(nth_page(page, i));
>> +			page_kasan_tag_reset(page + i);
> 
> Had a look at it, not very familiar with CMA, but the changes look equivalent to
> what was before. Not sure that's worth a Reviewed-by tag, but here it in case
> you want to add it:
> 
> Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>

Thanks!

> 
> Just so I can better understand the problem being fixed, I guess you can have
> two consecutive pfns with non-consecutive associated struct page if you have two
> adjacent memory sections spanning the same physical memory region, is that
> correct?

Exactly. Essentially on SPARSEMEM without SPARSEMEM_VMEMMAP it is not 
guaranteed that

	pfn_to_page(pfn + 1) == pfn_to_page(pfn) + 1

when we cross memory section boundaries.

It can be the case for early boot memory if we allocated consecutive 
areas from memblock when allocating the memmap (struct pages) per memory 
section, but it's not guaranteed.

So we rule out that case.

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH RFC 21/35] mm/cma: refuse handing out non-contiguous page ranges
  2025-08-26 11:04     ` David Hildenbrand
@ 2025-08-26 13:03       ` Alexandru Elisei
  2025-08-26 13:08         ` David Hildenbrand
  0 siblings, 1 reply; 90+ messages in thread
From: Alexandru Elisei @ 2025-08-26 13:03 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Alexander Potapenko, Andrew Morton, Brendan Jackman,
	Christoph Lameter, Dennis Zhou, Dmitry Vyukov, dri-devel,
	intel-gfx, iommu, io-uring, Jason Gunthorpe, Jens Axboe,
	Johannes Weiner, John Hubbard, kasan-dev, kvm, Liam R. Howlett,
	Linus Torvalds, linux-arm-kernel, linux-arm-kernel, linux-crypto,
	linux-ide, linux-kselftest, linux-mips, linux-mmc, linux-mm,
	linux-riscv, linux-s390, linux-scsi, Lorenzo Stoakes, Marco Elver,
	Marek Szyprowski, Michal Hocko, Mike Rapoport, Muchun Song,
	netdev, Oscar Salvador, Peter Xu, Robin Murphy,
	Suren Baghdasaryan, Tejun Heo, virtualization, Vlastimil Babka,
	wireguard, x86, Zi Yan

Hi David,

On Tue, Aug 26, 2025 at 01:04:33PM +0200, David Hildenbrand wrote:
..
> > Just so I can better understand the problem being fixed, I guess you can have
> > two consecutive pfns with non-consecutive associated struct page if you have two
> > adjacent memory sections spanning the same physical memory region, is that
> > correct?
> 
> Exactly. Essentially on SPARSEMEM without SPARSEMEM_VMEMMAP it is not
> guaranteed that
> 
> 	pfn_to_page(pfn + 1) == pfn_to_page(pfn) + 1
> 
> when we cross memory section boundaries.
> 
> It can be the case for early boot memory if we allocated consecutive areas
> from memblock when allocating the memmap (struct pages) per memory section,
> but it's not guaranteed.

Thank you for the explanation, but I'm a bit confused by the last paragraph. I
think what you're saying is that we can also have the reverse problem, where
consecutive struct page * represent non-consecutive pfns, because memmap
allocations happened to return consecutive virtual addresses, is that right?

If that's correct, I don't think that's the case for CMA, which deals out
contiguous physical memory. Or were you just trying to explain the other side of
the problem, and I'm just overthinking it?

Thanks,
Alex

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH RFC 21/35] mm/cma: refuse handing out non-contiguous page ranges
  2025-08-26 13:03       ` Alexandru Elisei
@ 2025-08-26 13:08         ` David Hildenbrand
  2025-08-26 13:11           ` Alexandru Elisei
  0 siblings, 1 reply; 90+ messages in thread
From: David Hildenbrand @ 2025-08-26 13:08 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: linux-kernel, Alexander Potapenko, Andrew Morton, Brendan Jackman,
	Christoph Lameter, Dennis Zhou, Dmitry Vyukov, dri-devel,
	intel-gfx, iommu, io-uring, Jason Gunthorpe, Jens Axboe,
	Johannes Weiner, John Hubbard, kasan-dev, kvm, Liam R. Howlett,
	Linus Torvalds, linux-arm-kernel, linux-arm-kernel, linux-crypto,
	linux-ide, linux-kselftest, linux-mips, linux-mmc, linux-mm,
	linux-riscv, linux-s390, linux-scsi, Lorenzo Stoakes, Marco Elver,
	Marek Szyprowski, Michal Hocko, Mike Rapoport, Muchun Song,
	netdev, Oscar Salvador, Peter Xu, Robin Murphy,
	Suren Baghdasaryan, Tejun Heo, virtualization, Vlastimil Babka,
	wireguard, x86, Zi Yan

On 26.08.25 15:03, Alexandru Elisei wrote:
> Hi David,
> 
> On Tue, Aug 26, 2025 at 01:04:33PM +0200, David Hildenbrand wrote:
> ..
>>> Just so I can better understand the problem being fixed, I guess you can have
>>> two consecutive pfns with non-consecutive associated struct page if you have two
>>> adjacent memory sections spanning the same physical memory region, is that
>>> correct?
>>
>> Exactly. Essentially on SPARSEMEM without SPARSEMEM_VMEMMAP it is not
>> guaranteed that
>>
>> 	pfn_to_page(pfn + 1) == pfn_to_page(pfn) + 1
>>
>> when we cross memory section boundaries.
>>
>> It can be the case for early boot memory if we allocated consecutive areas
>> from memblock when allocating the memmap (struct pages) per memory section,
>> but it's not guaranteed.
> 
> Thank you for the explanation, but I'm a bit confused by the last paragraph. I
> think what you're saying is that we can also have the reverse problem, where
> consecutive struct page * represent non-consecutive pfns, because memmap
> allocations happened to return consecutive virtual addresses, is that right?

Exactly, that's something we have to deal with elsewhere [1]. For this 
code, it's not a problem because we always allocate a contiguous PFN range.

> 
> If that's correct, I don't think that's the case for CMA, which deals out
> contiguous physical memory. Or were you just trying to explain the other side of
> the problem, and I'm just overthinking it?

The latter :)

[1] https://lkml.kernel.org/r/20250814064714.56485-2-lizhe.67@bytedance.com

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH RFC 21/35] mm/cma: refuse handing out non-contiguous page ranges
  2025-08-26 13:08         ` David Hildenbrand
@ 2025-08-26 13:11           ` Alexandru Elisei
  0 siblings, 0 replies; 90+ messages in thread
From: Alexandru Elisei @ 2025-08-26 13:11 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, Alexander Potapenko, Andrew Morton, Brendan Jackman,
	Christoph Lameter, Dennis Zhou, Dmitry Vyukov, dri-devel,
	intel-gfx, iommu, io-uring, Jason Gunthorpe, Jens Axboe,
	Johannes Weiner, John Hubbard, kasan-dev, kvm, Liam R. Howlett,
	Linus Torvalds, linux-arm-kernel, linux-arm-kernel, linux-crypto,
	linux-ide, linux-kselftest, linux-mips, linux-mmc, linux-mm,
	linux-riscv, linux-s390, linux-scsi, Lorenzo Stoakes, Marco Elver,
	Marek Szyprowski, Michal Hocko, Mike Rapoport, Muchun Song,
	netdev, Oscar Salvador, Peter Xu, Robin Murphy,
	Suren Baghdasaryan, Tejun Heo, virtualization, Vlastimil Babka,
	wireguard, x86, Zi Yan

Hi David,

On Tue, Aug 26, 2025 at 03:08:08PM +0200, David Hildenbrand wrote:
> On 26.08.25 15:03, Alexandru Elisei wrote:
> > Hi David,
> > 
> > On Tue, Aug 26, 2025 at 01:04:33PM +0200, David Hildenbrand wrote:
> > ..
> > > > Just so I can better understand the problem being fixed, I guess you can have
> > > > two consecutive pfns with non-consecutive associated struct page if you have two
> > > > adjacent memory sections spanning the same physical memory region, is that
> > > > correct?
> > > 
> > > Exactly. Essentially on SPARSEMEM without SPARSEMEM_VMEMMAP it is not
> > > guaranteed that
> > > 
> > > 	pfn_to_page(pfn + 1) == pfn_to_page(pfn) + 1
> > > 
> > > when we cross memory section boundaries.
> > > 
> > > It can be the case for early boot memory if we allocated consecutive areas
> > > from memblock when allocating the memmap (struct pages) per memory section,
> > > but it's not guaranteed.
> > 
> > Thank you for the explanation, but I'm a bit confused by the last paragraph. I
> > think what you're saying is that we can also have the reverse problem, where
> > consecutive struct page * represent non-consecutive pfns, because memmap
> > allocations happened to return consecutive virtual addresses, is that right?
> 
> Exactly, that's something we have to deal with elsewhere [1]. For this code,
> it's not a problem because we always allocate a contiguous PFN range.
> 
> > 
> > If that's correct, I don't think that's the case for CMA, which deals out
> > contiguous physical memory. Or were you just trying to explain the other side of
> > the problem, and I'm just overthinking it?
> 
> The latter :)

Ok, sorry for the noise then, and thank you for educating me.

Alex

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH RFC 18/35] io_uring/zcrx: remove "struct io_copy_cache" and one nth_page() usage
  2025-08-22 13:59     ` David Hildenbrand
@ 2025-08-27  9:43       ` Pavel Begunkov
  0 siblings, 0 replies; 90+ messages in thread
From: Pavel Begunkov @ 2025-08-27  9:43 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: Jens Axboe, Alexander Potapenko, Andrew Morton, Brendan Jackman,
	Christoph Lameter, Dennis Zhou, Dmitry Vyukov, dri-devel,
	intel-gfx, iommu, io-uring, Jason Gunthorpe, Johannes Weiner,
	John Hubbard, kasan-dev, kvm, Liam R. Howlett, Linus Torvalds,
	linux-arm-kernel, linux-arm-kernel, linux-crypto, linux-ide,
	linux-kselftest, linux-mips, linux-mmc, linux-mm, linux-riscv,
	linux-s390, linux-scsi, Lorenzo Stoakes, Marco Elver,
	Marek Szyprowski, Michal Hocko, Mike Rapoport, Muchun Song,
	netdev, Oscar Salvador, Peter Xu, Robin Murphy,
	Suren Baghdasaryan, Tejun Heo, virtualization, Vlastimil Babka,
	wireguard, x86, Zi Yan

On 8/22/25 14:59, David Hildenbrand wrote:
> On 22.08.25 13:32, Pavel Begunkov wrote:
>> On 8/21/25 21:06, David Hildenbrand wrote:
>>> We always provide a single dst page, it's unclear why the io_copy_cache
>>> complexity is required.
>>
>> Because it'll need to be pulled outside the loop to reuse the page for
>> multiple copies, i.e. packing multiple fragments of the same skb into
>> it. Not finished, and currently it's wasting memory.
> 
> Okay, so what you're saying is that there will be follow-up work that will actually make this structure useful.

Exactly

>> Why not do as below? Pages there never cross boundaries of their folios. > Do you want it to be taken into the io_uring tree?
> 
> This should better all go through the MM tree where we actually guarantee contiguous pages within a folio. (see the cover letter)

Makes sense. No objection, hopefully it won't cause too many conflicts.

>> diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
>> index e5ff49f3425e..18c12f4b56b6 100644
>> --- a/io_uring/zcrx.c
>> +++ b/io_uring/zcrx.c
>> @@ -975,9 +975,9 @@ static ssize_t io_copy_page(struct io_copy_cache *cc, struct page *src_page,
>>            if (folio_test_partial_kmap(page_folio(dst_page)) ||
>>                folio_test_partial_kmap(page_folio(src_page))) {
>> -            dst_page = nth_page(dst_page, dst_offset / PAGE_SIZE);
>> +            dst_page += dst_offset / PAGE_SIZE;
>>                dst_offset = offset_in_page(dst_offset);
>> -            src_page = nth_page(src_page, src_offset / PAGE_SIZE);
>> +            src_page += src_offset / PAGE_SIZE;
> 
> Yeah, I can do that in the next version given that you have plans on extending that code soon.

If we go with this version:

Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 90+ messages in thread

end of thread, other threads:[~2025-08-27  9:42 UTC | newest]

Thread overview: 90+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-21 20:06 [PATCH RFC 00/35] mm: remove nth_page() David Hildenbrand
2025-08-21 20:06 ` [PATCH RFC 01/35] mm: stop making SPARSEMEM_VMEMMAP user-selectable David Hildenbrand
2025-08-21 20:20   ` Zi Yan
2025-08-22 15:09   ` Mike Rapoport
2025-08-22 17:02   ` SeongJae Park
2025-08-21 20:06 ` [PATCH RFC 02/35] arm64: Kconfig: drop superfluous "select SPARSEMEM_VMEMMAP" David Hildenbrand
2025-08-22 15:10   ` Mike Rapoport
2025-08-21 20:06 ` [PATCH RFC 03/35] s390/Kconfig: " David Hildenbrand
2025-08-22 15:11   ` Mike Rapoport
2025-08-21 20:06 ` [PATCH RFC 04/35] x86/Kconfig: " David Hildenbrand
2025-08-22 15:11   ` Mike Rapoport
2025-08-21 20:06 ` [PATCH RFC 05/35] wireguard: selftests: remove CONFIG_SPARSEMEM_VMEMMAP=y from qemu kernel config David Hildenbrand
2025-08-22 15:13   ` Mike Rapoport
2025-08-21 20:06 ` [PATCH RFC 06/35] mm/page_alloc: reject unreasonable folio/compound page sizes in alloc_contig_range_noprof() David Hildenbrand
2025-08-21 20:23   ` Zi Yan
2025-08-22 17:07   ` SeongJae Park
2025-08-21 20:06 ` [PATCH RFC 07/35] mm/memremap: reject unreasonable folio/compound page sizes in memremap_pages() David Hildenbrand
2025-08-22 17:09   ` SeongJae Park
2025-08-21 20:06 ` [PATCH RFC 08/35] mm/hugetlb: check for unreasonable folio sizes when registering hstate David Hildenbrand
2025-08-21 20:06 ` [PATCH RFC 09/35] mm/mm_init: make memmap_init_compound() look more like prep_compound_page() David Hildenbrand
2025-08-22 15:27   ` Mike Rapoport
2025-08-22 18:09     ` David Hildenbrand
2025-08-21 20:06 ` [PATCH RFC 10/35] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap() David Hildenbrand
2025-08-22  4:09   ` Mika Penttilä
2025-08-22  6:24     ` David Hildenbrand
2025-08-23  8:59       ` Mike Rapoport
2025-08-25 12:48         ` David Hildenbrand
2025-08-25 14:32           ` Mike Rapoport
2025-08-25 14:38             ` David Hildenbrand
2025-08-25 14:59               ` Mike Rapoport
2025-08-25 15:42                 ` David Hildenbrand
2025-08-25 16:17                   ` Mike Rapoport
2025-08-25 16:23                     ` David Hildenbrand
2025-08-25 16:58                       ` update kernel-doc for MEMBLOCK_RSRV_NOINIT (was: Re: [PATCH RFC 10/35] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap()) Mike Rapoport
2025-08-25 18:32                         ` update kernel-doc for MEMBLOCK_RSRV_NOINIT David Hildenbrand
2025-08-21 20:06 ` [PATCH RFC 11/35] mm: sanity-check maximum folio size in folio_set_order() David Hildenbrand
2025-08-21 20:36   ` Zi Yan
2025-08-21 20:06 ` [PATCH RFC 12/35] mm: limit folio/compound page sizes in problematic kernel configs David Hildenbrand
2025-08-21 20:46   ` Zi Yan
2025-08-21 20:49     ` David Hildenbrand
2025-08-21 20:50       ` Zi Yan
2025-08-24 13:24   ` Mike Rapoport
2025-08-21 20:06 ` [PATCH RFC 13/35] mm: simplify folio_page() and folio_page_idx() David Hildenbrand
2025-08-21 20:55   ` Zi Yan
2025-08-21 21:00     ` David Hildenbrand
2025-08-21 20:06 ` [PATCH RFC 14/35] mm/mm/percpu-km: drop nth_page() usage within single allocation David Hildenbrand
2025-08-21 20:06 ` [PATCH RFC 15/35] fs: hugetlbfs: remove nth_page() usage within folio in adjust_range_hwpoison() David Hildenbrand
2025-08-21 20:06 ` [PATCH RFC 16/35] mm/pagewalk: drop nth_page() usage within folio in folio_walk_start() David Hildenbrand
2025-08-21 20:06 ` [PATCH RFC 17/35] mm/gup: drop nth_page() usage within folio when recording subpages David Hildenbrand
2025-08-21 20:06 ` [PATCH RFC 18/35] io_uring/zcrx: remove "struct io_copy_cache" and one nth_page() usage David Hildenbrand
2025-08-22 11:32   ` Pavel Begunkov
2025-08-22 13:59     ` David Hildenbrand
2025-08-27  9:43       ` Pavel Begunkov
2025-08-21 20:06 ` [PATCH RFC 19/35] io_uring/zcrx: remove nth_page() usage within folio David Hildenbrand
2025-08-21 20:06 ` [PATCH RFC 20/35] mips: mm: convert __flush_dcache_pages() to __flush_dcache_folio_pages() David Hildenbrand
2025-08-21 20:06 ` [PATCH RFC 21/35] mm/cma: refuse handing out non-contiguous page ranges David Hildenbrand
2025-08-26 10:45   ` Alexandru Elisei
2025-08-26 11:04     ` David Hildenbrand
2025-08-26 13:03       ` Alexandru Elisei
2025-08-26 13:08         ` David Hildenbrand
2025-08-26 13:11           ` Alexandru Elisei
2025-08-21 20:06 ` [PATCH RFC 22/35] dma-remap: drop nth_page() in dma_common_contiguous_remap() David Hildenbrand
2025-08-22  8:15   ` Marek Szyprowski
2025-08-21 20:06 ` [PATCH RFC 23/35] scatterlist: disallow non-contigous page ranges in a single SG entry David Hildenbrand
2025-08-22  8:15   ` Marek Szyprowski
2025-08-21 20:06 ` [PATCH RFC 24/35] ata: libata-eh: drop nth_page() usage within " David Hildenbrand
2025-08-22  1:59   ` Damien Le Moal
2025-08-22  6:18     ` David Hildenbrand
2025-08-21 20:06 ` [PATCH RFC 25/35] drm/i915/gem: " David Hildenbrand
2025-08-21 20:06 ` [PATCH RFC 26/35] mspro_block: " David Hildenbrand
2025-08-21 20:06 ` [PATCH RFC 27/35] memstick: " David Hildenbrand
2025-08-21 20:06 ` [PATCH RFC 28/35] mmc: " David Hildenbrand
2025-08-21 20:06 ` [PATCH RFC 29/35] scsi: core: " David Hildenbrand
2025-08-22 18:01   ` Bart Van Assche
2025-08-22 18:10     ` David Hildenbrand
2025-08-21 20:06 ` [PATCH RFC 30/35] vfio/pci: " David Hildenbrand
2025-08-21 20:06 ` [PATCH RFC 31/35] crypto: remove " David Hildenbrand
2025-08-21 20:24   ` Linus Torvalds
2025-08-21 20:29     ` David Hildenbrand
2025-08-21 20:36       ` Linus Torvalds
2025-08-21 20:37       ` David Hildenbrand
2025-08-21 20:40       ` Linus Torvalds
2025-08-21 20:06 ` [PATCH RFC 32/35] mm/gup: drop nth_page() usage in unpin_user_page_range_dirty_lock() David Hildenbrand
2025-08-21 20:06 ` [PATCH RFC 33/35] kfence: drop nth_page() usage David Hildenbrand
2025-08-21 20:32   ` David Hildenbrand
2025-08-21 21:45     ` David Hildenbrand
2025-08-21 20:07 ` [PATCH RFC 34/35] block: update comment of "struct bio_vec" regarding nth_page() David Hildenbrand
2025-08-21 20:07 ` [PATCH RFC 35/35] mm: remove nth_page() David Hildenbrand
2025-08-21 21:37 ` [syzbot ci] " syzbot ci
2025-08-22 14:30 ` [PATCH RFC 00/35] " Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).