[PATCH v1 00/11] mm: replace follow_page() by folio

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v1 00/11] mm: replace follow_page() by folio_walk
@ 2024-08-02 15:55 David Hildenbrand
  2024-08-02 15:55 ` [PATCH v1 01/11] mm: provide vm_normal_(page|folio)_pmd() with CONFIG_PGTABLE_HAS_HUGE_LEAVES David Hildenbrand
                   ` (13 more replies)
  0 siblings, 14 replies; 32+ messages in thread
From: David Hildenbrand @ 2024-08-02 15:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, linux-doc, kvm, linux-s390, linux-fsdevel,
	David Hildenbrand, Andrew Morton, Matthew Wilcox (Oracle),
	Jonathan Corbet, Christian Borntraeger, Janosch Frank,
	Claudio Imbrenda, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Sven Schnelle, Gerald Schaefer

Looking into a way of moving the last folio_likely_mapped_shared() call
in add_folio_for_migration() under the PTL, I found myself removing
follow_page(). This paves the way for cleaning up all the FOLL_, follow_*
terminology to just be called "GUP" nowadays.

The new page table walker will lookup a mapped folio and return to the
caller with the PTL held, such that the folio cannot get unmapped
concurrently. Callers can then conditionally decide whether they really
want to take a short-term folio reference or whether the can simply
unlock the PTL and be done with it.

folio_walk is similar to page_vma_mapped_walk(), except that we don't know
the folio we want to walk to and that we are only walking to exactly one
PTE/PMD/PUD.

folio_walk provides access to the pte/pmd/pud (and the referenced folio
page because things like KSM need that), however, as part of this series
no page table modifications are performed by users.

We might be able to convert some other walk_page_range() users that really
only walk to one address, such as DAMON with
damon_mkold_ops/damon_young_ops. It might make sense to extend folio_walk
in the future to optionally fault in a folio (if applicable), such that we
can replace some get_user_pages() users that really only want to lookup
a single page/folio under PTL without unconditionally grabbing a folio
reference.

I have plans to extend the approach to a range walker that will try
batching various page table entries (not just folio pages) to be a better
replace for walk_page_range() -- and users will be able to opt in which
type of page table entries they want to process -- but that will require
more work and more thoughts.

KSM seems to work just fine (ksm_functional_tests selftests) and
move_pages seems to work (migration selftest). I tested the leaf
implementation excessively using various hugetlb sizes (64K, 2M, 32M, 1G)
on arm64 using move_pages and did some more testing on x86-64. Cross
compiled on a bunch of architectures.

I am not able to test the s390x Secure Execution changes, unfortunately.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Janosch Frank <frankja@linux.ibm.com>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>

David Hildenbrand (11):
  mm: provide vm_normal_(page|folio)_pmd() with
    CONFIG_PGTABLE_HAS_HUGE_LEAVES
  mm/pagewalk: introduce folio_walk_start() + folio_walk_end()
  mm/migrate: convert do_pages_stat_array() from follow_page() to
    folio_walk
  mm/migrate: convert add_page_for_migration() from follow_page() to
    folio_walk
  mm/ksm: convert get_mergeable_page() from follow_page() to folio_walk
  mm/ksm: convert scan_get_next_rmap_item() from follow_page() to
    folio_walk
  mm/huge_memory: convert split_huge_pages_pid() from follow_page() to
    folio_walk
  s390/uv: convert gmap_destroy_page() from follow_page() to folio_walk
  s390/mm/fault: convert do_secure_storage_access() from follow_page()
    to folio_walk
  mm: remove follow_page()
  mm/ksm: convert break_ksm() from walk_page_range_vma() to folio_walk

 Documentation/mm/transhuge.rst |   6 +-
 arch/s390/kernel/uv.c          |  18 ++-
 arch/s390/mm/fault.c           |  16 ++-
 include/linux/mm.h             |   3 -
 include/linux/pagewalk.h       |  58 ++++++++++
 mm/filemap.c                   |   2 +-
 mm/gup.c                       |  24 +---
 mm/huge_memory.c               |  18 +--
 mm/ksm.c                       | 127 +++++++++------------
 mm/memory.c                    |   2 +-
 mm/migrate.c                   | 131 ++++++++++-----------
 mm/nommu.c                     |   6 -
 mm/pagewalk.c                  | 202 +++++++++++++++++++++++++++++++++
 13 files changed, 413 insertions(+), 200 deletions(-)

-- 
2.45.2

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v1 01/11] mm: provide vm_normal_(page|folio)_pmd() with CONFIG_PGTABLE_HAS_HUGE_LEAVES
  2024-08-02 15:55 [PATCH v1 00/11] mm: replace follow_page() by folio_walk David Hildenbrand
@ 2024-08-02 15:55 ` David Hildenbrand
  2024-08-02 15:55 ` [PATCH v1 02/11] mm/pagewalk: introduce folio_walk_start() + folio_walk_end() David Hildenbrand
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: David Hildenbrand @ 2024-08-02 15:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, linux-doc, kvm, linux-s390, linux-fsdevel,
	David Hildenbrand, Andrew Morton, Matthew Wilcox (Oracle),
	Jonathan Corbet, Christian Borntraeger, Janosch Frank,
	Claudio Imbrenda, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Sven Schnelle, Gerald Schaefer

We want to make use of vm_normal_page_pmd() in generic page table
walking code where we might walk hugetlb folios that are mapped by PMDs
even without CONFIG_TRANSPARENT_HUGEPAGE.

So let's expose vm_normal_page_pmd() + vm_normal_folio_pmd() with
CONFIG_PGTABLE_HAS_HUGE_LEAVES.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memory.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index 4c8716cb306c..29772beb3275 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -666,7 +666,7 @@ struct folio *vm_normal_folio(struct vm_area_struct *vma, unsigned long addr,
 	return NULL;
 }
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
 struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
 				pmd_t pmd)
 {
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v1 02/11] mm/pagewalk: introduce folio_walk_start() + folio_walk_end()
  2024-08-02 15:55 [PATCH v1 00/11] mm: replace follow_page() by folio_walk David Hildenbrand
  2024-08-02 15:55 ` [PATCH v1 01/11] mm: provide vm_normal_(page|folio)_pmd() with CONFIG_PGTABLE_HAS_HUGE_LEAVES David Hildenbrand
@ 2024-08-02 15:55 ` David Hildenbrand
  2024-08-07  9:17   ` Claudio Imbrenda
  2024-08-02 15:55 ` [PATCH v1 03/11] mm/migrate: convert do_pages_stat_array() from follow_page() to folio_walk David Hildenbrand
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 32+ messages in thread
From: David Hildenbrand @ 2024-08-02 15:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, linux-doc, kvm, linux-s390, linux-fsdevel,
	David Hildenbrand, Andrew Morton, Matthew Wilcox (Oracle),
	Jonathan Corbet, Christian Borntraeger, Janosch Frank,
	Claudio Imbrenda, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Sven Schnelle, Gerald Schaefer

We want to get rid of follow_page(), and have a more reasonable way to
just lookup a folio mapped at a certain address, perform some checks while
still under PTL, and then only conditionally grab a folio reference if
really required.

Further, we might want to get rid of some walk_page_range*() users that
really only want to temporarily lookup a single folio at a single address.

So let's add a new page table walker that does exactly that, similarly
to GUP also being able to walk hugetlb VMAs.

Add folio_walk_end() as a macro for now: the compiler is not easy to
please with the pte_unmap()->kunmap_local().

Note that one difference between follow_page() and get_user_pages(1) is
that follow_page() will not trigger faults to get something mapped. So
folio_walk is at least currently not a replacement for get_user_pages(1),
but could likely be extended/reused to achieve something similar in the
future.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/pagewalk.h |  58 +++++++++++
 mm/pagewalk.c            | 202 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 260 insertions(+)

diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
index 27cd1e59ccf7..f5eb5a32aeed 100644
--- a/include/linux/pagewalk.h
+++ b/include/linux/pagewalk.h
@@ -130,4 +130,62 @@ int walk_page_mapping(struct address_space *mapping, pgoff_t first_index,
 		      pgoff_t nr, const struct mm_walk_ops *ops,
 		      void *private);
 
+typedef int __bitwise folio_walk_flags_t;
+
+/*
+ * Walk migration entries as well. Careful: a large folio might get split
+ * concurrently.
+ */
+#define FW_MIGRATION			((__force folio_walk_flags_t)BIT(0))
+
+/* Walk shared zeropages (small + huge) as well. */
+#define FW_ZEROPAGE			((__force folio_walk_flags_t)BIT(1))
+
+enum folio_walk_level {
+	FW_LEVEL_PTE,
+	FW_LEVEL_PMD,
+	FW_LEVEL_PUD,
+};
+
+/**
+ * struct folio_walk - folio_walk_start() / folio_walk_end() data
+ * @page:	exact folio page referenced (if applicable)
+ * @level:	page table level identifying the entry type
+ * @pte:	pointer to the page table entry (FW_LEVEL_PTE).
+ * @pmd:	pointer to the page table entry (FW_LEVEL_PMD).
+ * @pud:	pointer to the page table entry (FW_LEVEL_PUD).
+ * @ptl:	pointer to the page table lock.
+ *
+ * (see folio_walk_start() documentation for more details)
+ */
+struct folio_walk {
+	/* public */
+	struct page *page;
+	enum folio_walk_level level;
+	union {
+		pte_t *ptep;
+		pud_t *pudp;
+		pmd_t *pmdp;
+	};
+	union {
+		pte_t pte;
+		pud_t pud;
+		pmd_t pmd;
+	};
+	/* private */
+	struct vm_area_struct *vma;
+	spinlock_t *ptl;
+};
+
+struct folio *folio_walk_start(struct folio_walk *fw,
+		struct vm_area_struct *vma, unsigned long addr,
+		folio_walk_flags_t flags);
+
+#define folio_walk_end(__fw, __vma) do { \
+	spin_unlock((__fw)->ptl); \
+	if (likely((__fw)->level == FW_LEVEL_PTE)) \
+		pte_unmap((__fw)->ptep); \
+	vma_pgtable_walk_end(__vma); \
+} while (0)
+
 #endif /* _LINUX_PAGEWALK_H */
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index ae2f08ce991b..cd79fb3b89e5 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -3,6 +3,8 @@
 #include <linux/highmem.h>
 #include <linux/sched.h>
 #include <linux/hugetlb.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
 
 /*
  * We want to know the real level where a entry is located ignoring any
@@ -654,3 +656,203 @@ int walk_page_mapping(struct address_space *mapping, pgoff_t first_index,
 
 	return err;
 }
+
+/**
+ * folio_walk_start - walk the page tables to a folio
+ * @fw: filled with information on success.
+ * @vma: the VMA.
+ * @addr: the virtual address to use for the page table walk.
+ * @flags: flags modifying which folios to walk to.
+ *
+ * Walk the page tables using @addr in a given @vma to a mapped folio and
+ * return the folio, making sure that the page table entry referenced by
+ * @addr cannot change until folio_walk_end() was called.
+ *
+ * As default, this function returns only folios that are not special (e.g., not
+ * the zeropage) and never returns folios that are supposed to be ignored by the
+ * VM as documented by vm_normal_page(). If requested, zeropages will be
+ * returned as well.
+ *
+ * As default, this function only considers present page table entries.
+ * If requested, it will also consider migration entries.
+ *
+ * If this function returns NULL it might either indicate "there is nothing" or
+ * "there is nothing suitable".
+ *
+ * On success, @fw is filled and the function returns the folio while the PTL
+ * is still held and folio_walk_end() must be called to clean up,
+ * releasing any held locks. The returned folio must *not* be used after the
+ * call to folio_walk_end(), unless a short-term folio reference is taken before
+ * that call.
+ *
+ * @fw->page will correspond to the page that is effectively referenced by
+ * @addr. However, for migration entries and shared zeropages @fw->page is
+ * set to NULL. Note that large folios might be mapped by multiple page table
+ * entries, and this function will always only lookup a single entry as
+ * specified by @addr, which might or might not cover more than a single page of
+ * the returned folio.
+ *
+ * This function must *not* be used as a naive replacement for
+ * get_user_pages() / pin_user_pages(), especially not to perform DMA or
+ * to carelessly modify page content. This function may *only* be used to grab
+ * short-term folio references, never to grab long-term folio references.
+ *
+ * Using the page table entry pointers in @fw for reading or modifying the
+ * entry should be avoided where possible: however, there might be valid
+ * use cases.
+ *
+ * WARNING: Modifying page table entries in hugetlb VMAs requires a lot of care.
+ * For example, PMD page table sharing might require prior unsharing. Also,
+ * logical hugetlb entries might span multiple physical page table entries,
+ * which *must* be modified in a single operation (set_huge_pte_at(),
+ * huge_ptep_set_*, ...). Note that the page table entry stored in @fw might
+ * not correspond to the first physical entry of a logical hugetlb entry.
+ *
+ * The mmap lock must be held in read mode.
+ *
+ * Return: folio pointer on success, otherwise NULL.
+ */
+struct folio *folio_walk_start(struct folio_walk *fw,
+		struct vm_area_struct *vma, unsigned long addr,
+		folio_walk_flags_t flags)
+{
+	unsigned long entry_size;
+	bool expose_page = true;
+	struct page *page;
+	pud_t *pudp, pud;
+	pmd_t *pmdp, pmd;
+	pte_t *ptep, pte;
+	spinlock_t *ptl;
+	pgd_t *pgdp;
+	p4d_t *p4dp;
+
+	mmap_assert_locked(vma->vm_mm);
+	vma_pgtable_walk_begin(vma);
+
+	if (WARN_ON_ONCE(addr < vma->vm_start || addr >= vma->vm_end))
+		goto not_found;
+
+	pgdp = pgd_offset(vma->vm_mm, addr);
+	if (pgd_none_or_clear_bad(pgdp))
+		goto not_found;
+
+	p4dp = p4d_offset(pgdp, addr);
+	if (p4d_none_or_clear_bad(p4dp))
+		goto not_found;
+
+	pudp = pud_offset(p4dp, addr);
+	pud = pudp_get(pudp);
+	if (pud_none(pud))
+		goto not_found;
+	if (IS_ENABLED(CONFIG_PGTABLE_HAS_HUGE_LEAVES) && pud_leaf(pud)) {
+		ptl = pud_lock(vma->vm_mm, pudp);
+		pud = pudp_get(pudp);
+
+		entry_size = PUD_SIZE;
+		fw->level = FW_LEVEL_PUD;
+		fw->pudp = pudp;
+		fw->pud = pud;
+
+		if (!pud_present(pud) || pud_devmap(pud)) {
+			spin_unlock(ptl);
+			goto not_found;
+		} else if (!pud_leaf(pud)) {
+			spin_unlock(ptl);
+			goto pmd_table;
+		}
+		/*
+		 * TODO: vm_normal_page_pud() will be handy once we want to
+		 * support PUD mappings in VM_PFNMAP|VM_MIXEDMAP VMAs.
+		 */
+		page = pud_page(pud);
+		goto found;
+	}
+
+pmd_table:
+	VM_WARN_ON_ONCE(pud_leaf(*pudp));
+	pmdp = pmd_offset(pudp, addr);
+	pmd = pmdp_get_lockless(pmdp);
+	if (pmd_none(pmd))
+		goto not_found;
+	if (IS_ENABLED(CONFIG_PGTABLE_HAS_HUGE_LEAVES) && pmd_leaf(pmd)) {
+		ptl = pmd_lock(vma->vm_mm, pmdp);
+		pmd = pmdp_get(pmdp);
+
+		entry_size = PMD_SIZE;
+		fw->level = FW_LEVEL_PMD;
+		fw->pmdp = pmdp;
+		fw->pmd = pmd;
+
+		if (pmd_none(pmd)) {
+			spin_unlock(ptl);
+			goto not_found;
+		} else if (!pmd_leaf(pmd)) {
+			spin_unlock(ptl);
+			goto pte_table;
+		} else if (pmd_present(pmd)) {
+			page = vm_normal_page_pmd(vma, addr, pmd);
+			if (page) {
+				goto found;
+			} else if ((flags & FW_ZEROPAGE) &&
+				    is_huge_zero_pmd(pmd)) {
+				page = pfn_to_page(pmd_pfn(pmd));
+				expose_page = false;
+				goto found;
+			}
+		} else if ((flags & FW_MIGRATION) &&
+			   is_pmd_migration_entry(pmd)) {
+			swp_entry_t entry = pmd_to_swp_entry(pmd);
+
+			page = pfn_swap_entry_to_page(entry);
+			expose_page = false;
+			goto found;
+		}
+		spin_unlock(ptl);
+		goto not_found;
+	}
+
+pte_table:
+	VM_WARN_ON_ONCE(pmd_leaf(pmdp_get_lockless(pmdp)));
+	ptep = pte_offset_map_lock(vma->vm_mm, pmdp, addr, &ptl);
+	if (!ptep)
+		goto not_found;
+	pte = ptep_get(ptep);
+
+	entry_size = PAGE_SIZE;
+	fw->level = FW_LEVEL_PTE;
+	fw->ptep = ptep;
+	fw->pte = pte;
+
+	if (pte_present(pte)) {
+		page = vm_normal_page(vma, addr, pte);
+		if (page)
+			goto found;
+		if ((flags & FW_ZEROPAGE) &&
+		    is_zero_pfn(pte_pfn(pte))) {
+			page = pfn_to_page(pte_pfn(pte));
+			expose_page = false;
+			goto found;
+		}
+	} else if (!pte_none(pte)) {
+		swp_entry_t entry = pte_to_swp_entry(pte);
+
+		if ((flags & FW_MIGRATION) &&
+		    is_migration_entry(entry)) {
+			page = pfn_swap_entry_to_page(entry);
+			expose_page = false;
+			goto found;
+		}
+	}
+	pte_unmap_unlock(ptep, ptl);
+not_found:
+	vma_pgtable_walk_end(vma);
+	return NULL;
+found:
+	if (expose_page)
+		/* Note: Offset from the mapped page, not the folio start. */
+		fw->page = nth_page(page, (addr & (entry_size - 1)) >> PAGE_SHIFT);
+	else
+		fw->page = NULL;
+	fw->ptl = ptl;
+	return page_folio(page);
+}
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v1 03/11] mm/migrate: convert do_pages_stat_array() from follow_page() to folio_walk
  2024-08-02 15:55 [PATCH v1 00/11] mm: replace follow_page() by folio_walk David Hildenbrand
  2024-08-02 15:55 ` [PATCH v1 01/11] mm: provide vm_normal_(page|folio)_pmd() with CONFIG_PGTABLE_HAS_HUGE_LEAVES David Hildenbrand
  2024-08-02 15:55 ` [PATCH v1 02/11] mm/pagewalk: introduce folio_walk_start() + folio_walk_end() David Hildenbrand
@ 2024-08-02 15:55 ` David Hildenbrand
  2024-08-02 15:55 ` [PATCH v1 04/11] mm/migrate: convert add_page_for_migration() " David Hildenbrand
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: David Hildenbrand @ 2024-08-02 15:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, linux-doc, kvm, linux-s390, linux-fsdevel,
	David Hildenbrand, Andrew Morton, Matthew Wilcox (Oracle),
	Jonathan Corbet, Christian Borntraeger, Janosch Frank,
	Claudio Imbrenda, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Sven Schnelle, Gerald Schaefer

Let's use folio_walk instead, so we can avoid taking a folio reference
just to read the nid and get rid of another follow_page()/FOLL_DUMP
user. Use FW_ZEROPAGE so we can return "-EFAULT" for it as documented.

The possible return values for follow_page() were confusing, especially
with FOLL_DUMP set. We'll handle it like documented in the man page:

* -EFAULT: This is a zero page or the memory area is not mapped by the
   process.
* -ENOENT: The page is not present.

We'll keep setting -ENOENT for ZONE_DEVICE. Maybe not the right thing to
do, but it likely doesn't really matter (just like for weird devmap,
whereby we fake "not present").

Note that the other errors (-EACCESS, -EBUSY, -EIO, -EINVAL, -ENOMEM)
so far only applied when actually moving pages, not when only
querying stats.

We'll effectively drop the "secretmem" check we had in follow_page(), but
that shouldn't really matter here, we're not accessing folio/page content
after all.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/migrate.c | 31 +++++++++++++++----------------
 1 file changed, 15 insertions(+), 16 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index aa482c954cb0..b5365a434ba9 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -50,6 +50,7 @@
 #include <linux/random.h>
 #include <linux/sched/sysctl.h>
 #include <linux/memory-tiers.h>
+#include <linux/pagewalk.h>
 
 #include <asm/tlbflush.h>
 
@@ -2331,28 +2332,26 @@ static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
 	for (i = 0; i < nr_pages; i++) {
 		unsigned long addr = (unsigned long)(*pages);
 		struct vm_area_struct *vma;
-		struct page *page;
+		struct folio_walk fw;
+		struct folio *folio;
 		int err = -EFAULT;
 
 		vma = vma_lookup(mm, addr);
 		if (!vma)
 			goto set_status;
 
-		/* FOLL_DUMP to ignore special (like zero) pages */
-		page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP);
-
-		err = PTR_ERR(page);
-		if (IS_ERR(page))
-			goto set_status;
-
-		err = -ENOENT;
-		if (!page)
-			goto set_status;
-
-		if (!is_zone_device_page(page))
-			err = page_to_nid(page);
-
-		put_page(page);
+		folio = folio_walk_start(&fw, vma, addr, FW_ZEROPAGE);
+		if (folio) {
+			if (is_zero_folio(folio) || is_huge_zero_folio(folio))
+				err = -EFAULT;
+			else if (folio_is_zone_device(folio))
+				err = -ENOENT;
+			else
+				err = folio_nid(folio);
+			folio_walk_end(&fw, vma);
+		} else {
+			err = -ENOENT;
+		}
 set_status:
 		*status = err;
 
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v1 04/11] mm/migrate: convert add_page_for_migration() from follow_page() to folio_walk
  2024-08-02 15:55 [PATCH v1 00/11] mm: replace follow_page() by folio_walk David Hildenbrand
                   ` (2 preceding siblings ...)
  2024-08-02 15:55 ` [PATCH v1 03/11] mm/migrate: convert do_pages_stat_array() from follow_page() to folio_walk David Hildenbrand
@ 2024-08-02 15:55 ` David Hildenbrand
  2024-08-02 15:55 ` [PATCH v1 05/11] mm/ksm: convert get_mergeable_page() " David Hildenbrand
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: David Hildenbrand @ 2024-08-02 15:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, linux-doc, kvm, linux-s390, linux-fsdevel,
	David Hildenbrand, Andrew Morton, Matthew Wilcox (Oracle),
	Jonathan Corbet, Christian Borntraeger, Janosch Frank,
	Claudio Imbrenda, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Sven Schnelle, Gerald Schaefer

Let's use folio_walk instead, so we can avoid taking a folio reference
when we won't even be trying to migrate the folio and to get rid of
another follow_page()/FOLL_DUMP user. Use FW_ZEROPAGE so we can return
"-EFAULT" for it as documented.

We now perform the folio_likely_mapped_shared() check under PTL, which
is what we want: relying on the mapcount and friends after dropping the
PTL does not make too much sense, as the page can get unmapped
concurrently from this process.

Further, we perform the folio isolation under PTL, similar to how we
handle it for MADV_PAGEOUT.

The possible return values for follow_page() were confusing, especially
with FOLL_DUMP set. We'll handle it like documented in the man page:
 * -EFAULT: This is a zero page or the memory area is not mapped by the
    process.
 * -ENOENT: The page is not present.

We'll keep setting -ENOENT for ZONE_DEVICE. Maybe not the right thing to
do, but it likely doesn't really matter (just like for weird devmap,
whereby we fake "not present").

The other errros are left as is, and match the documentation in the man
page.

While at it, rename add_page_for_migration() to
add_folio_for_migration().

We'll lose the "secretmem" check, but that shouldn't really matter
because these folios cannot ever be migrated. Should vma_migratable()
refuse these VMAs? Maybe.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/migrate.c | 100 +++++++++++++++++++++++----------------------------
 1 file changed, 45 insertions(+), 55 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index b5365a434ba9..e1383d9cc944 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2112,76 +2112,66 @@ static int do_move_pages_to_node(struct list_head *pagelist, int node)
 	return err;
 }
 
+static int __add_folio_for_migration(struct folio *folio, int node,
+		struct list_head *pagelist, bool migrate_all)
+{
+	if (is_zero_folio(folio) || is_huge_zero_folio(folio))
+		return -EFAULT;
+
+	if (folio_is_zone_device(folio))
+		return -ENOENT;
+
+	if (folio_nid(folio) == node)
+		return 0;
+
+	if (folio_likely_mapped_shared(folio) && !migrate_all)
+		return -EACCES;
+
+	if (folio_test_hugetlb(folio)) {
+		if (isolate_hugetlb(folio, pagelist))
+			return 1;
+	} else if (folio_isolate_lru(folio)) {
+		list_add_tail(&folio->lru, pagelist);
+		node_stat_mod_folio(folio,
+			NR_ISOLATED_ANON + folio_is_file_lru(folio),
+			folio_nr_pages(folio));
+		return 1;
+	}
+	return -EBUSY;
+}
+
 /*
- * Resolves the given address to a struct page, isolates it from the LRU and
+ * Resolves the given address to a struct folio, isolates it from the LRU and
  * puts it to the given pagelist.
  * Returns:
- *     errno - if the page cannot be found/isolated
+ *     errno - if the folio cannot be found/isolated
  *     0 - when it doesn't have to be migrated because it is already on the
  *         target node
  *     1 - when it has been queued
  */
-static int add_page_for_migration(struct mm_struct *mm, const void __user *p,
+static int add_folio_for_migration(struct mm_struct *mm, const void __user *p,
 		int node, struct list_head *pagelist, bool migrate_all)
 {
 	struct vm_area_struct *vma;
-	unsigned long addr;
-	struct page *page;
+	struct folio_walk fw;
 	struct folio *folio;
-	int err;
+	unsigned long addr;
+	int err = -EFAULT;
 
 	mmap_read_lock(mm);
 	addr = (unsigned long)untagged_addr_remote(mm, p);
 
-	err = -EFAULT;
 	vma = vma_lookup(mm, addr);
-	if (!vma || !vma_migratable(vma))
-		goto out;
-
-	/* FOLL_DUMP to ignore special (like zero) pages */
-	page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP);
-
-	err = PTR_ERR(page);
-	if (IS_ERR(page))
-		goto out;
-
-	err = -ENOENT;
-	if (!page)
-		goto out;
-
-	folio = page_folio(page);
-	if (folio_is_zone_device(folio))
-		goto out_putfolio;
-
-	err = 0;
-	if (folio_nid(folio) == node)
-		goto out_putfolio;
-
-	err = -EACCES;
-	if (folio_likely_mapped_shared(folio) && !migrate_all)
-		goto out_putfolio;
-
-	err = -EBUSY;
-	if (folio_test_hugetlb(folio)) {
-		if (isolate_hugetlb(folio, pagelist))
-			err = 1;
-	} else {
-		if (!folio_isolate_lru(folio))
-			goto out_putfolio;
-
-		err = 1;
-		list_add_tail(&folio->lru, pagelist);
-		node_stat_mod_folio(folio,
-			NR_ISOLATED_ANON + folio_is_file_lru(folio),
-			folio_nr_pages(folio));
+	if (vma && vma_migratable(vma)) {
+		folio = folio_walk_start(&fw, vma, addr, FW_ZEROPAGE);
+		if (folio) {
+			err = __add_folio_for_migration(folio, node, pagelist,
+							migrate_all);
+			folio_walk_end(&fw, vma);
+		} else {
+			err = -ENOENT;
+		}
 	}
-out_putfolio:
-	/*
-	 * Either remove the duplicate refcount from folio_isolate_lru()
-	 * or drop the folio ref if it was not isolated.
-	 */
-	folio_put(folio);
-out:
 	mmap_read_unlock(mm);
 	return err;
 }
@@ -2275,8 +2265,8 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
 		 * Errors in the page lookup or isolation are not fatal and we simply
 		 * report them via status
 		 */
-		err = add_page_for_migration(mm, p, current_node, &pagelist,
-					     flags & MPOL_MF_MOVE_ALL);
+		err = add_folio_for_migration(mm, p, current_node, &pagelist,
+					      flags & MPOL_MF_MOVE_ALL);
 
 		if (err > 0) {
 			/* The page is successfully queued for migration */
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v1 05/11] mm/ksm: convert get_mergeable_page() from follow_page() to folio_walk
  2024-08-02 15:55 [PATCH v1 00/11] mm: replace follow_page() by folio_walk David Hildenbrand
                   ` (3 preceding siblings ...)
  2024-08-02 15:55 ` [PATCH v1 04/11] mm/migrate: convert add_page_for_migration() " David Hildenbrand
@ 2024-08-02 15:55 ` David Hildenbrand
  2024-08-02 15:55 ` [PATCH v1 06/11] mm/ksm: convert scan_get_next_rmap_item() " David Hildenbrand
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: David Hildenbrand @ 2024-08-02 15:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, linux-doc, kvm, linux-s390, linux-fsdevel,
	David Hildenbrand, Andrew Morton, Matthew Wilcox (Oracle),
	Jonathan Corbet, Christian Borntraeger, Janosch Frank,
	Claudio Imbrenda, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Sven Schnelle, Gerald Schaefer

Let's use folio_walk instead, for example avoiding taking temporary
folio references if the folio does not even apply and getting rid of
one more follow_page() user.

Note that zeropages obviously don't apply: old code could just have
specified FOLL_DUMP. Anon folios are never secretmem, so we don't care
about losing the check in follow_page().

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/ksm.c | 26 ++++++++++++++------------
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/mm/ksm.c b/mm/ksm.c
index 14d9e53b1ec2..742b005f3f77 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -767,26 +767,28 @@ static struct page *get_mergeable_page(struct ksm_rmap_item *rmap_item)
 	struct mm_struct *mm = rmap_item->mm;
 	unsigned long addr = rmap_item->address;
 	struct vm_area_struct *vma;
-	struct page *page;
+	struct page *page = NULL;
+	struct folio_walk fw;
+	struct folio *folio;
 
 	mmap_read_lock(mm);
 	vma = find_mergeable_vma(mm, addr);
 	if (!vma)
 		goto out;
 
-	page = follow_page(vma, addr, FOLL_GET);
-	if (IS_ERR_OR_NULL(page))
-		goto out;
-	if (is_zone_device_page(page))
-		goto out_putpage;
-	if (PageAnon(page)) {
+	folio = folio_walk_start(&fw, vma, addr, 0);
+	if (folio) {
+		if (!folio_is_zone_device(folio) &&
+		    folio_test_anon(folio)) {
+			folio_get(folio);
+			page = fw.page;
+		}
+		folio_walk_end(&fw, vma);
+	}
+out:
+	if (page) {
 		flush_anon_page(vma, page, addr);
 		flush_dcache_page(page);
-	} else {
-out_putpage:
-		put_page(page);
-out:
-		page = NULL;
 	}
 	mmap_read_unlock(mm);
 	return page;
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v1 06/11] mm/ksm: convert scan_get_next_rmap_item() from follow_page() to folio_walk
  2024-08-02 15:55 [PATCH v1 00/11] mm: replace follow_page() by folio_walk David Hildenbrand
                   ` (4 preceding siblings ...)
  2024-08-02 15:55 ` [PATCH v1 05/11] mm/ksm: convert get_mergeable_page() " David Hildenbrand
@ 2024-08-02 15:55 ` David Hildenbrand
  2024-08-02 15:55 ` [PATCH v1 07/11] mm/huge_memory: convert split_huge_pages_pid() " David Hildenbrand
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: David Hildenbrand @ 2024-08-02 15:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, linux-doc, kvm, linux-s390, linux-fsdevel,
	David Hildenbrand, Andrew Morton, Matthew Wilcox (Oracle),
	Jonathan Corbet, Christian Borntraeger, Janosch Frank,
	Claudio Imbrenda, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Sven Schnelle, Gerald Schaefer

Let's use folio_walk instead, for example avoiding taking temporary
folio references if the folio does obviously not even apply and getting
rid of one more follow_page() user. We cannot move all handling under the
PTL, so leave the rmap handling (which implies an allocation) out.

Note that zeropages obviously don't apply: old code could just have
specified FOLL_DUMP. Further, we don't care about losing the secretmem
check in follow_page(): these are never anon pages and
vma_ksm_compatible() would never consider secretmem vmas
(VM_SHARED | VM_MAYSHARE must be set for secretmem, see secretmem_mmap()).

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/ksm.c | 38 ++++++++++++++++++++++++--------------
 1 file changed, 24 insertions(+), 14 deletions(-)

diff --git a/mm/ksm.c b/mm/ksm.c
index 742b005f3f77..0f5b2bba4ef0 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -2564,36 +2564,46 @@ static struct ksm_rmap_item *scan_get_next_rmap_item(struct page **page)
 			ksm_scan.address = vma->vm_end;
 
 		while (ksm_scan.address < vma->vm_end) {
+			struct page *tmp_page = NULL;
+			struct folio_walk fw;
+			struct folio *folio;
+
 			if (ksm_test_exit(mm))
 				break;
-			*page = follow_page(vma, ksm_scan.address, FOLL_GET);
-			if (IS_ERR_OR_NULL(*page)) {
-				ksm_scan.address += PAGE_SIZE;
-				cond_resched();
-				continue;
+
+			folio = folio_walk_start(&fw, vma, ksm_scan.address, 0);
+			if (folio) {
+				if (!folio_is_zone_device(folio) &&
+				     folio_test_anon(folio)) {
+					folio_get(folio);
+					tmp_page = fw.page;
+				}
+				folio_walk_end(&fw, vma);
 			}
-			if (is_zone_device_page(*page))
-				goto next_page;
-			if (PageAnon(*page)) {
-				flush_anon_page(vma, *page, ksm_scan.address);
-				flush_dcache_page(*page);
+
+			if (tmp_page) {
+				flush_anon_page(vma, tmp_page, ksm_scan.address);
+				flush_dcache_page(tmp_page);
 				rmap_item = get_next_rmap_item(mm_slot,
 					ksm_scan.rmap_list, ksm_scan.address);
 				if (rmap_item) {
 					ksm_scan.rmap_list =
 							&rmap_item->rmap_list;
 
-					if (should_skip_rmap_item(*page, rmap_item))
+					if (should_skip_rmap_item(tmp_page, rmap_item)) {
+						folio_put(folio);
 						goto next_page;
+					}
 
 					ksm_scan.address += PAGE_SIZE;
-				} else
-					put_page(*page);
+					*page = tmp_page;
+				} else {
+					folio_put(folio);
+				}
 				mmap_read_unlock(mm);
 				return rmap_item;
 			}
 next_page:
-			put_page(*page);
 			ksm_scan.address += PAGE_SIZE;
 			cond_resched();
 		}
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v1 07/11] mm/huge_memory: convert split_huge_pages_pid() from follow_page() to folio_walk
  2024-08-02 15:55 [PATCH v1 00/11] mm: replace follow_page() by folio_walk David Hildenbrand
                   ` (5 preceding siblings ...)
  2024-08-02 15:55 ` [PATCH v1 06/11] mm/ksm: convert scan_get_next_rmap_item() " David Hildenbrand
@ 2024-08-02 15:55 ` David Hildenbrand
  2024-08-06  9:46   ` Ryan Roberts
  2024-08-15 10:04   ` Pankaj Raghav
  2024-08-02 15:55 ` [PATCH v1 08/11] s390/uv: convert gmap_destroy_page() " David Hildenbrand
                   ` (6 subsequent siblings)
  13 siblings, 2 replies; 32+ messages in thread
From: David Hildenbrand @ 2024-08-02 15:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, linux-doc, kvm, linux-s390, linux-fsdevel,
	David Hildenbrand, Andrew Morton, Matthew Wilcox (Oracle),
	Jonathan Corbet, Christian Borntraeger, Janosch Frank,
	Claudio Imbrenda, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Sven Schnelle, Gerald Schaefer

Let's remove yet another follow_page() user. Note that we have to do the
split without holding the PTL, after folio_walk_end(). We don't care
about losing the secretmem check in follow_page().

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/huge_memory.c | 18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0167dc27e365..697fcf89f975 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -40,6 +40,7 @@
 #include <linux/memory-tiers.h>
 #include <linux/compat.h>
 #include <linux/pgalloc_tag.h>
+#include <linux/pagewalk.h>
 
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
@@ -3507,7 +3508,7 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
 	 */
 	for (addr = vaddr_start; addr < vaddr_end; addr += PAGE_SIZE) {
 		struct vm_area_struct *vma = vma_lookup(mm, addr);
-		struct page *page;
+		struct folio_walk fw;
 		struct folio *folio;
 
 		if (!vma)
@@ -3519,13 +3520,10 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
 			continue;
 		}
 
-		/* FOLL_DUMP to ignore special (like zero) pages */
-		page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP);
-
-		if (IS_ERR_OR_NULL(page))
+		folio = folio_walk_start(&fw, vma, addr, 0);
+		if (!folio)
 			continue;
 
-		folio = page_folio(page);
 		if (!is_transparent_hugepage(folio))
 			goto next;
 
@@ -3544,13 +3542,19 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
 
 		if (!folio_trylock(folio))
 			goto next;
+		folio_get(folio);
+		folio_walk_end(&fw, vma);
 
 		if (!split_folio_to_order(folio, new_order))
 			split++;
 
 		folio_unlock(folio);
-next:
 		folio_put(folio);
+
+		cond_resched();
+		continue;
+next:
+		folio_walk_end(&fw, vma);
 		cond_resched();
 	}
 	mmap_read_unlock(mm);
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v1 08/11] s390/uv: convert gmap_destroy_page() from follow_page() to folio_walk
  2024-08-02 15:55 [PATCH v1 00/11] mm: replace follow_page() by folio_walk David Hildenbrand
                   ` (6 preceding siblings ...)
  2024-08-02 15:55 ` [PATCH v1 07/11] mm/huge_memory: convert split_huge_pages_pid() " David Hildenbrand
@ 2024-08-02 15:55 ` David Hildenbrand
  2024-08-07  8:59   ` Claudio Imbrenda
  2024-08-02 15:55 ` [PATCH v1 09/11] s390/mm/fault: convert do_secure_storage_access() " David Hildenbrand
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 32+ messages in thread
From: David Hildenbrand @ 2024-08-02 15:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, linux-doc, kvm, linux-s390, linux-fsdevel,
	David Hildenbrand, Andrew Morton, Matthew Wilcox (Oracle),
	Jonathan Corbet, Christian Borntraeger, Janosch Frank,
	Claudio Imbrenda, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Sven Schnelle, Gerald Schaefer

Let's get rid of another follow_page() user and perform the UV calls
under PTL -- which likely should be fine.

No need for an additional reference while holding the PTL:
uv_destroy_folio() and uv_convert_from_secure_folio() raise the
refcount, so any concurrent make_folio_secure() would see an unexpted
reference and cannot set PG_arch_1 concurrently.

Do we really need a writable PTE? Likely yes, because the "destroy"
part is, in comparison to the export, a destructive operation. So we'll
keep the writability check for now.

We'll lose the secretmem check from follow_page(). Likely we don't care
about that here.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/s390/kernel/uv.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
index 35ed2aea8891..9646f773208a 100644
--- a/arch/s390/kernel/uv.c
+++ b/arch/s390/kernel/uv.c
@@ -14,6 +14,7 @@
 #include <linux/memblock.h>
 #include <linux/pagemap.h>
 #include <linux/swap.h>
+#include <linux/pagewalk.h>
 #include <asm/facility.h>
 #include <asm/sections.h>
 #include <asm/uv.h>
@@ -462,9 +463,9 @@ EXPORT_SYMBOL_GPL(gmap_convert_to_secure);
 int gmap_destroy_page(struct gmap *gmap, unsigned long gaddr)
 {
 	struct vm_area_struct *vma;
+	struct folio_walk fw;
 	unsigned long uaddr;
 	struct folio *folio;
-	struct page *page;
 	int rc;
 
 	rc = -EFAULT;
@@ -483,11 +484,15 @@ int gmap_destroy_page(struct gmap *gmap, unsigned long gaddr)
 		goto out;
 
 	rc = 0;
-	/* we take an extra reference here */
-	page = follow_page(vma, uaddr, FOLL_WRITE | FOLL_GET);
-	if (IS_ERR_OR_NULL(page))
+	folio = folio_walk_start(&fw, vma, uaddr, 0);
+	if (!folio)
 		goto out;
-	folio = page_folio(page);
+	/*
+	 * See gmap_make_secure(): large folios cannot be secure. Small
+	 * folio implies FW_LEVEL_PTE.
+	 */
+	if (folio_test_large(folio) || !pte_write(fw.pte))
+		goto out_walk_end;
 	rc = uv_destroy_folio(folio);
 	/*
 	 * Fault handlers can race; it is possible that two CPUs will fault
@@ -500,7 +505,8 @@ int gmap_destroy_page(struct gmap *gmap, unsigned long gaddr)
 	 */
 	if (rc)
 		rc = uv_convert_from_secure_folio(folio);
-	folio_put(folio);
+out_walk_end:
+	folio_walk_end(&fw, vma);
 out:
 	mmap_read_unlock(gmap->mm);
 	return rc;
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v1 09/11] s390/mm/fault: convert do_secure_storage_access() from follow_page() to folio_walk
  2024-08-02 15:55 [PATCH v1 00/11] mm: replace follow_page() by folio_walk David Hildenbrand
                   ` (7 preceding siblings ...)
  2024-08-02 15:55 ` [PATCH v1 08/11] s390/uv: convert gmap_destroy_page() " David Hildenbrand
@ 2024-08-02 15:55 ` David Hildenbrand
  2024-08-07  8:59   ` Claudio Imbrenda
  2024-08-02 15:55 ` [PATCH v1 10/11] mm: remove follow_page() David Hildenbrand
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 32+ messages in thread
From: David Hildenbrand @ 2024-08-02 15:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, linux-doc, kvm, linux-s390, linux-fsdevel,
	David Hildenbrand, Andrew Morton, Matthew Wilcox (Oracle),
	Jonathan Corbet, Christian Borntraeger, Janosch Frank,
	Claudio Imbrenda, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Sven Schnelle, Gerald Schaefer

Let's get rid of another follow_page() user and perform the conversion
under PTL: Note that this is also what follow_page_pte() ends up doing.

Unfortunately we cannot currently optimize out the additional reference,
because arch_make_folio_accessible() must be called with a raised
refcount to protect against concurrent conversion to secure. We can just
move the arch_make_folio_accessible() under the PTL, like
follow_page_pte() would.

We'll effectively drop the "writable" check implied by FOLL_WRITE:
follow_page_pte() would also not check that when calling
arch_make_folio_accessible(), so there is no good reason for doing that
here.

We'll lose the secretmem check from follow_page() as well, about which
we shouldn't really care about.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/s390/mm/fault.c | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index 8e149ef5e89b..ad8b0d6b77ea 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -34,6 +34,7 @@
 #include <linux/uaccess.h>
 #include <linux/hugetlb.h>
 #include <linux/kfence.h>
+#include <linux/pagewalk.h>
 #include <asm/asm-extable.h>
 #include <asm/asm-offsets.h>
 #include <asm/ptrace.h>
@@ -492,9 +493,9 @@ void do_secure_storage_access(struct pt_regs *regs)
 	union teid teid = { .val = regs->int_parm_long };
 	unsigned long addr = get_fault_address(regs);
 	struct vm_area_struct *vma;
+	struct folio_walk fw;
 	struct mm_struct *mm;
 	struct folio *folio;
-	struct page *page;
 	struct gmap *gmap;
 	int rc;
 
@@ -536,15 +537,18 @@ void do_secure_storage_access(struct pt_regs *regs)
 		vma = find_vma(mm, addr);
 		if (!vma)
 			return handle_fault_error(regs, SEGV_MAPERR);
-		page = follow_page(vma, addr, FOLL_WRITE | FOLL_GET);
-		if (IS_ERR_OR_NULL(page)) {
+		folio = folio_walk_start(&fw, vma, addr, 0);
+		if (!folio) {
 			mmap_read_unlock(mm);
 			break;
 		}
-		folio = page_folio(page);
-		if (arch_make_folio_accessible(folio))
-			send_sig(SIGSEGV, current, 0);
+		/* arch_make_folio_accessible() needs a raised refcount. */
+		folio_get(folio);
+		rc = arch_make_folio_accessible(folio);
 		folio_put(folio);
+		folio_walk_end(&fw, vma);
+		if (rc)
+			send_sig(SIGSEGV, current, 0);
 		mmap_read_unlock(mm);
 		break;
 	case KERNEL_FAULT:
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v1 10/11] mm: remove follow_page()
  2024-08-02 15:55 [PATCH v1 00/11] mm: replace follow_page() by folio_walk David Hildenbrand
                   ` (8 preceding siblings ...)
  2024-08-02 15:55 ` [PATCH v1 09/11] s390/mm/fault: convert do_secure_storage_access() " David Hildenbrand
@ 2024-08-02 15:55 ` David Hildenbrand
  2024-08-02 15:55 ` [PATCH v1 11/11] mm/ksm: convert break_ksm() from walk_page_range_vma() to folio_walk David Hildenbrand
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: David Hildenbrand @ 2024-08-02 15:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, linux-doc, kvm, linux-s390, linux-fsdevel,
	David Hildenbrand, Andrew Morton, Matthew Wilcox (Oracle),
	Jonathan Corbet, Christian Borntraeger, Janosch Frank,
	Claudio Imbrenda, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Sven Schnelle, Gerald Schaefer

All users are gone, let's remove it and any leftovers in comments. We'll
leave any FOLL/follow_page_() naming cleanups as future work.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 Documentation/mm/transhuge.rst |  6 +++---
 include/linux/mm.h             |  3 ---
 mm/filemap.c                   |  2 +-
 mm/gup.c                       | 24 +-----------------------
 mm/nommu.c                     |  6 ------
 5 files changed, 5 insertions(+), 36 deletions(-)

diff --git a/Documentation/mm/transhuge.rst b/Documentation/mm/transhuge.rst
index 1ba0ad63246c..a2cd8800d527 100644
--- a/Documentation/mm/transhuge.rst
+++ b/Documentation/mm/transhuge.rst
@@ -31,10 +31,10 @@ Design principles
   feature that applies to all dynamic high order allocations in the
   kernel)
 
-get_user_pages and follow_page
-==============================
+get_user_pages and pin_user_pages
+=================================
 
-get_user_pages and follow_page if run on a hugepage, will return the
+get_user_pages and pin_user_pages if run on a hugepage, will return the
 head or tail pages as usual (exactly as they would do on
 hugetlbfs). Most GUP users will only care about the actual physical
 address of the page and its temporary pinning to release after the I/O
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2f6c08b53e4f..ee8cea73d415 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3527,9 +3527,6 @@ static inline vm_fault_t vmf_fs_error(int err)
 	return VM_FAULT_SIGBUS;
 }
 
-struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
-			 unsigned int foll_flags);
-
 static inline int vm_fault_to_errno(vm_fault_t vm_fault, int foll_flags)
 {
 	if (vm_fault & VM_FAULT_OOM)
diff --git a/mm/filemap.c b/mm/filemap.c
index d62150418b91..4130be74f6fd 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -112,7 +112,7 @@
  *    ->swap_lock		(try_to_unmap_one)
  *    ->private_lock		(try_to_unmap_one)
  *    ->i_pages lock		(try_to_unmap_one)
- *    ->lruvec->lru_lock	(follow_page->mark_page_accessed)
+ *    ->lruvec->lru_lock	(follow_page_mask->mark_page_accessed)
  *    ->lruvec->lru_lock	(check_pte_range->isolate_lru_page)
  *    ->private_lock		(folio_remove_rmap_pte->set_page_dirty)
  *    ->i_pages lock		(folio_remove_rmap_pte->set_page_dirty)
diff --git a/mm/gup.c b/mm/gup.c
index 3e8484c893aa..d19884e097fd 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1072,28 +1072,6 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
 	return page;
 }
 
-struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
-			 unsigned int foll_flags)
-{
-	struct follow_page_context ctx = { NULL };
-	struct page *page;
-
-	if (vma_is_secretmem(vma))
-		return NULL;
-
-	if (WARN_ON_ONCE(foll_flags & FOLL_PIN))
-		return NULL;
-
-	/*
-	 * We never set FOLL_HONOR_NUMA_FAULT because callers don't expect
-	 * to fail on PROT_NONE-mapped pages.
-	 */
-	page = follow_page_mask(vma, address, foll_flags, &ctx);
-	if (ctx.pgmap)
-		put_dev_pagemap(ctx.pgmap);
-	return page;
-}
-
 static int get_gate_page(struct mm_struct *mm, unsigned long address,
 		unsigned int gup_flags, struct vm_area_struct **vma,
 		struct page **page)
@@ -2519,7 +2497,7 @@ static bool is_valid_gup_args(struct page **pages, int *locked,
 	 * These flags not allowed to be specified externally to the gup
 	 * interfaces:
 	 * - FOLL_TOUCH/FOLL_PIN/FOLL_TRIED/FOLL_FAST_ONLY are internal only
-	 * - FOLL_REMOTE is internal only and used on follow_page()
+	 * - FOLL_REMOTE is internal only, set in (get|pin)_user_pages_remote()
 	 * - FOLL_UNLOCKABLE is internal only and used if locked is !NULL
 	 */
 	if (WARN_ON_ONCE(gup_flags & INTERNAL_GUP_FLAGS))
diff --git a/mm/nommu.c b/mm/nommu.c
index 40cac1348b40..385b0c15add8 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1578,12 +1578,6 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
 	return ret;
 }
 
-struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
-			 unsigned int foll_flags)
-{
-	return NULL;
-}
-
 int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
 		unsigned long pfn, unsigned long size, pgprot_t prot)
 {
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v1 11/11] mm/ksm: convert break_ksm() from walk_page_range_vma() to folio_walk
  2024-08-02 15:55 [PATCH v1 00/11] mm: replace follow_page() by folio_walk David Hildenbrand
                   ` (9 preceding siblings ...)
  2024-08-02 15:55 ` [PATCH v1 10/11] mm: remove follow_page() David Hildenbrand
@ 2024-08-02 15:55 ` David Hildenbrand
  2024-08-03  5:34 ` [PATCH v1 00/11] mm: replace follow_page() by folio_walk Andrew Morton
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: David Hildenbrand @ 2024-08-02 15:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, linux-doc, kvm, linux-s390, linux-fsdevel,
	David Hildenbrand, Andrew Morton, Matthew Wilcox (Oracle),
	Jonathan Corbet, Christian Borntraeger, Janosch Frank,
	Claudio Imbrenda, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Sven Schnelle, Gerald Schaefer

Let's simplify by reusing folio_walk. Keep the existing behavior by
handling migration entries and zeropages.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/ksm.c | 63 ++++++++++++++------------------------------------------
 1 file changed, 16 insertions(+), 47 deletions(-)

diff --git a/mm/ksm.c b/mm/ksm.c
index 0f5b2bba4ef0..8e53666bc7b0 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -608,47 +608,6 @@ static inline bool ksm_test_exit(struct mm_struct *mm)
 	return atomic_read(&mm->mm_users) == 0;
 }
 
-static int break_ksm_pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long next,
-			struct mm_walk *walk)
-{
-	struct page *page = NULL;
-	spinlock_t *ptl;
-	pte_t *pte;
-	pte_t ptent;
-	int ret;
-
-	pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
-	if (!pte)
-		return 0;
-	ptent = ptep_get(pte);
-	if (pte_present(ptent)) {
-		page = vm_normal_page(walk->vma, addr, ptent);
-	} else if (!pte_none(ptent)) {
-		swp_entry_t entry = pte_to_swp_entry(ptent);
-
-		/*
-		 * As KSM pages remain KSM pages until freed, no need to wait
-		 * here for migration to end.
-		 */
-		if (is_migration_entry(entry))
-			page = pfn_swap_entry_to_page(entry);
-	}
-	/* return 1 if the page is an normal ksm page or KSM-placed zero page */
-	ret = (page && PageKsm(page)) || is_ksm_zero_pte(ptent);
-	pte_unmap_unlock(pte, ptl);
-	return ret;
-}
-
-static const struct mm_walk_ops break_ksm_ops = {
-	.pmd_entry = break_ksm_pmd_entry,
-	.walk_lock = PGWALK_RDLOCK,
-};
-
-static const struct mm_walk_ops break_ksm_lock_vma_ops = {
-	.pmd_entry = break_ksm_pmd_entry,
-	.walk_lock = PGWALK_WRLOCK,
-};
-
 /*
  * We use break_ksm to break COW on a ksm page by triggering unsharing,
  * such that the ksm page will get replaced by an exclusive anonymous page.
@@ -665,16 +624,26 @@ static const struct mm_walk_ops break_ksm_lock_vma_ops = {
 static int break_ksm(struct vm_area_struct *vma, unsigned long addr, bool lock_vma)
 {
 	vm_fault_t ret = 0;
-	const struct mm_walk_ops *ops = lock_vma ?
-				&break_ksm_lock_vma_ops : &break_ksm_ops;
+
+	if (lock_vma)
+		vma_start_write(vma);
 
 	do {
-		int ksm_page;
+		bool ksm_page = false;
+		struct folio_walk fw;
+		struct folio *folio;
 
 		cond_resched();
-		ksm_page = walk_page_range_vma(vma, addr, addr + 1, ops, NULL);
-		if (WARN_ON_ONCE(ksm_page < 0))
-			return ksm_page;
+		folio = folio_walk_start(&fw, vma, addr,
+					 FW_MIGRATION | FW_ZEROPAGE);
+		if (folio) {
+			/* Small folio implies FW_LEVEL_PTE. */
+			if (!folio_test_large(folio) &&
+			    (folio_test_ksm(folio) || is_ksm_zero_pte(fw.pte)))
+				ksm_page = true;
+			folio_walk_end(&fw, vma);
+		}
+
 		if (!ksm_page)
 			return 0;
 		ret = handle_mm_fault(vma, addr,
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v1 00/11] mm: replace follow_page() by folio_walk
  2024-08-02 15:55 [PATCH v1 00/11] mm: replace follow_page() by folio_walk David Hildenbrand
                   ` (10 preceding siblings ...)
  2024-08-02 15:55 ` [PATCH v1 11/11] mm/ksm: convert break_ksm() from walk_page_range_vma() to folio_walk David Hildenbrand
@ 2024-08-03  5:34 ` Andrew Morton
  2024-08-06 13:42 ` Claudio Imbrenda
  2024-08-07  9:15 ` Claudio Imbrenda
  13 siblings, 0 replies; 32+ messages in thread
From: Andrew Morton @ 2024-08-03  5:34 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, linux-doc, kvm, linux-s390, linux-fsdevel,
	Matthew Wilcox (Oracle), Jonathan Corbet, Christian Borntraeger,
	Janosch Frank, Claudio Imbrenda, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Sven Schnelle, Gerald Schaefer

On Fri,  2 Aug 2024 17:55:13 +0200 David Hildenbrand <david@redhat.com> wrote:

> I am not able to test the s390x Secure Execution changes, unfortunately.

I'll add it to -next.  Could the s390 developers please check this?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v1 07/11] mm/huge_memory: convert split_huge_pages_pid() from follow_page() to folio_walk
  2024-08-02 15:55 ` [PATCH v1 07/11] mm/huge_memory: convert split_huge_pages_pid() " David Hildenbrand
@ 2024-08-06  9:46   ` Ryan Roberts
  2024-08-06  9:56     ` David Hildenbrand
  2024-08-15 10:04   ` Pankaj Raghav
  1 sibling, 1 reply; 32+ messages in thread
From: Ryan Roberts @ 2024-08-06  9:46 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, linux-doc, kvm, linux-s390, linux-fsdevel,
	Andrew Morton, Matthew Wilcox (Oracle), Jonathan Corbet,
	Christian Borntraeger, Janosch Frank, Claudio Imbrenda,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Sven Schnelle,
	Gerald Schaefer, Mark Brown

On 02/08/2024 16:55, David Hildenbrand wrote:
> Let's remove yet another follow_page() user. Note that we have to do the
> split without holding the PTL, after folio_walk_end(). We don't care
> about losing the secretmem check in follow_page().

Hi David,

Our (arm64) CI is showing a regression in split_huge_page_test from mm selftests from next-20240805 onwards. Navigating around a couple of other lurking bugs, I was able to bisect to this change (which smells about right).

Newly failing test:

# # ------------------------------
# # running ./split_huge_page_test
# # ------------------------------
# # TAP version 13
# # 1..12
# # Bail out! Still AnonHugePages not split
# # # Planned tests != run tests (12 != 0)
# # # Totals: pass:0 fail:0 xfail:0 xpass:0 skip:0 error:0
# # [FAIL]
# not ok 52 split_huge_page_test # exit=1

It's trying to split some pmd-mapped THPs then checking and finding that they are not split. The split is requested via /sys/kernel/debug/split_huge_pages, which I believe ends up in this function you are modifying here. Although I'll admit that looking at the change, there is nothing obviously wrong! Any ideas?

bisect log:

# bad: [1e391b34f6aa043c7afa40a2103163a0ef06d179] Add linux-next specific files for 20240806
git bisect bad 1e391b34f6aa043c7afa40a2103163a0ef06d179
# good: [de9c2c66ad8e787abec7c9d7eff4f8c3cdd28aed] Linux 6.11-rc2
git bisect good de9c2c66ad8e787abec7c9d7eff4f8c3cdd28aed
# bad: [01c2d56f2c52e8af01dfd91af1fe9affc76c4c9e] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git
git bisect bad 01c2d56f2c52e8af01dfd91af1fe9affc76c4c9e
# bad: [01c2d56f2c52e8af01dfd91af1fe9affc76c4c9e] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git
git bisect bad 01c2d56f2c52e8af01dfd91af1fe9affc76c4c9e
# bad: [3610638e967f32f02c56c7cc8f7d6a815972f8c2] Merge branch 'for-linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/sudeep.holla/linux.git
git bisect bad 3610638e967f32f02c56c7cc8f7d6a815972f8c2
# bad: [3610638e967f32f02c56c7cc8f7d6a815972f8c2] Merge branch 'for-linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/sudeep.holla/linux.git
git bisect bad 3610638e967f32f02c56c7cc8f7d6a815972f8c2
# bad: [d35ef6c9d106eedff36908c21699e1b7f3e55584] Merge branch 'clang-format' of https://github.com/ojeda/linux.git
git bisect bad d35ef6c9d106eedff36908c21699e1b7f3e55584
# good: [e1a15959d75c9ba4b45e07e37bcf843c85750010] Merge branch 'for-linux-next-fixes' of https://gitlab.freedesktop.org/drm/misc/kernel.git
git bisect good e1a15959d75c9ba4b45e07e37bcf843c85750010
# good: [6d66cb9bdeceb769ce62591f56580ebe80f6267a] mm: swap: add a adaptive full cluster cache reclaim
git bisect good 6d66cb9bdeceb769ce62591f56580ebe80f6267a
# bad: [2b820b576dfc4aa9b65f18b68f468cb5b38ece84] mm: optimization on page allocation when CMA enabled
git bisect bad 2b820b576dfc4aa9b65f18b68f468cb5b38ece84
# bad: [ab70279848c8623027791799492a3f6e7c38a9b2] MIPS: sgi-ip27: drop HAVE_ARCH_NODEDATA_EXTENSION
git bisect bad ab70279848c8623027791799492a3f6e7c38a9b2
# bad: [539bc09ff00b29eb60f3dc8ed2d82ad2050a582d] mm/huge_memory: convert split_huge_pages_pid() from follow_page() to folio_walk
git bisect bad 539bc09ff00b29eb60f3dc8ed2d82ad2050a582d
# good: [1a37544d0e35340ce740d377d7d6c746a84e2aae] include/linux/mmzone.h: clean up watermark accessors
git bisect good 1a37544d0e35340ce740d377d7d6c746a84e2aae
# good: [22adafb60d6e1a607a3d99da90927ddd7df928ad] mm/migrate: convert do_pages_stat_array() from follow_page() to folio_walk
git bisect good 22adafb60d6e1a607a3d99da90927ddd7df928ad
# good: [57e1ccf54dba4dda6d6f0264b76e2b86eec3d401] mm/ksm: convert get_mergeable_page() from follow_page() to folio_walk
git bisect good 57e1ccf54dba4dda6d6f0264b76e2b86eec3d401
# good: [285aa1a963f310530351b0e4a2e64bc4b806e518] mm/ksm: convert scan_get_next_rmap_item() from follow_page() to folio_walk
git bisect good 285aa1a963f310530351b0e4a2e64bc4b806e518
# first bad commit: [539bc09ff00b29eb60f3dc8ed2d82ad2050a582d] mm/huge_memory: convert split_huge_pages_pid() from follow_page() to folio_walk

Thanks,
Ryan


> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/huge_memory.c | 18 +++++++++++-------
>  1 file changed, 11 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 0167dc27e365..697fcf89f975 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -40,6 +40,7 @@
>  #include <linux/memory-tiers.h>
>  #include <linux/compat.h>
>  #include <linux/pgalloc_tag.h>
> +#include <linux/pagewalk.h>
>  
>  #include <asm/tlb.h>
>  #include <asm/pgalloc.h>
> @@ -3507,7 +3508,7 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
>  	 */
>  	for (addr = vaddr_start; addr < vaddr_end; addr += PAGE_SIZE) {
>  		struct vm_area_struct *vma = vma_lookup(mm, addr);
> -		struct page *page;
> +		struct folio_walk fw;
>  		struct folio *folio;
>  
>  		if (!vma)
> @@ -3519,13 +3520,10 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
>  			continue;
>  		}
>  
> -		/* FOLL_DUMP to ignore special (like zero) pages */
> -		page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP);
> -
> -		if (IS_ERR_OR_NULL(page))
> +		folio = folio_walk_start(&fw, vma, addr, 0);
> +		if (!folio)
>  			continue;
>  
> -		folio = page_folio(page);
>  		if (!is_transparent_hugepage(folio))
>  			goto next;
>  
> @@ -3544,13 +3542,19 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
>  
>  		if (!folio_trylock(folio))
>  			goto next;
> +		folio_get(folio);
> +		folio_walk_end(&fw, vma);
>  
>  		if (!split_folio_to_order(folio, new_order))
>  			split++;
>  
>  		folio_unlock(folio);
> -next:
>  		folio_put(folio);
> +
> +		cond_resched();
> +		continue;
> +next:
> +		folio_walk_end(&fw, vma);
>  		cond_resched();
>  	}
>  	mmap_read_unlock(mm);


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v1 07/11] mm/huge_memory: convert split_huge_pages_pid() from follow_page() to folio_walk
  2024-08-06  9:46   ` Ryan Roberts
@ 2024-08-06  9:56     ` David Hildenbrand
  2024-08-06 10:03       ` David Hildenbrand
  0 siblings, 1 reply; 32+ messages in thread
From: David Hildenbrand @ 2024-08-06  9:56 UTC (permalink / raw)
  To: Ryan Roberts, linux-kernel
  Cc: linux-mm, linux-doc, kvm, linux-s390, linux-fsdevel,
	Andrew Morton, Matthew Wilcox (Oracle), Jonathan Corbet,
	Christian Borntraeger, Janosch Frank, Claudio Imbrenda,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Sven Schnelle,
	Gerald Schaefer, Mark Brown

On 06.08.24 11:46, Ryan Roberts wrote:
> On 02/08/2024 16:55, David Hildenbrand wrote:
>> Let's remove yet another follow_page() user. Note that we have to do the
>> split without holding the PTL, after folio_walk_end(). We don't care
>> about losing the secretmem check in follow_page().
> 
> Hi David,
> 
> Our (arm64) CI is showing a regression in split_huge_page_test from mm selftests from next-20240805 onwards. Navigating around a couple of other lurking bugs, I was able to bisect to this change (which smells about right).
> 
> Newly failing test:
> 
> # # ------------------------------
> # # running ./split_huge_page_test
> # # ------------------------------
> # # TAP version 13
> # # 1..12
> # # Bail out! Still AnonHugePages not split
> # # # Planned tests != run tests (12 != 0)
> # # # Totals: pass:0 fail:0 xfail:0 xpass:0 skip:0 error:0
> # # [FAIL]
> # not ok 52 split_huge_page_test # exit=1
> 
> It's trying to split some pmd-mapped THPs then checking and finding that they are not split. The split is requested via /sys/kernel/debug/split_huge_pages, which I believe ends up in this function you are modifying here. Although I'll admit that looking at the change, there is nothing obviously wrong! Any ideas?

Nothing jumps at me as well. Let me fire up the debugger :)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v1 07/11] mm/huge_memory: convert split_huge_pages_pid() from follow_page() to folio_walk
  2024-08-06  9:56     ` David Hildenbrand
@ 2024-08-06 10:03       ` David Hildenbrand
  2024-08-06 10:24         ` David Hildenbrand
  0 siblings, 1 reply; 32+ messages in thread
From: David Hildenbrand @ 2024-08-06 10:03 UTC (permalink / raw)
  To: Ryan Roberts, linux-kernel
  Cc: linux-mm, linux-doc, kvm, linux-s390, linux-fsdevel,
	Andrew Morton, Matthew Wilcox (Oracle), Jonathan Corbet,
	Christian Borntraeger, Janosch Frank, Claudio Imbrenda,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Sven Schnelle,
	Gerald Schaefer, Mark Brown

On 06.08.24 11:56, David Hildenbrand wrote:
> On 06.08.24 11:46, Ryan Roberts wrote:
>> On 02/08/2024 16:55, David Hildenbrand wrote:
>>> Let's remove yet another follow_page() user. Note that we have to do the
>>> split without holding the PTL, after folio_walk_end(). We don't care
>>> about losing the secretmem check in follow_page().
>>
>> Hi David,
>>
>> Our (arm64) CI is showing a regression in split_huge_page_test from mm selftests from next-20240805 onwards. Navigating around a couple of other lurking bugs, I was able to bisect to this change (which smells about right).
>>
>> Newly failing test:
>>
>> # # ------------------------------
>> # # running ./split_huge_page_test
>> # # ------------------------------
>> # # TAP version 13
>> # # 1..12
>> # # Bail out! Still AnonHugePages not split
>> # # # Planned tests != run tests (12 != 0)
>> # # # Totals: pass:0 fail:0 xfail:0 xpass:0 skip:0 error:0
>> # # [FAIL]
>> # not ok 52 split_huge_page_test # exit=1
>>
>> It's trying to split some pmd-mapped THPs then checking and finding that they are not split. The split is requested via /sys/kernel/debug/split_huge_pages, which I believe ends up in this function you are modifying here. Although I'll admit that looking at the change, there is nothing obviously wrong! Any ideas?
> 
> Nothing jumps at me as well. Let me fire up the debugger :)

Ah, very likely the can_split_folio() check expects a raised refcount 
already.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v1 07/11] mm/huge_memory: convert split_huge_pages_pid() from follow_page() to folio_walk
  2024-08-06 10:03       ` David Hildenbrand
@ 2024-08-06 10:24         ` David Hildenbrand
  2024-08-06 11:17           ` Ryan Roberts
  2024-08-06 15:36           ` Zi Yan
  0 siblings, 2 replies; 32+ messages in thread
From: David Hildenbrand @ 2024-08-06 10:24 UTC (permalink / raw)
  To: Ryan Roberts, linux-kernel
  Cc: linux-mm, linux-doc, kvm, linux-s390, linux-fsdevel,
	Andrew Morton, Matthew Wilcox (Oracle), Jonathan Corbet,
	Christian Borntraeger, Janosch Frank, Claudio Imbrenda,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Sven Schnelle,
	Gerald Schaefer, Mark Brown

On 06.08.24 12:03, David Hildenbrand wrote:
> On 06.08.24 11:56, David Hildenbrand wrote:
>> On 06.08.24 11:46, Ryan Roberts wrote:
>>> On 02/08/2024 16:55, David Hildenbrand wrote:
>>>> Let's remove yet another follow_page() user. Note that we have to do the
>>>> split without holding the PTL, after folio_walk_end(). We don't care
>>>> about losing the secretmem check in follow_page().
>>>
>>> Hi David,
>>>
>>> Our (arm64) CI is showing a regression in split_huge_page_test from mm selftests from next-20240805 onwards. Navigating around a couple of other lurking bugs, I was able to bisect to this change (which smells about right).
>>>
>>> Newly failing test:
>>>
>>> # # ------------------------------
>>> # # running ./split_huge_page_test
>>> # # ------------------------------
>>> # # TAP version 13
>>> # # 1..12
>>> # # Bail out! Still AnonHugePages not split
>>> # # # Planned tests != run tests (12 != 0)
>>> # # # Totals: pass:0 fail:0 xfail:0 xpass:0 skip:0 error:0
>>> # # [FAIL]
>>> # not ok 52 split_huge_page_test # exit=1
>>>
>>> It's trying to split some pmd-mapped THPs then checking and finding that they are not split. The split is requested via /sys/kernel/debug/split_huge_pages, which I believe ends up in this function you are modifying here. Although I'll admit that looking at the change, there is nothing obviously wrong! Any ideas?
>>
>> Nothing jumps at me as well. Let me fire up the debugger :)
> 
> Ah, very likely the can_split_folio() check expects a raised refcount
> already.

Indeed, the following does the trick! Thanks Ryan, I could have sworn
I ran that selftest as well.

TAP version 13
1..12
ok 1 Split huge pages successful
ok 2 Split PTE-mapped huge pages successful
# Please enable pr_debug in split_huge_pages_in_file() for more info.
# Please check dmesg for more information
ok 3 File-backed THP split test done

...


@Andrew, can you squash the following?


 From e5ea585de3e089ea89bf43d8447ff9fc9b371286 Mon Sep 17 00:00:00 2001
From: David Hildenbrand <david@redhat.com>
Date: Tue, 6 Aug 2024 12:08:17 +0200
Subject: [PATCH] fixup: mm/huge_memory: convert split_huge_pages_pid() from
  follow_page() to folio_walk

We have to teach can_split_folio() that we are not holding an additional
reference.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
  include/linux/huge_mm.h | 4 ++--
  mm/huge_memory.c        | 8 ++++----
  mm/vmscan.c             | 2 +-
  3 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index e25d9ebfdf89..ce44caa40eed 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -314,7 +314,7 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
  		unsigned long len, unsigned long pgoff, unsigned long flags,
  		vm_flags_t vm_flags);
  
-bool can_split_folio(struct folio *folio, int *pextra_pins);
+bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
  int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
  		unsigned int new_order);
  static inline int split_huge_page(struct page *page)
@@ -470,7 +470,7 @@ thp_get_unmapped_area_vmflags(struct file *filp, unsigned long addr,
  }
  
  static inline bool
-can_split_folio(struct folio *folio, int *pextra_pins)
+can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins)
  {
  	return false;
  }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 697fcf89f975..c40b0dcc205b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3021,7 +3021,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
  }
  
  /* Racy check whether the huge page can be split */
-bool can_split_folio(struct folio *folio, int *pextra_pins)
+bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins)
  {
  	int extra_pins;
  
@@ -3033,7 +3033,7 @@ bool can_split_folio(struct folio *folio, int *pextra_pins)
  		extra_pins = folio_nr_pages(folio);
  	if (pextra_pins)
  		*pextra_pins = extra_pins;
-	return folio_mapcount(folio) == folio_ref_count(folio) - extra_pins - 1;
+	return folio_mapcount(folio) == folio_ref_count(folio) - extra_pins - caller_pins;
  }
  
  /*
@@ -3201,7 +3201,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
  	 * Racy check if we can split the page, before unmap_folio() will
  	 * split PMDs
  	 */
-	if (!can_split_folio(folio, &extra_pins)) {
+	if (!can_split_folio(folio, 1, &extra_pins)) {
  		ret = -EAGAIN;
  		goto out_unlock;
  	}
@@ -3537,7 +3537,7 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
  		 * can be split or not. So skip the check here.
  		 */
  		if (!folio_test_private(folio) &&
-		    !can_split_folio(folio, NULL))
+		    !can_split_folio(folio, 0, NULL))
  			goto next;
  
  		if (!folio_trylock(folio))
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 31d13462571e..a332cb80e928 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1227,7 +1227,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
  					goto keep_locked;
  				if (folio_test_large(folio)) {
  					/* cannot split folio, skip it */
-					if (!can_split_folio(folio, NULL))
+					if (!can_split_folio(folio, 1, NULL))
  						goto activate_locked;
  					/*
  					 * Split partially mapped folios right away.
-- 
2.45.2


-- 
Cheers,

David / dhildenb


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v1 07/11] mm/huge_memory: convert split_huge_pages_pid() from follow_page() to folio_walk
  2024-08-06 10:24         ` David Hildenbrand
@ 2024-08-06 11:17           ` Ryan Roberts
  2024-08-06 15:36           ` Zi Yan
  1 sibling, 0 replies; 32+ messages in thread
From: Ryan Roberts @ 2024-08-06 11:17 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, linux-doc, kvm, linux-s390, linux-fsdevel,
	Andrew Morton, Matthew Wilcox (Oracle), Jonathan Corbet,
	Christian Borntraeger, Janosch Frank, Claudio Imbrenda,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev, Sven Schnelle,
	Gerald Schaefer, Mark Brown

>>>> It's trying to split some pmd-mapped THPs then checking and finding that
>>>> they are not split. The split is requested via
>>>> /sys/kernel/debug/split_huge_pages, which I believe ends up in this function
>>>> you are modifying here. Although I'll admit that looking at the change,
>>>> there is nothing obviously wrong! Any ideas?
>>>
>>> Nothing jumps at me as well. Let me fire up the debugger :)
>>
>> Ah, very likely the can_split_folio() check expects a raised refcount
>> already.
> 
> Indeed, the following does the trick! Thanks Ryan, I could have sworn
> I ran that selftest as well.

Ahha! Thanks for sorting so quickly!


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v1 00/11] mm: replace follow_page() by folio_walk
  2024-08-02 15:55 [PATCH v1 00/11] mm: replace follow_page() by folio_walk David Hildenbrand
                   ` (11 preceding siblings ...)
  2024-08-03  5:34 ` [PATCH v1 00/11] mm: replace follow_page() by folio_walk Andrew Morton
@ 2024-08-06 13:42 ` Claudio Imbrenda
  2024-08-07  9:15 ` Claudio Imbrenda
  13 siblings, 0 replies; 32+ messages in thread
From: Claudio Imbrenda @ 2024-08-06 13:42 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, linux-doc, kvm, linux-s390, linux-fsdevel,
	Andrew Morton, Matthew Wilcox (Oracle), Jonathan Corbet,
	Christian Borntraeger, Janosch Frank, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Sven Schnelle, Gerald Schaefer

On Fri,  2 Aug 2024 17:55:13 +0200
David Hildenbrand <david@redhat.com> wrote:

> Looking into a way of moving the last folio_likely_mapped_shared() call
> in add_folio_for_migration() under the PTL, I found myself removing
> follow_page(). This paves the way for cleaning up all the FOLL_, follow_*
> terminology to just be called "GUP" nowadays.
> 
> The new page table walker will lookup a mapped folio and return to the
> caller with the PTL held, such that the folio cannot get unmapped
> concurrently. Callers can then conditionally decide whether they really
> want to take a short-term folio reference or whether the can simply
> unlock the PTL and be done with it.
> 
> folio_walk is similar to page_vma_mapped_walk(), except that we don't know
> the folio we want to walk to and that we are only walking to exactly one
> PTE/PMD/PUD.
> 
> folio_walk provides access to the pte/pmd/pud (and the referenced folio
> page because things like KSM need that), however, as part of this series
> no page table modifications are performed by users.
> 
> We might be able to convert some other walk_page_range() users that really
> only walk to one address, such as DAMON with
> damon_mkold_ops/damon_young_ops. It might make sense to extend folio_walk
> in the future to optionally fault in a folio (if applicable), such that we
> can replace some get_user_pages() users that really only want to lookup
> a single page/folio under PTL without unconditionally grabbing a folio
> reference.
> 
> I have plans to extend the approach to a range walker that will try
> batching various page table entries (not just folio pages) to be a better
> replace for walk_page_range() -- and users will be able to opt in which
> type of page table entries they want to process -- but that will require
> more work and more thoughts.
> 
> KSM seems to work just fine (ksm_functional_tests selftests) and
> move_pages seems to work (migration selftest). I tested the leaf
> implementation excessively using various hugetlb sizes (64K, 2M, 32M, 1G)
> on arm64 using move_pages and did some more testing on x86-64. Cross
> compiled on a bunch of architectures.
> 
> I am not able to test the s390x Secure Execution changes, unfortunately.

the series looks good; we will do some tests and report back if
everything is ok

> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
> Cc: Janosch Frank <frankja@linux.ibm.com>
> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
> Cc: Heiko Carstens <hca@linux.ibm.com>
> Cc: Vasily Gorbik <gor@linux.ibm.com>
> Cc: Alexander Gordeev <agordeev@linux.ibm.com>
> Cc: Sven Schnelle <svens@linux.ibm.com>
> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> 
> David Hildenbrand (11):
>   mm: provide vm_normal_(page|folio)_pmd() with
>     CONFIG_PGTABLE_HAS_HUGE_LEAVES
>   mm/pagewalk: introduce folio_walk_start() + folio_walk_end()
>   mm/migrate: convert do_pages_stat_array() from follow_page() to
>     folio_walk
>   mm/migrate: convert add_page_for_migration() from follow_page() to
>     folio_walk
>   mm/ksm: convert get_mergeable_page() from follow_page() to folio_walk
>   mm/ksm: convert scan_get_next_rmap_item() from follow_page() to
>     folio_walk
>   mm/huge_memory: convert split_huge_pages_pid() from follow_page() to
>     folio_walk
>   s390/uv: convert gmap_destroy_page() from follow_page() to folio_walk
>   s390/mm/fault: convert do_secure_storage_access() from follow_page()
>     to folio_walk
>   mm: remove follow_page()
>   mm/ksm: convert break_ksm() from walk_page_range_vma() to folio_walk
> 
>  Documentation/mm/transhuge.rst |   6 +-
>  arch/s390/kernel/uv.c          |  18 ++-
>  arch/s390/mm/fault.c           |  16 ++-
>  include/linux/mm.h             |   3 -
>  include/linux/pagewalk.h       |  58 ++++++++++
>  mm/filemap.c                   |   2 +-
>  mm/gup.c                       |  24 +---
>  mm/huge_memory.c               |  18 +--
>  mm/ksm.c                       | 127 +++++++++------------
>  mm/memory.c                    |   2 +-
>  mm/migrate.c                   | 131 ++++++++++-----------
>  mm/nommu.c                     |   6 -
>  mm/pagewalk.c                  | 202 +++++++++++++++++++++++++++++++++
>  13 files changed, 413 insertions(+), 200 deletions(-)
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v1 07/11] mm/huge_memory: convert split_huge_pages_pid() from follow_page() to folio_walk
  2024-08-06 10:24         ` David Hildenbrand
  2024-08-06 11:17           ` Ryan Roberts
@ 2024-08-06 15:36           ` Zi Yan
  2024-08-07  9:57             ` David Hildenbrand
  1 sibling, 1 reply; 32+ messages in thread
From: Zi Yan @ 2024-08-06 15:36 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Ryan Roberts, linux-kernel, linux-mm, linux-doc, kvm, linux-s390,
	linux-fsdevel, Andrew Morton, Matthew Wilcox (Oracle),
	Jonathan Corbet, Christian Borntraeger, Janosch Frank,
	Claudio Imbrenda, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Sven Schnelle, Gerald Schaefer, Mark Brown

[-- Attachment #1: Type: text/plain, Size: 6365 bytes --]

On 6 Aug 2024, at 6:24, David Hildenbrand wrote:

> On 06.08.24 12:03, David Hildenbrand wrote:
>> On 06.08.24 11:56, David Hildenbrand wrote:
>>> On 06.08.24 11:46, Ryan Roberts wrote:
>>>> On 02/08/2024 16:55, David Hildenbrand wrote:
>>>>> Let's remove yet another follow_page() user. Note that we have to do the
>>>>> split without holding the PTL, after folio_walk_end(). We don't care
>>>>> about losing the secretmem check in follow_page().
>>>>
>>>> Hi David,
>>>>
>>>> Our (arm64) CI is showing a regression in split_huge_page_test from mm selftests from next-20240805 onwards. Navigating around a couple of other lurking bugs, I was able to bisect to this change (which smells about right).
>>>>
>>>> Newly failing test:
>>>>
>>>> # # ------------------------------
>>>> # # running ./split_huge_page_test
>>>> # # ------------------------------
>>>> # # TAP version 13
>>>> # # 1..12
>>>> # # Bail out! Still AnonHugePages not split
>>>> # # # Planned tests != run tests (12 != 0)
>>>> # # # Totals: pass:0 fail:0 xfail:0 xpass:0 skip:0 error:0
>>>> # # [FAIL]
>>>> # not ok 52 split_huge_page_test # exit=1
>>>>
>>>> It's trying to split some pmd-mapped THPs then checking and finding that they are not split. The split is requested via /sys/kernel/debug/split_huge_pages, which I believe ends up in this function you are modifying here. Although I'll admit that looking at the change, there is nothing obviously wrong! Any ideas?
>>>
>>> Nothing jumps at me as well. Let me fire up the debugger :)
>>
>> Ah, very likely the can_split_folio() check expects a raised refcount
>> already.
>
> Indeed, the following does the trick! Thanks Ryan, I could have sworn
> I ran that selftest as well.
>
> TAP version 13
> 1..12
> ok 1 Split huge pages successful
> ok 2 Split PTE-mapped huge pages successful
> # Please enable pr_debug in split_huge_pages_in_file() for more info.
> # Please check dmesg for more information
> ok 3 File-backed THP split test done
>
> ...
>
>
> @Andrew, can you squash the following?
>
>
> From e5ea585de3e089ea89bf43d8447ff9fc9b371286 Mon Sep 17 00:00:00 2001
> From: David Hildenbrand <david@redhat.com>
> Date: Tue, 6 Aug 2024 12:08:17 +0200
> Subject: [PATCH] fixup: mm/huge_memory: convert split_huge_pages_pid() from
>  follow_page() to folio_walk
>
> We have to teach can_split_folio() that we are not holding an additional
> reference.
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  include/linux/huge_mm.h | 4 ++--
>  mm/huge_memory.c        | 8 ++++----
>  mm/vmscan.c             | 2 +-
>  3 files changed, 7 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index e25d9ebfdf89..ce44caa40eed 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -314,7 +314,7 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>  		unsigned long len, unsigned long pgoff, unsigned long flags,
>  		vm_flags_t vm_flags);
>  -bool can_split_folio(struct folio *folio, int *pextra_pins);
> +bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>  int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>  		unsigned int new_order);
>  static inline int split_huge_page(struct page *page)
> @@ -470,7 +470,7 @@ thp_get_unmapped_area_vmflags(struct file *filp, unsigned long addr,
>  }
>   static inline bool
> -can_split_folio(struct folio *folio, int *pextra_pins)
> +can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins)
>  {
>  	return false;
>  }
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 697fcf89f975..c40b0dcc205b 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3021,7 +3021,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>  }
>   /* Racy check whether the huge page can be split */
> -bool can_split_folio(struct folio *folio, int *pextra_pins)
> +bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins)
>  {
>  	int extra_pins;
>  @@ -3033,7 +3033,7 @@ bool can_split_folio(struct folio *folio, int *pextra_pins)
>  		extra_pins = folio_nr_pages(folio);
>  	if (pextra_pins)
>  		*pextra_pins = extra_pins;
> -	return folio_mapcount(folio) == folio_ref_count(folio) - extra_pins - 1;
> +	return folio_mapcount(folio) == folio_ref_count(folio) - extra_pins - caller_pins;
>  }
>   /*
> @@ -3201,7 +3201,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>  	 * Racy check if we can split the page, before unmap_folio() will
>  	 * split PMDs
>  	 */
> -	if (!can_split_folio(folio, &extra_pins)) {
> +	if (!can_split_folio(folio, 1, &extra_pins)) {
>  		ret = -EAGAIN;
>  		goto out_unlock;
>  	}
> @@ -3537,7 +3537,7 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
>  		 * can be split or not. So skip the check here.
>  		 */
>  		if (!folio_test_private(folio) &&
> -		    !can_split_folio(folio, NULL))
> +		    !can_split_folio(folio, 0, NULL))
>  			goto next;
>   		if (!folio_trylock(folio))

The diff below can skip a folio with private and extra pin(s) early instead
of trying to lock and split it then failing at can_split_folio() inside
split_huge_page_to_list_to_order().

Maybe worth applying on top of yours?


diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a218320a9233..ce992d54f1da 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3532,13 +3532,10 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
                        goto next;

                total++;
-               /*
-                * For folios with private, split_huge_page_to_list_to_order()
-                * will try to drop it before split and then check if the folio
-                * can be split or not. So skip the check here.
-                */
-               if (!folio_test_private(folio) &&
-                   !can_split_folio(folio, 0, NULL))
+
+               if (!can_split_folio(folio,
+                                    folio_test_private(folio) ? 1 : 0,
+                                    NULL))
                        goto next;

                if (!folio_trylock(folio))

Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v1 08/11] s390/uv: convert gmap_destroy_page() from follow_page() to folio_walk
  2024-08-02 15:55 ` [PATCH v1 08/11] s390/uv: convert gmap_destroy_page() " David Hildenbrand
@ 2024-08-07  8:59   ` Claudio Imbrenda
  0 siblings, 0 replies; 32+ messages in thread
From: Claudio Imbrenda @ 2024-08-07  8:59 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, linux-doc, kvm, linux-s390, linux-fsdevel,
	Andrew Morton, Matthew Wilcox (Oracle), Jonathan Corbet,
	Christian Borntraeger, Janosch Frank, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Sven Schnelle, Gerald Schaefer

On Fri,  2 Aug 2024 17:55:21 +0200
David Hildenbrand <david@redhat.com> wrote:

> Let's get rid of another follow_page() user and perform the UV calls
> under PTL -- which likely should be fine.
> 
> No need for an additional reference while holding the PTL:
> uv_destroy_folio() and uv_convert_from_secure_folio() raise the
> refcount, so any concurrent make_folio_secure() would see an unexpted
> reference and cannot set PG_arch_1 concurrently.
> 
> Do we really need a writable PTE? Likely yes, because the "destroy"
> part is, in comparison to the export, a destructive operation. So we'll
> keep the writability check for now.
> 
> We'll lose the secretmem check from follow_page(). Likely we don't care
> about that here.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>

> ---
>  arch/s390/kernel/uv.c | 18 ++++++++++++------
>  1 file changed, 12 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
> index 35ed2aea8891..9646f773208a 100644
> --- a/arch/s390/kernel/uv.c
> +++ b/arch/s390/kernel/uv.c
> @@ -14,6 +14,7 @@
>  #include <linux/memblock.h>
>  #include <linux/pagemap.h>
>  #include <linux/swap.h>
> +#include <linux/pagewalk.h>
>  #include <asm/facility.h>
>  #include <asm/sections.h>
>  #include <asm/uv.h>
> @@ -462,9 +463,9 @@ EXPORT_SYMBOL_GPL(gmap_convert_to_secure);
>  int gmap_destroy_page(struct gmap *gmap, unsigned long gaddr)
>  {
>  	struct vm_area_struct *vma;
> +	struct folio_walk fw;
>  	unsigned long uaddr;
>  	struct folio *folio;
> -	struct page *page;
>  	int rc;
>  
>  	rc = -EFAULT;
> @@ -483,11 +484,15 @@ int gmap_destroy_page(struct gmap *gmap, unsigned long gaddr)
>  		goto out;
>  
>  	rc = 0;
> -	/* we take an extra reference here */
> -	page = follow_page(vma, uaddr, FOLL_WRITE | FOLL_GET);
> -	if (IS_ERR_OR_NULL(page))
> +	folio = folio_walk_start(&fw, vma, uaddr, 0);
> +	if (!folio)
>  		goto out;
> -	folio = page_folio(page);
> +	/*
> +	 * See gmap_make_secure(): large folios cannot be secure. Small
> +	 * folio implies FW_LEVEL_PTE.
> +	 */
> +	if (folio_test_large(folio) || !pte_write(fw.pte))
> +		goto out_walk_end;
>  	rc = uv_destroy_folio(folio);
>  	/*
>  	 * Fault handlers can race; it is possible that two CPUs will fault
> @@ -500,7 +505,8 @@ int gmap_destroy_page(struct gmap *gmap, unsigned long gaddr)
>  	 */
>  	if (rc)
>  		rc = uv_convert_from_secure_folio(folio);
> -	folio_put(folio);
> +out_walk_end:
> +	folio_walk_end(&fw, vma);
>  out:
>  	mmap_read_unlock(gmap->mm);
>  	return rc;


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v1 09/11] s390/mm/fault: convert do_secure_storage_access() from follow_page() to folio_walk
  2024-08-02 15:55 ` [PATCH v1 09/11] s390/mm/fault: convert do_secure_storage_access() " David Hildenbrand
@ 2024-08-07  8:59   ` Claudio Imbrenda
  0 siblings, 0 replies; 32+ messages in thread
From: Claudio Imbrenda @ 2024-08-07  8:59 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, linux-doc, kvm, linux-s390, linux-fsdevel,
	Andrew Morton, Matthew Wilcox (Oracle), Jonathan Corbet,
	Christian Borntraeger, Janosch Frank, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Sven Schnelle, Gerald Schaefer

On Fri,  2 Aug 2024 17:55:22 +0200
David Hildenbrand <david@redhat.com> wrote:

> Let's get rid of another follow_page() user and perform the conversion
> under PTL: Note that this is also what follow_page_pte() ends up doing.
> 
> Unfortunately we cannot currently optimize out the additional reference,
> because arch_make_folio_accessible() must be called with a raised
> refcount to protect against concurrent conversion to secure. We can just
> move the arch_make_folio_accessible() under the PTL, like
> follow_page_pte() would.
> 
> We'll effectively drop the "writable" check implied by FOLL_WRITE:
> follow_page_pte() would also not check that when calling
> arch_make_folio_accessible(), so there is no good reason for doing that
> here.
> 
> We'll lose the secretmem check from follow_page() as well, about which
> we shouldn't really care about.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>

> ---
>  arch/s390/mm/fault.c | 16 ++++++++++------
>  1 file changed, 10 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
> index 8e149ef5e89b..ad8b0d6b77ea 100644
> --- a/arch/s390/mm/fault.c
> +++ b/arch/s390/mm/fault.c
> @@ -34,6 +34,7 @@
>  #include <linux/uaccess.h>
>  #include <linux/hugetlb.h>
>  #include <linux/kfence.h>
> +#include <linux/pagewalk.h>
>  #include <asm/asm-extable.h>
>  #include <asm/asm-offsets.h>
>  #include <asm/ptrace.h>
> @@ -492,9 +493,9 @@ void do_secure_storage_access(struct pt_regs *regs)
>  	union teid teid = { .val = regs->int_parm_long };
>  	unsigned long addr = get_fault_address(regs);
>  	struct vm_area_struct *vma;
> +	struct folio_walk fw;
>  	struct mm_struct *mm;
>  	struct folio *folio;
> -	struct page *page;
>  	struct gmap *gmap;
>  	int rc;
>  
> @@ -536,15 +537,18 @@ void do_secure_storage_access(struct pt_regs *regs)
>  		vma = find_vma(mm, addr);
>  		if (!vma)
>  			return handle_fault_error(regs, SEGV_MAPERR);
> -		page = follow_page(vma, addr, FOLL_WRITE | FOLL_GET);
> -		if (IS_ERR_OR_NULL(page)) {
> +		folio = folio_walk_start(&fw, vma, addr, 0);
> +		if (!folio) {
>  			mmap_read_unlock(mm);
>  			break;
>  		}
> -		folio = page_folio(page);
> -		if (arch_make_folio_accessible(folio))
> -			send_sig(SIGSEGV, current, 0);
> +		/* arch_make_folio_accessible() needs a raised refcount. */
> +		folio_get(folio);
> +		rc = arch_make_folio_accessible(folio);
>  		folio_put(folio);
> +		folio_walk_end(&fw, vma);
> +		if (rc)
> +			send_sig(SIGSEGV, current, 0);
>  		mmap_read_unlock(mm);
>  		break;
>  	case KERNEL_FAULT:


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v1 00/11] mm: replace follow_page() by folio_walk
  2024-08-02 15:55 [PATCH v1 00/11] mm: replace follow_page() by folio_walk David Hildenbrand
                   ` (12 preceding siblings ...)
  2024-08-06 13:42 ` Claudio Imbrenda
@ 2024-08-07  9:15 ` Claudio Imbrenda
  2024-08-07  9:33   ` David Hildenbrand
  13 siblings, 1 reply; 32+ messages in thread
From: Claudio Imbrenda @ 2024-08-07  9:15 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, linux-doc, kvm, linux-s390, linux-fsdevel,
	Andrew Morton, Matthew Wilcox (Oracle), Jonathan Corbet,
	Christian Borntraeger, Janosch Frank, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Sven Schnelle, Gerald Schaefer

On Fri,  2 Aug 2024 17:55:13 +0200
David Hildenbrand <david@redhat.com> wrote:

> Looking into a way of moving the last folio_likely_mapped_shared() call
> in add_folio_for_migration() under the PTL, I found myself removing
> follow_page(). This paves the way for cleaning up all the FOLL_, follow_*
> terminology to just be called "GUP" nowadays.
> 
> The new page table walker will lookup a mapped folio and return to the
> caller with the PTL held, such that the folio cannot get unmapped
> concurrently. Callers can then conditionally decide whether they really
> want to take a short-term folio reference or whether the can simply
> unlock the PTL and be done with it.
> 
> folio_walk is similar to page_vma_mapped_walk(), except that we don't know
> the folio we want to walk to and that we are only walking to exactly one
> PTE/PMD/PUD.
> 
> folio_walk provides access to the pte/pmd/pud (and the referenced folio
> page because things like KSM need that), however, as part of this series
> no page table modifications are performed by users.
> 
> We might be able to convert some other walk_page_range() users that really
> only walk to one address, such as DAMON with
> damon_mkold_ops/damon_young_ops. It might make sense to extend folio_walk
> in the future to optionally fault in a folio (if applicable), such that we
> can replace some get_user_pages() users that really only want to lookup
> a single page/folio under PTL without unconditionally grabbing a folio
> reference.
> 
> I have plans to extend the approach to a range walker that will try
> batching various page table entries (not just folio pages) to be a better
> replace for walk_page_range() -- and users will be able to opt in which
> type of page table entries they want to process -- but that will require
> more work and more thoughts.
> 
> KSM seems to work just fine (ksm_functional_tests selftests) and
> move_pages seems to work (migration selftest). I tested the leaf
> implementation excessively using various hugetlb sizes (64K, 2M, 32M, 1G)
> on arm64 using move_pages and did some more testing on x86-64. Cross
> compiled on a bunch of architectures.
> 
> I am not able to test the s390x Secure Execution changes, unfortunately.

The whole series looks good to me, but I do not feel confident enough
about all the folio details to actually r-b any of the non-s390
patches. (I do have a few questions, though)

As for the s390 patches: they look fine. I have tested the series on
s390 and nothing caught fire.

We will be able to get more CI coverage once this lands in -next.

> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
> Cc: Janosch Frank <frankja@linux.ibm.com>
> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
> Cc: Heiko Carstens <hca@linux.ibm.com>
> Cc: Vasily Gorbik <gor@linux.ibm.com>
> Cc: Alexander Gordeev <agordeev@linux.ibm.com>
> Cc: Sven Schnelle <svens@linux.ibm.com>
> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> 
> David Hildenbrand (11):
>   mm: provide vm_normal_(page|folio)_pmd() with
>     CONFIG_PGTABLE_HAS_HUGE_LEAVES
>   mm/pagewalk: introduce folio_walk_start() + folio_walk_end()
>   mm/migrate: convert do_pages_stat_array() from follow_page() to
>     folio_walk
>   mm/migrate: convert add_page_for_migration() from follow_page() to
>     folio_walk
>   mm/ksm: convert get_mergeable_page() from follow_page() to folio_walk
>   mm/ksm: convert scan_get_next_rmap_item() from follow_page() to
>     folio_walk
>   mm/huge_memory: convert split_huge_pages_pid() from follow_page() to
>     folio_walk
>   s390/uv: convert gmap_destroy_page() from follow_page() to folio_walk
>   s390/mm/fault: convert do_secure_storage_access() from follow_page()
>     to folio_walk
>   mm: remove follow_page()
>   mm/ksm: convert break_ksm() from walk_page_range_vma() to folio_walk
> 
>  Documentation/mm/transhuge.rst |   6 +-
>  arch/s390/kernel/uv.c          |  18 ++-
>  arch/s390/mm/fault.c           |  16 ++-
>  include/linux/mm.h             |   3 -
>  include/linux/pagewalk.h       |  58 ++++++++++
>  mm/filemap.c                   |   2 +-
>  mm/gup.c                       |  24 +---
>  mm/huge_memory.c               |  18 +--
>  mm/ksm.c                       | 127 +++++++++------------
>  mm/memory.c                    |   2 +-
>  mm/migrate.c                   | 131 ++++++++++-----------
>  mm/nommu.c                     |   6 -
>  mm/pagewalk.c                  | 202 +++++++++++++++++++++++++++++++++
>  13 files changed, 413 insertions(+), 200 deletions(-)
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v1 02/11] mm/pagewalk: introduce folio_walk_start() + folio_walk_end()
  2024-08-02 15:55 ` [PATCH v1 02/11] mm/pagewalk: introduce folio_walk_start() + folio_walk_end() David Hildenbrand
@ 2024-08-07  9:17   ` Claudio Imbrenda
  2024-08-07  9:31     ` David Hildenbrand
  0 siblings, 1 reply; 32+ messages in thread
From: Claudio Imbrenda @ 2024-08-07  9:17 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, linux-doc, kvm, linux-s390, linux-fsdevel,
	Andrew Morton, Matthew Wilcox (Oracle), Jonathan Corbet,
	Christian Borntraeger, Janosch Frank, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Sven Schnelle, Gerald Schaefer

On Fri,  2 Aug 2024 17:55:15 +0200
David Hildenbrand <david@redhat.com> wrote:

> We want to get rid of follow_page(), and have a more reasonable way to
> just lookup a folio mapped at a certain address, perform some checks while
> still under PTL, and then only conditionally grab a folio reference if
> really required.
> 
> Further, we might want to get rid of some walk_page_range*() users that
> really only want to temporarily lookup a single folio at a single address.
> 
> So let's add a new page table walker that does exactly that, similarly
> to GUP also being able to walk hugetlb VMAs.
> 
> Add folio_walk_end() as a macro for now: the compiler is not easy to
> please with the pte_unmap()->kunmap_local().
> 
> Note that one difference between follow_page() and get_user_pages(1) is
> that follow_page() will not trigger faults to get something mapped. So
> folio_walk is at least currently not a replacement for get_user_pages(1),
> but could likely be extended/reused to achieve something similar in the
> future.

[...]

> +struct folio *folio_walk_start(struct folio_walk *fw,
> +		struct vm_area_struct *vma, unsigned long addr,
> +		folio_walk_flags_t flags)
> +{
> +	unsigned long entry_size;
> +	bool expose_page = true;
> +	struct page *page;
> +	pud_t *pudp, pud;
> +	pmd_t *pmdp, pmd;
> +	pte_t *ptep, pte;
> +	spinlock_t *ptl;
> +	pgd_t *pgdp;
> +	p4d_t *p4dp;
> +
> +	mmap_assert_locked(vma->vm_mm);
> +	vma_pgtable_walk_begin(vma);
> +
> +	if (WARN_ON_ONCE(addr < vma->vm_start || addr >= vma->vm_end))
> +		goto not_found;
> +
> +	pgdp = pgd_offset(vma->vm_mm, addr);
> +	if (pgd_none_or_clear_bad(pgdp))
> +		goto not_found;
> +
> +	p4dp = p4d_offset(pgdp, addr);
> +	if (p4d_none_or_clear_bad(p4dp))
> +		goto not_found;
> +
> +	pudp = pud_offset(p4dp, addr);
> +	pud = pudp_get(pudp);
> +	if (pud_none(pud))
> +		goto not_found;
> +	if (IS_ENABLED(CONFIG_PGTABLE_HAS_HUGE_LEAVES) && pud_leaf(pud)) {
> +		ptl = pud_lock(vma->vm_mm, pudp);
> +		pud = pudp_get(pudp);
> +
> +		entry_size = PUD_SIZE;
> +		fw->level = FW_LEVEL_PUD;
> +		fw->pudp = pudp;
> +		fw->pud = pud;
> +
> +		if (!pud_present(pud) || pud_devmap(pud)) {
> +			spin_unlock(ptl);
> +			goto not_found;
> +		} else if (!pud_leaf(pud)) {
> +			spin_unlock(ptl);
> +			goto pmd_table;
> +		}
> +		/*
> +		 * TODO: vm_normal_page_pud() will be handy once we want to
> +		 * support PUD mappings in VM_PFNMAP|VM_MIXEDMAP VMAs.
> +		 */
> +		page = pud_page(pud);
> +		goto found;
> +	}
> +
> +pmd_table:
> +	VM_WARN_ON_ONCE(pud_leaf(*pudp));

is this warning necessary? can this actually happen?
and if it can happen, wouldn't it be more reasonable to return NULL?

> +	pmdp = pmd_offset(pudp, addr);
> +	pmd = pmdp_get_lockless(pmdp);
> +	if (pmd_none(pmd))
> +		goto not_found;
> +	if (IS_ENABLED(CONFIG_PGTABLE_HAS_HUGE_LEAVES) && pmd_leaf(pmd)) {
> +		ptl = pmd_lock(vma->vm_mm, pmdp);
> +		pmd = pmdp_get(pmdp);
> +
> +		entry_size = PMD_SIZE;
> +		fw->level = FW_LEVEL_PMD;
> +		fw->pmdp = pmdp;
> +		fw->pmd = pmd;
> +
> +		if (pmd_none(pmd)) {
> +			spin_unlock(ptl);
> +			goto not_found;
> +		} else if (!pmd_leaf(pmd)) {
> +			spin_unlock(ptl);
> +			goto pte_table;
> +		} else if (pmd_present(pmd)) {
> +			page = vm_normal_page_pmd(vma, addr, pmd);
> +			if (page) {
> +				goto found;
> +			} else if ((flags & FW_ZEROPAGE) &&
> +				    is_huge_zero_pmd(pmd)) {
> +				page = pfn_to_page(pmd_pfn(pmd));
> +				expose_page = false;
> +				goto found;
> +			}
> +		} else if ((flags & FW_MIGRATION) &&
> +			   is_pmd_migration_entry(pmd)) {
> +			swp_entry_t entry = pmd_to_swp_entry(pmd);
> +
> +			page = pfn_swap_entry_to_page(entry);
> +			expose_page = false;
> +			goto found;
> +		}
> +		spin_unlock(ptl);
> +		goto not_found;
> +	}
> +
> +pte_table:
> +	VM_WARN_ON_ONCE(pmd_leaf(pmdp_get_lockless(pmdp)));

same here

> +	ptep = pte_offset_map_lock(vma->vm_mm, pmdp, addr, &ptl);
> +	if (!ptep)
> +		goto not_found;
> +	pte = ptep_get(ptep);
> +
> +	entry_size = PAGE_SIZE;
> +	fw->level = FW_LEVEL_PTE;
> +	fw->ptep = ptep;
> +	fw->pte = pte;
> +
> +	if (pte_present(pte)) {
> +		page = vm_normal_page(vma, addr, pte);
> +		if (page)
> +			goto found;
> +		if ((flags & FW_ZEROPAGE) &&
> +		    is_zero_pfn(pte_pfn(pte))) {
> +			page = pfn_to_page(pte_pfn(pte));
> +			expose_page = false;
> +			goto found;
> +		}
> +	} else if (!pte_none(pte)) {
> +		swp_entry_t entry = pte_to_swp_entry(pte);
> +
> +		if ((flags & FW_MIGRATION) &&
> +		    is_migration_entry(entry)) {
> +			page = pfn_swap_entry_to_page(entry);
> +			expose_page = false;
> +			goto found;
> +		}
> +	}
> +	pte_unmap_unlock(ptep, ptl);
> +not_found:
> +	vma_pgtable_walk_end(vma);
> +	return NULL;
> +found:
> +	if (expose_page)
> +		/* Note: Offset from the mapped page, not the folio start. */
> +		fw->page = nth_page(page, (addr & (entry_size - 1)) >> PAGE_SHIFT);
> +	else
> +		fw->page = NULL;
> +	fw->ptl = ptl;
> +	return page_folio(page);
> +}


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v1 02/11] mm/pagewalk: introduce folio_walk_start() + folio_walk_end()
  2024-08-07  9:17   ` Claudio Imbrenda
@ 2024-08-07  9:31     ` David Hildenbrand
  0 siblings, 0 replies; 32+ messages in thread
From: David Hildenbrand @ 2024-08-07  9:31 UTC (permalink / raw)
  To: Claudio Imbrenda
  Cc: linux-kernel, linux-mm, linux-doc, kvm, linux-s390, linux-fsdevel,
	Andrew Morton, Matthew Wilcox (Oracle), Jonathan Corbet,
	Christian Borntraeger, Janosch Frank, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Sven Schnelle, Gerald Schaefer

On 07.08.24 11:17, Claudio Imbrenda wrote:
> On Fri,  2 Aug 2024 17:55:15 +0200
> David Hildenbrand <david@redhat.com> wrote:
> 
>> We want to get rid of follow_page(), and have a more reasonable way to
>> just lookup a folio mapped at a certain address, perform some checks while
>> still under PTL, and then only conditionally grab a folio reference if
>> really required.
>>
>> Further, we might want to get rid of some walk_page_range*() users that
>> really only want to temporarily lookup a single folio at a single address.
>>
>> So let's add a new page table walker that does exactly that, similarly
>> to GUP also being able to walk hugetlb VMAs.
>>
>> Add folio_walk_end() as a macro for now: the compiler is not easy to
>> please with the pte_unmap()->kunmap_local().
>>
>> Note that one difference between follow_page() and get_user_pages(1) is
>> that follow_page() will not trigger faults to get something mapped. So
>> folio_walk is at least currently not a replacement for get_user_pages(1),
>> but could likely be extended/reused to achieve something similar in the
>> future.
> 

[...]

>> +pmd_table:
>> +	VM_WARN_ON_ONCE(pud_leaf(*pudp));
> 

Thanks for the review!

> is this warning necessary? can this actually happen?
> and if it can happen, wouldn't it be more reasonable to return NULL?

The we have to turn this into an unconditional WARN_ON_ONCE() that 
cannot be compiled out.

It's something that should be found early during testing (like I had a 
bug where I misspelled "CONFIG_PGTABLE_HAS_HUGE_LEAVES" above that took 
me 2h to debug, so I added it ;) ), and shouldn't need runtime checks.

Same for the other one.

Thanks!

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v1 00/11] mm: replace follow_page() by folio_walk
  2024-08-07  9:15 ` Claudio Imbrenda
@ 2024-08-07  9:33   ` David Hildenbrand
  0 siblings, 0 replies; 32+ messages in thread
From: David Hildenbrand @ 2024-08-07  9:33 UTC (permalink / raw)
  To: Claudio Imbrenda
  Cc: linux-kernel, linux-mm, linux-doc, kvm, linux-s390, linux-fsdevel,
	Andrew Morton, Matthew Wilcox (Oracle), Jonathan Corbet,
	Christian Borntraeger, Janosch Frank, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Sven Schnelle, Gerald Schaefer

On 07.08.24 11:15, Claudio Imbrenda wrote:
> On Fri,  2 Aug 2024 17:55:13 +0200
> David Hildenbrand <david@redhat.com> wrote:
> 
>> Looking into a way of moving the last folio_likely_mapped_shared() call
>> in add_folio_for_migration() under the PTL, I found myself removing
>> follow_page(). This paves the way for cleaning up all the FOLL_, follow_*
>> terminology to just be called "GUP" nowadays.
>>
>> The new page table walker will lookup a mapped folio and return to the
>> caller with the PTL held, such that the folio cannot get unmapped
>> concurrently. Callers can then conditionally decide whether they really
>> want to take a short-term folio reference or whether the can simply
>> unlock the PTL and be done with it.
>>
>> folio_walk is similar to page_vma_mapped_walk(), except that we don't know
>> the folio we want to walk to and that we are only walking to exactly one
>> PTE/PMD/PUD.
>>
>> folio_walk provides access to the pte/pmd/pud (and the referenced folio
>> page because things like KSM need that), however, as part of this series
>> no page table modifications are performed by users.
>>
>> We might be able to convert some other walk_page_range() users that really
>> only walk to one address, such as DAMON with
>> damon_mkold_ops/damon_young_ops. It might make sense to extend folio_walk
>> in the future to optionally fault in a folio (if applicable), such that we
>> can replace some get_user_pages() users that really only want to lookup
>> a single page/folio under PTL without unconditionally grabbing a folio
>> reference.
>>
>> I have plans to extend the approach to a range walker that will try
>> batching various page table entries (not just folio pages) to be a better
>> replace for walk_page_range() -- and users will be able to opt in which
>> type of page table entries they want to process -- but that will require
>> more work and more thoughts.
>>
>> KSM seems to work just fine (ksm_functional_tests selftests) and
>> move_pages seems to work (migration selftest). I tested the leaf
>> implementation excessively using various hugetlb sizes (64K, 2M, 32M, 1G)
>> on arm64 using move_pages and did some more testing on x86-64. Cross
>> compiled on a bunch of architectures.
>>
>> I am not able to test the s390x Secure Execution changes, unfortunately.
> 
> The whole series looks good to me, but I do not feel confident enough
> about all the folio details to actually r-b any of the non-s390
> patches. (I do have a few questions, though)
> 
> As for the s390 patches: they look fine. I have tested the series on
> s390 and nothing caught fire.
> 
> We will be able to get more CI coverage once this lands in -next.

Thanks for the review! Note that it's already in -next.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v1 07/11] mm/huge_memory: convert split_huge_pages_pid() from follow_page() to folio_walk
  2024-08-06 15:36           ` Zi Yan
@ 2024-08-07  9:57             ` David Hildenbrand
  2024-08-07 14:45               ` Zi Yan
  0 siblings, 1 reply; 32+ messages in thread
From: David Hildenbrand @ 2024-08-07  9:57 UTC (permalink / raw)
  To: Zi Yan
  Cc: Ryan Roberts, linux-kernel, linux-mm, linux-doc, kvm, linux-s390,
	linux-fsdevel, Andrew Morton, Matthew Wilcox (Oracle),
	Jonathan Corbet, Christian Borntraeger, Janosch Frank,
	Claudio Imbrenda, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Sven Schnelle, Gerald Schaefer, Mark Brown

On 06.08.24 17:36, Zi Yan wrote:
> On 6 Aug 2024, at 6:24, David Hildenbrand wrote:
> 
>> On 06.08.24 12:03, David Hildenbrand wrote:
>>> On 06.08.24 11:56, David Hildenbrand wrote:
>>>> On 06.08.24 11:46, Ryan Roberts wrote:
>>>>> On 02/08/2024 16:55, David Hildenbrand wrote:
>>>>>> Let's remove yet another follow_page() user. Note that we have to do the
>>>>>> split without holding the PTL, after folio_walk_end(). We don't care
>>>>>> about losing the secretmem check in follow_page().
>>>>>
>>>>> Hi David,
>>>>>
>>>>> Our (arm64) CI is showing a regression in split_huge_page_test from mm selftests from next-20240805 onwards. Navigating around a couple of other lurking bugs, I was able to bisect to this change (which smells about right).
>>>>>
>>>>> Newly failing test:
>>>>>
>>>>> # # ------------------------------
>>>>> # # running ./split_huge_page_test
>>>>> # # ------------------------------
>>>>> # # TAP version 13
>>>>> # # 1..12
>>>>> # # Bail out! Still AnonHugePages not split
>>>>> # # # Planned tests != run tests (12 != 0)
>>>>> # # # Totals: pass:0 fail:0 xfail:0 xpass:0 skip:0 error:0
>>>>> # # [FAIL]
>>>>> # not ok 52 split_huge_page_test # exit=1
>>>>>
>>>>> It's trying to split some pmd-mapped THPs then checking and finding that they are not split. The split is requested via /sys/kernel/debug/split_huge_pages, which I believe ends up in this function you are modifying here. Although I'll admit that looking at the change, there is nothing obviously wrong! Any ideas?
>>>>
>>>> Nothing jumps at me as well. Let me fire up the debugger :)
>>>
>>> Ah, very likely the can_split_folio() check expects a raised refcount
>>> already.
>>
>> Indeed, the following does the trick! Thanks Ryan, I could have sworn
>> I ran that selftest as well.
>>
>> TAP version 13
>> 1..12
>> ok 1 Split huge pages successful
>> ok 2 Split PTE-mapped huge pages successful
>> # Please enable pr_debug in split_huge_pages_in_file() for more info.
>> # Please check dmesg for more information
>> ok 3 File-backed THP split test done
>>
>> ...
>>
>>
>> @Andrew, can you squash the following?
>>
>>
>>  From e5ea585de3e089ea89bf43d8447ff9fc9b371286 Mon Sep 17 00:00:00 2001
>> From: David Hildenbrand <david@redhat.com>
>> Date: Tue, 6 Aug 2024 12:08:17 +0200
>> Subject: [PATCH] fixup: mm/huge_memory: convert split_huge_pages_pid() from
>>   follow_page() to folio_walk
>>
>> We have to teach can_split_folio() that we are not holding an additional
>> reference.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>   include/linux/huge_mm.h | 4 ++--
>>   mm/huge_memory.c        | 8 ++++----
>>   mm/vmscan.c             | 2 +-
>>   3 files changed, 7 insertions(+), 7 deletions(-)
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index e25d9ebfdf89..ce44caa40eed 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -314,7 +314,7 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>>   		unsigned long len, unsigned long pgoff, unsigned long flags,
>>   		vm_flags_t vm_flags);
>>   -bool can_split_folio(struct folio *folio, int *pextra_pins);
>> +bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>>   int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>   		unsigned int new_order);
>>   static inline int split_huge_page(struct page *page)
>> @@ -470,7 +470,7 @@ thp_get_unmapped_area_vmflags(struct file *filp, unsigned long addr,
>>   }
>>    static inline bool
>> -can_split_folio(struct folio *folio, int *pextra_pins)
>> +can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins)
>>   {
>>   	return false;
>>   }
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 697fcf89f975..c40b0dcc205b 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -3021,7 +3021,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>>   }
>>    /* Racy check whether the huge page can be split */
>> -bool can_split_folio(struct folio *folio, int *pextra_pins)
>> +bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins)
>>   {
>>   	int extra_pins;
>>   @@ -3033,7 +3033,7 @@ bool can_split_folio(struct folio *folio, int *pextra_pins)
>>   		extra_pins = folio_nr_pages(folio);
>>   	if (pextra_pins)
>>   		*pextra_pins = extra_pins;
>> -	return folio_mapcount(folio) == folio_ref_count(folio) - extra_pins - 1;
>> +	return folio_mapcount(folio) == folio_ref_count(folio) - extra_pins - caller_pins;
>>   }
>>    /*
>> @@ -3201,7 +3201,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>   	 * Racy check if we can split the page, before unmap_folio() will
>>   	 * split PMDs
>>   	 */
>> -	if (!can_split_folio(folio, &extra_pins)) {
>> +	if (!can_split_folio(folio, 1, &extra_pins)) {
>>   		ret = -EAGAIN;
>>   		goto out_unlock;
>>   	}
>> @@ -3537,7 +3537,7 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
>>   		 * can be split or not. So skip the check here.
>>   		 */
>>   		if (!folio_test_private(folio) &&
>> -		    !can_split_folio(folio, NULL))
>> +		    !can_split_folio(folio, 0, NULL))
>>   			goto next;
>>    		if (!folio_trylock(folio))
> 
> The diff below can skip a folio with private and extra pin(s) early instead
> of trying to lock and split it then failing at can_split_folio() inside
> split_huge_page_to_list_to_order().
> 
> Maybe worth applying on top of yours?
> 
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index a218320a9233..ce992d54f1da 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3532,13 +3532,10 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
>                          goto next;
> 
>                  total++;
> -               /*
> -                * For folios with private, split_huge_page_to_list_to_order()
> -                * will try to drop it before split and then check if the folio
> -                * can be split or not. So skip the check here.
> -                */
> -               if (!folio_test_private(folio) &&
> -                   !can_split_folio(folio, 0, NULL))
> +
> +               if (!can_split_folio(folio,
> +                                    folio_test_private(folio) ? 1 : 0,
> +                                    NULL))

Hmm, it does look a bit odd. It's not something from the caller (caller_pins), but a
folio property. Likely should be handled differently.

In vmscan code, we only call can_split_folio() on anon folios where
folio_test_private() does not apply.

But indeed, in split_huge_page_to_list_to_order() we'd have to fail if
folio_test_private() still applies after

Not sure if that is really better:


diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c40b0dcc205b..7cb743047566 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3026,11 +3026,14 @@ bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins)
         int extra_pins;
  
         /* Additional pins from page cache */
-       if (folio_test_anon(folio))
+       if (folio_test_anon(folio)) {
                 extra_pins = folio_test_swapcache(folio) ?
                                 folio_nr_pages(folio) : 0;
-       else
+       } else {
                 extra_pins = folio_nr_pages(folio);
+               if (unlikely(folio_test_private(folio)))
+                       extra_pins++;
+       }
         if (pextra_pins)
                 *pextra_pins = extra_pins;
         return folio_mapcount(folio) == folio_ref_count(folio) - extra_pins - caller_pins;
@@ -3199,9 +3202,11 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
  
         /*
          * Racy check if we can split the page, before unmap_folio() will
-        * split PMDs
+        * split PMDs. filemap_release_folio() will try to free buffer; if that
+        * fails, filemap_release_folio() fails.
          */
-       if (!can_split_folio(folio, 1, &extra_pins)) {
+       if (WARN_ON_ONCE(folio_test_private(folio)) ||
+           !can_split_folio(folio, 1, &extra_pins)) {
                 ret = -EAGAIN;
                 goto out_unlock;
         }
@@ -3531,13 +3536,7 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
                         goto next;
  
                 total++;
-               /*
-                * For folios with private, split_huge_page_to_list_to_order()
-                * will try to drop it before split and then check if the folio
-                * can be split or not. So skip the check here.
-                */
-               if (!folio_test_private(folio) &&
-                   !can_split_folio(folio, 0, NULL))
+               if (!can_split_folio(folio, 0, NULL))
                         goto next;
  
                 if (!folio_trylock(folio))


It assumes that folio_set_private() is impossible after filemap_release_folio() succeeded
and we're still holding the folio lock. Then we could even get rid of the WARN_ON_ONCE().

-- 
Cheers,

David / dhildenb


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v1 07/11] mm/huge_memory: convert split_huge_pages_pid() from follow_page() to folio_walk
  2024-08-07  9:57             ` David Hildenbrand
@ 2024-08-07 14:45               ` Zi Yan
  2024-08-07 14:52                 ` David Hildenbrand
  0 siblings, 1 reply; 32+ messages in thread
From: Zi Yan @ 2024-08-07 14:45 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Ryan Roberts, linux-kernel, linux-mm, linux-doc, kvm, linux-s390,
	linux-fsdevel, Andrew Morton, Matthew Wilcox (Oracle),
	Jonathan Corbet, Christian Borntraeger, Janosch Frank,
	Claudio Imbrenda, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Sven Schnelle, Gerald Schaefer, Mark Brown

[-- Attachment #1: Type: text/plain, Size: 7338 bytes --]

On 7 Aug 2024, at 5:57, David Hildenbrand wrote:

> On 06.08.24 17:36, Zi Yan wrote:
>> On 6 Aug 2024, at 6:24, David Hildenbrand wrote:
>>
>>> On 06.08.24 12:03, David Hildenbrand wrote:
>>>> On 06.08.24 11:56, David Hildenbrand wrote:
>>>>> On 06.08.24 11:46, Ryan Roberts wrote:
>>>>>> On 02/08/2024 16:55, David Hildenbrand wrote:
>>>>>>> Let's remove yet another follow_page() user. Note that we have to do the
>>>>>>> split without holding the PTL, after folio_walk_end(). We don't care
>>>>>>> about losing the secretmem check in follow_page().
>>>>>>
>>>>>> Hi David,
>>>>>>
>>>>>> Our (arm64) CI is showing a regression in split_huge_page_test from mm selftests from next-20240805 onwards. Navigating around a couple of other lurking bugs, I was able to bisect to this change (which smells about right).
>>>>>>
>>>>>> Newly failing test:
>>>>>>
>>>>>> # # ------------------------------
>>>>>> # # running ./split_huge_page_test
>>>>>> # # ------------------------------
>>>>>> # # TAP version 13
>>>>>> # # 1..12
>>>>>> # # Bail out! Still AnonHugePages not split
>>>>>> # # # Planned tests != run tests (12 != 0)
>>>>>> # # # Totals: pass:0 fail:0 xfail:0 xpass:0 skip:0 error:0
>>>>>> # # [FAIL]
>>>>>> # not ok 52 split_huge_page_test # exit=1
>>>>>>
>>>>>> It's trying to split some pmd-mapped THPs then checking and finding that they are not split. The split is requested via /sys/kernel/debug/split_huge_pages, which I believe ends up in this function you are modifying here. Although I'll admit that looking at the change, there is nothing obviously wrong! Any ideas?
>>>>>
>>>>> Nothing jumps at me as well. Let me fire up the debugger :)
>>>>
>>>> Ah, very likely the can_split_folio() check expects a raised refcount
>>>> already.
>>>
>>> Indeed, the following does the trick! Thanks Ryan, I could have sworn
>>> I ran that selftest as well.
>>>
>>> TAP version 13
>>> 1..12
>>> ok 1 Split huge pages successful
>>> ok 2 Split PTE-mapped huge pages successful
>>> # Please enable pr_debug in split_huge_pages_in_file() for more info.
>>> # Please check dmesg for more information
>>> ok 3 File-backed THP split test done
>>>
>>> ...
>>>
>>>
>>> @Andrew, can you squash the following?
>>>
>>>
>>>  From e5ea585de3e089ea89bf43d8447ff9fc9b371286 Mon Sep 17 00:00:00 2001
>>> From: David Hildenbrand <david@redhat.com>
>>> Date: Tue, 6 Aug 2024 12:08:17 +0200
>>> Subject: [PATCH] fixup: mm/huge_memory: convert split_huge_pages_pid() from
>>>   follow_page() to folio_walk
>>>
>>> We have to teach can_split_folio() that we are not holding an additional
>>> reference.
>>>
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>> ---
>>>   include/linux/huge_mm.h | 4 ++--
>>>   mm/huge_memory.c        | 8 ++++----
>>>   mm/vmscan.c             | 2 +-
>>>   3 files changed, 7 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>> index e25d9ebfdf89..ce44caa40eed 100644
>>> --- a/include/linux/huge_mm.h
>>> +++ b/include/linux/huge_mm.h
>>> @@ -314,7 +314,7 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>>>   		unsigned long len, unsigned long pgoff, unsigned long flags,
>>>   		vm_flags_t vm_flags);
>>>   -bool can_split_folio(struct folio *folio, int *pextra_pins);
>>> +bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>>>   int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>>   		unsigned int new_order);
>>>   static inline int split_huge_page(struct page *page)
>>> @@ -470,7 +470,7 @@ thp_get_unmapped_area_vmflags(struct file *filp, unsigned long addr,
>>>   }
>>>    static inline bool
>>> -can_split_folio(struct folio *folio, int *pextra_pins)
>>> +can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins)
>>>   {
>>>   	return false;
>>>   }
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index 697fcf89f975..c40b0dcc205b 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -3021,7 +3021,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>>>   }
>>>    /* Racy check whether the huge page can be split */
>>> -bool can_split_folio(struct folio *folio, int *pextra_pins)
>>> +bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins)
>>>   {
>>>   	int extra_pins;
>>>   @@ -3033,7 +3033,7 @@ bool can_split_folio(struct folio *folio, int *pextra_pins)
>>>   		extra_pins = folio_nr_pages(folio);
>>>   	if (pextra_pins)
>>>   		*pextra_pins = extra_pins;
>>> -	return folio_mapcount(folio) == folio_ref_count(folio) - extra_pins - 1;
>>> +	return folio_mapcount(folio) == folio_ref_count(folio) - extra_pins - caller_pins;
>>>   }
>>>    /*
>>> @@ -3201,7 +3201,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>>   	 * Racy check if we can split the page, before unmap_folio() will
>>>   	 * split PMDs
>>>   	 */
>>> -	if (!can_split_folio(folio, &extra_pins)) {
>>> +	if (!can_split_folio(folio, 1, &extra_pins)) {
>>>   		ret = -EAGAIN;
>>>   		goto out_unlock;
>>>   	}
>>> @@ -3537,7 +3537,7 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
>>>   		 * can be split or not. So skip the check here.
>>>   		 */
>>>   		if (!folio_test_private(folio) &&
>>> -		    !can_split_folio(folio, NULL))
>>> +		    !can_split_folio(folio, 0, NULL))
>>>   			goto next;
>>>    		if (!folio_trylock(folio))
>>
>> The diff below can skip a folio with private and extra pin(s) early instead
>> of trying to lock and split it then failing at can_split_folio() inside
>> split_huge_page_to_list_to_order().
>>
>> Maybe worth applying on top of yours?
>>
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index a218320a9233..ce992d54f1da 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -3532,13 +3532,10 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
>>                          goto next;
>>
>>                  total++;
>> -               /*
>> -                * For folios with private, split_huge_page_to_list_to_order()
>> -                * will try to drop it before split and then check if the folio
>> -                * can be split or not. So skip the check here.
>> -                */
>> -               if (!folio_test_private(folio) &&
>> -                   !can_split_folio(folio, 0, NULL))
>> +
>> +               if (!can_split_folio(folio,
>> +                                    folio_test_private(folio) ? 1 : 0,
>> +                                    NULL))
>
> Hmm, it does look a bit odd. It's not something from the caller (caller_pins), but a
> folio property. Likely should be handled differently.
>
> In vmscan code, we only call can_split_folio() on anon folios where
> folio_test_private() does not apply.
>
> But indeed, in split_huge_page_to_list_to_order() we'd have to fail if
> folio_test_private() still applies after
>
> Not sure if that is really better:

Yeah, not worth the code churn to optimize for that debugfs code.

As I looked at this patch and the fix long enough, feel free to add
Reviewed-by: Zi Yan <ziy@nvidia.com>


Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v1 07/11] mm/huge_memory: convert split_huge_pages_pid() from follow_page() to folio_walk
  2024-08-07 14:45               ` Zi Yan
@ 2024-08-07 14:52                 ` David Hildenbrand
  0 siblings, 0 replies; 32+ messages in thread
From: David Hildenbrand @ 2024-08-07 14:52 UTC (permalink / raw)
  To: Zi Yan
  Cc: Ryan Roberts, linux-kernel, linux-mm, linux-doc, kvm, linux-s390,
	linux-fsdevel, Andrew Morton, Matthew Wilcox (Oracle),
	Jonathan Corbet, Christian Borntraeger, Janosch Frank,
	Claudio Imbrenda, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Sven Schnelle, Gerald Schaefer, Mark Brown

On 07.08.24 16:45, Zi Yan wrote:
> On 7 Aug 2024, at 5:57, David Hildenbrand wrote:
> 
>> On 06.08.24 17:36, Zi Yan wrote:
>>> On 6 Aug 2024, at 6:24, David Hildenbrand wrote:
>>>
>>>> On 06.08.24 12:03, David Hildenbrand wrote:
>>>>> On 06.08.24 11:56, David Hildenbrand wrote:
>>>>>> On 06.08.24 11:46, Ryan Roberts wrote:
>>>>>>> On 02/08/2024 16:55, David Hildenbrand wrote:
>>>>>>>> Let's remove yet another follow_page() user. Note that we have to do the
>>>>>>>> split without holding the PTL, after folio_walk_end(). We don't care
>>>>>>>> about losing the secretmem check in follow_page().
>>>>>>>
>>>>>>> Hi David,
>>>>>>>
>>>>>>> Our (arm64) CI is showing a regression in split_huge_page_test from mm selftests from next-20240805 onwards. Navigating around a couple of other lurking bugs, I was able to bisect to this change (which smells about right).
>>>>>>>
>>>>>>> Newly failing test:
>>>>>>>
>>>>>>> # # ------------------------------
>>>>>>> # # running ./split_huge_page_test
>>>>>>> # # ------------------------------
>>>>>>> # # TAP version 13
>>>>>>> # # 1..12
>>>>>>> # # Bail out! Still AnonHugePages not split
>>>>>>> # # # Planned tests != run tests (12 != 0)
>>>>>>> # # # Totals: pass:0 fail:0 xfail:0 xpass:0 skip:0 error:0
>>>>>>> # # [FAIL]
>>>>>>> # not ok 52 split_huge_page_test # exit=1
>>>>>>>
>>>>>>> It's trying to split some pmd-mapped THPs then checking and finding that they are not split. The split is requested via /sys/kernel/debug/split_huge_pages, which I believe ends up in this function you are modifying here. Although I'll admit that looking at the change, there is nothing obviously wrong! Any ideas?
>>>>>>
>>>>>> Nothing jumps at me as well. Let me fire up the debugger :)
>>>>>
>>>>> Ah, very likely the can_split_folio() check expects a raised refcount
>>>>> already.
>>>>
>>>> Indeed, the following does the trick! Thanks Ryan, I could have sworn
>>>> I ran that selftest as well.
>>>>
>>>> TAP version 13
>>>> 1..12
>>>> ok 1 Split huge pages successful
>>>> ok 2 Split PTE-mapped huge pages successful
>>>> # Please enable pr_debug in split_huge_pages_in_file() for more info.
>>>> # Please check dmesg for more information
>>>> ok 3 File-backed THP split test done
>>>>
>>>> ...
>>>>
>>>>
>>>> @Andrew, can you squash the following?
>>>>
>>>>
>>>>   From e5ea585de3e089ea89bf43d8447ff9fc9b371286 Mon Sep 17 00:00:00 2001
>>>> From: David Hildenbrand <david@redhat.com>
>>>> Date: Tue, 6 Aug 2024 12:08:17 +0200
>>>> Subject: [PATCH] fixup: mm/huge_memory: convert split_huge_pages_pid() from
>>>>    follow_page() to folio_walk
>>>>
>>>> We have to teach can_split_folio() that we are not holding an additional
>>>> reference.
>>>>
>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>> ---
>>>>    include/linux/huge_mm.h | 4 ++--
>>>>    mm/huge_memory.c        | 8 ++++----
>>>>    mm/vmscan.c             | 2 +-
>>>>    3 files changed, 7 insertions(+), 7 deletions(-)
>>>>
>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>>> index e25d9ebfdf89..ce44caa40eed 100644
>>>> --- a/include/linux/huge_mm.h
>>>> +++ b/include/linux/huge_mm.h
>>>> @@ -314,7 +314,7 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>>>>    		unsigned long len, unsigned long pgoff, unsigned long flags,
>>>>    		vm_flags_t vm_flags);
>>>>    -bool can_split_folio(struct folio *folio, int *pextra_pins);
>>>> +bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>>>>    int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>>>    		unsigned int new_order);
>>>>    static inline int split_huge_page(struct page *page)
>>>> @@ -470,7 +470,7 @@ thp_get_unmapped_area_vmflags(struct file *filp, unsigned long addr,
>>>>    }
>>>>     static inline bool
>>>> -can_split_folio(struct folio *folio, int *pextra_pins)
>>>> +can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins)
>>>>    {
>>>>    	return false;
>>>>    }
>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>> index 697fcf89f975..c40b0dcc205b 100644
>>>> --- a/mm/huge_memory.c
>>>> +++ b/mm/huge_memory.c
>>>> @@ -3021,7 +3021,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>>>>    }
>>>>     /* Racy check whether the huge page can be split */
>>>> -bool can_split_folio(struct folio *folio, int *pextra_pins)
>>>> +bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins)
>>>>    {
>>>>    	int extra_pins;
>>>>    @@ -3033,7 +3033,7 @@ bool can_split_folio(struct folio *folio, int *pextra_pins)
>>>>    		extra_pins = folio_nr_pages(folio);
>>>>    	if (pextra_pins)
>>>>    		*pextra_pins = extra_pins;
>>>> -	return folio_mapcount(folio) == folio_ref_count(folio) - extra_pins - 1;
>>>> +	return folio_mapcount(folio) == folio_ref_count(folio) - extra_pins - caller_pins;
>>>>    }
>>>>     /*
>>>> @@ -3201,7 +3201,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>>>    	 * Racy check if we can split the page, before unmap_folio() will
>>>>    	 * split PMDs
>>>>    	 */
>>>> -	if (!can_split_folio(folio, &extra_pins)) {
>>>> +	if (!can_split_folio(folio, 1, &extra_pins)) {
>>>>    		ret = -EAGAIN;
>>>>    		goto out_unlock;
>>>>    	}
>>>> @@ -3537,7 +3537,7 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
>>>>    		 * can be split or not. So skip the check here.
>>>>    		 */
>>>>    		if (!folio_test_private(folio) &&
>>>> -		    !can_split_folio(folio, NULL))
>>>> +		    !can_split_folio(folio, 0, NULL))
>>>>    			goto next;
>>>>     		if (!folio_trylock(folio))
>>>
>>> The diff below can skip a folio with private and extra pin(s) early instead
>>> of trying to lock and split it then failing at can_split_folio() inside
>>> split_huge_page_to_list_to_order().
>>>
>>> Maybe worth applying on top of yours?
>>>
>>>
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index a218320a9233..ce992d54f1da 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -3532,13 +3532,10 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
>>>                           goto next;
>>>
>>>                   total++;
>>> -               /*
>>> -                * For folios with private, split_huge_page_to_list_to_order()
>>> -                * will try to drop it before split and then check if the folio
>>> -                * can be split or not. So skip the check here.
>>> -                */
>>> -               if (!folio_test_private(folio) &&
>>> -                   !can_split_folio(folio, 0, NULL))
>>> +
>>> +               if (!can_split_folio(folio,
>>> +                                    folio_test_private(folio) ? 1 : 0,
>>> +                                    NULL))
>>
>> Hmm, it does look a bit odd. It's not something from the caller (caller_pins), but a
>> folio property. Likely should be handled differently.
>>
>> In vmscan code, we only call can_split_folio() on anon folios where
>> folio_test_private() does not apply.
>>
>> But indeed, in split_huge_page_to_list_to_order() we'd have to fail if
>> folio_test_private() still applies after
>>
>> Not sure if that is really better:
> 
> Yeah, not worth the code churn to optimize for that debugfs code.
> 
> As I looked at this patch and the fix long enough, feel free to add
> Reviewed-by: Zi Yan <ziy@nvidia.com>

Thanks! :)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v1 07/11] mm/huge_memory: convert split_huge_pages_pid() from follow_page() to folio_walk
  2024-08-02 15:55 ` [PATCH v1 07/11] mm/huge_memory: convert split_huge_pages_pid() " David Hildenbrand
  2024-08-06  9:46   ` Ryan Roberts
@ 2024-08-15 10:04   ` Pankaj Raghav
  2024-08-15 10:20     ` David Hildenbrand
  1 sibling, 1 reply; 32+ messages in thread
From: Pankaj Raghav @ 2024-08-15 10:04 UTC (permalink / raw)
  To: david
  Cc: agordeev, akpm, borntraeger, corbet, frankja, gerald.schaefer,
	gor, hca, imbrenda, kvm, linux-doc, linux-fsdevel, linux-kernel,
	linux-mm, linux-s390, svens, willy

Hi David,

On Fri, Aug 02, 2024 at 05:55:20PM +0200, David Hildenbrand wrote:
>  			continue;
>  		}
>  
> -		/* FOLL_DUMP to ignore special (like zero) pages */
> -		page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP);
> -
> -		if (IS_ERR_OR_NULL(page))
> +		folio = folio_walk_start(&fw, vma, addr, 0);
> +		if (!folio)
>  			continue;
>  
> -		folio = page_folio(page);
>  		if (!is_transparent_hugepage(folio))
>  			goto next;
>  
> @@ -3544,13 +3542,19 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
>  
>  		if (!folio_trylock(folio))
>  			goto next;
> +		folio_get(folio);

Shouldn't we lock the folio after we increase the refcount on the folio?
i.e we do folio_get() first and then folio_trylock()?

That is how it was done before (through follow_page) and this patch changes
that. Maybe it doesn't matter? To me increasing the refcount and then
locking sounds more logical but I do see this ordering getting mixed all
over the kernel.

> +		folio_walk_end(&fw, vma);
>  
>  		if (!split_folio_to_order(folio, new_order))
>  			split++;
>  
>  		folio_unlock(folio);
> -next:
>  		folio_put(folio);
> +
> +		cond_resched();
> +		continue;
> +next:
> +		folio_walk_end(&fw, vma);
>  		cond_resched();
>  	}
>  	mmap_read_unlock(mm);
> -- 
> 2.45.2

-- 
Pankaj Raghav


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v1 07/11] mm/huge_memory: convert split_huge_pages_pid() from follow_page() to folio_walk
  2024-08-15 10:04   ` Pankaj Raghav
@ 2024-08-15 10:20     ` David Hildenbrand
  2024-08-15 13:43       ` Pankaj Raghav (Samsung)
  0 siblings, 1 reply; 32+ messages in thread
From: David Hildenbrand @ 2024-08-15 10:20 UTC (permalink / raw)
  To: Pankaj Raghav
  Cc: agordeev, akpm, borntraeger, corbet, frankja, gerald.schaefer,
	gor, hca, imbrenda, kvm, linux-doc, linux-fsdevel, linux-kernel,
	linux-mm, linux-s390, svens, willy

On 15.08.24 12:04, Pankaj Raghav wrote:
> Hi David,
> 
> On Fri, Aug 02, 2024 at 05:55:20PM +0200, David Hildenbrand wrote:
>>   			continue;
>>   		}
>>   
>> -		/* FOLL_DUMP to ignore special (like zero) pages */
>> -		page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP);
>> -
>> -		if (IS_ERR_OR_NULL(page))
>> +		folio = folio_walk_start(&fw, vma, addr, 0);
>> +		if (!folio)
>>   			continue;
>>   
>> -		folio = page_folio(page);
>>   		if (!is_transparent_hugepage(folio))
>>   			goto next;
>>   
>> @@ -3544,13 +3542,19 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
>>   
>>   		if (!folio_trylock(folio))
>>   			goto next;
>> +		folio_get(folio);
> 
> Shouldn't we lock the folio after we increase the refcount on the folio?
> i.e we do folio_get() first and then folio_trylock()?
> 
> That is how it was done before (through follow_page) and this patch changes
> that. Maybe it doesn't matter? To me increasing the refcount and then
> locking sounds more logical but I do see this ordering getting mixed all
> over the kernel.

There is no need to grab a folio reference if we hold an implicit 
reference through the mapping that cannot go away (not that we hold the 
page table lock). Locking the folio is not special in that regard: we 
just have to make sure that the folio cannot get freed concurrently, 
which is the case here.

So here, we really only grab a reference if we have to -- when we are 
about to drop the page table lock and will continue using the folio 
afterwards.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v1 07/11] mm/huge_memory: convert split_huge_pages_pid() from follow_page() to folio_walk
  2024-08-15 10:20     ` David Hildenbrand
@ 2024-08-15 13:43       ` Pankaj Raghav (Samsung)
  0 siblings, 0 replies; 32+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-08-15 13:43 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Pankaj Raghav, agordeev, akpm, borntraeger, corbet, frankja,
	gerald.schaefer, gor, hca, imbrenda, kvm, linux-doc,
	linux-fsdevel, linux-kernel, linux-mm, linux-s390, svens, willy

On Thu, Aug 15, 2024 at 12:20:04PM +0200, David Hildenbrand wrote:
> On 15.08.24 12:04, Pankaj Raghav wrote:
> > Hi David,
> > 
> > On Fri, Aug 02, 2024 at 05:55:20PM +0200, David Hildenbrand wrote:
> > >   			continue;
> > >   		}
> > > -		/* FOLL_DUMP to ignore special (like zero) pages */
> > > -		page = follow_page(vma, addr, FOLL_GET | FOLL_DUMP);
> > > -
> > > -		if (IS_ERR_OR_NULL(page))
> > > +		folio = folio_walk_start(&fw, vma, addr, 0);
> > > +		if (!folio)
> > >   			continue;
> > > -		folio = page_folio(page);
> > >   		if (!is_transparent_hugepage(folio))
> > >   			goto next;
> > > @@ -3544,13 +3542,19 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
> > >   		if (!folio_trylock(folio))
> > >   			goto next;
> > > +		folio_get(folio);
> > 
> > Shouldn't we lock the folio after we increase the refcount on the folio?
> > i.e we do folio_get() first and then folio_trylock()?
> > 
> > That is how it was done before (through follow_page) and this patch changes
> > that. Maybe it doesn't matter? To me increasing the refcount and then
> > locking sounds more logical but I do see this ordering getting mixed all
> > over the kernel.
> 
> There is no need to grab a folio reference if we hold an implicit reference
> through the mapping that cannot go away (not that we hold the page table
> lock). Locking the folio is not special in that regard: we just have to make
> sure that the folio cannot get freed concurrently, which is the case here.
> 
> So here, we really only grab a reference if we have to -- when we are about
> to drop the page table lock and will continue using the folio afterwards.
Got it. Thanks!
> 
> -- 
> Cheers,
> 
> David / dhildenb
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2024-08-15 13:43 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-02 15:55 [PATCH v1 00/11] mm: replace follow_page() by folio_walk David Hildenbrand
2024-08-02 15:55 ` [PATCH v1 01/11] mm: provide vm_normal_(page|folio)_pmd() with CONFIG_PGTABLE_HAS_HUGE_LEAVES David Hildenbrand
2024-08-02 15:55 ` [PATCH v1 02/11] mm/pagewalk: introduce folio_walk_start() + folio_walk_end() David Hildenbrand
2024-08-07  9:17   ` Claudio Imbrenda
2024-08-07  9:31     ` David Hildenbrand
2024-08-02 15:55 ` [PATCH v1 03/11] mm/migrate: convert do_pages_stat_array() from follow_page() to folio_walk David Hildenbrand
2024-08-02 15:55 ` [PATCH v1 04/11] mm/migrate: convert add_page_for_migration() " David Hildenbrand
2024-08-02 15:55 ` [PATCH v1 05/11] mm/ksm: convert get_mergeable_page() " David Hildenbrand
2024-08-02 15:55 ` [PATCH v1 06/11] mm/ksm: convert scan_get_next_rmap_item() " David Hildenbrand
2024-08-02 15:55 ` [PATCH v1 07/11] mm/huge_memory: convert split_huge_pages_pid() " David Hildenbrand
2024-08-06  9:46   ` Ryan Roberts
2024-08-06  9:56     ` David Hildenbrand
2024-08-06 10:03       ` David Hildenbrand
2024-08-06 10:24         ` David Hildenbrand
2024-08-06 11:17           ` Ryan Roberts
2024-08-06 15:36           ` Zi Yan
2024-08-07  9:57             ` David Hildenbrand
2024-08-07 14:45               ` Zi Yan
2024-08-07 14:52                 ` David Hildenbrand
2024-08-15 10:04   ` Pankaj Raghav
2024-08-15 10:20     ` David Hildenbrand
2024-08-15 13:43       ` Pankaj Raghav (Samsung)
2024-08-02 15:55 ` [PATCH v1 08/11] s390/uv: convert gmap_destroy_page() " David Hildenbrand
2024-08-07  8:59   ` Claudio Imbrenda
2024-08-02 15:55 ` [PATCH v1 09/11] s390/mm/fault: convert do_secure_storage_access() " David Hildenbrand
2024-08-07  8:59   ` Claudio Imbrenda
2024-08-02 15:55 ` [PATCH v1 10/11] mm: remove follow_page() David Hildenbrand
2024-08-02 15:55 ` [PATCH v1 11/11] mm/ksm: convert break_ksm() from walk_page_range_vma() to folio_walk David Hildenbrand
2024-08-03  5:34 ` [PATCH v1 00/11] mm: replace follow_page() by folio_walk Andrew Morton
2024-08-06 13:42 ` Claudio Imbrenda
2024-08-07  9:15 ` Claudio Imbrenda
2024-08-07  9:33   ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).