linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 00/13] introduce pte_offset_map_{ro|rw}_nolock()
@ 2024-09-24  6:09 Qi Zheng
  2024-09-24  6:09 ` [PATCH v4 01/13] mm: pgtable: " Qi Zheng
                   ` (12 more replies)
  0 siblings, 13 replies; 29+ messages in thread
From: Qi Zheng @ 2024-09-24  6:09 UTC (permalink / raw)
  To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
	vishal.moola, peterx, ryan.roberts, christophe.leroy2
  Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng

Changes in v4:
 - arm: adjust_pte() use pte_offset_map_rw_nolock()
   (use ptl != vmf->ptl to check if we are using split PTE locks)
   mm: khugepaged: collapse_pte_mapped_thp() use pte_offset_map_rw_nolock()
   (move the pte_unmap() backward)
   mm: copy_pte_range() use pte_offset_map_rw_nolock()
   (remove pmd_same() check)
   mm: mremap: move_ptes() use pte_offset_map_rw_nolock()
   (remove pmd_same() check)
   mm: page_vma_mapped_walk: map_pte() use pte_offset_map_rw_nolock()
   (move the assignment to pvmw->ptl backward)
 - remove [PATCH v3 14/14] (will be sent as a separate patch)
 - reorder patches
 - collect the Reviewed-bys
 - rebase onto the next-20240923

Changes in v3:
 - change to use VM_WARN_ON_ONCE() instead of BUG_ON() in pte_offset_map_rw_nolock()
   (David Hildenbrand)
 - modify the comment above the pte_offset_map_lock() in [PATCH v2 01/14]
   (David Hildenbrand and Muchun Song)
 - modify the comment above the pte_offset_map_rw_nolock() in [PATCH v2 06/14]
   (David Hildenbrand and Muchun Song)
 - also perform a pmd_same() check in [PATCH v2 08/14] and [PATCH v2 09/14]
   (since we may free the PTE page in retract_page_tables() without holding the
    read lock of mmap_lock)
 - collect the Acked-bys and Reviewed-bys
 - rebase onto the next-20240904

Changes in v2:
 - rename pte_offset_map_{readonly|maywrite}_nolock() to
   pte_offset_map_{ro|rw}_nolock() (LEROY Christophe)
 - make pte_offset_map_rw_nolock() not accept NULL parameters
   (David Hildenbrand)
 - rebase onto the next-20240822

Hi all,

As proposed by David Hildenbrand [1], this series introduces the following two
new helper functions to replace pte_offset_map_nolock().

1. pte_offset_map_ro_nolock()
2. pte_offset_map_rw_nolock()

As the name suggests, pte_offset_map_ro_nolock() is used for read-only
case. In this case, only read-only operations will be performed on PTE page
after the PTL is held. The RCU lock in pte_offset_map_nolock() will ensure that
the PTE page will not be freed, and there is no need to worry about whether the
pmd entry is modified. Therefore pte_offset_map_ro_nolock() is just a renamed
version of pte_offset_map_nolock().

pte_offset_map_rw_nolock() is used for may-write case. In this case, the pte or
pmd entry may be modified after the PTL is held, so we need to ensure that the
pmd entry has not been modified concurrently. So in addition to the name change,
it also outputs the pmdval when successful. The users should make sure the page
table is stable like checking pte_same() or checking pmd_same() by using the
output pmdval before performing the write operations.

This series will convert all pte_offset_map_nolock() into the above two helper
functions one by one, and finally completely delete it.

This also a preparation for reclaiming the empty user PTE page table pages.

This series is based on the next-20240923.

Comments and suggestions are welcome!

Thanks,
Qi

Qi Zheng (13):
  mm: pgtable: introduce pte_offset_map_{ro|rw}_nolock()
  powerpc: assert_pte_locked() use pte_offset_map_ro_nolock()
  mm: filemap: filemap_fault_recheck_pte_none() use
    pte_offset_map_ro_nolock()
  mm: khugepaged: __collapse_huge_page_swapin() use
    pte_offset_map_ro_nolock()
  arm: adjust_pte() use pte_offset_map_rw_nolock()
  mm: handle_pte_fault() use pte_offset_map_rw_nolock()
  mm: khugepaged: collapse_pte_mapped_thp() use
    pte_offset_map_rw_nolock()
  mm: copy_pte_range() use pte_offset_map_rw_nolock()
  mm: mremap: move_ptes() use pte_offset_map_rw_nolock()
  mm: page_vma_mapped_walk: map_pte() use pte_offset_map_rw_nolock()
  mm: userfaultfd: move_pages_pte() use pte_offset_map_rw_nolock()
  mm: multi-gen LRU: walk_pte_range() use pte_offset_map_rw_nolock()
  mm: pgtable: remove pte_offset_map_nolock()

 Documentation/mm/split_page_table_lock.rst |  6 ++-
 arch/arm/mm/fault-armv.c                   | 53 +++++++++-------------
 arch/powerpc/mm/pgtable.c                  |  2 +-
 include/linux/mm.h                         |  7 ++-
 mm/filemap.c                               |  4 +-
 mm/khugepaged.c                            | 20 ++++++--
 mm/memory.c                                | 25 ++++++++--
 mm/mremap.c                                | 11 ++++-
 mm/page_vma_mapped.c                       | 25 ++++++++--
 mm/pgtable-generic.c                       | 43 +++++++++++++++---
 mm/userfaultfd.c                           | 15 ++++--
 mm/vmscan.c                                |  9 +++-
 12 files changed, 159 insertions(+), 61 deletions(-)

-- 
2.20.1



^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v4 01/13] mm: pgtable: introduce pte_offset_map_{ro|rw}_nolock()
  2024-09-24  6:09 [PATCH v4 00/13] introduce pte_offset_map_{ro|rw}_nolock() Qi Zheng
@ 2024-09-24  6:09 ` Qi Zheng
  2024-09-24  6:24   ` Muchun Song
  2024-09-24 12:58   ` David Hildenbrand
  2024-09-24  6:09 ` [PATCH v4 02/13] powerpc: assert_pte_locked() use pte_offset_map_ro_nolock() Qi Zheng
                   ` (11 subsequent siblings)
  12 siblings, 2 replies; 29+ messages in thread
From: Qi Zheng @ 2024-09-24  6:09 UTC (permalink / raw)
  To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
	vishal.moola, peterx, ryan.roberts, christophe.leroy2
  Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng

Currently, the usage of pte_offset_map_nolock() can be divided into the
following two cases:

1) After acquiring PTL, only read-only operations are performed on the PTE
   page. In this case, the RCU lock in pte_offset_map_nolock() will ensure
   that the PTE page will not be freed, and there is no need to worry
   about whether the pmd entry is modified.

2) After acquiring PTL, the pte or pmd entries may be modified. At this
   time, we need to ensure that the pmd entry has not been modified
   concurrently.

To more clearing distinguish between these two cases, this commit
introduces two new helper functions to replace pte_offset_map_nolock().
For 1), just rename it to pte_offset_map_ro_nolock(). For 2), in addition
to changing the name to pte_offset_map_rw_nolock(), it also outputs the
pmdval when successful. It is applicable for may-write cases where any
modification operations to the page table may happen after the
corresponding spinlock is held afterwards. But the users should make sure
the page table is stable like checking pte_same() or checking pmd_same()
by using the output pmdval before performing the write operations.

Note: "RO" / "RW" expresses the intended semantics, not that the *kmap*
will be read-only/read-write protected.

Subsequent commits will convert pte_offset_map_nolock() into the above
two functions one by one, and finally completely delete it.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 Documentation/mm/split_page_table_lock.rst |  7 +++
 include/linux/mm.h                         |  5 +++
 mm/pgtable-generic.c                       | 50 ++++++++++++++++++++++
 3 files changed, 62 insertions(+)

diff --git a/Documentation/mm/split_page_table_lock.rst b/Documentation/mm/split_page_table_lock.rst
index e4f6972eb6c04..08d0e706a32db 100644
--- a/Documentation/mm/split_page_table_lock.rst
+++ b/Documentation/mm/split_page_table_lock.rst
@@ -19,6 +19,13 @@ There are helpers to lock/unlock a table and other accessor functions:
  - pte_offset_map_nolock()
 	maps PTE, returns pointer to PTE with pointer to its PTE table
 	lock (not taken), or returns NULL if no PTE table;
+ - pte_offset_map_ro_nolock()
+	maps PTE, returns pointer to PTE with pointer to its PTE table
+	lock (not taken), or returns NULL if no PTE table;
+ - pte_offset_map_rw_nolock()
+	maps PTE, returns pointer to PTE with pointer to its PTE table
+	lock (not taken) and the value of its pmd entry, or returns NULL
+	if no PTE table;
  - pte_offset_map()
 	maps PTE, returns pointer to PTE, or returns NULL if no PTE table;
  - pte_unmap()
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 546a9406859ad..9a4550cd830c9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3017,6 +3017,11 @@ static inline pte_t *pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd,
 
 pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
 			unsigned long addr, spinlock_t **ptlp);
+pte_t *pte_offset_map_ro_nolock(struct mm_struct *mm, pmd_t *pmd,
+				unsigned long addr, spinlock_t **ptlp);
+pte_t *pte_offset_map_rw_nolock(struct mm_struct *mm, pmd_t *pmd,
+				unsigned long addr, pmd_t *pmdvalp,
+				spinlock_t **ptlp);
 
 #define pte_unmap_unlock(pte, ptl)	do {		\
 	spin_unlock(ptl);				\
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index a78a4adf711ac..262b7065a5a2e 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -317,6 +317,33 @@ pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
 	return pte;
 }
 
+pte_t *pte_offset_map_ro_nolock(struct mm_struct *mm, pmd_t *pmd,
+				unsigned long addr, spinlock_t **ptlp)
+{
+	pmd_t pmdval;
+	pte_t *pte;
+
+	pte = __pte_offset_map(pmd, addr, &pmdval);
+	if (likely(pte))
+		*ptlp = pte_lockptr(mm, &pmdval);
+	return pte;
+}
+
+pte_t *pte_offset_map_rw_nolock(struct mm_struct *mm, pmd_t *pmd,
+				unsigned long addr, pmd_t *pmdvalp,
+				spinlock_t **ptlp)
+{
+	pmd_t pmdval;
+	pte_t *pte;
+
+	VM_WARN_ON_ONCE(!pmdvalp);
+	pte = __pte_offset_map(pmd, addr, &pmdval);
+	if (likely(pte))
+		*ptlp = pte_lockptr(mm, &pmdval);
+	*pmdvalp = pmdval;
+	return pte;
+}
+
 /*
  * pte_offset_map_lock(mm, pmd, addr, ptlp), and its internal implementation
  * __pte_offset_map_lock() below, is usually called with the pmd pointer for
@@ -356,6 +383,29 @@ pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
  * recheck *pmd once the lock is taken; in practice, no callsite needs that -
  * either the mmap_lock for write, or pte_same() check on contents, is enough.
  *
+ * pte_offset_map_ro_nolock(mm, pmd, addr, ptlp), above, is like pte_offset_map();
+ * but when successful, it also outputs a pointer to the spinlock in ptlp - as
+ * pte_offset_map_lock() does, but in this case without locking it.  This helps
+ * the caller to avoid a later pte_lockptr(mm, *pmd), which might by that time
+ * act on a changed *pmd: pte_offset_map_ro_nolock() provides the correct spinlock
+ * pointer for the page table that it returns. Even after grabbing the spinlock,
+ * we might be looking either at a page table that is still mapped or one that
+ * was unmapped and is about to get freed. But for R/O access this is sufficient.
+ * So it is only applicable for read-only cases where any modification operations
+ * to the page table are not allowed even if the corresponding spinlock is held
+ * afterwards.
+ *
+ * pte_offset_map_rw_nolock(mm, pmd, addr, pmdvalp, ptlp), above, is like
+ * pte_offset_map_ro_nolock(); but when successful, it also outputs the pdmval.
+ * It is applicable for may-write cases where any modification operations to the
+ * page table may happen after the corresponding spinlock is held afterwards.
+ * But the users should make sure the page table is stable like checking pte_same()
+ * or checking pmd_same() by using the output pmdval before performing the write
+ * operations.
+ *
+ * Note: "RO" / "RW" expresses the intended semantics, not that the *kmap* will
+ * be read-only/read-write protected.
+ *
  * Note that free_pgtables(), used after unmapping detached vmas, or when
  * exiting the whole mm, does not take page table lock before freeing a page
  * table, and may not use RCU at all: "outsiders" like khugepaged should avoid
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 02/13] powerpc: assert_pte_locked() use pte_offset_map_ro_nolock()
  2024-09-24  6:09 [PATCH v4 00/13] introduce pte_offset_map_{ro|rw}_nolock() Qi Zheng
  2024-09-24  6:09 ` [PATCH v4 01/13] mm: pgtable: " Qi Zheng
@ 2024-09-24  6:09 ` Qi Zheng
  2024-09-24  6:09 ` [PATCH v4 03/13] mm: filemap: filemap_fault_recheck_pte_none() " Qi Zheng
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 29+ messages in thread
From: Qi Zheng @ 2024-09-24  6:09 UTC (permalink / raw)
  To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
	vishal.moola, peterx, ryan.roberts, christophe.leroy2
  Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng

In assert_pte_locked(), we just get the ptl and assert if it was already
held, so convert it to using pte_offset_map_ro_nolock().

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
---
 arch/powerpc/mm/pgtable.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index 7316396e452d8..61df5aed79894 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -398,7 +398,7 @@ void assert_pte_locked(struct mm_struct *mm, unsigned long addr)
 	 */
 	if (pmd_none(*pmd))
 		return;
-	pte = pte_offset_map_nolock(mm, pmd, addr, &ptl);
+	pte = pte_offset_map_ro_nolock(mm, pmd, addr, &ptl);
 	BUG_ON(!pte);
 	assert_spin_locked(ptl);
 	pte_unmap(pte);
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 03/13] mm: filemap: filemap_fault_recheck_pte_none() use pte_offset_map_ro_nolock()
  2024-09-24  6:09 [PATCH v4 00/13] introduce pte_offset_map_{ro|rw}_nolock() Qi Zheng
  2024-09-24  6:09 ` [PATCH v4 01/13] mm: pgtable: " Qi Zheng
  2024-09-24  6:09 ` [PATCH v4 02/13] powerpc: assert_pte_locked() use pte_offset_map_ro_nolock() Qi Zheng
@ 2024-09-24  6:09 ` Qi Zheng
  2024-09-24  6:09 ` [PATCH v4 04/13] mm: khugepaged: __collapse_huge_page_swapin() " Qi Zheng
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 29+ messages in thread
From: Qi Zheng @ 2024-09-24  6:09 UTC (permalink / raw)
  To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
	vishal.moola, peterx, ryan.roberts, christophe.leroy2
  Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng

In filemap_fault_recheck_pte_none(), we just do pte_none() check, so
convert it to using pte_offset_map_ro_nolock().

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
---
 mm/filemap.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 3e46ca45e13dc..6c6ff8722550f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3271,8 +3271,8 @@ static vm_fault_t filemap_fault_recheck_pte_none(struct vm_fault *vmf)
 	if (!(vmf->flags & FAULT_FLAG_ORIG_PTE_VALID))
 		return 0;
 
-	ptep = pte_offset_map_nolock(vma->vm_mm, vmf->pmd, vmf->address,
-				     &vmf->ptl);
+	ptep = pte_offset_map_ro_nolock(vma->vm_mm, vmf->pmd, vmf->address,
+					&vmf->ptl);
 	if (unlikely(!ptep))
 		return VM_FAULT_NOPAGE;
 
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 04/13] mm: khugepaged: __collapse_huge_page_swapin() use pte_offset_map_ro_nolock()
  2024-09-24  6:09 [PATCH v4 00/13] introduce pte_offset_map_{ro|rw}_nolock() Qi Zheng
                   ` (2 preceding siblings ...)
  2024-09-24  6:09 ` [PATCH v4 03/13] mm: filemap: filemap_fault_recheck_pte_none() " Qi Zheng
@ 2024-09-24  6:09 ` Qi Zheng
  2024-09-24  6:09 ` [PATCH v4 05/13] arm: adjust_pte() use pte_offset_map_rw_nolock() Qi Zheng
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 29+ messages in thread
From: Qi Zheng @ 2024-09-24  6:09 UTC (permalink / raw)
  To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
	vishal.moola, peterx, ryan.roberts, christophe.leroy2
  Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng

In __collapse_huge_page_swapin(), we just use the ptl for pte_same() check
in do_swap_page(). In other places, we directly use pte_offset_map_lock(),
so convert it to using pte_offset_map_ro_nolock().

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
---
 mm/khugepaged.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index f9c39898eaff6..6498721d4783a 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1011,7 +1011,11 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
 		};
 
 		if (!pte++) {
-			pte = pte_offset_map_nolock(mm, pmd, address, &ptl);
+			/*
+			 * Here the ptl is only used to check pte_same() in
+			 * do_swap_page(), so readonly version is enough.
+			 */
+			pte = pte_offset_map_ro_nolock(mm, pmd, address, &ptl);
 			if (!pte) {
 				mmap_read_unlock(mm);
 				result = SCAN_PMD_NULL;
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 05/13] arm: adjust_pte() use pte_offset_map_rw_nolock()
  2024-09-24  6:09 [PATCH v4 00/13] introduce pte_offset_map_{ro|rw}_nolock() Qi Zheng
                   ` (3 preceding siblings ...)
  2024-09-24  6:09 ` [PATCH v4 04/13] mm: khugepaged: __collapse_huge_page_swapin() " Qi Zheng
@ 2024-09-24  6:09 ` Qi Zheng
  2024-09-24  6:09 ` [PATCH v4 06/13] mm: handle_pte_fault() " Qi Zheng
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 29+ messages in thread
From: Qi Zheng @ 2024-09-24  6:09 UTC (permalink / raw)
  To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
	vishal.moola, peterx, ryan.roberts, christophe.leroy2
  Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng

In do_adjust_pte(), we may modify the pte entry. The corresponding pmd
entry may have been modified concurrently. Therefore, in order to ensure
the stability if pmd entry, use pte_offset_map_rw_nolock() to replace
pte_offset_map_nolock(), and do pmd_same() check after holding the PTL.

All callers of update_mmu_cache_range() hold the vmf->ptl, so we can
determined whether split PTE locks is being used by doing the following,
just as we do elsewhere in the kernel.

	ptl != vmf->ptl

And then we can delete the do_pte_lock() and do_pte_unlock().

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
---
Hi David and Muchun, I did not remove your Acked-by and Reviewed-by tags since
there is no functional change.

 arch/arm/mm/fault-armv.c | 53 +++++++++++++++++-----------------------
 1 file changed, 22 insertions(+), 31 deletions(-)

diff --git a/arch/arm/mm/fault-armv.c b/arch/arm/mm/fault-armv.c
index 831793cd6ff94..2bec87c3327d2 100644
--- a/arch/arm/mm/fault-armv.c
+++ b/arch/arm/mm/fault-armv.c
@@ -61,32 +61,8 @@ static int do_adjust_pte(struct vm_area_struct *vma, unsigned long address,
 	return ret;
 }
 
-#if defined(CONFIG_SPLIT_PTE_PTLOCKS)
-/*
- * If we are using split PTE locks, then we need to take the page
- * lock here.  Otherwise we are using shared mm->page_table_lock
- * which is already locked, thus cannot take it.
- */
-static inline void do_pte_lock(spinlock_t *ptl)
-{
-	/*
-	 * Use nested version here to indicate that we are already
-	 * holding one similar spinlock.
-	 */
-	spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
-}
-
-static inline void do_pte_unlock(spinlock_t *ptl)
-{
-	spin_unlock(ptl);
-}
-#else /* !defined(CONFIG_SPLIT_PTE_PTLOCKS) */
-static inline void do_pte_lock(spinlock_t *ptl) {}
-static inline void do_pte_unlock(spinlock_t *ptl) {}
-#endif /* defined(CONFIG_SPLIT_PTE_PTLOCKS) */
-
 static int adjust_pte(struct vm_area_struct *vma, unsigned long address,
-	unsigned long pfn)
+		      unsigned long pfn, struct vm_fault *vmf)
 {
 	spinlock_t *ptl;
 	pgd_t *pgd;
@@ -94,6 +70,7 @@ static int adjust_pte(struct vm_area_struct *vma, unsigned long address,
 	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
+	pmd_t pmdval;
 	int ret;
 
 	pgd = pgd_offset(vma->vm_mm, address);
@@ -112,20 +89,33 @@ static int adjust_pte(struct vm_area_struct *vma, unsigned long address,
 	if (pmd_none_or_clear_bad(pmd))
 		return 0;
 
+again:
 	/*
 	 * This is called while another page table is mapped, so we
 	 * must use the nested version.  This also means we need to
 	 * open-code the spin-locking.
 	 */
-	pte = pte_offset_map_nolock(vma->vm_mm, pmd, address, &ptl);
+	pte = pte_offset_map_rw_nolock(vma->vm_mm, pmd, address, &pmdval, &ptl);
 	if (!pte)
 		return 0;
 
-	do_pte_lock(ptl);
+	/*
+	 * If we are using split PTE locks, then we need to take the page
+	 * lock here.  Otherwise we are using shared mm->page_table_lock
+	 * which is already locked, thus cannot take it.
+	 */
+	if (ptl != vmf->ptl) {
+		spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
+		if (unlikely(!pmd_same(pmdval, pmdp_get_lockless(pmd)))) {
+			pte_unmap_unlock(pte, ptl);
+			goto again;
+		}
+	}
 
 	ret = do_adjust_pte(vma, address, pfn, pte);
 
-	do_pte_unlock(ptl);
+	if (ptl != vmf->ptl)
+		spin_unlock(ptl);
 	pte_unmap(pte);
 
 	return ret;
@@ -133,7 +123,8 @@ static int adjust_pte(struct vm_area_struct *vma, unsigned long address,
 
 static void
 make_coherent(struct address_space *mapping, struct vm_area_struct *vma,
-	unsigned long addr, pte_t *ptep, unsigned long pfn)
+	      unsigned long addr, pte_t *ptep, unsigned long pfn,
+	      struct vm_fault *vmf)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	struct vm_area_struct *mpnt;
@@ -160,7 +151,7 @@ make_coherent(struct address_space *mapping, struct vm_area_struct *vma,
 		if (!(mpnt->vm_flags & VM_MAYSHARE))
 			continue;
 		offset = (pgoff - mpnt->vm_pgoff) << PAGE_SHIFT;
-		aliases += adjust_pte(mpnt, mpnt->vm_start + offset, pfn);
+		aliases += adjust_pte(mpnt, mpnt->vm_start + offset, pfn, vmf);
 	}
 	flush_dcache_mmap_unlock(mapping);
 	if (aliases)
@@ -203,7 +194,7 @@ void update_mmu_cache_range(struct vm_fault *vmf, struct vm_area_struct *vma,
 		__flush_dcache_folio(mapping, folio);
 	if (mapping) {
 		if (cache_is_vivt())
-			make_coherent(mapping, vma, addr, ptep, pfn);
+			make_coherent(mapping, vma, addr, ptep, pfn, vmf);
 		else if (vma->vm_flags & VM_EXEC)
 			__flush_icache_all();
 	}
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 06/13] mm: handle_pte_fault() use pte_offset_map_rw_nolock()
  2024-09-24  6:09 [PATCH v4 00/13] introduce pte_offset_map_{ro|rw}_nolock() Qi Zheng
                   ` (4 preceding siblings ...)
  2024-09-24  6:09 ` [PATCH v4 05/13] arm: adjust_pte() use pte_offset_map_rw_nolock() Qi Zheng
@ 2024-09-24  6:09 ` Qi Zheng
  2024-09-24  6:09 ` [PATCH v4 07/13] mm: khugepaged: collapse_pte_mapped_thp() " Qi Zheng
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 29+ messages in thread
From: Qi Zheng @ 2024-09-24  6:09 UTC (permalink / raw)
  To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
	vishal.moola, peterx, ryan.roberts, christophe.leroy2
  Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng

In handle_pte_fault(), we may modify the vmf->pte after acquiring the
vmf->ptl, so convert it to using pte_offset_map_rw_nolock(). But since we
will do the pte_same() check, so there is no need to get pmdval to do
pmd_same() check, just pass a dummy variable to it.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
---
 mm/memory.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index a245d28787034..6432b636d1ba7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5750,14 +5750,24 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 		vmf->pte = NULL;
 		vmf->flags &= ~FAULT_FLAG_ORIG_PTE_VALID;
 	} else {
+		pmd_t dummy_pmdval;
+
 		/*
 		 * A regular pmd is established and it can't morph into a huge
 		 * pmd by anon khugepaged, since that takes mmap_lock in write
 		 * mode; but shmem or file collapse to THP could still morph
 		 * it into a huge pmd: just retry later if so.
+		 *
+		 * Use the maywrite version to indicate that vmf->pte may be
+		 * modified, but since we will use pte_same() to detect the
+		 * change of the !pte_none() entry, there is no need to recheck
+		 * the pmdval. Here we chooes to pass a dummy variable instead
+		 * of NULL, which helps new user think about why this place is
+		 * special.
 		 */
-		vmf->pte = pte_offset_map_nolock(vmf->vma->vm_mm, vmf->pmd,
-						 vmf->address, &vmf->ptl);
+		vmf->pte = pte_offset_map_rw_nolock(vmf->vma->vm_mm, vmf->pmd,
+						    vmf->address, &dummy_pmdval,
+						    &vmf->ptl);
 		if (unlikely(!vmf->pte))
 			return 0;
 		vmf->orig_pte = ptep_get_lockless(vmf->pte);
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 07/13] mm: khugepaged: collapse_pte_mapped_thp() use pte_offset_map_rw_nolock()
  2024-09-24  6:09 [PATCH v4 00/13] introduce pte_offset_map_{ro|rw}_nolock() Qi Zheng
                   ` (5 preceding siblings ...)
  2024-09-24  6:09 ` [PATCH v4 06/13] mm: handle_pte_fault() " Qi Zheng
@ 2024-09-24  6:09 ` Qi Zheng
  2024-09-24  7:14   ` Muchun Song
  2024-09-24  6:10 ` [PATCH v4 08/13] mm: copy_pte_range() " Qi Zheng
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 29+ messages in thread
From: Qi Zheng @ 2024-09-24  6:09 UTC (permalink / raw)
  To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
	vishal.moola, peterx, ryan.roberts, christophe.leroy2
  Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng

In collapse_pte_mapped_thp(), we may modify the pte and pmd entry after
acquring the ptl, so convert it to using pte_offset_map_rw_nolock(). At
this time, the pte_same() check is not performed after the PTL held. So we
should get pgt_pmd and do pmd_same() check after the ptl held.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/khugepaged.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 6498721d4783a..8ab79c13d077f 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1605,7 +1605,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 	if (userfaultfd_armed(vma) && !(vma->vm_flags & VM_SHARED))
 		pml = pmd_lock(mm, pmd);
 
-	start_pte = pte_offset_map_nolock(mm, pmd, haddr, &ptl);
+	start_pte = pte_offset_map_rw_nolock(mm, pmd, haddr, &pgt_pmd, &ptl);
 	if (!start_pte)		/* mmap_lock + page lock should prevent this */
 		goto abort;
 	if (!pml)
@@ -1613,6 +1613,9 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 	else if (ptl != pml)
 		spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
 
+	if (unlikely(!pmd_same(pgt_pmd, pmdp_get_lockless(pmd))))
+		goto abort;
+
 	/* step 2: clear page table and adjust rmap */
 	for (i = 0, addr = haddr, pte = start_pte;
 	     i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE, pte++) {
@@ -1645,7 +1648,6 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 		nr_ptes++;
 	}
 
-	pte_unmap(start_pte);
 	if (!pml)
 		spin_unlock(ptl);
 
@@ -1658,13 +1660,19 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 	/* step 4: remove empty page table */
 	if (!pml) {
 		pml = pmd_lock(mm, pmd);
-		if (ptl != pml)
+		if (ptl != pml) {
 			spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
+			if (unlikely(!pmd_same(pgt_pmd, pmdp_get_lockless(pmd)))) {
+				spin_unlock(pml);
+				goto abort;
+			}
+		}
 	}
 	pgt_pmd = pmdp_collapse_flush(vma, haddr, pmd);
 	pmdp_get_lockless_sync();
 	if (ptl != pml)
 		spin_unlock(ptl);
+	pte_unmap(start_pte);
 	spin_unlock(pml);
 
 	mmu_notifier_invalidate_range_end(&range);
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 08/13] mm: copy_pte_range() use pte_offset_map_rw_nolock()
  2024-09-24  6:09 [PATCH v4 00/13] introduce pte_offset_map_{ro|rw}_nolock() Qi Zheng
                   ` (6 preceding siblings ...)
  2024-09-24  6:09 ` [PATCH v4 07/13] mm: khugepaged: collapse_pte_mapped_thp() " Qi Zheng
@ 2024-09-24  6:10 ` Qi Zheng
  2024-09-24  7:35   ` Muchun Song
  2024-09-24  6:10 ` [PATCH v4 09/13] mm: mremap: move_ptes() " Qi Zheng
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 29+ messages in thread
From: Qi Zheng @ 2024-09-24  6:10 UTC (permalink / raw)
  To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
	vishal.moola, peterx, ryan.roberts, christophe.leroy2
  Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng

In copy_pte_range(), we may modify the src_pte entry after holding the
src_ptl, so convert it to using pte_offset_map_rw_nolock(). Since we
already hold the exclusive mmap_lock, and the copy_pte_range() and
retract_page_tables() are using vma->anon_vma to be exclusive, so the PTE
page is stable, there is no need to get pmdval and do pmd_same() check.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/memory.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index 6432b636d1ba7..c19cf14e1c565 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1086,6 +1086,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 	struct mm_struct *src_mm = src_vma->vm_mm;
 	pte_t *orig_src_pte, *orig_dst_pte;
 	pte_t *src_pte, *dst_pte;
+	pmd_t dummy_pmdval;
 	pte_t ptent;
 	spinlock_t *src_ptl, *dst_ptl;
 	int progress, max_nr, ret = 0;
@@ -1111,7 +1112,15 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 		ret = -ENOMEM;
 		goto out;
 	}
-	src_pte = pte_offset_map_nolock(src_mm, src_pmd, addr, &src_ptl);
+
+	/*
+	 * We already hold the exclusive mmap_lock, the copy_pte_range() and
+	 * retract_page_tables() are using vma->anon_vma to be exclusive, so
+	 * the PTE page is stable, and there is no need to get pmdval and do
+	 * pmd_same() check.
+	 */
+	src_pte = pte_offset_map_rw_nolock(src_mm, src_pmd, addr, &dummy_pmdval,
+					   &src_ptl);
 	if (!src_pte) {
 		pte_unmap_unlock(dst_pte, dst_ptl);
 		/* ret == 0 */
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 09/13] mm: mremap: move_ptes() use pte_offset_map_rw_nolock()
  2024-09-24  6:09 [PATCH v4 00/13] introduce pte_offset_map_{ro|rw}_nolock() Qi Zheng
                   ` (7 preceding siblings ...)
  2024-09-24  6:10 ` [PATCH v4 08/13] mm: copy_pte_range() " Qi Zheng
@ 2024-09-24  6:10 ` Qi Zheng
  2024-09-24  7:39   ` Muchun Song
  2024-09-24  6:10 ` [PATCH v4 10/13] mm: page_vma_mapped_walk: map_pte() " Qi Zheng
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 29+ messages in thread
From: Qi Zheng @ 2024-09-24  6:10 UTC (permalink / raw)
  To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
	vishal.moola, peterx, ryan.roberts, christophe.leroy2
  Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng

In move_ptes(), we may modify the new_pte after acquiring the new_ptl, so
convert it to using pte_offset_map_rw_nolock(). Now new_pte is none, so
hpage_collapse_scan_file() path can not find this by traversing
file->f_mapping, so there is no concurrency with retract_page_tables(). In
addition, we already hold the exclusive mmap_lock, so this new_pte page is
stable, so there is no need to get pmdval and do pmd_same() check.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/mremap.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index 24712f8dbb6b5..9dffd4a5b4d18 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -143,6 +143,7 @@ static int move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
 	spinlock_t *old_ptl, *new_ptl;
 	bool force_flush = false;
 	unsigned long len = old_end - old_addr;
+	pmd_t dummy_pmdval;
 	int err = 0;
 
 	/*
@@ -175,7 +176,15 @@ static int move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
 		err = -EAGAIN;
 		goto out;
 	}
-	new_pte = pte_offset_map_nolock(mm, new_pmd, new_addr, &new_ptl);
+	/*
+	 * Now new_pte is none, so hpage_collapse_scan_file() path can not find
+	 * this by traversing file->f_mapping, so there is no concurrency with
+	 * retract_page_tables(). In addition, we already hold the exclusive
+	 * mmap_lock, so this new_pte page is stable, so there is no need to get
+	 * pmdval and do pmd_same() check.
+	 */
+	new_pte = pte_offset_map_rw_nolock(mm, new_pmd, new_addr, &dummy_pmdval,
+					   &new_ptl);
 	if (!new_pte) {
 		pte_unmap_unlock(old_pte, old_ptl);
 		err = -EAGAIN;
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 10/13] mm: page_vma_mapped_walk: map_pte() use pte_offset_map_rw_nolock()
  2024-09-24  6:09 [PATCH v4 00/13] introduce pte_offset_map_{ro|rw}_nolock() Qi Zheng
                   ` (8 preceding siblings ...)
  2024-09-24  6:10 ` [PATCH v4 09/13] mm: mremap: move_ptes() " Qi Zheng
@ 2024-09-24  6:10 ` Qi Zheng
  2024-09-24  8:25   ` Muchun Song
  2024-09-24  6:10 ` [PATCH v4 11/13] mm: userfaultfd: move_pages_pte() " Qi Zheng
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 29+ messages in thread
From: Qi Zheng @ 2024-09-24  6:10 UTC (permalink / raw)
  To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
	vishal.moola, peterx, ryan.roberts, christophe.leroy2
  Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng

In the caller of map_pte(), we may modify the pvmw->pte after acquiring
the pvmw->ptl, so convert it to using pte_offset_map_rw_nolock(). At
this time, the pte_same() check is not performed after the pvmw->ptl held,
so we should get pmdval and do pmd_same() check to ensure the stability of
pvmw->pmd.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/page_vma_mapped.c | 25 ++++++++++++++++++++-----
 1 file changed, 20 insertions(+), 5 deletions(-)

diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index ae5cc42aa2087..6410f29b37c1b 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -13,9 +13,11 @@ static inline bool not_found(struct page_vma_mapped_walk *pvmw)
 	return false;
 }
 
-static bool map_pte(struct page_vma_mapped_walk *pvmw, spinlock_t **ptlp)
+static bool map_pte(struct page_vma_mapped_walk *pvmw, pmd_t *pmdvalp,
+		    spinlock_t **ptlp)
 {
 	pte_t ptent;
+	pmd_t pmdval;
 
 	if (pvmw->flags & PVMW_SYNC) {
 		/* Use the stricter lookup */
@@ -25,6 +27,7 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw, spinlock_t **ptlp)
 		return !!pvmw->pte;
 	}
 
+again:
 	/*
 	 * It is important to return the ptl corresponding to pte,
 	 * in case *pvmw->pmd changes underneath us; so we need to
@@ -32,10 +35,11 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw, spinlock_t **ptlp)
 	 * proceeds to loop over next ptes, and finds a match later.
 	 * Though, in most cases, page lock already protects this.
 	 */
-	pvmw->pte = pte_offset_map_nolock(pvmw->vma->vm_mm, pvmw->pmd,
-					  pvmw->address, ptlp);
+	pvmw->pte = pte_offset_map_rw_nolock(pvmw->vma->vm_mm, pvmw->pmd,
+					     pvmw->address, &pmdval, ptlp);
 	if (!pvmw->pte)
 		return false;
+	*pmdvalp = pmdval;
 
 	ptent = ptep_get(pvmw->pte);
 
@@ -67,8 +71,13 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw, spinlock_t **ptlp)
 	} else if (!pte_present(ptent)) {
 		return false;
 	}
+	spin_lock(*ptlp);
+	if (unlikely(!pmd_same(pmdval, pmdp_get_lockless(pvmw->pmd)))) {
+		pte_unmap_unlock(pvmw->pte, *ptlp);
+		goto again;
+	}
 	pvmw->ptl = *ptlp;
-	spin_lock(pvmw->ptl);
+
 	return true;
 }
 
@@ -278,7 +287,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 			step_forward(pvmw, PMD_SIZE);
 			continue;
 		}
-		if (!map_pte(pvmw, &ptl)) {
+		if (!map_pte(pvmw, &pmde, &ptl)) {
 			if (!pvmw->pte)
 				goto restart;
 			goto next_pte;
@@ -307,6 +316,12 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 		if (!pvmw->ptl) {
 			pvmw->ptl = ptl;
 			spin_lock(pvmw->ptl);
+			if (unlikely(!pmd_same(pmde, pmdp_get_lockless(pvmw->pmd)))) {
+				pte_unmap_unlock(pvmw->pte, pvmw->ptl);
+				pvmw->ptl = NULL;
+				pvmw->pte = NULL;
+				goto restart;
+			}
 		}
 		goto this_pte;
 	} while (pvmw->address < end);
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 11/13] mm: userfaultfd: move_pages_pte() use pte_offset_map_rw_nolock()
  2024-09-24  6:09 [PATCH v4 00/13] introduce pte_offset_map_{ro|rw}_nolock() Qi Zheng
                   ` (9 preceding siblings ...)
  2024-09-24  6:10 ` [PATCH v4 10/13] mm: page_vma_mapped_walk: map_pte() " Qi Zheng
@ 2024-09-24  6:10 ` Qi Zheng
  2024-09-24  6:10 ` [PATCH v4 12/13] mm: multi-gen LRU: walk_pte_range() " Qi Zheng
  2024-09-24  6:10 ` [PATCH v4 13/13] mm: pgtable: remove pte_offset_map_nolock() Qi Zheng
  12 siblings, 0 replies; 29+ messages in thread
From: Qi Zheng @ 2024-09-24  6:10 UTC (permalink / raw)
  To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
	vishal.moola, peterx, ryan.roberts, christophe.leroy2
  Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng

In move_pages_pte(), we may modify the dst_pte and src_pte after acquiring
the ptl, so convert it to using pte_offset_map_rw_nolock(). But since we
will use pte_same() to detect the change of the pte entry, there is no
need to get pmdval, so just pass a dummy variable to it.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
---
 mm/userfaultfd.c | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index ce13c40626472..48b87c62fc3dd 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -1135,7 +1135,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
 	spinlock_t *src_ptl, *dst_ptl;
 	pte_t *src_pte = NULL;
 	pte_t *dst_pte = NULL;
-
+	pmd_t dummy_pmdval;
 	struct folio *src_folio = NULL;
 	struct anon_vma *src_anon_vma = NULL;
 	struct mmu_notifier_range range;
@@ -1146,7 +1146,14 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
 				src_addr, src_addr + PAGE_SIZE);
 	mmu_notifier_invalidate_range_start(&range);
 retry:
-	dst_pte = pte_offset_map_nolock(mm, dst_pmd, dst_addr, &dst_ptl);
+	/*
+	 * Use the maywrite version to indicate that dst_pte will be modified,
+	 * but since we will use pte_same() to detect the change of the pte
+	 * entry, there is no need to get pmdval, so just pass a dummy variable
+	 * to it.
+	 */
+	dst_pte = pte_offset_map_rw_nolock(mm, dst_pmd, dst_addr, &dummy_pmdval,
+					   &dst_ptl);
 
 	/* Retry if a huge pmd materialized from under us */
 	if (unlikely(!dst_pte)) {
@@ -1154,7 +1161,9 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
 		goto out;
 	}
 
-	src_pte = pte_offset_map_nolock(mm, src_pmd, src_addr, &src_ptl);
+	/* same as dst_pte */
+	src_pte = pte_offset_map_rw_nolock(mm, src_pmd, src_addr, &dummy_pmdval,
+					   &src_ptl);
 
 	/*
 	 * We held the mmap_lock for reading so MADV_DONTNEED
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 12/13] mm: multi-gen LRU: walk_pte_range() use pte_offset_map_rw_nolock()
  2024-09-24  6:09 [PATCH v4 00/13] introduce pte_offset_map_{ro|rw}_nolock() Qi Zheng
                   ` (10 preceding siblings ...)
  2024-09-24  6:10 ` [PATCH v4 11/13] mm: userfaultfd: move_pages_pte() " Qi Zheng
@ 2024-09-24  6:10 ` Qi Zheng
  2024-09-24 14:39   ` David Hildenbrand
  2024-09-24  6:10 ` [PATCH v4 13/13] mm: pgtable: remove pte_offset_map_nolock() Qi Zheng
  12 siblings, 1 reply; 29+ messages in thread
From: Qi Zheng @ 2024-09-24  6:10 UTC (permalink / raw)
  To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
	vishal.moola, peterx, ryan.roberts, christophe.leroy2
  Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng

In walk_pte_range(), we may modify the pte entry after holding the ptl, so
convert it to using pte_offset_map_rw_nolock(). At this time, the
pte_same() check is not performed after the ptl held, so we should get
pmdval and do pmd_same() check to ensure the stability of pmd entry.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
---
 mm/vmscan.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 749cdc110c745..bdca94e663bc5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3375,8 +3375,10 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
 	struct pglist_data *pgdat = lruvec_pgdat(walk->lruvec);
 	DEFINE_MAX_SEQ(walk->lruvec);
 	int old_gen, new_gen = lru_gen_from_seq(max_seq);
+	pmd_t pmdval;
 
-	pte = pte_offset_map_nolock(args->mm, pmd, start & PMD_MASK, &ptl);
+	pte = pte_offset_map_rw_nolock(args->mm, pmd, start & PMD_MASK, &pmdval,
+				       &ptl);
 	if (!pte)
 		return false;
 	if (!spin_trylock(ptl)) {
@@ -3384,6 +3386,11 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
 		return false;
 	}
 
+	if (unlikely(!pmd_same(pmdval, pmdp_get_lockless(pmd)))) {
+		pte_unmap_unlock(pte, ptl);
+		return false;
+	}
+
 	arch_enter_lazy_mmu_mode();
 restart:
 	for (i = pte_index(start), addr = start; addr != end; i++, addr += PAGE_SIZE) {
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v4 13/13] mm: pgtable: remove pte_offset_map_nolock()
  2024-09-24  6:09 [PATCH v4 00/13] introduce pte_offset_map_{ro|rw}_nolock() Qi Zheng
                   ` (11 preceding siblings ...)
  2024-09-24  6:10 ` [PATCH v4 12/13] mm: multi-gen LRU: walk_pte_range() " Qi Zheng
@ 2024-09-24  6:10 ` Qi Zheng
  2024-09-24 13:02   ` David Hildenbrand
  12 siblings, 1 reply; 29+ messages in thread
From: Qi Zheng @ 2024-09-24  6:10 UTC (permalink / raw)
  To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
	vishal.moola, peterx, ryan.roberts, christophe.leroy2
  Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng

Now no users are using the pte_offset_map_nolock(), remove it.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
---
 Documentation/mm/split_page_table_lock.rst |  3 ---
 include/linux/mm.h                         |  2 --
 mm/pgtable-generic.c                       | 21 ---------------------
 3 files changed, 26 deletions(-)

diff --git a/Documentation/mm/split_page_table_lock.rst b/Documentation/mm/split_page_table_lock.rst
index 08d0e706a32db..581446d4a4eba 100644
--- a/Documentation/mm/split_page_table_lock.rst
+++ b/Documentation/mm/split_page_table_lock.rst
@@ -16,9 +16,6 @@ There are helpers to lock/unlock a table and other accessor functions:
  - pte_offset_map_lock()
 	maps PTE and takes PTE table lock, returns pointer to PTE with
 	pointer to its PTE table lock, or returns NULL if no PTE table;
- - pte_offset_map_nolock()
-	maps PTE, returns pointer to PTE with pointer to its PTE table
-	lock (not taken), or returns NULL if no PTE table;
  - pte_offset_map_ro_nolock()
 	maps PTE, returns pointer to PTE with pointer to its PTE table
 	lock (not taken), or returns NULL if no PTE table;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9a4550cd830c9..e2a4502ab019b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3015,8 +3015,6 @@ static inline pte_t *pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd,
 	return pte;
 }
 
-pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
-			unsigned long addr, spinlock_t **ptlp);
 pte_t *pte_offset_map_ro_nolock(struct mm_struct *mm, pmd_t *pmd,
 				unsigned long addr, spinlock_t **ptlp);
 pte_t *pte_offset_map_rw_nolock(struct mm_struct *mm, pmd_t *pmd,
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 262b7065a5a2e..c68aa655b7872 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -305,18 +305,6 @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
 	return NULL;
 }
 
-pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
-			     unsigned long addr, spinlock_t **ptlp)
-{
-	pmd_t pmdval;
-	pte_t *pte;
-
-	pte = __pte_offset_map(pmd, addr, &pmdval);
-	if (likely(pte))
-		*ptlp = pte_lockptr(mm, &pmdval);
-	return pte;
-}
-
 pte_t *pte_offset_map_ro_nolock(struct mm_struct *mm, pmd_t *pmd,
 				unsigned long addr, spinlock_t **ptlp)
 {
@@ -374,15 +362,6 @@ pte_t *pte_offset_map_rw_nolock(struct mm_struct *mm, pmd_t *pmd,
  * and disconnected table.  Until pte_unmap(pte) unmaps and rcu_read_unlock()s
  * afterwards.
  *
- * pte_offset_map_nolock(mm, pmd, addr, ptlp), above, is like pte_offset_map();
- * but when successful, it also outputs a pointer to the spinlock in ptlp - as
- * pte_offset_map_lock() does, but in this case without locking it.  This helps
- * the caller to avoid a later pte_lockptr(mm, *pmd), which might by that time
- * act on a changed *pmd: pte_offset_map_nolock() provides the correct spinlock
- * pointer for the page table that it returns.  In principle, the caller should
- * recheck *pmd once the lock is taken; in practice, no callsite needs that -
- * either the mmap_lock for write, or pte_same() check on contents, is enough.
- *
  * pte_offset_map_ro_nolock(mm, pmd, addr, ptlp), above, is like pte_offset_map();
  * but when successful, it also outputs a pointer to the spinlock in ptlp - as
  * pte_offset_map_lock() does, but in this case without locking it.  This helps
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 01/13] mm: pgtable: introduce pte_offset_map_{ro|rw}_nolock()
  2024-09-24  6:09 ` [PATCH v4 01/13] mm: pgtable: " Qi Zheng
@ 2024-09-24  6:24   ` Muchun Song
  2024-09-24 12:58   ` David Hildenbrand
  1 sibling, 0 replies; 29+ messages in thread
From: Muchun Song @ 2024-09-24  6:24 UTC (permalink / raw)
  To: Qi Zheng
  Cc: david, hughd, willy, vbabka, akpm, rppt, vishal.moola, peterx,
	ryan.roberts, christophe.leroy2, linux-kernel, linux-mm,
	linux-arm-kernel, linuxppc-dev



> On Sep 24, 2024, at 14:09, Qi Zheng <zhengqi.arch@bytedance.com> wrote:
> 
> Currently, the usage of pte_offset_map_nolock() can be divided into the
> following two cases:
> 
> 1) After acquiring PTL, only read-only operations are performed on the PTE
>   page. In this case, the RCU lock in pte_offset_map_nolock() will ensure
>   that the PTE page will not be freed, and there is no need to worry
>   about whether the pmd entry is modified.
> 
> 2) After acquiring PTL, the pte or pmd entries may be modified. At this
>   time, we need to ensure that the pmd entry has not been modified
>   concurrently.
> 
> To more clearing distinguish between these two cases, this commit
> introduces two new helper functions to replace pte_offset_map_nolock().
> For 1), just rename it to pte_offset_map_ro_nolock(). For 2), in addition
> to changing the name to pte_offset_map_rw_nolock(), it also outputs the
> pmdval when successful. It is applicable for may-write cases where any
> modification operations to the page table may happen after the
> corresponding spinlock is held afterwards. But the users should make sure
> the page table is stable like checking pte_same() or checking pmd_same()
> by using the output pmdval before performing the write operations.
> 
> Note: "RO" / "RW" expresses the intended semantics, not that the *kmap*
> will be read-only/read-write protected.
> 
> Subsequent commits will convert pte_offset_map_nolock() into the above
> two functions one by one, and finally completely delete it.
> 
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>

Reviewed-by: Muchun Song <muchun.song@linux.dev>




^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 07/13] mm: khugepaged: collapse_pte_mapped_thp() use pte_offset_map_rw_nolock()
  2024-09-24  6:09 ` [PATCH v4 07/13] mm: khugepaged: collapse_pte_mapped_thp() " Qi Zheng
@ 2024-09-24  7:14   ` Muchun Song
  2024-09-24  7:29     ` Qi Zheng
  0 siblings, 1 reply; 29+ messages in thread
From: Muchun Song @ 2024-09-24  7:14 UTC (permalink / raw)
  To: Qi Zheng
  Cc: david, hughd, willy, vbabka, akpm, rppt, vishal.moola, peterx,
	ryan.roberts, christophe.leroy2, linux-kernel, linux-mm,
	linux-arm-kernel, linuxppc-dev



> On Sep 24, 2024, at 14:11, Qi Zheng <zhengqi.arch@bytedance.com> wrote:
> In collapse_pte_mapped_thp(), we may modify the pte and pmd entry after
> acquring the ptl, so convert it to using pte_offset_map_rw_nolock(). At
> this time, the pte_same() check is not performed after the PTL held. So we
> should get pgt_pmd and do pmd_same() check after the ptl held.
> 
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
> mm/khugepaged.c | 14 +++++++++++---
> 1 file changed, 11 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 6498721d4783a..8ab79c13d077f 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1605,7 +1605,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>    if (userfaultfd_armed(vma) && !(vma->vm_flags & VM_SHARED))
>        pml = pmd_lock(mm, pmd);
> 
> -    start_pte = pte_offset_map_nolock(mm, pmd, haddr, &ptl);
> +    start_pte = pte_offset_map_rw_nolock(mm, pmd, haddr, &pgt_pmd, &ptl);
>    if (!start_pte)        /* mmap_lock + page lock should prevent this */
>        goto abort;
>    if (!pml)
> @@ -1613,6 +1613,9 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>    else if (ptl != pml)
>        spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
> 
> +    if (unlikely(!pmd_same(pgt_pmd, pmdp_get_lockless(pmd))))
> +        goto abort;
> +
>    /* step 2: clear page table and adjust rmap */
>    for (i = 0, addr = haddr, pte = start_pte;
>         i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE, pte++) {
> @@ -1645,7 +1648,6 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>        nr_ptes++;
>    }
> 
> -    pte_unmap(start_pte);
>    if (!pml)
>        spin_unlock(ptl);
> 
> @@ -1658,13 +1660,19 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>    /* step 4: remove empty page table */
>    if (!pml) {
>        pml = pmd_lock(mm, pmd);
> -        if (ptl != pml)
> +        if (ptl != pml) {
>            spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
> +            if (unlikely(!pmd_same(pgt_pmd, pmdp_get_lockless(pmd)))) {
> +                spin_unlock(pml);
> +                goto abort;

Drop the reference of folio and the mm counter twice at the label of abort and the step 3.

> +            }
> +        }
>    }
>    pgt_pmd = pmdp_collapse_flush(vma, haddr, pmd);
>    pmdp_get_lockless_sync();
>    if (ptl != pml)
>        spin_unlock(ptl);
> +    pte_unmap(start_pte);
>    spin_unlock(pml);

Why not?

pte_unmap_unlock(start_pte, ptl);
if (pml != ptl)
        spin_unlock(pml);

> 
>    mmu_notifier_invalidate_range_end(&range);
> --
> 2.20.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 07/13] mm: khugepaged: collapse_pte_mapped_thp() use pte_offset_map_rw_nolock()
  2024-09-24  7:14   ` Muchun Song
@ 2024-09-24  7:29     ` Qi Zheng
  2024-09-24  8:52       ` Muchun Song
  0 siblings, 1 reply; 29+ messages in thread
From: Qi Zheng @ 2024-09-24  7:29 UTC (permalink / raw)
  To: Muchun Song
  Cc: david, hughd, willy, vbabka, akpm, rppt, vishal.moola, peterx,
	ryan.roberts, christophe.leroy2, linux-kernel, linux-mm,
	linux-arm-kernel, linuxppc-dev



On 2024/9/24 15:14, Muchun Song wrote:
> 
> 
>> On Sep 24, 2024, at 14:11, Qi Zheng <zhengqi.arch@bytedance.com> wrote:
>> In collapse_pte_mapped_thp(), we may modify the pte and pmd entry after
>> acquring the ptl, so convert it to using pte_offset_map_rw_nolock(). At
>> this time, the pte_same() check is not performed after the PTL held. So we
>> should get pgt_pmd and do pmd_same() check after the ptl held.
>>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>> ---
>> mm/khugepaged.c | 14 +++++++++++---
>> 1 file changed, 11 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 6498721d4783a..8ab79c13d077f 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -1605,7 +1605,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>>     if (userfaultfd_armed(vma) && !(vma->vm_flags & VM_SHARED))
>>         pml = pmd_lock(mm, pmd);
>>
>> -    start_pte = pte_offset_map_nolock(mm, pmd, haddr, &ptl);
>> +    start_pte = pte_offset_map_rw_nolock(mm, pmd, haddr, &pgt_pmd, &ptl);
>>     if (!start_pte)        /* mmap_lock + page lock should prevent this */
>>         goto abort;
>>     if (!pml)
>> @@ -1613,6 +1613,9 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>>     else if (ptl != pml)
>>         spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
>>
>> +    if (unlikely(!pmd_same(pgt_pmd, pmdp_get_lockless(pmd))))
>> +        goto abort;
>> +
>>     /* step 2: clear page table and adjust rmap */
>>     for (i = 0, addr = haddr, pte = start_pte;
>>          i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE, pte++) {
>> @@ -1645,7 +1648,6 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>>         nr_ptes++;
>>     }
>>
>> -    pte_unmap(start_pte);
>>     if (!pml)
>>         spin_unlock(ptl);
>>
>> @@ -1658,13 +1660,19 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>>     /* step 4: remove empty page table */
>>     if (!pml) {
>>         pml = pmd_lock(mm, pmd);
>> -        if (ptl != pml)
>> +        if (ptl != pml) {
>>             spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
>> +            if (unlikely(!pmd_same(pgt_pmd, pmdp_get_lockless(pmd)))) {
>> +                spin_unlock(pml);
>> +                goto abort;
> 
> Drop the reference of folio and the mm counter twice at the label of abort and the step 3.

My bad, should set nr_ptes to 0 and call flush_tlb_mm() here, right?

> 
>> +            }
>> +        }
>>     }
>>     pgt_pmd = pmdp_collapse_flush(vma, haddr, pmd);
>>     pmdp_get_lockless_sync();
>>     if (ptl != pml)
>>         spin_unlock(ptl);
>> +    pte_unmap(start_pte);
>>     spin_unlock(pml);
> 
> Why not?
> 
> pte_unmap_unlock(start_pte, ptl);
> if (pml != ptl)
>          spin_unlock(pml);

Both are fine, will do.

Thanks,
Qi

> 
>>
>>     mmu_notifier_invalidate_range_end(&range);
>> --
>> 2.20.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 08/13] mm: copy_pte_range() use pte_offset_map_rw_nolock()
  2024-09-24  6:10 ` [PATCH v4 08/13] mm: copy_pte_range() " Qi Zheng
@ 2024-09-24  7:35   ` Muchun Song
  0 siblings, 0 replies; 29+ messages in thread
From: Muchun Song @ 2024-09-24  7:35 UTC (permalink / raw)
  To: Qi Zheng
  Cc: david, hughd, willy, vbabka, akpm, rppt, vishal.moola, peterx,
	ryan.roberts, christophe.leroy2, linux-kernel, linux-mm,
	linux-arm-kernel, linuxppc-dev



> On Sep 24, 2024, at 14:10, Qi Zheng <zhengqi.arch@bytedance.com> wrote:
> 
> In copy_pte_range(), we may modify the src_pte entry after holding the
> src_ptl, so convert it to using pte_offset_map_rw_nolock(). Since we
> already hold the exclusive mmap_lock, and the copy_pte_range() and
> retract_page_tables() are using vma->anon_vma to be exclusive, so the PTE
> page is stable, there is no need to get pmdval and do pmd_same() check.
> 
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>

Reviewed-by: Muchun Song <muchun.song@linux.dev>




^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 09/13] mm: mremap: move_ptes() use pte_offset_map_rw_nolock()
  2024-09-24  6:10 ` [PATCH v4 09/13] mm: mremap: move_ptes() " Qi Zheng
@ 2024-09-24  7:39   ` Muchun Song
  0 siblings, 0 replies; 29+ messages in thread
From: Muchun Song @ 2024-09-24  7:39 UTC (permalink / raw)
  To: Qi Zheng
  Cc: david, hughd, willy, vbabka, akpm, rppt, vishal.moola, peterx,
	ryan.roberts, christophe.leroy2, linux-kernel, linux-mm,
	linux-arm-kernel, linuxppc-dev



> On Sep 24, 2024, at 14:10, Qi Zheng <zhengqi.arch@bytedance.com> wrote:
> 
> In move_ptes(), we may modify the new_pte after acquiring the new_ptl, so
> convert it to using pte_offset_map_rw_nolock(). Now new_pte is none, so
> hpage_collapse_scan_file() path can not find this by traversing
> file->f_mapping, so there is no concurrency with retract_page_tables(). In
> addition, we already hold the exclusive mmap_lock, so this new_pte page is
> stable, so there is no need to get pmdval and do pmd_same() check.
> 
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>

Reviewed-by: Muchun Song <muchun.song@linux.dev>




^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 10/13] mm: page_vma_mapped_walk: map_pte() use pte_offset_map_rw_nolock()
  2024-09-24  6:10 ` [PATCH v4 10/13] mm: page_vma_mapped_walk: map_pte() " Qi Zheng
@ 2024-09-24  8:25   ` Muchun Song
  2024-09-24  8:33     ` Qi Zheng
  0 siblings, 1 reply; 29+ messages in thread
From: Muchun Song @ 2024-09-24  8:25 UTC (permalink / raw)
  To: Qi Zheng
  Cc: david, hughd, willy, vbabka, akpm, rppt, vishal.moola, peterx,
	ryan.roberts, christophe.leroy2, linux-kernel, linux-mm,
	linux-arm-kernel, linuxppc-dev



> On Sep 24, 2024, at 14:11, Qi Zheng <zhengqi.arch@bytedance.com> wrote:
> 
> In the caller of map_pte(), we may modify the pvmw->pte after acquiring
> the pvmw->ptl, so convert it to using pte_offset_map_rw_nolock(). At
> this time, the pte_same() check is not performed after the pvmw->ptl held,
> so we should get pmdval and do pmd_same() check to ensure the stability of
> pvmw->pmd.
> 
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
> mm/page_vma_mapped.c | 25 ++++++++++++++++++++-----
> 1 file changed, 20 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index ae5cc42aa2087..6410f29b37c1b 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -13,9 +13,11 @@ static inline bool not_found(struct page_vma_mapped_walk *pvmw)
>    return false;
> }
> 
> -static bool map_pte(struct page_vma_mapped_walk *pvmw, spinlock_t **ptlp)
> +static bool map_pte(struct page_vma_mapped_walk *pvmw, pmd_t *pmdvalp,
> +            spinlock_t **ptlp)
> {
>    pte_t ptent;
> +    pmd_t pmdval;

Why declare a new variable? Can’t we use *pmdvalp instead?

> 
>    if (pvmw->flags & PVMW_SYNC) {
>        /* Use the stricter lookup */
> @@ -25,6 +27,7 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw, spinlock_t **ptlp)
>        return !!pvmw->pte;
>    }
> 
> +again:
>    /*
>     * It is important to return the ptl corresponding to pte,
>     * in case *pvmw->pmd changes underneath us; so we need to
> @@ -32,10 +35,11 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw, spinlock_t **ptlp)
>     * proceeds to loop over next ptes, and finds a match later.
>     * Though, in most cases, page lock already protects this.
>     */
> -    pvmw->pte = pte_offset_map_nolock(pvmw->vma->vm_mm, pvmw->pmd,
> -                      pvmw->address, ptlp);
> +    pvmw->pte = pte_offset_map_rw_nolock(pvmw->vma->vm_mm, pvmw->pmd,
> +                         pvmw->address, &pmdval, ptlp);
>    if (!pvmw->pte)
>        return false;
> +    *pmdvalp = pmdval;
> 
>    ptent = ptep_get(pvmw->pte);
> 
> @@ -67,8 +71,13 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw, spinlock_t **ptlp)
>    } else if (!pte_present(ptent)) {
>        return false;
>    }
> +    spin_lock(*ptlp);
> +    if (unlikely(!pmd_same(pmdval, pmdp_get_lockless(pvmw->pmd)))) {
> +        pte_unmap_unlock(pvmw->pte, *ptlp);
> +        goto again;
> +    }
>    pvmw->ptl = *ptlp;
> -    spin_lock(pvmw->ptl);
> +
>    return true;
> }
> 
> @@ -278,7 +287,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>            step_forward(pvmw, PMD_SIZE);
>            continue;
>        }
> -        if (!map_pte(pvmw, &ptl)) {
> +        if (!map_pte(pvmw, &pmde, &ptl)) {
>            if (!pvmw->pte)
>                goto restart;
>            goto next_pte;
> @@ -307,6 +316,12 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>        if (!pvmw->ptl) {
>            pvmw->ptl = ptl;
>            spin_lock(pvmw->ptl);
> +            if (unlikely(!pmd_same(pmde, pmdp_get_lockless(pvmw->pmd)))) {
> +                pte_unmap_unlock(pvmw->pte, pvmw->ptl);
> +                pvmw->ptl = NULL;
> +                pvmw->pte = NULL;
> +                goto restart;
> +            }
>        }
>        goto this_pte;
>    } while (pvmw->address < end);
> --
> 2.20.1
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 10/13] mm: page_vma_mapped_walk: map_pte() use pte_offset_map_rw_nolock()
  2024-09-24  8:25   ` Muchun Song
@ 2024-09-24  8:33     ` Qi Zheng
  2024-09-24  8:39       ` Muchun Song
  0 siblings, 1 reply; 29+ messages in thread
From: Qi Zheng @ 2024-09-24  8:33 UTC (permalink / raw)
  To: Muchun Song
  Cc: david, hughd, willy, vbabka, akpm, rppt, vishal.moola, peterx,
	ryan.roberts, christophe.leroy2, linux-kernel, linux-mm,
	linux-arm-kernel, linuxppc-dev



On 2024/9/24 16:25, Muchun Song wrote:
> 
> 
>> On Sep 24, 2024, at 14:11, Qi Zheng <zhengqi.arch@bytedance.com> wrote:
>>
>> In the caller of map_pte(), we may modify the pvmw->pte after acquiring
>> the pvmw->ptl, so convert it to using pte_offset_map_rw_nolock(). At
>> this time, the pte_same() check is not performed after the pvmw->ptl held,
>> so we should get pmdval and do pmd_same() check to ensure the stability of
>> pvmw->pmd.
>>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>> ---
>> mm/page_vma_mapped.c | 25 ++++++++++++++++++++-----
>> 1 file changed, 20 insertions(+), 5 deletions(-)
>>
>> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
>> index ae5cc42aa2087..6410f29b37c1b 100644
>> --- a/mm/page_vma_mapped.c
>> +++ b/mm/page_vma_mapped.c
>> @@ -13,9 +13,11 @@ static inline bool not_found(struct page_vma_mapped_walk *pvmw)
>>     return false;
>> }
>>
>> -static bool map_pte(struct page_vma_mapped_walk *pvmw, spinlock_t **ptlp)
>> +static bool map_pte(struct page_vma_mapped_walk *pvmw, pmd_t *pmdvalp,
>> +            spinlock_t **ptlp)
>> {
>>     pte_t ptent;
>> +    pmd_t pmdval;
> 
> Why declare a new variable? Can’t we use *pmdvalp instead?

It's just a coding habit, both are fine for me.

> 
>>
>>     if (pvmw->flags & PVMW_SYNC) {
>>         /* Use the stricter lookup */
>> @@ -25,6 +27,7 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw, spinlock_t **ptlp)
>>         return !!pvmw->pte;
>>     }
>>
>> +again:
>>     /*
>>      * It is important to return the ptl corresponding to pte,
>>      * in case *pvmw->pmd changes underneath us; so we need to
>> @@ -32,10 +35,11 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw, spinlock_t **ptlp)
>>      * proceeds to loop over next ptes, and finds a match later.
>>      * Though, in most cases, page lock already protects this.
>>      */
>> -    pvmw->pte = pte_offset_map_nolock(pvmw->vma->vm_mm, pvmw->pmd,
>> -                      pvmw->address, ptlp);
>> +    pvmw->pte = pte_offset_map_rw_nolock(pvmw->vma->vm_mm, pvmw->pmd,
>> +                         pvmw->address, &pmdval, ptlp);
>>     if (!pvmw->pte)
>>         return false;
>> +    *pmdvalp = pmdval;
>>
>>     ptent = ptep_get(pvmw->pte);
>>
>> @@ -67,8 +71,13 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw, spinlock_t **ptlp)
>>     } else if (!pte_present(ptent)) {
>>         return false;
>>     }
>> +    spin_lock(*ptlp);
>> +    if (unlikely(!pmd_same(pmdval, pmdp_get_lockless(pvmw->pmd)))) {
>> +        pte_unmap_unlock(pvmw->pte, *ptlp);
>> +        goto again;
>> +    }
>>     pvmw->ptl = *ptlp;
>> -    spin_lock(pvmw->ptl);
>> +
>>     return true;
>> }
>>
>> @@ -278,7 +287,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>>             step_forward(pvmw, PMD_SIZE);
>>             continue;
>>         }
>> -        if (!map_pte(pvmw, &ptl)) {
>> +        if (!map_pte(pvmw, &pmde, &ptl)) {
>>             if (!pvmw->pte)
>>                 goto restart;
>>             goto next_pte;
>> @@ -307,6 +316,12 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>>         if (!pvmw->ptl) {
>>             pvmw->ptl = ptl;
>>             spin_lock(pvmw->ptl);
>> +            if (unlikely(!pmd_same(pmde, pmdp_get_lockless(pvmw->pmd)))) {
>> +                pte_unmap_unlock(pvmw->pte, pvmw->ptl);
>> +                pvmw->ptl = NULL;
>> +                pvmw->pte = NULL;
>> +                goto restart;
>> +            }
>>         }
>>         goto this_pte;
>>     } while (pvmw->address < end);
>> --
>> 2.20.1
>>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 10/13] mm: page_vma_mapped_walk: map_pte() use pte_offset_map_rw_nolock()
  2024-09-24  8:33     ` Qi Zheng
@ 2024-09-24  8:39       ` Muchun Song
  2024-09-24  8:45         ` Qi Zheng
  0 siblings, 1 reply; 29+ messages in thread
From: Muchun Song @ 2024-09-24  8:39 UTC (permalink / raw)
  To: Qi Zheng
  Cc: david, hughd, willy, vbabka, akpm, rppt, vishal.moola, peterx,
	ryan.roberts, christophe.leroy2, linux-kernel, linux-mm,
	linux-arm-kernel, linuxppc-dev



> On Sep 24, 2024, at 16:33, Qi Zheng <zhengqi.arch@bytedance.com> wrote:
> 
> 
> 
> On 2024/9/24 16:25, Muchun Song wrote:
>>> On Sep 24, 2024, at 14:11, Qi Zheng <zhengqi.arch@bytedance.com> wrote:
>>> 
>>> In the caller of map_pte(), we may modify the pvmw->pte after acquiring
>>> the pvmw->ptl, so convert it to using pte_offset_map_rw_nolock(). At
>>> this time, the pte_same() check is not performed after the pvmw->ptl held,
>>> so we should get pmdval and do pmd_same() check to ensure the stability of
>>> pvmw->pmd.
>>> 
>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>> ---
>>> mm/page_vma_mapped.c | 25 ++++++++++++++++++++-----
>>> 1 file changed, 20 insertions(+), 5 deletions(-)
>>> 
>>> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
>>> index ae5cc42aa2087..6410f29b37c1b 100644
>>> --- a/mm/page_vma_mapped.c
>>> +++ b/mm/page_vma_mapped.c
>>> @@ -13,9 +13,11 @@ static inline bool not_found(struct page_vma_mapped_walk *pvmw)
>>>    return false;
>>> }
>>> 
>>> -static bool map_pte(struct page_vma_mapped_walk *pvmw, spinlock_t **ptlp)
>>> +static bool map_pte(struct page_vma_mapped_walk *pvmw, pmd_t *pmdvalp,
>>> +            spinlock_t **ptlp)
>>> {
>>>    pte_t ptent;
>>> +    pmd_t pmdval;
>> Why declare a new variable? Can’t we use *pmdvalp instead?
> 
> It's just a coding habit, both are fine for me.

Agree. But sometime it could make code look a little simpler.

> 
>>> 
>>>    if (pvmw->flags & PVMW_SYNC) {
>>>        /* Use the stricter lookup */
>>> @@ -25,6 +27,7 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw, spinlock_t **ptlp)
>>>        return !!pvmw->pte;
>>>    }
>>> 
>>> +again:
>>>    /*
>>>     * It is important to return the ptl corresponding to pte,
>>>     * in case *pvmw->pmd changes underneath us; so we need to
>>> @@ -32,10 +35,11 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw, spinlock_t **ptlp)
>>>     * proceeds to loop over next ptes, and finds a match later.
>>>     * Though, in most cases, page lock already protects this.
>>>     */
>>> -    pvmw->pte = pte_offset_map_nolock(pvmw->vma->vm_mm, pvmw->pmd,
>>> -                      pvmw->address, ptlp);
>>> +    pvmw->pte = pte_offset_map_rw_nolock(pvmw->vma->vm_mm, pvmw->pmd,
>>> +                         pvmw->address, &pmdval, ptlp);
>>>    if (!pvmw->pte)
>>>        return false;
>>> +    *pmdvalp = pmdval;

For instance, here, it is unnecessary if pmdvalp is passed directly to
pte_offset_map_rw_nolock.

>>> 
>>>    ptent = ptep_get(pvmw->pte);
>>> 
>>> @@ -67,8 +71,13 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw, spinlock_t **ptlp)
>>>    } else if (!pte_present(ptent)) {
>>>        return false;
>>>    }
>>> +    spin_lock(*ptlp);
>>> +    if (unlikely(!pmd_same(pmdval, pmdp_get_lockless(pvmw->pmd)))) {
>>> +        pte_unmap_unlock(pvmw->pte, *ptlp);
>>> +        goto again;
>>> +    }
>>>    pvmw->ptl = *ptlp;
>>> -    spin_lock(pvmw->ptl);
>>> +
>>>    return true;
>>> }
>>> 
>>> @@ -278,7 +287,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>>>            step_forward(pvmw, PMD_SIZE);
>>>            continue;
>>>        }
>>> -        if (!map_pte(pvmw, &ptl)) {
>>> +        if (!map_pte(pvmw, &pmde, &ptl)) {
>>>            if (!pvmw->pte)
>>>                goto restart;
>>>            goto next_pte;
>>> @@ -307,6 +316,12 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>>>        if (!pvmw->ptl) {
>>>            pvmw->ptl = ptl;
>>>            spin_lock(pvmw->ptl);
>>> +            if (unlikely(!pmd_same(pmde, pmdp_get_lockless(pvmw->pmd)))) {
>>> +                pte_unmap_unlock(pvmw->pte, pvmw->ptl);
>>> +                pvmw->ptl = NULL;
>>> +                pvmw->pte = NULL;
>>> +                goto restart;
>>> +            }
>>>        }
>>>        goto this_pte;
>>>    } while (pvmw->address < end);
>>> --
>>> 2.20.1




^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 10/13] mm: page_vma_mapped_walk: map_pte() use pte_offset_map_rw_nolock()
  2024-09-24  8:39       ` Muchun Song
@ 2024-09-24  8:45         ` Qi Zheng
  0 siblings, 0 replies; 29+ messages in thread
From: Qi Zheng @ 2024-09-24  8:45 UTC (permalink / raw)
  To: Muchun Song
  Cc: david, hughd, willy, vbabka, akpm, rppt, vishal.moola, peterx,
	ryan.roberts, christophe.leroy2, linux-kernel, linux-mm,
	linux-arm-kernel, linuxppc-dev



On 2024/9/24 16:39, Muchun Song wrote:
> 
> 
>> On Sep 24, 2024, at 16:33, Qi Zheng <zhengqi.arch@bytedance.com> wrote:
>>
>>
>>
>> On 2024/9/24 16:25, Muchun Song wrote:
>>>> On Sep 24, 2024, at 14:11, Qi Zheng <zhengqi.arch@bytedance.com> wrote:
>>>>
>>>> In the caller of map_pte(), we may modify the pvmw->pte after acquiring
>>>> the pvmw->ptl, so convert it to using pte_offset_map_rw_nolock(). At
>>>> this time, the pte_same() check is not performed after the pvmw->ptl held,
>>>> so we should get pmdval and do pmd_same() check to ensure the stability of
>>>> pvmw->pmd.
>>>>
>>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>>> ---
>>>> mm/page_vma_mapped.c | 25 ++++++++++++++++++++-----
>>>> 1 file changed, 20 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
>>>> index ae5cc42aa2087..6410f29b37c1b 100644
>>>> --- a/mm/page_vma_mapped.c
>>>> +++ b/mm/page_vma_mapped.c
>>>> @@ -13,9 +13,11 @@ static inline bool not_found(struct page_vma_mapped_walk *pvmw)
>>>>     return false;
>>>> }
>>>>
>>>> -static bool map_pte(struct page_vma_mapped_walk *pvmw, spinlock_t **ptlp)
>>>> +static bool map_pte(struct page_vma_mapped_walk *pvmw, pmd_t *pmdvalp,
>>>> +            spinlock_t **ptlp)
>>>> {
>>>>     pte_t ptent;
>>>> +    pmd_t pmdval;
>>> Why declare a new variable? Can’t we use *pmdvalp instead?
>>
>> It's just a coding habit, both are fine for me.
> 
> Agree. But sometime it could make code look a little simpler.
> 
>>
>>>>
>>>>     if (pvmw->flags & PVMW_SYNC) {
>>>>         /* Use the stricter lookup */
>>>> @@ -25,6 +27,7 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw, spinlock_t **ptlp)
>>>>         return !!pvmw->pte;
>>>>     }
>>>>
>>>> +again:
>>>>     /*
>>>>      * It is important to return the ptl corresponding to pte,
>>>>      * in case *pvmw->pmd changes underneath us; so we need to
>>>> @@ -32,10 +35,11 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw, spinlock_t **ptlp)
>>>>      * proceeds to loop over next ptes, and finds a match later.
>>>>      * Though, in most cases, page lock already protects this.
>>>>      */
>>>> -    pvmw->pte = pte_offset_map_nolock(pvmw->vma->vm_mm, pvmw->pmd,
>>>> -                      pvmw->address, ptlp);
>>>> +    pvmw->pte = pte_offset_map_rw_nolock(pvmw->vma->vm_mm, pvmw->pmd,
>>>> +                         pvmw->address, &pmdval, ptlp);
>>>>     if (!pvmw->pte)
>>>>         return false;
>>>> +    *pmdvalp = pmdval;
> 
> For instance, here, it is unnecessary if pmdvalp is passed directly to
> pte_offset_map_rw_nolock.

OK, will use pmdvalp directly. ;)

> 
>>>>
>>>>     ptent = ptep_get(pvmw->pte);
>>>>
>>>> @@ -67,8 +71,13 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw, spinlock_t **ptlp)
>>>>     } else if (!pte_present(ptent)) {
>>>>         return false;
>>>>     }
>>>> +    spin_lock(*ptlp);
>>>> +    if (unlikely(!pmd_same(pmdval, pmdp_get_lockless(pvmw->pmd)))) {
>>>> +        pte_unmap_unlock(pvmw->pte, *ptlp);
>>>> +        goto again;
>>>> +    }
>>>>     pvmw->ptl = *ptlp;
>>>> -    spin_lock(pvmw->ptl);
>>>> +
>>>>     return true;
>>>> }
>>>>
>>>> @@ -278,7 +287,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>>>>             step_forward(pvmw, PMD_SIZE);
>>>>             continue;
>>>>         }
>>>> -        if (!map_pte(pvmw, &ptl)) {
>>>> +        if (!map_pte(pvmw, &pmde, &ptl)) {
>>>>             if (!pvmw->pte)
>>>>                 goto restart;
>>>>             goto next_pte;
>>>> @@ -307,6 +316,12 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>>>>         if (!pvmw->ptl) {
>>>>             pvmw->ptl = ptl;
>>>>             spin_lock(pvmw->ptl);
>>>> +            if (unlikely(!pmd_same(pmde, pmdp_get_lockless(pvmw->pmd)))) {
>>>> +                pte_unmap_unlock(pvmw->pte, pvmw->ptl);
>>>> +                pvmw->ptl = NULL;
>>>> +                pvmw->pte = NULL;
>>>> +                goto restart;
>>>> +            }
>>>>         }
>>>>         goto this_pte;
>>>>     } while (pvmw->address < end);
>>>> --
>>>> 2.20.1
> 
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 07/13] mm: khugepaged: collapse_pte_mapped_thp() use pte_offset_map_rw_nolock()
  2024-09-24  7:29     ` Qi Zheng
@ 2024-09-24  8:52       ` Muchun Song
  2024-09-24  8:57         ` Qi Zheng
  0 siblings, 1 reply; 29+ messages in thread
From: Muchun Song @ 2024-09-24  8:52 UTC (permalink / raw)
  To: Qi Zheng
  Cc: david, hughd, willy, vbabka, akpm, rppt, vishal.moola, peterx,
	ryan.roberts, christophe.leroy2, linux-kernel, linux-mm,
	linux-arm-kernel, linuxppc-dev



> On Sep 24, 2024, at 15:29, Qi Zheng <zhengqi.arch@bytedance.com> wrote:
> 
> 
> 
> On 2024/9/24 15:14, Muchun Song wrote:
>>> On Sep 24, 2024, at 14:11, Qi Zheng <zhengqi.arch@bytedance.com> wrote:
>>> In collapse_pte_mapped_thp(), we may modify the pte and pmd entry after
>>> acquring the ptl, so convert it to using pte_offset_map_rw_nolock(). At
>>> this time, the pte_same() check is not performed after the PTL held. So we
>>> should get pgt_pmd and do pmd_same() check after the ptl held.
>>> 
>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>> ---
>>> mm/khugepaged.c | 14 +++++++++++---
>>> 1 file changed, 11 insertions(+), 3 deletions(-)
>>> 
>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index 6498721d4783a..8ab79c13d077f 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -1605,7 +1605,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>>>    if (userfaultfd_armed(vma) && !(vma->vm_flags & VM_SHARED))
>>>        pml = pmd_lock(mm, pmd);
>>> 
>>> -    start_pte = pte_offset_map_nolock(mm, pmd, haddr, &ptl);
>>> +    start_pte = pte_offset_map_rw_nolock(mm, pmd, haddr, &pgt_pmd, &ptl);
>>>    if (!start_pte)        /* mmap_lock + page lock should prevent this */
>>>        goto abort;
>>>    if (!pml)
>>> @@ -1613,6 +1613,9 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>>>    else if (ptl != pml)
>>>        spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
>>> 
>>> +    if (unlikely(!pmd_same(pgt_pmd, pmdp_get_lockless(pmd))))
>>> +        goto abort;
>>> +
>>>    /* step 2: clear page table and adjust rmap */
>>>    for (i = 0, addr = haddr, pte = start_pte;
>>>         i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE, pte++) {
>>> @@ -1645,7 +1648,6 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>>>        nr_ptes++;
>>>    }
>>> 
>>> -    pte_unmap(start_pte);
>>>    if (!pml)
>>>        spin_unlock(ptl);
>>> 
>>> @@ -1658,13 +1660,19 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>>>    /* step 4: remove empty page table */
>>>    if (!pml) {
>>>        pml = pmd_lock(mm, pmd);
>>> -        if (ptl != pml)
>>> +        if (ptl != pml) {
>>>            spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
>>> +            if (unlikely(!pmd_same(pgt_pmd, pmdp_get_lockless(pmd)))) {
>>> +                spin_unlock(pml);
>>> +                goto abort;
>> Drop the reference of folio and the mm counter twice at the label of abort and the step 3.
> 
> My bad, should set nr_ptes to 0 and call flush_tlb_mm() here, right?

Or add a new label "out" just below the "abort". Then go to out.

> 
>>> +            }
>>> +        }
>>>    }
>>>    pgt_pmd = pmdp_collapse_flush(vma, haddr, pmd);
>>>    pmdp_get_lockless_sync();
>>>    if (ptl != pml)
>>>        spin_unlock(ptl);
>>> +    pte_unmap(start_pte);
>>>    spin_unlock(pml);
>> Why not?
>> pte_unmap_unlock(start_pte, ptl);
>> if (pml != ptl)
>>         spin_unlock(pml);
> 
> Both are fine, will do.
> 
> Thanks,
> Qi
> 
>>> 
>>>    mmu_notifier_invalidate_range_end(&range);
>>> --
>>> 2.20.1




^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 07/13] mm: khugepaged: collapse_pte_mapped_thp() use pte_offset_map_rw_nolock()
  2024-09-24  8:52       ` Muchun Song
@ 2024-09-24  8:57         ` Qi Zheng
  2024-09-24  9:03           ` Muchun Song
  0 siblings, 1 reply; 29+ messages in thread
From: Qi Zheng @ 2024-09-24  8:57 UTC (permalink / raw)
  To: Muchun Song
  Cc: david, hughd, willy, vbabka, akpm, rppt, vishal.moola, peterx,
	ryan.roberts, christophe.leroy2, linux-kernel, linux-mm,
	linux-arm-kernel, linuxppc-dev



On 2024/9/24 16:52, Muchun Song wrote:
> 
> 
>> On Sep 24, 2024, at 15:29, Qi Zheng <zhengqi.arch@bytedance.com> wrote:
>>
>>
>>
>> On 2024/9/24 15:14, Muchun Song wrote:
>>>> On Sep 24, 2024, at 14:11, Qi Zheng <zhengqi.arch@bytedance.com> wrote:
>>>> In collapse_pte_mapped_thp(), we may modify the pte and pmd entry after
>>>> acquring the ptl, so convert it to using pte_offset_map_rw_nolock(). At
>>>> this time, the pte_same() check is not performed after the PTL held. So we
>>>> should get pgt_pmd and do pmd_same() check after the ptl held.
>>>>
>>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>>> ---
>>>> mm/khugepaged.c | 14 +++++++++++---
>>>> 1 file changed, 11 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>>> index 6498721d4783a..8ab79c13d077f 100644
>>>> --- a/mm/khugepaged.c
>>>> +++ b/mm/khugepaged.c
>>>> @@ -1605,7 +1605,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>>>>     if (userfaultfd_armed(vma) && !(vma->vm_flags & VM_SHARED))
>>>>         pml = pmd_lock(mm, pmd);
>>>>
>>>> -    start_pte = pte_offset_map_nolock(mm, pmd, haddr, &ptl);
>>>> +    start_pte = pte_offset_map_rw_nolock(mm, pmd, haddr, &pgt_pmd, &ptl);
>>>>     if (!start_pte)        /* mmap_lock + page lock should prevent this */
>>>>         goto abort;
>>>>     if (!pml)
>>>> @@ -1613,6 +1613,9 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>>>>     else if (ptl != pml)
>>>>         spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
>>>>
>>>> +    if (unlikely(!pmd_same(pgt_pmd, pmdp_get_lockless(pmd))))
>>>> +        goto abort;
>>>> +
>>>>     /* step 2: clear page table and adjust rmap */
>>>>     for (i = 0, addr = haddr, pte = start_pte;
>>>>          i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE, pte++) {
>>>> @@ -1645,7 +1648,6 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>>>>         nr_ptes++;
>>>>     }
>>>>
>>>> -    pte_unmap(start_pte);
>>>>     if (!pml)
>>>>         spin_unlock(ptl);
>>>>
>>>> @@ -1658,13 +1660,19 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>>>>     /* step 4: remove empty page table */
>>>>     if (!pml) {
>>>>         pml = pmd_lock(mm, pmd);
>>>> -        if (ptl != pml)
>>>> +        if (ptl != pml) {
>>>>             spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
>>>> +            if (unlikely(!pmd_same(pgt_pmd, pmdp_get_lockless(pmd)))) {
>>>> +                spin_unlock(pml);
>>>> +                goto abort;
>>> Drop the reference of folio and the mm counter twice at the label of abort and the step 3.
>>
>> My bad, should set nr_ptes to 0 and call flush_tlb_mm() here, right?
> 
> Or add a new label "out" just below the "abort". Then go to out.

For this way, we also need to call flush_tlb_mm() first, like the
following:

if (unlikely(!pmd_same(pgt_pmd, pmdp_get_lockless(pmd)))) {
	spin_unlock(pml);
	flush_tlb_mm(mm);
	goto out;
}

> 
>>
>>>> +            }
>>>> +        }
>>>>     }
>>>>     pgt_pmd = pmdp_collapse_flush(vma, haddr, pmd);
>>>>     pmdp_get_lockless_sync();
>>>>     if (ptl != pml)
>>>>         spin_unlock(ptl);
>>>> +    pte_unmap(start_pte);
>>>>     spin_unlock(pml);
>>> Why not?
>>> pte_unmap_unlock(start_pte, ptl);
>>> if (pml != ptl)
>>>          spin_unlock(pml);
>>
>> Both are fine, will do.
>>
>> Thanks,
>> Qi
>>
>>>>
>>>>     mmu_notifier_invalidate_range_end(&range);
>>>> --
>>>> 2.20.1
> 
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 07/13] mm: khugepaged: collapse_pte_mapped_thp() use pte_offset_map_rw_nolock()
  2024-09-24  8:57         ` Qi Zheng
@ 2024-09-24  9:03           ` Muchun Song
  0 siblings, 0 replies; 29+ messages in thread
From: Muchun Song @ 2024-09-24  9:03 UTC (permalink / raw)
  To: Qi Zheng
  Cc: david, hughd, willy, vbabka, akpm, rppt, vishal.moola, peterx,
	ryan.roberts, christophe.leroy2, linux-kernel, linux-mm,
	linux-arm-kernel, linuxppc-dev



> On Sep 24, 2024, at 16:57, Qi Zheng <zhengqi.arch@bytedance.com> wrote:
> 
> 
> 
> On 2024/9/24 16:52, Muchun Song wrote:
>>> On Sep 24, 2024, at 15:29, Qi Zheng <zhengqi.arch@bytedance.com> wrote:
>>> 
>>> 
>>> 
>>> On 2024/9/24 15:14, Muchun Song wrote:
>>>>> On Sep 24, 2024, at 14:11, Qi Zheng <zhengqi.arch@bytedance.com> wrote:
>>>>> In collapse_pte_mapped_thp(), we may modify the pte and pmd entry after
>>>>> acquring the ptl, so convert it to using pte_offset_map_rw_nolock(). At
>>>>> this time, the pte_same() check is not performed after the PTL held. So we
>>>>> should get pgt_pmd and do pmd_same() check after the ptl held.
>>>>> 
>>>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>>>> ---
>>>>> mm/khugepaged.c | 14 +++++++++++---
>>>>> 1 file changed, 11 insertions(+), 3 deletions(-)
>>>>> 
>>>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>>>> index 6498721d4783a..8ab79c13d077f 100644
>>>>> --- a/mm/khugepaged.c
>>>>> +++ b/mm/khugepaged.c
>>>>> @@ -1605,7 +1605,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>>>>>    if (userfaultfd_armed(vma) && !(vma->vm_flags & VM_SHARED))
>>>>>        pml = pmd_lock(mm, pmd);
>>>>> 
>>>>> -    start_pte = pte_offset_map_nolock(mm, pmd, haddr, &ptl);
>>>>> +    start_pte = pte_offset_map_rw_nolock(mm, pmd, haddr, &pgt_pmd, &ptl);
>>>>>    if (!start_pte)        /* mmap_lock + page lock should prevent this */
>>>>>        goto abort;
>>>>>    if (!pml)
>>>>> @@ -1613,6 +1613,9 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>>>>>    else if (ptl != pml)
>>>>>        spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
>>>>> 
>>>>> +    if (unlikely(!pmd_same(pgt_pmd, pmdp_get_lockless(pmd))))
>>>>> +        goto abort;
>>>>> +
>>>>>    /* step 2: clear page table and adjust rmap */
>>>>>    for (i = 0, addr = haddr, pte = start_pte;
>>>>>         i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE, pte++) {
>>>>> @@ -1645,7 +1648,6 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>>>>>        nr_ptes++;
>>>>>    }
>>>>> 
>>>>> -    pte_unmap(start_pte);
>>>>>    if (!pml)
>>>>>        spin_unlock(ptl);
>>>>> 
>>>>> @@ -1658,13 +1660,19 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>>>>>    /* step 4: remove empty page table */
>>>>>    if (!pml) {
>>>>>        pml = pmd_lock(mm, pmd);
>>>>> -        if (ptl != pml)
>>>>> +        if (ptl != pml) {
>>>>>            spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
>>>>> +            if (unlikely(!pmd_same(pgt_pmd, pmdp_get_lockless(pmd)))) {
>>>>> +                spin_unlock(pml);
>>>>> +                goto abort;
>>>> Drop the reference of folio and the mm counter twice at the label of abort and the step 3.
>>> 
>>> My bad, should set nr_ptes to 0 and call flush_tlb_mm() here, right?
>> Or add a new label "out" just below the "abort". Then go to out.
> 
> For this way, we also need to call flush_tlb_mm() first, like the
> following:
> 
> if (unlikely(!pmd_same(pgt_pmd, pmdp_get_lockless(pmd)))) {
> 	spin_unlock(pml);
> 	flush_tlb_mm(mm);
> 	goto out;
> }

Fine.

> 
>>> 
>>>>> +            }
>>>>> +        }
>>>>>    }
>>>>>    pgt_pmd = pmdp_collapse_flush(vma, haddr, pmd);
>>>>>    pmdp_get_lockless_sync();
>>>>>    if (ptl != pml)
>>>>>        spin_unlock(ptl);
>>>>> +    pte_unmap(start_pte);
>>>>>    spin_unlock(pml);
>>>> Why not?
>>>> pte_unmap_unlock(start_pte, ptl);
>>>> if (pml != ptl)
>>>>         spin_unlock(pml);
>>> 
>>> Both are fine, will do.
>>> 
>>> Thanks,
>>> Qi
>>> 
>>>>> 
>>>>>    mmu_notifier_invalidate_range_end(&range);
>>>>> --
>>>>> 2.20.1




^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 01/13] mm: pgtable: introduce pte_offset_map_{ro|rw}_nolock()
  2024-09-24  6:09 ` [PATCH v4 01/13] mm: pgtable: " Qi Zheng
  2024-09-24  6:24   ` Muchun Song
@ 2024-09-24 12:58   ` David Hildenbrand
  1 sibling, 0 replies; 29+ messages in thread
From: David Hildenbrand @ 2024-09-24 12:58 UTC (permalink / raw)
  To: Qi Zheng, hughd, willy, muchun.song, vbabka, akpm, rppt,
	vishal.moola, peterx, ryan.roberts, christophe.leroy2
  Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev

On 24.09.24 08:09, Qi Zheng wrote:
> Currently, the usage of pte_offset_map_nolock() can be divided into the
> following two cases:
> 
> 1) After acquiring PTL, only read-only operations are performed on the PTE
>     page. In this case, the RCU lock in pte_offset_map_nolock() will ensure
>     that the PTE page will not be freed, and there is no need to worry
>     about whether the pmd entry is modified.
> 
> 2) After acquiring PTL, the pte or pmd entries may be modified. At this
>     time, we need to ensure that the pmd entry has not been modified
>     concurrently.
> 
> To more clearing distinguish between these two cases, this commit
> introduces two new helper functions to replace pte_offset_map_nolock().
> For 1), just rename it to pte_offset_map_ro_nolock(). For 2), in addition
> to changing the name to pte_offset_map_rw_nolock(), it also outputs the
> pmdval when successful. It is applicable for may-write cases where any
> modification operations to the page table may happen after the
> corresponding spinlock is held afterwards. But the users should make sure
> the page table is stable like checking pte_same() or checking pmd_same()
> by using the output pmdval before performing the write operations.
> 
> Note: "RO" / "RW" expresses the intended semantics, not that the *kmap*
> will be read-only/read-write protected.
> 
> Subsequent commits will convert pte_offset_map_nolock() into the above
> two functions one by one, and finally completely delete it.
> 
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 13/13] mm: pgtable: remove pte_offset_map_nolock()
  2024-09-24  6:10 ` [PATCH v4 13/13] mm: pgtable: remove pte_offset_map_nolock() Qi Zheng
@ 2024-09-24 13:02   ` David Hildenbrand
  0 siblings, 0 replies; 29+ messages in thread
From: David Hildenbrand @ 2024-09-24 13:02 UTC (permalink / raw)
  To: Qi Zheng, hughd, willy, muchun.song, vbabka, akpm, rppt,
	vishal.moola, peterx, ryan.roberts, christophe.leroy2
  Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev

On 24.09.24 08:10, Qi Zheng wrote:
> Now no users are using the pte_offset_map_nolock(), remove it.
> 
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> Reviewed-by: Muchun Song <muchun.song@linux.dev>
> ---

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 12/13] mm: multi-gen LRU: walk_pte_range() use pte_offset_map_rw_nolock()
  2024-09-24  6:10 ` [PATCH v4 12/13] mm: multi-gen LRU: walk_pte_range() " Qi Zheng
@ 2024-09-24 14:39   ` David Hildenbrand
  0 siblings, 0 replies; 29+ messages in thread
From: David Hildenbrand @ 2024-09-24 14:39 UTC (permalink / raw)
  To: Qi Zheng, hughd, willy, muchun.song, vbabka, akpm, rppt,
	vishal.moola, peterx, ryan.roberts, christophe.leroy2
  Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev

On 24.09.24 08:10, Qi Zheng wrote:
> In walk_pte_range(), we may modify the pte entry after holding the ptl, so
> convert it to using pte_offset_map_rw_nolock(). At this time, the
> pte_same() check is not performed after the ptl held, so we should get
> pmdval and do pmd_same() check to ensure the stability of pmd entry.
> 
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> Reviewed-by: Muchun Song <muchun.song@linux.dev>
> ---
>   mm/vmscan.c | 9 ++++++++-
>   1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 749cdc110c745..bdca94e663bc5 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3375,8 +3375,10 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
>   	struct pglist_data *pgdat = lruvec_pgdat(walk->lruvec);
>   	DEFINE_MAX_SEQ(walk->lruvec);
>   	int old_gen, new_gen = lru_gen_from_seq(max_seq);
> +	pmd_t pmdval;
>   
> -	pte = pte_offset_map_nolock(args->mm, pmd, start & PMD_MASK, &ptl);
> +	pte = pte_offset_map_rw_nolock(args->mm, pmd, start & PMD_MASK, &pmdval,
> +				       &ptl);
>   	if (!pte)
>   		return false;
>   	if (!spin_trylock(ptl)) {
> @@ -3384,6 +3386,11 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
>   		return false;
>   	}
>   
> +	if (unlikely(!pmd_same(pmdval, pmdp_get_lockless(pmd)))) {
> +		pte_unmap_unlock(pte, ptl);
> +		return false;
> +	}

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2024-09-24 14:39 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-24  6:09 [PATCH v4 00/13] introduce pte_offset_map_{ro|rw}_nolock() Qi Zheng
2024-09-24  6:09 ` [PATCH v4 01/13] mm: pgtable: " Qi Zheng
2024-09-24  6:24   ` Muchun Song
2024-09-24 12:58   ` David Hildenbrand
2024-09-24  6:09 ` [PATCH v4 02/13] powerpc: assert_pte_locked() use pte_offset_map_ro_nolock() Qi Zheng
2024-09-24  6:09 ` [PATCH v4 03/13] mm: filemap: filemap_fault_recheck_pte_none() " Qi Zheng
2024-09-24  6:09 ` [PATCH v4 04/13] mm: khugepaged: __collapse_huge_page_swapin() " Qi Zheng
2024-09-24  6:09 ` [PATCH v4 05/13] arm: adjust_pte() use pte_offset_map_rw_nolock() Qi Zheng
2024-09-24  6:09 ` [PATCH v4 06/13] mm: handle_pte_fault() " Qi Zheng
2024-09-24  6:09 ` [PATCH v4 07/13] mm: khugepaged: collapse_pte_mapped_thp() " Qi Zheng
2024-09-24  7:14   ` Muchun Song
2024-09-24  7:29     ` Qi Zheng
2024-09-24  8:52       ` Muchun Song
2024-09-24  8:57         ` Qi Zheng
2024-09-24  9:03           ` Muchun Song
2024-09-24  6:10 ` [PATCH v4 08/13] mm: copy_pte_range() " Qi Zheng
2024-09-24  7:35   ` Muchun Song
2024-09-24  6:10 ` [PATCH v4 09/13] mm: mremap: move_ptes() " Qi Zheng
2024-09-24  7:39   ` Muchun Song
2024-09-24  6:10 ` [PATCH v4 10/13] mm: page_vma_mapped_walk: map_pte() " Qi Zheng
2024-09-24  8:25   ` Muchun Song
2024-09-24  8:33     ` Qi Zheng
2024-09-24  8:39       ` Muchun Song
2024-09-24  8:45         ` Qi Zheng
2024-09-24  6:10 ` [PATCH v4 11/13] mm: userfaultfd: move_pages_pte() " Qi Zheng
2024-09-24  6:10 ` [PATCH v4 12/13] mm: multi-gen LRU: walk_pte_range() " Qi Zheng
2024-09-24 14:39   ` David Hildenbrand
2024-09-24  6:10 ` [PATCH v4 13/13] mm: pgtable: remove pte_offset_map_nolock() Qi Zheng
2024-09-24 13:02   ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).