linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 00/14] introduce pte_offset_map_{ro|rw}_nolock()
@ 2024-09-04  8:40 Qi Zheng
  2024-09-04  8:40 ` [PATCH v3 01/14] mm: pgtable: " Qi Zheng
                   ` (13 more replies)
  0 siblings, 14 replies; 26+ messages in thread
From: Qi Zheng @ 2024-09-04  8:40 UTC (permalink / raw)
  To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
	vishal.moola, peterx, ryan.roberts, christophe.leroy2
  Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng

Changes in v3:
 - change to use VM_WARN_ON_ONCE() instead of BUG_ON() in pte_offset_map_rw_nolock()
   (David Hildenbrand)
 - modify the comment above the pte_offset_map_lock() in [PATCH v2 01/14]
   (David Hildenbrand and Muchun Song)
 - modify the comment above the pte_offset_map_rw_nolock() in [PATCH v2 06/14]
   (David Hildenbrand and Muchun Song)
 - also perform a pmd_same() check in [PATCH v2 08/14] and [PATCH v2 09/14]
   (since we may free the PTE page in retract_page_tables() without holding the
    read lock of mmap_lock)
 - collect the Acked-bys and Reviewed-bys
 - rebase onto the next-20240904

Changes in v2:
 - rename pte_offset_map_{readonly|maywrite}_nolock() to
   pte_offset_map_{ro|rw}_nolock() (LEROY Christophe)
 - make pte_offset_map_rw_nolock() not accept NULL parameters
   (David Hildenbrand)
 - rebase onto the next-20240822

Hi all,

As proposed by David Hildenbrand [1], this series introduces the following two
new helper functions to replace pte_offset_map_nolock().

1. pte_offset_map_ro_nolock()
2. pte_offset_map_rw_nolock()

As the name suggests, pte_offset_map_ro_nolock() is used for read-only
case. In this case, only read-only operations will be performed on PTE page
after the PTL is held. The RCU lock in pte_offset_map_nolock() will ensure that
the PTE page will not be freed, and there is no need to worry about whether the
pmd entry is modified. Therefore pte_offset_map_ro_nolock() is just a renamed
version of pte_offset_map_nolock().

pte_offset_map_rw_nolock() is used for may-write case. In this case, the pte or
pmd entry may be modified after the PTL is held, so we need to ensure that the
pmd entry has not been modified concurrently. So in addition to the name change,
it also outputs the pmdval when successful. The users should make sure the page
table is stable like checking pte_same() or checking pmd_same() by using the
output pmdval before performing the write operations.

This series will convert all pte_offset_map_nolock() into the above two helper
functions one by one, and finally completely delete it.

This also a preparation for reclaiming the empty user PTE page table pages.

This series is based on the next-20240904.

Comments and suggestions are welcome!

Thanks,
Qi

Qi Zheng (14):
  mm: pgtable: introduce pte_offset_map_{ro|rw}_nolock()
  arm: adjust_pte() use pte_offset_map_rw_nolock()
  powerpc: assert_pte_locked() use pte_offset_map_ro_nolock()
  mm: filemap: filemap_fault_recheck_pte_none() use
    pte_offset_map_ro_nolock()
  mm: khugepaged: __collapse_huge_page_swapin() use
    pte_offset_map_ro_nolock()
  mm: handle_pte_fault() use pte_offset_map_rw_nolock()
  mm: khugepaged: collapse_pte_mapped_thp() use
    pte_offset_map_rw_nolock()
  mm: copy_pte_range() use pte_offset_map_rw_nolock()
  mm: mremap: move_ptes() use pte_offset_map_rw_nolock()
  mm: page_vma_mapped_walk: map_pte() use pte_offset_map_rw_nolock()
  mm: userfaultfd: move_pages_pte() use pte_offset_map_rw_nolock()
  mm: multi-gen LRU: walk_pte_range() use pte_offset_map_rw_nolock()
  mm: pgtable: remove pte_offset_map_nolock()
  mm: khugepaged: retract_page_tables() use pte_offset_map_rw_nolock()

 Documentation/mm/split_page_table_lock.rst |  6 ++-
 arch/arm/mm/fault-armv.c                   |  9 ++++-
 arch/powerpc/mm/pgtable.c                  |  2 +-
 include/linux/mm.h                         |  7 +++-
 mm/filemap.c                               |  4 +-
 mm/khugepaged.c                            | 39 ++++++++++++++++++--
 mm/memory.c                                | 32 ++++++++++++++--
 mm/mremap.c                                | 20 +++++++++-
 mm/page_vma_mapped.c                       | 24 ++++++++++--
 mm/pgtable-generic.c                       | 43 ++++++++++++++++++----
 mm/userfaultfd.c                           | 15 ++++++--
 mm/vmscan.c                                |  9 ++++-
 12 files changed, 180 insertions(+), 30 deletions(-)

-- 
2.20.1



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v3 01/14] mm: pgtable: introduce pte_offset_map_{ro|rw}_nolock()
  2024-09-04  8:40 [PATCH v3 00/14] introduce pte_offset_map_{ro|rw}_nolock() Qi Zheng
@ 2024-09-04  8:40 ` Qi Zheng
  2024-09-06  7:20   ` Muchun Song
  2024-09-04  8:40 ` [PATCH v3 02/14] arm: adjust_pte() use pte_offset_map_rw_nolock() Qi Zheng
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 26+ messages in thread
From: Qi Zheng @ 2024-09-04  8:40 UTC (permalink / raw)
  To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
	vishal.moola, peterx, ryan.roberts, christophe.leroy2
  Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng

Currently, the usage of pte_offset_map_nolock() can be divided into the
following two cases:

1) After acquiring PTL, only read-only operations are performed on the PTE
   page. In this case, the RCU lock in pte_offset_map_nolock() will ensure
   that the PTE page will not be freed, and there is no need to worry
   about whether the pmd entry is modified.

2) After acquiring PTL, the pte or pmd entries may be modified. At this
   time, we need to ensure that the pmd entry has not been modified
   concurrently.

To more clearing distinguish between these two cases, this commit
introduces two new helper functions to replace pte_offset_map_nolock().
For 1), just rename it to pte_offset_map_ro_nolock(). For 2), in addition
to changing the name to pte_offset_map_rw_nolock(), it also outputs the
pmdval when successful. It is applicable for may-write cases where any
modification operations to the page table may happen after the
corresponding spinlock is held afterwards. But the users should make sure
the page table is stable like checking pte_same() or checking pmd_same()
by using the output pmdval before performing the write operations.

Note: "RO" / "RW" expresses the intended semantics, not that the *kmap*
will be read-only/read-write protected.

Subsequent commits will convert pte_offset_map_nolock() into the above
two functions one by one, and finally completely delete it.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 Documentation/mm/split_page_table_lock.rst |  7 +++
 include/linux/mm.h                         |  5 +++
 mm/pgtable-generic.c                       | 50 ++++++++++++++++++++++
 3 files changed, 62 insertions(+)

diff --git a/Documentation/mm/split_page_table_lock.rst b/Documentation/mm/split_page_table_lock.rst
index e4f6972eb6c04..08d0e706a32db 100644
--- a/Documentation/mm/split_page_table_lock.rst
+++ b/Documentation/mm/split_page_table_lock.rst
@@ -19,6 +19,13 @@ There are helpers to lock/unlock a table and other accessor functions:
  - pte_offset_map_nolock()
 	maps PTE, returns pointer to PTE with pointer to its PTE table
 	lock (not taken), or returns NULL if no PTE table;
+ - pte_offset_map_ro_nolock()
+	maps PTE, returns pointer to PTE with pointer to its PTE table
+	lock (not taken), or returns NULL if no PTE table;
+ - pte_offset_map_rw_nolock()
+	maps PTE, returns pointer to PTE with pointer to its PTE table
+	lock (not taken) and the value of its pmd entry, or returns NULL
+	if no PTE table;
  - pte_offset_map()
 	maps PTE, returns pointer to PTE, or returns NULL if no PTE table;
  - pte_unmap()
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a7c74a840249a..1fde9242231c9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3006,6 +3006,11 @@ static inline pte_t *pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd,
 
 pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
 			unsigned long addr, spinlock_t **ptlp);
+pte_t *pte_offset_map_ro_nolock(struct mm_struct *mm, pmd_t *pmd,
+				unsigned long addr, spinlock_t **ptlp);
+pte_t *pte_offset_map_rw_nolock(struct mm_struct *mm, pmd_t *pmd,
+				unsigned long addr, pmd_t *pmdvalp,
+				spinlock_t **ptlp);
 
 #define pte_unmap_unlock(pte, ptl)	do {		\
 	spin_unlock(ptl);				\
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index a78a4adf711ac..262b7065a5a2e 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -317,6 +317,33 @@ pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
 	return pte;
 }
 
+pte_t *pte_offset_map_ro_nolock(struct mm_struct *mm, pmd_t *pmd,
+				unsigned long addr, spinlock_t **ptlp)
+{
+	pmd_t pmdval;
+	pte_t *pte;
+
+	pte = __pte_offset_map(pmd, addr, &pmdval);
+	if (likely(pte))
+		*ptlp = pte_lockptr(mm, &pmdval);
+	return pte;
+}
+
+pte_t *pte_offset_map_rw_nolock(struct mm_struct *mm, pmd_t *pmd,
+				unsigned long addr, pmd_t *pmdvalp,
+				spinlock_t **ptlp)
+{
+	pmd_t pmdval;
+	pte_t *pte;
+
+	VM_WARN_ON_ONCE(!pmdvalp);
+	pte = __pte_offset_map(pmd, addr, &pmdval);
+	if (likely(pte))
+		*ptlp = pte_lockptr(mm, &pmdval);
+	*pmdvalp = pmdval;
+	return pte;
+}
+
 /*
  * pte_offset_map_lock(mm, pmd, addr, ptlp), and its internal implementation
  * __pte_offset_map_lock() below, is usually called with the pmd pointer for
@@ -356,6 +383,29 @@ pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
  * recheck *pmd once the lock is taken; in practice, no callsite needs that -
  * either the mmap_lock for write, or pte_same() check on contents, is enough.
  *
+ * pte_offset_map_ro_nolock(mm, pmd, addr, ptlp), above, is like pte_offset_map();
+ * but when successful, it also outputs a pointer to the spinlock in ptlp - as
+ * pte_offset_map_lock() does, but in this case without locking it.  This helps
+ * the caller to avoid a later pte_lockptr(mm, *pmd), which might by that time
+ * act on a changed *pmd: pte_offset_map_ro_nolock() provides the correct spinlock
+ * pointer for the page table that it returns. Even after grabbing the spinlock,
+ * we might be looking either at a page table that is still mapped or one that
+ * was unmapped and is about to get freed. But for R/O access this is sufficient.
+ * So it is only applicable for read-only cases where any modification operations
+ * to the page table are not allowed even if the corresponding spinlock is held
+ * afterwards.
+ *
+ * pte_offset_map_rw_nolock(mm, pmd, addr, pmdvalp, ptlp), above, is like
+ * pte_offset_map_ro_nolock(); but when successful, it also outputs the pdmval.
+ * It is applicable for may-write cases where any modification operations to the
+ * page table may happen after the corresponding spinlock is held afterwards.
+ * But the users should make sure the page table is stable like checking pte_same()
+ * or checking pmd_same() by using the output pmdval before performing the write
+ * operations.
+ *
+ * Note: "RO" / "RW" expresses the intended semantics, not that the *kmap* will
+ * be read-only/read-write protected.
+ *
  * Note that free_pgtables(), used after unmapping detached vmas, or when
  * exiting the whole mm, does not take page table lock before freeing a page
  * table, and may not use RCU at all: "outsiders" like khugepaged should avoid
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v3 02/14] arm: adjust_pte() use pte_offset_map_rw_nolock()
  2024-09-04  8:40 [PATCH v3 00/14] introduce pte_offset_map_{ro|rw}_nolock() Qi Zheng
  2024-09-04  8:40 ` [PATCH v3 01/14] mm: pgtable: " Qi Zheng
@ 2024-09-04  8:40 ` Qi Zheng
  2024-09-04  8:40 ` [PATCH v3 03/14] powerpc: assert_pte_locked() use pte_offset_map_ro_nolock() Qi Zheng
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 26+ messages in thread
From: Qi Zheng @ 2024-09-04  8:40 UTC (permalink / raw)
  To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
	vishal.moola, peterx, ryan.roberts, christophe.leroy2
  Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng

In do_adjust_pte(), we may modify the pte entry. The corresponding pmd
entry may have been modified concurrently. Therefore, in order to ensure
the stability if pmd entry, use pte_offset_map_rw_nolock() to replace
pte_offset_map_nolock(), and do pmd_same() check after holding the PTL.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
---
 arch/arm/mm/fault-armv.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/arm/mm/fault-armv.c b/arch/arm/mm/fault-armv.c
index 831793cd6ff94..de6c7d8a2ddfc 100644
--- a/arch/arm/mm/fault-armv.c
+++ b/arch/arm/mm/fault-armv.c
@@ -94,6 +94,7 @@ static int adjust_pte(struct vm_area_struct *vma, unsigned long address,
 	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
+	pmd_t pmdval;
 	int ret;
 
 	pgd = pgd_offset(vma->vm_mm, address);
@@ -112,16 +113,22 @@ static int adjust_pte(struct vm_area_struct *vma, unsigned long address,
 	if (pmd_none_or_clear_bad(pmd))
 		return 0;
 
+again:
 	/*
 	 * This is called while another page table is mapped, so we
 	 * must use the nested version.  This also means we need to
 	 * open-code the spin-locking.
 	 */
-	pte = pte_offset_map_nolock(vma->vm_mm, pmd, address, &ptl);
+	pte = pte_offset_map_rw_nolock(vma->vm_mm, pmd, address, &pmdval, &ptl);
 	if (!pte)
 		return 0;
 
 	do_pte_lock(ptl);
+	if (unlikely(!pmd_same(pmdval, pmdp_get_lockless(pmd)))) {
+		do_pte_unlock(ptl);
+		pte_unmap(pte);
+		goto again;
+	}
 
 	ret = do_adjust_pte(vma, address, pfn, pte);
 
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v3 03/14] powerpc: assert_pte_locked() use pte_offset_map_ro_nolock()
  2024-09-04  8:40 [PATCH v3 00/14] introduce pte_offset_map_{ro|rw}_nolock() Qi Zheng
  2024-09-04  8:40 ` [PATCH v3 01/14] mm: pgtable: " Qi Zheng
  2024-09-04  8:40 ` [PATCH v3 02/14] arm: adjust_pte() use pte_offset_map_rw_nolock() Qi Zheng
@ 2024-09-04  8:40 ` Qi Zheng
  2024-09-04  8:40 ` [PATCH v3 04/14] mm: filemap: filemap_fault_recheck_pte_none() " Qi Zheng
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 26+ messages in thread
From: Qi Zheng @ 2024-09-04  8:40 UTC (permalink / raw)
  To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
	vishal.moola, peterx, ryan.roberts, christophe.leroy2
  Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng

In assert_pte_locked(), we just get the ptl and assert if it was already
held, so convert it to using pte_offset_map_ro_nolock().

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
---
 arch/powerpc/mm/pgtable.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index 7316396e452d8..61df5aed79894 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -398,7 +398,7 @@ void assert_pte_locked(struct mm_struct *mm, unsigned long addr)
 	 */
 	if (pmd_none(*pmd))
 		return;
-	pte = pte_offset_map_nolock(mm, pmd, addr, &ptl);
+	pte = pte_offset_map_ro_nolock(mm, pmd, addr, &ptl);
 	BUG_ON(!pte);
 	assert_spin_locked(ptl);
 	pte_unmap(pte);
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v3 04/14] mm: filemap: filemap_fault_recheck_pte_none() use pte_offset_map_ro_nolock()
  2024-09-04  8:40 [PATCH v3 00/14] introduce pte_offset_map_{ro|rw}_nolock() Qi Zheng
                   ` (2 preceding siblings ...)
  2024-09-04  8:40 ` [PATCH v3 03/14] powerpc: assert_pte_locked() use pte_offset_map_ro_nolock() Qi Zheng
@ 2024-09-04  8:40 ` Qi Zheng
  2024-09-04  8:40 ` [PATCH v3 05/14] mm: khugepaged: __collapse_huge_page_swapin() " Qi Zheng
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 26+ messages in thread
From: Qi Zheng @ 2024-09-04  8:40 UTC (permalink / raw)
  To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
	vishal.moola, peterx, ryan.roberts, christophe.leroy2
  Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng

In filemap_fault_recheck_pte_none(), we just do pte_none() check, so
convert it to using pte_offset_map_ro_nolock().

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
---
 mm/filemap.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 20346df53df3b..216405ba497ea 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3251,8 +3251,8 @@ static vm_fault_t filemap_fault_recheck_pte_none(struct vm_fault *vmf)
 	if (!(vmf->flags & FAULT_FLAG_ORIG_PTE_VALID))
 		return 0;
 
-	ptep = pte_offset_map_nolock(vma->vm_mm, vmf->pmd, vmf->address,
-				     &vmf->ptl);
+	ptep = pte_offset_map_ro_nolock(vma->vm_mm, vmf->pmd, vmf->address,
+					&vmf->ptl);
 	if (unlikely(!ptep))
 		return VM_FAULT_NOPAGE;
 
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v3 05/14] mm: khugepaged: __collapse_huge_page_swapin() use pte_offset_map_ro_nolock()
  2024-09-04  8:40 [PATCH v3 00/14] introduce pte_offset_map_{ro|rw}_nolock() Qi Zheng
                   ` (3 preceding siblings ...)
  2024-09-04  8:40 ` [PATCH v3 04/14] mm: filemap: filemap_fault_recheck_pte_none() " Qi Zheng
@ 2024-09-04  8:40 ` Qi Zheng
  2024-09-04  8:40 ` [PATCH v3 06/14] mm: handle_pte_fault() use pte_offset_map_rw_nolock() Qi Zheng
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 26+ messages in thread
From: Qi Zheng @ 2024-09-04  8:40 UTC (permalink / raw)
  To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
	vishal.moola, peterx, ryan.roberts, christophe.leroy2
  Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng

In __collapse_huge_page_swapin(), we just use the ptl for pte_same() check
in do_swap_page(). In other places, we directly use pte_offset_map_lock(),
so convert it to using pte_offset_map_ro_nolock().

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
---
 mm/khugepaged.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index f9c39898eaff6..6498721d4783a 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1011,7 +1011,11 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
 		};
 
 		if (!pte++) {
-			pte = pte_offset_map_nolock(mm, pmd, address, &ptl);
+			/*
+			 * Here the ptl is only used to check pte_same() in
+			 * do_swap_page(), so readonly version is enough.
+			 */
+			pte = pte_offset_map_ro_nolock(mm, pmd, address, &ptl);
 			if (!pte) {
 				mmap_read_unlock(mm);
 				result = SCAN_PMD_NULL;
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v3 06/14] mm: handle_pte_fault() use pte_offset_map_rw_nolock()
  2024-09-04  8:40 [PATCH v3 00/14] introduce pte_offset_map_{ro|rw}_nolock() Qi Zheng
                   ` (4 preceding siblings ...)
  2024-09-04  8:40 ` [PATCH v3 05/14] mm: khugepaged: __collapse_huge_page_swapin() " Qi Zheng
@ 2024-09-04  8:40 ` Qi Zheng
  2024-09-04  8:40 ` [PATCH v3 07/14] mm: khugepaged: collapse_pte_mapped_thp() " Qi Zheng
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 26+ messages in thread
From: Qi Zheng @ 2024-09-04  8:40 UTC (permalink / raw)
  To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
	vishal.moola, peterx, ryan.roberts, christophe.leroy2
  Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng

In handle_pte_fault(), we may modify the vmf->pte after acquiring the
vmf->ptl, so convert it to using pte_offset_map_rw_nolock(). But since we
will do the pte_same() check, so there is no need to get pmdval to do
pmd_same() check, just pass a dummy variable to it.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
---
 mm/memory.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 42674c0748cba..06674f94b7a4e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5498,14 +5498,24 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 		vmf->pte = NULL;
 		vmf->flags &= ~FAULT_FLAG_ORIG_PTE_VALID;
 	} else {
+		pmd_t dummy_pmdval;
+
 		/*
 		 * A regular pmd is established and it can't morph into a huge
 		 * pmd by anon khugepaged, since that takes mmap_lock in write
 		 * mode; but shmem or file collapse to THP could still morph
 		 * it into a huge pmd: just retry later if so.
+		 *
+		 * Use the maywrite version to indicate that vmf->pte may be
+		 * modified, but since we will use pte_same() to detect the
+		 * change of the !pte_none() entry, there is no need to recheck
+		 * the pmdval. Here we chooes to pass a dummy variable instead
+		 * of NULL, which helps new user think about why this place is
+		 * special.
 		 */
-		vmf->pte = pte_offset_map_nolock(vmf->vma->vm_mm, vmf->pmd,
-						 vmf->address, &vmf->ptl);
+		vmf->pte = pte_offset_map_rw_nolock(vmf->vma->vm_mm, vmf->pmd,
+						    vmf->address, &dummy_pmdval,
+						    &vmf->ptl);
 		if (unlikely(!vmf->pte))
 			return 0;
 		vmf->orig_pte = ptep_get_lockless(vmf->pte);
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v3 07/14] mm: khugepaged: collapse_pte_mapped_thp() use pte_offset_map_rw_nolock()
  2024-09-04  8:40 [PATCH v3 00/14] introduce pte_offset_map_{ro|rw}_nolock() Qi Zheng
                   ` (5 preceding siblings ...)
  2024-09-04  8:40 ` [PATCH v3 06/14] mm: handle_pte_fault() use pte_offset_map_rw_nolock() Qi Zheng
@ 2024-09-04  8:40 ` Qi Zheng
  2024-09-04  8:40 ` [PATCH v3 08/14] mm: copy_pte_range() " Qi Zheng
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 26+ messages in thread
From: Qi Zheng @ 2024-09-04  8:40 UTC (permalink / raw)
  To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
	vishal.moola, peterx, ryan.roberts, christophe.leroy2
  Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng

In collapse_pte_mapped_thp(), we may modify the pte and pmd entry after
acquring the ptl, so convert it to using pte_offset_map_rw_nolock(). At
this time, the pte_same() check is not performed after the PTL held. So we
should get pgt_pmd and do pmd_same() check after the ptl held.

For the case where the ptl is released first and then the pml is acquired,
the PTE page may have been freed, so we must do pmd_same() check before
reacquiring the ptl.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/khugepaged.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 6498721d4783a..a117d35f33aee 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1605,7 +1605,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 	if (userfaultfd_armed(vma) && !(vma->vm_flags & VM_SHARED))
 		pml = pmd_lock(mm, pmd);
 
-	start_pte = pte_offset_map_nolock(mm, pmd, haddr, &ptl);
+	start_pte = pte_offset_map_rw_nolock(mm, pmd, haddr, &pgt_pmd, &ptl);
 	if (!start_pte)		/* mmap_lock + page lock should prevent this */
 		goto abort;
 	if (!pml)
@@ -1613,6 +1613,9 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 	else if (ptl != pml)
 		spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
 
+	if (unlikely(!pmd_same(pgt_pmd, pmdp_get_lockless(pmd))))
+		goto abort;
+
 	/* step 2: clear page table and adjust rmap */
 	for (i = 0, addr = haddr, pte = start_pte;
 	     i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE, pte++) {
@@ -1658,6 +1661,16 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 	/* step 4: remove empty page table */
 	if (!pml) {
 		pml = pmd_lock(mm, pmd);
+		/*
+		 * We called pte_unmap() and release the ptl before acquiring
+		 * the pml, which means we left the RCU critical section, so the
+		 * PTE page may have been freed, so we must do pmd_same() check
+		 * before reacquiring the ptl.
+		 */
+		if (unlikely(!pmd_same(pgt_pmd, pmdp_get_lockless(pmd)))) {
+			spin_unlock(pml);
+			goto pmd_change;
+		}
 		if (ptl != pml)
 			spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
 	}
@@ -1689,6 +1702,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 		pte_unmap_unlock(start_pte, ptl);
 	if (pml && pml != ptl)
 		spin_unlock(pml);
+pmd_change:
 	if (notified)
 		mmu_notifier_invalidate_range_end(&range);
 drop_folio:
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v3 08/14] mm: copy_pte_range() use pte_offset_map_rw_nolock()
  2024-09-04  8:40 [PATCH v3 00/14] introduce pte_offset_map_{ro|rw}_nolock() Qi Zheng
                   ` (6 preceding siblings ...)
  2024-09-04  8:40 ` [PATCH v3 07/14] mm: khugepaged: collapse_pte_mapped_thp() " Qi Zheng
@ 2024-09-04  8:40 ` Qi Zheng
  2024-09-05  8:57   ` Muchun Song
  2024-09-04  8:40 ` [PATCH v3 09/14] mm: mremap: move_ptes() " Qi Zheng
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 26+ messages in thread
From: Qi Zheng @ 2024-09-04  8:40 UTC (permalink / raw)
  To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
	vishal.moola, peterx, ryan.roberts, christophe.leroy2
  Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng

In copy_pte_range(), we may modify the src_pte entry after holding the
src_ptl, so convert it to using pte_offset_map_rw_nolock(). Since we may
free the PTE page in retract_page_tables() without holding the read lock
of mmap_lock, so we still need to get pmdval and do pmd_same() check after
the ptl is held.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
Hi Muchun, since the code has changed, I dropped your Reviewed-by tag here.

 mm/memory.c | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index 06674f94b7a4e..47974cc4bd7f2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1082,6 +1082,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 	struct mm_struct *src_mm = src_vma->vm_mm;
 	pte_t *orig_src_pte, *orig_dst_pte;
 	pte_t *src_pte, *dst_pte;
+	pmd_t pmdval;
 	pte_t ptent;
 	spinlock_t *src_ptl, *dst_ptl;
 	int progress, max_nr, ret = 0;
@@ -1107,13 +1108,28 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 		ret = -ENOMEM;
 		goto out;
 	}
-	src_pte = pte_offset_map_nolock(src_mm, src_pmd, addr, &src_ptl);
+
+	/*
+	 * Since we may free the PTE page in retract_page_tables() without
+	 * holding the read lock of mmap_lock, so we still need to do a
+	 * pmd_same() check after holding the PTL.
+	 */
+	src_pte = pte_offset_map_rw_nolock(src_mm, src_pmd, addr, &pmdval,
+					   &src_ptl);
 	if (!src_pte) {
 		pte_unmap_unlock(dst_pte, dst_ptl);
 		/* ret == 0 */
 		goto out;
 	}
 	spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
+
+	if (unlikely(!pmd_same(pmdval, pmdp_get_lockless(src_pmd)))) {
+		pte_unmap_unlock(src_pte, src_ptl);
+		pte_unmap_unlock(dst_pte, dst_ptl);
+		/* ret == 0 */
+		goto out;
+	}
+
 	orig_src_pte = src_pte;
 	orig_dst_pte = dst_pte;
 	arch_enter_lazy_mmu_mode();
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v3 09/14] mm: mremap: move_ptes() use pte_offset_map_rw_nolock()
  2024-09-04  8:40 [PATCH v3 00/14] introduce pte_offset_map_{ro|rw}_nolock() Qi Zheng
                   ` (7 preceding siblings ...)
  2024-09-04  8:40 ` [PATCH v3 08/14] mm: copy_pte_range() " Qi Zheng
@ 2024-09-04  8:40 ` Qi Zheng
  2024-09-05  9:25   ` Muchun Song
  2024-09-04  8:40 ` [PATCH v3 10/14] mm: page_vma_mapped_walk: map_pte() " Qi Zheng
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 26+ messages in thread
From: Qi Zheng @ 2024-09-04  8:40 UTC (permalink / raw)
  To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
	vishal.moola, peterx, ryan.roberts, christophe.leroy2
  Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng

In move_ptes(), we may modify the new_pte after acquiring the new_ptl, so
convert it to using pte_offset_map_rw_nolock(). Since we may free the PTE
page in retract_page_tables() without holding the read lock of mmap_lock,
so we still need to do a pmd_same() check after holding the PTL.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/mremap.c | 20 ++++++++++++++++++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index 24712f8dbb6b5..16e54151395ad 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -143,6 +143,7 @@ static int move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
 	spinlock_t *old_ptl, *new_ptl;
 	bool force_flush = false;
 	unsigned long len = old_end - old_addr;
+	pmd_t pmdval;
 	int err = 0;
 
 	/*
@@ -175,14 +176,29 @@ static int move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
 		err = -EAGAIN;
 		goto out;
 	}
-	new_pte = pte_offset_map_nolock(mm, new_pmd, new_addr, &new_ptl);
+	/*
+	 * Since we may free the PTE page in retract_page_tables() without
+	 * holding the read lock of mmap_lock, so we still need to do a
+	 * pmd_same() check after holding the PTL.
+	 */
+	new_pte = pte_offset_map_rw_nolock(mm, new_pmd, new_addr, &pmdval,
+					   &new_ptl);
 	if (!new_pte) {
 		pte_unmap_unlock(old_pte, old_ptl);
 		err = -EAGAIN;
 		goto out;
 	}
-	if (new_ptl != old_ptl)
+	if (new_ptl != old_ptl) {
 		spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
+
+		if (unlikely(!pmd_same(pmdval, pmdp_get_lockless(new_pmd)))) {
+			pte_unmap_unlock(new_pte, new_ptl);
+			pte_unmap_unlock(old_pte, old_ptl);
+			err = -EAGAIN;
+			goto out;
+		}
+	}
+
 	flush_tlb_batched_pending(vma->vm_mm);
 	arch_enter_lazy_mmu_mode();
 
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v3 10/14] mm: page_vma_mapped_walk: map_pte() use pte_offset_map_rw_nolock()
  2024-09-04  8:40 [PATCH v3 00/14] introduce pte_offset_map_{ro|rw}_nolock() Qi Zheng
                   ` (8 preceding siblings ...)
  2024-09-04  8:40 ` [PATCH v3 09/14] mm: mremap: move_ptes() " Qi Zheng
@ 2024-09-04  8:40 ` Qi Zheng
  2024-09-05 12:07   ` Muchun Song
  2024-09-04  8:40 ` [PATCH v3 11/14] mm: userfaultfd: move_pages_pte() " Qi Zheng
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 26+ messages in thread
From: Qi Zheng @ 2024-09-04  8:40 UTC (permalink / raw)
  To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
	vishal.moola, peterx, ryan.roberts, christophe.leroy2
  Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng

In the caller of map_pte(), we may modify the pvmw->pte after acquiring
the pvmw->ptl, so convert it to using pte_offset_map_rw_nolock(). At
this time, the pte_same() check is not performed after the pvmw->ptl held,
so we should get pmdval and do pmd_same() check to ensure the stability of
pvmw->pmd.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/page_vma_mapped.c | 24 ++++++++++++++++++++----
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index ae5cc42aa2087..f1d73fd448708 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -13,9 +13,11 @@ static inline bool not_found(struct page_vma_mapped_walk *pvmw)
 	return false;
 }
 
-static bool map_pte(struct page_vma_mapped_walk *pvmw, spinlock_t **ptlp)
+static bool map_pte(struct page_vma_mapped_walk *pvmw, pmd_t *pmdvalp,
+		    spinlock_t **ptlp)
 {
 	pte_t ptent;
+	pmd_t pmdval;
 
 	if (pvmw->flags & PVMW_SYNC) {
 		/* Use the stricter lookup */
@@ -25,6 +27,7 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw, spinlock_t **ptlp)
 		return !!pvmw->pte;
 	}
 
+again:
 	/*
 	 * It is important to return the ptl corresponding to pte,
 	 * in case *pvmw->pmd changes underneath us; so we need to
@@ -32,10 +35,11 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw, spinlock_t **ptlp)
 	 * proceeds to loop over next ptes, and finds a match later.
 	 * Though, in most cases, page lock already protects this.
 	 */
-	pvmw->pte = pte_offset_map_nolock(pvmw->vma->vm_mm, pvmw->pmd,
-					  pvmw->address, ptlp);
+	pvmw->pte = pte_offset_map_rw_nolock(pvmw->vma->vm_mm, pvmw->pmd,
+					     pvmw->address, &pmdval, ptlp);
 	if (!pvmw->pte)
 		return false;
+	*pmdvalp = pmdval;
 
 	ptent = ptep_get(pvmw->pte);
 
@@ -69,6 +73,12 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw, spinlock_t **ptlp)
 	}
 	pvmw->ptl = *ptlp;
 	spin_lock(pvmw->ptl);
+
+	if (unlikely(!pmd_same(pmdval, pmdp_get_lockless(pvmw->pmd)))) {
+		spin_unlock(pvmw->ptl);
+		goto again;
+	}
+
 	return true;
 }
 
@@ -278,7 +288,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 			step_forward(pvmw, PMD_SIZE);
 			continue;
 		}
-		if (!map_pte(pvmw, &ptl)) {
+		if (!map_pte(pvmw, &pmde, &ptl)) {
 			if (!pvmw->pte)
 				goto restart;
 			goto next_pte;
@@ -307,6 +317,12 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 		if (!pvmw->ptl) {
 			pvmw->ptl = ptl;
 			spin_lock(pvmw->ptl);
+			if (unlikely(!pmd_same(pmde, pmdp_get_lockless(pvmw->pmd)))) {
+				pte_unmap_unlock(pvmw->pte, pvmw->ptl);
+				pvmw->ptl = NULL;
+				pvmw->pte = NULL;
+				goto restart;
+			}
 		}
 		goto this_pte;
 	} while (pvmw->address < end);
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v3 11/14] mm: userfaultfd: move_pages_pte() use pte_offset_map_rw_nolock()
  2024-09-04  8:40 [PATCH v3 00/14] introduce pte_offset_map_{ro|rw}_nolock() Qi Zheng
                   ` (9 preceding siblings ...)
  2024-09-04  8:40 ` [PATCH v3 10/14] mm: page_vma_mapped_walk: map_pte() " Qi Zheng
@ 2024-09-04  8:40 ` Qi Zheng
  2024-09-05 12:20   ` Muchun Song
  2024-09-04  8:40 ` [PATCH v3 12/14] mm: multi-gen LRU: walk_pte_range() " Qi Zheng
                   ` (2 subsequent siblings)
  13 siblings, 1 reply; 26+ messages in thread
From: Qi Zheng @ 2024-09-04  8:40 UTC (permalink / raw)
  To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
	vishal.moola, peterx, ryan.roberts, christophe.leroy2
  Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng

In move_pages_pte(), we may modify the dst_pte and src_pte after acquiring
the ptl, so convert it to using pte_offset_map_rw_nolock(). But since we
already do the pte_same() check, there is no need to get pmdval to do
pmd_same() check, just pass a dummy variable to it.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/userfaultfd.c | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index ce13c40626472..48b87c62fc3dd 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -1135,7 +1135,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
 	spinlock_t *src_ptl, *dst_ptl;
 	pte_t *src_pte = NULL;
 	pte_t *dst_pte = NULL;
-
+	pmd_t dummy_pmdval;
 	struct folio *src_folio = NULL;
 	struct anon_vma *src_anon_vma = NULL;
 	struct mmu_notifier_range range;
@@ -1146,7 +1146,14 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
 				src_addr, src_addr + PAGE_SIZE);
 	mmu_notifier_invalidate_range_start(&range);
 retry:
-	dst_pte = pte_offset_map_nolock(mm, dst_pmd, dst_addr, &dst_ptl);
+	/*
+	 * Use the maywrite version to indicate that dst_pte will be modified,
+	 * but since we will use pte_same() to detect the change of the pte
+	 * entry, there is no need to get pmdval, so just pass a dummy variable
+	 * to it.
+	 */
+	dst_pte = pte_offset_map_rw_nolock(mm, dst_pmd, dst_addr, &dummy_pmdval,
+					   &dst_ptl);
 
 	/* Retry if a huge pmd materialized from under us */
 	if (unlikely(!dst_pte)) {
@@ -1154,7 +1161,9 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
 		goto out;
 	}
 
-	src_pte = pte_offset_map_nolock(mm, src_pmd, src_addr, &src_ptl);
+	/* same as dst_pte */
+	src_pte = pte_offset_map_rw_nolock(mm, src_pmd, src_addr, &dummy_pmdval,
+					   &src_ptl);
 
 	/*
 	 * We held the mmap_lock for reading so MADV_DONTNEED
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v3 12/14] mm: multi-gen LRU: walk_pte_range() use pte_offset_map_rw_nolock()
  2024-09-04  8:40 [PATCH v3 00/14] introduce pte_offset_map_{ro|rw}_nolock() Qi Zheng
                   ` (10 preceding siblings ...)
  2024-09-04  8:40 ` [PATCH v3 11/14] mm: userfaultfd: move_pages_pte() " Qi Zheng
@ 2024-09-04  8:40 ` Qi Zheng
  2024-09-05 12:23   ` Muchun Song
  2024-09-04  8:40 ` [PATCH v3 13/14] mm: pgtable: remove pte_offset_map_nolock() Qi Zheng
  2024-09-04  8:40 ` [PATCH v3 14/14] mm: khugepaged: retract_page_tables() use pte_offset_map_rw_nolock() Qi Zheng
  13 siblings, 1 reply; 26+ messages in thread
From: Qi Zheng @ 2024-09-04  8:40 UTC (permalink / raw)
  To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
	vishal.moola, peterx, ryan.roberts, christophe.leroy2
  Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng

In walk_pte_range(), we may modify the pte entry after holding the ptl, so
convert it to using pte_offset_map_rw_nolock(). At this time, the
pte_same() check is not performed after the ptl held, so we should get
pmdval and do pmd_same() check to ensure the stability of pmd entry.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/vmscan.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index a9b6a8196f958..36b84e46cd7b5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3375,8 +3375,10 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
 	struct pglist_data *pgdat = lruvec_pgdat(walk->lruvec);
 	DEFINE_MAX_SEQ(walk->lruvec);
 	int old_gen, new_gen = lru_gen_from_seq(max_seq);
+	pmd_t pmdval;
 
-	pte = pte_offset_map_nolock(args->mm, pmd, start & PMD_MASK, &ptl);
+	pte = pte_offset_map_rw_nolock(args->mm, pmd, start & PMD_MASK, &pmdval,
+				       &ptl);
 	if (!pte)
 		return false;
 	if (!spin_trylock(ptl)) {
@@ -3384,6 +3386,11 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
 		return false;
 	}
 
+	if (unlikely(!pmd_same(pmdval, pmdp_get_lockless(pmd)))) {
+		pte_unmap_unlock(pte, ptl);
+		return false;
+	}
+
 	arch_enter_lazy_mmu_mode();
 restart:
 	for (i = pte_index(start), addr = start; addr != end; i++, addr += PAGE_SIZE) {
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v3 13/14] mm: pgtable: remove pte_offset_map_nolock()
  2024-09-04  8:40 [PATCH v3 00/14] introduce pte_offset_map_{ro|rw}_nolock() Qi Zheng
                   ` (11 preceding siblings ...)
  2024-09-04  8:40 ` [PATCH v3 12/14] mm: multi-gen LRU: walk_pte_range() " Qi Zheng
@ 2024-09-04  8:40 ` Qi Zheng
  2024-09-05 12:23   ` Muchun Song
  2024-09-04  8:40 ` [PATCH v3 14/14] mm: khugepaged: retract_page_tables() use pte_offset_map_rw_nolock() Qi Zheng
  13 siblings, 1 reply; 26+ messages in thread
From: Qi Zheng @ 2024-09-04  8:40 UTC (permalink / raw)
  To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
	vishal.moola, peterx, ryan.roberts, christophe.leroy2
  Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng

Now no users are using the pte_offset_map_nolock(), remove it.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 Documentation/mm/split_page_table_lock.rst |  3 ---
 include/linux/mm.h                         |  2 --
 mm/pgtable-generic.c                       | 21 ---------------------
 3 files changed, 26 deletions(-)

diff --git a/Documentation/mm/split_page_table_lock.rst b/Documentation/mm/split_page_table_lock.rst
index 08d0e706a32db..581446d4a4eba 100644
--- a/Documentation/mm/split_page_table_lock.rst
+++ b/Documentation/mm/split_page_table_lock.rst
@@ -16,9 +16,6 @@ There are helpers to lock/unlock a table and other accessor functions:
  - pte_offset_map_lock()
 	maps PTE and takes PTE table lock, returns pointer to PTE with
 	pointer to its PTE table lock, or returns NULL if no PTE table;
- - pte_offset_map_nolock()
-	maps PTE, returns pointer to PTE with pointer to its PTE table
-	lock (not taken), or returns NULL if no PTE table;
  - pte_offset_map_ro_nolock()
 	maps PTE, returns pointer to PTE with pointer to its PTE table
 	lock (not taken), or returns NULL if no PTE table;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1fde9242231c9..5b5a774902bd6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3004,8 +3004,6 @@ static inline pte_t *pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd,
 	return pte;
 }
 
-pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
-			unsigned long addr, spinlock_t **ptlp);
 pte_t *pte_offset_map_ro_nolock(struct mm_struct *mm, pmd_t *pmd,
 				unsigned long addr, spinlock_t **ptlp);
 pte_t *pte_offset_map_rw_nolock(struct mm_struct *mm, pmd_t *pmd,
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 262b7065a5a2e..c68aa655b7872 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -305,18 +305,6 @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
 	return NULL;
 }
 
-pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
-			     unsigned long addr, spinlock_t **ptlp)
-{
-	pmd_t pmdval;
-	pte_t *pte;
-
-	pte = __pte_offset_map(pmd, addr, &pmdval);
-	if (likely(pte))
-		*ptlp = pte_lockptr(mm, &pmdval);
-	return pte;
-}
-
 pte_t *pte_offset_map_ro_nolock(struct mm_struct *mm, pmd_t *pmd,
 				unsigned long addr, spinlock_t **ptlp)
 {
@@ -374,15 +362,6 @@ pte_t *pte_offset_map_rw_nolock(struct mm_struct *mm, pmd_t *pmd,
  * and disconnected table.  Until pte_unmap(pte) unmaps and rcu_read_unlock()s
  * afterwards.
  *
- * pte_offset_map_nolock(mm, pmd, addr, ptlp), above, is like pte_offset_map();
- * but when successful, it also outputs a pointer to the spinlock in ptlp - as
- * pte_offset_map_lock() does, but in this case without locking it.  This helps
- * the caller to avoid a later pte_lockptr(mm, *pmd), which might by that time
- * act on a changed *pmd: pte_offset_map_nolock() provides the correct spinlock
- * pointer for the page table that it returns.  In principle, the caller should
- * recheck *pmd once the lock is taken; in practice, no callsite needs that -
- * either the mmap_lock for write, or pte_same() check on contents, is enough.
- *
  * pte_offset_map_ro_nolock(mm, pmd, addr, ptlp), above, is like pte_offset_map();
  * but when successful, it also outputs a pointer to the spinlock in ptlp - as
  * pte_offset_map_lock() does, but in this case without locking it.  This helps
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v3 14/14] mm: khugepaged: retract_page_tables() use pte_offset_map_rw_nolock()
  2024-09-04  8:40 [PATCH v3 00/14] introduce pte_offset_map_{ro|rw}_nolock() Qi Zheng
                   ` (12 preceding siblings ...)
  2024-09-04  8:40 ` [PATCH v3 13/14] mm: pgtable: remove pte_offset_map_nolock() Qi Zheng
@ 2024-09-04  8:40 ` Qi Zheng
  13 siblings, 0 replies; 26+ messages in thread
From: Qi Zheng @ 2024-09-04  8:40 UTC (permalink / raw)
  To: david, hughd, willy, muchun.song, vbabka, akpm, rppt,
	vishal.moola, peterx, ryan.roberts, christophe.leroy2
  Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, Qi Zheng

In retract_page_tables(), we may modify the pmd entry after acquiring the
pml and ptl, so we should also check whether the pmd entry is stable.
Using pte_offset_map_rw_nolock() + pmd_same() to do it.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/khugepaged.c | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index a117d35f33aee..318cc3eefb040 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1724,6 +1724,7 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 		spinlock_t *pml;
 		spinlock_t *ptl;
 		bool skipped_uffd = false;
+		pte_t *pte;
 
 		/*
 		 * Check vma->anon_vma to exclude MAP_PRIVATE mappings that
@@ -1759,11 +1760,25 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 					addr, addr + HPAGE_PMD_SIZE);
 		mmu_notifier_invalidate_range_start(&range);
 
+		pte = pte_offset_map_rw_nolock(mm, pmd, addr, &pgt_pmd, &ptl);
+		if (!pte) {
+			mmu_notifier_invalidate_range_end(&range);
+			continue;
+		}
+
 		pml = pmd_lock(mm, pmd);
-		ptl = pte_lockptr(mm, pmd);
 		if (ptl != pml)
 			spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
 
+		if (unlikely(!pmd_same(pgt_pmd, pmdp_get_lockless(pmd)))) {
+			pte_unmap_unlock(pte, ptl);
+			if (ptl != pml)
+				spin_unlock(pml);
+			mmu_notifier_invalidate_range_end(&range);
+			continue;
+		}
+		pte_unmap(pte);
+
 		/*
 		 * Huge page lock is still held, so normally the page table
 		 * must remain empty; and we have already skipped anon_vma
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 08/14] mm: copy_pte_range() use pte_offset_map_rw_nolock()
  2024-09-04  8:40 ` [PATCH v3 08/14] mm: copy_pte_range() " Qi Zheng
@ 2024-09-05  8:57   ` Muchun Song
  2024-09-05 10:55     ` Qi Zheng
  0 siblings, 1 reply; 26+ messages in thread
From: Muchun Song @ 2024-09-05  8:57 UTC (permalink / raw)
  To: Qi Zheng
  Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, david,
	hughd, willy, vbabka, akpm, rppt, vishal.moola, peterx,
	ryan.roberts, christophe.leroy2



On 2024/9/4 16:40, Qi Zheng wrote:
> In copy_pte_range(), we may modify the src_pte entry after holding the
> src_ptl, so convert it to using pte_offset_map_rw_nolock(). Since we may
> free the PTE page in retract_page_tables() without holding the read lock
> of mmap_lock, so we still need to get pmdval and do pmd_same() check after
> the ptl is held.

See commit 3db82b9374ca92, copy_pte_range and retract_page_tables
are using vma->anon_vma to be exclusive.

retract_page_tables()                    copy_page_range()
     vma_interval_tree_foreach()              if (!vma_needs_copy())
         if (READ_ONCE(vma->anon_vma))            return 0;
             continue;                        copy_pte_range()

So I think mmap write lock here is also used for keeping ->anon_vma stable.
And we do not need pmd_same().

Muchun,
Thanks.
>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
> Hi Muchun, since the code has changed, I dropped your Reviewed-by tag here.
>
>   mm/memory.c | 18 +++++++++++++++++-
>   1 file changed, 17 insertions(+), 1 deletion(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 06674f94b7a4e..47974cc4bd7f2 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1082,6 +1082,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>   	struct mm_struct *src_mm = src_vma->vm_mm;
>   	pte_t *orig_src_pte, *orig_dst_pte;
>   	pte_t *src_pte, *dst_pte;
> +	pmd_t pmdval;
>   	pte_t ptent;
>   	spinlock_t *src_ptl, *dst_ptl;
>   	int progress, max_nr, ret = 0;
> @@ -1107,13 +1108,28 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>   		ret = -ENOMEM;
>   		goto out;
>   	}
> -	src_pte = pte_offset_map_nolock(src_mm, src_pmd, addr, &src_ptl);
> +
> +	/*
> +	 * Since we may free the PTE page in retract_page_tables() without
> +	 * holding the read lock of mmap_lock, so we still need to do a
> +	 * pmd_same() check after holding the PTL.
> +	 */
> +	src_pte = pte_offset_map_rw_nolock(src_mm, src_pmd, addr, &pmdval,
> +					   &src_ptl);
>   	if (!src_pte) {
>   		pte_unmap_unlock(dst_pte, dst_ptl);
>   		/* ret == 0 */
>   		goto out;
>   	}
>   	spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
> +
> +	if (unlikely(!pmd_same(pmdval, pmdp_get_lockless(src_pmd)))) {
> +		pte_unmap_unlock(src_pte, src_ptl);
> +		pte_unmap_unlock(dst_pte, dst_ptl);
> +		/* ret == 0 */
> +		goto out;
> +	}
> +
>   	orig_src_pte = src_pte;
>   	orig_dst_pte = dst_pte;
>   	arch_enter_lazy_mmu_mode();



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 09/14] mm: mremap: move_ptes() use pte_offset_map_rw_nolock()
  2024-09-04  8:40 ` [PATCH v3 09/14] mm: mremap: move_ptes() " Qi Zheng
@ 2024-09-05  9:25   ` Muchun Song
  2024-09-05 10:56     ` Qi Zheng
  0 siblings, 1 reply; 26+ messages in thread
From: Muchun Song @ 2024-09-05  9:25 UTC (permalink / raw)
  To: Qi Zheng
  Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, david,
	hughd, willy, vbabka, akpm, rppt, vishal.moola, peterx,
	ryan.roberts, christophe.leroy2



On 2024/9/4 16:40, Qi Zheng wrote:
> In move_ptes(), we may modify the new_pte after acquiring the new_ptl, so
> convert it to using pte_offset_map_rw_nolock(). Since we may free the PTE
> page in retract_page_tables() without holding the read lock of mmap_lock,
> so we still need to do a pmd_same() check after holding the PTL.

retract_page_tables() and move_ptes() are synchronized with
i_mmap_lock, right?

Muchun,
Thanks.

>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
>   mm/mremap.c | 20 ++++++++++++++++++--
>   1 file changed, 18 insertions(+), 2 deletions(-)
>
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 24712f8dbb6b5..16e54151395ad 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -143,6 +143,7 @@ static int move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
>   	spinlock_t *old_ptl, *new_ptl;
>   	bool force_flush = false;
>   	unsigned long len = old_end - old_addr;
> +	pmd_t pmdval;
>   	int err = 0;
>   
>   	/*
> @@ -175,14 +176,29 @@ static int move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
>   		err = -EAGAIN;
>   		goto out;
>   	}
> -	new_pte = pte_offset_map_nolock(mm, new_pmd, new_addr, &new_ptl);
> +	/*
> +	 * Since we may free the PTE page in retract_page_tables() without
> +	 * holding the read lock of mmap_lock, so we still need to do a
> +	 * pmd_same() check after holding the PTL.
> +	 */
> +	new_pte = pte_offset_map_rw_nolock(mm, new_pmd, new_addr, &pmdval,
> +					   &new_ptl);
>   	if (!new_pte) {
>   		pte_unmap_unlock(old_pte, old_ptl);
>   		err = -EAGAIN;
>   		goto out;
>   	}
> -	if (new_ptl != old_ptl)
> +	if (new_ptl != old_ptl) {
>   		spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
> +
> +		if (unlikely(!pmd_same(pmdval, pmdp_get_lockless(new_pmd)))) {
> +			pte_unmap_unlock(new_pte, new_ptl);
> +			pte_unmap_unlock(old_pte, old_ptl);
> +			err = -EAGAIN;
> +			goto out;
> +		}
> +	}
> +
>   	flush_tlb_batched_pending(vma->vm_mm);
>   	arch_enter_lazy_mmu_mode();
>   



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 08/14] mm: copy_pte_range() use pte_offset_map_rw_nolock()
  2024-09-05  8:57   ` Muchun Song
@ 2024-09-05 10:55     ` Qi Zheng
  0 siblings, 0 replies; 26+ messages in thread
From: Qi Zheng @ 2024-09-05 10:55 UTC (permalink / raw)
  To: Muchun Song
  Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, david,
	hughd, willy, vbabka, akpm, rppt, vishal.moola, peterx,
	ryan.roberts, christophe.leroy2



On 2024/9/5 16:57, Muchun Song wrote:
> 
> 
> On 2024/9/4 16:40, Qi Zheng wrote:
>> In copy_pte_range(), we may modify the src_pte entry after holding the
>> src_ptl, so convert it to using pte_offset_map_rw_nolock(). Since we may
>> free the PTE page in retract_page_tables() without holding the read lock
>> of mmap_lock, so we still need to get pmdval and do pmd_same() check 
>> after
>> the ptl is held.
> 
> See commit 3db82b9374ca92, copy_pte_range and retract_page_tables
> are using vma->anon_vma to be exclusive.
> 
> retract_page_tables()                    copy_page_range()
>      vma_interval_tree_foreach()              if (!vma_needs_copy())
>          if (READ_ONCE(vma->anon_vma))            return 0;
>              continue;                        copy_pte_range()
> 
> So I think mmap write lock here is also used for keeping ->anon_vma stable.
> And we do not need pmd_same().

Indeed, will change it in v4. Thanks!

> 
> Muchun,
> Thanks.
>>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>> ---
>> Hi Muchun, since the code has changed, I dropped your Reviewed-by tag 
>> here.
>>
>>   mm/memory.c | 18 +++++++++++++++++-
>>   1 file changed, 17 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 06674f94b7a4e..47974cc4bd7f2 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -1082,6 +1082,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, 
>> struct vm_area_struct *src_vma,
>>       struct mm_struct *src_mm = src_vma->vm_mm;
>>       pte_t *orig_src_pte, *orig_dst_pte;
>>       pte_t *src_pte, *dst_pte;
>> +    pmd_t pmdval;
>>       pte_t ptent;
>>       spinlock_t *src_ptl, *dst_ptl;
>>       int progress, max_nr, ret = 0;
>> @@ -1107,13 +1108,28 @@ copy_pte_range(struct vm_area_struct *dst_vma, 
>> struct vm_area_struct *src_vma,
>>           ret = -ENOMEM;
>>           goto out;
>>       }
>> -    src_pte = pte_offset_map_nolock(src_mm, src_pmd, addr, &src_ptl);
>> +
>> +    /*
>> +     * Since we may free the PTE page in retract_page_tables() without
>> +     * holding the read lock of mmap_lock, so we still need to do a
>> +     * pmd_same() check after holding the PTL.
>> +     */
>> +    src_pte = pte_offset_map_rw_nolock(src_mm, src_pmd, addr, &pmdval,
>> +                       &src_ptl);
>>       if (!src_pte) {
>>           pte_unmap_unlock(dst_pte, dst_ptl);
>>           /* ret == 0 */
>>           goto out;
>>       }
>>       spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
>> +
>> +    if (unlikely(!pmd_same(pmdval, pmdp_get_lockless(src_pmd)))) {
>> +        pte_unmap_unlock(src_pte, src_ptl);
>> +        pte_unmap_unlock(dst_pte, dst_ptl);
>> +        /* ret == 0 */
>> +        goto out;
>> +    }
>> +
>>       orig_src_pte = src_pte;
>>       orig_dst_pte = dst_pte;
>>       arch_enter_lazy_mmu_mode();
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 09/14] mm: mremap: move_ptes() use pte_offset_map_rw_nolock()
  2024-09-05  9:25   ` Muchun Song
@ 2024-09-05 10:56     ` Qi Zheng
  0 siblings, 0 replies; 26+ messages in thread
From: Qi Zheng @ 2024-09-05 10:56 UTC (permalink / raw)
  To: Muchun Song
  Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, david,
	hughd, willy, vbabka, akpm, rppt, vishal.moola, peterx,
	ryan.roberts, christophe.leroy2



On 2024/9/5 17:25, Muchun Song wrote:
> 
> 
> On 2024/9/4 16:40, Qi Zheng wrote:
>> In move_ptes(), we may modify the new_pte after acquiring the new_ptl, so
>> convert it to using pte_offset_map_rw_nolock(). Since we may free the PTE
>> page in retract_page_tables() without holding the read lock of mmap_lock,
>> so we still need to do a pmd_same() check after holding the PTL.
> 
> retract_page_tables() and move_ptes() are synchronized with
> i_mmap_lock, right?

Right, will remove the pmd_same() check in v4. Thanks!

> 
> Muchun,
> Thanks.
> 
>>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>> ---
>>   mm/mremap.c | 20 ++++++++++++++++++--
>>   1 file changed, 18 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/mremap.c b/mm/mremap.c
>> index 24712f8dbb6b5..16e54151395ad 100644
>> --- a/mm/mremap.c
>> +++ b/mm/mremap.c
>> @@ -143,6 +143,7 @@ static int move_ptes(struct vm_area_struct *vma, 
>> pmd_t *old_pmd,
>>       spinlock_t *old_ptl, *new_ptl;
>>       bool force_flush = false;
>>       unsigned long len = old_end - old_addr;
>> +    pmd_t pmdval;
>>       int err = 0;
>>       /*
>> @@ -175,14 +176,29 @@ static int move_ptes(struct vm_area_struct *vma, 
>> pmd_t *old_pmd,
>>           err = -EAGAIN;
>>           goto out;
>>       }
>> -    new_pte = pte_offset_map_nolock(mm, new_pmd, new_addr, &new_ptl);
>> +    /*
>> +     * Since we may free the PTE page in retract_page_tables() without
>> +     * holding the read lock of mmap_lock, so we still need to do a
>> +     * pmd_same() check after holding the PTL.
>> +     */
>> +    new_pte = pte_offset_map_rw_nolock(mm, new_pmd, new_addr, &pmdval,
>> +                       &new_ptl);
>>       if (!new_pte) {
>>           pte_unmap_unlock(old_pte, old_ptl);
>>           err = -EAGAIN;
>>           goto out;
>>       }
>> -    if (new_ptl != old_ptl)
>> +    if (new_ptl != old_ptl) {
>>           spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
>> +
>> +        if (unlikely(!pmd_same(pmdval, pmdp_get_lockless(new_pmd)))) {
>> +            pte_unmap_unlock(new_pte, new_ptl);
>> +            pte_unmap_unlock(old_pte, old_ptl);
>> +            err = -EAGAIN;
>> +            goto out;
>> +        }
>> +    }
>> +
>>       flush_tlb_batched_pending(vma->vm_mm);
>>       arch_enter_lazy_mmu_mode();
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 10/14] mm: page_vma_mapped_walk: map_pte() use pte_offset_map_rw_nolock()
  2024-09-04  8:40 ` [PATCH v3 10/14] mm: page_vma_mapped_walk: map_pte() " Qi Zheng
@ 2024-09-05 12:07   ` Muchun Song
  2024-09-12  9:30     ` Qi Zheng
  0 siblings, 1 reply; 26+ messages in thread
From: Muchun Song @ 2024-09-05 12:07 UTC (permalink / raw)
  To: Qi Zheng
  Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, david,
	hughd, willy, vbabka, akpm, rppt, vishal.moola, peterx,
	ryan.roberts, christophe.leroy2



On 2024/9/4 16:40, Qi Zheng wrote:
> In the caller of map_pte(), we may modify the pvmw->pte after acquiring
> the pvmw->ptl, so convert it to using pte_offset_map_rw_nolock(). At
> this time, the pte_same() check is not performed after the pvmw->ptl held,
> so we should get pmdval and do pmd_same() check to ensure the stability of
> pvmw->pmd.
>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
>   mm/page_vma_mapped.c | 24 ++++++++++++++++++++----
>   1 file changed, 20 insertions(+), 4 deletions(-)
>
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index ae5cc42aa2087..f1d73fd448708 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -13,9 +13,11 @@ static inline bool not_found(struct page_vma_mapped_walk *pvmw)
>   	return false;
>   }
>   
> -static bool map_pte(struct page_vma_mapped_walk *pvmw, spinlock_t **ptlp)
> +static bool map_pte(struct page_vma_mapped_walk *pvmw, pmd_t *pmdvalp,
> +		    spinlock_t **ptlp)
>   {
>   	pte_t ptent;
> +	pmd_t pmdval;
>   
>   	if (pvmw->flags & PVMW_SYNC) {
>   		/* Use the stricter lookup */
> @@ -25,6 +27,7 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw, spinlock_t **ptlp)
>   		return !!pvmw->pte;
>   	}
>   
> +again:
>   	/*
>   	 * It is important to return the ptl corresponding to pte,
>   	 * in case *pvmw->pmd changes underneath us; so we need to
> @@ -32,10 +35,11 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw, spinlock_t **ptlp)
>   	 * proceeds to loop over next ptes, and finds a match later.
>   	 * Though, in most cases, page lock already protects this.
>   	 */
> -	pvmw->pte = pte_offset_map_nolock(pvmw->vma->vm_mm, pvmw->pmd,
> -					  pvmw->address, ptlp);
> +	pvmw->pte = pte_offset_map_rw_nolock(pvmw->vma->vm_mm, pvmw->pmd,
> +					     pvmw->address, &pmdval, ptlp);
>   	if (!pvmw->pte)
>   		return false;
> +	*pmdvalp = pmdval;
>   
>   	ptent = ptep_get(pvmw->pte);
>   
> @@ -69,6 +73,12 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw, spinlock_t **ptlp)
>   	}
>   	pvmw->ptl = *ptlp;
>   	spin_lock(pvmw->ptl);
> +
> +	if (unlikely(!pmd_same(pmdval, pmdp_get_lockless(pvmw->pmd)))) {
> +		spin_unlock(pvmw->ptl);

Forgot to clear pvmw->ptl? Or how about moving the assignment for it
to the place where the pmd_same check is successful?

> +		goto again;
> +	}
> +

Maybe here is the right place to assign pvmw->ptl.

Muchun,
Thanks.

>   	return true;
>   }
>   
> @@ -278,7 +288,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>   			step_forward(pvmw, PMD_SIZE);
>   			continue;
>   		}
> -		if (!map_pte(pvmw, &ptl)) {
> +		if (!map_pte(pvmw, &pmde, &ptl)) {
>   			if (!pvmw->pte)
>   				goto restart;
>   			goto next_pte;
> @@ -307,6 +317,12 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>   		if (!pvmw->ptl) {
>   			pvmw->ptl = ptl;
>   			spin_lock(pvmw->ptl);
> +			if (unlikely(!pmd_same(pmde, pmdp_get_lockless(pvmw->pmd)))) {
> +				pte_unmap_unlock(pvmw->pte, pvmw->ptl);
> +				pvmw->ptl = NULL;
> +				pvmw->pte = NULL;
> +				goto restart;
> +			}
>   		}
>   		goto this_pte;
>   	} while (pvmw->address < end);



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 11/14] mm: userfaultfd: move_pages_pte() use pte_offset_map_rw_nolock()
  2024-09-04  8:40 ` [PATCH v3 11/14] mm: userfaultfd: move_pages_pte() " Qi Zheng
@ 2024-09-05 12:20   ` Muchun Song
  0 siblings, 0 replies; 26+ messages in thread
From: Muchun Song @ 2024-09-05 12:20 UTC (permalink / raw)
  To: Qi Zheng
  Cc: David Hildenbrand, Hugh Dickins, Matthew Wilcox,
	Vlastimil Babka (SUSE), Andrew Morton, Mike Rapoport,
	Vishal Moola, Peter Xu, Ryan Roberts, christophe.leroy2, LKML,
	Linux Memory Management List, linux-arm-kernel, linuxppc-dev



> On Sep 4, 2024, at 16:40, Qi Zheng <zhengqi.arch@bytedance.com> wrote:
> 
> In move_pages_pte(), we may modify the dst_pte and src_pte after acquiring
> the ptl, so convert it to using pte_offset_map_rw_nolock(). But since we
> already do the pte_same() check, there is no need to get pmdval to do
> pmd_same() check, just pass a dummy variable to it.
> 
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>

Reviewed-by: Muchun Song <muchun.song@linux.dev>

Thanks.



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 12/14] mm: multi-gen LRU: walk_pte_range() use pte_offset_map_rw_nolock()
  2024-09-04  8:40 ` [PATCH v3 12/14] mm: multi-gen LRU: walk_pte_range() " Qi Zheng
@ 2024-09-05 12:23   ` Muchun Song
  0 siblings, 0 replies; 26+ messages in thread
From: Muchun Song @ 2024-09-05 12:23 UTC (permalink / raw)
  To: Qi Zheng
  Cc: David Hildenbrand, Hugh Dickins, Matthew Wilcox,
	Vlastimil Babka (SUSE), akpm, rppt, vishal.moola, peterx,
	ryan.roberts, christophe.leroy2, linux-kernel, linux-mm,
	linux-arm-kernel, linuxppc-dev



> On Sep 4, 2024, at 16:40, Qi Zheng <zhengqi.arch@bytedance.com> wrote:
> 
> In walk_pte_range(), we may modify the pte entry after holding the ptl, so
> convert it to using pte_offset_map_rw_nolock(). At this time, the
> pte_same() check is not performed after the ptl held, so we should get
> pmdval and do pmd_same() check to ensure the stability of pmd entry.
> 
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>

Reviewed-by: Muchun Song <muchun.song@linux.dev>

Thanks.



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 13/14] mm: pgtable: remove pte_offset_map_nolock()
  2024-09-04  8:40 ` [PATCH v3 13/14] mm: pgtable: remove pte_offset_map_nolock() Qi Zheng
@ 2024-09-05 12:23   ` Muchun Song
  0 siblings, 0 replies; 26+ messages in thread
From: Muchun Song @ 2024-09-05 12:23 UTC (permalink / raw)
  To: Qi Zheng
  Cc: David Hildenbrand, Hugh Dickins, Matthew Wilcox,
	Vlastimil Babka (SUSE), akpm, rppt, vishal.moola, peterx,
	ryan.roberts, christophe.leroy2, linux-kernel, linux-mm,
	linux-arm-kernel, linuxppc-dev



> On Sep 4, 2024, at 16:40, Qi Zheng <zhengqi.arch@bytedance.com> wrote:
> 
> Now no users are using the pte_offset_map_nolock(), remove it.
> 
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>

Reviewed-by: Muchun Song <muchun.song@linux.dev>

Thanks.



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 01/14] mm: pgtable: introduce pte_offset_map_{ro|rw}_nolock()
  2024-09-04  8:40 ` [PATCH v3 01/14] mm: pgtable: " Qi Zheng
@ 2024-09-06  7:20   ` Muchun Song
  2024-09-12  9:28     ` Qi Zheng
  0 siblings, 1 reply; 26+ messages in thread
From: Muchun Song @ 2024-09-06  7:20 UTC (permalink / raw)
  To: Qi Zheng
  Cc: LKML, Linux Memory Management List, linux-arm-kernel,
	linuxppc-dev, David Hildenbrand, hughd, willy, vbabka, akpm, rppt,
	vishal.moola, peterx, ryan.roberts, christophe.leroy2



On 2024/9/4 16:40, Qi Zheng wrote:
> Currently, the usage of pte_offset_map_nolock() can be divided into the
> following two cases:
>
> 1) After acquiring PTL, only read-only operations are performed on the PTE
>     page. In this case, the RCU lock in pte_offset_map_nolock() will ensure
>     that the PTE page will not be freed, and there is no need to worry
>     about whether the pmd entry is modified.
>
> 2) After acquiring PTL, the pte or pmd entries may be modified. At this
>     time, we need to ensure that the pmd entry has not been modified
>     concurrently.
>
> To more clearing distinguish between these two cases, this commit
> introduces two new helper functions to replace pte_offset_map_nolock().
> For 1), just rename it to pte_offset_map_ro_nolock(). For 2), in addition
> to changing the name to pte_offset_map_rw_nolock(), it also outputs the
> pmdval when successful. It is applicable for may-write cases where any
> modification operations to the page table may happen after the
> corresponding spinlock is held afterwards. But the users should make sure
> the page table is stable like checking pte_same() or checking pmd_same()
> by using the output pmdval before performing the write operations.
>
> Note: "RO" / "RW" expresses the intended semantics, not that the *kmap*
> will be read-only/read-write protected.
>
> Subsequent commits will convert pte_offset_map_nolock() into the above
> two functions one by one, and finally completely delete it.
>
> Signed-off-by: Qi Zheng<zhengqi.arch@bytedance.com>
> ---
>   Documentation/mm/split_page_table_lock.rst |  7 +++
>   include/linux/mm.h                         |  5 +++
>   mm/pgtable-generic.c                       | 50 ++++++++++++++++++++++
>   3 files changed, 62 insertions(+)
>
> diff --git a/Documentation/mm/split_page_table_lock.rst b/Documentation/mm/split_page_table_lock.rst
> index e4f6972eb6c04..08d0e706a32db 100644
> --- a/Documentation/mm/split_page_table_lock.rst
> +++ b/Documentation/mm/split_page_table_lock.rst
> @@ -19,6 +19,13 @@ There are helpers to lock/unlock a table and other accessor functions:
>    - pte_offset_map_nolock()
>   	maps PTE, returns pointer to PTE with pointer to its PTE table
>   	lock (not taken), or returns NULL if no PTE table;
> + - pte_offset_map_ro_nolock()
> +	maps PTE, returns pointer to PTE with pointer to its PTE table
> +	lock (not taken), or returns NULL if no PTE table;
> + - pte_offset_map_rw_nolock()
> +	maps PTE, returns pointer to PTE with pointer to its PTE table
> +	lock (not taken) and the value of its pmd entry, or returns NULL
> +	if no PTE table;
>    - pte_offset_map()
>   	maps PTE, returns pointer to PTE, or returns NULL if no PTE table;
>    - pte_unmap()
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index a7c74a840249a..1fde9242231c9 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3006,6 +3006,11 @@ static inline pte_t *pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd,
>   
>   pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
>   			unsigned long addr, spinlock_t **ptlp);
> +pte_t *pte_offset_map_ro_nolock(struct mm_struct *mm, pmd_t *pmd,
> +				unsigned long addr, spinlock_t **ptlp);
> +pte_t *pte_offset_map_rw_nolock(struct mm_struct *mm, pmd_t *pmd,
> +				unsigned long addr, pmd_t *pmdvalp,
> +				spinlock_t **ptlp);
>   
>   #define pte_unmap_unlock(pte, ptl)	do {		\
>   	spin_unlock(ptl);				\
> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> index a78a4adf711ac..262b7065a5a2e 100644
> --- a/mm/pgtable-generic.c
> +++ b/mm/pgtable-generic.c
> @@ -317,6 +317,33 @@ pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
>   	return pte;
>   }
>   
> +pte_t *pte_offset_map_ro_nolock(struct mm_struct *mm, pmd_t *pmd,
> +				unsigned long addr, spinlock_t **ptlp)
> +{
> +	pmd_t pmdval;
> +	pte_t *pte;
> +
> +	pte = __pte_offset_map(pmd, addr, &pmdval);
> +	if (likely(pte))
> +		*ptlp = pte_lockptr(mm, &pmdval);
> +	return pte;
> +}
> +
> +pte_t *pte_offset_map_rw_nolock(struct mm_struct *mm, pmd_t *pmd,
> +				unsigned long addr, pmd_t *pmdvalp,
> +				spinlock_t **ptlp)
> +{
> +	pmd_t pmdval;
> +	pte_t *pte;
> +
> +	VM_WARN_ON_ONCE(!pmdvalp);
> +	pte = __pte_offset_map(pmd, addr, &pmdval);
> +	if (likely(pte))
> +		*ptlp = pte_lockptr(mm, &pmdval);
> +	*pmdvalp = pmdval;
> +	return pte;
> +}
> +
>   /*
>    * pte_offset_map_lock(mm, pmd, addr, ptlp), and its internal implementation
>    * __pte_offset_map_lock() below, is usually called with the pmd pointer for
> @@ -356,6 +383,29 @@ pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
>    * recheck *pmd once the lock is taken; in practice, no callsite needs that -
>    * either the mmap_lock for write, or pte_same() check on contents, is enough.
>    *
> + * pte_offset_map_ro_nolock(mm, pmd, addr, ptlp), above, is like pte_offset_map();
> + * but when successful, it also outputs a pointer to the spinlock in ptlp - as
> + * pte_offset_map_lock() does, but in this case without locking it.  This helps
> + * the caller to avoid a later pte_lockptr(mm, *pmd), which might by that time
> + * act on a changed *pmd: pte_offset_map_ro_nolock() provides the correct spinlock
> + * pointer for the page table that it returns. Even after grabbing the spinlock,
> + * we might be looking either at a page table that is still mapped or one that
> + * was unmapped and is about to get freed. But for R/O access this is sufficient.
> + * So it is only applicable for read-only cases where any modification operations
> + * to the page table are not allowed even if the corresponding spinlock is held
> + * afterwards.
> + *
> + * pte_offset_map_rw_nolock(mm, pmd, addr, pmdvalp, ptlp), above, is like
> + * pte_offset_map_ro_nolock(); but when successful, it also outputs the pdmval.
> + * It is applicable for may-write cases where any modification operations to the
> + * page table may happen after the corresponding spinlock is held afterwards.
> + * But the users should make sure the page table is stable like checking pte_same()
> + * or checking pmd_same() by using the output pmdval before performing the write
> + * operations.

Now, we have two options to make sure the stability of PTE for users
of pte_offset_map_rw_nolock(), in order to ease this operation, how
about proposing a new helper (or two, one for pmd_same, another for
pte_same) like pte_lock_stability (I am not good at naming, maybe
you can) which helps users 1) hold the PTL and 2) check if the PTE is
stable and 3) return true if the PTE stable, otherwise return false.

Muchun,
Thanks.

> + *
> + * Note: "RO" / "RW" expresses the intended semantics, not that the *kmap* will
> + * be read-only/read-write protected.
> + *
>    * Note that free_pgtables(), used after unmapping detached vmas, or when
>    * exiting the whole mm, does not take page table lock before freeing a page
>    * table, and may not use RCU at all: "outsiders" like khugepaged should avoid



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 01/14] mm: pgtable: introduce pte_offset_map_{ro|rw}_nolock()
  2024-09-06  7:20   ` Muchun Song
@ 2024-09-12  9:28     ` Qi Zheng
  0 siblings, 0 replies; 26+ messages in thread
From: Qi Zheng @ 2024-09-12  9:28 UTC (permalink / raw)
  To: Muchun Song
  Cc: LKML, Linux Memory Management List, linux-arm-kernel,
	linuxppc-dev, David Hildenbrand, hughd, willy, vbabka, akpm, rppt,
	vishal.moola, peterx, ryan.roberts, christophe.leroy2

Hi Muchun,

On 2024/9/6 15:20, Muchun Song wrote:
> 
> 
> On 2024/9/4 16:40, Qi Zheng wrote:
>> Currently, the usage of pte_offset_map_nolock() can be divided into the
>> following two cases:
>>
>> 1) After acquiring PTL, only read-only operations are performed on the 
>> PTE
>>     page. In this case, the RCU lock in pte_offset_map_nolock() will 
>> ensure
>>     that the PTE page will not be freed, and there is no need to worry
>>     about whether the pmd entry is modified.
>>
>> 2) After acquiring PTL, the pte or pmd entries may be modified. At this
>>     time, we need to ensure that the pmd entry has not been modified
>>     concurrently.
>>
>> To more clearing distinguish between these two cases, this commit
>> introduces two new helper functions to replace pte_offset_map_nolock().
>> For 1), just rename it to pte_offset_map_ro_nolock(). For 2), in addition
>> to changing the name to pte_offset_map_rw_nolock(), it also outputs the
>> pmdval when successful. It is applicable for may-write cases where any
>> modification operations to the page table may happen after the
>> corresponding spinlock is held afterwards. But the users should make sure
>> the page table is stable like checking pte_same() or checking pmd_same()
>> by using the output pmdval before performing the write operations.
>>
>> Note: "RO" / "RW" expresses the intended semantics, not that the *kmap*
>> will be read-only/read-write protected.
>>
>> Subsequent commits will convert pte_offset_map_nolock() into the above
>> two functions one by one, and finally completely delete it.
>>
>> Signed-off-by: Qi Zheng<zhengqi.arch@bytedance.com>
>> ---
>>   Documentation/mm/split_page_table_lock.rst |  7 +++
>>   include/linux/mm.h                         |  5 +++
>>   mm/pgtable-generic.c                       | 50 ++++++++++++++++++++++
>>   3 files changed, 62 insertions(+)
>>
>> diff --git a/Documentation/mm/split_page_table_lock.rst 
>> b/Documentation/mm/split_page_table_lock.rst
>> index e4f6972eb6c04..08d0e706a32db 100644
>> --- a/Documentation/mm/split_page_table_lock.rst
>> +++ b/Documentation/mm/split_page_table_lock.rst
>> @@ -19,6 +19,13 @@ There are helpers to lock/unlock a table and other 
>> accessor functions:
>>    - pte_offset_map_nolock()
>>       maps PTE, returns pointer to PTE with pointer to its PTE table
>>       lock (not taken), or returns NULL if no PTE table;
>> + - pte_offset_map_ro_nolock()
>> +    maps PTE, returns pointer to PTE with pointer to its PTE table
>> +    lock (not taken), or returns NULL if no PTE table;
>> + - pte_offset_map_rw_nolock()
>> +    maps PTE, returns pointer to PTE with pointer to its PTE table
>> +    lock (not taken) and the value of its pmd entry, or returns NULL
>> +    if no PTE table;
>>    - pte_offset_map()
>>       maps PTE, returns pointer to PTE, or returns NULL if no PTE table;
>>    - pte_unmap()
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index a7c74a840249a..1fde9242231c9 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -3006,6 +3006,11 @@ static inline pte_t *pte_offset_map_lock(struct 
>> mm_struct *mm, pmd_t *pmd,
>>   pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
>>               unsigned long addr, spinlock_t **ptlp);
>> +pte_t *pte_offset_map_ro_nolock(struct mm_struct *mm, pmd_t *pmd,
>> +                unsigned long addr, spinlock_t **ptlp);
>> +pte_t *pte_offset_map_rw_nolock(struct mm_struct *mm, pmd_t *pmd,
>> +                unsigned long addr, pmd_t *pmdvalp,
>> +                spinlock_t **ptlp);
>>   #define pte_unmap_unlock(pte, ptl)    do {        \
>>       spin_unlock(ptl);                \
>> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
>> index a78a4adf711ac..262b7065a5a2e 100644
>> --- a/mm/pgtable-generic.c
>> +++ b/mm/pgtable-generic.c
>> @@ -317,6 +317,33 @@ pte_t *pte_offset_map_nolock(struct mm_struct 
>> *mm, pmd_t *pmd,
>>       return pte;
>>   }
>> +pte_t *pte_offset_map_ro_nolock(struct mm_struct *mm, pmd_t *pmd,
>> +                unsigned long addr, spinlock_t **ptlp)
>> +{
>> +    pmd_t pmdval;
>> +    pte_t *pte;
>> +
>> +    pte = __pte_offset_map(pmd, addr, &pmdval);
>> +    if (likely(pte))
>> +        *ptlp = pte_lockptr(mm, &pmdval);
>> +    return pte;
>> +}
>> +
>> +pte_t *pte_offset_map_rw_nolock(struct mm_struct *mm, pmd_t *pmd,
>> +                unsigned long addr, pmd_t *pmdvalp,
>> +                spinlock_t **ptlp)
>> +{
>> +    pmd_t pmdval;
>> +    pte_t *pte;
>> +
>> +    VM_WARN_ON_ONCE(!pmdvalp);
>> +    pte = __pte_offset_map(pmd, addr, &pmdval);
>> +    if (likely(pte))
>> +        *ptlp = pte_lockptr(mm, &pmdval);
>> +    *pmdvalp = pmdval;
>> +    return pte;
>> +}
>> +
>>   /*
>>    * pte_offset_map_lock(mm, pmd, addr, ptlp), and its internal 
>> implementation
>>    * __pte_offset_map_lock() below, is usually called with the pmd 
>> pointer for
>> @@ -356,6 +383,29 @@ pte_t *pte_offset_map_nolock(struct mm_struct 
>> *mm, pmd_t *pmd,
>>    * recheck *pmd once the lock is taken; in practice, no callsite 
>> needs that -
>>    * either the mmap_lock for write, or pte_same() check on contents, 
>> is enough.
>>    *
>> + * pte_offset_map_ro_nolock(mm, pmd, addr, ptlp), above, is like 
>> pte_offset_map();
>> + * but when successful, it also outputs a pointer to the spinlock in 
>> ptlp - as
>> + * pte_offset_map_lock() does, but in this case without locking it.  
>> This helps
>> + * the caller to avoid a later pte_lockptr(mm, *pmd), which might by 
>> that time
>> + * act on a changed *pmd: pte_offset_map_ro_nolock() provides the 
>> correct spinlock
>> + * pointer for the page table that it returns. Even after grabbing 
>> the spinlock,
>> + * we might be looking either at a page table that is still mapped or 
>> one that
>> + * was unmapped and is about to get freed. But for R/O access this is 
>> sufficient.
>> + * So it is only applicable for read-only cases where any 
>> modification operations
>> + * to the page table are not allowed even if the corresponding 
>> spinlock is held
>> + * afterwards.
>> + *
>> + * pte_offset_map_rw_nolock(mm, pmd, addr, pmdvalp, ptlp), above, is 
>> like
>> + * pte_offset_map_ro_nolock(); but when successful, it also outputs 
>> the pdmval.
>> + * It is applicable for may-write cases where any modification 
>> operations to the
>> + * page table may happen after the corresponding spinlock is held 
>> afterwards.
>> + * But the users should make sure the page table is stable like 
>> checking pte_same()
>> + * or checking pmd_same() by using the output pmdval before 
>> performing the write
>> + * operations.
> 
> Now, we have two options to make sure the stability of PTE for users
> of pte_offset_map_rw_nolock(), in order to ease this operation, how
> about proposing a new helper (or two, one for pmd_same, another for
> pte_same) like pte_lock_stability (I am not good at naming, maybe
> you can) which helps users 1) hold the PTL and 2) check if the PTE is
> stable and 3) return true if the PTE stable, otherwise return false.

I've been trying to do this these days, but I found it was not very
convenient.

I introduced the following helpers:

#define __PTE_STABILITY(lock)						\
bool __pte_stability_##lock(pmd_t *pmd, pmd_t *orig_pmd, pte_t *pte,	\
			    pte_t *orig_pte, spinlock_t *ptlp)		\
{									\
	pte_spin_##lock(ptlp);						\
	if (orig_pte) {							\
		VM_WARN_ON_ONCE(pte_none(*orig_pte));			\
		return pte_same(*orig_pte, ptep_get(pte));		\
	}								\
	if (orig_pmd) {							\
		VM_WARN_ON_ONCE(pmd_none(*orig_pmd));			\
		return pmd_same(*orig_pmd, pmdp_get_lockless(pmd));	\
	}								\
	VM_WARN_ON_ONCE(1);						\
	return false;							\
}
__PTE_STABILITY(lock)
__PTE_STABILITY(lock_nested)

static inline bool pte_stability_lock(pmd_t *pmd, pmd_t *orig_pmd, pte_t 
*pte,
				      pte_t *orig_pte, spinlock_t *ptlp)
	__acquires(ptlp)
{
	return __pte_stability_lock(pmd, orig_pmd, pte, orig_pte, ptlp);
}

#ifdef CONFIG_SPLIT_PTE_PTLOCKS
static inline bool pte_stability_lock_nested(pmd_t *pmd, pmd_t *orig_pmd,
                                              pte_t *pte, pte_t *orig_pte,
                                              spinlock_t *ptlp)
         __acquires(ptlp)
{
         return __pte_stability_lock_nested(pmd, orig_pmd, pte, 
orig_pte, ptlp);
}

static inline void pte_stability_unlock_nested(spinlock_t *ptlp)
	__releases(ptlp)
{
	spin_unlock(ptlp);
}
#else
static inline bool pte_stability_lock_nested(pmd_t *pmd, pmd_t *orig_pmd,
                                              pte_t *pte, pte_t *orig_pte,
                                              spinlock_t *ptlp)
{
         return true;
}
static inline void pte_stability_unlock_nested(spinlock_t *ptlp)
{
}
#endif /* CONFIG_SPLIT_PTE_PTLOCKS */

and try to use them with pte_offset_map_rw_nolock() in the following
functions:

1. collapse_pte_mapped_thp
2. handle_pte_fault
3. map_pte
4. move_pages_pte
5. walk_pte_range

For 1, 2 and 3, the conversion is relatively simple, but 2 actually
already does a pte_same() check, so it does not reduce the amount of
code much.

For 4, the pte_same() checks have already been done, and it is not
easy to convert double_pt_lock() to use pte_stability_lock{_nested}().

For 5, it calls spin_trylock(), we should introduce another
pte_stability_trylock() helper for it, but it feels unnecessary.

There are not many places where pte_offset_map_rw_nolock() is called,
and some places have already done pte_same() checks, so maybe open
code is enough and there is no need to introduce more helper function.

Thanks,
Qi

> 
> Muchun,
> Thanks.
> 
>> + *
>> + * Note: "RO" / "RW" expresses the intended semantics, not that the 
>> *kmap* will
>> + * be read-only/read-write protected.
>> + *
>>    * Note that free_pgtables(), used after unmapping detached vmas, or 
>> when
>>    * exiting the whole mm, does not take page table lock before 
>> freeing a page
>>    * table, and may not use RCU at all: "outsiders" like khugepaged 
>> should avoid
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 10/14] mm: page_vma_mapped_walk: map_pte() use pte_offset_map_rw_nolock()
  2024-09-05 12:07   ` Muchun Song
@ 2024-09-12  9:30     ` Qi Zheng
  0 siblings, 0 replies; 26+ messages in thread
From: Qi Zheng @ 2024-09-12  9:30 UTC (permalink / raw)
  To: Muchun Song
  Cc: linux-kernel, linux-mm, linux-arm-kernel, linuxppc-dev, david,
	hughd, willy, vbabka, akpm, rppt, vishal.moola, peterx,
	ryan.roberts, christophe.leroy2



On 2024/9/5 20:07, Muchun Song wrote:
> 
> 
> On 2024/9/4 16:40, Qi Zheng wrote:
>> In the caller of map_pte(), we may modify the pvmw->pte after acquiring
>> the pvmw->ptl, so convert it to using pte_offset_map_rw_nolock(). At
>> this time, the pte_same() check is not performed after the pvmw->ptl 
>> held,
>> so we should get pmdval and do pmd_same() check to ensure the 
>> stability of
>> pvmw->pmd.
>>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>> ---
>>   mm/page_vma_mapped.c | 24 ++++++++++++++++++++----
>>   1 file changed, 20 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
>> index ae5cc42aa2087..f1d73fd448708 100644
>> --- a/mm/page_vma_mapped.c
>> +++ b/mm/page_vma_mapped.c
>> @@ -13,9 +13,11 @@ static inline bool not_found(struct 
>> page_vma_mapped_walk *pvmw)
>>       return false;
>>   }
>> -static bool map_pte(struct page_vma_mapped_walk *pvmw, spinlock_t 
>> **ptlp)
>> +static bool map_pte(struct page_vma_mapped_walk *pvmw, pmd_t *pmdvalp,
>> +            spinlock_t **ptlp)
>>   {
>>       pte_t ptent;
>> +    pmd_t pmdval;
>>       if (pvmw->flags & PVMW_SYNC) {
>>           /* Use the stricter lookup */
>> @@ -25,6 +27,7 @@ static bool map_pte(struct page_vma_mapped_walk 
>> *pvmw, spinlock_t **ptlp)
>>           return !!pvmw->pte;
>>       }
>> +again:
>>       /*
>>        * It is important to return the ptl corresponding to pte,
>>        * in case *pvmw->pmd changes underneath us; so we need to
>> @@ -32,10 +35,11 @@ static bool map_pte(struct page_vma_mapped_walk 
>> *pvmw, spinlock_t **ptlp)
>>        * proceeds to loop over next ptes, and finds a match later.
>>        * Though, in most cases, page lock already protects this.
>>        */
>> -    pvmw->pte = pte_offset_map_nolock(pvmw->vma->vm_mm, pvmw->pmd,
>> -                      pvmw->address, ptlp);
>> +    pvmw->pte = pte_offset_map_rw_nolock(pvmw->vma->vm_mm, pvmw->pmd,
>> +                         pvmw->address, &pmdval, ptlp);
>>       if (!pvmw->pte)
>>           return false;
>> +    *pmdvalp = pmdval;
>>       ptent = ptep_get(pvmw->pte);
>> @@ -69,6 +73,12 @@ static bool map_pte(struct page_vma_mapped_walk 
>> *pvmw, spinlock_t **ptlp)
>>       }
>>       pvmw->ptl = *ptlp;
>>       spin_lock(pvmw->ptl);
>> +
>> +    if (unlikely(!pmd_same(pmdval, pmdp_get_lockless(pvmw->pmd)))) {
>> +        spin_unlock(pvmw->ptl);
> 
> Forgot to clear pvmw->ptl? Or how about moving the assignment for it
> to the place where the pmd_same check is successful?
> 
>> +        goto again;
>> +    }
>> +
> 
> Maybe here is the right place to assign pvmw->ptl.

Right, will do in the v4.

> 
> Muchun,
> Thanks.
> 
>>       return true;
>>   }
>> @@ -278,7 +288,7 @@ bool page_vma_mapped_walk(struct 
>> page_vma_mapped_walk *pvmw)
>>               step_forward(pvmw, PMD_SIZE);
>>               continue;
>>           }
>> -        if (!map_pte(pvmw, &ptl)) {
>> +        if (!map_pte(pvmw, &pmde, &ptl)) {
>>               if (!pvmw->pte)
>>                   goto restart;
>>               goto next_pte;
>> @@ -307,6 +317,12 @@ bool page_vma_mapped_walk(struct 
>> page_vma_mapped_walk *pvmw)
>>           if (!pvmw->ptl) {
>>               pvmw->ptl = ptl;
>>               spin_lock(pvmw->ptl);
>> +            if (unlikely(!pmd_same(pmde, 
>> pmdp_get_lockless(pvmw->pmd)))) {
>> +                pte_unmap_unlock(pvmw->pte, pvmw->ptl);
>> +                pvmw->ptl = NULL;
>> +                pvmw->pte = NULL;
>> +                goto restart;
>> +            }
>>           }
>>           goto this_pte;
>>       } while (pvmw->address < end);
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2024-09-12  9:30 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-04  8:40 [PATCH v3 00/14] introduce pte_offset_map_{ro|rw}_nolock() Qi Zheng
2024-09-04  8:40 ` [PATCH v3 01/14] mm: pgtable: " Qi Zheng
2024-09-06  7:20   ` Muchun Song
2024-09-12  9:28     ` Qi Zheng
2024-09-04  8:40 ` [PATCH v3 02/14] arm: adjust_pte() use pte_offset_map_rw_nolock() Qi Zheng
2024-09-04  8:40 ` [PATCH v3 03/14] powerpc: assert_pte_locked() use pte_offset_map_ro_nolock() Qi Zheng
2024-09-04  8:40 ` [PATCH v3 04/14] mm: filemap: filemap_fault_recheck_pte_none() " Qi Zheng
2024-09-04  8:40 ` [PATCH v3 05/14] mm: khugepaged: __collapse_huge_page_swapin() " Qi Zheng
2024-09-04  8:40 ` [PATCH v3 06/14] mm: handle_pte_fault() use pte_offset_map_rw_nolock() Qi Zheng
2024-09-04  8:40 ` [PATCH v3 07/14] mm: khugepaged: collapse_pte_mapped_thp() " Qi Zheng
2024-09-04  8:40 ` [PATCH v3 08/14] mm: copy_pte_range() " Qi Zheng
2024-09-05  8:57   ` Muchun Song
2024-09-05 10:55     ` Qi Zheng
2024-09-04  8:40 ` [PATCH v3 09/14] mm: mremap: move_ptes() " Qi Zheng
2024-09-05  9:25   ` Muchun Song
2024-09-05 10:56     ` Qi Zheng
2024-09-04  8:40 ` [PATCH v3 10/14] mm: page_vma_mapped_walk: map_pte() " Qi Zheng
2024-09-05 12:07   ` Muchun Song
2024-09-12  9:30     ` Qi Zheng
2024-09-04  8:40 ` [PATCH v3 11/14] mm: userfaultfd: move_pages_pte() " Qi Zheng
2024-09-05 12:20   ` Muchun Song
2024-09-04  8:40 ` [PATCH v3 12/14] mm: multi-gen LRU: walk_pte_range() " Qi Zheng
2024-09-05 12:23   ` Muchun Song
2024-09-04  8:40 ` [PATCH v3 13/14] mm: pgtable: remove pte_offset_map_nolock() Qi Zheng
2024-09-05 12:23   ` Muchun Song
2024-09-04  8:40 ` [PATCH v3 14/14] mm: khugepaged: retract_page_tables() use pte_offset_map_rw_nolock() Qi Zheng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).