linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 00/11] synchronously scan and reclaim empty user PTE pages
@ 2024-12-04 11:09 Qi Zheng
  2024-12-04 11:09 ` [PATCH v4 01/11] mm: khugepaged: recheck pmd state in retract_page_tables() Qi Zheng
                   ` (12 more replies)
  0 siblings, 13 replies; 24+ messages in thread
From: Qi Zheng @ 2024-12-04 11:09 UTC (permalink / raw)
  To: david, jannh, hughd, willy, muchun.song, vbabka, peterx, akpm
  Cc: mgorman, catalin.marinas, will, dave.hansen, luto, peterz, x86,
	lorenzo.stoakes, zokeefe, rientjes, linux-mm, linux-kernel,
	Qi Zheng

Changes in v4:
 - update the process_addrs.rst in [PATCH v4 01/11]
   (suggested by Lorenzo Stoakes)
 - fix [PATCH v3 4/9] and move it after [PATCH v3 5/9]
   (pointed by David Hildenbrand)
 - change to use any_skipped instead of rechecking pte_none() to detect empty
   user PTE pages (suggested by David Hildenbrand)
 - rebase onto the next-20241203

Changes in v3:
 - recheck pmd state instead of pmd_same() in retract_page_tables()
   (suggested by Jann Horn)
 - recheck dst_pmd entry in move_pages_pte() (pointed by Jann Horn)
 - introduce new skip_none_ptes() (suggested by David Hildenbrand)
 - minor changes in [PATCH v2 5/7]
 - remove tlb_remove_table_sync_one() if CONFIG_PT_RECLAIM is enabled.
 - use put_page() instead of free_page_and_swap_cache() in
   __tlb_remove_table_one_rcu() (pointed by Jann Horn)
 - collect the Reviewed-bys and Acked-bys
 - rebase onto the next-20241112

Changes in v2:
 - fix [PATCH v1 1/7] (Jann Horn)
 - reset force_flush and force_break to false in [PATCH v1 2/7] (Jann Horn)
 - introduce zap_nonpresent_ptes() and do_zap_pte_range()
 - check pte_none() instead of can_reclaim_pt after the processing of PTEs
   (remove [PATCH v1 3/7] and [PATCH v1 4/7])
 - reorder patches
 - rebase onto the next-20241031

Changes in v1:
 - replace [RFC PATCH 1/7] with a separate serise (already merge into mm-unstable):
   https://lore.kernel.org/lkml/cover.1727332572.git.zhengqi.arch@bytedance.com/
   (suggested by David Hildenbrand)
 - squash [RFC PATCH 2/7] into [RFC PATCH 4/7]
   (suggested by David Hildenbrand)
 - change to scan and reclaim empty user PTE pages in zap_pte_range()
   (suggested by David Hildenbrand)
 - sent a separate RFC patch to track the tlb flushing issue, and remove
   that part form this series ([RFC PATCH 3/7] and [RFC PATCH 6/7]).
   link: https://lore.kernel.org/lkml/20240815120715.14516-1-zhengqi.arch@bytedance.com/
 - add [PATCH v1 1/7] into this series
 - drop RFC tag
 - rebase onto the next-20241011

Changes in RFC v2:
 - fix compilation errors in [RFC PATCH 5/7] and [RFC PATCH 7/7] reproted by
   kernel test robot
 - use pte_offset_map_nolock() + pmd_same() instead of check_pmd_still_valid()
   in retract_page_tables() (in [RFC PATCH 4/7])
 - rebase onto the next-20240805

Hi all,

Previously, we tried to use a completely asynchronous method to reclaim empty
user PTE pages [1]. After discussing with David Hildenbrand, we decided to
implement synchronous reclaimation in the case of madvise(MADV_DONTNEED) as the
first step.

So this series aims to synchronously free the empty PTE pages in
madvise(MADV_DONTNEED) case. We will detect and free empty PTE pages in
zap_pte_range(), and will add zap_details.reclaim_pt to exclude cases other than
madvise(MADV_DONTNEED).

In zap_pte_range(), mmu_gather is used to perform batch tlb flushing and page
freeing operations. Therefore, if we want to free the empty PTE page in this
path, the most natural way is to add it to mmu_gather as well. Now, if
CONFIG_MMU_GATHER_RCU_TABLE_FREE is selected, mmu_gather will free page table
pages by semi RCU:

 - batch table freeing: asynchronous free by RCU
 - single table freeing: IPI + synchronous free

But this is not enough to free the empty PTE page table pages in paths other
that munmap and exit_mmap path, because IPI cannot be synchronized with
rcu_read_lock() in pte_offset_map{_lock}(). So we should let single table also
be freed by RCU like batch table freeing.

As a first step, we supported this feature on x86_64 and selectd the newly
introduced CONFIG_ARCH_SUPPORTS_PT_RECLAIM.

For other cases such as madvise(MADV_FREE), consider scanning and freeing empty
PTE pages asynchronously in the future.

This series is based on next-20241112 (which contains the series [2]).

Note: issues related to TLB flushing are not new to this series and are tracked
      in the separate RFC patch [3]. And more context please refer to this
      thread [4].

Comments and suggestions are welcome!

Thanks,
Qi

[1]. https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/
[2]. https://lore.kernel.org/lkml/cover.1727332572.git.zhengqi.arch@bytedance.com/
[3]. https://lore.kernel.org/lkml/20240815120715.14516-1-zhengqi.arch@bytedance.com/
[4]. https://lore.kernel.org/lkml/6f38cb19-9847-4f70-bbe7-06881bb016be@bytedance.com/

Qi Zheng (11):
  mm: khugepaged: recheck pmd state in retract_page_tables()
  mm: userfaultfd: recheck dst_pmd entry in move_pages_pte()
  mm: introduce zap_nonpresent_ptes()
  mm: introduce do_zap_pte_range()
  mm: skip over all consecutive none ptes in do_zap_pte_range()
  mm: zap_install_uffd_wp_if_needed: return whether uffd-wp pte has been
    re-installed
  mm: do_zap_pte_range: return any_skipped information to the caller
  mm: make zap_pte_range() handle full within-PMD range
  mm: pgtable: reclaim empty PTE page in madvise(MADV_DONTNEED)
  x86: mm: free page table pages by RCU instead of semi RCU
  x86: select ARCH_SUPPORTS_PT_RECLAIM if X86_64

 Documentation/mm/process_addrs.rst |   4 +
 arch/x86/Kconfig                   |   1 +
 arch/x86/include/asm/tlb.h         |  20 +++
 arch/x86/kernel/paravirt.c         |   7 +
 arch/x86/mm/pgtable.c              |  10 +-
 include/linux/mm.h                 |   1 +
 include/linux/mm_inline.h          |  11 +-
 include/linux/mm_types.h           |   4 +-
 mm/Kconfig                         |  15 ++
 mm/Makefile                        |   1 +
 mm/internal.h                      |  19 +++
 mm/khugepaged.c                    |  45 +++--
 mm/madvise.c                       |   7 +-
 mm/memory.c                        | 253 ++++++++++++++++++-----------
 mm/mmu_gather.c                    |   9 +-
 mm/pt_reclaim.c                    |  71 ++++++++
 mm/userfaultfd.c                   |  51 ++++--
 17 files changed, 397 insertions(+), 132 deletions(-)
 create mode 100644 mm/pt_reclaim.c

-- 
2.20.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v4 01/11] mm: khugepaged: recheck pmd state in retract_page_tables()
  2024-12-04 11:09 [PATCH v4 00/11] synchronously scan and reclaim empty user PTE pages Qi Zheng
@ 2024-12-04 11:09 ` Qi Zheng
  2024-12-04 11:09 ` [PATCH v4 02/11] mm: userfaultfd: recheck dst_pmd entry in move_pages_pte() Qi Zheng
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 24+ messages in thread
From: Qi Zheng @ 2024-12-04 11:09 UTC (permalink / raw)
  To: david, jannh, hughd, willy, muchun.song, vbabka, peterx, akpm
  Cc: mgorman, catalin.marinas, will, dave.hansen, luto, peterz, x86,
	lorenzo.stoakes, zokeefe, rientjes, linux-mm, linux-kernel,
	Qi Zheng

In retract_page_tables(), the lock of new_folio is still held, we will be
blocked in the page fault path, which prevents the pte entries from being
set again. So even though the old empty PTE page may be concurrently freed
and a new PTE page is filled into the pmd entry, it is still empty and can
be removed.

So just refactor the retract_page_tables() a little bit and recheck the
pmd state after holding the pmd lock.

Suggested-by: Jann Horn <jannh@google.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 Documentation/mm/process_addrs.rst |  4 +++
 mm/khugepaged.c                    | 45 ++++++++++++++++++++----------
 2 files changed, 35 insertions(+), 14 deletions(-)

diff --git a/Documentation/mm/process_addrs.rst b/Documentation/mm/process_addrs.rst
index 1d416658d7f59..81417fa2ed20b 100644
--- a/Documentation/mm/process_addrs.rst
+++ b/Documentation/mm/process_addrs.rst
@@ -531,6 +531,10 @@ are extra requirements for accessing them:
   new page table has been installed in the same location and filled with
   entries. Writers normally need to take the PTE lock and revalidate that the
   PMD entry still refers to the same PTE-level page table.
+  If the writer does not care whether it is the same PTE-level page table, it
+  can take the PMD lock and revalidate that the contents of pmd entry still meet
+  the requirements. In particular, this also happens in :c:func:`!retract_page_tables`
+  when handling :c:macro:`!MADV_COLLAPSE`.
 
 To access PTE-level page tables, a helper like :c:func:`!pte_offset_map_lock` or
 :c:func:`!pte_offset_map` can be used depending on stability requirements.
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 6f8d46d107b4b..99dc995aac110 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -947,17 +947,10 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
 	return SCAN_SUCCEED;
 }
 
-static int find_pmd_or_thp_or_none(struct mm_struct *mm,
-				   unsigned long address,
-				   pmd_t **pmd)
+static inline int check_pmd_state(pmd_t *pmd)
 {
-	pmd_t pmde;
+	pmd_t pmde = pmdp_get_lockless(pmd);
 
-	*pmd = mm_find_pmd(mm, address);
-	if (!*pmd)
-		return SCAN_PMD_NULL;
-
-	pmde = pmdp_get_lockless(*pmd);
 	if (pmd_none(pmde))
 		return SCAN_PMD_NONE;
 	if (!pmd_present(pmde))
@@ -971,6 +964,17 @@ static int find_pmd_or_thp_or_none(struct mm_struct *mm,
 	return SCAN_SUCCEED;
 }
 
+static int find_pmd_or_thp_or_none(struct mm_struct *mm,
+				   unsigned long address,
+				   pmd_t **pmd)
+{
+	*pmd = mm_find_pmd(mm, address);
+	if (!*pmd)
+		return SCAN_PMD_NULL;
+
+	return check_pmd_state(*pmd);
+}
+
 static int check_pmd_still_valid(struct mm_struct *mm,
 				 unsigned long address,
 				 pmd_t *pmd)
@@ -1720,7 +1724,7 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 		pmd_t *pmd, pgt_pmd;
 		spinlock_t *pml;
 		spinlock_t *ptl;
-		bool skipped_uffd = false;
+		bool success = false;
 
 		/*
 		 * Check vma->anon_vma to exclude MAP_PRIVATE mappings that
@@ -1757,6 +1761,19 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 		mmu_notifier_invalidate_range_start(&range);
 
 		pml = pmd_lock(mm, pmd);
+		/*
+		 * The lock of new_folio is still held, we will be blocked in
+		 * the page fault path, which prevents the pte entries from
+		 * being set again. So even though the old empty PTE page may be
+		 * concurrently freed and a new PTE page is filled into the pmd
+		 * entry, it is still empty and can be removed.
+		 *
+		 * So here we only need to recheck if the state of pmd entry
+		 * still meets our requirements, rather than checking pmd_same()
+		 * like elsewhere.
+		 */
+		if (check_pmd_state(pmd) != SCAN_SUCCEED)
+			goto drop_pml;
 		ptl = pte_lockptr(mm, pmd);
 		if (ptl != pml)
 			spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
@@ -1770,20 +1787,20 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 		 * repeating the anon_vma check protects from one category,
 		 * and repeating the userfaultfd_wp() check from another.
 		 */
-		if (unlikely(vma->anon_vma || userfaultfd_wp(vma))) {
-			skipped_uffd = true;
-		} else {
+		if (likely(!vma->anon_vma && !userfaultfd_wp(vma))) {
 			pgt_pmd = pmdp_collapse_flush(vma, addr, pmd);
 			pmdp_get_lockless_sync();
+			success = true;
 		}
 
 		if (ptl != pml)
 			spin_unlock(ptl);
+drop_pml:
 		spin_unlock(pml);
 
 		mmu_notifier_invalidate_range_end(&range);
 
-		if (!skipped_uffd) {
+		if (success) {
 			mm_dec_nr_ptes(mm);
 			page_table_check_pte_clear_range(mm, addr, pgt_pmd);
 			pte_free_defer(mm, pmd_pgtable(pgt_pmd));
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v4 02/11] mm: userfaultfd: recheck dst_pmd entry in move_pages_pte()
  2024-12-04 11:09 [PATCH v4 00/11] synchronously scan and reclaim empty user PTE pages Qi Zheng
  2024-12-04 11:09 ` [PATCH v4 01/11] mm: khugepaged: recheck pmd state in retract_page_tables() Qi Zheng
@ 2024-12-04 11:09 ` Qi Zheng
  2024-12-10  8:41   ` [PATCH v4 02/11 fix] fix: " Qi Zheng
  2024-12-04 11:09 ` [PATCH v4 03/11] mm: introduce zap_nonpresent_ptes() Qi Zheng
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 24+ messages in thread
From: Qi Zheng @ 2024-12-04 11:09 UTC (permalink / raw)
  To: david, jannh, hughd, willy, muchun.song, vbabka, peterx, akpm
  Cc: mgorman, catalin.marinas, will, dave.hansen, luto, peterz, x86,
	lorenzo.stoakes, zokeefe, rientjes, linux-mm, linux-kernel,
	Qi Zheng

In move_pages_pte(), since dst_pte needs to be none, the subsequent
pte_same() check cannot prevent the dst_pte page from being freed
concurrently, so we also need to abtain dst_pmdval and recheck pmd_same().
Otherwise, once we support empty PTE page reclaimation for anonymous
pages, it may result in moving the src_pte page into the dts_pte page that
is about to be freed by RCU.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/userfaultfd.c | 51 +++++++++++++++++++++++++++++++-----------------
 1 file changed, 33 insertions(+), 18 deletions(-)

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 60a0be33766ff..8e16dc290ddf1 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -1020,6 +1020,14 @@ void double_pt_unlock(spinlock_t *ptl1,
 		__release(ptl2);
 }
 
+static inline bool is_pte_pages_stable(pte_t *dst_pte, pte_t *src_pte,
+				       pte_t orig_dst_pte, pte_t orig_src_pte,
+				       pmd_t *dst_pmd, pmd_t dst_pmdval)
+{
+	return pte_same(ptep_get(src_pte), orig_src_pte) &&
+	       pte_same(ptep_get(dst_pte), orig_dst_pte) &&
+	       pmd_same(dst_pmdval, pmdp_get_lockless(dst_pmd));
+}
 
 static int move_present_pte(struct mm_struct *mm,
 			    struct vm_area_struct *dst_vma,
@@ -1027,6 +1035,7 @@ static int move_present_pte(struct mm_struct *mm,
 			    unsigned long dst_addr, unsigned long src_addr,
 			    pte_t *dst_pte, pte_t *src_pte,
 			    pte_t orig_dst_pte, pte_t orig_src_pte,
+			    pmd_t *dst_pmd, pmd_t dst_pmdval,
 			    spinlock_t *dst_ptl, spinlock_t *src_ptl,
 			    struct folio *src_folio)
 {
@@ -1034,8 +1043,8 @@ static int move_present_pte(struct mm_struct *mm,
 
 	double_pt_lock(dst_ptl, src_ptl);
 
-	if (!pte_same(ptep_get(src_pte), orig_src_pte) ||
-	    !pte_same(ptep_get(dst_pte), orig_dst_pte)) {
+	if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
+				 dst_pmd, dst_pmdval)) {
 		err = -EAGAIN;
 		goto out;
 	}
@@ -1071,6 +1080,7 @@ static int move_swap_pte(struct mm_struct *mm,
 			 unsigned long dst_addr, unsigned long src_addr,
 			 pte_t *dst_pte, pte_t *src_pte,
 			 pte_t orig_dst_pte, pte_t orig_src_pte,
+			 pmd_t *dst_pmd, pmd_t dst_pmdval,
 			 spinlock_t *dst_ptl, spinlock_t *src_ptl)
 {
 	if (!pte_swp_exclusive(orig_src_pte))
@@ -1078,8 +1088,8 @@ static int move_swap_pte(struct mm_struct *mm,
 
 	double_pt_lock(dst_ptl, src_ptl);
 
-	if (!pte_same(ptep_get(src_pte), orig_src_pte) ||
-	    !pte_same(ptep_get(dst_pte), orig_dst_pte)) {
+	if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
+				 dst_pmd, dst_pmdval)) {
 		double_pt_unlock(dst_ptl, src_ptl);
 		return -EAGAIN;
 	}
@@ -1097,13 +1107,14 @@ static int move_zeropage_pte(struct mm_struct *mm,
 			     unsigned long dst_addr, unsigned long src_addr,
 			     pte_t *dst_pte, pte_t *src_pte,
 			     pte_t orig_dst_pte, pte_t orig_src_pte,
+			     pmd_t *dst_pmd, pmd_t dst_pmdval,
 			     spinlock_t *dst_ptl, spinlock_t *src_ptl)
 {
 	pte_t zero_pte;
 
 	double_pt_lock(dst_ptl, src_ptl);
-	if (!pte_same(ptep_get(src_pte), orig_src_pte) ||
-	    !pte_same(ptep_get(dst_pte), orig_dst_pte)) {
+	if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
+				 dst_pmd, dst_pmdval)) {
 		double_pt_unlock(dst_ptl, src_ptl);
 		return -EAGAIN;
 	}
@@ -1136,6 +1147,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
 	pte_t *src_pte = NULL;
 	pte_t *dst_pte = NULL;
 	pmd_t dummy_pmdval;
+	pmd_t dst_pmdval;
 	struct folio *src_folio = NULL;
 	struct anon_vma *src_anon_vma = NULL;
 	struct mmu_notifier_range range;
@@ -1148,11 +1160,11 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
 retry:
 	/*
 	 * Use the maywrite version to indicate that dst_pte will be modified,
-	 * but since we will use pte_same() to detect the change of the pte
-	 * entry, there is no need to get pmdval, so just pass a dummy variable
-	 * to it.
+	 * since dst_pte needs to be none, the subsequent pte_same() check
+	 * cannot prevent the dst_pte page from being freed concurrently, so we
+	 * also need to abtain dst_pmdval and recheck pmd_same() later.
 	 */
-	dst_pte = pte_offset_map_rw_nolock(mm, dst_pmd, dst_addr, &dummy_pmdval,
+	dst_pte = pte_offset_map_rw_nolock(mm, dst_pmd, dst_addr, &dst_pmdval,
 					   &dst_ptl);
 
 	/* Retry if a huge pmd materialized from under us */
@@ -1161,7 +1173,11 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
 		goto out;
 	}
 
-	/* same as dst_pte */
+	/*
+	 * Unlike dst_pte, the subsequent pte_same() check can ensure the
+	 * stability of the src_pte page, so there is no need to get pmdval,
+	 * just pass a dummy variable to it.
+	 */
 	src_pte = pte_offset_map_rw_nolock(mm, src_pmd, src_addr, &dummy_pmdval,
 					   &src_ptl);
 
@@ -1213,7 +1229,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
 			err = move_zeropage_pte(mm, dst_vma, src_vma,
 					       dst_addr, src_addr, dst_pte, src_pte,
 					       orig_dst_pte, orig_src_pte,
-					       dst_ptl, src_ptl);
+					       dst_pmd, dst_pmdval, dst_ptl, src_ptl);
 			goto out;
 		}
 
@@ -1303,8 +1319,8 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
 
 		err = move_present_pte(mm,  dst_vma, src_vma,
 				       dst_addr, src_addr, dst_pte, src_pte,
-				       orig_dst_pte, orig_src_pte,
-				       dst_ptl, src_ptl, src_folio);
+				       orig_dst_pte, orig_src_pte, dst_pmd,
+				       dst_pmdval, dst_ptl, src_ptl, src_folio);
 	} else {
 		entry = pte_to_swp_entry(orig_src_pte);
 		if (non_swap_entry(entry)) {
@@ -1319,10 +1335,9 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
 			goto out;
 		}
 
-		err = move_swap_pte(mm, dst_addr, src_addr,
-				    dst_pte, src_pte,
-				    orig_dst_pte, orig_src_pte,
-				    dst_ptl, src_ptl);
+		err = move_swap_pte(mm, dst_addr, src_addr, dst_pte, src_pte,
+				    orig_dst_pte, orig_src_pte, dst_pmd,
+				    dst_pmdval, dst_ptl, src_ptl);
 	}
 
 out:
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v4 03/11] mm: introduce zap_nonpresent_ptes()
  2024-12-04 11:09 [PATCH v4 00/11] synchronously scan and reclaim empty user PTE pages Qi Zheng
  2024-12-04 11:09 ` [PATCH v4 01/11] mm: khugepaged: recheck pmd state in retract_page_tables() Qi Zheng
  2024-12-04 11:09 ` [PATCH v4 02/11] mm: userfaultfd: recheck dst_pmd entry in move_pages_pte() Qi Zheng
@ 2024-12-04 11:09 ` Qi Zheng
  2024-12-04 11:09 ` [PATCH v4 04/11] mm: introduce do_zap_pte_range() Qi Zheng
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 24+ messages in thread
From: Qi Zheng @ 2024-12-04 11:09 UTC (permalink / raw)
  To: david, jannh, hughd, willy, muchun.song, vbabka, peterx, akpm
  Cc: mgorman, catalin.marinas, will, dave.hansen, luto, peterz, x86,
	lorenzo.stoakes, zokeefe, rientjes, linux-mm, linux-kernel,
	Qi Zheng

Similar to zap_present_ptes(), let's introduce zap_nonpresent_ptes() to
handle non-present ptes, which can improve code readability.

No functional change.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Jann Horn <jannh@google.com>
Acked-by: David Hildenbrand <david@redhat.com>
---
 mm/memory.c | 136 ++++++++++++++++++++++++++++------------------------
 1 file changed, 73 insertions(+), 63 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index d5a1b0a6bf1fa..5624c22bb03cf 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1587,6 +1587,76 @@ static inline int zap_present_ptes(struct mmu_gather *tlb,
 	return 1;
 }
 
+static inline int zap_nonpresent_ptes(struct mmu_gather *tlb,
+		struct vm_area_struct *vma, pte_t *pte, pte_t ptent,
+		unsigned int max_nr, unsigned long addr,
+		struct zap_details *details, int *rss)
+{
+	swp_entry_t entry;
+	int nr = 1;
+
+	entry = pte_to_swp_entry(ptent);
+	if (is_device_private_entry(entry) ||
+		is_device_exclusive_entry(entry)) {
+		struct page *page = pfn_swap_entry_to_page(entry);
+		struct folio *folio = page_folio(page);
+
+		if (unlikely(!should_zap_folio(details, folio)))
+			return 1;
+		/*
+		 * Both device private/exclusive mappings should only
+		 * work with anonymous page so far, so we don't need to
+		 * consider uffd-wp bit when zap. For more information,
+		 * see zap_install_uffd_wp_if_needed().
+		 */
+		WARN_ON_ONCE(!vma_is_anonymous(vma));
+		rss[mm_counter(folio)]--;
+		if (is_device_private_entry(entry))
+			folio_remove_rmap_pte(folio, page, vma);
+		folio_put(folio);
+	} else if (!non_swap_entry(entry)) {
+		/* Genuine swap entries, hence a private anon pages */
+		if (!should_zap_cows(details))
+			return 1;
+
+		nr = swap_pte_batch(pte, max_nr, ptent);
+		rss[MM_SWAPENTS] -= nr;
+		free_swap_and_cache_nr(entry, nr);
+	} else if (is_migration_entry(entry)) {
+		struct folio *folio = pfn_swap_entry_folio(entry);
+
+		if (!should_zap_folio(details, folio))
+			return 1;
+		rss[mm_counter(folio)]--;
+	} else if (pte_marker_entry_uffd_wp(entry)) {
+		/*
+		 * For anon: always drop the marker; for file: only
+		 * drop the marker if explicitly requested.
+		 */
+		if (!vma_is_anonymous(vma) && !zap_drop_markers(details))
+			return 1;
+	} else if (is_guard_swp_entry(entry)) {
+		/*
+		 * Ordinary zapping should not remove guard PTE
+		 * markers. Only do so if we should remove PTE markers
+		 * in general.
+		 */
+		if (!zap_drop_markers(details))
+			return 1;
+	} else if (is_hwpoison_entry(entry) || is_poisoned_swp_entry(entry)) {
+		if (!should_zap_cows(details))
+			return 1;
+	} else {
+		/* We should have covered all the swap entry types */
+		pr_alert("unrecognized swap entry 0x%lx\n", entry.val);
+		WARN_ON_ONCE(1);
+	}
+	clear_not_present_full_ptes(vma->vm_mm, addr, pte, nr, tlb->fullmm);
+	zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details, ptent);
+
+	return nr;
+}
+
 static unsigned long zap_pte_range(struct mmu_gather *tlb,
 				struct vm_area_struct *vma, pmd_t *pmd,
 				unsigned long addr, unsigned long end,
@@ -1598,7 +1668,6 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 	spinlock_t *ptl;
 	pte_t *start_pte;
 	pte_t *pte;
-	swp_entry_t entry;
 	int nr;
 
 	tlb_change_page_size(tlb, PAGE_SIZE);
@@ -1611,8 +1680,6 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 	arch_enter_lazy_mmu_mode();
 	do {
 		pte_t ptent = ptep_get(pte);
-		struct folio *folio;
-		struct page *page;
 		int max_nr;
 
 		nr = 1;
@@ -1622,8 +1689,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 		if (need_resched())
 			break;
 
+		max_nr = (end - addr) / PAGE_SIZE;
 		if (pte_present(ptent)) {
-			max_nr = (end - addr) / PAGE_SIZE;
 			nr = zap_present_ptes(tlb, vma, pte, ptent, max_nr,
 					      addr, details, rss, &force_flush,
 					      &force_break);
@@ -1631,67 +1698,10 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 				addr += nr * PAGE_SIZE;
 				break;
 			}
-			continue;
-		}
-
-		entry = pte_to_swp_entry(ptent);
-		if (is_device_private_entry(entry) ||
-		    is_device_exclusive_entry(entry)) {
-			page = pfn_swap_entry_to_page(entry);
-			folio = page_folio(page);
-			if (unlikely(!should_zap_folio(details, folio)))
-				continue;
-			/*
-			 * Both device private/exclusive mappings should only
-			 * work with anonymous page so far, so we don't need to
-			 * consider uffd-wp bit when zap. For more information,
-			 * see zap_install_uffd_wp_if_needed().
-			 */
-			WARN_ON_ONCE(!vma_is_anonymous(vma));
-			rss[mm_counter(folio)]--;
-			if (is_device_private_entry(entry))
-				folio_remove_rmap_pte(folio, page, vma);
-			folio_put(folio);
-		} else if (!non_swap_entry(entry)) {
-			max_nr = (end - addr) / PAGE_SIZE;
-			nr = swap_pte_batch(pte, max_nr, ptent);
-			/* Genuine swap entries, hence a private anon pages */
-			if (!should_zap_cows(details))
-				continue;
-			rss[MM_SWAPENTS] -= nr;
-			free_swap_and_cache_nr(entry, nr);
-		} else if (is_migration_entry(entry)) {
-			folio = pfn_swap_entry_folio(entry);
-			if (!should_zap_folio(details, folio))
-				continue;
-			rss[mm_counter(folio)]--;
-		} else if (pte_marker_entry_uffd_wp(entry)) {
-			/*
-			 * For anon: always drop the marker; for file: only
-			 * drop the marker if explicitly requested.
-			 */
-			if (!vma_is_anonymous(vma) &&
-			    !zap_drop_markers(details))
-				continue;
-		} else if (is_guard_swp_entry(entry)) {
-			/*
-			 * Ordinary zapping should not remove guard PTE
-			 * markers. Only do so if we should remove PTE markers
-			 * in general.
-			 */
-			if (!zap_drop_markers(details))
-				continue;
-		} else if (is_hwpoison_entry(entry) ||
-			   is_poisoned_swp_entry(entry)) {
-			if (!should_zap_cows(details))
-				continue;
 		} else {
-			/* We should have covered all the swap entry types */
-			pr_alert("unrecognized swap entry 0x%lx\n", entry.val);
-			WARN_ON_ONCE(1);
+			nr = zap_nonpresent_ptes(tlb, vma, pte, ptent, max_nr,
+						 addr, details, rss);
 		}
-		clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm);
-		zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details, ptent);
 	} while (pte += nr, addr += PAGE_SIZE * nr, addr != end);
 
 	add_mm_rss_vec(mm, rss);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v4 04/11] mm: introduce do_zap_pte_range()
  2024-12-04 11:09 [PATCH v4 00/11] synchronously scan and reclaim empty user PTE pages Qi Zheng
                   ` (2 preceding siblings ...)
  2024-12-04 11:09 ` [PATCH v4 03/11] mm: introduce zap_nonpresent_ptes() Qi Zheng
@ 2024-12-04 11:09 ` Qi Zheng
  2024-12-04 11:09 ` [PATCH v4 05/11] mm: skip over all consecutive none ptes in do_zap_pte_range() Qi Zheng
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 24+ messages in thread
From: Qi Zheng @ 2024-12-04 11:09 UTC (permalink / raw)
  To: david, jannh, hughd, willy, muchun.song, vbabka, peterx, akpm
  Cc: mgorman, catalin.marinas, will, dave.hansen, luto, peterz, x86,
	lorenzo.stoakes, zokeefe, rientjes, linux-mm, linux-kernel,
	Qi Zheng

This commit introduces do_zap_pte_range() to actually zap the PTEs, which
will help improve code readability and facilitate secondary checking of
the processed PTEs in the future.

No functional change.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Jann Horn <jannh@google.com>
Acked-by: David Hildenbrand <david@redhat.com>
---
 mm/memory.c | 45 ++++++++++++++++++++++++++-------------------
 1 file changed, 26 insertions(+), 19 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 5624c22bb03cf..abe07e6bdd1bb 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1657,6 +1657,27 @@ static inline int zap_nonpresent_ptes(struct mmu_gather *tlb,
 	return nr;
 }
 
+static inline int do_zap_pte_range(struct mmu_gather *tlb,
+				   struct vm_area_struct *vma, pte_t *pte,
+				   unsigned long addr, unsigned long end,
+				   struct zap_details *details, int *rss,
+				   bool *force_flush, bool *force_break)
+{
+	pte_t ptent = ptep_get(pte);
+	int max_nr = (end - addr) / PAGE_SIZE;
+
+	if (pte_none(ptent))
+		return 1;
+
+	if (pte_present(ptent))
+		return zap_present_ptes(tlb, vma, pte, ptent, max_nr,
+					addr, details, rss, force_flush,
+					force_break);
+
+	return zap_nonpresent_ptes(tlb, vma, pte, ptent, max_nr, addr,
+					 details, rss);
+}
+
 static unsigned long zap_pte_range(struct mmu_gather *tlb,
 				struct vm_area_struct *vma, pmd_t *pmd,
 				unsigned long addr, unsigned long end,
@@ -1679,28 +1700,14 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 	flush_tlb_batched_pending(mm);
 	arch_enter_lazy_mmu_mode();
 	do {
-		pte_t ptent = ptep_get(pte);
-		int max_nr;
-
-		nr = 1;
-		if (pte_none(ptent))
-			continue;
-
 		if (need_resched())
 			break;
 
-		max_nr = (end - addr) / PAGE_SIZE;
-		if (pte_present(ptent)) {
-			nr = zap_present_ptes(tlb, vma, pte, ptent, max_nr,
-					      addr, details, rss, &force_flush,
-					      &force_break);
-			if (unlikely(force_break)) {
-				addr += nr * PAGE_SIZE;
-				break;
-			}
-		} else {
-			nr = zap_nonpresent_ptes(tlb, vma, pte, ptent, max_nr,
-						 addr, details, rss);
+		nr = do_zap_pte_range(tlb, vma, pte, addr, end, details, rss,
+				      &force_flush, &force_break);
+		if (unlikely(force_break)) {
+			addr += nr * PAGE_SIZE;
+			break;
 		}
 	} while (pte += nr, addr += PAGE_SIZE * nr, addr != end);
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v4 05/11] mm: skip over all consecutive none ptes in do_zap_pte_range()
  2024-12-04 11:09 [PATCH v4 00/11] synchronously scan and reclaim empty user PTE pages Qi Zheng
                   ` (3 preceding siblings ...)
  2024-12-04 11:09 ` [PATCH v4 04/11] mm: introduce do_zap_pte_range() Qi Zheng
@ 2024-12-04 11:09 ` Qi Zheng
  2024-12-04 11:09 ` [PATCH v4 06/11] mm: zap_install_uffd_wp_if_needed: return whether uffd-wp pte has been re-installed Qi Zheng
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 24+ messages in thread
From: Qi Zheng @ 2024-12-04 11:09 UTC (permalink / raw)
  To: david, jannh, hughd, willy, muchun.song, vbabka, peterx, akpm
  Cc: mgorman, catalin.marinas, will, dave.hansen, luto, peterz, x86,
	lorenzo.stoakes, zokeefe, rientjes, linux-mm, linux-kernel,
	Qi Zheng

Skip over all consecutive none ptes in do_zap_pte_range(), which helps
optimize away need_resched() + force_break + incremental pte/addr
increments etc.

Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/memory.c | 27 ++++++++++++++++++++-------
 1 file changed, 20 insertions(+), 7 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index abe07e6bdd1bb..7f8869a22b57c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1665,17 +1665,30 @@ static inline int do_zap_pte_range(struct mmu_gather *tlb,
 {
 	pte_t ptent = ptep_get(pte);
 	int max_nr = (end - addr) / PAGE_SIZE;
+	int nr = 0;
 
-	if (pte_none(ptent))
-		return 1;
+	/* Skip all consecutive none ptes */
+	if (pte_none(ptent)) {
+		for (nr = 1; nr < max_nr; nr++) {
+			ptent = ptep_get(pte + nr);
+			if (!pte_none(ptent))
+				break;
+		}
+		max_nr -= nr;
+		if (!max_nr)
+			return nr;
+		pte += nr;
+		addr += nr * PAGE_SIZE;
+	}
 
 	if (pte_present(ptent))
-		return zap_present_ptes(tlb, vma, pte, ptent, max_nr,
-					addr, details, rss, force_flush,
-					force_break);
+		nr += zap_present_ptes(tlb, vma, pte, ptent, max_nr, addr,
+				       details, rss, force_flush, force_break);
+	else
+		nr += zap_nonpresent_ptes(tlb, vma, pte, ptent, max_nr, addr,
+					  details, rss);
 
-	return zap_nonpresent_ptes(tlb, vma, pte, ptent, max_nr, addr,
-					 details, rss);
+	return nr;
 }
 
 static unsigned long zap_pte_range(struct mmu_gather *tlb,
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v4 06/11] mm: zap_install_uffd_wp_if_needed: return whether uffd-wp pte has been re-installed
  2024-12-04 11:09 [PATCH v4 00/11] synchronously scan and reclaim empty user PTE pages Qi Zheng
                   ` (4 preceding siblings ...)
  2024-12-04 11:09 ` [PATCH v4 05/11] mm: skip over all consecutive none ptes in do_zap_pte_range() Qi Zheng
@ 2024-12-04 11:09 ` Qi Zheng
  2024-12-04 11:09 ` [PATCH v4 07/11] mm: do_zap_pte_range: return any_skipped information to the caller Qi Zheng
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 24+ messages in thread
From: Qi Zheng @ 2024-12-04 11:09 UTC (permalink / raw)
  To: david, jannh, hughd, willy, muchun.song, vbabka, peterx, akpm
  Cc: mgorman, catalin.marinas, will, dave.hansen, luto, peterz, x86,
	lorenzo.stoakes, zokeefe, rientjes, linux-mm, linux-kernel,
	Qi Zheng

In some cases, we'll replace the none pte with an uffd-wp swap special pte
marker when necessary. Let's expose this information to the caller through
the return value, so that subsequent commits can use this information to
detect whether the PTE page is empty.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 include/linux/mm_inline.h | 11 +++++++----
 mm/memory.c               | 16 ++++++++++++----
 2 files changed, 19 insertions(+), 8 deletions(-)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 1b6a917fffa4b..34e5097182a02 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -564,9 +564,9 @@ static inline pte_marker copy_pte_marker(
  * Must be called with pgtable lock held so that no thread will see the none
  * pte, and if they see it, they'll fault and serialize at the pgtable lock.
  *
- * This function is a no-op if PTE_MARKER_UFFD_WP is not enabled.
+ * Returns true if an uffd-wp pte was installed, false otherwise.
  */
-static inline void
+static inline bool
 pte_install_uffd_wp_if_needed(struct vm_area_struct *vma, unsigned long addr,
 			      pte_t *pte, pte_t pteval)
 {
@@ -583,7 +583,7 @@ pte_install_uffd_wp_if_needed(struct vm_area_struct *vma, unsigned long addr,
 	 * with a swap pte.  There's no way of leaking the bit.
 	 */
 	if (vma_is_anonymous(vma) || !userfaultfd_wp(vma))
-		return;
+		return false;
 
 	/* A uffd-wp wr-protected normal pte */
 	if (unlikely(pte_present(pteval) && pte_uffd_wp(pteval)))
@@ -596,10 +596,13 @@ pte_install_uffd_wp_if_needed(struct vm_area_struct *vma, unsigned long addr,
 	if (unlikely(pte_swp_uffd_wp_any(pteval)))
 		arm_uffd_pte = true;
 
-	if (unlikely(arm_uffd_pte))
+	if (unlikely(arm_uffd_pte)) {
 		set_pte_at(vma->vm_mm, addr, pte,
 			   make_pte_marker(PTE_MARKER_UFFD_WP));
+		return true;
+	}
 #endif
+	return false;
 }
 
 static inline bool vma_has_recency(struct vm_area_struct *vma)
diff --git a/mm/memory.c b/mm/memory.c
index 7f8869a22b57c..1f149bc2c0586 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1466,27 +1466,35 @@ static inline bool zap_drop_markers(struct zap_details *details)
 /*
  * This function makes sure that we'll replace the none pte with an uffd-wp
  * swap special pte marker when necessary. Must be with the pgtable lock held.
+ *
+ * Returns true if uffd-wp ptes was installed, false otherwise.
  */
-static inline void
+static inline bool
 zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
 			      unsigned long addr, pte_t *pte, int nr,
 			      struct zap_details *details, pte_t pteval)
 {
+	bool was_installed = false;
+
+#ifdef CONFIG_PTE_MARKER_UFFD_WP
 	/* Zap on anonymous always means dropping everything */
 	if (vma_is_anonymous(vma))
-		return;
+		return false;
 
 	if (zap_drop_markers(details))
-		return;
+		return false;
 
 	for (;;) {
 		/* the PFN in the PTE is irrelevant. */
-		pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
+		if (pte_install_uffd_wp_if_needed(vma, addr, pte, pteval))
+			was_installed = true;
 		if (--nr == 0)
 			break;
 		pte++;
 		addr += PAGE_SIZE;
 	}
+#endif
+	return was_installed;
 }
 
 static __always_inline void zap_present_folio_ptes(struct mmu_gather *tlb,
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v4 07/11] mm: do_zap_pte_range: return any_skipped information to the caller
  2024-12-04 11:09 [PATCH v4 00/11] synchronously scan and reclaim empty user PTE pages Qi Zheng
                   ` (5 preceding siblings ...)
  2024-12-04 11:09 ` [PATCH v4 06/11] mm: zap_install_uffd_wp_if_needed: return whether uffd-wp pte has been re-installed Qi Zheng
@ 2024-12-04 11:09 ` Qi Zheng
  2024-12-04 11:09 ` [PATCH v4 08/11] mm: make zap_pte_range() handle full within-PMD range Qi Zheng
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 24+ messages in thread
From: Qi Zheng @ 2024-12-04 11:09 UTC (permalink / raw)
  To: david, jannh, hughd, willy, muchun.song, vbabka, peterx, akpm
  Cc: mgorman, catalin.marinas, will, dave.hansen, luto, peterz, x86,
	lorenzo.stoakes, zokeefe, rientjes, linux-mm, linux-kernel,
	Qi Zheng

Let the caller of do_zap_pte_range() know whether we skip zap ptes or
reinstall uffd-wp ptes through any_skipped parameter, so that subsequent
commits can use this information in zap_pte_range() to detect whether the
PTE page can be reclaimed.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/memory.c | 36 +++++++++++++++++++++---------------
 1 file changed, 21 insertions(+), 15 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 1f149bc2c0586..fdefa551d1250 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1501,7 +1501,7 @@ static __always_inline void zap_present_folio_ptes(struct mmu_gather *tlb,
 		struct vm_area_struct *vma, struct folio *folio,
 		struct page *page, pte_t *pte, pte_t ptent, unsigned int nr,
 		unsigned long addr, struct zap_details *details, int *rss,
-		bool *force_flush, bool *force_break)
+		bool *force_flush, bool *force_break, bool *any_skipped)
 {
 	struct mm_struct *mm = tlb->mm;
 	bool delay_rmap = false;
@@ -1527,8 +1527,8 @@ static __always_inline void zap_present_folio_ptes(struct mmu_gather *tlb,
 	arch_check_zapped_pte(vma, ptent);
 	tlb_remove_tlb_entries(tlb, pte, nr, addr);
 	if (unlikely(userfaultfd_pte_wp(vma, ptent)))
-		zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details,
-					      ptent);
+		*any_skipped = zap_install_uffd_wp_if_needed(vma, addr, pte,
+							     nr, details, ptent);
 
 	if (!delay_rmap) {
 		folio_remove_rmap_ptes(folio, page, nr, vma);
@@ -1552,7 +1552,7 @@ static inline int zap_present_ptes(struct mmu_gather *tlb,
 		struct vm_area_struct *vma, pte_t *pte, pte_t ptent,
 		unsigned int max_nr, unsigned long addr,
 		struct zap_details *details, int *rss, bool *force_flush,
-		bool *force_break)
+		bool *force_break, bool *any_skipped)
 {
 	const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
 	struct mm_struct *mm = tlb->mm;
@@ -1567,15 +1567,17 @@ static inline int zap_present_ptes(struct mmu_gather *tlb,
 		arch_check_zapped_pte(vma, ptent);
 		tlb_remove_tlb_entry(tlb, pte, addr);
 		if (userfaultfd_pte_wp(vma, ptent))
-			zap_install_uffd_wp_if_needed(vma, addr, pte, 1,
-						      details, ptent);
+			*any_skipped = zap_install_uffd_wp_if_needed(vma, addr,
+						pte, 1, details, ptent);
 		ksm_might_unmap_zero_page(mm, ptent);
 		return 1;
 	}
 
 	folio = page_folio(page);
-	if (unlikely(!should_zap_folio(details, folio)))
+	if (unlikely(!should_zap_folio(details, folio))) {
+		*any_skipped = true;
 		return 1;
+	}
 
 	/*
 	 * Make sure that the common "small folio" case is as fast as possible
@@ -1587,22 +1589,23 @@ static inline int zap_present_ptes(struct mmu_gather *tlb,
 
 		zap_present_folio_ptes(tlb, vma, folio, page, pte, ptent, nr,
 				       addr, details, rss, force_flush,
-				       force_break);
+				       force_break, any_skipped);
 		return nr;
 	}
 	zap_present_folio_ptes(tlb, vma, folio, page, pte, ptent, 1, addr,
-			       details, rss, force_flush, force_break);
+			       details, rss, force_flush, force_break, any_skipped);
 	return 1;
 }
 
 static inline int zap_nonpresent_ptes(struct mmu_gather *tlb,
 		struct vm_area_struct *vma, pte_t *pte, pte_t ptent,
 		unsigned int max_nr, unsigned long addr,
-		struct zap_details *details, int *rss)
+		struct zap_details *details, int *rss, bool *any_skipped)
 {
 	swp_entry_t entry;
 	int nr = 1;
 
+	*any_skipped = true;
 	entry = pte_to_swp_entry(ptent);
 	if (is_device_private_entry(entry) ||
 		is_device_exclusive_entry(entry)) {
@@ -1660,7 +1663,7 @@ static inline int zap_nonpresent_ptes(struct mmu_gather *tlb,
 		WARN_ON_ONCE(1);
 	}
 	clear_not_present_full_ptes(vma->vm_mm, addr, pte, nr, tlb->fullmm);
-	zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details, ptent);
+	*any_skipped = zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details, ptent);
 
 	return nr;
 }
@@ -1669,7 +1672,8 @@ static inline int do_zap_pte_range(struct mmu_gather *tlb,
 				   struct vm_area_struct *vma, pte_t *pte,
 				   unsigned long addr, unsigned long end,
 				   struct zap_details *details, int *rss,
-				   bool *force_flush, bool *force_break)
+				   bool *force_flush, bool *force_break,
+				   bool *any_skipped)
 {
 	pte_t ptent = ptep_get(pte);
 	int max_nr = (end - addr) / PAGE_SIZE;
@@ -1691,10 +1695,11 @@ static inline int do_zap_pte_range(struct mmu_gather *tlb,
 
 	if (pte_present(ptent))
 		nr += zap_present_ptes(tlb, vma, pte, ptent, max_nr, addr,
-				       details, rss, force_flush, force_break);
+				       details, rss, force_flush, force_break,
+				       any_skipped);
 	else
 		nr += zap_nonpresent_ptes(tlb, vma, pte, ptent, max_nr, addr,
-					  details, rss);
+					  details, rss, any_skipped);
 
 	return nr;
 }
@@ -1705,6 +1710,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 				struct zap_details *details)
 {
 	bool force_flush = false, force_break = false;
+	bool any_skipped = false;
 	struct mm_struct *mm = tlb->mm;
 	int rss[NR_MM_COUNTERS];
 	spinlock_t *ptl;
@@ -1725,7 +1731,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			break;
 
 		nr = do_zap_pte_range(tlb, vma, pte, addr, end, details, rss,
-				      &force_flush, &force_break);
+				      &force_flush, &force_break, &any_skipped);
 		if (unlikely(force_break)) {
 			addr += nr * PAGE_SIZE;
 			break;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v4 08/11] mm: make zap_pte_range() handle full within-PMD range
  2024-12-04 11:09 [PATCH v4 00/11] synchronously scan and reclaim empty user PTE pages Qi Zheng
                   ` (6 preceding siblings ...)
  2024-12-04 11:09 ` [PATCH v4 07/11] mm: do_zap_pte_range: return any_skipped information to the caller Qi Zheng
@ 2024-12-04 11:09 ` Qi Zheng
  2024-12-04 11:09 ` [PATCH v4 09/11] mm: pgtable: reclaim empty PTE page in madvise(MADV_DONTNEED) Qi Zheng
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 24+ messages in thread
From: Qi Zheng @ 2024-12-04 11:09 UTC (permalink / raw)
  To: david, jannh, hughd, willy, muchun.song, vbabka, peterx, akpm
  Cc: mgorman, catalin.marinas, will, dave.hansen, luto, peterz, x86,
	lorenzo.stoakes, zokeefe, rientjes, linux-mm, linux-kernel,
	Qi Zheng

In preparation for reclaiming empty PTE pages, this commit first makes
zap_pte_range() to handle the full within-PMD range, so that we can more
easily detect and free PTE pages in this function in subsequent commits.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Jann Horn <jannh@google.com>
---
 mm/memory.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index fdefa551d1250..36a59bea289d1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1718,6 +1718,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 	pte_t *pte;
 	int nr;
 
+retry:
 	tlb_change_page_size(tlb, PAGE_SIZE);
 	init_rss_vec(rss);
 	start_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
@@ -1757,6 +1758,13 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 	if (force_flush)
 		tlb_flush_mmu(tlb);
 
+	if (addr != end) {
+		cond_resched();
+		force_flush = false;
+		force_break = false;
+		goto retry;
+	}
+
 	return addr;
 }
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v4 09/11] mm: pgtable: reclaim empty PTE page in madvise(MADV_DONTNEED)
  2024-12-04 11:09 [PATCH v4 00/11] synchronously scan and reclaim empty user PTE pages Qi Zheng
                   ` (7 preceding siblings ...)
  2024-12-04 11:09 ` [PATCH v4 08/11] mm: make zap_pte_range() handle full within-PMD range Qi Zheng
@ 2024-12-04 11:09 ` Qi Zheng
  2024-12-04 22:36   ` Andrew Morton
  2024-12-06 11:23   ` [PATCH v4 09/11 fix] fix: " Qi Zheng
  2024-12-04 11:09 ` [PATCH v4 10/11] x86: mm: free page table pages by RCU instead of semi RCU Qi Zheng
                   ` (3 subsequent siblings)
  12 siblings, 2 replies; 24+ messages in thread
From: Qi Zheng @ 2024-12-04 11:09 UTC (permalink / raw)
  To: david, jannh, hughd, willy, muchun.song, vbabka, peterx, akpm
  Cc: mgorman, catalin.marinas, will, dave.hansen, luto, peterz, x86,
	lorenzo.stoakes, zokeefe, rientjes, linux-mm, linux-kernel,
	Qi Zheng

Now in order to pursue high performance, applications mostly use some
high-performance user-mode memory allocators, such as jemalloc or
tcmalloc. These memory allocators use madvise(MADV_DONTNEED or MADV_FREE)
to release physical memory, but neither MADV_DONTNEED nor MADV_FREE will
release page table memory, which may cause huge page table memory usage.

The following are a memory usage snapshot of one process which actually
happened on our server:

        VIRT:  55t
        RES:   590g
        VmPTE: 110g

In this case, most of the page table entries are empty. For such a PTE
page where all entries are empty, we can actually free it back to the
system for others to use.

As a first step, this commit aims to synchronously free the empty PTE
pages in madvise(MADV_DONTNEED) case. We will detect and free empty PTE
pages in zap_pte_range(), and will add zap_details.reclaim_pt to exclude
cases other than madvise(MADV_DONTNEED).

Once an empty PTE is detected, we first try to hold the pmd lock within
the pte lock. If successful, we clear the pmd entry directly (fast path).
Otherwise, we wait until the pte lock is released, then re-hold the pmd
and pte locks and loop PTRS_PER_PTE times to check pte_none() to re-detect
whether the PTE page is empty and free it (slow path).

For other cases such as madvise(MADV_FREE), consider scanning and freeing
empty PTE pages asynchronously in the future.

The following code snippet can show the effect of optimization:

        mmap 50G
        while (1) {
                for (; i < 1024 * 25; i++) {
                        touch 2M memory
                        madvise MADV_DONTNEED 2M
                }
        }

As we can see, the memory usage of VmPTE is reduced:

                        before                          after
VIRT                   50.0 GB                        50.0 GB
RES                     3.1 MB                         3.1 MB
VmPTE                102640 KB                         240 KB

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 include/linux/mm.h |  1 +
 mm/Kconfig         | 15 ++++++++++
 mm/Makefile        |  1 +
 mm/internal.h      | 19 +++++++++++++
 mm/madvise.c       |  7 ++++-
 mm/memory.c        | 21 ++++++++++++--
 mm/pt_reclaim.c    | 71 ++++++++++++++++++++++++++++++++++++++++++++++
 7 files changed, 132 insertions(+), 3 deletions(-)
 create mode 100644 mm/pt_reclaim.c

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 12fb3b9334269..8f3c824ee5a77 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2319,6 +2319,7 @@ extern void pagefault_out_of_memory(void);
 struct zap_details {
 	struct folio *single_folio;	/* Locked folio to be unmapped */
 	bool even_cows;			/* Zap COWed private pages too? */
+	bool reclaim_pt;		/* Need reclaim page tables? */
 	zap_flags_t zap_flags;		/* Extra flags for zapping */
 };
 
diff --git a/mm/Kconfig b/mm/Kconfig
index 84000b0168086..7949ab121070f 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1301,6 +1301,21 @@ config ARCH_HAS_USER_SHADOW_STACK
 	  The architecture has hardware support for userspace shadow call
           stacks (eg, x86 CET, arm64 GCS or RISC-V Zicfiss).
 
+config ARCH_SUPPORTS_PT_RECLAIM
+	def_bool n
+
+config PT_RECLAIM
+	bool "reclaim empty user page table pages"
+	default y
+	depends on ARCH_SUPPORTS_PT_RECLAIM && MMU && SMP
+	select MMU_GATHER_RCU_TABLE_FREE
+	help
+	  Try to reclaim empty user page table pages in paths other than munmap
+	  and exit_mmap path.
+
+	  Note: now only empty user PTE page table pages will be reclaimed.
+
+
 source "mm/damon/Kconfig"
 
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index dba52bb0da8ab..850386a67b3e0 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -146,3 +146,4 @@ obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o
 obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o
 obj-$(CONFIG_EXECMEM) += execmem.o
 obj-$(CONFIG_TMPFS_QUOTA) += shmem_quota.o
+obj-$(CONFIG_PT_RECLAIM) += pt_reclaim.o
diff --git a/mm/internal.h b/mm/internal.h
index 74713b44bedb6..3958a965e56e1 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1545,4 +1545,23 @@ int walk_page_range_mm(struct mm_struct *mm, unsigned long start,
 		unsigned long end, const struct mm_walk_ops *ops,
 		void *private);
 
+/* pt_reclaim.c */
+bool try_get_and_clear_pmd(struct mm_struct *mm, pmd_t *pmd, pmd_t *pmdval);
+void free_pte(struct mm_struct *mm, unsigned long addr, struct mmu_gather *tlb,
+	      pmd_t pmdval);
+void try_to_free_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr,
+		     struct mmu_gather *tlb);
+
+#ifdef CONFIG_PT_RECLAIM
+bool reclaim_pt_is_enabled(unsigned long start, unsigned long end,
+			   struct zap_details *details);
+#else
+static inline bool reclaim_pt_is_enabled(unsigned long start, unsigned long end,
+					 struct zap_details *details)
+{
+	return false;
+}
+#endif /* CONFIG_PT_RECLAIM */
+
+
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/madvise.c b/mm/madvise.c
index 0ceae57da7dad..49f3a75046f63 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -851,7 +851,12 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
 static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
 					unsigned long start, unsigned long end)
 {
-	zap_page_range_single(vma, start, end - start, NULL);
+	struct zap_details details = {
+		.reclaim_pt = true,
+		.even_cows = true,
+	};
+
+	zap_page_range_single(vma, start, end - start, &details);
 	return 0;
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index 36a59bea289d1..1fc1f14839916 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1436,7 +1436,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
 static inline bool should_zap_cows(struct zap_details *details)
 {
 	/* By default, zap all pages */
-	if (!details)
+	if (!details || details->reclaim_pt)
 		return true;
 
 	/* Or, we zap COWed pages only if the caller wants to */
@@ -1710,12 +1710,15 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 				struct zap_details *details)
 {
 	bool force_flush = false, force_break = false;
-	bool any_skipped = false;
 	struct mm_struct *mm = tlb->mm;
 	int rss[NR_MM_COUNTERS];
 	spinlock_t *ptl;
 	pte_t *start_pte;
 	pte_t *pte;
+	pmd_t pmdval;
+	unsigned long start = addr;
+	bool can_reclaim_pt = reclaim_pt_is_enabled(start, end, details);
+	bool direct_reclaim = false;
 	int nr;
 
 retry:
@@ -1728,17 +1731,24 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 	flush_tlb_batched_pending(mm);
 	arch_enter_lazy_mmu_mode();
 	do {
+		bool any_skipped = false;
+
 		if (need_resched())
 			break;
 
 		nr = do_zap_pte_range(tlb, vma, pte, addr, end, details, rss,
 				      &force_flush, &force_break, &any_skipped);
+		if (any_skipped)
+			can_reclaim_pt = false;
 		if (unlikely(force_break)) {
 			addr += nr * PAGE_SIZE;
 			break;
 		}
 	} while (pte += nr, addr += PAGE_SIZE * nr, addr != end);
 
+	if (can_reclaim_pt && addr == end)
+		direct_reclaim = try_get_and_clear_pmd(mm, pmd, &pmdval);
+
 	add_mm_rss_vec(mm, rss);
 	arch_leave_lazy_mmu_mode();
 
@@ -1765,6 +1775,13 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 		goto retry;
 	}
 
+	if (can_reclaim_pt) {
+		if (direct_reclaim)
+			free_pte(mm, start, tlb, pmdval);
+		else
+			try_to_free_pte(mm, pmd, start, tlb);
+	}
+
 	return addr;
 }
 
diff --git a/mm/pt_reclaim.c b/mm/pt_reclaim.c
new file mode 100644
index 0000000000000..6540a3115dde8
--- /dev/null
+++ b/mm/pt_reclaim.c
@@ -0,0 +1,71 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/hugetlb.h>
+#include <asm-generic/tlb.h>
+#include <asm/pgalloc.h>
+
+#include "internal.h"
+
+bool reclaim_pt_is_enabled(unsigned long start, unsigned long end,
+			   struct zap_details *details)
+{
+	return details && details->reclaim_pt && (end - start >= PMD_SIZE);
+}
+
+bool try_get_and_clear_pmd(struct mm_struct *mm, pmd_t *pmd, pmd_t *pmdval)
+{
+	spinlock_t *pml = pmd_lockptr(mm, pmd);
+
+	if (!spin_trylock(pml))
+		return false;
+
+	*pmdval = pmdp_get_lockless(pmd);
+	pmd_clear(pmd);
+	spin_unlock(pml);
+
+	return true;
+}
+
+void free_pte(struct mm_struct *mm, unsigned long addr, struct mmu_gather *tlb,
+	      pmd_t pmdval)
+{
+	pte_free_tlb(tlb, pmd_pgtable(pmdval), addr);
+	mm_dec_nr_ptes(mm);
+}
+
+void try_to_free_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr,
+		     struct mmu_gather *tlb)
+{
+	pmd_t pmdval;
+	spinlock_t *pml, *ptl;
+	pte_t *start_pte, *pte;
+	int i;
+
+	pml = pmd_lock(mm, pmd);
+	start_pte = pte_offset_map_rw_nolock(mm, pmd, addr, &pmdval, &ptl);
+	if (!start_pte)
+		goto out_ptl;
+	if (ptl != pml)
+		spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
+
+	/* Check if it is empty PTE page */
+	for (i = 0, pte = start_pte; i < PTRS_PER_PTE; i++, pte++) {
+		if (!pte_none(ptep_get(pte)))
+			goto out_ptl;
+	}
+	pte_unmap(start_pte);
+
+	pmd_clear(pmd);
+
+	if (ptl != pml)
+		spin_unlock(ptl);
+	spin_unlock(pml);
+
+	free_pte(mm, addr, tlb, pmdval);
+
+	return;
+out_ptl:
+	if (start_pte)
+		pte_unmap_unlock(start_pte, ptl);
+	if (ptl != pml)
+		spin_unlock(pml);
+}
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v4 10/11] x86: mm: free page table pages by RCU instead of semi RCU
  2024-12-04 11:09 [PATCH v4 00/11] synchronously scan and reclaim empty user PTE pages Qi Zheng
                   ` (8 preceding siblings ...)
  2024-12-04 11:09 ` [PATCH v4 09/11] mm: pgtable: reclaim empty PTE page in madvise(MADV_DONTNEED) Qi Zheng
@ 2024-12-04 11:09 ` Qi Zheng
  2024-12-04 11:09 ` [PATCH v4 11/11] x86: select ARCH_SUPPORTS_PT_RECLAIM if X86_64 Qi Zheng
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 24+ messages in thread
From: Qi Zheng @ 2024-12-04 11:09 UTC (permalink / raw)
  To: david, jannh, hughd, willy, muchun.song, vbabka, peterx, akpm
  Cc: mgorman, catalin.marinas, will, dave.hansen, luto, peterz, x86,
	lorenzo.stoakes, zokeefe, rientjes, linux-mm, linux-kernel,
	Qi Zheng

Now, if CONFIG_MMU_GATHER_RCU_TABLE_FREE is selected, the page table pages
will be freed by semi RCU, that is:

 - batch table freeing: asynchronous free by RCU
 - single table freeing: IPI + synchronous free

In this way, the page table can be lockless traversed by disabling IRQ in
paths such as fast GUP. But this is not enough to free the empty PTE page
table pages in paths other that munmap and exit_mmap path, because IPI
cannot be synchronized with rcu_read_lock() in pte_offset_map{_lock}().

In preparation for supporting empty PTE page table pages reclaimation,
let single table also be freed by RCU like batch table freeing. Then we
can also use pte_offset_map() etc to prevent PTE page from being freed.

Like pte_free_defer(), we can also safely use ptdesc->pt_rcu_head to free
the page table pages:

 - The pt_rcu_head is unioned with pt_list and pmd_huge_pte.

 - For pt_list, it is used to manage the PGD page in x86. Fortunately
   tlb_remove_table() will not be used for free PGD pages, so it is safe
   to use pt_rcu_head.

 - For pmd_huge_pte, it is used for THPs, so it is safe.

After applying this patch, if CONFIG_PT_RECLAIM is enabled, the function
call of free_pte() is as follows:

free_pte
  pte_free_tlb
    __pte_free_tlb
      ___pte_free_tlb
        paravirt_tlb_remove_table
          tlb_remove_table [!CONFIG_PARAVIRT, Xen PV, Hyper-V, KVM]
            [no-free-memory slowpath:]
              tlb_table_invalidate
              tlb_remove_table_one
                __tlb_remove_table_one [frees via RCU]
            [fastpath:]
              tlb_table_flush
                tlb_remove_table_free [frees via RCU]
          native_tlb_remove_table [CONFIG_PARAVIRT on native]
            tlb_remove_table [see above]

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: x86@kernel.org
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
---
 arch/x86/include/asm/tlb.h | 20 ++++++++++++++++++++
 arch/x86/kernel/paravirt.c |  7 +++++++
 arch/x86/mm/pgtable.c      | 10 +++++++++-
 include/linux/mm_types.h   |  4 +++-
 mm/mmu_gather.c            |  9 ++++++++-
 5 files changed, 47 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
index 4d3c9d00d6b6b..73f0786181cc9 100644
--- a/arch/x86/include/asm/tlb.h
+++ b/arch/x86/include/asm/tlb.h
@@ -34,8 +34,28 @@ static inline void __tlb_remove_table(void *table)
 	free_page_and_swap_cache(table);
 }
 
+#ifdef CONFIG_PT_RECLAIM
+static inline void __tlb_remove_table_one_rcu(struct rcu_head *head)
+{
+	struct page *page;
+
+	page = container_of(head, struct page, rcu_head);
+	put_page(page);
+}
+
+static inline void __tlb_remove_table_one(void *table)
+{
+	struct page *page;
+
+	page = table;
+	call_rcu(&page->rcu_head, __tlb_remove_table_one_rcu);
+}
+#define __tlb_remove_table_one __tlb_remove_table_one
+#endif /* CONFIG_PT_RECLAIM */
+
 static inline void invlpg(unsigned long addr)
 {
 	asm volatile("invlpg (%0)" ::"r" (addr) : "memory");
 }
+
 #endif /* _ASM_X86_TLB_H */
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index fec3815335558..89688921ea62e 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -59,10 +59,17 @@ void __init native_pv_lock_init(void)
 		static_branch_enable(&virt_spin_lock_key);
 }
 
+#ifndef CONFIG_PT_RECLAIM
 static void native_tlb_remove_table(struct mmu_gather *tlb, void *table)
 {
 	tlb_remove_page(tlb, table);
 }
+#else
+static void native_tlb_remove_table(struct mmu_gather *tlb, void *table)
+{
+	tlb_remove_table(tlb, table);
+}
+#endif
 
 struct static_key paravirt_steal_enabled;
 struct static_key paravirt_steal_rq_enabled;
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 5745a354a241c..69a357b15974a 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -19,12 +19,20 @@ EXPORT_SYMBOL(physical_mask);
 #endif
 
 #ifndef CONFIG_PARAVIRT
+#ifndef CONFIG_PT_RECLAIM
 static inline
 void paravirt_tlb_remove_table(struct mmu_gather *tlb, void *table)
 {
 	tlb_remove_page(tlb, table);
 }
-#endif
+#else
+static inline
+void paravirt_tlb_remove_table(struct mmu_gather *tlb, void *table)
+{
+	tlb_remove_table(tlb, table);
+}
+#endif /* !CONFIG_PT_RECLAIM */
+#endif /* !CONFIG_PARAVIRT */
 
 gfp_t __userpte_alloc_gfp = GFP_PGTABLE_USER | PGTABLE_HIGHMEM;
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 3a35546bac944..706b3c926a089 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -438,7 +438,9 @@ FOLIO_MATCH(compound_head, _head_2a);
  * struct ptdesc -    Memory descriptor for page tables.
  * @__page_flags:     Same as page flags. Powerpc only.
  * @pt_rcu_head:      For freeing page table pages.
- * @pt_list:          List of used page tables. Used for s390 and x86.
+ * @pt_list:          List of used page tables. Used for s390 gmap shadow pages
+ *                    (which are not linked into the user page tables) and x86
+ *                    pgds.
  * @_pt_pad_1:        Padding that aliases with page's compound head.
  * @pmd_huge_pte:     Protected by ptdesc->ptl, used for THPs.
  * @__page_mapping:   Aliases with page->mapping. Unused for page tables.
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 99b3e9408aa0f..1e21022bcf339 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -311,11 +311,18 @@ static inline void tlb_table_invalidate(struct mmu_gather *tlb)
 	}
 }
 
-static void tlb_remove_table_one(void *table)
+#ifndef __tlb_remove_table_one
+static inline void __tlb_remove_table_one(void *table)
 {
 	tlb_remove_table_sync_one();
 	__tlb_remove_table(table);
 }
+#endif
+
+static void tlb_remove_table_one(void *table)
+{
+	__tlb_remove_table_one(table);
+}
 
 static void tlb_table_flush(struct mmu_gather *tlb)
 {
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v4 11/11] x86: select ARCH_SUPPORTS_PT_RECLAIM if X86_64
  2024-12-04 11:09 [PATCH v4 00/11] synchronously scan and reclaim empty user PTE pages Qi Zheng
                   ` (9 preceding siblings ...)
  2024-12-04 11:09 ` [PATCH v4 10/11] x86: mm: free page table pages by RCU instead of semi RCU Qi Zheng
@ 2024-12-04 11:09 ` Qi Zheng
  2024-12-10  8:44   ` [PATCH v4 12/11] mm: pgtable: make ptlock be freed by RCU Qi Zheng
  2024-12-04 22:49 ` [PATCH v4 00/11] synchronously scan and reclaim empty user PTE pages Andrew Morton
  2024-12-10  8:57 ` Qi Zheng
  12 siblings, 1 reply; 24+ messages in thread
From: Qi Zheng @ 2024-12-04 11:09 UTC (permalink / raw)
  To: david, jannh, hughd, willy, muchun.song, vbabka, peterx, akpm
  Cc: mgorman, catalin.marinas, will, dave.hansen, luto, peterz, x86,
	lorenzo.stoakes, zokeefe, rientjes, linux-mm, linux-kernel,
	Qi Zheng

Now, x86 has fully supported the CONFIG_PT_RECLAIM feature, and
reclaiming PTE pages is profitable only on 64-bit systems, so select
ARCH_SUPPORTS_PT_RECLAIM if X86_64.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: x86@kernel.org
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
---
 arch/x86/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 65f8478fe7a96..77f001c6a5679 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -324,6 +324,7 @@ config X86
 	select FUNCTION_ALIGNMENT_4B
 	imply IMA_SECURE_AND_OR_TRUSTED_BOOT    if EFI
 	select HAVE_DYNAMIC_FTRACE_NO_PATCHABLE
+	select ARCH_SUPPORTS_PT_RECLAIM		if X86_64
 
 config INSTRUCTION_DECODER
 	def_bool y
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 09/11] mm: pgtable: reclaim empty PTE page in madvise(MADV_DONTNEED)
  2024-12-04 11:09 ` [PATCH v4 09/11] mm: pgtable: reclaim empty PTE page in madvise(MADV_DONTNEED) Qi Zheng
@ 2024-12-04 22:36   ` Andrew Morton
  2024-12-04 22:47     ` Jann Horn
  2024-12-05  3:35     ` Qi Zheng
  2024-12-06 11:23   ` [PATCH v4 09/11 fix] fix: " Qi Zheng
  1 sibling, 2 replies; 24+ messages in thread
From: Andrew Morton @ 2024-12-04 22:36 UTC (permalink / raw)
  To: Qi Zheng
  Cc: david, jannh, hughd, willy, muchun.song, vbabka, peterx, mgorman,
	catalin.marinas, will, dave.hansen, luto, peterz, x86,
	lorenzo.stoakes, zokeefe, rientjes, linux-mm, linux-kernel

On Wed,  4 Dec 2024 19:09:49 +0800 Qi Zheng <zhengqi.arch@bytedance.com> wrote:

> Now in order to pursue high performance, applications mostly use some
> high-performance user-mode memory allocators, such as jemalloc or
> tcmalloc. These memory allocators use madvise(MADV_DONTNEED or MADV_FREE)
> to release physical memory, but neither MADV_DONTNEED nor MADV_FREE will
> release page table memory, which may cause huge page table memory usage.
> 
> The following are a memory usage snapshot of one process which actually
> happened on our server:
> 
>         VIRT:  55t
>         RES:   590g
>         VmPTE: 110g
> 
> In this case, most of the page table entries are empty. For such a PTE
> page where all entries are empty, we can actually free it back to the
> system for others to use.
> 
> As a first step, this commit aims to synchronously free the empty PTE
> pages in madvise(MADV_DONTNEED) case. We will detect and free empty PTE
> pages in zap_pte_range(), and will add zap_details.reclaim_pt to exclude
> cases other than madvise(MADV_DONTNEED).
> 
> Once an empty PTE is detected, we first try to hold the pmd lock within
> the pte lock. If successful, we clear the pmd entry directly (fast path).
> Otherwise, we wait until the pte lock is released, then re-hold the pmd
> and pte locks and loop PTRS_PER_PTE times to check pte_none() to re-detect
> whether the PTE page is empty and free it (slow path).

"wait until the pte lock is released" sounds nasty.  I'm not
immediately seeing the code which does this.  PLease provide more
description?

> For other cases such as madvise(MADV_FREE), consider scanning and freeing
> empty PTE pages asynchronously in the future.
> 
> The following code snippet can show the effect of optimization:
> 
>         mmap 50G
>         while (1) {
>                 for (; i < 1024 * 25; i++) {
>                         touch 2M memory
>                         madvise MADV_DONTNEED 2M
>                 }
>         }
> 
> As we can see, the memory usage of VmPTE is reduced:
> 
>                         before                          after
> VIRT                   50.0 GB                        50.0 GB
> RES                     3.1 MB                         3.1 MB
> VmPTE                102640 KB                         240 KB
> 
> ...
>
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1301,6 +1301,21 @@ config ARCH_HAS_USER_SHADOW_STACK
>  	  The architecture has hardware support for userspace shadow call
>            stacks (eg, x86 CET, arm64 GCS or RISC-V Zicfiss).
>  
> +config ARCH_SUPPORTS_PT_RECLAIM
> +	def_bool n
> +
> +config PT_RECLAIM
> +	bool "reclaim empty user page table pages"
> +	default y
> +	depends on ARCH_SUPPORTS_PT_RECLAIM && MMU && SMP
> +	select MMU_GATHER_RCU_TABLE_FREE
> +	help
> +	  Try to reclaim empty user page table pages in paths other than munmap
> +	  and exit_mmap path.
> +
> +	  Note: now only empty user PTE page table pages will be reclaimed.
> +

Why is this optional?  What is the case for permitting PT_RECLAIM to e
disabled?

>  source "mm/damon/Kconfig"
>  
>  endmenu
>
> ...
>
> +void try_to_free_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr,
> +		     struct mmu_gather *tlb)
> +{
> +	pmd_t pmdval;
> +	spinlock_t *pml, *ptl;
> +	pte_t *start_pte, *pte;
> +	int i;
> +
> +	pml = pmd_lock(mm, pmd);
> +	start_pte = pte_offset_map_rw_nolock(mm, pmd, addr, &pmdval, &ptl);
> +	if (!start_pte)
> +		goto out_ptl;
> +	if (ptl != pml)
> +		spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
> +
> +	/* Check if it is empty PTE page */
> +	for (i = 0, pte = start_pte; i < PTRS_PER_PTE; i++, pte++) {
> +		if (!pte_none(ptep_get(pte)))
> +			goto out_ptl;
> +	}

Are there any worst-case situations in which we'll spend uncceptable
mounts of time running this loop?

> +	pte_unmap(start_pte);
> +
> +	pmd_clear(pmd);
> +
> +	if (ptl != pml)
> +		spin_unlock(ptl);
> +	spin_unlock(pml);
> +
> +	free_pte(mm, addr, tlb, pmdval);
> +
> +	return;
> +out_ptl:
> +	if (start_pte)
> +		pte_unmap_unlock(start_pte, ptl);
> +	if (ptl != pml)
> +		spin_unlock(pml);
> +}
> -- 
> 2.20.1

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 09/11] mm: pgtable: reclaim empty PTE page in madvise(MADV_DONTNEED)
  2024-12-04 22:36   ` Andrew Morton
@ 2024-12-04 22:47     ` Jann Horn
  2024-12-05  3:23       ` Qi Zheng
  2024-12-05  3:35     ` Qi Zheng
  1 sibling, 1 reply; 24+ messages in thread
From: Jann Horn @ 2024-12-04 22:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Qi Zheng, david, hughd, willy, muchun.song, vbabka, peterx,
	mgorman, catalin.marinas, will, dave.hansen, luto, peterz, x86,
	lorenzo.stoakes, zokeefe, rientjes, linux-mm, linux-kernel

On Wed, Dec 4, 2024 at 11:36 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Wed,  4 Dec 2024 19:09:49 +0800 Qi Zheng <zhengqi.arch@bytedance.com> wrote:
> > As a first step, this commit aims to synchronously free the empty PTE
> > pages in madvise(MADV_DONTNEED) case. We will detect and free empty PTE
> > pages in zap_pte_range(), and will add zap_details.reclaim_pt to exclude
> > cases other than madvise(MADV_DONTNEED).
> >
> > Once an empty PTE is detected, we first try to hold the pmd lock within
> > the pte lock. If successful, we clear the pmd entry directly (fast path).
> > Otherwise, we wait until the pte lock is released, then re-hold the pmd
> > and pte locks and loop PTRS_PER_PTE times to check pte_none() to re-detect
> > whether the PTE page is empty and free it (slow path).
>
> "wait until the pte lock is released" sounds nasty.  I'm not
> immediately seeing the code which does this.  PLease provide more
> description?

It's worded a bit confusingly, but it's fine; a better description
might be "if try_get_and_clear_pmd() fails to trylock the PMD lock
(against lock order), then later, after we have dropped the PTE lock,
try_to_free_pte() takes the PMD and PTE locks in the proper lock
order".

The "wait until the pte lock is released" part is just supposed to
mean that the try_to_free_pte() call is placed after the point where
the PTE lock has been dropped (which makes it possible to take the PMD
lock). It does not refer to waiting for other threads.

> > +void try_to_free_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr,
> > +                  struct mmu_gather *tlb)
> > +{
> > +     pmd_t pmdval;
> > +     spinlock_t *pml, *ptl;
> > +     pte_t *start_pte, *pte;
> > +     int i;
> > +
> > +     pml = pmd_lock(mm, pmd);
> > +     start_pte = pte_offset_map_rw_nolock(mm, pmd, addr, &pmdval, &ptl);
> > +     if (!start_pte)
> > +             goto out_ptl;
> > +     if (ptl != pml)
> > +             spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
> > +
> > +     /* Check if it is empty PTE page */
> > +     for (i = 0, pte = start_pte; i < PTRS_PER_PTE; i++, pte++) {
> > +             if (!pte_none(ptep_get(pte)))
> > +                     goto out_ptl;
> > +     }
>
> Are there any worst-case situations in which we'll spend uncceptable
> mounts of time running this loop?

This loop is just over a single page table, that should be no more
expensive than what we already do in other common paths like
zap_pte_range().

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 00/11] synchronously scan and reclaim empty user PTE pages
  2024-12-04 11:09 [PATCH v4 00/11] synchronously scan and reclaim empty user PTE pages Qi Zheng
                   ` (10 preceding siblings ...)
  2024-12-04 11:09 ` [PATCH v4 11/11] x86: select ARCH_SUPPORTS_PT_RECLAIM if X86_64 Qi Zheng
@ 2024-12-04 22:49 ` Andrew Morton
  2024-12-04 22:56   ` Jann Horn
  2024-12-05  3:56   ` Qi Zheng
  2024-12-10  8:57 ` Qi Zheng
  12 siblings, 2 replies; 24+ messages in thread
From: Andrew Morton @ 2024-12-04 22:49 UTC (permalink / raw)
  To: Qi Zheng
  Cc: david, jannh, hughd, willy, muchun.song, vbabka, peterx, mgorman,
	catalin.marinas, will, dave.hansen, luto, peterz, x86,
	lorenzo.stoakes, zokeefe, rientjes, linux-mm, linux-kernel

On Wed,  4 Dec 2024 19:09:40 +0800 Qi Zheng <zhengqi.arch@bytedance.com> wrote:

> 
> ...
>
> Previously, we tried to use a completely asynchronous method to reclaim empty
> user PTE pages [1]. After discussing with David Hildenbrand, we decided to
> implement synchronous reclaimation in the case of madvise(MADV_DONTNEED) as the
> first step.

Please help us understand what the other steps are.  Because we dont
want to commit to a particular partial implementation only to later
discover that completing that implementation causes us problems.

> So this series aims to synchronously free the empty PTE pages in
> madvise(MADV_DONTNEED) case. We will detect and free empty PTE pages in
> zap_pte_range(), and will add zap_details.reclaim_pt to exclude cases other than
> madvise(MADV_DONTNEED).
> 
> In zap_pte_range(), mmu_gather is used to perform batch tlb flushing and page
> freeing operations. Therefore, if we want to free the empty PTE page in this
> path, the most natural way is to add it to mmu_gather as well. Now, if
> CONFIG_MMU_GATHER_RCU_TABLE_FREE is selected, mmu_gather will free page table
> pages by semi RCU:
> 
>  - batch table freeing: asynchronous free by RCU
>  - single table freeing: IPI + synchronous free
> 
> But this is not enough to free the empty PTE page table pages in paths other
> that munmap and exit_mmap path, because IPI cannot be synchronized with
> rcu_read_lock() in pte_offset_map{_lock}(). So we should let single table also
> be freed by RCU like batch table freeing.
> 
> As a first step, we supported this feature on x86_64 and selectd the newly
> introduced CONFIG_ARCH_SUPPORTS_PT_RECLAIM.
> 
> For other cases such as madvise(MADV_FREE), consider scanning and freeing empty
> PTE pages asynchronously in the future.

Handling MADV_FREE sounds fairly straightforward?



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 00/11] synchronously scan and reclaim empty user PTE pages
  2024-12-04 22:49 ` [PATCH v4 00/11] synchronously scan and reclaim empty user PTE pages Andrew Morton
@ 2024-12-04 22:56   ` Jann Horn
  2024-12-05  3:59     ` Qi Zheng
  2024-12-05  3:56   ` Qi Zheng
  1 sibling, 1 reply; 24+ messages in thread
From: Jann Horn @ 2024-12-04 22:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Qi Zheng, david, hughd, willy, muchun.song, vbabka, peterx,
	mgorman, catalin.marinas, will, dave.hansen, luto, peterz, x86,
	lorenzo.stoakes, zokeefe, rientjes, linux-mm, linux-kernel

On Wed, Dec 4, 2024 at 11:49 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> On Wed,  4 Dec 2024 19:09:40 +0800 Qi Zheng <zhengqi.arch@bytedance.com> wrote:
> > But this is not enough to free the empty PTE page table pages in paths other
> > that munmap and exit_mmap path, because IPI cannot be synchronized with
> > rcu_read_lock() in pte_offset_map{_lock}(). So we should let single table also
> > be freed by RCU like batch table freeing.
> >
> > As a first step, we supported this feature on x86_64 and selectd the newly
> > introduced CONFIG_ARCH_SUPPORTS_PT_RECLAIM.
> >
> > For other cases such as madvise(MADV_FREE), consider scanning and freeing empty
> > PTE pages asynchronously in the future.
>
> Handling MADV_FREE sounds fairly straightforward?

AFAIU MADV_FREE usually doesn't immediately clear PTEs (except if they
are swap/hwpoison/... PTEs). So the easy thing to do would be to check
whether the page table has become empty within madvise(), but I think
the most likely case would be that PTEs still remain (and will be
asynchronously zapped later when memory pressure causes reclaim, or
something like that).

So I don't see an easy path to doing it for MADV_FREE.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 09/11] mm: pgtable: reclaim empty PTE page in madvise(MADV_DONTNEED)
  2024-12-04 22:47     ` Jann Horn
@ 2024-12-05  3:23       ` Qi Zheng
  0 siblings, 0 replies; 24+ messages in thread
From: Qi Zheng @ 2024-12-05  3:23 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andrew Morton, Qi Zheng, david, hughd, willy, muchun.song, vbabka,
	peterx, mgorman, catalin.marinas, will, dave.hansen, luto, peterz,
	x86, lorenzo.stoakes, zokeefe, rientjes, linux-mm, linux-kernel



On 2024/12/5 06:47, Jann Horn wrote:
> On Wed, Dec 4, 2024 at 11:36 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>>
>> On Wed,  4 Dec 2024 19:09:49 +0800 Qi Zheng <zhengqi.arch@bytedance.com> wrote:
>>> As a first step, this commit aims to synchronously free the empty PTE
>>> pages in madvise(MADV_DONTNEED) case. We will detect and free empty PTE
>>> pages in zap_pte_range(), and will add zap_details.reclaim_pt to exclude
>>> cases other than madvise(MADV_DONTNEED).
>>>
>>> Once an empty PTE is detected, we first try to hold the pmd lock within
>>> the pte lock. If successful, we clear the pmd entry directly (fast path).
>>> Otherwise, we wait until the pte lock is released, then re-hold the pmd
>>> and pte locks and loop PTRS_PER_PTE times to check pte_none() to re-detect
>>> whether the PTE page is empty and free it (slow path).
>>
>> "wait until the pte lock is released" sounds nasty.  I'm not
>> immediately seeing the code which does this.  PLease provide more
>> description?
> 
> It's worded a bit confusingly, but it's fine; a better description
> might be "if try_get_and_clear_pmd() fails to trylock the PMD lock
> (against lock order), then later, after we have dropped the PTE lock,
> try_to_free_pte() takes the PMD and PTE locks in the proper lock
> order".
> 
> The "wait until the pte lock is released" part is just supposed to
> mean that the try_to_free_pte() call is placed after the point where
> the PTE lock has been dropped (which makes it possible to take the PMD
> lock). It does not refer to waiting for other threads.

Yes. Thanks for helping to clarify my vague statement!

> 
>>> +void try_to_free_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr,
>>> +                  struct mmu_gather *tlb)
>>> +{
>>> +     pmd_t pmdval;
>>> +     spinlock_t *pml, *ptl;
>>> +     pte_t *start_pte, *pte;
>>> +     int i;
>>> +
>>> +     pml = pmd_lock(mm, pmd);
>>> +     start_pte = pte_offset_map_rw_nolock(mm, pmd, addr, &pmdval, &ptl);
>>> +     if (!start_pte)
>>> +             goto out_ptl;
>>> +     if (ptl != pml)
>>> +             spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
>>> +
>>> +     /* Check if it is empty PTE page */
>>> +     for (i = 0, pte = start_pte; i < PTRS_PER_PTE; i++, pte++) {
>>> +             if (!pte_none(ptep_get(pte)))
>>> +                     goto out_ptl;
>>> +     }
>>
>> Are there any worst-case situations in which we'll spend uncceptable
>> mounts of time running this loop?
> 
> This loop is just over a single page table, that should be no more
> expensive than what we already do in other common paths like
> zap_pte_range().

Agree.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 09/11] mm: pgtable: reclaim empty PTE page in madvise(MADV_DONTNEED)
  2024-12-04 22:36   ` Andrew Morton
  2024-12-04 22:47     ` Jann Horn
@ 2024-12-05  3:35     ` Qi Zheng
  1 sibling, 0 replies; 24+ messages in thread
From: Qi Zheng @ 2024-12-05  3:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: david, jannh, hughd, willy, muchun.song, vbabka, peterx, mgorman,
	catalin.marinas, will, dave.hansen, luto, peterz, x86,
	lorenzo.stoakes, zokeefe, rientjes, linux-mm, linux-kernel



On 2024/12/5 06:36, Andrew Morton wrote:
> On Wed,  4 Dec 2024 19:09:49 +0800 Qi Zheng <zhengqi.arch@bytedance.com> wrote:
> 

[...]

>>
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -1301,6 +1301,21 @@ config ARCH_HAS_USER_SHADOW_STACK
>>   	  The architecture has hardware support for userspace shadow call
>>             stacks (eg, x86 CET, arm64 GCS or RISC-V Zicfiss).
>>   
>> +config ARCH_SUPPORTS_PT_RECLAIM
>> +	def_bool n
>> +
>> +config PT_RECLAIM
>> +	bool "reclaim empty user page table pages"
>> +	default y
>> +	depends on ARCH_SUPPORTS_PT_RECLAIM && MMU && SMP
>> +	select MMU_GATHER_RCU_TABLE_FREE
>> +	help
>> +	  Try to reclaim empty user page table pages in paths other than munmap
>> +	  and exit_mmap path.
>> +
>> +	  Note: now only empty user PTE page table pages will be reclaimed.
>> +
> 
> Why is this optional?  What is the case for permitting PT_RECLAIM to e
> disabled?
> 

To reclaim the empty PTE pages, we need to free the PTE page through
RCU, which requires modifying the implementation in each architecture.
Making it an option will make it easier to gradually add support for
each architecture. And for now, we have only added support for x86.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 00/11] synchronously scan and reclaim empty user PTE pages
  2024-12-04 22:49 ` [PATCH v4 00/11] synchronously scan and reclaim empty user PTE pages Andrew Morton
  2024-12-04 22:56   ` Jann Horn
@ 2024-12-05  3:56   ` Qi Zheng
  1 sibling, 0 replies; 24+ messages in thread
From: Qi Zheng @ 2024-12-05  3:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: david, jannh, hughd, willy, muchun.song, vbabka, peterx, mgorman,
	catalin.marinas, will, dave.hansen, luto, peterz, x86,
	lorenzo.stoakes, zokeefe, rientjes, linux-mm, linux-kernel



On 2024/12/5 06:49, Andrew Morton wrote:
> On Wed,  4 Dec 2024 19:09:40 +0800 Qi Zheng <zhengqi.arch@bytedance.com> wrote:
> 
>>
>> ...
>>
>> Previously, we tried to use a completely asynchronous method to reclaim empty
>> user PTE pages [1]. After discussing with David Hildenbrand, we decided to
>> implement synchronous reclaimation in the case of madvise(MADV_DONTNEED) as the
>> first step.
> 
> Please help us understand what the other steps are.  Because we dont
> want to commit to a particular partial implementation only to later
> discover that completing that implementation causes us problems.

Although it is the first step, it is relatively independent because it
solve the problem (huge PTE memory usage) in the case of
madvise(MADV_DONTNEED), while the other steps are to solve the problem
in other cases.

I can briefly describe all the plans in my mind here:

First step
==========

I plan to implement synchronous empty user PTE pages reclamation in
madvise(MADV_DONTNEED) case for the following reasons:

1. It covers most of the known cases. (On ByteDance server, all the
    problems of huge PTE memory usage are in this case)
2. It helps verify the lock protection scheme and other infrastructure.

This is what this patch is doing (only support x86). Once this is done,
support for more architectures will be added.

Second step
===========

I plan to implement asynchronous reclamation for madvise(MADV_FREE) and
other cases. The initial idea is to mark vma first, then add the
corresponding mm to a global linked list, and then perform asynchronous
scanning and reclamation in the memory reclamation process.

Third step
==========

Based on the above infrastructure, we may try to reclaim all full-zero
PTE pages (all pte entries map zero page), which will be beneficial to
the memory balloon case mentioned by David Hildenbrand.

Another plan
============

Currently, page table modification are protected by page table locks
(page_table_lock or split pmd/pte lock), but the life cycle of page
table pages are protected by mmap_lock (and vma lock). For more details,
please refer to the latest added Documentation/mm/process_addrs.rst file.

Currently we try to free the PTE pages through RCU when
CONFIG_PT_RECLAIM is turned on. In this case, we will no longer
need to hold mmap_lock for the read/write op on the PTE pages.

So maybe we can remove the page table from the protection of the mmap
lock (which is too big), like this:

1. free all levels of page table pages by RCU, not just PTE pages, but
    also pmd, pud, etc.
2. similar to pte_offset_map/pte_unmap, add
    [pmd|pud]_offset_map/[pmd|pud]_unmap, and make them all contain
    rcu_read_lock/rcu_read_unlcok, and make them accept failure.

In this way, we no longer need the mmap lock. For readers, such as page
table wallers, we are already in the critical section of RCU. For
writers, we only need to hold the page table lock.

But there is a difficulty here, that is, the RCU critical section is not
allowed to sleep, but it is possible to sleep in the callback function
of .pmd_entry, such as mmu_notifier_invalidate_range_start().

Use SRCU instead? Or use RCU + refcount method? Not sure. But I think
it's an interesting thing to try.

Thanks!

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 00/11] synchronously scan and reclaim empty user PTE pages
  2024-12-04 22:56   ` Jann Horn
@ 2024-12-05  3:59     ` Qi Zheng
  0 siblings, 0 replies; 24+ messages in thread
From: Qi Zheng @ 2024-12-05  3:59 UTC (permalink / raw)
  To: Jann Horn, Andrew Morton
  Cc: david, hughd, willy, muchun.song, vbabka, peterx, mgorman,
	catalin.marinas, will, dave.hansen, luto, peterz, x86,
	lorenzo.stoakes, zokeefe, rientjes, linux-mm, linux-kernel



On 2024/12/5 06:56, Jann Horn wrote:
> On Wed, Dec 4, 2024 at 11:49 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>> On Wed,  4 Dec 2024 19:09:40 +0800 Qi Zheng <zhengqi.arch@bytedance.com> wrote:
>>> But this is not enough to free the empty PTE page table pages in paths other
>>> that munmap and exit_mmap path, because IPI cannot be synchronized with
>>> rcu_read_lock() in pte_offset_map{_lock}(). So we should let single table also
>>> be freed by RCU like batch table freeing.
>>>
>>> As a first step, we supported this feature on x86_64 and selectd the newly
>>> introduced CONFIG_ARCH_SUPPORTS_PT_RECLAIM.
>>>
>>> For other cases such as madvise(MADV_FREE), consider scanning and freeing empty
>>> PTE pages asynchronously in the future.
>>
>> Handling MADV_FREE sounds fairly straightforward?
> 
> AFAIU MADV_FREE usually doesn't immediately clear PTEs (except if they
> are swap/hwpoison/... PTEs). So the easy thing to do would be to check
> whether the page table has become empty within madvise(), but I think
> the most likely case would be that PTEs still remain (and will be
> asynchronously zapped later when memory pressure causes reclaim, or
> something like that).
> 
> So I don't see an easy path to doing it for MADV_FREE.

+1. Thanks for helping explain!


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v4 09/11 fix] fix: mm: pgtable: reclaim empty PTE page in madvise(MADV_DONTNEED)
  2024-12-04 11:09 ` [PATCH v4 09/11] mm: pgtable: reclaim empty PTE page in madvise(MADV_DONTNEED) Qi Zheng
  2024-12-04 22:36   ` Andrew Morton
@ 2024-12-06 11:23   ` Qi Zheng
  1 sibling, 0 replies; 24+ messages in thread
From: Qi Zheng @ 2024-12-06 11:23 UTC (permalink / raw)
  To: akpm, dan.carpenter; +Cc: linux-mm, linux-kernel, Qi Zheng

Dan Carpenter reported the following warning:

Commit e3aafd2d3551 ("mm: pgtable: reclaim empty PTE page in
madvise(MADV_DONTNEED)") from Dec 4, 2024 (linux-next), leads to the
following Smatch static checker warning:

	mm/pt_reclaim.c:69 try_to_free_pte()
	error: uninitialized symbol 'ptl'.

To fix it, assign an initial value of NULL to the ptl.

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/linux-mm/224e6a4e-43b5-4080-bdd8-b0a6fb2f0853@stanley.mountain/
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/pt_reclaim.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/pt_reclaim.c b/mm/pt_reclaim.c
index 6540a3115dde8..7e9455a18aae7 100644
--- a/mm/pt_reclaim.c
+++ b/mm/pt_reclaim.c
@@ -36,7 +36,7 @@ void try_to_free_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr,
 		     struct mmu_gather *tlb)
 {
 	pmd_t pmdval;
-	spinlock_t *pml, *ptl;
+	spinlock_t *pml, *ptl = NULL;
 	pte_t *start_pte, *pte;
 	int i;
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v4 02/11 fix] fix: mm: userfaultfd: recheck dst_pmd entry in move_pages_pte()
  2024-12-04 11:09 ` [PATCH v4 02/11] mm: userfaultfd: recheck dst_pmd entry in move_pages_pte() Qi Zheng
@ 2024-12-10  8:41   ` Qi Zheng
  0 siblings, 0 replies; 24+ messages in thread
From: Qi Zheng @ 2024-12-10  8:41 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, linux-kernel, Qi Zheng

The following WARN_ON_ONCE()s can also be expected to be triggered, so
remove them as well.

if (WARN_ON_ONCE(pmd_none(*dst_pmd)) ||  WARN_ON_ONCE(pmd_none(*src_pmd)) ||
    WARN_ON_ONCE(pmd_trans_huge(*dst_pmd)) || WARN_ON_ONCE(pmd_trans_huge(*src_pmd))

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/userfaultfd.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index bc9a66ec6a6e4..4527c385935be 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -1185,8 +1185,8 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
 	}
 
 	/* Sanity checks before the operation */
-	if (WARN_ON_ONCE(pmd_none(*dst_pmd)) ||	WARN_ON_ONCE(pmd_none(*src_pmd)) ||
-	    WARN_ON_ONCE(pmd_trans_huge(*dst_pmd)) || WARN_ON_ONCE(pmd_trans_huge(*src_pmd))) {
+	if (pmd_none(*dst_pmd) || pmd_none(*src_pmd) ||
+	    pmd_trans_huge(*dst_pmd) || pmd_trans_huge(*src_pmd)) {
 		err = -EINVAL;
 		goto out;
 	}
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v4 12/11] mm: pgtable: make ptlock be freed by RCU
  2024-12-04 11:09 ` [PATCH v4 11/11] x86: select ARCH_SUPPORTS_PT_RECLAIM if X86_64 Qi Zheng
@ 2024-12-10  8:44   ` Qi Zheng
  0 siblings, 0 replies; 24+ messages in thread
From: Qi Zheng @ 2024-12-10  8:44 UTC (permalink / raw)
  To: akpm, david, jannh, hughd, muchun.song
  Cc: linux-mm, linux-kernel, Qi Zheng, syzbot+1c58afed1cfd2f57efee

If ALLOC_SPLIT_PTLOCKS is enabled, the ptdesc->ptl will be a pointer and
a ptlock will be allocated for it, and it will be freed immediately before
the PTE page is freed. Once we support empty PTE page reclaimation, it may
result in the following use-after-free problem:

	CPU 0				CPU 1

					pte_offset_map_rw_nolock(&ptlock)
					--> rcu_read_lock()
	madvise(MADV_DONTNEED)
	--> ptlock_free (free ptlock immediately!)
	    free PTE page via RCU
					/* UAF!! */
					spin_lock(ptlock)

To avoid this problem, make ptlock also be freed by RCU.

Reported-by: syzbot+1c58afed1cfd2f57efee@syzkaller.appspotmail.com
Tested-by: syzbot+1c58afed1cfd2f57efee@syzkaller.appspotmail.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 include/linux/mm.h       |  2 +-
 include/linux/mm_types.h |  9 ++++++++-
 mm/memory.c              | 22 ++++++++++++++++------
 3 files changed, 25 insertions(+), 8 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e2d38c5867b32..e836ef6291265 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2988,7 +2988,7 @@ void ptlock_free(struct ptdesc *ptdesc);
 
 static inline spinlock_t *ptlock_ptr(struct ptdesc *ptdesc)
 {
-	return ptdesc->ptl;
+	return &(ptdesc->ptl->ptl);
 }
 #else /* ALLOC_SPLIT_PTLOCKS */
 static inline void ptlock_cache_init(void)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5d8779997266e..df8f5152644ec 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -434,6 +434,13 @@ FOLIO_MATCH(flags, _flags_2a);
 FOLIO_MATCH(compound_head, _head_2a);
 #undef FOLIO_MATCH
 
+#if ALLOC_SPLIT_PTLOCKS
+struct pt_lock {
+	spinlock_t ptl;
+	struct rcu_head rcu;
+};
+#endif
+
 /**
  * struct ptdesc -    Memory descriptor for page tables.
  * @__page_flags:     Same as page flags. Powerpc only.
@@ -478,7 +485,7 @@ struct ptdesc {
 	union {
 		unsigned long _pt_pad_2;
 #if ALLOC_SPLIT_PTLOCKS
-		spinlock_t *ptl;
+		struct pt_lock *ptl;
 #else
 		spinlock_t ptl;
 #endif
diff --git a/mm/memory.c b/mm/memory.c
index 91900a1479322..b5babc4bc36bc 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -7044,24 +7044,34 @@ static struct kmem_cache *page_ptl_cachep;
 
 void __init ptlock_cache_init(void)
 {
-	page_ptl_cachep = kmem_cache_create("page->ptl", sizeof(spinlock_t), 0,
+	page_ptl_cachep = kmem_cache_create("page->ptl", sizeof(struct pt_lock), 0,
 			SLAB_PANIC, NULL);
 }
 
 bool ptlock_alloc(struct ptdesc *ptdesc)
 {
-	spinlock_t *ptl;
+	struct pt_lock *pt_lock;
 
-	ptl = kmem_cache_alloc(page_ptl_cachep, GFP_KERNEL);
-	if (!ptl)
+	pt_lock = kmem_cache_alloc(page_ptl_cachep, GFP_KERNEL);
+	if (!pt_lock)
 		return false;
-	ptdesc->ptl = ptl;
+	ptdesc->ptl = pt_lock;
 	return true;
 }
 
+static void ptlock_free_rcu(struct rcu_head *head)
+{
+	struct pt_lock *pt_lock;
+
+	pt_lock = container_of(head, struct pt_lock, rcu);
+	kmem_cache_free(page_ptl_cachep, pt_lock);
+}
+
 void ptlock_free(struct ptdesc *ptdesc)
 {
-	kmem_cache_free(page_ptl_cachep, ptdesc->ptl);
+	struct pt_lock *pt_lock = ptdesc->ptl;
+
+	call_rcu(&pt_lock->rcu, ptlock_free_rcu);
 }
 #endif
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 00/11] synchronously scan and reclaim empty user PTE pages
  2024-12-04 11:09 [PATCH v4 00/11] synchronously scan and reclaim empty user PTE pages Qi Zheng
                   ` (11 preceding siblings ...)
  2024-12-04 22:49 ` [PATCH v4 00/11] synchronously scan and reclaim empty user PTE pages Andrew Morton
@ 2024-12-10  8:57 ` Qi Zheng
  12 siblings, 0 replies; 24+ messages in thread
From: Qi Zheng @ 2024-12-10  8:57 UTC (permalink / raw)
  To: akpm
  Cc: david, jannh, hughd, willy, muchun.song, vbabka, peterx, mgorman,
	catalin.marinas, will, dave.hansen, luto, peterz, x86,
	lorenzo.stoakes, zokeefe, rientjes, linux-mm, linux-kernel

Hi Andrew,

I have sent patch[1][2][3] to fix recently reported issues:

[1]. 
https://lore.kernel.org/lkml/20241210084156.89877-1-zhengqi.arch@bytedance.com/
(Fix warning, need to be folded into [PATCH v4 02/11])

[2]. 
https://lore.kernel.org/lkml/20241206112348.51570-1-zhengqi.arch@bytedance.com/
(Fix uninitialized symbol, need to be folded into [PATCH v4 09/11])

[3]. 
https://lore.kernel.org/lkml/20241210084431.91414-1-zhengqi.arch@bytedance.com/
(fix UAF, need to be placed before [PATCH v4 11/11])

If you need me to re-post a complete v5, please let me know.

Thanks,
Qi


On 2024/12/4 19:09, Qi Zheng wrote:
> Changes in v4:
>   - update the process_addrs.rst in [PATCH v4 01/11]
>     (suggested by Lorenzo Stoakes)
>   - fix [PATCH v3 4/9] and move it after [PATCH v3 5/9]
>     (pointed by David Hildenbrand)
>   - change to use any_skipped instead of rechecking pte_none() to detect empty
>     user PTE pages (suggested by David Hildenbrand)
>   - rebase onto the next-20241203
> 
> Changes in v3:
>   - recheck pmd state instead of pmd_same() in retract_page_tables()
>     (suggested by Jann Horn)
>   - recheck dst_pmd entry in move_pages_pte() (pointed by Jann Horn)
>   - introduce new skip_none_ptes() (suggested by David Hildenbrand)
>   - minor changes in [PATCH v2 5/7]
>   - remove tlb_remove_table_sync_one() if CONFIG_PT_RECLAIM is enabled.
>   - use put_page() instead of free_page_and_swap_cache() in
>     __tlb_remove_table_one_rcu() (pointed by Jann Horn)
>   - collect the Reviewed-bys and Acked-bys
>   - rebase onto the next-20241112
> 
> Changes in v2:
>   - fix [PATCH v1 1/7] (Jann Horn)
>   - reset force_flush and force_break to false in [PATCH v1 2/7] (Jann Horn)
>   - introduce zap_nonpresent_ptes() and do_zap_pte_range()
>   - check pte_none() instead of can_reclaim_pt after the processing of PTEs
>     (remove [PATCH v1 3/7] and [PATCH v1 4/7])
>   - reorder patches
>   - rebase onto the next-20241031
> 
> Changes in v1:
>   - replace [RFC PATCH 1/7] with a separate serise (already merge into mm-unstable):
>     https://lore.kernel.org/lkml/cover.1727332572.git.zhengqi.arch@bytedance.com/
>     (suggested by David Hildenbrand)
>   - squash [RFC PATCH 2/7] into [RFC PATCH 4/7]
>     (suggested by David Hildenbrand)
>   - change to scan and reclaim empty user PTE pages in zap_pte_range()
>     (suggested by David Hildenbrand)
>   - sent a separate RFC patch to track the tlb flushing issue, and remove
>     that part form this series ([RFC PATCH 3/7] and [RFC PATCH 6/7]).
>     link: https://lore.kernel.org/lkml/20240815120715.14516-1-zhengqi.arch@bytedance.com/
>   - add [PATCH v1 1/7] into this series
>   - drop RFC tag
>   - rebase onto the next-20241011
> 
> Changes in RFC v2:
>   - fix compilation errors in [RFC PATCH 5/7] and [RFC PATCH 7/7] reproted by
>     kernel test robot
>   - use pte_offset_map_nolock() + pmd_same() instead of check_pmd_still_valid()
>     in retract_page_tables() (in [RFC PATCH 4/7])
>   - rebase onto the next-20240805
> 
> Hi all,
> 
> Previously, we tried to use a completely asynchronous method to reclaim empty
> user PTE pages [1]. After discussing with David Hildenbrand, we decided to
> implement synchronous reclaimation in the case of madvise(MADV_DONTNEED) as the
> first step.
> 
> So this series aims to synchronously free the empty PTE pages in
> madvise(MADV_DONTNEED) case. We will detect and free empty PTE pages in
> zap_pte_range(), and will add zap_details.reclaim_pt to exclude cases other than
> madvise(MADV_DONTNEED).
> 
> In zap_pte_range(), mmu_gather is used to perform batch tlb flushing and page
> freeing operations. Therefore, if we want to free the empty PTE page in this
> path, the most natural way is to add it to mmu_gather as well. Now, if
> CONFIG_MMU_GATHER_RCU_TABLE_FREE is selected, mmu_gather will free page table
> pages by semi RCU:
> 
>   - batch table freeing: asynchronous free by RCU
>   - single table freeing: IPI + synchronous free
> 
> But this is not enough to free the empty PTE page table pages in paths other
> that munmap and exit_mmap path, because IPI cannot be synchronized with
> rcu_read_lock() in pte_offset_map{_lock}(). So we should let single table also
> be freed by RCU like batch table freeing.
> 
> As a first step, we supported this feature on x86_64 and selectd the newly
> introduced CONFIG_ARCH_SUPPORTS_PT_RECLAIM.
> 
> For other cases such as madvise(MADV_FREE), consider scanning and freeing empty
> PTE pages asynchronously in the future.
> 
> This series is based on next-20241112 (which contains the series [2]).
> 
> Note: issues related to TLB flushing are not new to this series and are tracked
>        in the separate RFC patch [3]. And more context please refer to this
>        thread [4].
> 
> Comments and suggestions are welcome!
> 
> Thanks,
> Qi
> 
> [1]. https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/
> [2]. https://lore.kernel.org/lkml/cover.1727332572.git.zhengqi.arch@bytedance.com/
> [3]. https://lore.kernel.org/lkml/20240815120715.14516-1-zhengqi.arch@bytedance.com/
> [4]. https://lore.kernel.org/lkml/6f38cb19-9847-4f70-bbe7-06881bb016be@bytedance.com/
> 
> Qi Zheng (11):
>    mm: khugepaged: recheck pmd state in retract_page_tables()
>    mm: userfaultfd: recheck dst_pmd entry in move_pages_pte()
>    mm: introduce zap_nonpresent_ptes()
>    mm: introduce do_zap_pte_range()
>    mm: skip over all consecutive none ptes in do_zap_pte_range()
>    mm: zap_install_uffd_wp_if_needed: return whether uffd-wp pte has been
>      re-installed
>    mm: do_zap_pte_range: return any_skipped information to the caller
>    mm: make zap_pte_range() handle full within-PMD range
>    mm: pgtable: reclaim empty PTE page in madvise(MADV_DONTNEED)
>    x86: mm: free page table pages by RCU instead of semi RCU
>    x86: select ARCH_SUPPORTS_PT_RECLAIM if X86_64
> 
>   Documentation/mm/process_addrs.rst |   4 +
>   arch/x86/Kconfig                   |   1 +
>   arch/x86/include/asm/tlb.h         |  20 +++
>   arch/x86/kernel/paravirt.c         |   7 +
>   arch/x86/mm/pgtable.c              |  10 +-
>   include/linux/mm.h                 |   1 +
>   include/linux/mm_inline.h          |  11 +-
>   include/linux/mm_types.h           |   4 +-
>   mm/Kconfig                         |  15 ++
>   mm/Makefile                        |   1 +
>   mm/internal.h                      |  19 +++
>   mm/khugepaged.c                    |  45 +++--
>   mm/madvise.c                       |   7 +-
>   mm/memory.c                        | 253 ++++++++++++++++++-----------
>   mm/mmu_gather.c                    |   9 +-
>   mm/pt_reclaim.c                    |  71 ++++++++
>   mm/userfaultfd.c                   |  51 ++++--
>   17 files changed, 397 insertions(+), 132 deletions(-)
>   create mode 100644 mm/pt_reclaim.c
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2024-12-10  8:57 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-04 11:09 [PATCH v4 00/11] synchronously scan and reclaim empty user PTE pages Qi Zheng
2024-12-04 11:09 ` [PATCH v4 01/11] mm: khugepaged: recheck pmd state in retract_page_tables() Qi Zheng
2024-12-04 11:09 ` [PATCH v4 02/11] mm: userfaultfd: recheck dst_pmd entry in move_pages_pte() Qi Zheng
2024-12-10  8:41   ` [PATCH v4 02/11 fix] fix: " Qi Zheng
2024-12-04 11:09 ` [PATCH v4 03/11] mm: introduce zap_nonpresent_ptes() Qi Zheng
2024-12-04 11:09 ` [PATCH v4 04/11] mm: introduce do_zap_pte_range() Qi Zheng
2024-12-04 11:09 ` [PATCH v4 05/11] mm: skip over all consecutive none ptes in do_zap_pte_range() Qi Zheng
2024-12-04 11:09 ` [PATCH v4 06/11] mm: zap_install_uffd_wp_if_needed: return whether uffd-wp pte has been re-installed Qi Zheng
2024-12-04 11:09 ` [PATCH v4 07/11] mm: do_zap_pte_range: return any_skipped information to the caller Qi Zheng
2024-12-04 11:09 ` [PATCH v4 08/11] mm: make zap_pte_range() handle full within-PMD range Qi Zheng
2024-12-04 11:09 ` [PATCH v4 09/11] mm: pgtable: reclaim empty PTE page in madvise(MADV_DONTNEED) Qi Zheng
2024-12-04 22:36   ` Andrew Morton
2024-12-04 22:47     ` Jann Horn
2024-12-05  3:23       ` Qi Zheng
2024-12-05  3:35     ` Qi Zheng
2024-12-06 11:23   ` [PATCH v4 09/11 fix] fix: " Qi Zheng
2024-12-04 11:09 ` [PATCH v4 10/11] x86: mm: free page table pages by RCU instead of semi RCU Qi Zheng
2024-12-04 11:09 ` [PATCH v4 11/11] x86: select ARCH_SUPPORTS_PT_RECLAIM if X86_64 Qi Zheng
2024-12-10  8:44   ` [PATCH v4 12/11] mm: pgtable: make ptlock be freed by RCU Qi Zheng
2024-12-04 22:49 ` [PATCH v4 00/11] synchronously scan and reclaim empty user PTE pages Andrew Morton
2024-12-04 22:56   ` Jann Horn
2024-12-05  3:59     ` Qi Zheng
2024-12-05  3:56   ` Qi Zheng
2024-12-10  8:57 ` Qi Zheng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).