[RFC PATCH 08/18] mm: introduce percpu_ref for user PTE page table page

linux-doc.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Qi Zheng <zhengqi.arch@bytedance.com>
To: akpm@linux-foundation.org, tglx@linutronix.de,
	kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com,
	david@redhat.com, jgg@nvidia.com, tj@kernel.org,
	dennis@kernel.org, ming.lei@redhat.com
Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, songmuchun@bytedance.com,
	zhouchengming@bytedance.com,
	Qi Zheng <zhengqi.arch@bytedance.com>
Subject: [RFC PATCH 08/18] mm: introduce percpu_ref for user PTE page table page
Date: Fri, 29 Apr 2022 21:35:42 +0800	[thread overview]
Message-ID: <20220429133552.33768-9-zhengqi.arch@bytedance.com> (raw)
In-Reply-To: <20220429133552.33768-1-zhengqi.arch@bytedance.com>

Now in order to pursue high performance, applications mostly use some
high-performance user-mode memory allocators, such as jemalloc or tcmalloc.
These memory allocators use madvise(MADV_DONTNEED or MADV_FREE) to release
physical memory for the following reasons::

 First of all, we should hold as few write locks of mmap_lock as possible,
 since the mmap_lock semaphore has long been a contention point in the
 memory management subsystem. The mmap()/munmap() hold the write lock, and
 the madvise(MADV_DONTNEED or MADV_FREE) hold the read lock, so using
 madvise() instead of munmap() to released physical memory can reduce the
 competition of the mmap_lock.

 Secondly, after using madvise() to release physical memory, there is no
 need to build vma and allocate page tables again when accessing the same
 virtual address again, which can also save some time.

The following is the largest user PTE page table memory that can be
allocated by a single user process in a 32-bit and a 64-bit system.

+---------------------------+--------+---------+
|                           | 32-bit | 64-bit  |
+===========================+========+=========+
| user PTE page table pages | 3 MiB  | 512 GiB |
+---------------------------+--------+---------+
| user PMD page table pages | 3 KiB  | 1 GiB   |
+---------------------------+--------+---------+

(for 32-bit, take 3G user address space, 4K page size as an example;
 for 64-bit, take 48-bit address width, 4K page size as an example.)

After using madvise(), everything looks good, but as can be seen from the
above table, a single process can create a large number of PTE page tables
on a 64-bit system, since both of the MADV_DONTNEED and MADV_FREE will not
release page table memory. And before the process exits or calls munmap(),
the kernel cannot reclaim these pages even if these PTE page tables do not
map anything.

To fix the situation, this patchset introduces a percpu_ref for each user
PTE page table page. The following people will hold a percpu_ref::

 The !pte_none() entry, such as regular page table entry that map physical
 pages, or swap entry, or migrate entry, etc.

 Visitor to the PTE page table entries, such as page table walker.

Any ``!pte_none()`` entry and visitor can be regarded as the user of its
PTE page table page. When the percpu_ref is reduced to 0 (need to switch
to atomic mode first to check), it means that no one is using the PTE page
table page, then this free PTE page table page can be reclaimed at this
time.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 include/linux/mm.h       |  9 +++++++-
 include/linux/mm_types.h |  1 +
 include/linux/pte_ref.h  | 29 +++++++++++++++++++++++++
 mm/Makefile              |  2 +-
 mm/pte_ref.c             | 47 ++++++++++++++++++++++++++++++++++++++++
 5 files changed, 86 insertions(+), 2 deletions(-)
 create mode 100644 include/linux/pte_ref.h
 create mode 100644 mm/pte_ref.c

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0afd3b097e90..1a6bc79c351b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -28,6 +28,7 @@
 #include <linux/sched.h>
 #include <linux/pgtable.h>
 #include <linux/kasan.h>
+#include <linux/pte_ref.h>
 
 struct mempolicy;
 struct anon_vma;
@@ -2260,11 +2261,16 @@ static inline void pgtable_init(void)
 
 static inline bool pgtable_pte_page_ctor(struct page *page)
 {
-	if (!ptlock_init(page))
+	if (!pte_ref_init(page))
 		return false;
+	if (!ptlock_init(page))
+		goto free_pte_ref;
 	__SetPageTable(page);
 	inc_lruvec_page_state(page, NR_PAGETABLE);
 	return true;
+free_pte_ref:
+	pte_ref_free(page);
+	return false;
 }
 
 static inline void pgtable_pte_page_dtor(struct page *page)
@@ -2272,6 +2278,7 @@ static inline void pgtable_pte_page_dtor(struct page *page)
 	ptlock_free(page);
 	__ClearPageTable(page);
 	dec_lruvec_page_state(page, NR_PAGETABLE);
+	pte_ref_free(page);
 }
 
 #define pte_alloc(mm, pmd) (unlikely(pmd_none(*(pmd))) && __pte_alloc(mm, pmd))
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 8834e38c06a4..650bfb22b0e2 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -153,6 +153,7 @@ struct page {
 			union {
 				struct mm_struct *pt_mm; /* x86 pgds only */
 				atomic_t pt_frag_refcount; /* powerpc */
+				struct percpu_ref *pte_ref; /* PTE page only */
 			};
 #if ALLOC_SPLIT_PTLOCKS
 			spinlock_t *ptl;
diff --git a/include/linux/pte_ref.h b/include/linux/pte_ref.h
new file mode 100644
index 000000000000..d3963a151ca5
--- /dev/null
+++ b/include/linux/pte_ref.h
@@ -0,0 +1,29 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2022, ByteDance. All rights reserved.
+ *
+ * 	Author: Qi Zheng <zhengqi.arch@bytedance.com>
+ */
+
+#ifndef _LINUX_PTE_REF_H
+#define _LINUX_PTE_REF_H
+
+#ifdef CONFIG_FREE_USER_PTE
+
+bool pte_ref_init(pgtable_t pte);
+void pte_ref_free(pgtable_t pte);
+
+#else /* !CONFIG_FREE_USER_PTE */
+
+static inline bool pte_ref_init(pgtable_t pte)
+{
+	return true;
+}
+
+static inline void pte_ref_free(pgtable_t pte)
+{
+}
+
+#endif /* CONFIG_FREE_USER_PTE */
+
+#endif /* _LINUX_PTE_REF_H */
diff --git a/mm/Makefile b/mm/Makefile
index 4cc13f3179a5..b9711510f84f 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -54,7 +54,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
 			   mm_init.o percpu.o slab_common.o \
 			   compaction.o vmacache.o \
 			   interval_tree.o list_lru.o workingset.o \
-			   debug.o gup.o mmap_lock.o $(mmu-y)
+			   debug.o gup.o mmap_lock.o $(mmu-y) pte_ref.o
 
 # Give 'page_alloc' its own module-parameter namespace
 page-alloc-y := page_alloc.o
diff --git a/mm/pte_ref.c b/mm/pte_ref.c
new file mode 100644
index 000000000000..52e31be00de4
--- /dev/null
+++ b/mm/pte_ref.c
@@ -0,0 +1,47 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2022, ByteDance. All rights reserved.
+ *
+ * 	Author: Qi Zheng <zhengqi.arch@bytedance.com>
+ */
+#include <linux/pgtable.h>
+#include <linux/pte_ref.h>
+#include <linux/percpu-refcount.h>
+#include <linux/slab.h>
+
+#ifdef CONFIG_FREE_USER_PTE
+
+static void no_op(struct percpu_ref *r) {}
+
+bool pte_ref_init(pgtable_t pte)
+{
+	struct percpu_ref *pte_ref;
+
+	pte_ref = kmalloc(sizeof(struct percpu_ref), GFP_KERNEL);
+	if (!pte_ref)
+		return false;
+	if (percpu_ref_init(pte_ref, no_op,
+			    PERCPU_REF_ALLOW_REINIT, GFP_KERNEL) < 0)
+		goto free_ref;
+	/* We want to start with the refcount at zero */
+	percpu_ref_put(pte_ref);
+
+	pte->pte_ref = pte_ref;
+	return true;
+free_ref:
+	kfree(pte_ref);
+	return false;
+}
+
+void pte_ref_free(pgtable_t pte)
+{
+	struct percpu_ref *ref = pte->pte_ref;
+	if (!ref)
+		return;
+
+	pte->pte_ref = NULL;
+	percpu_ref_exit(ref);
+	kfree(ref);
+}
+
+#endif /* CONFIG_FREE_USER_PTE */
-- 
2.20.1

next prev parent reply	other threads:[~2022-04-29 13:37 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-29 13:35 [RFC PATCH 00/18] Try to free user PTE page table pages Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 01/18] x86/mm/encrypt: add the missing pte_unmap() call Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 02/18] percpu_ref: make ref stable after percpu_ref_switch_to_atomic_sync() returns Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 03/18] percpu_ref: make percpu_ref_switch_lock per percpu_ref Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 04/18] mm: convert to use ptep_clear() in pte_clear_not_present_full() Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 05/18] mm: split the related definitions of pte_offset_map_lock() into pgtable.h Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 06/18] mm: introduce CONFIG_FREE_USER_PTE Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 07/18] mm: add pte_to_page() helper Qi Zheng
2022-04-29 13:35 ` Qi Zheng [this message]
2022-04-29 13:35 ` [RFC PATCH 09/18] pte_ref: add pte_tryget() and {__,}pte_put() helper Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 10/18] mm: add pte_tryget_map{_lock}() helper Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 11/18] mm: convert to use pte_tryget_map_lock() Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 12/18] mm: convert to use pte_tryget_map() Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 13/18] mm: add try_to_free_user_pte() helper Qi Zheng
2022-04-30 13:35   ` Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 14/18] mm: use try_to_free_user_pte() in MADV_DONTNEED case Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 15/18] mm: use try_to_free_user_pte() in MADV_FREE case Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 16/18] pte_ref: add track_pte_{set, clear}() helper Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 17/18] x86/mm: add x86_64 support for pte_ref Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 18/18] Documentation: add document " Qi Zheng
2022-04-30 13:19   ` Bagas Sanjaya
2022-04-30 13:32     ` Qi Zheng
2022-05-17  8:30 ` [RFC PATCH 00/18] Try to free user PTE page table pages Qi Zheng
2022-05-18 14:51   ` David Hildenbrand
2022-05-18 14:56     ` Matthew Wilcox
2022-05-19  4:03       ` Qi Zheng
2022-05-19  3:58     ` Qi Zheng

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:0afd3b097e9 dfblob:1a6bc79c351 dfblob:8834e38c06a
dfblob:650bfb22b0e dfblob:d3963a151ca dfblob:4cc13f3179a
dfblob:b9711510f84 dfblob:52e31be00de )
 OR (
bs:"[RFC PATCH 08/18] mm: introduce percpu_ref for user PTE page table page" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220429133552.33768-9-zhengqi.arch@bytedance.com \
    --to=zhengqi.arch@bytedance.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=dennis@kernel.org \
    --cc=jgg@nvidia.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mika.penttila@nextfour.com \
    --cc=ming.lei@redhat.com \
    --cc=songmuchun@bytedance.com \
    --cc=tglx@linutronix.de \
    --cc=tj@kernel.org \
    --cc=zhouchengming@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).