From: Usama Arif <usamaarif642@gmail.com>
To: akpm@linux-foundation.org, linux-mm@kvack.org
Cc: hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev,
roman.gushchin@linux.dev, yuzhao@google.com, david@redhat.com,
baohua@kernel.org, ryan.roberts@arm.com, rppt@kernel.org,
willy@infradead.org, cerasuolodomenico@gmail.com,
ryncsn@gmail.com, corbet@lwn.net, linux-kernel@vger.kernel.org,
linux-doc@vger.kernel.org, kernel-team@meta.com,
Usama Arif <usamaarif642@gmail.com>
Subject: [PATCH v4 5/6] mm: split underused THPs
Date: Mon, 19 Aug 2024 03:30:58 +0100 [thread overview]
Message-ID: <20240819023145.2415299-6-usamaarif642@gmail.com> (raw)
In-Reply-To: <20240819023145.2415299-1-usamaarif642@gmail.com>
This is an attempt to mitigate the issue of running out of memory when
THP is always enabled. During runtime whenever a THP is being faulted in
(__do_huge_pmd_anonymous_page) or collapsed by khugepaged
(collapse_huge_page), the THP is added to _deferred_list. Whenever
memory reclaim happens in linux, the kernel runs the deferred_split
shrinker which goes through the _deferred_list.
If the folio was partially mapped, the shrinker attempts to split it.
If the folio is not partially mapped, the shrinker checks if the THP
was underused, i.e. how many of the base 4K pages of the entire THP
were zero-filled. If this number goes above a certain threshold (decided by
/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none), the
shrinker will attempt to split that THP. Then at remap time, the pages
that were zero-filled are mapped to the shared zeropage, hence saving
memory.
Suggested-by: Rik van Riel <riel@surriel.com>
Co-authored-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
Documentation/admin-guide/mm/transhuge.rst | 6 +++
include/linux/khugepaged.h | 1 +
include/linux/vm_event_item.h | 1 +
mm/huge_memory.c | 60 +++++++++++++++++++++-
mm/khugepaged.c | 3 +-
mm/vmstat.c | 1 +
6 files changed, 69 insertions(+), 3 deletions(-)
diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 058485daf186..40741b892aff 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -447,6 +447,12 @@ thp_deferred_split_page
splitting it would free up some memory. Pages on split queue are
going to be split under memory pressure.
+thp_underused_split_page
+ is incremented when a huge page on the split queue was split
+ because it was underused. A THP is underused if the number of
+ zero pages in the THP is above a certain threshold
+ (/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none).
+
thp_split_pmd
is incremented every time a PMD split into table of PTEs.
This can happen, for instance, when application calls mprotect() or
diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
index f68865e19b0b..30baae91b225 100644
--- a/include/linux/khugepaged.h
+++ b/include/linux/khugepaged.h
@@ -4,6 +4,7 @@
#include <linux/sched/coredump.h> /* MMF_VM_HUGEPAGE */
+extern unsigned int khugepaged_max_ptes_none __read_mostly;
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
extern struct attribute_group khugepaged_attr_group;
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index aae5c7c5cfb4..aed952d04132 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -105,6 +105,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
THP_SPLIT_PAGE,
THP_SPLIT_PAGE_FAILED,
THP_DEFERRED_SPLIT_PAGE,
+ THP_UNDERUSED_SPLIT_PAGE,
THP_SPLIT_PMD,
THP_SCAN_EXCEED_NONE_PTE,
THP_SCAN_EXCEED_SWAP_PTE,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 70ee49dfeaad..f5363cf900f9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1087,6 +1087,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR);
mm_inc_nr_ptes(vma->vm_mm);
+ deferred_split_folio(folio, false);
spin_unlock(vmf->ptl);
count_vm_event(THP_FAULT_ALLOC);
count_mthp_stat(HPAGE_PMD_ORDER, MTHP_STAT_ANON_FAULT_ALLOC);
@@ -3526,6 +3527,39 @@ static unsigned long deferred_split_count(struct shrinker *shrink,
return READ_ONCE(ds_queue->split_queue_len);
}
+static bool thp_underused(struct folio *folio)
+{
+ int num_zero_pages = 0, num_filled_pages = 0;
+ void *kaddr;
+ int i;
+
+ if (khugepaged_max_ptes_none == HPAGE_PMD_NR - 1)
+ return false;
+
+ for (i = 0; i < folio_nr_pages(folio); i++) {
+ kaddr = kmap_local_folio(folio, i * PAGE_SIZE);
+ if (!memchr_inv(kaddr, 0, PAGE_SIZE)) {
+ num_zero_pages++;
+ if (num_zero_pages > khugepaged_max_ptes_none) {
+ kunmap_local(kaddr);
+ return true;
+ }
+ } else {
+ /*
+ * Another path for early exit once the number
+ * of non-zero filled pages exceeds threshold.
+ */
+ num_filled_pages++;
+ if (num_filled_pages >= HPAGE_PMD_NR - khugepaged_max_ptes_none) {
+ kunmap_local(kaddr);
+ return false;
+ }
+ }
+ kunmap_local(kaddr);
+ }
+ return false;
+}
+
static unsigned long deferred_split_scan(struct shrinker *shrink,
struct shrink_control *sc)
{
@@ -3559,13 +3593,35 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
list_for_each_entry_safe(folio, next, &list, _deferred_list) {
+ bool did_split = false;
+ bool underused = false;
+
+ if (!folio_test_partially_mapped(folio)) {
+ underused = thp_underused(folio);
+ if (!underused)
+ goto next;
+ }
if (!folio_trylock(folio))
goto next;
- /* split_huge_page() removes page from list on success */
- if (!split_folio(folio))
+ if (!split_folio(folio)) {
+ did_split = true;
+ if (underused)
+ count_vm_event(THP_UNDERUSED_SPLIT_PAGE);
split++;
+ }
folio_unlock(folio);
next:
+ /*
+ * split_folio() removes folio from list on success.
+ * Only add back to the queue if folio is partially mapped.
+ * If thp_underused returns false, or if split_folio fails
+ * in the case it was underused, then consider it used and
+ * don't add it back to split_queue.
+ */
+ if (!did_split && !folio_test_partially_mapped(folio)) {
+ list_del_init(&folio->_deferred_list);
+ ds_queue->split_queue_len--;
+ }
folio_put(folio);
}
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 6c42062478c1..2e138b22d939 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -85,7 +85,7 @@ static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait);
*
* Note that these are only respected if collapse was initiated by khugepaged.
*/
-static unsigned int khugepaged_max_ptes_none __read_mostly;
+unsigned int khugepaged_max_ptes_none __read_mostly;
static unsigned int khugepaged_max_ptes_swap __read_mostly;
static unsigned int khugepaged_max_ptes_shared __read_mostly;
@@ -1235,6 +1235,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
pgtable_trans_huge_deposit(mm, pmd, pgtable);
set_pmd_at(mm, address, pmd, _pmd);
update_mmu_cache_pmd(vma, address, pmd);
+ deferred_split_folio(folio, false);
spin_unlock(pmd_ptl);
folio = NULL;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index c3a402ea91f0..6060bb7bbb44 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1384,6 +1384,7 @@ const char * const vmstat_text[] = {
"thp_split_page",
"thp_split_page_failed",
"thp_deferred_split_page",
+ "thp_underused_split_page",
"thp_split_pmd",
"thp_scan_exceed_none_pte",
"thp_scan_exceed_swap_pte",
--
2.43.5
next prev parent reply other threads:[~2024-08-19 2:32 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-08-19 2:30 [PATCH v4 0/6] mm: split underused THPs Usama Arif
2024-08-19 2:30 ` [PATCH v4 1/6] mm: free zapped tail pages when splitting isolated thp Usama Arif
2024-08-19 2:30 ` [PATCH v4 2/6] mm: remap unused subpages to shared zeropage " Usama Arif
2024-08-19 2:30 ` [PATCH v4 3/6] mm: selftest to verify zero-filled pages are mapped to zeropage Usama Arif
2024-08-21 19:09 ` Usama Arif
2024-08-19 2:30 ` [PATCH v4 4/6] mm: Introduce a pageflag for partially mapped folios Usama Arif
2024-08-19 8:29 ` Barry Song
2024-08-19 8:30 ` Barry Song
2024-08-19 14:17 ` Usama Arif
2024-08-19 19:00 ` Barry Song
2024-08-19 20:16 ` Usama Arif
2024-08-19 21:34 ` Barry Song
2024-08-19 21:55 ` Barry Song
2024-08-20 19:35 ` Usama Arif
2024-08-20 21:30 ` Barry Song
2024-08-21 19:04 ` Usama Arif
2024-08-21 21:25 ` Barry Song
2024-09-01 12:48 ` Bang Li
2024-09-02 10:09 ` Usama Arif
2024-08-19 2:30 ` Usama Arif [this message]
2024-08-19 2:30 ` [PATCH v4 6/6] mm: add sysfs entry to disable splitting underused THPs Usama Arif
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240819023145.2415299-6-usamaarif642@gmail.com \
--to=usamaarif642@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=baohua@kernel.org \
--cc=cerasuolodomenico@gmail.com \
--cc=corbet@lwn.net \
--cc=david@redhat.com \
--cc=hannes@cmpxchg.org \
--cc=kernel-team@meta.com \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=riel@surriel.com \
--cc=roman.gushchin@linux.dev \
--cc=rppt@kernel.org \
--cc=ryan.roberts@arm.com \
--cc=ryncsn@gmail.com \
--cc=shakeel.butt@linux.dev \
--cc=willy@infradead.org \
--cc=yuzhao@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).