All of lore.kernel.org
 help / color / mirror / Atom feed
From: Hugh Dickins <hughd@google.com>
To: Usama Arif <usamaarif642@gmail.com>
Cc: akpm@linux-foundation.org, linux-mm@kvack.org,
	hannes@cmpxchg.org,  riel@surriel.com, shakeel.butt@linux.dev,
	roman.gushchin@linux.dev,  yuzhao@google.com, david@redhat.com,
	baohua@kernel.org,  ryan.roberts@arm.com, rppt@kernel.org,
	willy@infradead.org,  ryncsn@gmail.com, ak@linux.intel.com,
	cerasuolodomenico@gmail.com,  corbet@lwn.net,
	linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org,
	 kernel-team@meta.com
Subject: Re: [PATCH v3 0/6] mm: split underutilized THPs
Date: Sat, 17 Aug 2024 22:13:27 -0700 (PDT)	[thread overview]
Message-ID: <1e6f3b38-d309-e63f-bca0-5093e152f7d7@google.com> (raw)
In-Reply-To: <20240813120328.1275952-1-usamaarif642@gmail.com>

On Tue, 13 Aug 2024, Usama Arif wrote:

> The current upstream default policy for THP is always. However, Meta
> uses madvise in production as the current THP=always policy vastly
> overprovisions THPs in sparsely accessed memory areas, resulting in
> excessive memory pressure and premature OOM killing.
> Using madvise + relying on khugepaged has certain drawbacks over
> THP=always. Using madvise hints mean THPs aren't "transparent" and
> require userspace changes. Waiting for khugepaged to scan memory and
> collapse pages into THP can be slow and unpredictable in terms of performance
> (i.e. you dont know when the collapse will happen), while production
> environments require predictable performance. If there is enough memory
> available, its better for both performance and predictability to have
> a THP from fault time, i.e. THP=always rather than wait for khugepaged
> to collapse it, and deal with sparsely populated THPs when the system is
> running out of memory.
> 
> This patch-series is an attempt to mitigate the issue of running out of
> memory when THP is always enabled. During runtime whenever a THP is being
> faulted in or collapsed by khugepaged, the THP is added to a list.
> Whenever memory reclaim happens, the kernel runs the deferred_split
> shrinker which goes through the list and checks if the THP was underutilized,
> i.e. how many of the base 4K pages of the entire THP were zero-filled.
> If this number goes above a certain threshold, the shrinker will attempt
> to split that THP. Then at remap time, the pages that were zero-filled are
> mapped to the shared zeropage, hence saving memory. This method avoids the
> downside of wasting memory in areas where THP is sparsely filled when THP
> is always enabled, while still providing the upside THPs like reduced TLB
> misses without having to use madvise.
> 
> Meta production workloads that were CPU bound (>99% CPU utilzation) were
> tested with THP shrinker. The results after 2 hours are as follows:
> 
>                             | THP=madvise |  THP=always   | THP=always
>                             |             |               | + shrinker series
>                             |             |               | + max_ptes_none=409
> -----------------------------------------------------------------------------
> Performance improvement     |      -      |    +1.8%      |     +1.7%
> (over THP=madvise)          |             |               |
> -----------------------------------------------------------------------------
> Memory usage                |    54.6G    | 58.8G (+7.7%) |   55.9G (+2.4%)
> -----------------------------------------------------------------------------
> max_ptes_none=409 means that any THP that has more than 409 out of 512
> (80%) zero filled filled pages will be split.
> 
> To test out the patches, the below commands without the shrinker will
> invoke OOM killer immediately and kill stress, but will not fail with
> the shrinker:
> 
> echo 450 > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
> mkdir /sys/fs/cgroup/test
> echo $$ > /sys/fs/cgroup/test/cgroup.procs
> echo 20M > /sys/fs/cgroup/test/memory.max
> echo 0 > /sys/fs/cgroup/test/memory.swap.max
> # allocate twice memory.max for each stress worker and touch 40/512 of
> # each THP, i.e. vm-stride 50K.
> # With the shrinker, max_ptes_none of 470 and below won't invoke OOM
> # killer.
> # Without the shrinker, OOM killer is invoked immediately irrespective
> # of max_ptes_none value and kills stress.
> stress --vm 1 --vm-bytes 40M --vm-stride 50K
> 
> v2 -> v3:
> - Use my_zero_pfn instead of page_to_pfn(ZERO_PAGE(..)) (Johannes)
> - Use flags argument instead of bools in remove_migration_ptes (Johannes)
> - Use a new flag in folio->_flags_1 instead of folio->_partially_mapped
>   (David Hildenbrand).
> - Split out the last patch of v2 into 3, one for introducing the flag,
>   one for splitting underutilized THPs on _deferred_list and one for adding
>   sysfs entry to disable splitting (David Hildenbrand).
> 
> v1 -> v2:
> - Turn page checks and operations to folio versions in __split_huge_page.
>   This means patches 1 and 2 from v1 are no longer needed.
>   (David Hildenbrand)
> - Map to shared zeropage in all cases if the base page is zero-filled.
>   The uffd selftest was removed.
>   (David Hildenbrand).
> - rename 'dirty' to 'contains_data' in try_to_map_unused_to_zeropage
>   (Rik van Riel).
> - Use unsigned long instead of uint64_t (kernel test robot).
>  
> Alexander Zhu (1):
>   mm: selftest to verify zero-filled pages are mapped to zeropage
> 
> Usama Arif (3):
>   mm: Introduce a pageflag for partially mapped folios
>   mm: split underutilized THPs
>   mm: add sysfs entry to disable splitting underutilized THPs
> 
> Yu Zhao (2):
>   mm: free zapped tail pages when splitting isolated thp
>   mm: remap unused subpages to shared zeropage when splitting isolated
>     thp

Sorry, I don't have time to review this, but notice you're intending
a v4 of the series, so want to bring up four points quickly before that.

1. Even with the two fixes to 1/6 in __split_huge_page(), under load
this series causes system lockup, with interrupts disabled on most CPUs.

The error is in deferred_split_scan(), where the old code just did
a list_splice_tail() under split_queue_lock, but this series ends up
doing more there, including a folio_put(): deadlock when racing, and
that is the final folio_put() which brings refcount down to 0, which
then wants to take split_queue_lock.

The patch I've been using successfully on 6.11-rc3-next-20240816 below:
I do have other problems with current mm commits, so have not been able
to sustain a load for very long, but suspect those problems unrelated
to this series. Please fold this fix, or your own equivalent, into
your next version.

--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3270,7 +3270,8 @@ static void __split_huge_page(struct pag
 
 			folio_clear_active(new_folio);
 			folio_clear_unevictable(new_folio);
-			if (!folio_batch_add(&free_folios, folio)) {
+			list_del(&new_folio->lru);
+			if (!folio_batch_add(&free_folios, new_folio)) {
 				mem_cgroup_uncharge_folios(&free_folios);
 				free_unref_folios(&free_folios);
 			}
@@ -3706,42 +3707,37 @@ static unsigned long deferred_split_scan
 		bool did_split = false;
 		bool underutilized = false;
 
-		if (folio_test_partially_mapped(folio))
-			goto split;
-		underutilized = thp_underutilized(folio);
-		if (underutilized)
-			goto split;
-		continue;
-split:
+		if (!folio_test_partially_mapped(folio)) {
+			underutilized = thp_underutilized(folio);
+			if (!underutilized)
+				goto next;
+		}
 		if (!folio_trylock(folio))
-			continue;
-		did_split = !split_folio(folio);
-		folio_unlock(folio);
-		if (did_split) {
-			/* Splitting removed folio from the list, drop reference here */
-			folio_put(folio);
+			goto next;
+		if (!split_folio(folio)) {
+			did_split = true;
 			if (underutilized)
 				count_vm_event(THP_UNDERUTILIZED_SPLIT_PAGE);
 			split++;
 		}
-	}
-
-	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
-	/*
-	 * Only add back to the queue if folio is partially mapped.
-	 * If thp_underutilized returns false, or if split_folio fails in
-	 * the case it was underutilized, then consider it used and don't
-	 * add it back to split_queue.
-	 */
-	list_for_each_entry_safe(folio, next, &list, _deferred_list) {
-		if (folio_test_partially_mapped(folio))
-			list_move(&folio->_deferred_list, &ds_queue->split_queue);
-		else {
+		folio_unlock(folio);
+next:
+		/*
+		 * split_folio() removes folio from list on success.
+		 * Only add back to the queue if folio is partially mapped.
+		 * If thp_underutilized returns false, or if split_folio fails
+		 * in the case it was underutilized, then consider it used and
+		 * don't add it back to split_queue.
+		 */
+		if (!did_split && !folio_test_partially_mapped(folio)) {
 			list_del_init(&folio->_deferred_list);
 			ds_queue->split_queue_len--;
 		}
 		folio_put(folio);
 	}
+
+	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
+	list_splice_tail(&list, &ds_queue->split_queue);
 	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
 
 	/*

2. I don't understand why there needs to be a new PG_partially_mapped
flag, with all its attendant sets and tests and clears all over.  Why
can't deferred_split_scan() detect that case for itself, using the
criteria from __folio_remove_rmap()? I see folio->_nr_pages_mapped
is commented "Do not use outside of rmap and debug code", and
folio_nr_pages_mapped() is currently only used from mm/debug.c; but
using the info already maintained is preferable to adding a PG_flag
(and perhaps more efficient - skips splitting when _nr_pages_mapped
already fell to 0 and folio will soon be freed).

3. Everything in /sys/kernel/mm/transparent_hugepage/ is about THPs,
so please remove the "thp_" from "thp_low_util_shrinker" -
"shrink_underused" perhaps.  And it needs a brief description in
Documentation/admin-guide/mm/transhuge.rst.

4. Wouldn't "underused" be better than "underutilized" throughout?

Hugh

  parent reply	other threads:[~2024-08-18  5:13 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-08-13 12:02 [PATCH v3 0/6] mm: split underutilized THPs Usama Arif
2024-08-13 12:02 ` [PATCH v3 1/6] mm: free zapped tail pages when splitting isolated thp Usama Arif
2024-08-15 18:47   ` Kairui Song
2024-08-15 19:16     ` Usama Arif
2024-08-16 16:55       ` Kairui Song
2024-08-16 17:02         ` Usama Arif
2024-08-16 18:11           ` Kairui Song
2024-08-13 12:02 ` [PATCH v3 2/6] mm: remap unused subpages to shared zeropage " Usama Arif
2024-08-13 12:02 ` [PATCH v3 3/6] mm: selftest to verify zero-filled pages are mapped to zeropage Usama Arif
2024-08-13 12:02 ` [PATCH v3 4/6] mm: Introduce a pageflag for partially mapped folios Usama Arif
2024-08-14  3:30   ` Yu Zhao
2024-08-14 10:20     ` Usama Arif
2024-08-14 10:44   ` Barry Song
2024-08-14 10:52     ` Barry Song
2024-08-14 11:11     ` Usama Arif
2024-08-14 11:20       ` Barry Song
2024-08-14 11:26         ` Barry Song
2024-08-14 11:30         ` Usama Arif
2024-08-14 11:10   ` Barry Song
2024-08-14 11:20     ` Usama Arif
2024-08-14 11:23       ` Barry Song
2024-08-14 12:36         ` Usama Arif
2024-08-14 23:05           ` Barry Song
2024-08-15 15:25             ` Usama Arif
2024-08-15 23:30               ` Andrew Morton
2024-08-16  2:50                 ` Yu Zhao
2024-08-15 16:33   ` David Hildenbrand
2024-08-15 17:10     ` Usama Arif
2024-08-15 21:06       ` Barry Song
2024-08-15 21:08       ` David Hildenbrand
2024-08-16 15:44   ` Matthew Wilcox
2024-08-16 16:08     ` Usama Arif
2024-08-16 16:28       ` Matthew Wilcox
2024-08-16 16:41         ` Usama Arif
2024-08-13 12:02 ` [PATCH v3 5/6] mm: split underutilized THPs Usama Arif
2024-08-13 12:02 ` [PATCH v3 6/6] mm: add sysfs entry to disable splitting " Usama Arif
2024-08-13 17:22 ` [PATCH v3 0/6] mm: split " Andi Kleen
2024-08-14 10:13   ` Usama Arif
2024-08-18  5:13 ` Hugh Dickins [this message]
2024-08-18  7:45   ` David Hildenbrand
2024-08-19  2:38     ` Usama Arif
2024-08-19  2:36   ` Usama Arif

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1e6f3b38-d309-e63f-bca0-5093e152f7d7@google.com \
    --to=hughd@google.com \
    --cc=ak@linux.intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=cerasuolodomenico@gmail.com \
    --cc=corbet@lwn.net \
    --cc=david@redhat.com \
    --cc=hannes@cmpxchg.org \
    --cc=kernel-team@meta.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=riel@surriel.com \
    --cc=roman.gushchin@linux.dev \
    --cc=rppt@kernel.org \
    --cc=ryan.roberts@arm.com \
    --cc=ryncsn@gmail.com \
    --cc=shakeel.butt@linux.dev \
    --cc=usamaarif642@gmail.com \
    --cc=willy@infradead.org \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.