[PATCH 0/6] mm: split underutilized THPs

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/6] mm: split underutilized THPs
@ 2024-07-30 12:45 Usama Arif
  2024-07-30 12:45 ` [PATCH 1/6] Revert "memcg: remove mem_cgroup_uncharge_list()" Usama Arif
                   ` (7 more replies)
  0 siblings, 8 replies; 40+ messages in thread
From: Usama Arif @ 2024-07-30 12:45 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: hannes, riel, shakeel.butt, roman.gushchin, yuzhao, david, baohua,
	ryan.roberts, rppt, willy, cerasuolodomenico, corbet,
	linux-kernel, linux-doc, kernel-team, Usama Arif

The current upstream default policy for THP is always. However, Meta
uses madvise in production as the current THP=always policy vastly
overprovisions THPs in sparsely accessed memory areas, resulting in
excessive memory pressure and premature OOM killing.
Using madvise + relying on khugepaged has certain drawbacks over
THP=always. Using madvise hints mean THPs aren't "transparent" and
require userspace changes. Waiting for khugepaged to scan memory and
collapse pages into THP can be slow and unpredictable in terms of performance
(i.e. you dont know when the collapse will happen), while production
environments require predictable performance. If there is enough memory
available, its better for both performance and predictability to have
a THP from fault time, i.e. THP=always rather than wait for khugepaged
to collapse it, and deal with sparsely populated THPs when the system is
running out of memory.

This patch-series is an attempt to mitigate the issue of running out of
memory when THP is always enabled. During runtime whenever a THP is being
faulted in or collapsed by khugepaged, the THP is added to a list.
Whenever memory reclaim happens, the kernel runs the deferred_split
shrinker which goes through the list and checks if the THP was underutilized,
i.e. how many of the base 4K pages of the entire THP were zero-filled.
If this number goes above a certain threshold, the shrinker will attempt
to split that THP. Then at remap time, the pages that were zero-filled are
not remapped, hence saving memory. This method avoids the downside of
wasting memory in areas where THP is sparsely filled when THP is always
enabled, while still providing the upside THPs like reduced TLB misses without
having to use madvise.

Meta production workloads that were CPU bound (>99% CPU utilzation) were
tested with THP shrinker. The results after 2 hours are as follows:

                            | THP=madvise |  THP=always   | THP=always
                            |             |               | + shrinker series
                            |             |               | + max_ptes_none=409
-----------------------------------------------------------------------------
Performance improvement     |      -      |    +1.8%      |     +1.7%
(over THP=madvise)          |             |               |
-----------------------------------------------------------------------------
Memory usage                |    54.6G    | 58.8G (+7.7%) |   55.9G (+2.4%)
-----------------------------------------------------------------------------
max_ptes_none=409 means that any THP that has more than 409 out of 512
(80%) zero filled filled pages will be split.

To test out the patches, the below commands without the shrinker will
invoke OOM killer immediately and kill stress, but will not fail with
the shrinker:

echo 450 > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
mkdir /sys/fs/cgroup/test
echo $$ > /sys/fs/cgroup/test/cgroup.procs
echo 20M > /sys/fs/cgroup/test/memory.max
echo 0 > /sys/fs/cgroup/test/memory.swap.max
# allocate twice memory.max for each stress worker and touch 40/512 of
# each THP, i.e. vm-stride 50K.
# With the shrinker, max_ptes_none of 470 and below won't invoke OOM
# killer.
# Without the shrinker, OOM killer is invoked immediately irrespective
# of max_ptes_none value and kill stress.
stress --vm 1 --vm-bytes 40M --vm-stride 50K

Patches 1-2 add back helper functions that were previously removed
to operate on page lists (needed by patch 3).
Patch 3 is an optimization to free zapped tail pages rather than
waiting for page reclaim or migration.
Patch 4 is a prerequisite for THP shrinker to not remap zero-filled
subpages when splitting THP.
Patches 6 adds support for THP shrinker.

(This patch-series restarts the work on having a THP shrinker in kernel
originally done in
https://lore.kernel.org/all/cover.1667454613.git.alexlzhu@fb.com/.
The THP shrinker in this series is significantly different than the
original one, hence its labelled v1 (although the prerequisite to not
remap clean subpages is the same).)

Alexander Zhu (1):
  mm: add selftests to split_huge_page() to verify unmap/zap of zero
    pages

Usama Arif (3):
  Revert "memcg: remove mem_cgroup_uncharge_list()"
  Revert "mm: remove free_unref_page_list()"
  mm: split underutilized THPs

Yu Zhao (2):
  mm: free zapped tail pages when splitting isolated thp
  mm: don't remap unused subpages when splitting isolated thp

 Documentation/admin-guide/mm/transhuge.rst    |   6 +
 include/linux/huge_mm.h                       |   4 +-
 include/linux/khugepaged.h                    |   1 +
 include/linux/memcontrol.h                    |  12 ++
 include/linux/mm_types.h                      |   2 +
 include/linux/rmap.h                          |   2 +-
 include/linux/vm_event_item.h                 |   1 +
 mm/huge_memory.c                              | 152 +++++++++++++++---
 mm/hugetlb.c                                  |   1 +
 mm/internal.h                                 |   5 +-
 mm/khugepaged.c                               |   3 +-
 mm/memcontrol.c                               |  22 ++-
 mm/migrate.c                                  |  76 +++++++--
 mm/migrate_device.c                           |   4 +-
 mm/page_alloc.c                               |  18 +++
 mm/rmap.c                                     |   2 +-
 mm/vmscan.c                                   |   3 +-
 mm/vmstat.c                                   |   1 +
 .../selftests/mm/split_huge_page_test.c       | 113 +++++++++++++
 tools/testing/selftests/mm/vm_util.c          |  22 +++
 tools/testing/selftests/mm/vm_util.h          |   1 +
 21 files changed, 414 insertions(+), 37 deletions(-)

-- 
2.43.0

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 1/6] Revert "memcg: remove mem_cgroup_uncharge_list()"
  2024-07-30 12:45 [PATCH 0/6] mm: split underutilized THPs Usama Arif
@ 2024-07-30 12:45 ` Usama Arif
  2024-07-30 12:45 ` [PATCH 2/6] Revert "mm: remove free_unref_page_list()" Usama Arif
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 40+ messages in thread
From: Usama Arif @ 2024-07-30 12:45 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: hannes, riel, shakeel.butt, roman.gushchin, yuzhao, david, baohua,
	ryan.roberts, rppt, willy, cerasuolodomenico, corbet,
	linux-kernel, linux-doc, kernel-team, Usama Arif

mem_cgroup_uncharge_list will be needed in a later patch for an
optimization to free zapped tail pages when splitting isolated thp.

Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
 include/linux/memcontrol.h | 12 ++++++++++++
 mm/memcontrol.c            | 19 +++++++++++++++++++
 2 files changed, 31 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 07eadf7ecbba..cbaf0ea1b217 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -713,6 +713,14 @@ static inline void mem_cgroup_uncharge(struct folio *folio)
 	__mem_cgroup_uncharge(folio);
 }
 
+void __mem_cgroup_uncharge_list(struct list_head *page_list);
+static inline void mem_cgroup_uncharge_list(struct list_head *page_list)
+{
+	if (mem_cgroup_disabled())
+		return;
+	__mem_cgroup_uncharge_list(page_list);
+}
+
 void __mem_cgroup_uncharge_folios(struct folio_batch *folios);
 static inline void mem_cgroup_uncharge_folios(struct folio_batch *folios)
 {
@@ -1203,6 +1211,10 @@ static inline void mem_cgroup_uncharge(struct folio *folio)
 {
 }
 
+static inline void mem_cgroup_uncharge_list(struct list_head *page_list)
+{
+}
+
 static inline void mem_cgroup_uncharge_folios(struct folio_batch *folios)
 {
 }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9b3ef3a70833..f568b9594c2b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4717,6 +4717,25 @@ void __mem_cgroup_uncharge(struct folio *folio)
 	uncharge_batch(&ug);
 }
 
+/**
+ * __mem_cgroup_uncharge_list - uncharge a list of page
+ * @page_list: list of pages to uncharge
+ *
+ * Uncharge a list of pages previously charged with
+ * __mem_cgroup_charge().
+ */
+void __mem_cgroup_uncharge_list(struct list_head *page_list)
+{
+	struct uncharge_gather ug;
+	struct folio *folio;
+
+	uncharge_gather_clear(&ug);
+	list_for_each_entry(folio, page_list, lru)
+		uncharge_folio(folio, &ug);
+	if (ug.memcg)
+		uncharge_batch(&ug);
+}
+
 void __mem_cgroup_uncharge_folios(struct folio_batch *folios)
 {
 	struct uncharge_gather ug;
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 2/6] Revert "mm: remove free_unref_page_list()"
  2024-07-30 12:45 [PATCH 0/6] mm: split underutilized THPs Usama Arif
  2024-07-30 12:45 ` [PATCH 1/6] Revert "memcg: remove mem_cgroup_uncharge_list()" Usama Arif
@ 2024-07-30 12:45 ` Usama Arif
  2024-07-30 12:46 ` [PATCH 3/6] mm: free zapped tail pages when splitting isolated thp Usama Arif
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 40+ messages in thread
From: Usama Arif @ 2024-07-30 12:45 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: hannes, riel, shakeel.butt, roman.gushchin, yuzhao, david, baohua,
	ryan.roberts, rppt, willy, cerasuolodomenico, corbet,
	linux-kernel, linux-doc, kernel-team, Usama Arif

free_unref_page_list will be needed in a later patch for an
optimization to free zapped tail pages when splitting isolated thp.

Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
 mm/internal.h   |  1 +
 mm/page_alloc.c | 18 ++++++++++++++++++
 2 files changed, 19 insertions(+)

diff --git a/mm/internal.h b/mm/internal.h
index 7a3bcc6d95e7..259afe44dc88 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -680,6 +680,7 @@ extern int user_min_free_kbytes;
 
 void free_unref_page(struct page *page, unsigned int order);
 void free_unref_folios(struct folio_batch *fbatch);
+void free_unref_page_list(struct list_head *list);
 
 extern void zone_pcp_reset(struct zone *zone);
 extern void zone_pcp_disable(struct zone *zone);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index aae00ba3b3bd..38832e6b1e6c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2774,6 +2774,24 @@ void free_unref_folios(struct folio_batch *folios)
 	folio_batch_reinit(folios);
 }
 
+void free_unref_page_list(struct list_head *list)
+{
+	struct folio_batch fbatch;
+
+	folio_batch_init(&fbatch);
+	while (!list_empty(list)) {
+		struct folio *folio = list_first_entry(list, struct folio, lru);
+
+		list_del(&folio->lru);
+		if (folio_batch_add(&fbatch, folio) > 0)
+			continue;
+		free_unref_folios(&fbatch);
+	}
+
+	if (fbatch.nr)
+		free_unref_folios(&fbatch);
+}
+
 /*
  * split_page takes a non-compound higher-order page, and splits it into
  * n (1<<order) sub-pages: page[0..n]
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 3/6] mm: free zapped tail pages when splitting isolated thp
  2024-07-30 12:45 [PATCH 0/6] mm: split underutilized THPs Usama Arif
  2024-07-30 12:45 ` [PATCH 1/6] Revert "memcg: remove mem_cgroup_uncharge_list()" Usama Arif
  2024-07-30 12:45 ` [PATCH 2/6] Revert "mm: remove free_unref_page_list()" Usama Arif
@ 2024-07-30 12:46 ` Usama Arif
  2024-07-30 15:14   ` David Hildenbrand
  2024-07-30 12:46 ` [PATCH 4/6] mm: don't remap unused subpages " Usama Arif
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 40+ messages in thread
From: Usama Arif @ 2024-07-30 12:46 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: hannes, riel, shakeel.butt, roman.gushchin, yuzhao, david, baohua,
	ryan.roberts, rppt, willy, cerasuolodomenico, corbet,
	linux-kernel, linux-doc, kernel-team, Shuang Zhai, Usama Arif

From: Yu Zhao <yuzhao@google.com>

If a tail page has only two references left, one inherited from the
isolation of its head and the other from lru_add_page_tail() which we
are about to drop, it means this tail page was concurrently zapped.
Then we can safely free it and save page reclaim or migration the
trouble of trying it.

Signed-off-by: Yu Zhao <yuzhao@google.com>
Tested-by: Shuang Zhai <zhais@google.com>
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
 mm/huge_memory.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0167dc27e365..76a3b6a2b796 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2923,6 +2923,8 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 	unsigned int new_nr = 1 << new_order;
 	int order = folio_order(folio);
 	unsigned int nr = 1 << order;
+	LIST_HEAD(pages_to_free);
+	int nr_pages_to_free = 0;
 
 	/* complete memcg works before add pages to LRU */
 	split_page_memcg(head, order, new_order);
@@ -3007,6 +3009,24 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		if (subpage == page)
 			continue;
 		folio_unlock(new_folio);
+		/*
+		 * If a tail page has only two references left, one inherited
+		 * from the isolation of its head and the other from
+		 * lru_add_page_tail() which we are about to drop, it means this
+		 * tail page was concurrently zapped. Then we can safely free it
+		 * and save page reclaim or migration the trouble of trying it.
+		 */
+		if (list && page_ref_freeze(subpage, 2)) {
+			VM_BUG_ON_PAGE(PageLRU(subpage), subpage);
+			VM_BUG_ON_PAGE(PageCompound(subpage), subpage);
+			VM_BUG_ON_PAGE(page_mapped(subpage), subpage);
+
+			ClearPageActive(subpage);
+			ClearPageUnevictable(subpage);
+			list_move(&subpage->lru, &pages_to_free);
+			nr_pages_to_free++;
+			continue;
+		}
 
 		/*
 		 * Subpages may be freed if there wasn't any mapping
@@ -3017,6 +3037,12 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		 */
 		free_page_and_swap_cache(subpage);
 	}
+
+	if (!nr_pages_to_free)
+		return;
+
+	mem_cgroup_uncharge_list(&pages_to_free);
+	free_unref_page_list(&pages_to_free);
 }
 
 /* Racy check whether the huge page can be split */
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/6] mm: free zapped tail pages when splitting isolated thp
  2024-07-30 12:46 ` [PATCH 3/6] mm: free zapped tail pages when splitting isolated thp Usama Arif
@ 2024-07-30 15:14   ` David Hildenbrand
  2024-08-04 19:02     ` Usama Arif
  0 siblings, 1 reply; 40+ messages in thread
From: David Hildenbrand @ 2024-07-30 15:14 UTC (permalink / raw)
  To: Usama Arif, akpm, linux-mm
  Cc: hannes, riel, shakeel.butt, roman.gushchin, yuzhao, baohua,
	ryan.roberts, rppt, willy, cerasuolodomenico, corbet,
	linux-kernel, linux-doc, kernel-team, Shuang Zhai

On 30.07.24 14:46, Usama Arif wrote:
> From: Yu Zhao <yuzhao@google.com>
> 
> If a tail page has only two references left, one inherited from the
> isolation of its head and the other from lru_add_page_tail() which we
> are about to drop, it means this tail page was concurrently zapped.
> Then we can safely free it and save page reclaim or migration the
> trouble of trying it.
> 
> Signed-off-by: Yu Zhao <yuzhao@google.com>
> Tested-by: Shuang Zhai <zhais@google.com>
> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> ---
>   mm/huge_memory.c | 26 ++++++++++++++++++++++++++
>   1 file changed, 26 insertions(+)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 0167dc27e365..76a3b6a2b796 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2923,6 +2923,8 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>   	unsigned int new_nr = 1 << new_order;
>   	int order = folio_order(folio);
>   	unsigned int nr = 1 << order;
> +	LIST_HEAD(pages_to_free);
> +	int nr_pages_to_free = 0;
>   
>   	/* complete memcg works before add pages to LRU */
>   	split_page_memcg(head, order, new_order);
> @@ -3007,6 +3009,24 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>   		if (subpage == page)
>   			continue;
>   		folio_unlock(new_folio);
> +		/*
> +		 * If a tail page has only two references left, one inherited
> +		 * from the isolation of its head and the other from
> +		 * lru_add_page_tail() which we are about to drop, it means this
> +		 * tail page was concurrently zapped. Then we can safely free it
> +		 * and save page reclaim or migration the trouble of trying it.
> +		 */
> +		if (list && page_ref_freeze(subpage, 2)) {
> +			VM_BUG_ON_PAGE(PageLRU(subpage), subpage);
> +			VM_BUG_ON_PAGE(PageCompound(subpage), subpage);
> +			VM_BUG_ON_PAGE(page_mapped(subpage), subpage);
> +

No VM_BUG_*, VM_WARN is good enough.

> +			ClearPageActive(subpage);
> +			ClearPageUnevictable(subpage);
> +			list_move(&subpage->lru, &pages_to_free);

Most checks here should operate on new_folio instead of subpage.


-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/6] mm: free zapped tail pages when splitting isolated thp
  2024-07-30 15:14   ` David Hildenbrand
@ 2024-08-04 19:02     ` Usama Arif
  2024-08-05  9:00       ` David Hildenbrand
  0 siblings, 1 reply; 40+ messages in thread
From: Usama Arif @ 2024-08-04 19:02 UTC (permalink / raw)
  To: David Hildenbrand, akpm, linux-mm
  Cc: hannes, riel, shakeel.butt, roman.gushchin, yuzhao, baohua,
	ryan.roberts, rppt, willy, cerasuolodomenico, corbet,
	linux-kernel, linux-doc, kernel-team, Shuang Zhai



On 30/07/2024 16:14, David Hildenbrand wrote:
> On 30.07.24 14:46, Usama Arif wrote:
>> From: Yu Zhao <yuzhao@google.com>
>>
>> If a tail page has only two references left, one inherited from the
>> isolation of its head and the other from lru_add_page_tail() which we
>> are about to drop, it means this tail page was concurrently zapped.
>> Then we can safely free it and save page reclaim or migration the
>> trouble of trying it.
>>
>> Signed-off-by: Yu Zhao <yuzhao@google.com>
>> Tested-by: Shuang Zhai <zhais@google.com>
>> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
>> ---
>>   mm/huge_memory.c | 26 ++++++++++++++++++++++++++
>>   1 file changed, 26 insertions(+)
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 0167dc27e365..76a3b6a2b796 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -2923,6 +2923,8 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>>       unsigned int new_nr = 1 << new_order;
>>       int order = folio_order(folio);
>>       unsigned int nr = 1 << order;
>> +    LIST_HEAD(pages_to_free);
>> +    int nr_pages_to_free = 0;
>>         /* complete memcg works before add pages to LRU */
>>       split_page_memcg(head, order, new_order);
>> @@ -3007,6 +3009,24 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>>           if (subpage == page)
>>               continue;
>>           folio_unlock(new_folio);
>> +        /*
>> +         * If a tail page has only two references left, one inherited
>> +         * from the isolation of its head and the other from
>> +         * lru_add_page_tail() which we are about to drop, it means this
>> +         * tail page was concurrently zapped. Then we can safely free it
>> +         * and save page reclaim or migration the trouble of trying it.
>> +         */
>> +        if (list && page_ref_freeze(subpage, 2)) {
>> +            VM_BUG_ON_PAGE(PageLRU(subpage), subpage);
>> +            VM_BUG_ON_PAGE(PageCompound(subpage), subpage);
>> +            VM_BUG_ON_PAGE(page_mapped(subpage), subpage);
>> +
> 
> No VM_BUG_*, VM_WARN is good enough.
> 
>> +            ClearPageActive(subpage);
>> +            ClearPageUnevictable(subpage);
>> +            list_move(&subpage->lru, &pages_to_free);
> 
> Most checks here should operate on new_folio instead of subpage.
> 
> 
Do you mean instead of doing the PageLRU, PageCompound and page_mapped check on the subpage, there should be checks on new_folio?
If new_folio is a large folio, then it could be that only some of the subpages were zapped?

Could do below if subpage makes sense

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3305e6d0b90e..abfcd4b7cbba 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3041,9 +3041,9 @@ static void __split_huge_page(struct page *page, struct list_head *list,
                 * and save page reclaim or migration the trouble of trying it.
                 */
                if (list && page_ref_freeze(subpage, 2)) {
-                       VM_BUG_ON_PAGE(PageLRU(subpage), subpage);
-                       VM_BUG_ON_PAGE(PageCompound(subpage), subpage);
-                       VM_BUG_ON_PAGE(page_mapped(subpage), subpage);
+                       VM_WARN_ON_ONCE_PAGE(PageLRU(subpage), subpage);
+                       VM_WARN_ON_ONCE_PAGE(PageCompound(subpage), subpage);
+                       VM_WARN_ON_ONCE_PAGE(page_mapped(subpage), subpage);
 
                        ClearPageActive(subpage);
                        ClearPageUnevictable(subpage);


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/6] mm: free zapped tail pages when splitting isolated thp
  2024-08-04 19:02     ` Usama Arif
@ 2024-08-05  9:00       ` David Hildenbrand
  2024-08-06  9:58         ` Usama Arif
  0 siblings, 1 reply; 40+ messages in thread
From: David Hildenbrand @ 2024-08-05  9:00 UTC (permalink / raw)
  To: Usama Arif, akpm, linux-mm
  Cc: hannes, riel, shakeel.butt, roman.gushchin, yuzhao, baohua,
	ryan.roberts, rppt, willy, cerasuolodomenico, corbet,
	linux-kernel, linux-doc, kernel-team, Shuang Zhai

On 04.08.24 21:02, Usama Arif wrote:
> 
> 
> On 30/07/2024 16:14, David Hildenbrand wrote:
>> On 30.07.24 14:46, Usama Arif wrote:
>>> From: Yu Zhao <yuzhao@google.com>
>>>
>>> If a tail page has only two references left, one inherited from the
>>> isolation of its head and the other from lru_add_page_tail() which we
>>> are about to drop, it means this tail page was concurrently zapped.
>>> Then we can safely free it and save page reclaim or migration the
>>> trouble of trying it.
>>>
>>> Signed-off-by: Yu Zhao <yuzhao@google.com>
>>> Tested-by: Shuang Zhai <zhais@google.com>
>>> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
>>> ---
>>>    mm/huge_memory.c | 26 ++++++++++++++++++++++++++
>>>    1 file changed, 26 insertions(+)
>>>
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index 0167dc27e365..76a3b6a2b796 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -2923,6 +2923,8 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>>>        unsigned int new_nr = 1 << new_order;
>>>        int order = folio_order(folio);
>>>        unsigned int nr = 1 << order;
>>> +    LIST_HEAD(pages_to_free);
>>> +    int nr_pages_to_free = 0;
>>>          /* complete memcg works before add pages to LRU */
>>>        split_page_memcg(head, order, new_order);
>>> @@ -3007,6 +3009,24 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>>>            if (subpage == page)
>>>                continue;
>>>            folio_unlock(new_folio);
>>> +        /*
>>> +         * If a tail page has only two references left, one inherited
>>> +         * from the isolation of its head and the other from
>>> +         * lru_add_page_tail() which we are about to drop, it means this
>>> +         * tail page was concurrently zapped. Then we can safely free it
>>> +         * and save page reclaim or migration the trouble of trying it.
>>> +         */
>>> +        if (list && page_ref_freeze(subpage, 2)) {
>>> +            VM_BUG_ON_PAGE(PageLRU(subpage), subpage);
>>> +            VM_BUG_ON_PAGE(PageCompound(subpage), subpage);
>>> +            VM_BUG_ON_PAGE(page_mapped(subpage), subpage);
>>> +
>>
>> No VM_BUG_*, VM_WARN is good enough.
>>
>>> +            ClearPageActive(subpage);
>>> +            ClearPageUnevictable(subpage);
>>> +            list_move(&subpage->lru, &pages_to_free);
>>
>> Most checks here should operate on new_folio instead of subpage.
>>
>>
> Do you mean instead of doing the PageLRU, PageCompound and page_mapped check on the subpage, there should be checks on new_folio?
> If new_folio is a large folio, then it could be that only some of the subpages were zapped?

We do a:

struct folio *new_folio = page_folio(subpage);

Then:

PageLRU() will end up getting translated to 
folio_test_lru(page_folio(subpage))

page_mapped() will end up getting translated to
folio_mapped(page_folio(subpage))

PageCompound() is essentially a folio_test_large() check.

So what stops us from doing

VM_WARN_ON_ONCE_FOLIO(folio_test_lru(new_folio), new_folio);
VM_WARN_ON_ONCE_FOLIO(folio_test_large(new_folio), new_folio);
VM_WARN_ON_ONCE_FOLIO(folio_mapped(new_folio), new_folio);

folio_clear_active(new_folio);
folio_clear_unevictable(new_folio);
...

?

The page_ref_freeze() should make sure that we don't have a tail page of
a large folio. Tail pages would have a refcount of 0.

Or what am I missing?

> 
> Could do below if subpage makes sense
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 3305e6d0b90e..abfcd4b7cbba 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3041,9 +3041,9 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>                   * and save page reclaim or migration the trouble of trying it.
>                   */
>                  if (list && page_ref_freeze(subpage, 2)) {
> -                       VM_BUG_ON_PAGE(PageLRU(subpage), subpage);
> -                       VM_BUG_ON_PAGE(PageCompound(subpage), subpage);
> -                       VM_BUG_ON_PAGE(page_mapped(subpage), subpage);
> +                       VM_WARN_ON_ONCE_PAGE(PageLRU(subpage), subpage);
> +                       VM_WARN_ON_ONCE_PAGE(PageCompound(subpage), subpage);
> +                       VM_WARN_ON_ONCE_PAGE(page_mapped(subpage), subpage);
>   
>                          ClearPageActive(subpage);
>                          ClearPageUnevictable(subpage);
> 

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/6] mm: free zapped tail pages when splitting isolated thp
  2024-08-05  9:00       ` David Hildenbrand
@ 2024-08-06  9:58         ` Usama Arif
  0 siblings, 0 replies; 40+ messages in thread
From: Usama Arif @ 2024-08-06  9:58 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: hannes, riel, shakeel.butt, roman.gushchin, yuzhao, baohua,
	ryan.roberts, rppt, willy, cerasuolodomenico, corbet,
	linux-kernel, linux-doc, kernel-team, Shuang Zhai, akpm, linux-mm



On 05/08/2024 10:00, David Hildenbrand wrote:
> On 04.08.24 21:02, Usama Arif wrote:
>>
>>
>> On 30/07/2024 16:14, David Hildenbrand wrote:
>>> On 30.07.24 14:46, Usama Arif wrote:
>>>> From: Yu Zhao <yuzhao@google.com>
>>>>
>>>> If a tail page has only two references left, one inherited from the
>>>> isolation of its head and the other from lru_add_page_tail() which we
>>>> are about to drop, it means this tail page was concurrently zapped.
>>>> Then we can safely free it and save page reclaim or migration the
>>>> trouble of trying it.
>>>>
>>>> Signed-off-by: Yu Zhao <yuzhao@google.com>
>>>> Tested-by: Shuang Zhai <zhais@google.com>
>>>> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
>>>> ---
>>>>    mm/huge_memory.c | 26 ++++++++++++++++++++++++++
>>>>    1 file changed, 26 insertions(+)
>>>>
>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>> index 0167dc27e365..76a3b6a2b796 100644
>>>> --- a/mm/huge_memory.c
>>>> +++ b/mm/huge_memory.c
>>>> @@ -2923,6 +2923,8 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>>>>        unsigned int new_nr = 1 << new_order;
>>>>        int order = folio_order(folio);
>>>>        unsigned int nr = 1 << order;
>>>> +    LIST_HEAD(pages_to_free);
>>>> +    int nr_pages_to_free = 0;
>>>>          /* complete memcg works before add pages to LRU */
>>>>        split_page_memcg(head, order, new_order);
>>>> @@ -3007,6 +3009,24 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>>>>            if (subpage == page)
>>>>                continue;
>>>>            folio_unlock(new_folio);
>>>> +        /*
>>>> +         * If a tail page has only two references left, one inherited
>>>> +         * from the isolation of its head and the other from
>>>> +         * lru_add_page_tail() which we are about to drop, it means this
>>>> +         * tail page was concurrently zapped. Then we can safely free it
>>>> +         * and save page reclaim or migration the trouble of trying it.
>>>> +         */
>>>> +        if (list && page_ref_freeze(subpage, 2)) {
>>>> +            VM_BUG_ON_PAGE(PageLRU(subpage), subpage);
>>>> +            VM_BUG_ON_PAGE(PageCompound(subpage), subpage);
>>>> +            VM_BUG_ON_PAGE(page_mapped(subpage), subpage);
>>>> +
>>>
>>> No VM_BUG_*, VM_WARN is good enough.
>>>
>>>> +            ClearPageActive(subpage);
>>>> +            ClearPageUnevictable(subpage);
>>>> +            list_move(&subpage->lru, &pages_to_free);
>>>
>>> Most checks here should operate on new_folio instead of subpage.
>>>
>>>
>> Do you mean instead of doing the PageLRU, PageCompound and page_mapped check on the subpage, there should be checks on new_folio?
>> If new_folio is a large folio, then it could be that only some of the subpages were zapped?
> 
> We do a:
> 
> struct folio *new_folio = page_folio(subpage);
> 
> Then:
> 
> PageLRU() will end up getting translated to folio_test_lru(page_folio(subpage))
> 
> page_mapped() will end up getting translated to
> folio_mapped(page_folio(subpage))
> 
> PageCompound() is essentially a folio_test_large() check.
> 
> So what stops us from doing
> 
> VM_WARN_ON_ONCE_FOLIO(folio_test_lru(new_folio), new_folio);
> VM_WARN_ON_ONCE_FOLIO(folio_test_large(new_folio), new_folio);
> VM_WARN_ON_ONCE_FOLIO(folio_mapped(new_folio), new_folio);
> 
> folio_clear_active(new_folio);
> folio_clear_unevictable(new_folio);
> ...
> 
> ?
> 
> The page_ref_freeze() should make sure that we don't have a tail page of
> a large folio. Tail pages would have a refcount of 0.
> 
> Or what am I missing?
> 

Yes you are right. For some reason I was thinking tail pages would be able to reach this path.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 4/6] mm: don't remap unused subpages when splitting isolated thp
  2024-07-30 12:45 [PATCH 0/6] mm: split underutilized THPs Usama Arif
                   ` (2 preceding siblings ...)
  2024-07-30 12:46 ` [PATCH 3/6] mm: free zapped tail pages when splitting isolated thp Usama Arif
@ 2024-07-30 12:46 ` Usama Arif
  2024-07-30 18:07   ` Rik van Riel
  2024-07-30 12:46 ` [PATCH 5/6] mm: add selftests to split_huge_page() to verify unmap/zap of zero pages Usama Arif
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 40+ messages in thread
From: Usama Arif @ 2024-07-30 12:46 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: hannes, riel, shakeel.butt, roman.gushchin, yuzhao, david, baohua,
	ryan.roberts, rppt, willy, cerasuolodomenico, corbet,
	linux-kernel, linux-doc, kernel-team, Shuang Zhai, Usama Arif

From: Yu Zhao <yuzhao@google.com>

Here being unused means containing only zeros and inaccessible to
userspace. When splitting an isolated thp under reclaim or migration,
there is no need to remap its unused subpages because they can be
faulted in anew. Not remapping them avoids writeback or copying during
reclaim or migration. This is particularly helpful when the internal
fragmentation of a thp is high, i.e., it has many untouched subpages.

This is also a prerequisite for THP low utilization shrinker which will
be introduced in later patches, where underutilized THPs are split, and
the zero-filled split pages are freed saving memory.

Signed-off-by: Yu Zhao <yuzhao@google.com>
Tested-by: Shuang Zhai <zhais@google.com>
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
 include/linux/rmap.h |  2 +-
 mm/huge_memory.c     |  8 ++---
 mm/migrate.c         | 73 +++++++++++++++++++++++++++++++++++++++-----
 mm/migrate_device.c  |  4 +--
 4 files changed, 72 insertions(+), 15 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 0978c64f49d8..805ab09057ed 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -745,7 +745,7 @@ int folio_mkclean(struct folio *);
 int pfn_mkclean_range(unsigned long pfn, unsigned long nr_pages, pgoff_t pgoff,
 		      struct vm_area_struct *vma);
 
-void remove_migration_ptes(struct folio *src, struct folio *dst, bool locked);
+void remove_migration_ptes(struct folio *src, struct folio *dst, bool locked, bool unmap_unused);
 
 /*
  * rmap_walk_control: To control rmap traversing for specific needs
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 76a3b6a2b796..892467d85f3a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2775,7 +2775,7 @@ bool unmap_huge_pmd_locked(struct vm_area_struct *vma, unsigned long addr,
 	return false;
 }
 
-static void remap_page(struct folio *folio, unsigned long nr)
+static void remap_page(struct folio *folio, unsigned long nr, bool unmap_unused)
 {
 	int i = 0;
 
@@ -2783,7 +2783,7 @@ static void remap_page(struct folio *folio, unsigned long nr)
 	if (!folio_test_anon(folio))
 		return;
 	for (;;) {
-		remove_migration_ptes(folio, folio, true);
+		remove_migration_ptes(folio, folio, true, unmap_unused);
 		i += folio_nr_pages(folio);
 		if (i >= nr)
 			break;
@@ -2993,7 +2993,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 
 	if (nr_dropped)
 		shmem_uncharge(folio->mapping->host, nr_dropped);
-	remap_page(folio, nr);
+	remap_page(folio, nr, PageAnon(head));
 
 	/*
 	 * set page to its compound_head when split to non order-0 pages, so
@@ -3286,7 +3286,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 		if (mapping)
 			xas_unlock(&xas);
 		local_irq_enable();
-		remap_page(folio, folio_nr_pages(folio));
+		remap_page(folio, folio_nr_pages(folio), false);
 		ret = -EAGAIN;
 	}
 
diff --git a/mm/migrate.c b/mm/migrate.c
index b273bac0d5ae..f4f06bdded70 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -177,13 +177,61 @@ void putback_movable_pages(struct list_head *l)
 	}
 }
 
+static bool try_to_unmap_unused(struct page_vma_mapped_walk *pvmw,
+			       struct folio *folio,
+			       unsigned long idx)
+{
+	struct page *page = folio_page(folio, idx);
+	void *addr;
+	bool dirty;
+	pte_t newpte;
+
+	VM_BUG_ON_PAGE(PageCompound(page), page);
+	VM_BUG_ON_PAGE(!PageAnon(page), page);
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+	VM_BUG_ON_PAGE(pte_present(*pvmw->pte), page);
+
+	if (PageMlocked(page) || (pvmw->vma->vm_flags & VM_LOCKED))
+		return false;
+
+	/*
+	 * The pmd entry mapping the old thp was flushed and the pte mapping
+	 * this subpage has been non present. Therefore, this subpage is
+	 * inaccessible. We don't need to remap it if it contains only zeros.
+	 */
+	addr = kmap_local_page(page);
+	dirty = memchr_inv(addr, 0, PAGE_SIZE);
+	kunmap_local(addr);
+
+	if (dirty)
+		return false;
+
+	pte_clear_not_present_full(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, false);
+
+	if (userfaultfd_armed(pvmw->vma)) {
+		newpte = pte_mkspecial(pfn_pte(page_to_pfn(ZERO_PAGE(pvmw->address)),
+					       pvmw->vma->vm_page_prot));
+		ptep_clear_flush(pvmw->vma, pvmw->address, pvmw->pte);
+		set_pte_at(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, newpte);
+	}
+
+	dec_mm_counter(pvmw->vma->vm_mm, mm_counter(folio));
+	return true;
+}
+
+struct rmap_walk_arg {
+	struct folio *folio;
+	bool unmap_unused;
+};
+
 /*
  * Restore a potential migration pte to a working pte entry
  */
 static bool remove_migration_pte(struct folio *folio,
-		struct vm_area_struct *vma, unsigned long addr, void *old)
+		struct vm_area_struct *vma, unsigned long addr, void *arg)
 {
-	DEFINE_FOLIO_VMA_WALK(pvmw, old, vma, addr, PVMW_SYNC | PVMW_MIGRATION);
+	struct rmap_walk_arg *rmap_walk_arg = arg;
+	DEFINE_FOLIO_VMA_WALK(pvmw, rmap_walk_arg->folio, vma, addr, PVMW_SYNC | PVMW_MIGRATION);
 
 	while (page_vma_mapped_walk(&pvmw)) {
 		rmap_t rmap_flags = RMAP_NONE;
@@ -207,6 +255,8 @@ static bool remove_migration_pte(struct folio *folio,
 			continue;
 		}
 #endif
+		if (rmap_walk_arg->unmap_unused && try_to_unmap_unused(&pvmw, folio, idx))
+			continue;
 
 		folio_get(folio);
 		pte = mk_pte(new, READ_ONCE(vma->vm_page_prot));
@@ -285,13 +335,20 @@ static bool remove_migration_pte(struct folio *folio,
  * Get rid of all migration entries and replace them by
  * references to the indicated page.
  */
-void remove_migration_ptes(struct folio *src, struct folio *dst, bool locked)
+void remove_migration_ptes(struct folio *src, struct folio *dst, bool locked, bool unmap_unused)
 {
+	struct rmap_walk_arg rmap_walk_arg = {
+		.folio = src,
+		.unmap_unused = unmap_unused,
+	};
+
 	struct rmap_walk_control rwc = {
 		.rmap_one = remove_migration_pte,
-		.arg = src,
+		.arg = &rmap_walk_arg,
 	};
 
+	VM_BUG_ON_FOLIO(unmap_unused && src != dst, src);
+
 	if (locked)
 		rmap_walk_locked(dst, &rwc);
 	else
@@ -904,7 +961,7 @@ static int writeout(struct address_space *mapping, struct folio *folio)
 	 * At this point we know that the migration attempt cannot
 	 * be successful.
 	 */
-	remove_migration_ptes(folio, folio, false);
+	remove_migration_ptes(folio, folio, false, false);
 
 	rc = mapping->a_ops->writepage(&folio->page, &wbc);
 
@@ -1068,7 +1125,7 @@ static void migrate_folio_undo_src(struct folio *src,
 				   struct list_head *ret)
 {
 	if (page_was_mapped)
-		remove_migration_ptes(src, src, false);
+		remove_migration_ptes(src, src, false, false);
 	/* Drop an anon_vma reference if we took one */
 	if (anon_vma)
 		put_anon_vma(anon_vma);
@@ -1306,7 +1363,7 @@ static int migrate_folio_move(free_folio_t put_new_folio, unsigned long private,
 		lru_add_drain();
 
 	if (old_page_state & PAGE_WAS_MAPPED)
-		remove_migration_ptes(src, dst, false);
+		remove_migration_ptes(src, dst, false, false);
 
 out_unlock_both:
 	folio_unlock(dst);
@@ -1444,7 +1501,7 @@ static int unmap_and_move_huge_page(new_folio_t get_new_folio,
 
 	if (page_was_mapped)
 		remove_migration_ptes(src,
-			rc == MIGRATEPAGE_SUCCESS ? dst : src, false);
+			rc == MIGRATEPAGE_SUCCESS ? dst : src, false, false);
 
 unlock_put_anon:
 	folio_unlock(dst);
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 6d66dc1c6ffa..a1630d8e0d95 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -424,7 +424,7 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
 			continue;
 
 		folio = page_folio(page);
-		remove_migration_ptes(folio, folio, false);
+		remove_migration_ptes(folio, folio, false, false);
 
 		src_pfns[i] = 0;
 		folio_unlock(folio);
@@ -837,7 +837,7 @@ void migrate_device_finalize(unsigned long *src_pfns,
 
 		src = page_folio(page);
 		dst = page_folio(newpage);
-		remove_migration_ptes(src, dst, false);
+		remove_migration_ptes(src, dst, false, false);
 		folio_unlock(src);
 
 		if (is_zone_device_page(page))
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/6] mm: don't remap unused subpages when splitting isolated thp
  2024-07-30 12:46 ` [PATCH 4/6] mm: don't remap unused subpages " Usama Arif
@ 2024-07-30 18:07   ` Rik van Riel
  2024-07-31 17:08     ` Usama Arif
  0 siblings, 1 reply; 40+ messages in thread
From: Rik van Riel @ 2024-07-30 18:07 UTC (permalink / raw)
  To: Usama Arif, akpm, linux-mm
  Cc: hannes, shakeel.butt, roman.gushchin, yuzhao, david, baohua,
	ryan.roberts, rppt, willy, cerasuolodomenico, corbet,
	linux-kernel, linux-doc, kernel-team, Shuang Zhai

On Tue, 2024-07-30 at 13:46 +0100, Usama Arif wrote:
> 
> +	/*
> +	 * The pmd entry mapping the old thp was flushed and the pte
> mapping
> +	 * this subpage has been non present. Therefore, this
> subpage is
> +	 * inaccessible. We don't need to remap it if it contains
> only zeros.
> +	 */
> +	addr = kmap_local_page(page);
> +	dirty = memchr_inv(addr, 0, PAGE_SIZE);
> +	kunmap_local(addr);
> +
> +	if (dirty)
> +		return false;
> 

A minor nitpick here. The word dirty has a few different meanings
in memory management already.

Could it be clearer to use something like "contains_data" ?

-- 
All Rights Reversed.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/6] mm: don't remap unused subpages when splitting isolated thp
  2024-07-30 18:07   ` Rik van Riel
@ 2024-07-31 17:08     ` Usama Arif
  0 siblings, 0 replies; 40+ messages in thread
From: Usama Arif @ 2024-07-31 17:08 UTC (permalink / raw)
  To: Rik van Riel, akpm, linux-mm
  Cc: hannes, shakeel.butt, roman.gushchin, yuzhao, david, baohua,
	ryan.roberts, rppt, willy, cerasuolodomenico, corbet,
	linux-kernel, linux-doc, kernel-team, Shuang Zhai



On 30/07/2024 19:07, Rik van Riel wrote:
> On Tue, 2024-07-30 at 13:46 +0100, Usama Arif wrote:
>>
>> +	/*
>> +	 * The pmd entry mapping the old thp was flushed and the pte
>> mapping
>> +	 * this subpage has been non present. Therefore, this
>> subpage is
>> +	 * inaccessible. We don't need to remap it if it contains
>> only zeros.
>> +	 */
>> +	addr = kmap_local_page(page);
>> +	dirty = memchr_inv(addr, 0, PAGE_SIZE);
>> +	kunmap_local(addr);
>> +
>> +	if (dirty)
>> +		return false;
>>
> 
> A minor nitpick here. The word dirty has a few different meanings
> in memory management already.
> 
> Could it be clearer to use something like "contains_data" ?
> 

Thanks, yes makes much more sense, will use contains_data in the next revision.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 5/6] mm: add selftests to split_huge_page() to verify unmap/zap of zero pages
  2024-07-30 12:45 [PATCH 0/6] mm: split underutilized THPs Usama Arif
                   ` (3 preceding siblings ...)
  2024-07-30 12:46 ` [PATCH 4/6] mm: don't remap unused subpages " Usama Arif
@ 2024-07-30 12:46 ` Usama Arif
  2024-07-30 18:10   ` Rik van Riel
  2024-08-01  4:45   ` kernel test robot
  2024-07-30 12:46 ` [PATCH 6/6] mm: split underutilized THPs Usama Arif
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 40+ messages in thread
From: Usama Arif @ 2024-07-30 12:46 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: hannes, riel, shakeel.butt, roman.gushchin, yuzhao, david, baohua,
	ryan.roberts, rppt, willy, cerasuolodomenico, corbet,
	linux-kernel, linux-doc, kernel-team, Alexander Zhu, Usama Arif

From: Alexander Zhu <alexlzhu@fb.com>

Self tests to verify the RssAnon value to make sure zero
pages are not remapped except in the case of userfaultfd.
Also includes a self test for the userfaultfd use case.

Signed-off-by: Alexander Zhu <alexlzhu@fb.com>
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
 .../selftests/mm/split_huge_page_test.c       | 113 ++++++++++++++++++
 tools/testing/selftests/mm/vm_util.c          |  22 ++++
 tools/testing/selftests/mm/vm_util.h          |   1 +
 3 files changed, 136 insertions(+)

diff --git a/tools/testing/selftests/mm/split_huge_page_test.c b/tools/testing/selftests/mm/split_huge_page_test.c
index e5e8dafc9d94..da271ad6ff11 100644
--- a/tools/testing/selftests/mm/split_huge_page_test.c
+++ b/tools/testing/selftests/mm/split_huge_page_test.c
@@ -17,6 +17,8 @@
 #include <malloc.h>
 #include <stdbool.h>
 #include <time.h>
+#include <sys/syscall.h>
+#include <linux/userfaultfd.h>
 #include "vm_util.h"
 #include "../kselftest.h"
 
@@ -84,6 +86,115 @@ static void write_debugfs(const char *fmt, ...)
 	write_file(SPLIT_DEBUGFS, input, ret + 1);
 }
 
+static char *allocate_zero_filled_hugepage(size_t len)
+{
+	char *result;
+	size_t i;
+
+	result = memalign(pmd_pagesize, len);
+	if (!result) {
+		printf("Fail to allocate memory\n");
+		exit(EXIT_FAILURE);
+	}
+
+	madvise(result, len, MADV_HUGEPAGE);
+
+	for (i = 0; i < len; i++)
+		result[i] = (char)0;
+
+	return result;
+}
+
+static void verify_rss_anon_split_huge_page_all_zeroes(char *one_page, int nr_hpages, size_t len)
+{
+	uint64_t rss_anon_before, rss_anon_after;
+	size_t i;
+
+	if (!check_huge_anon(one_page, 4, pmd_pagesize)) {
+		printf("No THP is allocated\n");
+		exit(EXIT_FAILURE);
+	}
+
+	rss_anon_before = rss_anon();
+	if (!rss_anon_before) {
+		printf("No RssAnon is allocated before split\n");
+		exit(EXIT_FAILURE);
+	}
+
+	/* split all THPs */
+	write_debugfs(PID_FMT, getpid(), (uint64_t)one_page,
+		      (uint64_t)one_page + len, 0);
+
+	for (i = 0; i < len; i++)
+		if (one_page[i] != (char)0) {
+			printf("%ld byte corrupted\n", i);
+			exit(EXIT_FAILURE);
+		}
+
+	if (!check_huge_anon(one_page, 0, pmd_pagesize)) {
+		printf("Still AnonHugePages not split\n");
+		exit(EXIT_FAILURE);
+	}
+
+	rss_anon_after = rss_anon();
+	if (rss_anon_after >= rss_anon_before) {
+		printf("Incorrect RssAnon value. Before: %ld After: %ld\n",
+		       rss_anon_before, rss_anon_after);
+		exit(EXIT_FAILURE);
+	}
+}
+
+void split_pmd_zero_pages(void)
+{
+	char *one_page;
+	int nr_hpages = 4;
+	size_t len = nr_hpages * pmd_pagesize;
+
+	one_page = allocate_zero_filled_hugepage(len);
+	verify_rss_anon_split_huge_page_all_zeroes(one_page, nr_hpages, len);
+	printf("Split zero filled huge pages successful\n");
+	free(one_page);
+}
+
+void split_pmd_zero_pages_uffd(void)
+{
+	char *one_page;
+	int nr_hpages = 4;
+	size_t len = nr_hpages * pmd_pagesize;
+	long uffd; /* userfaultfd file descriptor */
+	struct uffdio_api uffdio_api;
+	struct uffdio_register uffdio_register;
+
+	/* Create and enable userfaultfd object. */
+
+	uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
+	if (uffd == -1) {
+		perror("userfaultfd");
+		exit(1);
+	}
+
+	uffdio_api.api = UFFD_API;
+	uffdio_api.features = 0;
+	if (ioctl(uffd, UFFDIO_API, &uffdio_api) == -1) {
+		perror("ioctl-UFFDIO_API");
+		exit(1);
+	}
+
+	one_page = allocate_zero_filled_hugepage(len);
+
+	uffdio_register.range.start = (unsigned long)one_page;
+	uffdio_register.range.len = len;
+	uffdio_register.mode = UFFDIO_REGISTER_MODE_WP;
+	if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) == -1) {
+		perror("ioctl-UFFDIO_REGISTER");
+		exit(1);
+	}
+
+	verify_rss_anon_split_huge_page_all_zeroes(one_page, nr_hpages, len);
+	printf("Split zero filled huge pages with uffd successful\n");
+	free(one_page);
+}
+
 void split_pmd_thp(void)
 {
 	char *one_page;
@@ -431,6 +542,8 @@ int main(int argc, char **argv)
 
 	fd_size = 2 * pmd_pagesize;
 
+	split_pmd_zero_pages();
+	split_pmd_zero_pages_uffd();
 	split_pmd_thp();
 	split_pte_mapped_thp();
 	split_file_backed_thp();
diff --git a/tools/testing/selftests/mm/vm_util.c b/tools/testing/selftests/mm/vm_util.c
index 5a62530da3b5..7b7e763ba8e3 100644
--- a/tools/testing/selftests/mm/vm_util.c
+++ b/tools/testing/selftests/mm/vm_util.c
@@ -12,6 +12,7 @@
 
 #define PMD_SIZE_FILE_PATH "/sys/kernel/mm/transparent_hugepage/hpage_pmd_size"
 #define SMAP_FILE_PATH "/proc/self/smaps"
+#define STATUS_FILE_PATH "/proc/self/status"
 #define MAX_LINE_LENGTH 500
 
 unsigned int __page_size;
@@ -171,6 +172,27 @@ uint64_t read_pmd_pagesize(void)
 	return strtoul(buf, NULL, 10);
 }
 
+uint64_t rss_anon(void)
+{
+	uint64_t rss_anon = 0;
+	FILE *fp;
+	char buffer[MAX_LINE_LENGTH];
+
+	fp = fopen(STATUS_FILE_PATH, "r");
+	if (!fp)
+		ksft_exit_fail_msg("%s: Failed to open file %s\n", __func__, STATUS_FILE_PATH);
+
+	if (!check_for_pattern(fp, "RssAnon:", buffer, sizeof(buffer)))
+		goto err_out;
+
+	if (sscanf(buffer, "RssAnon:%10ld kB", &rss_anon) != 1)
+		ksft_exit_fail_msg("Reading status error\n");
+
+err_out:
+	fclose(fp);
+	return rss_anon;
+}
+
 bool __check_huge(void *addr, char *pattern, int nr_hpages,
 		  uint64_t hpage_size)
 {
diff --git a/tools/testing/selftests/mm/vm_util.h b/tools/testing/selftests/mm/vm_util.h
index 9007c420d52c..71b75429f4a5 100644
--- a/tools/testing/selftests/mm/vm_util.h
+++ b/tools/testing/selftests/mm/vm_util.h
@@ -39,6 +39,7 @@ unsigned long pagemap_get_pfn(int fd, char *start);
 void clear_softdirty(void);
 bool check_for_pattern(FILE *fp, const char *pattern, char *buf, size_t len);
 uint64_t read_pmd_pagesize(void);
+uint64_t rss_anon(void);
 bool check_huge_anon(void *addr, int nr_hpages, uint64_t hpage_size);
 bool check_huge_file(void *addr, int nr_hpages, uint64_t hpage_size);
 bool check_huge_shmem(void *addr, int nr_hpages, uint64_t hpage_size);
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 5/6] mm: add selftests to split_huge_page() to verify unmap/zap of zero pages
  2024-07-30 12:46 ` [PATCH 5/6] mm: add selftests to split_huge_page() to verify unmap/zap of zero pages Usama Arif
@ 2024-07-30 18:10   ` Rik van Riel
  2024-08-01  4:45   ` kernel test robot
  1 sibling, 0 replies; 40+ messages in thread
From: Rik van Riel @ 2024-07-30 18:10 UTC (permalink / raw)
  To: Usama Arif, akpm, linux-mm
  Cc: hannes, shakeel.butt, roman.gushchin, yuzhao, david, baohua,
	ryan.roberts, rppt, willy, cerasuolodomenico, corbet,
	linux-kernel, linux-doc, kernel-team, Alexander Zhu

On Tue, 2024-07-30 at 13:46 +0100, Usama Arif wrote:
> From: Alexander Zhu <alexlzhu@fb.com>
> 
> Self tests to verify the RssAnon value to make sure zero
> pages are not remapped except in the case of userfaultfd.
> Also includes a self test for the userfaultfd use case.
> 
> Signed-off-by: Alexander Zhu <alexlzhu@fb.com>
> Signed-off-by: Usama Arif <usamaarif642@gmail.com>

Acked-by: Rik van Riel <riel@surriel.com>

-- 
All Rights Reversed.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 5/6] mm: add selftests to split_huge_page() to verify unmap/zap of zero pages
  2024-07-30 12:46 ` [PATCH 5/6] mm: add selftests to split_huge_page() to verify unmap/zap of zero pages Usama Arif
  2024-07-30 18:10   ` Rik van Riel
@ 2024-08-01  4:45   ` kernel test robot
  1 sibling, 0 replies; 40+ messages in thread
From: kernel test robot @ 2024-08-01  4:45 UTC (permalink / raw)
  To: Usama Arif, akpm, linux-mm
  Cc: oe-kbuild-all, hannes, riel, shakeel.butt, roman.gushchin, yuzhao,
	david, baohua, ryan.roberts, rppt, willy, cerasuolodomenico,
	corbet, linux-kernel, linux-doc, kernel-team, Alexander Zhu,
	Usama Arif

Hi Usama,

kernel test robot noticed the following build warnings:

[auto build test WARNING on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Usama-Arif/Revert-memcg-remove-mem_cgroup_uncharge_list/20240730-223949
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20240730125346.1580150-6-usamaarif642%40gmail.com
patch subject: [PATCH 5/6] mm: add selftests to split_huge_page() to verify unmap/zap of zero pages
:::::: branch date: 32 hours ago
:::::: commit date: 32 hours ago
compiler: gcc-13 (Ubuntu 13.2.0-4ubuntu3) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240801/202408010618.lgnamdZd-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/r/202408010618.lgnamdZd-lkp@intel.com/

All warnings (new ones prefixed by >>):

   vm_util.c: In function 'rss_anon':
>> vm_util.c:188:41: warning: format '%ld' expects argument of type 'long int *', but argument 3 has type 'uint64_t *' {aka 'long long unsigned int *'} [-Wformat=]
     188 |         if (sscanf(buffer, "RssAnon:%10ld kB", &rss_anon) != 1)
         |                                     ~~~~^      ~~~~~~~~~
         |                                         |      |
         |                                         |      uint64_t * {aka long long unsigned int *}
         |                                         long int *
         |                                     %10lld


vim +188 tools/testing/selftests/mm/vm_util.c

642bc52aed9c99 tools/testing/selftests/vm/vm_util.c Muhammad Usama Anjum 2022-04-28  174  
15bd39f288be91 tools/testing/selftests/mm/vm_util.c Alexander Zhu        2024-07-30  175  uint64_t rss_anon(void)
15bd39f288be91 tools/testing/selftests/mm/vm_util.c Alexander Zhu        2024-07-30  176  {
15bd39f288be91 tools/testing/selftests/mm/vm_util.c Alexander Zhu        2024-07-30  177  	uint64_t rss_anon = 0;
15bd39f288be91 tools/testing/selftests/mm/vm_util.c Alexander Zhu        2024-07-30  178  	FILE *fp;
15bd39f288be91 tools/testing/selftests/mm/vm_util.c Alexander Zhu        2024-07-30  179  	char buffer[MAX_LINE_LENGTH];
15bd39f288be91 tools/testing/selftests/mm/vm_util.c Alexander Zhu        2024-07-30  180  
15bd39f288be91 tools/testing/selftests/mm/vm_util.c Alexander Zhu        2024-07-30  181  	fp = fopen(STATUS_FILE_PATH, "r");
15bd39f288be91 tools/testing/selftests/mm/vm_util.c Alexander Zhu        2024-07-30  182  	if (!fp)
15bd39f288be91 tools/testing/selftests/mm/vm_util.c Alexander Zhu        2024-07-30  183  		ksft_exit_fail_msg("%s: Failed to open file %s\n", __func__, STATUS_FILE_PATH);
15bd39f288be91 tools/testing/selftests/mm/vm_util.c Alexander Zhu        2024-07-30  184  
15bd39f288be91 tools/testing/selftests/mm/vm_util.c Alexander Zhu        2024-07-30  185  	if (!check_for_pattern(fp, "RssAnon:", buffer, sizeof(buffer)))
15bd39f288be91 tools/testing/selftests/mm/vm_util.c Alexander Zhu        2024-07-30  186  		goto err_out;
15bd39f288be91 tools/testing/selftests/mm/vm_util.c Alexander Zhu        2024-07-30  187  
15bd39f288be91 tools/testing/selftests/mm/vm_util.c Alexander Zhu        2024-07-30 @188  	if (sscanf(buffer, "RssAnon:%10ld kB", &rss_anon) != 1)
15bd39f288be91 tools/testing/selftests/mm/vm_util.c Alexander Zhu        2024-07-30  189  		ksft_exit_fail_msg("Reading status error\n");
15bd39f288be91 tools/testing/selftests/mm/vm_util.c Alexander Zhu        2024-07-30  190  
15bd39f288be91 tools/testing/selftests/mm/vm_util.c Alexander Zhu        2024-07-30  191  err_out:
15bd39f288be91 tools/testing/selftests/mm/vm_util.c Alexander Zhu        2024-07-30  192  	fclose(fp);
15bd39f288be91 tools/testing/selftests/mm/vm_util.c Alexander Zhu        2024-07-30  193  	return rss_anon;
15bd39f288be91 tools/testing/selftests/mm/vm_util.c Alexander Zhu        2024-07-30  194  }
15bd39f288be91 tools/testing/selftests/mm/vm_util.c Alexander Zhu        2024-07-30  195  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 6/6] mm: split underutilized THPs
  2024-07-30 12:45 [PATCH 0/6] mm: split underutilized THPs Usama Arif
                   ` (4 preceding siblings ...)
  2024-07-30 12:46 ` [PATCH 5/6] mm: add selftests to split_huge_page() to verify unmap/zap of zero pages Usama Arif
@ 2024-07-30 12:46 ` Usama Arif
  2024-07-30 13:59   ` Randy Dunlap
  2024-07-30 14:35 ` [PATCH 0/6] " David Hildenbrand
  2024-08-01  6:09 ` Yu Zhao
  7 siblings, 1 reply; 40+ messages in thread
From: Usama Arif @ 2024-07-30 12:46 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: hannes, riel, shakeel.butt, roman.gushchin, yuzhao, david, baohua,
	ryan.roberts, rppt, willy, cerasuolodomenico, corbet,
	linux-kernel, linux-doc, kernel-team, Usama Arif

This is an attempt to mitigate the issue of running out of memory when THP
is always enabled. During runtime whenever a THP is being faulted in
(__do_huge_pmd_anonymous_page) or collapsed by khugepaged
(collapse_huge_page), the THP is added to  _deferred_list. Whenever memory
reclaim happens in linux, the kernel runs the deferred_split
shrinker which goes through the _deferred_list.

If the folio was partially mapped, the shrinker attempts to split it.
A new boolean is added to be able to distinguish between partially
mapped folios and others in the deferred_list at split time in
deferred_split_scan. Its needed as __folio_remove_rmap decrements
the folio mapcount elements, hence it won't be possible to distinguish
between partially mapped folios and others in deferred_split_scan
without the boolean.

If folio->_partially_mapped is not set, the shrinker checks if the THP
was underutilized, i.e. how many of the base 4K pages of the entire THP
were zero-filled. If this number goes above a certain threshold (decided by
/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none), the
shrinker will attempt to split that THP. Then at remap time, the pages that
were zero-filled are not remapped, hence saving memory.

Suggested-by: Rik van Riel <riel@surriel.com>
Co-authored-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
 Documentation/admin-guide/mm/transhuge.rst |   6 ++
 include/linux/huge_mm.h                    |   4 +-
 include/linux/khugepaged.h                 |   1 +
 include/linux/mm_types.h                   |   2 +
 include/linux/vm_event_item.h              |   1 +
 mm/huge_memory.c                           | 118 ++++++++++++++++++---
 mm/hugetlb.c                               |   1 +
 mm/internal.h                              |   4 +-
 mm/khugepaged.c                            |   3 +-
 mm/memcontrol.c                            |   3 +-
 mm/migrate.c                               |   3 +-
 mm/rmap.c                                  |   2 +-
 mm/vmscan.c                                |   3 +-
 mm/vmstat.c                                |   1 +
 14 files changed, 130 insertions(+), 22 deletions(-)

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 058485daf186..24eec1c03ad8 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -447,6 +447,12 @@ thp_deferred_split_page
 	splitting it would free up some memory. Pages on split queue are
 	going to be split under memory pressure.
 
+thp_underutilized_split_page
+	is incremented when a huge page on the split queue was split
+	because it was underutilized. A THP is underutilized if the
+	number of zero pages in the THP are above a certain threshold
+	(/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none).
+
 thp_split_pmd
 	is incremented every time a PMD split into table of PTEs.
 	This can happen, for instance, when application calls mprotect() or
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index e25d9ebfdf89..00af84aa88ea 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -321,7 +321,7 @@ static inline int split_huge_page(struct page *page)
 {
 	return split_huge_page_to_list_to_order(page, NULL, 0);
 }
-void deferred_split_folio(struct folio *folio);
+void deferred_split_folio(struct folio *folio, bool partially_mapped);
 
 void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long address, bool freeze, struct folio *folio);
@@ -484,7 +484,7 @@ static inline int split_huge_page(struct page *page)
 {
 	return 0;
 }
-static inline void deferred_split_folio(struct folio *folio) {}
+static inline void deferred_split_folio(struct folio *folio, bool partially_mapped) {}
 #define split_huge_pmd(__vma, __pmd, __address)	\
 	do { } while (0)
 
diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
index f68865e19b0b..30baae91b225 100644
--- a/include/linux/khugepaged.h
+++ b/include/linux/khugepaged.h
@@ -4,6 +4,7 @@
 
 #include <linux/sched/coredump.h> /* MMF_VM_HUGEPAGE */
 
+extern unsigned int khugepaged_max_ptes_none __read_mostly;
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 extern struct attribute_group khugepaged_attr_group;
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 485424979254..443026cf763e 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -311,6 +311,7 @@ typedef struct {
  * @_hugetlb_cgroup_rsvd: Do not use directly, use accessor in hugetlb_cgroup.h.
  * @_hugetlb_hwpoison: Do not use directly, call raw_hwp_list_head().
  * @_deferred_list: Folios to be split under memory pressure.
+ * @_partially_mapped: Folio was partially mapped.
  * @_unused_slab_obj_exts: Placeholder to match obj_exts in struct slab.
  *
  * A folio is a physically, virtually and logically contiguous set
@@ -393,6 +394,7 @@ struct folio {
 			unsigned long _head_2a;
 	/* public: */
 			struct list_head _deferred_list;
+			bool _partially_mapped;
 	/* private: the union with struct page is transitional */
 		};
 		struct page __page_2;
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index aae5c7c5cfb4..bf1470a7a737 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -105,6 +105,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		THP_SPLIT_PAGE,
 		THP_SPLIT_PAGE_FAILED,
 		THP_DEFERRED_SPLIT_PAGE,
+		THP_UNDERUTILIZED_SPLIT_PAGE,
 		THP_SPLIT_PMD,
 		THP_SCAN_EXCEED_NONE_PTE,
 		THP_SCAN_EXCEED_SWAP_PTE,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 892467d85f3a..3305e6d0b90e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -73,6 +73,7 @@ static unsigned long deferred_split_count(struct shrinker *shrink,
 					  struct shrink_control *sc);
 static unsigned long deferred_split_scan(struct shrinker *shrink,
 					 struct shrink_control *sc);
+static bool split_underutilized_thp = true;
 
 static atomic_t huge_zero_refcount;
 struct folio *huge_zero_folio __read_mostly;
@@ -438,6 +439,27 @@ static ssize_t hpage_pmd_size_show(struct kobject *kobj,
 static struct kobj_attribute hpage_pmd_size_attr =
 	__ATTR_RO(hpage_pmd_size);
 
+static ssize_t split_underutilized_thp_show(struct kobject *kobj,
+			    struct kobj_attribute *attr, char *buf)
+{
+	return sysfs_emit(buf, "%d\n", split_underutilized_thp);
+}
+
+static ssize_t split_underutilized_thp_store(struct kobject *kobj,
+			     struct kobj_attribute *attr,
+			     const char *buf, size_t count)
+{
+	int err = kstrtobool(buf, &split_underutilized_thp);
+
+	if (err < 0)
+		return err;
+
+	return count;
+}
+
+static struct kobj_attribute split_underutilized_thp_attr = __ATTR(
+	thp_low_util_shrinker, 0644, split_underutilized_thp_show, split_underutilized_thp_store);
+
 static struct attribute *hugepage_attr[] = {
 	&enabled_attr.attr,
 	&defrag_attr.attr,
@@ -446,6 +468,7 @@ static struct attribute *hugepage_attr[] = {
 #ifdef CONFIG_SHMEM
 	&shmem_enabled_attr.attr,
 #endif
+	&split_underutilized_thp_attr.attr,
 	NULL,
 };
 
@@ -1002,6 +1025,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
 		update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
 		add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR);
 		mm_inc_nr_ptes(vma->vm_mm);
+		deferred_split_folio(folio, false);
 		spin_unlock(vmf->ptl);
 		count_vm_event(THP_FAULT_ALLOC);
 		count_mthp_stat(HPAGE_PMD_ORDER, MTHP_STAT_ANON_FAULT_ALLOC);
@@ -3259,6 +3283,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 			 * page_deferred_list.
 			 */
 			list_del_init(&folio->_deferred_list);
+			folio->_partially_mapped = false;
 		}
 		spin_unlock(&ds_queue->split_queue_lock);
 		if (mapping) {
@@ -3315,11 +3340,12 @@ void __folio_undo_large_rmappable(struct folio *folio)
 	if (!list_empty(&folio->_deferred_list)) {
 		ds_queue->split_queue_len--;
 		list_del_init(&folio->_deferred_list);
+		folio->_partially_mapped = false;
 	}
 	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
 }
 
-void deferred_split_folio(struct folio *folio)
+void deferred_split_folio(struct folio *folio, bool partially_mapped)
 {
 	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
 #ifdef CONFIG_MEMCG
@@ -3334,6 +3360,9 @@ void deferred_split_folio(struct folio *folio)
 	if (folio_order(folio) <= 1)
 		return;
 
+	if (!partially_mapped && !split_underutilized_thp)
+		return;
+
 	/*
 	 * The try_to_unmap() in page reclaim path might reach here too,
 	 * this may cause a race condition to corrupt deferred split queue.
@@ -3347,14 +3376,14 @@ void deferred_split_folio(struct folio *folio)
 	if (folio_test_swapcache(folio))
 		return;
 
-	if (!list_empty(&folio->_deferred_list))
-		return;
-
 	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
+	folio->_partially_mapped = partially_mapped;
 	if (list_empty(&folio->_deferred_list)) {
-		if (folio_test_pmd_mappable(folio))
-			count_vm_event(THP_DEFERRED_SPLIT_PAGE);
-		count_mthp_stat(folio_order(folio), MTHP_STAT_SPLIT_DEFERRED);
+		if (partially_mapped) {
+			if (folio_test_pmd_mappable(folio))
+				count_vm_event(THP_DEFERRED_SPLIT_PAGE);
+			count_mthp_stat(folio_order(folio), MTHP_STAT_SPLIT_DEFERRED);
+		}
 		list_add_tail(&folio->_deferred_list, &ds_queue->split_queue);
 		ds_queue->split_queue_len++;
 #ifdef CONFIG_MEMCG
@@ -3379,6 +3408,39 @@ static unsigned long deferred_split_count(struct shrinker *shrink,
 	return READ_ONCE(ds_queue->split_queue_len);
 }
 
+static bool thp_underutilized(struct folio *folio)
+{
+	int num_zero_pages = 0, num_filled_pages = 0;
+	void *kaddr;
+	int i;
+
+	if (khugepaged_max_ptes_none == HPAGE_PMD_NR - 1)
+		return false;
+
+	for (i = 0; i < folio_nr_pages(folio); i++) {
+		kaddr = kmap_local_folio(folio, i * PAGE_SIZE);
+		if (memchr_inv(kaddr, 0, PAGE_SIZE) == NULL) {
+			num_zero_pages++;
+			if (num_zero_pages > khugepaged_max_ptes_none) {
+				kunmap_local(kaddr);
+				return true;
+			}
+		} else {
+			/*
+			 * Another path for early exit once the number
+			 * of non-zero filled pages exceeds threshold.
+			 */
+			num_filled_pages++;
+			if (num_filled_pages >= HPAGE_PMD_NR - khugepaged_max_ptes_none) {
+				kunmap_local(kaddr);
+				return false;
+			}
+		}
+		kunmap_local(kaddr);
+	}
+	return false;
+}
+
 static unsigned long deferred_split_scan(struct shrinker *shrink,
 		struct shrink_control *sc)
 {
@@ -3403,6 +3465,7 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
 		} else {
 			/* We lost race with folio_put() */
 			list_del_init(&folio->_deferred_list);
+			folio->_partially_mapped = false;
 			ds_queue->split_queue_len--;
 		}
 		if (!--sc->nr_to_scan)
@@ -3411,18 +3474,45 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
 	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
 
 	list_for_each_entry_safe(folio, next, &list, _deferred_list) {
+		bool did_split = false;
+		bool underutilized = false;
+
+		if (folio->_partially_mapped)
+			goto split;
+		underutilized = thp_underutilized(folio);
+		if (underutilized)
+			goto split;
+		continue;
+split:
 		if (!folio_trylock(folio))
-			goto next;
-		/* split_huge_page() removes page from list on success */
-		if (!split_folio(folio))
-			split++;
+			continue;
+		did_split = !split_folio(folio);
 		folio_unlock(folio);
-next:
-		folio_put(folio);
+		if (did_split) {
+			/* Splitting removed folio from the list, drop reference here */
+			folio_put(folio);
+			if (underutilized)
+				count_vm_event(THP_UNDERUTILIZED_SPLIT_PAGE);
+			split++;
+		}
 	}
 
 	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
-	list_splice_tail(&list, &ds_queue->split_queue);
+	/*
+	 * Only add back to the queue if folio->_partially_mapped is set.
+	 * If thp_underutilized returns false, or if split_folio fails in
+	 * the case it was underutilized, then consider it used and don't
+	 * add it back to split_queue.
+	 */
+	list_for_each_entry_safe(folio, next, &list, _deferred_list) {
+		if (folio->_partially_mapped)
+			list_move(&folio->_deferred_list, &ds_queue->split_queue);
+		else {
+			list_del_init(&folio->_deferred_list);
+			ds_queue->split_queue_len--;
+		}
+		folio_put(folio);
+	}
 	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
 
 	/*
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 5a32157ca309..df2da47d0637 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1758,6 +1758,7 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
 		free_gigantic_folio(folio, huge_page_order(h));
 	} else {
 		INIT_LIST_HEAD(&folio->_deferred_list);
+		folio->_partially_mapped = false;
 		folio_put(folio);
 	}
 }
diff --git a/mm/internal.h b/mm/internal.h
index 259afe44dc88..8fc072cc3023 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -657,8 +657,10 @@ static inline void prep_compound_head(struct page *page, unsigned int order)
 	atomic_set(&folio->_entire_mapcount, -1);
 	atomic_set(&folio->_nr_pages_mapped, 0);
 	atomic_set(&folio->_pincount, 0);
-	if (order > 1)
+	if (order > 1) {
 		INIT_LIST_HEAD(&folio->_deferred_list);
+		folio->_partially_mapped = false;
+	}
 }
 
 static inline void prep_compound_tail(struct page *head, int tail_idx)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index f3b3db104615..5a434fdbc1ef 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -85,7 +85,7 @@ static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait);
  *
  * Note that these are only respected if collapse was initiated by khugepaged.
  */
-static unsigned int khugepaged_max_ptes_none __read_mostly;
+unsigned int khugepaged_max_ptes_none __read_mostly;
 static unsigned int khugepaged_max_ptes_swap __read_mostly;
 static unsigned int khugepaged_max_ptes_shared __read_mostly;
 
@@ -1235,6 +1235,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	pgtable_trans_huge_deposit(mm, pmd, pgtable);
 	set_pmd_at(mm, address, pmd, _pmd);
 	update_mmu_cache_pmd(vma, address, pmd);
+	deferred_split_folio(folio, false);
 	spin_unlock(pmd_ptl);
 
 	folio = NULL;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f568b9594c2b..2ee61d619d86 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4651,7 +4651,8 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
 	VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
 	VM_BUG_ON_FOLIO(folio_order(folio) > 1 &&
 			!folio_test_hugetlb(folio) &&
-			!list_empty(&folio->_deferred_list), folio);
+			!list_empty(&folio->_deferred_list) &&
+			folio->_partially_mapped, folio);
 
 	/*
 	 * Nobody should be changing or seriously looking at
diff --git a/mm/migrate.c b/mm/migrate.c
index f4f06bdded70..2731ac20ff33 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1734,7 +1734,8 @@ static int migrate_pages_batch(struct list_head *from,
 			 * use _deferred_list.
 			 */
 			if (nr_pages > 2 &&
-			   !list_empty(&folio->_deferred_list)) {
+			   !list_empty(&folio->_deferred_list) &&
+			   folio->_partially_mapped) {
 				if (try_split_folio(folio, split_folios) == 0) {
 					nr_failed++;
 					stats->nr_thp_failed += is_thp;
diff --git a/mm/rmap.c b/mm/rmap.c
index 2630bde38640..1b5418121965 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1582,7 +1582,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
 		 */
 		if (folio_test_anon(folio) && partially_mapped &&
 		    list_empty(&folio->_deferred_list))
-			deferred_split_folio(folio);
+			deferred_split_folio(folio, true);
 	}
 	__folio_mod_stat(folio, -nr, -nr_pmdmapped);
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c89d0551655e..1bee9b1262f6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1233,7 +1233,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 					 * Split partially mapped folios right away.
 					 * We can free the unmapped pages without IO.
 					 */
-					if (data_race(!list_empty(&folio->_deferred_list)) &&
+					if (data_race(!list_empty(&folio->_deferred_list) &&
+					    folio->_partially_mapped) &&
 					    split_folio_to_list(folio, folio_list))
 						goto activate_locked;
 				}
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 5082431dad28..525fad4a1d6d 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1367,6 +1367,7 @@ const char * const vmstat_text[] = {
 	"thp_split_page",
 	"thp_split_page_failed",
 	"thp_deferred_split_page",
+	"thp_underutilized_split_page",
 	"thp_split_pmd",
 	"thp_scan_exceed_none_pte",
 	"thp_scan_exceed_swap_pte",
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 6/6] mm: split underutilized THPs
  2024-07-30 12:46 ` [PATCH 6/6] mm: split underutilized THPs Usama Arif
@ 2024-07-30 13:59   ` Randy Dunlap
  0 siblings, 0 replies; 40+ messages in thread
From: Randy Dunlap @ 2024-07-30 13:59 UTC (permalink / raw)
  To: Usama Arif, akpm, linux-mm
  Cc: hannes, riel, shakeel.butt, roman.gushchin, yuzhao, david, baohua,
	ryan.roberts, rppt, willy, cerasuolodomenico, corbet,
	linux-kernel, linux-doc, kernel-team



On 7/30/24 5:46 AM, Usama Arif wrote:
> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> index 058485daf186..24eec1c03ad8 100644
> --- a/Documentation/admin-guide/mm/transhuge.rst
> +++ b/Documentation/admin-guide/mm/transhuge.rst
> @@ -447,6 +447,12 @@ thp_deferred_split_page
>  	splitting it would free up some memory. Pages on split queue are
>  	going to be split under memory pressure.
>  
> +thp_underutilized_split_page
> +	is incremented when a huge page on the split queue was split
> +	because it was underutilized. A THP is underutilized if the
> +	number of zero pages in the THP are above a certain threshold

	                     in the THP is above

(if the number ... is)

> +	(/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none).
> +

-- 
~Randy


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/6] mm: split underutilized THPs
  2024-07-30 12:45 [PATCH 0/6] mm: split underutilized THPs Usama Arif
                   ` (5 preceding siblings ...)
  2024-07-30 12:46 ` [PATCH 6/6] mm: split underutilized THPs Usama Arif
@ 2024-07-30 14:35 ` David Hildenbrand
  2024-07-30 15:14   ` Usama Arif
  2024-08-01  6:09 ` Yu Zhao
  7 siblings, 1 reply; 40+ messages in thread
From: David Hildenbrand @ 2024-07-30 14:35 UTC (permalink / raw)
  To: Usama Arif, akpm, linux-mm
  Cc: hannes, riel, shakeel.butt, roman.gushchin, yuzhao, baohua,
	ryan.roberts, rppt, willy, cerasuolodomenico, corbet,
	linux-kernel, linux-doc, kernel-team

On 30.07.24 14:45, Usama Arif wrote:
> The current upstream default policy for THP is always. However, Meta
> uses madvise in production as the current THP=always policy vastly
> overprovisions THPs in sparsely accessed memory areas, resulting in
> excessive memory pressure and premature OOM killing.
> Using madvise + relying on khugepaged has certain drawbacks over
> THP=always. Using madvise hints mean THPs aren't "transparent" and
> require userspace changes. Waiting for khugepaged to scan memory and
> collapse pages into THP can be slow and unpredictable in terms of performance
> (i.e. you dont know when the collapse will happen), while production
> environments require predictable performance. If there is enough memory
> available, its better for both performance and predictability to have
> a THP from fault time, i.e. THP=always rather than wait for khugepaged
> to collapse it, and deal with sparsely populated THPs when the system is
> running out of memory.
> 
> This patch-series is an attempt to mitigate the issue of running out of
> memory when THP is always enabled. During runtime whenever a THP is being
> faulted in or collapsed by khugepaged, the THP is added to a list.
> Whenever memory reclaim happens, the kernel runs the deferred_split
> shrinker which goes through the list and checks if the THP was underutilized,
> i.e. how many of the base 4K pages of the entire THP were zero-filled.
> If this number goes above a certain threshold, the shrinker will attempt
> to split that THP. Then at remap time, the pages that were zero-filled are
> not remapped, hence saving memory. This method avoids the downside of
> wasting memory in areas where THP is sparsely filled when THP is always
> enabled, while still providing the upside THPs like reduced TLB misses without
> having to use madvise.
> 
> Meta production workloads that were CPU bound (>99% CPU utilzation) were
> tested with THP shrinker. The results after 2 hours are as follows:
> 
>                              | THP=madvise |  THP=always   | THP=always
>                              |             |               | + shrinker series
>                              |             |               | + max_ptes_none=409
> -----------------------------------------------------------------------------
> Performance improvement     |      -      |    +1.8%      |     +1.7%
> (over THP=madvise)          |             |               |
> -----------------------------------------------------------------------------
> Memory usage                |    54.6G    | 58.8G (+7.7%) |   55.9G (+2.4%)
> -----------------------------------------------------------------------------
> max_ptes_none=409 means that any THP that has more than 409 out of 512
> (80%) zero filled filled pages will be split.
> 
> To test out the patches, the below commands without the shrinker will
> invoke OOM killer immediately and kill stress, but will not fail with
> the shrinker:
> 
> echo 450 > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
> mkdir /sys/fs/cgroup/test
> echo $$ > /sys/fs/cgroup/test/cgroup.procs
> echo 20M > /sys/fs/cgroup/test/memory.max
> echo 0 > /sys/fs/cgroup/test/memory.swap.max
> # allocate twice memory.max for each stress worker and touch 40/512 of
> # each THP, i.e. vm-stride 50K.
> # With the shrinker, max_ptes_none of 470 and below won't invoke OOM
> # killer.
> # Without the shrinker, OOM killer is invoked immediately irrespective
> # of max_ptes_none value and kill stress.
> stress --vm 1 --vm-bytes 40M --vm-stride 50K
> 
> Patches 1-2 add back helper functions that were previously removed
> to operate on page lists (needed by patch 3).
> Patch 3 is an optimization to free zapped tail pages rather than
> waiting for page reclaim or migration.
> Patch 4 is a prerequisite for THP shrinker to not remap zero-filled
> subpages when splitting THP.
> Patches 6 adds support for THP shrinker.
> 
> (This patch-series restarts the work on having a THP shrinker in kernel
> originally done in
> https://lore.kernel.org/all/cover.1667454613.git.alexlzhu@fb.com/.
> The THP shrinker in this series is significantly different than the
> original one, hence its labelled v1 (although the prerequisite to not
> remap clean subpages is the same).)

As shared previously, there is one issue with uffd (even when currently 
not active for a VMA!), where we must not zap present page table entries.

Something that is always possible (assuming no GUP pins of course, 
which) is replacing the zero-filled subpages by shared zeropages.

Is that being done in this patch set already, or are we creating 
pte_none() entries?

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/6] mm: split underutilized THPs
  2024-07-30 14:35 ` [PATCH 0/6] " David Hildenbrand
@ 2024-07-30 15:14   ` Usama Arif
  2024-07-30 15:19     ` Usama Arif
  0 siblings, 1 reply; 40+ messages in thread
From: Usama Arif @ 2024-07-30 15:14 UTC (permalink / raw)
  To: David Hildenbrand, akpm, linux-mm
  Cc: hannes, riel, shakeel.butt, roman.gushchin, yuzhao, baohua,
	ryan.roberts, rppt, willy, cerasuolodomenico, corbet,
	linux-kernel, linux-doc, kernel-team



On 30/07/2024 15:35, David Hildenbrand wrote:
> On 30.07.24 14:45, Usama Arif wrote:
>> The current upstream default policy for THP is always. However, Meta
>> uses madvise in production as the current THP=always policy vastly
>> overprovisions THPs in sparsely accessed memory areas, resulting in
>> excessive memory pressure and premature OOM killing.
>> Using madvise + relying on khugepaged has certain drawbacks over
>> THP=always. Using madvise hints mean THPs aren't "transparent" and
>> require userspace changes. Waiting for khugepaged to scan memory and
>> collapse pages into THP can be slow and unpredictable in terms of performance
>> (i.e. you dont know when the collapse will happen), while production
>> environments require predictable performance. If there is enough memory
>> available, its better for both performance and predictability to have
>> a THP from fault time, i.e. THP=always rather than wait for khugepaged
>> to collapse it, and deal with sparsely populated THPs when the system is
>> running out of memory.
>>
>> This patch-series is an attempt to mitigate the issue of running out of
>> memory when THP is always enabled. During runtime whenever a THP is being
>> faulted in or collapsed by khugepaged, the THP is added to a list.
>> Whenever memory reclaim happens, the kernel runs the deferred_split
>> shrinker which goes through the list and checks if the THP was underutilized,
>> i.e. how many of the base 4K pages of the entire THP were zero-filled.
>> If this number goes above a certain threshold, the shrinker will attempt
>> to split that THP. Then at remap time, the pages that were zero-filled are
>> not remapped, hence saving memory. This method avoids the downside of
>> wasting memory in areas where THP is sparsely filled when THP is always
>> enabled, while still providing the upside THPs like reduced TLB misses without
>> having to use madvise.
>>
>> Meta production workloads that were CPU bound (>99% CPU utilzation) were
>> tested with THP shrinker. The results after 2 hours are as follows:
>>
>>                              | THP=madvise |  THP=always   | THP=always
>>                              |             |               | + shrinker series
>>                              |             |               | + max_ptes_none=409
>> -----------------------------------------------------------------------------
>> Performance improvement     |      -      |    +1.8%      |     +1.7%
>> (over THP=madvise)          |             |               |
>> -----------------------------------------------------------------------------
>> Memory usage                |    54.6G    | 58.8G (+7.7%) |   55.9G (+2.4%)
>> -----------------------------------------------------------------------------
>> max_ptes_none=409 means that any THP that has more than 409 out of 512
>> (80%) zero filled filled pages will be split.
>>
>> To test out the patches, the below commands without the shrinker will
>> invoke OOM killer immediately and kill stress, but will not fail with
>> the shrinker:
>>
>> echo 450 > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
>> mkdir /sys/fs/cgroup/test
>> echo $$ > /sys/fs/cgroup/test/cgroup.procs
>> echo 20M > /sys/fs/cgroup/test/memory.max
>> echo 0 > /sys/fs/cgroup/test/memory.swap.max
>> # allocate twice memory.max for each stress worker and touch 40/512 of
>> # each THP, i.e. vm-stride 50K.
>> # With the shrinker, max_ptes_none of 470 and below won't invoke OOM
>> # killer.
>> # Without the shrinker, OOM killer is invoked immediately irrespective
>> # of max_ptes_none value and kill stress.
>> stress --vm 1 --vm-bytes 40M --vm-stride 50K
>>
>> Patches 1-2 add back helper functions that were previously removed
>> to operate on page lists (needed by patch 3).
>> Patch 3 is an optimization to free zapped tail pages rather than
>> waiting for page reclaim or migration.
>> Patch 4 is a prerequisite for THP shrinker to not remap zero-filled
>> subpages when splitting THP.
>> Patches 6 adds support for THP shrinker.
>>
>> (This patch-series restarts the work on having a THP shrinker in kernel
>> originally done in
>> https://lore.kernel.org/all/cover.1667454613.git.alexlzhu@fb.com/.
>> The THP shrinker in this series is significantly different than the
>> original one, hence its labelled v1 (although the prerequisite to not
>> remap clean subpages is the same).)
> 
> As shared previously, there is one issue with uffd (even when currently not active for a VMA!), where we must not zap present page table entries.
> 
> Something that is always possible (assuming no GUP pins of course, which) is replacing the zero-filled subpages by shared zeropages.
> 
> Is that being done in this patch set already, or are we creating pte_none() entries?
> 

I think thats done in Patch 4/6. In function try_to_unmap_unused, we have below which I think does what you are suggesting? i.e. point to shared zeropage and not clear pte for uffd armed vma.

	if (userfaultfd_armed(pvmw->vma)) {
		newpte = pte_mkspecial(pfn_pte(page_to_pfn(ZERO_PAGE(pvmw->address)),
					       pvmw->vma->vm_page_prot));
		ptep_clear_flush(pvmw->vma, pvmw->address, pvmw->pte);
		set_pte_at(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, newpte);
	}


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/6] mm: split underutilized THPs
  2024-07-30 15:14   ` Usama Arif
@ 2024-07-30 15:19     ` Usama Arif
  2024-07-30 16:11       ` David Hildenbrand
  0 siblings, 1 reply; 40+ messages in thread
From: Usama Arif @ 2024-07-30 15:19 UTC (permalink / raw)
  To: David Hildenbrand, akpm, linux-mm
  Cc: hannes, riel, shakeel.butt, roman.gushchin, yuzhao, baohua,
	ryan.roberts, rppt, willy, cerasuolodomenico, corbet,
	linux-kernel, linux-doc, kernel-team



On 30/07/2024 16:14, Usama Arif wrote:
> 
> 
> On 30/07/2024 15:35, David Hildenbrand wrote:
>> On 30.07.24 14:45, Usama Arif wrote:
>>> The current upstream default policy for THP is always. However, Meta
>>> uses madvise in production as the current THP=always policy vastly
>>> overprovisions THPs in sparsely accessed memory areas, resulting in
>>> excessive memory pressure and premature OOM killing.
>>> Using madvise + relying on khugepaged has certain drawbacks over
>>> THP=always. Using madvise hints mean THPs aren't "transparent" and
>>> require userspace changes. Waiting for khugepaged to scan memory and
>>> collapse pages into THP can be slow and unpredictable in terms of performance
>>> (i.e. you dont know when the collapse will happen), while production
>>> environments require predictable performance. If there is enough memory
>>> available, its better for both performance and predictability to have
>>> a THP from fault time, i.e. THP=always rather than wait for khugepaged
>>> to collapse it, and deal with sparsely populated THPs when the system is
>>> running out of memory.
>>>
>>> This patch-series is an attempt to mitigate the issue of running out of
>>> memory when THP is always enabled. During runtime whenever a THP is being
>>> faulted in or collapsed by khugepaged, the THP is added to a list.
>>> Whenever memory reclaim happens, the kernel runs the deferred_split
>>> shrinker which goes through the list and checks if the THP was underutilized,
>>> i.e. how many of the base 4K pages of the entire THP were zero-filled.
>>> If this number goes above a certain threshold, the shrinker will attempt
>>> to split that THP. Then at remap time, the pages that were zero-filled are
>>> not remapped, hence saving memory. This method avoids the downside of
>>> wasting memory in areas where THP is sparsely filled when THP is always
>>> enabled, while still providing the upside THPs like reduced TLB misses without
>>> having to use madvise.
>>>
>>> Meta production workloads that were CPU bound (>99% CPU utilzation) were
>>> tested with THP shrinker. The results after 2 hours are as follows:
>>>
>>>                              | THP=madvise |  THP=always   | THP=always
>>>                              |             |               | + shrinker series
>>>                              |             |               | + max_ptes_none=409
>>> -----------------------------------------------------------------------------
>>> Performance improvement     |      -      |    +1.8%      |     +1.7%
>>> (over THP=madvise)          |             |               |
>>> -----------------------------------------------------------------------------
>>> Memory usage                |    54.6G    | 58.8G (+7.7%) |   55.9G (+2.4%)
>>> -----------------------------------------------------------------------------
>>> max_ptes_none=409 means that any THP that has more than 409 out of 512
>>> (80%) zero filled filled pages will be split.
>>>
>>> To test out the patches, the below commands without the shrinker will
>>> invoke OOM killer immediately and kill stress, but will not fail with
>>> the shrinker:
>>>
>>> echo 450 > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
>>> mkdir /sys/fs/cgroup/test
>>> echo $$ > /sys/fs/cgroup/test/cgroup.procs
>>> echo 20M > /sys/fs/cgroup/test/memory.max
>>> echo 0 > /sys/fs/cgroup/test/memory.swap.max
>>> # allocate twice memory.max for each stress worker and touch 40/512 of
>>> # each THP, i.e. vm-stride 50K.
>>> # With the shrinker, max_ptes_none of 470 and below won't invoke OOM
>>> # killer.
>>> # Without the shrinker, OOM killer is invoked immediately irrespective
>>> # of max_ptes_none value and kill stress.
>>> stress --vm 1 --vm-bytes 40M --vm-stride 50K
>>>
>>> Patches 1-2 add back helper functions that were previously removed
>>> to operate on page lists (needed by patch 3).
>>> Patch 3 is an optimization to free zapped tail pages rather than
>>> waiting for page reclaim or migration.
>>> Patch 4 is a prerequisite for THP shrinker to not remap zero-filled
>>> subpages when splitting THP.
>>> Patches 6 adds support for THP shrinker.
>>>
>>> (This patch-series restarts the work on having a THP shrinker in kernel
>>> originally done in
>>> https://lore.kernel.org/all/cover.1667454613.git.alexlzhu@fb.com/.
>>> The THP shrinker in this series is significantly different than the
>>> original one, hence its labelled v1 (although the prerequisite to not
>>> remap clean subpages is the same).)
>>
>> As shared previously, there is one issue with uffd (even when currently not active for a VMA!), where we must not zap present page table entries.
>>
>> Something that is always possible (assuming no GUP pins of course, which) is replacing the zero-filled subpages by shared zeropages.
>>
>> Is that being done in this patch set already, or are we creating pte_none() entries?
>>
> 
> I think thats done in Patch 4/6. In function try_to_unmap_unused, we have below which I think does what you are suggesting? i.e. point to shared zeropage and not clear pte for uffd armed vma.
> 
> 	if (userfaultfd_armed(pvmw->vma)) {
> 		newpte = pte_mkspecial(pfn_pte(page_to_pfn(ZERO_PAGE(pvmw->address)),
> 					       pvmw->vma->vm_page_prot));
> 		ptep_clear_flush(pvmw->vma, pvmw->address, pvmw->pte);
> 		set_pte_at(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, newpte);
> 	}


Ah are you suggesting userfaultfd_armed(pvmw->vma) will evaluate to false even if its uffd? I think something like below would work in that case.

diff --git a/mm/migrate.c b/mm/migrate.c
index 2731ac20ff33..52aa4770fbed 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -206,14 +206,10 @@ static bool try_to_unmap_unused(struct page_vma_mapped_walk *pvmw,
        if (dirty)
                return false;
 
-       pte_clear_not_present_full(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, false);
-
-       if (userfaultfd_armed(pvmw->vma)) {
-               newpte = pte_mkspecial(pfn_pte(page_to_pfn(ZERO_PAGE(pvmw->address)),
-                                              pvmw->vma->vm_page_prot));
-               ptep_clear_flush(pvmw->vma, pvmw->address, pvmw->pte);
-               set_pte_at(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, newpte);
-       }
+       newpte = pte_mkspecial(pfn_pte(page_to_pfn(ZERO_PAGE(pvmw->address)),
+                                       pvmw->vma->vm_page_prot));
+       ptep_clear_flush(pvmw->vma, pvmw->address, pvmw->pte);
+       set_pte_at(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, newpte);
 
        dec_mm_counter(pvmw->vma->vm_mm, mm_counter(folio));
        return true;



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/6] mm: split underutilized THPs
  2024-07-30 15:19     ` Usama Arif
@ 2024-07-30 16:11       ` David Hildenbrand
  2024-07-30 17:22         ` Usama Arif
  0 siblings, 1 reply; 40+ messages in thread
From: David Hildenbrand @ 2024-07-30 16:11 UTC (permalink / raw)
  To: Usama Arif, akpm, linux-mm
  Cc: hannes, riel, shakeel.butt, roman.gushchin, yuzhao, baohua,
	ryan.roberts, rppt, willy, cerasuolodomenico, corbet,
	linux-kernel, linux-doc, kernel-team

On 30.07.24 17:19, Usama Arif wrote:
> 
> 
> On 30/07/2024 16:14, Usama Arif wrote:
>>
>>
>> On 30/07/2024 15:35, David Hildenbrand wrote:
>>> On 30.07.24 14:45, Usama Arif wrote:
>>>> The current upstream default policy for THP is always. However, Meta
>>>> uses madvise in production as the current THP=always policy vastly
>>>> overprovisions THPs in sparsely accessed memory areas, resulting in
>>>> excessive memory pressure and premature OOM killing.
>>>> Using madvise + relying on khugepaged has certain drawbacks over
>>>> THP=always. Using madvise hints mean THPs aren't "transparent" and
>>>> require userspace changes. Waiting for khugepaged to scan memory and
>>>> collapse pages into THP can be slow and unpredictable in terms of performance
>>>> (i.e. you dont know when the collapse will happen), while production
>>>> environments require predictable performance. If there is enough memory
>>>> available, its better for both performance and predictability to have
>>>> a THP from fault time, i.e. THP=always rather than wait for khugepaged
>>>> to collapse it, and deal with sparsely populated THPs when the system is
>>>> running out of memory.
>>>>
>>>> This patch-series is an attempt to mitigate the issue of running out of
>>>> memory when THP is always enabled. During runtime whenever a THP is being
>>>> faulted in or collapsed by khugepaged, the THP is added to a list.
>>>> Whenever memory reclaim happens, the kernel runs the deferred_split
>>>> shrinker which goes through the list and checks if the THP was underutilized,
>>>> i.e. how many of the base 4K pages of the entire THP were zero-filled.
>>>> If this number goes above a certain threshold, the shrinker will attempt
>>>> to split that THP. Then at remap time, the pages that were zero-filled are
>>>> not remapped, hence saving memory. This method avoids the downside of
>>>> wasting memory in areas where THP is sparsely filled when THP is always
>>>> enabled, while still providing the upside THPs like reduced TLB misses without
>>>> having to use madvise.
>>>>
>>>> Meta production workloads that were CPU bound (>99% CPU utilzation) were
>>>> tested with THP shrinker. The results after 2 hours are as follows:
>>>>
>>>>                               | THP=madvise |  THP=always   | THP=always
>>>>                               |             |               | + shrinker series
>>>>                               |             |               | + max_ptes_none=409
>>>> -----------------------------------------------------------------------------
>>>> Performance improvement     |      -      |    +1.8%      |     +1.7%
>>>> (over THP=madvise)          |             |               |
>>>> -----------------------------------------------------------------------------
>>>> Memory usage                |    54.6G    | 58.8G (+7.7%) |   55.9G (+2.4%)
>>>> -----------------------------------------------------------------------------
>>>> max_ptes_none=409 means that any THP that has more than 409 out of 512
>>>> (80%) zero filled filled pages will be split.
>>>>
>>>> To test out the patches, the below commands without the shrinker will
>>>> invoke OOM killer immediately and kill stress, but will not fail with
>>>> the shrinker:
>>>>
>>>> echo 450 > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
>>>> mkdir /sys/fs/cgroup/test
>>>> echo $$ > /sys/fs/cgroup/test/cgroup.procs
>>>> echo 20M > /sys/fs/cgroup/test/memory.max
>>>> echo 0 > /sys/fs/cgroup/test/memory.swap.max
>>>> # allocate twice memory.max for each stress worker and touch 40/512 of
>>>> # each THP, i.e. vm-stride 50K.
>>>> # With the shrinker, max_ptes_none of 470 and below won't invoke OOM
>>>> # killer.
>>>> # Without the shrinker, OOM killer is invoked immediately irrespective
>>>> # of max_ptes_none value and kill stress.
>>>> stress --vm 1 --vm-bytes 40M --vm-stride 50K
>>>>
>>>> Patches 1-2 add back helper functions that were previously removed
>>>> to operate on page lists (needed by patch 3).
>>>> Patch 3 is an optimization to free zapped tail pages rather than
>>>> waiting for page reclaim or migration.
>>>> Patch 4 is a prerequisite for THP shrinker to not remap zero-filled
>>>> subpages when splitting THP.
>>>> Patches 6 adds support for THP shrinker.
>>>>
>>>> (This patch-series restarts the work on having a THP shrinker in kernel
>>>> originally done in
>>>> https://lore.kernel.org/all/cover.1667454613.git.alexlzhu@fb.com/.
>>>> The THP shrinker in this series is significantly different than the
>>>> original one, hence its labelled v1 (although the prerequisite to not
>>>> remap clean subpages is the same).)
>>>
>>> As shared previously, there is one issue with uffd (even when currently not active for a VMA!), where we must not zap present page table entries.
>>>
>>> Something that is always possible (assuming no GUP pins of course, which) is replacing the zero-filled subpages by shared zeropages.
>>>
>>> Is that being done in this patch set already, or are we creating pte_none() entries?
>>>
>>
>> I think thats done in Patch 4/6. In function try_to_unmap_unused, we have below which I think does what you are suggesting? i.e. point to shared zeropage and not clear pte for uffd armed vma.
>>
>> 	if (userfaultfd_armed(pvmw->vma)) {
>> 		newpte = pte_mkspecial(pfn_pte(page_to_pfn(ZERO_PAGE(pvmw->address)),
>> 					       pvmw->vma->vm_page_prot));
>> 		ptep_clear_flush(pvmw->vma, pvmw->address, pvmw->pte);
>> 		set_pte_at(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, newpte);
>> 	}
> 
> 
> Ah are you suggesting userfaultfd_armed(pvmw->vma) will evaluate to false even if its uffd? I think something like below would work in that case.

I remember one ugly case in QEMU with postcopy live-migration where we 
must not zap zero-filled pages. I am not 100% regarding THP (if it could 
be enabled at that point), but imagine the following

1) mmap(), enable THP
2) Migrate a bunch of pages from the source during precopy (writing to
    the memory). Might end up creating THPs (during fault/khugepaged)
3) Register UFFD on the VMA
4) Disable new THPs from forming via MADV_NOHUGEPAGE on the VMA
5) Discard any pages that have been re-dirtied or not migrated yet
6) Migrate-on-demand any holes using uffd


If we discard zero-filled pages between 2) and 3) we might get wrong 
uffd notifications in 6 for pages that have already been migrated).

I'll have to check if that actually happens in that sequence in QEMU: if 
QEMU would disable THP right before 2) we would be safe. But I recall 
that it is not the case :/


-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/6] mm: split underutilized THPs
  2024-07-30 16:11       ` David Hildenbrand
@ 2024-07-30 17:22         ` Usama Arif
  2024-07-30 20:25           ` David Hildenbrand
  0 siblings, 1 reply; 40+ messages in thread
From: Usama Arif @ 2024-07-30 17:22 UTC (permalink / raw)
  To: David Hildenbrand, akpm, linux-mm
  Cc: hannes, riel, shakeel.butt, roman.gushchin, yuzhao, baohua,
	ryan.roberts, rppt, willy, cerasuolodomenico, corbet,
	linux-kernel, linux-doc, kernel-team



On 30/07/2024 17:11, David Hildenbrand wrote:
> On 30.07.24 17:19, Usama Arif wrote:
>>
>>
>> On 30/07/2024 16:14, Usama Arif wrote:
>>>
>>>
>>> On 30/07/2024 15:35, David Hildenbrand wrote:
>>>> On 30.07.24 14:45, Usama Arif wrote:
>>>>> The current upstream default policy for THP is always. However, Meta
>>>>> uses madvise in production as the current THP=always policy vastly
>>>>> overprovisions THPs in sparsely accessed memory areas, resulting in
>>>>> excessive memory pressure and premature OOM killing.
>>>>> Using madvise + relying on khugepaged has certain drawbacks over
>>>>> THP=always. Using madvise hints mean THPs aren't "transparent" and
>>>>> require userspace changes. Waiting for khugepaged to scan memory and
>>>>> collapse pages into THP can be slow and unpredictable in terms of performance
>>>>> (i.e. you dont know when the collapse will happen), while production
>>>>> environments require predictable performance. If there is enough memory
>>>>> available, its better for both performance and predictability to have
>>>>> a THP from fault time, i.e. THP=always rather than wait for khugepaged
>>>>> to collapse it, and deal with sparsely populated THPs when the system is
>>>>> running out of memory.
>>>>>
>>>>> This patch-series is an attempt to mitigate the issue of running out of
>>>>> memory when THP is always enabled. During runtime whenever a THP is being
>>>>> faulted in or collapsed by khugepaged, the THP is added to a list.
>>>>> Whenever memory reclaim happens, the kernel runs the deferred_split
>>>>> shrinker which goes through the list and checks if the THP was underutilized,
>>>>> i.e. how many of the base 4K pages of the entire THP were zero-filled.
>>>>> If this number goes above a certain threshold, the shrinker will attempt
>>>>> to split that THP. Then at remap time, the pages that were zero-filled are
>>>>> not remapped, hence saving memory. This method avoids the downside of
>>>>> wasting memory in areas where THP is sparsely filled when THP is always
>>>>> enabled, while still providing the upside THPs like reduced TLB misses without
>>>>> having to use madvise.
>>>>>
>>>>> Meta production workloads that were CPU bound (>99% CPU utilzation) were
>>>>> tested with THP shrinker. The results after 2 hours are as follows:
>>>>>
>>>>>                               | THP=madvise |  THP=always   | THP=always
>>>>>                               |             |               | + shrinker series
>>>>>                               |             |               | + max_ptes_none=409
>>>>> -----------------------------------------------------------------------------
>>>>> Performance improvement     |      -      |    +1.8%      |     +1.7%
>>>>> (over THP=madvise)          |             |               |
>>>>> -----------------------------------------------------------------------------
>>>>> Memory usage                |    54.6G    | 58.8G (+7.7%) |   55.9G (+2.4%)
>>>>> -----------------------------------------------------------------------------
>>>>> max_ptes_none=409 means that any THP that has more than 409 out of 512
>>>>> (80%) zero filled filled pages will be split.
>>>>>
>>>>> To test out the patches, the below commands without the shrinker will
>>>>> invoke OOM killer immediately and kill stress, but will not fail with
>>>>> the shrinker:
>>>>>
>>>>> echo 450 > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
>>>>> mkdir /sys/fs/cgroup/test
>>>>> echo $$ > /sys/fs/cgroup/test/cgroup.procs
>>>>> echo 20M > /sys/fs/cgroup/test/memory.max
>>>>> echo 0 > /sys/fs/cgroup/test/memory.swap.max
>>>>> # allocate twice memory.max for each stress worker and touch 40/512 of
>>>>> # each THP, i.e. vm-stride 50K.
>>>>> # With the shrinker, max_ptes_none of 470 and below won't invoke OOM
>>>>> # killer.
>>>>> # Without the shrinker, OOM killer is invoked immediately irrespective
>>>>> # of max_ptes_none value and kill stress.
>>>>> stress --vm 1 --vm-bytes 40M --vm-stride 50K
>>>>>
>>>>> Patches 1-2 add back helper functions that were previously removed
>>>>> to operate on page lists (needed by patch 3).
>>>>> Patch 3 is an optimization to free zapped tail pages rather than
>>>>> waiting for page reclaim or migration.
>>>>> Patch 4 is a prerequisite for THP shrinker to not remap zero-filled
>>>>> subpages when splitting THP.
>>>>> Patches 6 adds support for THP shrinker.
>>>>>
>>>>> (This patch-series restarts the work on having a THP shrinker in kernel
>>>>> originally done in
>>>>> https://lore.kernel.org/all/cover.1667454613.git.alexlzhu@fb.com/.
>>>>> The THP shrinker in this series is significantly different than the
>>>>> original one, hence its labelled v1 (although the prerequisite to not
>>>>> remap clean subpages is the same).)
>>>>
>>>> As shared previously, there is one issue with uffd (even when currently not active for a VMA!), where we must not zap present page table entries.
>>>>
>>>> Something that is always possible (assuming no GUP pins of course, which) is replacing the zero-filled subpages by shared zeropages.
>>>>
>>>> Is that being done in this patch set already, or are we creating pte_none() entries?
>>>>
>>>
>>> I think thats done in Patch 4/6. In function try_to_unmap_unused, we have below which I think does what you are suggesting? i.e. point to shared zeropage and not clear pte for uffd armed vma.
>>>
>>>     if (userfaultfd_armed(pvmw->vma)) {
>>>         newpte = pte_mkspecial(pfn_pte(page_to_pfn(ZERO_PAGE(pvmw->address)),
>>>                            pvmw->vma->vm_page_prot));
>>>         ptep_clear_flush(pvmw->vma, pvmw->address, pvmw->pte);
>>>         set_pte_at(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, newpte);
>>>     }
>>
>>
>> Ah are you suggesting userfaultfd_armed(pvmw->vma) will evaluate to false even if its uffd? I think something like below would work in that case.
> 
> I remember one ugly case in QEMU with postcopy live-migration where we must not zap zero-filled pages. I am not 100% regarding THP (if it could be enabled at that point), but imagine the following
> 
> 1) mmap(), enable THP
> 2) Migrate a bunch of pages from the source during precopy (writing to
>    the memory). Might end up creating THPs (during fault/khugepaged)
> 3) Register UFFD on the VMA
> 4) Disable new THPs from forming via MADV_NOHUGEPAGE on the VMA
> 5) Discard any pages that have been re-dirtied or not migrated yet
> 6) Migrate-on-demand any holes using uffd
> 
> 
> If we discard zero-filled pages between 2) and 3) we might get wrong uffd notifications in 6 for pages that have already been migrated).
> 
> I'll have to check if that actually happens in that sequence in QEMU: if QEMU would disable THP right before 2) we would be safe. But I recall that it is not the case :/
> 
> 

Thanks for the example!

Just to understand the issue better, as I am not very familiar with live-migration code, the problem is only for zero-filled pages that were migrated, right? If a THP is created and a subpage of it was a zero-page that was migrated and its split before VMA is armed with uffd, userfaultfd_armed(pvmw->vma) will return false when splitting and it will become pte_none. And afterwards when the destination faults on it, uffd will see that its pte_clear and will request the zero-page back from source. Uffd will then have to get the page again from source.

If I understand the example correctly, the below diff over patch 6 should be good? i.e. just point to the empty_zero_page instead of doing pte_clear. This should still use the same amount of memory, although ptep_clear_flush means it might be slighly more expensive.

diff --git a/mm/migrate.c b/mm/migrate.c
index 2731ac20ff33..52aa4770fbed 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -206,14 +206,10 @@ static bool try_to_unmap_unused(struct page_vma_mapped_walk *pvmw,
        if (dirty)
                return false;
 
-       pte_clear_not_present_full(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, false);
-
-       if (userfaultfd_armed(pvmw->vma)) {
-               newpte = pte_mkspecial(pfn_pte(page_to_pfn(ZERO_PAGE(pvmw->address)),
-                                              pvmw->vma->vm_page_prot));
-               ptep_clear_flush(pvmw->vma, pvmw->address, pvmw->pte);
-               set_pte_at(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, newpte);
-       }
+       newpte = pte_mkspecial(pfn_pte(page_to_pfn(ZERO_PAGE(pvmw->address)),
+                                       pvmw->vma->vm_page_prot));
+       ptep_clear_flush(pvmw->vma, pvmw->address, pvmw->pte);
+       set_pte_at(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, newpte);
 
        dec_mm_counter(pvmw->vma->vm_mm, mm_counter(folio));
        return true;






^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/6] mm: split underutilized THPs
  2024-07-30 17:22         ` Usama Arif
@ 2024-07-30 20:25           ` David Hildenbrand
  2024-07-31 17:01             ` Usama Arif
  0 siblings, 1 reply; 40+ messages in thread
From: David Hildenbrand @ 2024-07-30 20:25 UTC (permalink / raw)
  To: Usama Arif, akpm, linux-mm
  Cc: hannes, riel, shakeel.butt, roman.gushchin, yuzhao, baohua,
	ryan.roberts, rppt, willy, cerasuolodomenico, corbet,
	linux-kernel, linux-doc, kernel-team

On 30.07.24 19:22, Usama Arif wrote:
> 
> 
> On 30/07/2024 17:11, David Hildenbrand wrote:
>> On 30.07.24 17:19, Usama Arif wrote:
>>>
>>>
>>> On 30/07/2024 16:14, Usama Arif wrote:
>>>>
>>>>
>>>> On 30/07/2024 15:35, David Hildenbrand wrote:
>>>>> On 30.07.24 14:45, Usama Arif wrote:
>>>>>> The current upstream default policy for THP is always. However, Meta
>>>>>> uses madvise in production as the current THP=always policy vastly
>>>>>> overprovisions THPs in sparsely accessed memory areas, resulting in
>>>>>> excessive memory pressure and premature OOM killing.
>>>>>> Using madvise + relying on khugepaged has certain drawbacks over
>>>>>> THP=always. Using madvise hints mean THPs aren't "transparent" and
>>>>>> require userspace changes. Waiting for khugepaged to scan memory and
>>>>>> collapse pages into THP can be slow and unpredictable in terms of performance
>>>>>> (i.e. you dont know when the collapse will happen), while production
>>>>>> environments require predictable performance. If there is enough memory
>>>>>> available, its better for both performance and predictability to have
>>>>>> a THP from fault time, i.e. THP=always rather than wait for khugepaged
>>>>>> to collapse it, and deal with sparsely populated THPs when the system is
>>>>>> running out of memory.
>>>>>>
>>>>>> This patch-series is an attempt to mitigate the issue of running out of
>>>>>> memory when THP is always enabled. During runtime whenever a THP is being
>>>>>> faulted in or collapsed by khugepaged, the THP is added to a list.
>>>>>> Whenever memory reclaim happens, the kernel runs the deferred_split
>>>>>> shrinker which goes through the list and checks if the THP was underutilized,
>>>>>> i.e. how many of the base 4K pages of the entire THP were zero-filled.
>>>>>> If this number goes above a certain threshold, the shrinker will attempt
>>>>>> to split that THP. Then at remap time, the pages that were zero-filled are
>>>>>> not remapped, hence saving memory. This method avoids the downside of
>>>>>> wasting memory in areas where THP is sparsely filled when THP is always
>>>>>> enabled, while still providing the upside THPs like reduced TLB misses without
>>>>>> having to use madvise.
>>>>>>
>>>>>> Meta production workloads that were CPU bound (>99% CPU utilzation) were
>>>>>> tested with THP shrinker. The results after 2 hours are as follows:
>>>>>>
>>>>>>                                | THP=madvise |  THP=always   | THP=always
>>>>>>                                |             |               | + shrinker series
>>>>>>                                |             |               | + max_ptes_none=409
>>>>>> -----------------------------------------------------------------------------
>>>>>> Performance improvement     |      -      |    +1.8%      |     +1.7%
>>>>>> (over THP=madvise)          |             |               |
>>>>>> -----------------------------------------------------------------------------
>>>>>> Memory usage                |    54.6G    | 58.8G (+7.7%) |   55.9G (+2.4%)
>>>>>> -----------------------------------------------------------------------------
>>>>>> max_ptes_none=409 means that any THP that has more than 409 out of 512
>>>>>> (80%) zero filled filled pages will be split.
>>>>>>
>>>>>> To test out the patches, the below commands without the shrinker will
>>>>>> invoke OOM killer immediately and kill stress, but will not fail with
>>>>>> the shrinker:
>>>>>>
>>>>>> echo 450 > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
>>>>>> mkdir /sys/fs/cgroup/test
>>>>>> echo $$ > /sys/fs/cgroup/test/cgroup.procs
>>>>>> echo 20M > /sys/fs/cgroup/test/memory.max
>>>>>> echo 0 > /sys/fs/cgroup/test/memory.swap.max
>>>>>> # allocate twice memory.max for each stress worker and touch 40/512 of
>>>>>> # each THP, i.e. vm-stride 50K.
>>>>>> # With the shrinker, max_ptes_none of 470 and below won't invoke OOM
>>>>>> # killer.
>>>>>> # Without the shrinker, OOM killer is invoked immediately irrespective
>>>>>> # of max_ptes_none value and kill stress.
>>>>>> stress --vm 1 --vm-bytes 40M --vm-stride 50K
>>>>>>
>>>>>> Patches 1-2 add back helper functions that were previously removed
>>>>>> to operate on page lists (needed by patch 3).
>>>>>> Patch 3 is an optimization to free zapped tail pages rather than
>>>>>> waiting for page reclaim or migration.
>>>>>> Patch 4 is a prerequisite for THP shrinker to not remap zero-filled
>>>>>> subpages when splitting THP.
>>>>>> Patches 6 adds support for THP shrinker.
>>>>>>
>>>>>> (This patch-series restarts the work on having a THP shrinker in kernel
>>>>>> originally done in
>>>>>> https://lore.kernel.org/all/cover.1667454613.git.alexlzhu@fb.com/.
>>>>>> The THP shrinker in this series is significantly different than the
>>>>>> original one, hence its labelled v1 (although the prerequisite to not
>>>>>> remap clean subpages is the same).)
>>>>>
>>>>> As shared previously, there is one issue with uffd (even when currently not active for a VMA!), where we must not zap present page table entries.
>>>>>
>>>>> Something that is always possible (assuming no GUP pins of course, which) is replacing the zero-filled subpages by shared zeropages.
>>>>>
>>>>> Is that being done in this patch set already, or are we creating pte_none() entries?
>>>>>
>>>>
>>>> I think thats done in Patch 4/6. In function try_to_unmap_unused, we have below which I think does what you are suggesting? i.e. point to shared zeropage and not clear pte for uffd armed vma.
>>>>
>>>>      if (userfaultfd_armed(pvmw->vma)) {
>>>>          newpte = pte_mkspecial(pfn_pte(page_to_pfn(ZERO_PAGE(pvmw->address)),
>>>>                             pvmw->vma->vm_page_prot));
>>>>          ptep_clear_flush(pvmw->vma, pvmw->address, pvmw->pte);
>>>>          set_pte_at(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, newpte);
>>>>      }
>>>
>>>
>>> Ah are you suggesting userfaultfd_armed(pvmw->vma) will evaluate to false even if its uffd? I think something like below would work in that case.
>>
>> I remember one ugly case in QEMU with postcopy live-migration where we must not zap zero-filled pages. I am not 100% regarding THP (if it could be enabled at that point), but imagine the following
>>
>> 1) mmap(), enable THP
>> 2) Migrate a bunch of pages from the source during precopy (writing to
>>     the memory). Might end up creating THPs (during fault/khugepaged)
>> 3) Register UFFD on the VMA
>> 4) Disable new THPs from forming via MADV_NOHUGEPAGE on the VMA
>> 5) Discard any pages that have been re-dirtied or not migrated yet
>> 6) Migrate-on-demand any holes using uffd
>>
>>
>> If we discard zero-filled pages between 2) and 3) we might get wrong uffd notifications in 6 for pages that have already been migrated).
>>
>> I'll have to check if that actually happens in that sequence in QEMU: if QEMU would disable THP right before 2) we would be safe. But I recall that it is not the case :/
>>
>>
> 
> Thanks for the example!
> 
> Just to understand the issue better, as I am not very familiar with live-migration code, the problem is only for zero-filled pages that were migrated, right? If a THP is created and a subpage of it was a zero-page that was migrated and its split before VMA is armed with uffd, userfaultfd_armed(pvmw->vma) will return false when splitting and it will become pte_none. And afterwards when the destination faults on it, uffd will see that its pte_clear and will request the zero-page back from source. Uffd will then have to get the page again from source.

Something like that, but I recall that if it detects that it already 
migrated the page stuff will go wrong.

IIRC QEMU maintains receive bitmaps about which pages it already 
received+placed. Staring at migrate_send_rp_req_pages(), QEMU ignores 
the uffd notification if it finds that the page was already received and 
would not ask the source again.

So if we turn some PTEs that map zeroed-pages into pte-none we could run 
into trouble, because we will never try requesting/placing that page 
again and our VM will just be stuck forever.

I wonder if other uffd users could similarly be affected. But maybe they 
don't place pages into the VMA before registering uffd.

I'll try to double-check when QEMU would disable THP. But it could also 
be that that behavior changed between QEMU versions.

> 
> If I understand the example correctly, the below diff over patch 6 should be good? i.e. just point to the empty_zero_page instead of doing pte_clear. This should still use the same amount of memory, although ptep_clear_flush means it might be slighly more expensive.

There are rare cases where we must not use the shared zeropage 
(mm_forbids_zeropage()), that must be handled.

I assume we could optimize here if uffd is not configured in. Further, 
what QEMU does is sense right at the beginning by temporarily 
registering uffd on some other VMA if it is even supported. That could 
be used as an indication ("ever used uffd -> don't turn PTEs none"). But 
again, no idea what other uffd users might be relying on :/

In I wonder if some applications could rely on anon memory not suddenly 
"vanishing" form the PTEs (for example relying on pagemap like 
tools/testing/selftests/mm/cow.c) would. I don't think a lot of 
applications would do that, though.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/6] mm: split underutilized THPs
  2024-07-30 20:25           ` David Hildenbrand
@ 2024-07-31 17:01             ` Usama Arif
  2024-07-31 17:51               ` David Hildenbrand
  0 siblings, 1 reply; 40+ messages in thread
From: Usama Arif @ 2024-07-31 17:01 UTC (permalink / raw)
  To: David Hildenbrand, akpm, linux-mm
  Cc: hannes, riel, shakeel.butt, roman.gushchin, yuzhao, baohua,
	ryan.roberts, rppt, willy, cerasuolodomenico, corbet,
	linux-kernel, linux-doc, kernel-team



On 30/07/2024 21:25, David Hildenbrand wrote:
> On 30.07.24 19:22, Usama Arif wrote:
>>
>>
>> On 30/07/2024 17:11, David Hildenbrand wrote:
>>> On 30.07.24 17:19, Usama Arif wrote:
>>>>
>>>>
>>>> On 30/07/2024 16:14, Usama Arif wrote:
>>>>>
>>>>>
>>>>> On 30/07/2024 15:35, David Hildenbrand wrote:
>>>>>> On 30.07.24 14:45, Usama Arif wrote:
>>>>>>> The current upstream default policy for THP is always. However, Meta
>>>>>>> uses madvise in production as the current THP=always policy vastly
>>>>>>> overprovisions THPs in sparsely accessed memory areas, resulting in
>>>>>>> excessive memory pressure and premature OOM killing.
>>>>>>> Using madvise + relying on khugepaged has certain drawbacks over
>>>>>>> THP=always. Using madvise hints mean THPs aren't "transparent" and
>>>>>>> require userspace changes. Waiting for khugepaged to scan memory and
>>>>>>> collapse pages into THP can be slow and unpredictable in terms of performance
>>>>>>> (i.e. you dont know when the collapse will happen), while production
>>>>>>> environments require predictable performance. If there is enough memory
>>>>>>> available, its better for both performance and predictability to have
>>>>>>> a THP from fault time, i.e. THP=always rather than wait for khugepaged
>>>>>>> to collapse it, and deal with sparsely populated THPs when the system is
>>>>>>> running out of memory.
>>>>>>>
>>>>>>> This patch-series is an attempt to mitigate the issue of running out of
>>>>>>> memory when THP is always enabled. During runtime whenever a THP is being
>>>>>>> faulted in or collapsed by khugepaged, the THP is added to a list.
>>>>>>> Whenever memory reclaim happens, the kernel runs the deferred_split
>>>>>>> shrinker which goes through the list and checks if the THP was underutilized,
>>>>>>> i.e. how many of the base 4K pages of the entire THP were zero-filled.
>>>>>>> If this number goes above a certain threshold, the shrinker will attempt
>>>>>>> to split that THP. Then at remap time, the pages that were zero-filled are
>>>>>>> not remapped, hence saving memory. This method avoids the downside of
>>>>>>> wasting memory in areas where THP is sparsely filled when THP is always
>>>>>>> enabled, while still providing the upside THPs like reduced TLB misses without
>>>>>>> having to use madvise.
>>>>>>>
>>>>>>> Meta production workloads that were CPU bound (>99% CPU utilzation) were
>>>>>>> tested with THP shrinker. The results after 2 hours are as follows:
>>>>>>>
>>>>>>>                                | THP=madvise |  THP=always   | THP=always
>>>>>>>                                |             |               | + shrinker series
>>>>>>>                                |             |               | + max_ptes_none=409
>>>>>>> -----------------------------------------------------------------------------
>>>>>>> Performance improvement     |      -      |    +1.8%      |     +1.7%
>>>>>>> (over THP=madvise)          |             |               |
>>>>>>> -----------------------------------------------------------------------------
>>>>>>> Memory usage                |    54.6G    | 58.8G (+7.7%) |   55.9G (+2.4%)
>>>>>>> -----------------------------------------------------------------------------
>>>>>>> max_ptes_none=409 means that any THP that has more than 409 out of 512
>>>>>>> (80%) zero filled filled pages will be split.
>>>>>>>
>>>>>>> To test out the patches, the below commands without the shrinker will
>>>>>>> invoke OOM killer immediately and kill stress, but will not fail with
>>>>>>> the shrinker:
>>>>>>>
>>>>>>> echo 450 > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
>>>>>>> mkdir /sys/fs/cgroup/test
>>>>>>> echo $$ > /sys/fs/cgroup/test/cgroup.procs
>>>>>>> echo 20M > /sys/fs/cgroup/test/memory.max
>>>>>>> echo 0 > /sys/fs/cgroup/test/memory.swap.max
>>>>>>> # allocate twice memory.max for each stress worker and touch 40/512 of
>>>>>>> # each THP, i.e. vm-stride 50K.
>>>>>>> # With the shrinker, max_ptes_none of 470 and below won't invoke OOM
>>>>>>> # killer.
>>>>>>> # Without the shrinker, OOM killer is invoked immediately irrespective
>>>>>>> # of max_ptes_none value and kill stress.
>>>>>>> stress --vm 1 --vm-bytes 40M --vm-stride 50K
>>>>>>>
>>>>>>> Patches 1-2 add back helper functions that were previously removed
>>>>>>> to operate on page lists (needed by patch 3).
>>>>>>> Patch 3 is an optimization to free zapped tail pages rather than
>>>>>>> waiting for page reclaim or migration.
>>>>>>> Patch 4 is a prerequisite for THP shrinker to not remap zero-filled
>>>>>>> subpages when splitting THP.
>>>>>>> Patches 6 adds support for THP shrinker.
>>>>>>>
>>>>>>> (This patch-series restarts the work on having a THP shrinker in kernel
>>>>>>> originally done in
>>>>>>> https://lore.kernel.org/all/cover.1667454613.git.alexlzhu@fb.com/.
>>>>>>> The THP shrinker in this series is significantly different than the
>>>>>>> original one, hence its labelled v1 (although the prerequisite to not
>>>>>>> remap clean subpages is the same).)
>>>>>>
>>>>>> As shared previously, there is one issue with uffd (even when currently not active for a VMA!), where we must not zap present page table entries.
>>>>>>
>>>>>> Something that is always possible (assuming no GUP pins of course, which) is replacing the zero-filled subpages by shared zeropages.
>>>>>>
>>>>>> Is that being done in this patch set already, or are we creating pte_none() entries?
>>>>>>
>>>>>
>>>>> I think thats done in Patch 4/6. In function try_to_unmap_unused, we have below which I think does what you are suggesting? i.e. point to shared zeropage and not clear pte for uffd armed vma.
>>>>>
>>>>>      if (userfaultfd_armed(pvmw->vma)) {
>>>>>          newpte = pte_mkspecial(pfn_pte(page_to_pfn(ZERO_PAGE(pvmw->address)),
>>>>>                             pvmw->vma->vm_page_prot));
>>>>>          ptep_clear_flush(pvmw->vma, pvmw->address, pvmw->pte);
>>>>>          set_pte_at(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, newpte);
>>>>>      }
>>>>
>>>>
>>>> Ah are you suggesting userfaultfd_armed(pvmw->vma) will evaluate to false even if its uffd? I think something like below would work in that case.
>>>
>>> I remember one ugly case in QEMU with postcopy live-migration where we must not zap zero-filled pages. I am not 100% regarding THP (if it could be enabled at that point), but imagine the following
>>>
>>> 1) mmap(), enable THP
>>> 2) Migrate a bunch of pages from the source during precopy (writing to
>>>     the memory). Might end up creating THPs (during fault/khugepaged)
>>> 3) Register UFFD on the VMA
>>> 4) Disable new THPs from forming via MADV_NOHUGEPAGE on the VMA
>>> 5) Discard any pages that have been re-dirtied or not migrated yet
>>> 6) Migrate-on-demand any holes using uffd
>>>
>>>
>>> If we discard zero-filled pages between 2) and 3) we might get wrong uffd notifications in 6 for pages that have already been migrated).
>>>
>>> I'll have to check if that actually happens in that sequence in QEMU: if QEMU would disable THP right before 2) we would be safe. But I recall that it is not the case :/
>>>
>>>
>>
>> Thanks for the example!
>>
>> Just to understand the issue better, as I am not very familiar with live-migration code, the problem is only for zero-filled pages that were migrated, right? If a THP is created and a subpage of it was a zero-page that was migrated and its split before VMA is armed with uffd, userfaultfd_armed(pvmw->vma) will return false when splitting and it will become pte_none. And afterwards when the destination faults on it, uffd will see that its pte_clear and will request the zero-page back from source. Uffd will then have to get the page again from source.
> 
> Something like that, but I recall that if it detects that it already migrated the page stuff will go wrong.
> 
> IIRC QEMU maintains receive bitmaps about which pages it already received+placed. Staring at migrate_send_rp_req_pages(), QEMU ignores the uffd notification if it finds that the page was already received and would not ask the source again.
> 
> So if we turn some PTEs that map zeroed-pages into pte-none we could run into trouble, because we will never try requesting/placing that page again and our VM will just be stuck forever.
> 
> I wonder if other uffd users could similarly be affected. But maybe they don't place pages into the VMA before registering uffd.
> 
> I'll try to double-check when QEMU would disable THP. But it could also be that that behavior changed between QEMU versions.
> 
>>
>> If I understand the example correctly, the below diff over patch 6 should be good? i.e. just point to the empty_zero_page instead of doing pte_clear. This should still use the same amount of memory, although ptep_clear_flush means it might be slighly more expensive.
> 
> There are rare cases where we must not use the shared zeropage (mm_forbids_zeropage()), that must be handled.
> 
> I assume we could optimize here if uffd is not configured in. Further, what QEMU does is sense right at the beginning by temporarily registering uffd on some other VMA if it is even supported. That could be used as an indication ("ever used uffd -> don't turn PTEs none"). But again, no idea what other uffd users might be relying on :/
> 
> In I wonder if some applications could rely on anon memory not suddenly "vanishing" form the PTEs (for example relying on pagemap like tools/testing/selftests/mm/cow.c) would. I don't think a lot of applications would do that, though.
> 


I had a deeper look at how QEMU handles migration and at how kernel handles THPs in other places with respect to uffd. I hope I understood the migration code correctly :)

QEMU:
There are 2 types of "migrations" [1], migration_thread and bg_migration_thread
- migration_thread supports postcopy live migration, but doesn't use uffd. So postcopy live migration doesn't use uffd as far as I can tell looking at the function [2]. 
- bg_migration_thread [3]: this isn't really live migration, but background snapshot. There is no postcopy and it works similar to pre-copy. From what I understand, uffd is registered before any pages are migrated [4].

So the way these 2 threads work in qemu is: its either postcopy without uffd, or precopy-style snapshot with uffd registered at the start, then the current patch in this series is good, i.e. only use zeropage if userfaultfd_armed(pvmw->vma), otherwise pte_clear. I can't see postcopy with uffd anywhere in qemu which doesn all the steps 1-6 that you mentioned above, but I might not be looking at the right place?

Kernel:
From a kernel perspective, if we look at khugepaged [5], it just takes into account if the vma is currently registered as uffd. So maybe this scenario can happen?
1) THP is enabled on VMA, UFFD is not yet registered.
2) dest accesses a 4K page (lets call it 4Kpage1), khugepaged collapses that page into 2M THP as the following 511 4K pages were pte_none [5], as VMA is not uffd armed yet.
3) UFFD is registered + MADV_NOHUGEPAGE is set. No further collapse will happen, but the THPs that were created in step 2 will remain. MADV_NOHUGEPAGE doesn't split existing THP created in 2.
4) dest tries to access 4Kpage2, the address right after 4Kpage1. As khugepaged collapsed it, there won't be a pagefault, and dest will read it as a zero-filled page, even if UFFD handler wanted to give some non-zero data filled page. I think anyone who wrote the uffd handler would want there to be a pagefault and give the non-zero data filled page and would not expect dest to see a zero-filled page.

What I am trying to say with the above example is:
- UFFD registeration + MADV_NOHUGEPAGE should be done by all applications before any data in the region is accessed. Using THPs and accessing data before UFFD registeration + MADV_NOHUGEPAGE can lead to unexpected behaviour and is wrong?
- even in the current kernel code in other places like khugepaged, its only checked if uffd is enabled currently. It is not tracked if it was ever enabled on any VMA.

Thanks for pointing to mm_forbids_zeropage. Incorporating that into the code, and if I am (hopefully :)) right about qemu and kernel above, then I believe the right code should be:

	if (userfaultfd_armed(pvmw->vma) && mm_forbids_zeropage(pvmw->vma->vm_mm))
		return false;

	if (!userfaultfd_armed(pvmw->vma)) {
		pte_clear_not_present_full(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, false);
	} else {
		newpte = pte_mkspecial(pfn_pte(page_to_pfn(ZERO_PAGE(pvmw->address)),
					       pvmw->vma->vm_page_prot));
		ptep_clear_flush(pvmw->vma, pvmw->address, pvmw->pte);
		set_pte_at(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, newpte);
	}
 


[1] https://github.com/qemu/qemu/blob/4e56e89d6c81589cc47cf5811f570c67889bd18a/migration/migration.c#L3817
[2] https://github.com/qemu/qemu/blob/4e56e89d6c81589cc47cf5811f570c67889bd18a/migration/migration.c#L3455
[3] https://github.com/qemu/qemu/blob/4e56e89d6c81589cc47cf5811f570c67889bd18a/migration/migration.c#L3591
[4] https://github.com/qemu/qemu/blob/4e56e89d6c81589cc47cf5811f570c67889bd18a/migration/migration.c#L3675
[5] https://github.com/torvalds/linux/blob/master/mm/khugepaged.c#L1307




^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/6] mm: split underutilized THPs
  2024-07-31 17:01             ` Usama Arif
@ 2024-07-31 17:51               ` David Hildenbrand
  2024-07-31 20:41                 ` Usama Arif
  0 siblings, 1 reply; 40+ messages in thread
From: David Hildenbrand @ 2024-07-31 17:51 UTC (permalink / raw)
  To: Usama Arif, akpm, linux-mm
  Cc: hannes, riel, shakeel.butt, roman.gushchin, yuzhao, baohua,
	ryan.roberts, rppt, willy, cerasuolodomenico, corbet,
	linux-kernel, linux-doc, kernel-team

On 31.07.24 19:01, Usama Arif wrote:
> 
> 
> On 30/07/2024 21:25, David Hildenbrand wrote:
>> On 30.07.24 19:22, Usama Arif wrote:
>>>
>>>
>>> On 30/07/2024 17:11, David Hildenbrand wrote:
>>>> On 30.07.24 17:19, Usama Arif wrote:
>>>>>
>>>>>
>>>>> On 30/07/2024 16:14, Usama Arif wrote:
>>>>>>
>>>>>>
>>>>>> On 30/07/2024 15:35, David Hildenbrand wrote:
>>>>>>> On 30.07.24 14:45, Usama Arif wrote:
>>>>>>>> The current upstream default policy for THP is always. However, Meta
>>>>>>>> uses madvise in production as the current THP=always policy vastly
>>>>>>>> overprovisions THPs in sparsely accessed memory areas, resulting in
>>>>>>>> excessive memory pressure and premature OOM killing.
>>>>>>>> Using madvise + relying on khugepaged has certain drawbacks over
>>>>>>>> THP=always. Using madvise hints mean THPs aren't "transparent" and
>>>>>>>> require userspace changes. Waiting for khugepaged to scan memory and
>>>>>>>> collapse pages into THP can be slow and unpredictable in terms of performance
>>>>>>>> (i.e. you dont know when the collapse will happen), while production
>>>>>>>> environments require predictable performance. If there is enough memory
>>>>>>>> available, its better for both performance and predictability to have
>>>>>>>> a THP from fault time, i.e. THP=always rather than wait for khugepaged
>>>>>>>> to collapse it, and deal with sparsely populated THPs when the system is
>>>>>>>> running out of memory.
>>>>>>>>
>>>>>>>> This patch-series is an attempt to mitigate the issue of running out of
>>>>>>>> memory when THP is always enabled. During runtime whenever a THP is being
>>>>>>>> faulted in or collapsed by khugepaged, the THP is added to a list.
>>>>>>>> Whenever memory reclaim happens, the kernel runs the deferred_split
>>>>>>>> shrinker which goes through the list and checks if the THP was underutilized,
>>>>>>>> i.e. how many of the base 4K pages of the entire THP were zero-filled.
>>>>>>>> If this number goes above a certain threshold, the shrinker will attempt
>>>>>>>> to split that THP. Then at remap time, the pages that were zero-filled are
>>>>>>>> not remapped, hence saving memory. This method avoids the downside of
>>>>>>>> wasting memory in areas where THP is sparsely filled when THP is always
>>>>>>>> enabled, while still providing the upside THPs like reduced TLB misses without
>>>>>>>> having to use madvise.
>>>>>>>>
>>>>>>>> Meta production workloads that were CPU bound (>99% CPU utilzation) were
>>>>>>>> tested with THP shrinker. The results after 2 hours are as follows:
>>>>>>>>
>>>>>>>>                                 | THP=madvise |  THP=always   | THP=always
>>>>>>>>                                 |             |               | + shrinker series
>>>>>>>>                                 |             |               | + max_ptes_none=409
>>>>>>>> -----------------------------------------------------------------------------
>>>>>>>> Performance improvement     |      -      |    +1.8%      |     +1.7%
>>>>>>>> (over THP=madvise)          |             |               |
>>>>>>>> -----------------------------------------------------------------------------
>>>>>>>> Memory usage                |    54.6G    | 58.8G (+7.7%) |   55.9G (+2.4%)
>>>>>>>> -----------------------------------------------------------------------------
>>>>>>>> max_ptes_none=409 means that any THP that has more than 409 out of 512
>>>>>>>> (80%) zero filled filled pages will be split.
>>>>>>>>
>>>>>>>> To test out the patches, the below commands without the shrinker will
>>>>>>>> invoke OOM killer immediately and kill stress, but will not fail with
>>>>>>>> the shrinker:
>>>>>>>>
>>>>>>>> echo 450 > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
>>>>>>>> mkdir /sys/fs/cgroup/test
>>>>>>>> echo $$ > /sys/fs/cgroup/test/cgroup.procs
>>>>>>>> echo 20M > /sys/fs/cgroup/test/memory.max
>>>>>>>> echo 0 > /sys/fs/cgroup/test/memory.swap.max
>>>>>>>> # allocate twice memory.max for each stress worker and touch 40/512 of
>>>>>>>> # each THP, i.e. vm-stride 50K.
>>>>>>>> # With the shrinker, max_ptes_none of 470 and below won't invoke OOM
>>>>>>>> # killer.
>>>>>>>> # Without the shrinker, OOM killer is invoked immediately irrespective
>>>>>>>> # of max_ptes_none value and kill stress.
>>>>>>>> stress --vm 1 --vm-bytes 40M --vm-stride 50K
>>>>>>>>
>>>>>>>> Patches 1-2 add back helper functions that were previously removed
>>>>>>>> to operate on page lists (needed by patch 3).
>>>>>>>> Patch 3 is an optimization to free zapped tail pages rather than
>>>>>>>> waiting for page reclaim or migration.
>>>>>>>> Patch 4 is a prerequisite for THP shrinker to not remap zero-filled
>>>>>>>> subpages when splitting THP.
>>>>>>>> Patches 6 adds support for THP shrinker.
>>>>>>>>
>>>>>>>> (This patch-series restarts the work on having a THP shrinker in kernel
>>>>>>>> originally done in
>>>>>>>> https://lore.kernel.org/all/cover.1667454613.git.alexlzhu@fb.com/.
>>>>>>>> The THP shrinker in this series is significantly different than the
>>>>>>>> original one, hence its labelled v1 (although the prerequisite to not
>>>>>>>> remap clean subpages is the same).)
>>>>>>>
>>>>>>> As shared previously, there is one issue with uffd (even when currently not active for a VMA!), where we must not zap present page table entries.
>>>>>>>
>>>>>>> Something that is always possible (assuming no GUP pins of course, which) is replacing the zero-filled subpages by shared zeropages.
>>>>>>>
>>>>>>> Is that being done in this patch set already, or are we creating pte_none() entries?
>>>>>>>
>>>>>>
>>>>>> I think thats done in Patch 4/6. In function try_to_unmap_unused, we have below which I think does what you are suggesting? i.e. point to shared zeropage and not clear pte for uffd armed vma.
>>>>>>
>>>>>>       if (userfaultfd_armed(pvmw->vma)) {
>>>>>>           newpte = pte_mkspecial(pfn_pte(page_to_pfn(ZERO_PAGE(pvmw->address)),
>>>>>>                              pvmw->vma->vm_page_prot));
>>>>>>           ptep_clear_flush(pvmw->vma, pvmw->address, pvmw->pte);
>>>>>>           set_pte_at(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, newpte);
>>>>>>       }
>>>>>
>>>>>
>>>>> Ah are you suggesting userfaultfd_armed(pvmw->vma) will evaluate to false even if its uffd? I think something like below would work in that case.
>>>>
>>>> I remember one ugly case in QEMU with postcopy live-migration where we must not zap zero-filled pages. I am not 100% regarding THP (if it could be enabled at that point), but imagine the following
>>>>
>>>> 1) mmap(), enable THP
>>>> 2) Migrate a bunch of pages from the source during precopy (writing to
>>>>      the memory). Might end up creating THPs (during fault/khugepaged)
>>>> 3) Register UFFD on the VMA
>>>> 4) Disable new THPs from forming via MADV_NOHUGEPAGE on the VMA
>>>> 5) Discard any pages that have been re-dirtied or not migrated yet
>>>> 6) Migrate-on-demand any holes using uffd
>>>>
>>>>
>>>> If we discard zero-filled pages between 2) and 3) we might get wrong uffd notifications in 6 for pages that have already been migrated).
>>>>
>>>> I'll have to check if that actually happens in that sequence in QEMU: if QEMU would disable THP right before 2) we would be safe. But I recall that it is not the case :/
>>>>
>>>>
>>>
>>> Thanks for the example!
>>>
>>> Just to understand the issue better, as I am not very familiar with live-migration code, the problem is only for zero-filled pages that were migrated, right? If a THP is created and a subpage of it was a zero-page that was migrated and its split before VMA is armed with uffd, userfaultfd_armed(pvmw->vma) will return false when splitting and it will become pte_none. And afterwards when the destination faults on it, uffd will see that its pte_clear and will request the zero-page back from source. Uffd will then have to get the page again from source.
>>
>> Something like that, but I recall that if it detects that it already migrated the page stuff will go wrong.
>>
>> IIRC QEMU maintains receive bitmaps about which pages it already received+placed. Staring at migrate_send_rp_req_pages(), QEMU ignores the uffd notification if it finds that the page was already received and would not ask the source again.
>>
>> So if we turn some PTEs that map zeroed-pages into pte-none we could run into trouble, because we will never try requesting/placing that page again and our VM will just be stuck forever.
>>
>> I wonder if other uffd users could similarly be affected. But maybe they don't place pages into the VMA before registering uffd.
>>
>> I'll try to double-check when QEMU would disable THP. But it could also be that that behavior changed between QEMU versions.
>>
>>>
>>> If I understand the example correctly, the below diff over patch 6 should be good? i.e. just point to the empty_zero_page instead of doing pte_clear. This should still use the same amount of memory, although ptep_clear_flush means it might be slighly more expensive.
>>
>> There are rare cases where we must not use the shared zeropage (mm_forbids_zeropage()), that must be handled.
>>
>> I assume we could optimize here if uffd is not configured in. Further, what QEMU does is sense right at the beginning by temporarily registering uffd on some other VMA if it is even supported. That could be used as an indication ("ever used uffd -> don't turn PTEs none"). But again, no idea what other uffd users might be relying on :/
>>
>> In I wonder if some applications could rely on anon memory not suddenly "vanishing" form the PTEs (for example relying on pagemap like tools/testing/selftests/mm/cow.c) would. I don't think a lot of applications would do that, though.
>>
> 
> 
> I had a deeper look at how QEMU handles migration and at how kernel handles THPs in other places with respect to uffd. I hope I understood the migration code correctly :)
> 
> QEMU:
> There are 2 types of "migrations" [1], migration_thread and bg_migration_thread
> - migration_thread supports postcopy live migration, but doesn't use uffd. So postcopy live migration doesn't use uffd as far as I can tell looking at the function [2].
> - bg_migration_thread [3]: this isn't really live migration, but background snapshot. There is no postcopy and it works similar to pre-copy. From what I understand, uffd is registered before any pages are migrated [4].
> 

Ignore the second. What QEMU ends up using is precopy + postcopy. You 
can start precopy migration and decide at some point that you want to 
switch to postcopy (this is what I trigger below, let me know if you 
want a simple way to reproduce it).

> So the way these 2 threads work in qemu is: its either postcopy without uffd, or precopy-style snapshot with uffd registered at the start, then the current patch in this series is good, i.e. only use zeropage if userfaultfd_armed(pvmw->vma), otherwise pte_clear. I can't see postcopy with uffd anywhere in qemu which doesn all the steps 1-6 that you mentioned above, but I might not be looking at the right place?

"I can't see postcopy with uffd anywhere" -- most of it lives in 
migration/postcopy-ram.c.

See postcopy_ram_incoming_setup() where most of the UFFD setup logic 
happens.

> 
> Kernel:
>  From a kernel perspective, if we look at khugepaged [5], it just takes into account if the vma is currently registered as uffd. So maybe this scenario can happen?
> 1) THP is enabled on VMA, UFFD is not yet registered.
> 2) dest accesses a 4K page (lets call it 4Kpage1), khugepaged collapses that page into 2M THP as the following 511 4K pages were pte_none [5], as VMA is not uffd armed yet.
> 3) UFFD is registered + MADV_NOHUGEPAGE is set. No further collapse will happen, but the THPs that were created in step 2 will remain. MADV_NOHUGEPAGE doesn't split existing THP created in 2.
> 4) dest tries to access 4Kpage2, the address right after 4Kpage1. As khugepaged collapsed it, there won't be a pagefault, and dest will read it as a zero-filled page, even if UFFD handler wanted to give some non-zero data filled page. I think anyone who wrote the uffd handler would want there to be a pagefault and give the non-zero data filled page and would not expect dest to see a zero-filled page.
> 
> What I am trying to say with the above example is:
> - UFFD registeration + MADV_NOHUGEPAGE should be done by all applications before any data in the region is accessed. Using THPs and accessing data before UFFD registeration + MADV_NOHUGEPAGE can lead to unexpected behaviour and is wrong?

The important point here is that you discard (MADV_DONTNEED) any pages 
you want to fault+place later lazily. That must happen after 
MADV_NOHUGEPAGE but before registering+running userfaultfd.

Then you don't care if page faults / kuhgepaged gave you THPs before 
that. In fact, you want to happily use THPs wherever possible and not
disable them unconditionally right from the start.

> - even in the current kernel code in other places like khugepaged, its only checked if uffd is enabled currently. It is not tracked if it was ever enabled on any VMA.

Right, see above, this all works if you MADV_DONTNEED the pages you want 
to get faults to after disabling THPs.


I just added a bunch of quick printfs to QEMU and ran a precopy+postcopy 
live migration. Looks like my assumption was right:

On the destination:

Writing received pages during precopy # ram_load_precopy()
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy 

Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Disabling THP: MADV_NOHUGEPAGE # postcopy_ram_prepare_discard()
Discarding pages # loadvm_postcopy_ram_handle_discard()
Discarding pages
Discarding pages
Discarding pages
Discarding pages
Discarding pages
Discarding pages
Registering UFFD # postcopy_ram_incoming_setup()


Let me know if you need more information.

> Thanks for pointing to mm_forbids_zeropage. Incorporating that into the code, and if I am (hopefully :)) right about qemu and kernel above, then I believe the right code should be:

I'm afraid you are not right about the qemu code :)

> 
> 	if (userfaultfd_armed(pvmw->vma) && mm_forbids_zeropage(pvmw->vma->vm_mm))
> 		return false;
> 
> 	if (!userfaultfd_armed(pvmw->vma)) {
> 		pte_clear_not_present_full(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, false);
> 	} else {
> 		newpte = pte_mkspecial(pfn_pte(page_to_pfn(ZERO_PAGE(pvmw->address)),
> 					       pvmw->vma->vm_page_prot));
> 		ptep_clear_flush(pvmw->vma, pvmw->address, pvmw->pte);
> 		set_pte_at(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, newpte);
> 	}
>   
> 
> 
> [1] https://github.com/qemu/qemu/blob/4e56e89d6c81589cc47cf5811f570c67889bd18a/migration/migration.c#L3817
> [2] https://github.com/qemu/qemu/blob/4e56e89d6c81589cc47cf5811f570c67889bd18a/migration/migration.c#L3455
> [3] https://github.com/qemu/qemu/blob/4e56e89d6c81589cc47cf5811f570c67889bd18a/migration/migration.c#L3591
> [4] https://github.com/qemu/qemu/blob/4e56e89d6c81589cc47cf5811f570c67889bd18a/migration/migration.c#L3675
> [5] https://github.com/torvalds/linux/blob/master/mm/khugepaged.c#L1307
> 
> 
> 

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/6] mm: split underutilized THPs
  2024-07-31 17:51               ` David Hildenbrand
@ 2024-07-31 20:41                 ` Usama Arif
  2024-08-01  6:36                   ` David Hildenbrand
  0 siblings, 1 reply; 40+ messages in thread
From: Usama Arif @ 2024-07-31 20:41 UTC (permalink / raw)
  To: David Hildenbrand, akpm, linux-mm
  Cc: hannes, riel, shakeel.butt, roman.gushchin, yuzhao, baohua,
	ryan.roberts, rppt, willy, cerasuolodomenico, corbet,
	linux-kernel, linux-doc, kernel-team



On 31/07/2024 18:51, David Hildenbrand wrote:
> On 31.07.24 19:01, Usama Arif wrote:
>>
>>
>> On 30/07/2024 21:25, David Hildenbrand wrote:
>>> On 30.07.24 19:22, Usama Arif wrote:
>>>>
>>>>
>>>> On 30/07/2024 17:11, David Hildenbrand wrote:
>>>>> On 30.07.24 17:19, Usama Arif wrote:
>>>>>>
>>>>>>
>>>>>> On 30/07/2024 16:14, Usama Arif wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 30/07/2024 15:35, David Hildenbrand wrote:
>>>>>>>> On 30.07.24 14:45, Usama Arif wrote:
>>>>>>>>> The current upstream default policy for THP is always. However, Meta
>>>>>>>>> uses madvise in production as the current THP=always policy vastly
>>>>>>>>> overprovisions THPs in sparsely accessed memory areas, resulting in
>>>>>>>>> excessive memory pressure and premature OOM killing.
>>>>>>>>> Using madvise + relying on khugepaged has certain drawbacks over
>>>>>>>>> THP=always. Using madvise hints mean THPs aren't "transparent" and
>>>>>>>>> require userspace changes. Waiting for khugepaged to scan memory and
>>>>>>>>> collapse pages into THP can be slow and unpredictable in terms of performance
>>>>>>>>> (i.e. you dont know when the collapse will happen), while production
>>>>>>>>> environments require predictable performance. If there is enough memory
>>>>>>>>> available, its better for both performance and predictability to have
>>>>>>>>> a THP from fault time, i.e. THP=always rather than wait for khugepaged
>>>>>>>>> to collapse it, and deal with sparsely populated THPs when the system is
>>>>>>>>> running out of memory.
>>>>>>>>>
>>>>>>>>> This patch-series is an attempt to mitigate the issue of running out of
>>>>>>>>> memory when THP is always enabled. During runtime whenever a THP is being
>>>>>>>>> faulted in or collapsed by khugepaged, the THP is added to a list.
>>>>>>>>> Whenever memory reclaim happens, the kernel runs the deferred_split
>>>>>>>>> shrinker which goes through the list and checks if the THP was underutilized,
>>>>>>>>> i.e. how many of the base 4K pages of the entire THP were zero-filled.
>>>>>>>>> If this number goes above a certain threshold, the shrinker will attempt
>>>>>>>>> to split that THP. Then at remap time, the pages that were zero-filled are
>>>>>>>>> not remapped, hence saving memory. This method avoids the downside of
>>>>>>>>> wasting memory in areas where THP is sparsely filled when THP is always
>>>>>>>>> enabled, while still providing the upside THPs like reduced TLB misses without
>>>>>>>>> having to use madvise.
>>>>>>>>>
>>>>>>>>> Meta production workloads that were CPU bound (>99% CPU utilzation) were
>>>>>>>>> tested with THP shrinker. The results after 2 hours are as follows:
>>>>>>>>>
>>>>>>>>>                                 | THP=madvise |  THP=always   | THP=always
>>>>>>>>>                                 |             |               | + shrinker series
>>>>>>>>>                                 |             |               | + max_ptes_none=409
>>>>>>>>> -----------------------------------------------------------------------------
>>>>>>>>> Performance improvement     |      -      |    +1.8%      |     +1.7%
>>>>>>>>> (over THP=madvise)          |             |               |
>>>>>>>>> -----------------------------------------------------------------------------
>>>>>>>>> Memory usage                |    54.6G    | 58.8G (+7.7%) |   55.9G (+2.4%)
>>>>>>>>> -----------------------------------------------------------------------------
>>>>>>>>> max_ptes_none=409 means that any THP that has more than 409 out of 512
>>>>>>>>> (80%) zero filled filled pages will be split.
>>>>>>>>>
>>>>>>>>> To test out the patches, the below commands without the shrinker will
>>>>>>>>> invoke OOM killer immediately and kill stress, but will not fail with
>>>>>>>>> the shrinker:
>>>>>>>>>
>>>>>>>>> echo 450 > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
>>>>>>>>> mkdir /sys/fs/cgroup/test
>>>>>>>>> echo $$ > /sys/fs/cgroup/test/cgroup.procs
>>>>>>>>> echo 20M > /sys/fs/cgroup/test/memory.max
>>>>>>>>> echo 0 > /sys/fs/cgroup/test/memory.swap.max
>>>>>>>>> # allocate twice memory.max for each stress worker and touch 40/512 of
>>>>>>>>> # each THP, i.e. vm-stride 50K.
>>>>>>>>> # With the shrinker, max_ptes_none of 470 and below won't invoke OOM
>>>>>>>>> # killer.
>>>>>>>>> # Without the shrinker, OOM killer is invoked immediately irrespective
>>>>>>>>> # of max_ptes_none value and kill stress.
>>>>>>>>> stress --vm 1 --vm-bytes 40M --vm-stride 50K
>>>>>>>>>
>>>>>>>>> Patches 1-2 add back helper functions that were previously removed
>>>>>>>>> to operate on page lists (needed by patch 3).
>>>>>>>>> Patch 3 is an optimization to free zapped tail pages rather than
>>>>>>>>> waiting for page reclaim or migration.
>>>>>>>>> Patch 4 is a prerequisite for THP shrinker to not remap zero-filled
>>>>>>>>> subpages when splitting THP.
>>>>>>>>> Patches 6 adds support for THP shrinker.
>>>>>>>>>
>>>>>>>>> (This patch-series restarts the work on having a THP shrinker in kernel
>>>>>>>>> originally done in
>>>>>>>>> https://lore.kernel.org/all/cover.1667454613.git.alexlzhu@fb.com/.
>>>>>>>>> The THP shrinker in this series is significantly different than the
>>>>>>>>> original one, hence its labelled v1 (although the prerequisite to not
>>>>>>>>> remap clean subpages is the same).)
>>>>>>>>
>>>>>>>> As shared previously, there is one issue with uffd (even when currently not active for a VMA!), where we must not zap present page table entries.
>>>>>>>>
>>>>>>>> Something that is always possible (assuming no GUP pins of course, which) is replacing the zero-filled subpages by shared zeropages.
>>>>>>>>
>>>>>>>> Is that being done in this patch set already, or are we creating pte_none() entries?
>>>>>>>>
>>>>>>>
>>>>>>> I think thats done in Patch 4/6. In function try_to_unmap_unused, we have below which I think does what you are suggesting? i.e. point to shared zeropage and not clear pte for uffd armed vma.
>>>>>>>
>>>>>>>       if (userfaultfd_armed(pvmw->vma)) {
>>>>>>>           newpte = pte_mkspecial(pfn_pte(page_to_pfn(ZERO_PAGE(pvmw->address)),
>>>>>>>                              pvmw->vma->vm_page_prot));
>>>>>>>           ptep_clear_flush(pvmw->vma, pvmw->address, pvmw->pte);
>>>>>>>           set_pte_at(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, newpte);
>>>>>>>       }
>>>>>>
>>>>>>
>>>>>> Ah are you suggesting userfaultfd_armed(pvmw->vma) will evaluate to false even if its uffd? I think something like below would work in that case.
>>>>>
>>>>> I remember one ugly case in QEMU with postcopy live-migration where we must not zap zero-filled pages. I am not 100% regarding THP (if it could be enabled at that point), but imagine the following
>>>>>
>>>>> 1) mmap(), enable THP
>>>>> 2) Migrate a bunch of pages from the source during precopy (writing to
>>>>>      the memory). Might end up creating THPs (during fault/khugepaged)
>>>>> 3) Register UFFD on the VMA
>>>>> 4) Disable new THPs from forming via MADV_NOHUGEPAGE on the VMA
>>>>> 5) Discard any pages that have been re-dirtied or not migrated yet
>>>>> 6) Migrate-on-demand any holes using uffd
>>>>>
>>>>>
>>>>> If we discard zero-filled pages between 2) and 3) we might get wrong uffd notifications in 6 for pages that have already been migrated).
>>>>>
>>>>> I'll have to check if that actually happens in that sequence in QEMU: if QEMU would disable THP right before 2) we would be safe. But I recall that it is not the case :/
>>>>>
>>>>>
>>>>
>>>> Thanks for the example!
>>>>
>>>> Just to understand the issue better, as I am not very familiar with live-migration code, the problem is only for zero-filled pages that were migrated, right? If a THP is created and a subpage of it was a zero-page that was migrated and its split before VMA is armed with uffd, userfaultfd_armed(pvmw->vma) will return false when splitting and it will become pte_none. And afterwards when the destination faults on it, uffd will see that its pte_clear and will request the zero-page back from source. Uffd will then have to get the page again from source.
>>>
>>> Something like that, but I recall that if it detects that it already migrated the page stuff will go wrong.
>>>
>>> IIRC QEMU maintains receive bitmaps about which pages it already received+placed. Staring at migrate_send_rp_req_pages(), QEMU ignores the uffd notification if it finds that the page was already received and would not ask the source again.
>>>
>>> So if we turn some PTEs that map zeroed-pages into pte-none we could run into trouble, because we will never try requesting/placing that page again and our VM will just be stuck forever.
>>>
>>> I wonder if other uffd users could similarly be affected. But maybe they don't place pages into the VMA before registering uffd.
>>>
>>> I'll try to double-check when QEMU would disable THP. But it could also be that that behavior changed between QEMU versions.
>>>
>>>>
>>>> If I understand the example correctly, the below diff over patch 6 should be good? i.e. just point to the empty_zero_page instead of doing pte_clear. This should still use the same amount of memory, although ptep_clear_flush means it might be slighly more expensive.
>>>
>>> There are rare cases where we must not use the shared zeropage (mm_forbids_zeropage()), that must be handled.
>>>
>>> I assume we could optimize here if uffd is not configured in. Further, what QEMU does is sense right at the beginning by temporarily registering uffd on some other VMA if it is even supported. That could be used as an indication ("ever used uffd -> don't turn PTEs none"). But again, no idea what other uffd users might be relying on :/
>>>
>>> In I wonder if some applications could rely on anon memory not suddenly "vanishing" form the PTEs (for example relying on pagemap like tools/testing/selftests/mm/cow.c) would. I don't think a lot of applications would do that, though.
>>>
>>
>>
>> I had a deeper look at how QEMU handles migration and at how kernel handles THPs in other places with respect to uffd. I hope I understood the migration code correctly :)
>>
>> QEMU:
>> There are 2 types of "migrations" [1], migration_thread and bg_migration_thread
>> - migration_thread supports postcopy live migration, but doesn't use uffd. So postcopy live migration doesn't use uffd as far as I can tell looking at the function [2].
>> - bg_migration_thread [3]: this isn't really live migration, but background snapshot. There is no postcopy and it works similar to pre-copy. From what I understand, uffd is registered before any pages are migrated [4].
>>
> 
> Ignore the second. What QEMU ends up using is precopy + postcopy. You can start precopy migration and decide at some point that you want to switch to postcopy (this is what I trigger below, let me know if you want a simple way to reproduce it).
> 
>> So the way these 2 threads work in qemu is: its either postcopy without uffd, or precopy-style snapshot with uffd registered at the start, then the current patch in this series is good, i.e. only use zeropage if userfaultfd_armed(pvmw->vma), otherwise pte_clear. I can't see postcopy with uffd anywhere in qemu which doesn all the steps 1-6 that you mentioned above, but I might not be looking at the right place?
> 
> "I can't see postcopy with uffd anywhere" -- most of it lives in migration/postcopy-ram.c.
> 
> See postcopy_ram_incoming_setup() where most of the UFFD setup logic happens.
> 

Thanks for pointing to this, I did not look at this before :( postcopy_ram_incoming_setup clears it up.

>>
>> Kernel:
>>  From a kernel perspective, if we look at khugepaged [5], it just takes into account if the vma is currently registered as uffd. So maybe this scenario can happen?
>> 1) THP is enabled on VMA, UFFD is not yet registered.
>> 2) dest accesses a 4K page (lets call it 4Kpage1), khugepaged collapses that page into 2M THP as the following 511 4K pages were pte_none [5], as VMA is not uffd armed yet.
>> 3) UFFD is registered + MADV_NOHUGEPAGE is set. No further collapse will happen, but the THPs that were created in step 2 will remain. MADV_NOHUGEPAGE doesn't split existing THP created in 2.
>> 4) dest tries to access 4Kpage2, the address right after 4Kpage1. As khugepaged collapsed it, there won't be a pagefault, and dest will read it as a zero-filled page, even if UFFD handler wanted to give some non-zero data filled page. I think anyone who wrote the uffd handler would want there to be a pagefault and give the non-zero data filled page and would not expect dest to see a zero-filled page.
>>
>> What I am trying to say with the above example is:
>> - UFFD registeration + MADV_NOHUGEPAGE should be done by all applications before any data in the region is accessed. Using THPs and accessing data before UFFD registeration + MADV_NOHUGEPAGE can lead to unexpected behaviour and is wrong?
> 
> The important point here is that you discard (MADV_DONTNEED) any pages you want to fault+place later lazily. That must happen after MADV_NOHUGEPAGE but before registering+running userfaultfd.
> 
> Then you don't care if page faults / kuhgepaged gave you THPs before that. In fact, you want to happily use THPs wherever possible and not
> disable them unconditionally right from the start.
> 
>> - even in the current kernel code in other places like khugepaged, its only checked if uffd is enabled currently. It is not tracked if it was ever enabled on any VMA.
> 
> Right, see above, this all works if you MADV_DONTNEED the pages you want to get faults to after disabling THPs.
> 
> 
> I just added a bunch of quick printfs to QEMU and ran a precopy+postcopy live migration. Looks like my assumption was right:
> 
> On the destination:
> 
> Writing received pages during precopy # ram_load_precopy()
> Writing received pages during precopy
> Writing received pages during precopy
> Writing received pages during precopy
> Writing received pages during precopy
> Writing received pages during precopy
> Writing received pages during precopy
> Writing received pages during precopy
> Writing received pages during precopy
> Writing received pages during precopy
> Writing received pages during precopy
> Writing received pages during precopy
> Writing received pages during precopy
> Writing received pages during precopy
> Writing received pages during precopy
> Writing received pages during precopy
> Writing received pages during precopy
> Writing received pages during precopy
> Disabling THP: MADV_NOHUGEPAGE # postcopy_ram_prepare_discard()
> Discarding pages # loadvm_postcopy_ram_handle_discard()
> Discarding pages
> Discarding pages
> Discarding pages
> Discarding pages
> Discarding pages
> Discarding pages
> Registering UFFD # postcopy_ram_incoming_setup()
> 

Thanks for this, yes it makes sense after you mentioned postcopy_ram_incoming_setup. 
postcopy_ram_incoming_setup happens in the Listen phase, which is after the discard phase, so I was able to follow in code in qemu the same sequence of events that the above prints show.
> 
> Let me know if you need more information.
> 
>> Thanks for pointing to mm_forbids_zeropage. Incorporating that into the code, and if I am (hopefully :)) right about qemu and kernel above, then I believe the right code should be:
> 
> I'm afraid you are not right about the qemu code :)
> 

Yes, and also didn't consider MADV_DONTNEED! Thanks for explaining both of these things clearly. Its clear that pte_clear won't work in this case. 

We don't need to clear_pte, just use zero_page for all cases. The original series from Alex did tlb flush, but looking further at the code, thats not needed. try_to_migrate() flushes tlb and installs migration entries which are not ‘present’ so will never be tlb cached. remove_migration_ptes() restores page pointers so tlb flushing is not needed. When using zeropage, we don't need make a distinction if uffd is used or not. i.e. we can just do below:

	if (contains_data || mm_forbids_zeropage(pvmw->vma->vm_mm))
		return false;

	newpte = pte_mkspecial(pfn_pte(page_to_pfn(ZERO_PAGE(pvmw->address)),
					pvmw->vma->vm_page_prot));

	set_pte_at(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, newpte); 









^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/6] mm: split underutilized THPs
  2024-07-31 20:41                 ` Usama Arif
@ 2024-08-01  6:36                   ` David Hildenbrand
  2024-08-04 23:04                     ` Usama Arif
       [not found]                     ` <20240806172830.GD322282@cmpxchg.org>
  0 siblings, 2 replies; 40+ messages in thread
From: David Hildenbrand @ 2024-08-01  6:36 UTC (permalink / raw)
  To: Usama Arif, akpm, linux-mm
  Cc: hannes, riel, shakeel.butt, roman.gushchin, yuzhao, baohua,
	ryan.roberts, rppt, willy, cerasuolodomenico, corbet,
	linux-kernel, linux-doc, kernel-team

>> I just added a bunch of quick printfs to QEMU and ran a precopy+postcopy live migration. Looks like my assumption was right:
>>
>> On the destination:
>>
>> Writing received pages during precopy # ram_load_precopy()
>> Writing received pages during precopy
>> Writing received pages during precopy
>> Writing received pages during precopy
>> Writing received pages during precopy
>> Writing received pages during precopy
>> Writing received pages during precopy
>> Writing received pages during precopy
>> Writing received pages during precopy
>> Writing received pages during precopy
>> Writing received pages during precopy
>> Writing received pages during precopy
>> Writing received pages during precopy
>> Writing received pages during precopy
>> Writing received pages during precopy
>> Writing received pages during precopy
>> Writing received pages during precopy
>> Writing received pages during precopy
>> Disabling THP: MADV_NOHUGEPAGE # postcopy_ram_prepare_discard()
>> Discarding pages # loadvm_postcopy_ram_handle_discard()
>> Discarding pages
>> Discarding pages
>> Discarding pages
>> Discarding pages
>> Discarding pages
>> Discarding pages
>> Registering UFFD # postcopy_ram_incoming_setup()
>>
> 
> Thanks for this, yes it makes sense after you mentioned postcopy_ram_incoming_setup.
> postcopy_ram_incoming_setup happens in the Listen phase, which is after the discard phase, so I was able to follow in code in qemu the same sequence of events that the above prints show.


I just added another printf to postcopy_ram_supported_by_host(), where 
we temporarily do a UFFDIO_REGISTER on some test area.

Sensing UFFD support # postcopy_ram_supported_by_host()
Sensing UFFD support
Writing received pages during precopy # ram_load_precopy()
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Writing received pages during precopy
Disabling THP: MADV_NOHUGEPAGE # postcopy_ram_prepare_discard()
Discarding pages # loadvm_postcopy_ram_handle_discard()
Discarding pages
Discarding pages
Discarding pages
Discarding pages
Discarding pages
Discarding pages
Discarding pages
Discarding pages
Discarding pages
Discarding pages
Discarding pages
Discarding pages
Discarding pages
Discarding pages
Discarding pages
Registering UFFD # postcopy_ram_incoming_setup()

We could think about using this "ever user uffd" to avoid the shared 
zeropage in most processes.

Of course, there might be other applications where that wouldn't work, 
but I think this behavior (write to area before enabling uffd) might be 
fairly QEMU specific already.

Avoiding the shared zeropage has the benefit that a later write fault 
won't have to do a TLB flush and can simply install a fresh anon page.

>>
>> Let me know if you need more information.
>>
>>> Thanks for pointing to mm_forbids_zeropage. Incorporating that into the code, and if I am (hopefully :)) right about qemu and kernel above, then I believe the right code should be:
>>
>> I'm afraid you are not right about the qemu code :)
>>
> 
> Yes, and also didn't consider MADV_DONTNEED! Thanks for explaining both of these things clearly. Its clear that pte_clear won't work in this case.
> 
> We don't need to clear_pte, just use zero_page for all cases. The original series from Alex did tlb flush, but looking further at the code, thats not needed. try_to_migrate() flushes tlb and installs migration entries which are not ‘present’ so will never be tlb cached. remove_migration_ptes() restores page pointers so tlb flushing is not needed. When using zeropage, we don't need make a distinction if uffd is used or not. i.e. we can just do below:
> 
> 	if (contains_data || mm_forbids_zeropage(pvmw->vma->vm_mm))

It's worth noting that on s390x, MMs that forbid the zeropage also have 
THPs disabled. So we shouldn't really run into that that often (of 
course, it's subject to change in the future, so we better have this 
check here).

> 		return false;
> 
> 	newpte = pte_mkspecial(pfn_pte(page_to_pfn(ZERO_PAGE(pvmw->address)),
> 					pvmw->vma->vm_page_prot));
> 
> 	set_pte_at(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, newpte);

We're replacing a present page by another present page without doing a 
TLB flush in between. I *think* this should be fine because the new 
present page is R/O and cannot possibly be written to.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/6] mm: split underutilized THPs
  2024-08-01  6:36                   ` David Hildenbrand
@ 2024-08-04 23:04                     ` Usama Arif
       [not found]                     ` <20240806172830.GD322282@cmpxchg.org>
  1 sibling, 0 replies; 40+ messages in thread
From: Usama Arif @ 2024-08-04 23:04 UTC (permalink / raw)
  To: David Hildenbrand, akpm, linux-mm
  Cc: hannes, riel, shakeel.butt, roman.gushchin, yuzhao, baohua,
	ryan.roberts, rppt, willy, cerasuolodomenico, corbet,
	linux-kernel, linux-doc, kernel-team



On 01/08/2024 07:36, David Hildenbrand wrote:
>>> I just added a bunch of quick printfs to QEMU and ran a precopy+postcopy live migration. Looks like my assumption was right:
>>>
>>> On the destination:
>>>
>>> Writing received pages during precopy # ram_load_precopy()
>>> Writing received pages during precopy
>>> Writing received pages during precopy
>>> Writing received pages during precopy
>>> Writing received pages during precopy
>>> Writing received pages during precopy
>>> Writing received pages during precopy
>>> Writing received pages during precopy
>>> Writing received pages during precopy
>>> Writing received pages during precopy
>>> Writing received pages during precopy
>>> Writing received pages during precopy
>>> Writing received pages during precopy
>>> Writing received pages during precopy
>>> Writing received pages during precopy
>>> Writing received pages during precopy
>>> Writing received pages during precopy
>>> Writing received pages during precopy
>>> Disabling THP: MADV_NOHUGEPAGE # postcopy_ram_prepare_discard()
>>> Discarding pages # loadvm_postcopy_ram_handle_discard()
>>> Discarding pages
>>> Discarding pages
>>> Discarding pages
>>> Discarding pages
>>> Discarding pages
>>> Discarding pages
>>> Registering UFFD # postcopy_ram_incoming_setup()
>>>
>>
>> Thanks for this, yes it makes sense after you mentioned postcopy_ram_incoming_setup.
>> postcopy_ram_incoming_setup happens in the Listen phase, which is after the discard phase, so I was able to follow in code in qemu the same sequence of events that the above prints show.
> 
> 
> I just added another printf to postcopy_ram_supported_by_host(), where we temporarily do a UFFDIO_REGISTER on some test area.
> 
> Sensing UFFD support # postcopy_ram_supported_by_host()
> Sensing UFFD support
> Writing received pages during precopy # ram_load_precopy()
> Writing received pages during precopy
> Writing received pages during precopy
> Writing received pages during precopy
> Writing received pages during precopy
> Writing received pages during precopy
> Writing received pages during precopy
> Writing received pages during precopy
> Writing received pages during precopy
> Writing received pages during precopy
> Writing received pages during precopy
> Writing received pages during precopy
> Writing received pages during precopy
> Writing received pages during precopy
> Writing received pages during precopy
> Writing received pages during precopy
> Writing received pages during precopy
> Writing received pages during precopy
> Writing received pages during precopy
> Writing received pages during precopy
> Writing received pages during precopy
> Writing received pages during precopy
> Disabling THP: MADV_NOHUGEPAGE # postcopy_ram_prepare_discard()
> Discarding pages # loadvm_postcopy_ram_handle_discard()
> Discarding pages
> Discarding pages
> Discarding pages
> Discarding pages
> Discarding pages
> Discarding pages
> Discarding pages
> Discarding pages
> Discarding pages
> Discarding pages
> Discarding pages
> Discarding pages
> Discarding pages
> Discarding pages
> Discarding pages
> Registering UFFD # postcopy_ram_incoming_setup()
> 
> We could think about using this "ever user uffd" to avoid the shared zeropage in most processes.
> 
> Of course, there might be other applications where that wouldn't work, but I think this behavior (write to area before enabling uffd) might be fairly QEMU specific already.
> 
> Avoiding the shared zeropage has the benefit that a later write fault won't have to do a TLB flush and can simply install a fresh anon page.
> 

I checked CRIU and that does a check at the start as well before attempting to use uffd: https://github.com/checkpoint-restore/criu/blob/criu-dev/criu/kerndat.c#L1349

If writing to an area before enabling uffd is likely to be QEMU specific, then you make a good point to clear pte instead of using shared zeropage to avoid the TLB flush if uffd is ever used.

I think "ever used uffd" would need to be tracked using mm_struct. This also won't cause an issue if the check is done in a parent process and the actual use is in a forked process, as copy_mm should take care of it.
The possibilities would then be:
1) Have a new bit in mm->flags, set it in new_userfaultfd and test it in try_to_unmap_unused, but unfortunately all the bits in mm->flags are taken.
2) We could use mm->def_flags as it looks like there is an unused bit (0x800) just before VM_UFFD_WP. But that makes the code confusing as its used to initialize the default flags for VMAs and is not supposed to be used as a "mm flag".
3) Introducing mm->flags2 and set/test as 1. This would introduce a 8 byte overhead for all mm_structs.

I am not sure either 2 or 3 are acceptable upstream, unless there is a need for more flags in the near future and the 8 byte overhead starts to make sense. Maybe we go with shared zeropage?


^ permalink raw reply	[flat|nested] 40+ messages in thread

[parent not found: <20240806172830.GD322282@cmpxchg.org>]

* Re: [PATCH 0/6] mm: split underutilized THPs
       [not found]                     ` <20240806172830.GD322282@cmpxchg.org>
@ 2024-08-06 17:33                       ` David Hildenbrand
  0 siblings, 0 replies; 40+ messages in thread
From: David Hildenbrand @ 2024-08-06 17:33 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Usama Arif, akpm, linux-mm, riel, shakeel.butt, roman.gushchin,
	yuzhao, baohua, ryan.roberts, rppt, willy, cerasuolodomenico,
	corbet, linux-kernel, linux-doc, kernel-team

On 06.08.24 19:28, Johannes Weiner wrote:
> On Thu, Aug 01, 2024 at 08:36:32AM +0200, David Hildenbrand wrote:
>> I just added another printf to postcopy_ram_supported_by_host(), where
>> we temporarily do a UFFDIO_REGISTER on some test area.
>>
>> Sensing UFFD support # postcopy_ram_supported_by_host()
>> Sensing UFFD support
>> Writing received pages during precopy # ram_load_precopy()
>> Writing received pages during precopy
>> Writing received pages during precopy
>> Writing received pages during precopy
>> Writing received pages during precopy
>> Writing received pages during precopy
>> Writing received pages during precopy
>> Writing received pages during precopy
>> Writing received pages during precopy
>> Writing received pages during precopy
>> Writing received pages during precopy
>> Writing received pages during precopy
>> Writing received pages during precopy
>> Writing received pages during precopy
>> Writing received pages during precopy
>> Writing received pages during precopy
>> Writing received pages during precopy
>> Writing received pages during precopy
>> Writing received pages during precopy
>> Writing received pages during precopy
>> Writing received pages during precopy
>> Writing received pages during precopy
>> Disabling THP: MADV_NOHUGEPAGE # postcopy_ram_prepare_discard()
>> Discarding pages # loadvm_postcopy_ram_handle_discard()
>> Discarding pages
>> Discarding pages
>> Discarding pages
>> Discarding pages
>> Discarding pages
>> Discarding pages
>> Discarding pages
>> Discarding pages
>> Discarding pages
>> Discarding pages
>> Discarding pages
>> Discarding pages
>> Discarding pages
>> Discarding pages
>> Discarding pages
>> Registering UFFD # postcopy_ram_incoming_setup()
>>
>> We could think about using this "ever user uffd" to avoid the shared
>> zeropage in most processes.
>>
>> Of course, there might be other applications where that wouldn't work,
>> but I think this behavior (write to area before enabling uffd) might be
>> fairly QEMU specific already.
> 
> It makes me a bit uneasy to hardcode this into the kernel. It's fairly
> specific to qemu/criu, and won't protect usecases that behave slightly
> differently.
> 
> It would also give userfaultfd users that aren't susceptible to this
> particular scenario a different code path.

True, but let's be honest, it's not like there are a million userfaultfd 
users out there :)

Anyhow ...

> 
>> Avoiding the shared zeropage has the benefit that a later write fault
>> won't have to do a TLB flush and can simply install a fresh anon page.
> 
> That's true - although if that happens frequently, it's something we
> might want to tune the shrinker for anyway. If subpages do get used
> later, we probably shouldn't have split the THP to begin with.
> 
> IMO the safest bet would be to use the zero page unconditionally.
> 

... I don't disagree. It also smells more like an optimization on top, 
if ever.

>>> 		return false;
>>>
>>> 	newpte = pte_mkspecial(pfn_pte(page_to_pfn(ZERO_PAGE(pvmw->address)),
>>> 					pvmw->vma->vm_page_prot));
>>>
>>> 	set_pte_at(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, newpte);
>>
>> We're replacing a present page by another present page without doing a
>> TLB flush in between. I *think* this should be fine because the new
>> present page is R/O and cannot possibly be written to.
> 
> It's safe because it's replacing a migration entry. The TLB was
> flushed when that was installed, and since the migration pte is not
> marked present it couldn't have re-established a TLB entry.

Oh, right, we're dealing with a migration entry.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/6] mm: split underutilized THPs
  2024-07-30 12:45 [PATCH 0/6] mm: split underutilized THPs Usama Arif
                   ` (6 preceding siblings ...)
  2024-07-30 14:35 ` [PATCH 0/6] " David Hildenbrand
@ 2024-08-01  6:09 ` Yu Zhao
  2024-08-01 15:47   ` David Hildenbrand
                     ` (2 more replies)
  7 siblings, 3 replies; 40+ messages in thread
From: Yu Zhao @ 2024-08-01  6:09 UTC (permalink / raw)
  To: Usama Arif
  Cc: akpm, linux-mm, hannes, riel, shakeel.butt, roman.gushchin, david,
	baohua, ryan.roberts, rppt, willy, cerasuolodomenico, corbet,
	linux-kernel, linux-doc, kernel-team

On Tue, Jul 30, 2024 at 6:54 AM Usama Arif <usamaarif642@gmail.com> wrote:
>
> The current upstream default policy for THP is always. However, Meta
> uses madvise in production as the current THP=always policy vastly
> overprovisions THPs in sparsely accessed memory areas, resulting in
> excessive memory pressure and premature OOM killing.
> Using madvise + relying on khugepaged has certain drawbacks over
> THP=always. Using madvise hints mean THPs aren't "transparent" and
> require userspace changes. Waiting for khugepaged to scan memory and
> collapse pages into THP can be slow and unpredictable in terms of performance
> (i.e. you dont know when the collapse will happen), while production
> environments require predictable performance. If there is enough memory
> available, its better for both performance and predictability to have
> a THP from fault time, i.e. THP=always rather than wait for khugepaged
> to collapse it, and deal with sparsely populated THPs when the system is
> running out of memory.
>
> This patch-series is an attempt to mitigate the issue of running out of
> memory when THP is always enabled. During runtime whenever a THP is being
> faulted in or collapsed by khugepaged, the THP is added to a list.
> Whenever memory reclaim happens, the kernel runs the deferred_split
> shrinker which goes through the list and checks if the THP was underutilized,
> i.e. how many of the base 4K pages of the entire THP were zero-filled.
> If this number goes above a certain threshold, the shrinker will attempt
> to split that THP. Then at remap time, the pages that were zero-filled are
> not remapped, hence saving memory. This method avoids the downside of
> wasting memory in areas where THP is sparsely filled when THP is always
> enabled, while still providing the upside THPs like reduced TLB misses without
> having to use madvise.
>
> Meta production workloads that were CPU bound (>99% CPU utilzation) were
> tested with THP shrinker. The results after 2 hours are as follows:
>
>                             | THP=madvise |  THP=always   | THP=always
>                             |             |               | + shrinker series
>                             |             |               | + max_ptes_none=409
> -----------------------------------------------------------------------------
> Performance improvement     |      -      |    +1.8%      |     +1.7%
> (over THP=madvise)          |             |               |
> -----------------------------------------------------------------------------
> Memory usage                |    54.6G    | 58.8G (+7.7%) |   55.9G (+2.4%)
> -----------------------------------------------------------------------------
> max_ptes_none=409 means that any THP that has more than 409 out of 512
> (80%) zero filled filled pages will be split.
>
> To test out the patches, the below commands without the shrinker will
> invoke OOM killer immediately and kill stress, but will not fail with
> the shrinker:
>
> echo 450 > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
> mkdir /sys/fs/cgroup/test
> echo $$ > /sys/fs/cgroup/test/cgroup.procs
> echo 20M > /sys/fs/cgroup/test/memory.max
> echo 0 > /sys/fs/cgroup/test/memory.swap.max
> # allocate twice memory.max for each stress worker and touch 40/512 of
> # each THP, i.e. vm-stride 50K.
> # With the shrinker, max_ptes_none of 470 and below won't invoke OOM
> # killer.
> # Without the shrinker, OOM killer is invoked immediately irrespective
> # of max_ptes_none value and kill stress.
> stress --vm 1 --vm-bytes 40M --vm-stride 50K
>
> Patches 1-2 add back helper functions that were previously removed
> to operate on page lists (needed by patch 3).
> Patch 3 is an optimization to free zapped tail pages rather than
> waiting for page reclaim or migration.
> Patch 4 is a prerequisite for THP shrinker to not remap zero-filled
> subpages when splitting THP.
> Patches 6 adds support for THP shrinker.
>
> (This patch-series restarts the work on having a THP shrinker in kernel
> originally done in
> https://lore.kernel.org/all/cover.1667454613.git.alexlzhu@fb.com/.
> The THP shrinker in this series is significantly different than the
> original one, hence its labelled v1 (although the prerequisite to not
> remap clean subpages is the same).)
>
> Alexander Zhu (1):
>   mm: add selftests to split_huge_page() to verify unmap/zap of zero
>     pages
>
> Usama Arif (3):
>   Revert "memcg: remove mem_cgroup_uncharge_list()"
>   Revert "mm: remove free_unref_page_list()"
>   mm: split underutilized THPs
>
> Yu Zhao (2):
>   mm: free zapped tail pages when splitting isolated thp
>   mm: don't remap unused subpages when splitting isolated thp

 I would recommend shatter [1] instead of splitting so that
1) whoever underutilized their THPs get punished for the overhead;
2) underutilized THPs are kept intact and can be reused by others.

[1] https://lore.kernel.org/20240229183436.4110845-3-yuzhao@google.com/


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/6] mm: split underutilized THPs
  2024-08-01  6:09 ` Yu Zhao
@ 2024-08-01 15:47   ` David Hildenbrand
  2024-08-04 21:54     ` Yu Zhao
  2024-08-01 16:22   ` Usama Arif
  2024-08-06 17:38   ` Johannes Weiner
  2 siblings, 1 reply; 40+ messages in thread
From: David Hildenbrand @ 2024-08-01 15:47 UTC (permalink / raw)
  To: Yu Zhao, Usama Arif
  Cc: akpm, linux-mm, hannes, riel, shakeel.butt, roman.gushchin,
	baohua, ryan.roberts, rppt, willy, cerasuolodomenico, corbet,
	linux-kernel, linux-doc, kernel-team

On 01.08.24 08:09, Yu Zhao wrote:
> On Tue, Jul 30, 2024 at 6:54 AM Usama Arif <usamaarif642@gmail.com> wrote:
>>
>> The current upstream default policy for THP is always. However, Meta
>> uses madvise in production as the current THP=always policy vastly
>> overprovisions THPs in sparsely accessed memory areas, resulting in
>> excessive memory pressure and premature OOM killing.
>> Using madvise + relying on khugepaged has certain drawbacks over
>> THP=always. Using madvise hints mean THPs aren't "transparent" and
>> require userspace changes. Waiting for khugepaged to scan memory and
>> collapse pages into THP can be slow and unpredictable in terms of performance
>> (i.e. you dont know when the collapse will happen), while production
>> environments require predictable performance. If there is enough memory
>> available, its better for both performance and predictability to have
>> a THP from fault time, i.e. THP=always rather than wait for khugepaged
>> to collapse it, and deal with sparsely populated THPs when the system is
>> running out of memory.
>>
>> This patch-series is an attempt to mitigate the issue of running out of
>> memory when THP is always enabled. During runtime whenever a THP is being
>> faulted in or collapsed by khugepaged, the THP is added to a list.
>> Whenever memory reclaim happens, the kernel runs the deferred_split
>> shrinker which goes through the list and checks if the THP was underutilized,
>> i.e. how many of the base 4K pages of the entire THP were zero-filled.
>> If this number goes above a certain threshold, the shrinker will attempt
>> to split that THP. Then at remap time, the pages that were zero-filled are
>> not remapped, hence saving memory. This method avoids the downside of
>> wasting memory in areas where THP is sparsely filled when THP is always
>> enabled, while still providing the upside THPs like reduced TLB misses without
>> having to use madvise.
>>
>> Meta production workloads that were CPU bound (>99% CPU utilzation) were
>> tested with THP shrinker. The results after 2 hours are as follows:
>>
>>                              | THP=madvise |  THP=always   | THP=always
>>                              |             |               | + shrinker series
>>                              |             |               | + max_ptes_none=409
>> -----------------------------------------------------------------------------
>> Performance improvement     |      -      |    +1.8%      |     +1.7%
>> (over THP=madvise)          |             |               |
>> -----------------------------------------------------------------------------
>> Memory usage                |    54.6G    | 58.8G (+7.7%) |   55.9G (+2.4%)
>> -----------------------------------------------------------------------------
>> max_ptes_none=409 means that any THP that has more than 409 out of 512
>> (80%) zero filled filled pages will be split.
>>
>> To test out the patches, the below commands without the shrinker will
>> invoke OOM killer immediately and kill stress, but will not fail with
>> the shrinker:
>>
>> echo 450 > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
>> mkdir /sys/fs/cgroup/test
>> echo $$ > /sys/fs/cgroup/test/cgroup.procs
>> echo 20M > /sys/fs/cgroup/test/memory.max
>> echo 0 > /sys/fs/cgroup/test/memory.swap.max
>> # allocate twice memory.max for each stress worker and touch 40/512 of
>> # each THP, i.e. vm-stride 50K.
>> # With the shrinker, max_ptes_none of 470 and below won't invoke OOM
>> # killer.
>> # Without the shrinker, OOM killer is invoked immediately irrespective
>> # of max_ptes_none value and kill stress.
>> stress --vm 1 --vm-bytes 40M --vm-stride 50K
>>
>> Patches 1-2 add back helper functions that were previously removed
>> to operate on page lists (needed by patch 3).
>> Patch 3 is an optimization to free zapped tail pages rather than
>> waiting for page reclaim or migration.
>> Patch 4 is a prerequisite for THP shrinker to not remap zero-filled
>> subpages when splitting THP.
>> Patches 6 adds support for THP shrinker.
>>
>> (This patch-series restarts the work on having a THP shrinker in kernel
>> originally done in
>> https://lore.kernel.org/all/cover.1667454613.git.alexlzhu@fb.com/.
>> The THP shrinker in this series is significantly different than the
>> original one, hence its labelled v1 (although the prerequisite to not
>> remap clean subpages is the same).)
>>
>> Alexander Zhu (1):
>>    mm: add selftests to split_huge_page() to verify unmap/zap of zero
>>      pages
>>
>> Usama Arif (3):
>>    Revert "memcg: remove mem_cgroup_uncharge_list()"
>>    Revert "mm: remove free_unref_page_list()"
>>    mm: split underutilized THPs
>>
>> Yu Zhao (2):
>>    mm: free zapped tail pages when splitting isolated thp
>>    mm: don't remap unused subpages when splitting isolated thp
> 
>   I would recommend shatter [1] instead of splitting so that
> 1) whoever underutilized their THPs get punished for the overhead;
> 2) underutilized THPs are kept intact and can be reused by others.
> 
> [1] https://lore.kernel.org/20240229183436.4110845-3-yuzhao@google.com/
> 

Do you have any plans to upstream the shattering also during "ordinary" 
deferred splitting?

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/6] mm: split underutilized THPs
  2024-08-01 15:47   ` David Hildenbrand
@ 2024-08-04 21:54     ` Yu Zhao
  2024-08-05  1:32       ` Rik van Riel
  0 siblings, 1 reply; 40+ messages in thread
From: Yu Zhao @ 2024-08-04 21:54 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Usama Arif, akpm, linux-mm, hannes, riel, shakeel.butt,
	roman.gushchin, baohua, ryan.roberts, rppt, willy,
	cerasuolodomenico, corbet, linux-kernel, linux-doc, kernel-team

On Thu, Aug 1, 2024 at 9:47 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 01.08.24 08:09, Yu Zhao wrote:
> > On Tue, Jul 30, 2024 at 6:54 AM Usama Arif <usamaarif642@gmail.com> wrote:
> >>
> >> The current upstream default policy for THP is always. However, Meta
> >> uses madvise in production as the current THP=always policy vastly
> >> overprovisions THPs in sparsely accessed memory areas, resulting in
> >> excessive memory pressure and premature OOM killing.
> >> Using madvise + relying on khugepaged has certain drawbacks over
> >> THP=always. Using madvise hints mean THPs aren't "transparent" and
> >> require userspace changes. Waiting for khugepaged to scan memory and
> >> collapse pages into THP can be slow and unpredictable in terms of performance
> >> (i.e. you dont know when the collapse will happen), while production
> >> environments require predictable performance. If there is enough memory
> >> available, its better for both performance and predictability to have
> >> a THP from fault time, i.e. THP=always rather than wait for khugepaged
> >> to collapse it, and deal with sparsely populated THPs when the system is
> >> running out of memory.
> >>
> >> This patch-series is an attempt to mitigate the issue of running out of
> >> memory when THP is always enabled. During runtime whenever a THP is being
> >> faulted in or collapsed by khugepaged, the THP is added to a list.
> >> Whenever memory reclaim happens, the kernel runs the deferred_split
> >> shrinker which goes through the list and checks if the THP was underutilized,
> >> i.e. how many of the base 4K pages of the entire THP were zero-filled.
> >> If this number goes above a certain threshold, the shrinker will attempt
> >> to split that THP. Then at remap time, the pages that were zero-filled are
> >> not remapped, hence saving memory. This method avoids the downside of
> >> wasting memory in areas where THP is sparsely filled when THP is always
> >> enabled, while still providing the upside THPs like reduced TLB misses without
> >> having to use madvise.
> >>
> >> Meta production workloads that were CPU bound (>99% CPU utilzation) were
> >> tested with THP shrinker. The results after 2 hours are as follows:
> >>
> >>                              | THP=madvise |  THP=always   | THP=always
> >>                              |             |               | + shrinker series
> >>                              |             |               | + max_ptes_none=409
> >> -----------------------------------------------------------------------------
> >> Performance improvement     |      -      |    +1.8%      |     +1.7%
> >> (over THP=madvise)          |             |               |
> >> -----------------------------------------------------------------------------
> >> Memory usage                |    54.6G    | 58.8G (+7.7%) |   55.9G (+2.4%)
> >> -----------------------------------------------------------------------------
> >> max_ptes_none=409 means that any THP that has more than 409 out of 512
> >> (80%) zero filled filled pages will be split.
> >>
> >> To test out the patches, the below commands without the shrinker will
> >> invoke OOM killer immediately and kill stress, but will not fail with
> >> the shrinker:
> >>
> >> echo 450 > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
> >> mkdir /sys/fs/cgroup/test
> >> echo $$ > /sys/fs/cgroup/test/cgroup.procs
> >> echo 20M > /sys/fs/cgroup/test/memory.max
> >> echo 0 > /sys/fs/cgroup/test/memory.swap.max
> >> # allocate twice memory.max for each stress worker and touch 40/512 of
> >> # each THP, i.e. vm-stride 50K.
> >> # With the shrinker, max_ptes_none of 470 and below won't invoke OOM
> >> # killer.
> >> # Without the shrinker, OOM killer is invoked immediately irrespective
> >> # of max_ptes_none value and kill stress.
> >> stress --vm 1 --vm-bytes 40M --vm-stride 50K
> >>
> >> Patches 1-2 add back helper functions that were previously removed
> >> to operate on page lists (needed by patch 3).
> >> Patch 3 is an optimization to free zapped tail pages rather than
> >> waiting for page reclaim or migration.
> >> Patch 4 is a prerequisite for THP shrinker to not remap zero-filled
> >> subpages when splitting THP.
> >> Patches 6 adds support for THP shrinker.
> >>
> >> (This patch-series restarts the work on having a THP shrinker in kernel
> >> originally done in
> >> https://lore.kernel.org/all/cover.1667454613.git.alexlzhu@fb.com/.
> >> The THP shrinker in this series is significantly different than the
> >> original one, hence its labelled v1 (although the prerequisite to not
> >> remap clean subpages is the same).)
> >>
> >> Alexander Zhu (1):
> >>    mm: add selftests to split_huge_page() to verify unmap/zap of zero
> >>      pages
> >>
> >> Usama Arif (3):
> >>    Revert "memcg: remove mem_cgroup_uncharge_list()"
> >>    Revert "mm: remove free_unref_page_list()"
> >>    mm: split underutilized THPs
> >>
> >> Yu Zhao (2):
> >>    mm: free zapped tail pages when splitting isolated thp
> >>    mm: don't remap unused subpages when splitting isolated thp
> >
> >   I would recommend shatter [1] instead of splitting so that
> > 1) whoever underutilized their THPs get punished for the overhead;
> > 2) underutilized THPs are kept intact and can be reused by others.
> >
> > [1] https://lore.kernel.org/20240229183436.4110845-3-yuzhao@google.com/
> >
>
> Do you have any plans to upstream the shattering also during "ordinary"
> deferred splitting?

Yes, once we finish verifying it in our production.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/6] mm: split underutilized THPs
  2024-08-04 21:54     ` Yu Zhao
@ 2024-08-05  1:32       ` Rik van Riel
  2024-08-05 19:51         ` Yu Zhao
  0 siblings, 1 reply; 40+ messages in thread
From: Rik van Riel @ 2024-08-05  1:32 UTC (permalink / raw)
  To: Yu Zhao, David Hildenbrand
  Cc: Usama Arif, akpm, linux-mm, hannes, shakeel.butt, roman.gushchin,
	baohua, ryan.roberts, rppt, willy, cerasuolodomenico, corbet,
	linux-kernel, linux-doc, kernel-team

On Sun, 2024-08-04 at 15:54 -0600, Yu Zhao wrote:
> On Thu, Aug 1, 2024 at 9:47 AM David Hildenbrand <david@redhat.com>
> wrote:
> > 
> > On 01.08.24 08:09, Yu Zhao wrote:
> > > 
> > >   I would recommend shatter [1] instead of splitting so that
> > > 1) whoever underutilized their THPs get punished for the
> > > overhead;
> > > 2) underutilized THPs are kept intact and can be reused by
> > > others.
> > > 
> > > [1]
> > > https://lore.kernel.org/20240229183436.4110845-3-yuzhao@google.com/
> > > 
> > 
> > Do you have any plans to upstream the shattering also during
> > "ordinary"
> > deferred splitting?
> 
> Yes, once we finish verifying it in our production.
> 

Shattering does seem like a nice improvement to the THP shrinker!

However, given that the shattering code is still being verified,
and the THP shrinker policy will no doubt need some tuning once
more real world workloads get thrown at it, would it make sense
to do those two things in parallel?

We could move forward with the THP shrinker as-is today, and use
the increased exposure it gets to fine tune the shrinking policy,
and then move it over to using the shattering code once that is
ready.

Is there any good reason to serialize these two things?

-- 
All Rights Reversed.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/6] mm: split underutilized THPs
  2024-08-05  1:32       ` Rik van Riel
@ 2024-08-05 19:51         ` Yu Zhao
  0 siblings, 0 replies; 40+ messages in thread
From: Yu Zhao @ 2024-08-05 19:51 UTC (permalink / raw)
  To: Rik van Riel
  Cc: David Hildenbrand, Usama Arif, akpm, linux-mm, hannes,
	shakeel.butt, roman.gushchin, baohua, ryan.roberts, rppt, willy,
	cerasuolodomenico, corbet, linux-kernel, linux-doc, kernel-team

On Sun, Aug 4, 2024 at 7:33 PM Rik van Riel <riel@surriel.com> wrote:
>
> On Sun, 2024-08-04 at 15:54 -0600, Yu Zhao wrote:
> > On Thu, Aug 1, 2024 at 9:47 AM David Hildenbrand <david@redhat.com>
> > wrote:
> > >
> > > On 01.08.24 08:09, Yu Zhao wrote:
> > > >
> > > >   I would recommend shatter [1] instead of splitting so that
> > > > 1) whoever underutilized their THPs get punished for the
> > > > overhead;
> > > > 2) underutilized THPs are kept intact and can be reused by
> > > > others.
> > > >
> > > > [1]
> > > > https://lore.kernel.org/20240229183436.4110845-3-yuzhao@google.com/
> > > >
> > >
> > > Do you have any plans to upstream the shattering also during
> > > "ordinary"
> > > deferred splitting?
> >
> > Yes, once we finish verifying it in our production.
> >
>
> Shattering does seem like a nice improvement to the THP shrinker!
>
> However, given that the shattering code is still being verified,
> and the THP shrinker policy will no doubt need some tuning once
> more real world workloads get thrown at it, would it make sense
> to do those two things in parallel?
>
> We could move forward with the THP shrinker as-is today, and use
> the increased exposure it gets to fine tune the shrinking policy,
> and then move it over to using the shattering code once that is
> ready.
>
> Is there any good reason to serialize these two things?

I'm fine with whichever way you prefer: if you are eager to try
shattering in your production environment, I'd be incentivized to
throw in extra engineers and get it ready for you asap.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/6] mm: split underutilized THPs
  2024-08-01  6:09 ` Yu Zhao
  2024-08-01 15:47   ` David Hildenbrand
@ 2024-08-01 16:22   ` Usama Arif
  2024-08-01 16:27     ` David Hildenbrand
  2024-08-04 23:23     ` Yu Zhao
  2024-08-06 17:38   ` Johannes Weiner
  2 siblings, 2 replies; 40+ messages in thread
From: Usama Arif @ 2024-08-01 16:22 UTC (permalink / raw)
  To: Yu Zhao
  Cc: akpm, linux-mm, hannes, riel, shakeel.butt, roman.gushchin, david,
	baohua, ryan.roberts, rppt, willy, cerasuolodomenico, corbet,
	linux-kernel, linux-doc, kernel-team



On 01/08/2024 07:09, Yu Zhao wrote:
> On Tue, Jul 30, 2024 at 6:54 AM Usama Arif <usamaarif642@gmail.com> wrote:
>>
>> The current upstream default policy for THP is always. However, Meta
>> uses madvise in production as the current THP=always policy vastly
>> overprovisions THPs in sparsely accessed memory areas, resulting in
>> excessive memory pressure and premature OOM killing.
>> Using madvise + relying on khugepaged has certain drawbacks over
>> THP=always. Using madvise hints mean THPs aren't "transparent" and
>> require userspace changes. Waiting for khugepaged to scan memory and
>> collapse pages into THP can be slow and unpredictable in terms of performance
>> (i.e. you dont know when the collapse will happen), while production
>> environments require predictable performance. If there is enough memory
>> available, its better for both performance and predictability to have
>> a THP from fault time, i.e. THP=always rather than wait for khugepaged
>> to collapse it, and deal with sparsely populated THPs when the system is
>> running out of memory.
>>
>> This patch-series is an attempt to mitigate the issue of running out of
>> memory when THP is always enabled. During runtime whenever a THP is being
>> faulted in or collapsed by khugepaged, the THP is added to a list.
>> Whenever memory reclaim happens, the kernel runs the deferred_split
>> shrinker which goes through the list and checks if the THP was underutilized,
>> i.e. how many of the base 4K pages of the entire THP were zero-filled.
>> If this number goes above a certain threshold, the shrinker will attempt
>> to split that THP. Then at remap time, the pages that were zero-filled are
>> not remapped, hence saving memory. This method avoids the downside of
>> wasting memory in areas where THP is sparsely filled when THP is always
>> enabled, while still providing the upside THPs like reduced TLB misses without
>> having to use madvise.
>>
>> Meta production workloads that were CPU bound (>99% CPU utilzation) were
>> tested with THP shrinker. The results after 2 hours are as follows:
>>
>>                             | THP=madvise |  THP=always   | THP=always
>>                             |             |               | + shrinker series
>>                             |             |               | + max_ptes_none=409
>> -----------------------------------------------------------------------------
>> Performance improvement     |      -      |    +1.8%      |     +1.7%
>> (over THP=madvise)          |             |               |
>> -----------------------------------------------------------------------------
>> Memory usage                |    54.6G    | 58.8G (+7.7%) |   55.9G (+2.4%)
>> -----------------------------------------------------------------------------
>> max_ptes_none=409 means that any THP that has more than 409 out of 512
>> (80%) zero filled filled pages will be split.
>>
>> To test out the patches, the below commands without the shrinker will
>> invoke OOM killer immediately and kill stress, but will not fail with
>> the shrinker:
>>
>> echo 450 > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
>> mkdir /sys/fs/cgroup/test
>> echo $$ > /sys/fs/cgroup/test/cgroup.procs
>> echo 20M > /sys/fs/cgroup/test/memory.max
>> echo 0 > /sys/fs/cgroup/test/memory.swap.max
>> # allocate twice memory.max for each stress worker and touch 40/512 of
>> # each THP, i.e. vm-stride 50K.
>> # With the shrinker, max_ptes_none of 470 and below won't invoke OOM
>> # killer.
>> # Without the shrinker, OOM killer is invoked immediately irrespective
>> # of max_ptes_none value and kill stress.
>> stress --vm 1 --vm-bytes 40M --vm-stride 50K
>>
>> Patches 1-2 add back helper functions that were previously removed
>> to operate on page lists (needed by patch 3).
>> Patch 3 is an optimization to free zapped tail pages rather than
>> waiting for page reclaim or migration.
>> Patch 4 is a prerequisite for THP shrinker to not remap zero-filled
>> subpages when splitting THP.
>> Patches 6 adds support for THP shrinker.
>>
>> (This patch-series restarts the work on having a THP shrinker in kernel
>> originally done in
>> https://lore.kernel.org/all/cover.1667454613.git.alexlzhu@fb.com/.
>> The THP shrinker in this series is significantly different than the
>> original one, hence its labelled v1 (although the prerequisite to not
>> remap clean subpages is the same).)
>>
>> Alexander Zhu (1):
>>   mm: add selftests to split_huge_page() to verify unmap/zap of zero
>>     pages
>>
>> Usama Arif (3):
>>   Revert "memcg: remove mem_cgroup_uncharge_list()"
>>   Revert "mm: remove free_unref_page_list()"
>>   mm: split underutilized THPs
>>
>> Yu Zhao (2):
>>   mm: free zapped tail pages when splitting isolated thp
>>   mm: don't remap unused subpages when splitting isolated thp
> 
>  I would recommend shatter [1] instead of splitting so that
> 1) whoever underutilized their THPs get punished for the overhead;
> 2) underutilized THPs are kept intact and can be reused by others.
> 
> [1] https://lore.kernel.org/20240229183436.4110845-3-yuzhao@google.com/

The objective of this series is to reduce memory usage, while trying to keep the performance benefits you get of using THP=always. Punishing any applications performance is the opposite of what I am trying to do here.
For e.g. if there is only one main application running in production, and its using majority of the THPs, then reducing its performance doesn't make sense.

Also, just going through the commit, and found the line "The advantage of shattering is that it keeps the original THP intact" a bit confusing. I am guessing the THP is freed? i.e. if a 2M THP has 10 non-zero filled base pages and the rest are zero-filled, then after shattering we will have 10*4K memory and not 2M+10*4K? Is it the case the THP is reused at next fault?


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/6] mm: split underutilized THPs
  2024-08-01 16:22   ` Usama Arif
@ 2024-08-01 16:27     ` David Hildenbrand
  2024-08-04 19:10       ` Usama Arif
  2024-08-04 23:32       ` Yu Zhao
  2024-08-04 23:23     ` Yu Zhao
  1 sibling, 2 replies; 40+ messages in thread
From: David Hildenbrand @ 2024-08-01 16:27 UTC (permalink / raw)
  To: Usama Arif, Yu Zhao
  Cc: akpm, linux-mm, hannes, riel, shakeel.butt, roman.gushchin,
	baohua, ryan.roberts, rppt, willy, cerasuolodomenico, corbet,
	linux-kernel, linux-doc, kernel-team

On 01.08.24 18:22, Usama Arif wrote:
> 
> 
> On 01/08/2024 07:09, Yu Zhao wrote:
>> On Tue, Jul 30, 2024 at 6:54 AM Usama Arif <usamaarif642@gmail.com> wrote:
>>>
>>> The current upstream default policy for THP is always. However, Meta
>>> uses madvise in production as the current THP=always policy vastly
>>> overprovisions THPs in sparsely accessed memory areas, resulting in
>>> excessive memory pressure and premature OOM killing.
>>> Using madvise + relying on khugepaged has certain drawbacks over
>>> THP=always. Using madvise hints mean THPs aren't "transparent" and
>>> require userspace changes. Waiting for khugepaged to scan memory and
>>> collapse pages into THP can be slow and unpredictable in terms of performance
>>> (i.e. you dont know when the collapse will happen), while production
>>> environments require predictable performance. If there is enough memory
>>> available, its better for both performance and predictability to have
>>> a THP from fault time, i.e. THP=always rather than wait for khugepaged
>>> to collapse it, and deal with sparsely populated THPs when the system is
>>> running out of memory.
>>>
>>> This patch-series is an attempt to mitigate the issue of running out of
>>> memory when THP is always enabled. During runtime whenever a THP is being
>>> faulted in or collapsed by khugepaged, the THP is added to a list.
>>> Whenever memory reclaim happens, the kernel runs the deferred_split
>>> shrinker which goes through the list and checks if the THP was underutilized,
>>> i.e. how many of the base 4K pages of the entire THP were zero-filled.
>>> If this number goes above a certain threshold, the shrinker will attempt
>>> to split that THP. Then at remap time, the pages that were zero-filled are
>>> not remapped, hence saving memory. This method avoids the downside of
>>> wasting memory in areas where THP is sparsely filled when THP is always
>>> enabled, while still providing the upside THPs like reduced TLB misses without
>>> having to use madvise.
>>>
>>> Meta production workloads that were CPU bound (>99% CPU utilzation) were
>>> tested with THP shrinker. The results after 2 hours are as follows:
>>>
>>>                              | THP=madvise |  THP=always   | THP=always
>>>                              |             |               | + shrinker series
>>>                              |             |               | + max_ptes_none=409
>>> -----------------------------------------------------------------------------
>>> Performance improvement     |      -      |    +1.8%      |     +1.7%
>>> (over THP=madvise)          |             |               |
>>> -----------------------------------------------------------------------------
>>> Memory usage                |    54.6G    | 58.8G (+7.7%) |   55.9G (+2.4%)
>>> -----------------------------------------------------------------------------
>>> max_ptes_none=409 means that any THP that has more than 409 out of 512
>>> (80%) zero filled filled pages will be split.
>>>
>>> To test out the patches, the below commands without the shrinker will
>>> invoke OOM killer immediately and kill stress, but will not fail with
>>> the shrinker:
>>>
>>> echo 450 > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
>>> mkdir /sys/fs/cgroup/test
>>> echo $$ > /sys/fs/cgroup/test/cgroup.procs
>>> echo 20M > /sys/fs/cgroup/test/memory.max
>>> echo 0 > /sys/fs/cgroup/test/memory.swap.max
>>> # allocate twice memory.max for each stress worker and touch 40/512 of
>>> # each THP, i.e. vm-stride 50K.
>>> # With the shrinker, max_ptes_none of 470 and below won't invoke OOM
>>> # killer.
>>> # Without the shrinker, OOM killer is invoked immediately irrespective
>>> # of max_ptes_none value and kill stress.
>>> stress --vm 1 --vm-bytes 40M --vm-stride 50K
>>>
>>> Patches 1-2 add back helper functions that were previously removed
>>> to operate on page lists (needed by patch 3).
>>> Patch 3 is an optimization to free zapped tail pages rather than
>>> waiting for page reclaim or migration.
>>> Patch 4 is a prerequisite for THP shrinker to not remap zero-filled
>>> subpages when splitting THP.
>>> Patches 6 adds support for THP shrinker.
>>>
>>> (This patch-series restarts the work on having a THP shrinker in kernel
>>> originally done in
>>> https://lore.kernel.org/all/cover.1667454613.git.alexlzhu@fb.com/.
>>> The THP shrinker in this series is significantly different than the
>>> original one, hence its labelled v1 (although the prerequisite to not
>>> remap clean subpages is the same).)
>>>
>>> Alexander Zhu (1):
>>>    mm: add selftests to split_huge_page() to verify unmap/zap of zero
>>>      pages
>>>
>>> Usama Arif (3):
>>>    Revert "memcg: remove mem_cgroup_uncharge_list()"
>>>    Revert "mm: remove free_unref_page_list()"
>>>    mm: split underutilized THPs
>>>
>>> Yu Zhao (2):
>>>    mm: free zapped tail pages when splitting isolated thp
>>>    mm: don't remap unused subpages when splitting isolated thp
>>
>>   I would recommend shatter [1] instead of splitting so that
>> 1) whoever underutilized their THPs get punished for the overhead;
>> 2) underutilized THPs are kept intact and can be reused by others.
>>
>> [1] https://lore.kernel.org/20240229183436.4110845-3-yuzhao@google.com/
> 
> The objective of this series is to reduce memory usage, while trying to keep the performance benefits you get of using THP=always. Punishing any applications performance is the opposite of what I am trying to do here.
> For e.g. if there is only one main application running in production, and its using majority of the THPs, then reducing its performance doesn't make sense.
> 

I'm not sure if there would really be a performance degradation 
regarding the THP, after all we zap PTEs either way.

Shattering will take longer because real migration is involved IIUC.

> Also, just going through the commit, and found the line "The advantage of shattering is that it keeps the original THP intact" a bit confusing. I am guessing the THP is freed? i.e. if a 2M THP has 10 non-zero filled base pages and the rest are zero-filled, then after shattering we will have 10*4K memory and not 2M+10*4K? Is it the case the THP is reused at next fault?

The idea is (as I understand it) to free the full THP abck to the buddy, 
replacing the individual pieces that are kept to freshly allocated 
order-0 pages from the buddy.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/6] mm: split underutilized THPs
  2024-08-01 16:27     ` David Hildenbrand
@ 2024-08-04 19:10       ` Usama Arif
  2024-08-04 23:32       ` Yu Zhao
  1 sibling, 0 replies; 40+ messages in thread
From: Usama Arif @ 2024-08-04 19:10 UTC (permalink / raw)
  To: David Hildenbrand, Yu Zhao
  Cc: akpm, linux-mm, hannes, riel, shakeel.butt, roman.gushchin,
	baohua, ryan.roberts, rppt, willy, cerasuolodomenico, corbet,
	linux-kernel, linux-doc, kernel-team



On 01/08/2024 17:27, David Hildenbrand wrote:
> On 01.08.24 18:22, Usama Arif wrote:
>>
>>
>> On 01/08/2024 07:09, Yu Zhao wrote:
>>> On Tue, Jul 30, 2024 at 6:54 AM Usama Arif <usamaarif642@gmail.com> wrote:
>>>>
>>>> The current upstream default policy for THP is always. However, Meta
>>>> uses madvise in production as the current THP=always policy vastly
>>>> overprovisions THPs in sparsely accessed memory areas, resulting in
>>>> excessive memory pressure and premature OOM killing.
>>>> Using madvise + relying on khugepaged has certain drawbacks over
>>>> THP=always. Using madvise hints mean THPs aren't "transparent" and
>>>> require userspace changes. Waiting for khugepaged to scan memory and
>>>> collapse pages into THP can be slow and unpredictable in terms of performance
>>>> (i.e. you dont know when the collapse will happen), while production
>>>> environments require predictable performance. If there is enough memory
>>>> available, its better for both performance and predictability to have
>>>> a THP from fault time, i.e. THP=always rather than wait for khugepaged
>>>> to collapse it, and deal with sparsely populated THPs when the system is
>>>> running out of memory.
>>>>
>>>> This patch-series is an attempt to mitigate the issue of running out of
>>>> memory when THP is always enabled. During runtime whenever a THP is being
>>>> faulted in or collapsed by khugepaged, the THP is added to a list.
>>>> Whenever memory reclaim happens, the kernel runs the deferred_split
>>>> shrinker which goes through the list and checks if the THP was underutilized,
>>>> i.e. how many of the base 4K pages of the entire THP were zero-filled.
>>>> If this number goes above a certain threshold, the shrinker will attempt
>>>> to split that THP. Then at remap time, the pages that were zero-filled are
>>>> not remapped, hence saving memory. This method avoids the downside of
>>>> wasting memory in areas where THP is sparsely filled when THP is always
>>>> enabled, while still providing the upside THPs like reduced TLB misses without
>>>> having to use madvise.
>>>>
>>>> Meta production workloads that were CPU bound (>99% CPU utilzation) were
>>>> tested with THP shrinker. The results after 2 hours are as follows:
>>>>
>>>>                              | THP=madvise |  THP=always   | THP=always
>>>>                              |             |               | + shrinker series
>>>>                              |             |               | + max_ptes_none=409
>>>> -----------------------------------------------------------------------------
>>>> Performance improvement     |      -      |    +1.8%      |     +1.7%
>>>> (over THP=madvise)          |             |               |
>>>> -----------------------------------------------------------------------------
>>>> Memory usage                |    54.6G    | 58.8G (+7.7%) |   55.9G (+2.4%)
>>>> -----------------------------------------------------------------------------
>>>> max_ptes_none=409 means that any THP that has more than 409 out of 512
>>>> (80%) zero filled filled pages will be split.
>>>>
>>>> To test out the patches, the below commands without the shrinker will
>>>> invoke OOM killer immediately and kill stress, but will not fail with
>>>> the shrinker:
>>>>
>>>> echo 450 > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
>>>> mkdir /sys/fs/cgroup/test
>>>> echo $$ > /sys/fs/cgroup/test/cgroup.procs
>>>> echo 20M > /sys/fs/cgroup/test/memory.max
>>>> echo 0 > /sys/fs/cgroup/test/memory.swap.max
>>>> # allocate twice memory.max for each stress worker and touch 40/512 of
>>>> # each THP, i.e. vm-stride 50K.
>>>> # With the shrinker, max_ptes_none of 470 and below won't invoke OOM
>>>> # killer.
>>>> # Without the shrinker, OOM killer is invoked immediately irrespective
>>>> # of max_ptes_none value and kill stress.
>>>> stress --vm 1 --vm-bytes 40M --vm-stride 50K
>>>>
>>>> Patches 1-2 add back helper functions that were previously removed
>>>> to operate on page lists (needed by patch 3).
>>>> Patch 3 is an optimization to free zapped tail pages rather than
>>>> waiting for page reclaim or migration.
>>>> Patch 4 is a prerequisite for THP shrinker to not remap zero-filled
>>>> subpages when splitting THP.
>>>> Patches 6 adds support for THP shrinker.
>>>>
>>>> (This patch-series restarts the work on having a THP shrinker in kernel
>>>> originally done in
>>>> https://lore.kernel.org/all/cover.1667454613.git.alexlzhu@fb.com/.
>>>> The THP shrinker in this series is significantly different than the
>>>> original one, hence its labelled v1 (although the prerequisite to not
>>>> remap clean subpages is the same).)
>>>>
>>>> Alexander Zhu (1):
>>>>    mm: add selftests to split_huge_page() to verify unmap/zap of zero
>>>>      pages
>>>>
>>>> Usama Arif (3):
>>>>    Revert "memcg: remove mem_cgroup_uncharge_list()"
>>>>    Revert "mm: remove free_unref_page_list()"
>>>>    mm: split underutilized THPs
>>>>
>>>> Yu Zhao (2):
>>>>    mm: free zapped tail pages when splitting isolated thp
>>>>    mm: don't remap unused subpages when splitting isolated thp
>>>
>>>   I would recommend shatter [1] instead of splitting so that
>>> 1) whoever underutilized their THPs get punished for the overhead;
>>> 2) underutilized THPs are kept intact and can be reused by others.
>>>
>>> [1] https://lore.kernel.org/20240229183436.4110845-3-yuzhao@google.com/
>>
>> The objective of this series is to reduce memory usage, while trying to keep the performance benefits you get of using THP=always. Punishing any applications performance is the opposite of what I am trying to do here.
>> For e.g. if there is only one main application running in production, and its using majority of the THPs, then reducing its performance doesn't make sense.
>>
> 
> I'm not sure if there would really be a performance degradation regarding the THP, after all we zap PTEs either way.
>

By performance I meant time/CPU used up for migration.
 
> Shattering will take longer because real migration is involved IIUC.
>

Yes, so thats what I want to avoid. If the system is CPU bound like the production workload I am testing, then spending any cycles on migration is going to make time/CPU performance worse.
 
Also, shattering isn't merged upstream, and it wouldn't make sense to make this series dependent on shattering.

>> Also, just going through the commit, and found the line "The advantage of shattering is that it keeps the original THP intact" a bit confusing. I am guessing the THP is freed? i.e. if a 2M THP has 10 non-zero filled base pages and the rest are zero-filled, then after shattering we will have 10*4K memory and not 2M+10*4K? Is it the case the THP is reused at next fault?
> 
> The idea is (as I understand it) to free the full THP abck to the buddy, replacing the individual pieces that are kept to freshly allocated order-0 pages from the buddy.
> 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/6] mm: split underutilized THPs
  2024-08-01 16:27     ` David Hildenbrand
  2024-08-04 19:10       ` Usama Arif
@ 2024-08-04 23:32       ` Yu Zhao
  1 sibling, 0 replies; 40+ messages in thread
From: Yu Zhao @ 2024-08-04 23:32 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Usama Arif, akpm, linux-mm, hannes, riel, shakeel.butt,
	roman.gushchin, baohua, ryan.roberts, rppt, willy,
	cerasuolodomenico, corbet, linux-kernel, linux-doc, kernel-team

On Thu, Aug 1, 2024 at 10:27 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 01.08.24 18:22, Usama Arif wrote:
> >
> >
> > On 01/08/2024 07:09, Yu Zhao wrote:
> >> On Tue, Jul 30, 2024 at 6:54 AM Usama Arif <usamaarif642@gmail.com> wrote:
> >>>
> >>> The current upstream default policy for THP is always. However, Meta
> >>> uses madvise in production as the current THP=always policy vastly
> >>> overprovisions THPs in sparsely accessed memory areas, resulting in
> >>> excessive memory pressure and premature OOM killing.
> >>> Using madvise + relying on khugepaged has certain drawbacks over
> >>> THP=always. Using madvise hints mean THPs aren't "transparent" and
> >>> require userspace changes. Waiting for khugepaged to scan memory and
> >>> collapse pages into THP can be slow and unpredictable in terms of performance
> >>> (i.e. you dont know when the collapse will happen), while production
> >>> environments require predictable performance. If there is enough memory
> >>> available, its better for both performance and predictability to have
> >>> a THP from fault time, i.e. THP=always rather than wait for khugepaged
> >>> to collapse it, and deal with sparsely populated THPs when the system is
> >>> running out of memory.
> >>>
> >>> This patch-series is an attempt to mitigate the issue of running out of
> >>> memory when THP is always enabled. During runtime whenever a THP is being
> >>> faulted in or collapsed by khugepaged, the THP is added to a list.
> >>> Whenever memory reclaim happens, the kernel runs the deferred_split
> >>> shrinker which goes through the list and checks if the THP was underutilized,
> >>> i.e. how many of the base 4K pages of the entire THP were zero-filled.
> >>> If this number goes above a certain threshold, the shrinker will attempt
> >>> to split that THP. Then at remap time, the pages that were zero-filled are
> >>> not remapped, hence saving memory. This method avoids the downside of
> >>> wasting memory in areas where THP is sparsely filled when THP is always
> >>> enabled, while still providing the upside THPs like reduced TLB misses without
> >>> having to use madvise.
> >>>
> >>> Meta production workloads that were CPU bound (>99% CPU utilzation) were
> >>> tested with THP shrinker. The results after 2 hours are as follows:
> >>>
> >>>                              | THP=madvise |  THP=always   | THP=always
> >>>                              |             |               | + shrinker series
> >>>                              |             |               | + max_ptes_none=409
> >>> -----------------------------------------------------------------------------
> >>> Performance improvement     |      -      |    +1.8%      |     +1.7%
> >>> (over THP=madvise)          |             |               |
> >>> -----------------------------------------------------------------------------
> >>> Memory usage                |    54.6G    | 58.8G (+7.7%) |   55.9G (+2.4%)
> >>> -----------------------------------------------------------------------------
> >>> max_ptes_none=409 means that any THP that has more than 409 out of 512
> >>> (80%) zero filled filled pages will be split.
> >>>
> >>> To test out the patches, the below commands without the shrinker will
> >>> invoke OOM killer immediately and kill stress, but will not fail with
> >>> the shrinker:
> >>>
> >>> echo 450 > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
> >>> mkdir /sys/fs/cgroup/test
> >>> echo $$ > /sys/fs/cgroup/test/cgroup.procs
> >>> echo 20M > /sys/fs/cgroup/test/memory.max
> >>> echo 0 > /sys/fs/cgroup/test/memory.swap.max
> >>> # allocate twice memory.max for each stress worker and touch 40/512 of
> >>> # each THP, i.e. vm-stride 50K.
> >>> # With the shrinker, max_ptes_none of 470 and below won't invoke OOM
> >>> # killer.
> >>> # Without the shrinker, OOM killer is invoked immediately irrespective
> >>> # of max_ptes_none value and kill stress.
> >>> stress --vm 1 --vm-bytes 40M --vm-stride 50K
> >>>
> >>> Patches 1-2 add back helper functions that were previously removed
> >>> to operate on page lists (needed by patch 3).
> >>> Patch 3 is an optimization to free zapped tail pages rather than
> >>> waiting for page reclaim or migration.
> >>> Patch 4 is a prerequisite for THP shrinker to not remap zero-filled
> >>> subpages when splitting THP.
> >>> Patches 6 adds support for THP shrinker.
> >>>
> >>> (This patch-series restarts the work on having a THP shrinker in kernel
> >>> originally done in
> >>> https://lore.kernel.org/all/cover.1667454613.git.alexlzhu@fb.com/.
> >>> The THP shrinker in this series is significantly different than the
> >>> original one, hence its labelled v1 (although the prerequisite to not
> >>> remap clean subpages is the same).)
> >>>
> >>> Alexander Zhu (1):
> >>>    mm: add selftests to split_huge_page() to verify unmap/zap of zero
> >>>      pages
> >>>
> >>> Usama Arif (3):
> >>>    Revert "memcg: remove mem_cgroup_uncharge_list()"
> >>>    Revert "mm: remove free_unref_page_list()"
> >>>    mm: split underutilized THPs
> >>>
> >>> Yu Zhao (2):
> >>>    mm: free zapped tail pages when splitting isolated thp
> >>>    mm: don't remap unused subpages when splitting isolated thp
> >>
> >>   I would recommend shatter [1] instead of splitting so that
> >> 1) whoever underutilized their THPs get punished for the overhead;
> >> 2) underutilized THPs are kept intact and can be reused by others.
> >>
> >> [1] https://lore.kernel.org/20240229183436.4110845-3-yuzhao@google.com/
> >
> > The objective of this series is to reduce memory usage, while trying to keep the performance benefits you get of using THP=always. Punishing any applications performance is the opposite of what I am trying to do here.
> > For e.g. if there is only one main application running in production, and its using majority of the THPs, then reducing its performance doesn't make sense.
> >
>
> I'm not sure if there would really be a performance degradation
> regarding the THP, after all we zap PTEs either way.
>
> Shattering will take longer because real migration is involved IIUC.

Correct, and that's by design. Also using it in the THP shrinker path
isn't a problem.

> > Also, just going through the commit, and found the line "The advantage of shattering is that it keeps the original THP intact" a bit confusing. I am guessing the THP is freed? i.e. if a 2M THP has 10 non-zero filled base pages and the rest are zero-filled, then after shattering we will have 10*4K memory and not 2M+10*4K? Is it the case the THP is reused at next fault?
>
> The idea is (as I understand it) to free the full THP abck to the buddy,
> replacing the individual pieces that are kept to freshly allocated
> order-0 pages from the buddy.

Correct, and this is essential to our problem: we are under memory
pressure with THP=always. Under this condition, we need to compare
shatter with split + compaction, not with split alone.

To summarize, the ideal use cases are:
1. split for THP=always with unlimited memory.
2. shatter for THP=always under memory pressure.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/6] mm: split underutilized THPs
  2024-08-01 16:22   ` Usama Arif
  2024-08-01 16:27     ` David Hildenbrand
@ 2024-08-04 23:23     ` Yu Zhao
  2024-08-06 11:18       ` Usama Arif
  1 sibling, 1 reply; 40+ messages in thread
From: Yu Zhao @ 2024-08-04 23:23 UTC (permalink / raw)
  To: Usama Arif
  Cc: akpm, linux-mm, hannes, riel, shakeel.butt, roman.gushchin, david,
	baohua, ryan.roberts, rppt, willy, cerasuolodomenico, corbet,
	linux-kernel, linux-doc, kernel-team

On Thu, Aug 1, 2024 at 10:22 AM Usama Arif <usamaarif642@gmail.com> wrote:
>
>
>
> On 01/08/2024 07:09, Yu Zhao wrote:
> > On Tue, Jul 30, 2024 at 6:54 AM Usama Arif <usamaarif642@gmail.com> wrote:
> >>
> >> The current upstream default policy for THP is always. However, Meta
> >> uses madvise in production as the current THP=always policy vastly
> >> overprovisions THPs in sparsely accessed memory areas, resulting in
> >> excessive memory pressure and premature OOM killing.
> >> Using madvise + relying on khugepaged has certain drawbacks over
> >> THP=always. Using madvise hints mean THPs aren't "transparent" and
> >> require userspace changes. Waiting for khugepaged to scan memory and
> >> collapse pages into THP can be slow and unpredictable in terms of performance
> >> (i.e. you dont know when the collapse will happen), while production
> >> environments require predictable performance. If there is enough memory
> >> available, its better for both performance and predictability to have
> >> a THP from fault time, i.e. THP=always rather than wait for khugepaged
> >> to collapse it, and deal with sparsely populated THPs when the system is
> >> running out of memory.
> >>
> >> This patch-series is an attempt to mitigate the issue of running out of
> >> memory when THP is always enabled. During runtime whenever a THP is being
> >> faulted in or collapsed by khugepaged, the THP is added to a list.
> >> Whenever memory reclaim happens, the kernel runs the deferred_split
> >> shrinker which goes through the list and checks if the THP was underutilized,
> >> i.e. how many of the base 4K pages of the entire THP were zero-filled.
> >> If this number goes above a certain threshold, the shrinker will attempt
> >> to split that THP. Then at remap time, the pages that were zero-filled are
> >> not remapped, hence saving memory. This method avoids the downside of
> >> wasting memory in areas where THP is sparsely filled when THP is always
> >> enabled, while still providing the upside THPs like reduced TLB misses without
> >> having to use madvise.
> >>
> >> Meta production workloads that were CPU bound (>99% CPU utilzation) were
> >> tested with THP shrinker. The results after 2 hours are as follows:
> >>
> >>                             | THP=madvise |  THP=always   | THP=always
> >>                             |             |               | + shrinker series
> >>                             |             |               | + max_ptes_none=409
> >> -----------------------------------------------------------------------------
> >> Performance improvement     |      -      |    +1.8%      |     +1.7%
> >> (over THP=madvise)          |             |               |
> >> -----------------------------------------------------------------------------
> >> Memory usage                |    54.6G    | 58.8G (+7.7%) |   55.9G (+2.4%)
> >> -----------------------------------------------------------------------------
> >> max_ptes_none=409 means that any THP that has more than 409 out of 512
> >> (80%) zero filled filled pages will be split.
> >>
> >> To test out the patches, the below commands without the shrinker will
> >> invoke OOM killer immediately and kill stress, but will not fail with
> >> the shrinker:
> >>
> >> echo 450 > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
> >> mkdir /sys/fs/cgroup/test
> >> echo $$ > /sys/fs/cgroup/test/cgroup.procs
> >> echo 20M > /sys/fs/cgroup/test/memory.max
> >> echo 0 > /sys/fs/cgroup/test/memory.swap.max
> >> # allocate twice memory.max for each stress worker and touch 40/512 of
> >> # each THP, i.e. vm-stride 50K.
> >> # With the shrinker, max_ptes_none of 470 and below won't invoke OOM
> >> # killer.
> >> # Without the shrinker, OOM killer is invoked immediately irrespective
> >> # of max_ptes_none value and kill stress.
> >> stress --vm 1 --vm-bytes 40M --vm-stride 50K
> >>
> >> Patches 1-2 add back helper functions that were previously removed
> >> to operate on page lists (needed by patch 3).
> >> Patch 3 is an optimization to free zapped tail pages rather than
> >> waiting for page reclaim or migration.
> >> Patch 4 is a prerequisite for THP shrinker to not remap zero-filled
> >> subpages when splitting THP.
> >> Patches 6 adds support for THP shrinker.
> >>
> >> (This patch-series restarts the work on having a THP shrinker in kernel
> >> originally done in
> >> https://lore.kernel.org/all/cover.1667454613.git.alexlzhu@fb.com/.
> >> The THP shrinker in this series is significantly different than the
> >> original one, hence its labelled v1 (although the prerequisite to not
> >> remap clean subpages is the same).)
> >>
> >> Alexander Zhu (1):
> >>   mm: add selftests to split_huge_page() to verify unmap/zap of zero
> >>     pages
> >>
> >> Usama Arif (3):
> >>   Revert "memcg: remove mem_cgroup_uncharge_list()"
> >>   Revert "mm: remove free_unref_page_list()"
> >>   mm: split underutilized THPs
> >>
> >> Yu Zhao (2):
> >>   mm: free zapped tail pages when splitting isolated thp
> >>   mm: don't remap unused subpages when splitting isolated thp
> >
> >  I would recommend shatter [1] instead of splitting so that
> > 1) whoever underutilized their THPs get punished for the overhead;
> > 2) underutilized THPs are kept intact and can be reused by others.
> >
> > [1] https://lore.kernel.org/20240229183436.4110845-3-yuzhao@google.com/
>
> The objective of this series is to reduce memory usage, while trying to keep the performance benefits you get of using THP=always.

Of course.

> Punishing any applications performance is the opposite of what I am trying to do here.

For applications that prefer THP=always, you would punish them more by
using split.

> For e.g. if there is only one main application running in production, and its using majority of the THPs, then reducing its performance doesn't make sense.

Exactly, and that's why I recommended shatter.

Let's walk through the big picture, and hopefully you'll agree.

Applications prefer THP=always because they want to allocate THPs. As
you mentioned above, the majority of their memory would be backed by
THPs, highly utilized.

You also mentioned that those applications can run into memory
pressure or even OOMs, which I agree, and this is essentially what we
are trying to solve here. Otherwise, with unlimited memory, we
wouldn't need to worry about internal fragmentation in this context.

So on one hand, we want to allocate THPs; on the other, we run into
memory pressure. It's obvious that splitting under this specific
condition can't fully solve our problem -- after splitting, we still
have to do compaction to fulfill new THP allocation requests.
Theoretically, splitting plus compaction is more expensive than
shattering itself: expressing the efficiency in
compact_success/(compact_success+fail), the latter is 100%; the former
is nowhere near it, and our experiments agree with this.

If applications opt for direct compaction, they'd pay for THP
allocation latency; if they don't want to wait, i.e., with background
compaction, but they'd pay for less THP coverage. So they are punished
either way, not in the THP shrinker path, but in their allocation
path. In comparison, shattering wins in both cases, as I explained
above.

> Also, just going through the commit, and found the line "The advantage of shattering is that it keeps the original THP intact" a bit confusing. I am guessing the THP is freed?

Yes, so that we don't have to do compaction.

> i.e. if a 2M THP has 10 non-zero filled base pages and the rest are zero-filled, then after shattering we will have 10*4K memory and not 2M+10*4K?

Correct.

> Is it the case the THP is reused at next fault?

Yes, and this is central to our condition: we are under memory
pressure with THP=always.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/6] mm: split underutilized THPs
  2024-08-04 23:23     ` Yu Zhao
@ 2024-08-06 11:18       ` Usama Arif
  0 siblings, 0 replies; 40+ messages in thread
From: Usama Arif @ 2024-08-06 11:18 UTC (permalink / raw)
  To: Yu Zhao
  Cc: akpm, linux-mm, hannes, riel, shakeel.butt, roman.gushchin, david,
	baohua, ryan.roberts, rppt, willy, cerasuolodomenico, corbet,
	linux-kernel, linux-doc, kernel-team



On 05/08/2024 00:23, Yu Zhao wrote:
> On Thu, Aug 1, 2024 at 10:22 AM Usama Arif <usamaarif642@gmail.com> wrote:
>>
>>
>>
>> On 01/08/2024 07:09, Yu Zhao wrote:
>>> On Tue, Jul 30, 2024 at 6:54 AM Usama Arif <usamaarif642@gmail.com> wrote:
>>>>
>>>> The current upstream default policy for THP is always. However, Meta
>>>> uses madvise in production as the current THP=always policy vastly
>>>> overprovisions THPs in sparsely accessed memory areas, resulting in
>>>> excessive memory pressure and premature OOM killing.
>>>> Using madvise + relying on khugepaged has certain drawbacks over
>>>> THP=always. Using madvise hints mean THPs aren't "transparent" and
>>>> require userspace changes. Waiting for khugepaged to scan memory and
>>>> collapse pages into THP can be slow and unpredictable in terms of performance
>>>> (i.e. you dont know when the collapse will happen), while production
>>>> environments require predictable performance. If there is enough memory
>>>> available, its better for both performance and predictability to have
>>>> a THP from fault time, i.e. THP=always rather than wait for khugepaged
>>>> to collapse it, and deal with sparsely populated THPs when the system is
>>>> running out of memory.
>>>>
>>>> This patch-series is an attempt to mitigate the issue of running out of
>>>> memory when THP is always enabled. During runtime whenever a THP is being
>>>> faulted in or collapsed by khugepaged, the THP is added to a list.
>>>> Whenever memory reclaim happens, the kernel runs the deferred_split
>>>> shrinker which goes through the list and checks if the THP was underutilized,
>>>> i.e. how many of the base 4K pages of the entire THP were zero-filled.
>>>> If this number goes above a certain threshold, the shrinker will attempt
>>>> to split that THP. Then at remap time, the pages that were zero-filled are
>>>> not remapped, hence saving memory. This method avoids the downside of
>>>> wasting memory in areas where THP is sparsely filled when THP is always
>>>> enabled, while still providing the upside THPs like reduced TLB misses without
>>>> having to use madvise.
>>>>
>>>> Meta production workloads that were CPU bound (>99% CPU utilzation) were
>>>> tested with THP shrinker. The results after 2 hours are as follows:
>>>>
>>>>                             | THP=madvise |  THP=always   | THP=always
>>>>                             |             |               | + shrinker series
>>>>                             |             |               | + max_ptes_none=409
>>>> -----------------------------------------------------------------------------
>>>> Performance improvement     |      -      |    +1.8%      |     +1.7%
>>>> (over THP=madvise)          |             |               |
>>>> -----------------------------------------------------------------------------
>>>> Memory usage                |    54.6G    | 58.8G (+7.7%) |   55.9G (+2.4%)
>>>> -----------------------------------------------------------------------------
>>>> max_ptes_none=409 means that any THP that has more than 409 out of 512
>>>> (80%) zero filled filled pages will be split.
>>>>
>>>> To test out the patches, the below commands without the shrinker will
>>>> invoke OOM killer immediately and kill stress, but will not fail with
>>>> the shrinker:
>>>>
>>>> echo 450 > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
>>>> mkdir /sys/fs/cgroup/test
>>>> echo $$ > /sys/fs/cgroup/test/cgroup.procs
>>>> echo 20M > /sys/fs/cgroup/test/memory.max
>>>> echo 0 > /sys/fs/cgroup/test/memory.swap.max
>>>> # allocate twice memory.max for each stress worker and touch 40/512 of
>>>> # each THP, i.e. vm-stride 50K.
>>>> # With the shrinker, max_ptes_none of 470 and below won't invoke OOM
>>>> # killer.
>>>> # Without the shrinker, OOM killer is invoked immediately irrespective
>>>> # of max_ptes_none value and kill stress.
>>>> stress --vm 1 --vm-bytes 40M --vm-stride 50K
>>>>
>>>> Patches 1-2 add back helper functions that were previously removed
>>>> to operate on page lists (needed by patch 3).
>>>> Patch 3 is an optimization to free zapped tail pages rather than
>>>> waiting for page reclaim or migration.
>>>> Patch 4 is a prerequisite for THP shrinker to not remap zero-filled
>>>> subpages when splitting THP.
>>>> Patches 6 adds support for THP shrinker.
>>>>
>>>> (This patch-series restarts the work on having a THP shrinker in kernel
>>>> originally done in
>>>> https://lore.kernel.org/all/cover.1667454613.git.alexlzhu@fb.com/.
>>>> The THP shrinker in this series is significantly different than the
>>>> original one, hence its labelled v1 (although the prerequisite to not
>>>> remap clean subpages is the same).)
>>>>
>>>> Alexander Zhu (1):
>>>>   mm: add selftests to split_huge_page() to verify unmap/zap of zero
>>>>     pages
>>>>
>>>> Usama Arif (3):
>>>>   Revert "memcg: remove mem_cgroup_uncharge_list()"
>>>>   Revert "mm: remove free_unref_page_list()"
>>>>   mm: split underutilized THPs
>>>>
>>>> Yu Zhao (2):
>>>>   mm: free zapped tail pages when splitting isolated thp
>>>>   mm: don't remap unused subpages when splitting isolated thp
>>>
>>>  I would recommend shatter [1] instead of splitting so that
>>> 1) whoever underutilized their THPs get punished for the overhead;
>>> 2) underutilized THPs are kept intact and can be reused by others.
>>>
>>> [1] https://lore.kernel.org/20240229183436.4110845-3-yuzhao@google.com/
>>
>> The objective of this series is to reduce memory usage, while trying to keep the performance benefits you get of using THP=always.
> 
> Of course.
> 
>> Punishing any applications performance is the opposite of what I am trying to do here.
> 
> For applications that prefer THP=always, you would punish them more by
> using split.
> 
>> For e.g. if there is only one main application running in production, and its using majority of the THPs, then reducing its performance doesn't make sense.
> 
> Exactly, and that's why I recommended shatter.
> 
> Let's walk through the big picture, and hopefully you'll agree.
> 
> Applications prefer THP=always because they want to allocate THPs. As
> you mentioned above, the majority of their memory would be backed by
> THPs, highly utilized.
> 
> You also mentioned that those applications can run into memory
> pressure or even OOMs, which I agree, and this is essentially what we
> are trying to solve here. Otherwise, with unlimited memory, we
> wouldn't need to worry about internal fragmentation in this context.
> 
> So on one hand, we want to allocate THPs; on the other, we run into
> memory pressure. It's obvious that splitting under this specific
> condition can't fully solve our problem -- after splitting, we still
> have to do compaction to fulfill new THP allocation requests.
> Theoretically, splitting plus compaction is more expensive than
> shattering itself: expressing the efficiency in
> compact_success/(compact_success+fail), the latter is 100%; the former
> is nowhere near it, and our experiments agree with this.
> 
> If applications opt for direct compaction, they'd pay for THP
> allocation latency; if they don't want to wait, i.e., with background
> compaction, but they'd pay for less THP coverage. So they are punished
> either way, not in the THP shrinker path, but in their allocation
> path. In comparison, shattering wins in both cases, as I explained
> above.

Thanks for the explanation. It makes the reason behind shattering much more clearer, and it explains why it could be an improvement over splitting.

As Rik mentioned, I think its best to parallelize the efforts of THP low utilization shrinker and shattering, as we already have the low utilization shrinker tested and ready with splitting.

I did have a question about shattering, I will mention it here but it probably is best to move the discussion over to the shattering patches you sent.

If the system is close to running out of memory and the shrinker has a very large number of folios to shatter, then the initial THPs that are shattered by shrinker will take the order-0 pages for migration of non-zero filled pages. The system could run out of order-0 pages for the latter THPs, so the THPs that were preserved earlier in the shrinker process would need to be split to provide the order-0 pages for migration needed for shattering the latter THPs?
The cost then becomes the cost of migration + the cost of splitting the THPs preserved earlier by the shrinker to provide 4K pages for migration (which means the advantage of preserving these THPs is lost?) + the cost of compaction when we later want THPs.

Hopefully I understood what would happen in the above case correctly with shattering.


> 
>> Also, just going through the commit, and found the line "The advantage of shattering is that it keeps the original THP intact" a bit confusing. I am guessing the THP is freed?
> 
> Yes, so that we don't have to do compaction.
> 
>> i.e. if a 2M THP has 10 non-zero filled base pages and the rest are zero-filled, then after shattering we will have 10*4K memory and not 2M+10*4K?
> 
> Correct.
> 
>> Is it the case the THP is reused at next fault?
> 
> Yes, and this is central to our condition: we are under memory
> pressure with THP=always.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/6] mm: split underutilized THPs
  2024-08-01  6:09 ` Yu Zhao
  2024-08-01 15:47   ` David Hildenbrand
  2024-08-01 16:22   ` Usama Arif
@ 2024-08-06 17:38   ` Johannes Weiner
  2 siblings, 0 replies; 40+ messages in thread
From: Johannes Weiner @ 2024-08-06 17:38 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Usama Arif, akpm, linux-mm, riel, shakeel.butt, roman.gushchin,
	david, baohua, ryan.roberts, rppt, willy, cerasuolodomenico,
	corbet, linux-kernel, linux-doc, kernel-team

On Thu, Aug 01, 2024 at 12:09:16AM -0600, Yu Zhao wrote:
> On Tue, Jul 30, 2024 at 6:54 AM Usama Arif <usamaarif642@gmail.com> wrote:
> >
> > The current upstream default policy for THP is always. However, Meta
> > uses madvise in production as the current THP=always policy vastly
> > overprovisions THPs in sparsely accessed memory areas, resulting in
> > excessive memory pressure and premature OOM killing.
> > Using madvise + relying on khugepaged has certain drawbacks over
> > THP=always. Using madvise hints mean THPs aren't "transparent" and
> > require userspace changes. Waiting for khugepaged to scan memory and
> > collapse pages into THP can be slow and unpredictable in terms of performance
> > (i.e. you dont know when the collapse will happen), while production
> > environments require predictable performance. If there is enough memory
> > available, its better for both performance and predictability to have
> > a THP from fault time, i.e. THP=always rather than wait for khugepaged
> > to collapse it, and deal with sparsely populated THPs when the system is
> > running out of memory.
> >
> > This patch-series is an attempt to mitigate the issue of running out of
> > memory when THP is always enabled. During runtime whenever a THP is being
> > faulted in or collapsed by khugepaged, the THP is added to a list.
> > Whenever memory reclaim happens, the kernel runs the deferred_split
> > shrinker which goes through the list and checks if the THP was underutilized,
> > i.e. how many of the base 4K pages of the entire THP were zero-filled.
> > If this number goes above a certain threshold, the shrinker will attempt
> > to split that THP. Then at remap time, the pages that were zero-filled are
> > not remapped, hence saving memory. This method avoids the downside of
> > wasting memory in areas where THP is sparsely filled when THP is always
> > enabled, while still providing the upside THPs like reduced TLB misses without
> > having to use madvise.
> >
> > Meta production workloads that were CPU bound (>99% CPU utilzation) were
> > tested with THP shrinker. The results after 2 hours are as follows:
> >
> >                             | THP=madvise |  THP=always   | THP=always
> >                             |             |               | + shrinker series
> >                             |             |               | + max_ptes_none=409
> > -----------------------------------------------------------------------------
> > Performance improvement     |      -      |    +1.8%      |     +1.7%
> > (over THP=madvise)          |             |               |
> > -----------------------------------------------------------------------------
> > Memory usage                |    54.6G    | 58.8G (+7.7%) |   55.9G (+2.4%)
> > -----------------------------------------------------------------------------
> > max_ptes_none=409 means that any THP that has more than 409 out of 512
> > (80%) zero filled filled pages will be split.
> >
> > To test out the patches, the below commands without the shrinker will
> > invoke OOM killer immediately and kill stress, but will not fail with
> > the shrinker:
> >
> > echo 450 > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
> > mkdir /sys/fs/cgroup/test
> > echo $$ > /sys/fs/cgroup/test/cgroup.procs
> > echo 20M > /sys/fs/cgroup/test/memory.max
> > echo 0 > /sys/fs/cgroup/test/memory.swap.max
> > # allocate twice memory.max for each stress worker and touch 40/512 of
> > # each THP, i.e. vm-stride 50K.
> > # With the shrinker, max_ptes_none of 470 and below won't invoke OOM
> > # killer.
> > # Without the shrinker, OOM killer is invoked immediately irrespective
> > # of max_ptes_none value and kill stress.
> > stress --vm 1 --vm-bytes 40M --vm-stride 50K
> >
> > Patches 1-2 add back helper functions that were previously removed
> > to operate on page lists (needed by patch 3).
> > Patch 3 is an optimization to free zapped tail pages rather than
> > waiting for page reclaim or migration.
> > Patch 4 is a prerequisite for THP shrinker to not remap zero-filled
> > subpages when splitting THP.
> > Patches 6 adds support for THP shrinker.
> >
> > (This patch-series restarts the work on having a THP shrinker in kernel
> > originally done in
> > https://lore.kernel.org/all/cover.1667454613.git.alexlzhu@fb.com/.
> > The THP shrinker in this series is significantly different than the
> > original one, hence its labelled v1 (although the prerequisite to not
> > remap clean subpages is the same).)
> >
> > Alexander Zhu (1):
> >   mm: add selftests to split_huge_page() to verify unmap/zap of zero
> >     pages
> >
> > Usama Arif (3):
> >   Revert "memcg: remove mem_cgroup_uncharge_list()"
> >   Revert "mm: remove free_unref_page_list()"
> >   mm: split underutilized THPs
> >
> > Yu Zhao (2):
> >   mm: free zapped tail pages when splitting isolated thp
> >   mm: don't remap unused subpages when splitting isolated thp
> 
>  I would recommend shatter [1] instead of splitting so that

I agree with Rik, this seems like a possible optimization, not a
pre-requisite.

> 1) whoever underutilized their THPs get punished for the overhead;

Is that true? The downgrade is done in a shrinker. With or without
shattering, the compaction effort will be on the allocation side.

> 2) underutilized THPs are kept intact and can be reused by others.

If migration of the subpages is possible, then compaction can clear
the block as quickly as shattering can. The only difference is that
compaction would do the work on-demand, whereas shattering would do it
unconditionally, whether a THP has been requested or not...

Anyway, I think it'd be better to keep those discussions separate.


^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2024-08-06 17:38 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-30 12:45 [PATCH 0/6] mm: split underutilized THPs Usama Arif
2024-07-30 12:45 ` [PATCH 1/6] Revert "memcg: remove mem_cgroup_uncharge_list()" Usama Arif
2024-07-30 12:45 ` [PATCH 2/6] Revert "mm: remove free_unref_page_list()" Usama Arif
2024-07-30 12:46 ` [PATCH 3/6] mm: free zapped tail pages when splitting isolated thp Usama Arif
2024-07-30 15:14   ` David Hildenbrand
2024-08-04 19:02     ` Usama Arif
2024-08-05  9:00       ` David Hildenbrand
2024-08-06  9:58         ` Usama Arif
2024-07-30 12:46 ` [PATCH 4/6] mm: don't remap unused subpages " Usama Arif
2024-07-30 18:07   ` Rik van Riel
2024-07-31 17:08     ` Usama Arif
2024-07-30 12:46 ` [PATCH 5/6] mm: add selftests to split_huge_page() to verify unmap/zap of zero pages Usama Arif
2024-07-30 18:10   ` Rik van Riel
2024-08-01  4:45   ` kernel test robot
2024-07-30 12:46 ` [PATCH 6/6] mm: split underutilized THPs Usama Arif
2024-07-30 13:59   ` Randy Dunlap
2024-07-30 14:35 ` [PATCH 0/6] " David Hildenbrand
2024-07-30 15:14   ` Usama Arif
2024-07-30 15:19     ` Usama Arif
2024-07-30 16:11       ` David Hildenbrand
2024-07-30 17:22         ` Usama Arif
2024-07-30 20:25           ` David Hildenbrand
2024-07-31 17:01             ` Usama Arif
2024-07-31 17:51               ` David Hildenbrand
2024-07-31 20:41                 ` Usama Arif
2024-08-01  6:36                   ` David Hildenbrand
2024-08-04 23:04                     ` Usama Arif
     [not found]                     ` <20240806172830.GD322282@cmpxchg.org>
2024-08-06 17:33                       ` David Hildenbrand
2024-08-01  6:09 ` Yu Zhao
2024-08-01 15:47   ` David Hildenbrand
2024-08-04 21:54     ` Yu Zhao
2024-08-05  1:32       ` Rik van Riel
2024-08-05 19:51         ` Yu Zhao
2024-08-01 16:22   ` Usama Arif
2024-08-01 16:27     ` David Hildenbrand
2024-08-04 19:10       ` Usama Arif
2024-08-04 23:32       ` Yu Zhao
2024-08-04 23:23     ` Yu Zhao
2024-08-06 11:18       ` Usama Arif
2024-08-06 17:38   ` Johannes Weiner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).