* [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default
@ 2025-09-05 14:11 David Hildenbrand
2025-09-05 14:37 ` Zi Yan
` (6 more replies)
0 siblings, 7 replies; 31+ messages in thread
From: David Hildenbrand @ 2025-09-05 14:11 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, David Hildenbrand, Andrew Morton, Lorenzo Stoakes,
Zi Yan, Baolin Wang, Liam R. Howlett, Nico Pache, Ryan Roberts,
Dev Jain, Barry Song, Usama Arif
We added an early exit in thp_underused(), probably to avoid scanning
pages when there is no chance for success.
However, assume we have max_ptes_none = 511 (default).
Nothing should stop us from freeing all pages part of a THP that
is completely zero (512) and khugepaged will for sure not try to
instantiate a THP in that case (512 shared zeropages).
This can just trivially happen if someone writes a single 0 byte into a
PMD area, or of course, when data ends up being zero later.
So let's remove that early exit.
Do we want to CC stable? Hm, not sure. Probably not urgent.
Note that, as default, the THP shrinker is active
(/sys/kernel/mm/transparent_hugepage/shrink_underused = 1), and all
THPs are added to the deferred split lists. However, with the
max_ptes_none default we would never scan them. We would not do that. If
that's not desirable, we should just disable the shrinker as default,
also not adding all THPs to the deferred split lists.
Easy to reproduce:
1) Allocate some THPs filled with 0s
<prog.c>
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>
const size_t size = 1024*1024*1024;
int main(void)
{
size_t offs;
char *area;
area = mmap(0, size, PROT_READ | PROT_WRITE,
MAP_ANON | MAP_PRIVATE, -1, 0);
if (area == MAP_FAILED) {
printf("mmap failed\n");
exit(-1);
}
madvise(area, size, MADV_HUGEPAGE);
for (offs = 0; offs < size; offs += getpagesize())
area[offs] = 0;
pause();
}
<\prog.c>
2) Trigger the shrinker
E.g., memory pressure through memhog
3) Observe that THPs are not getting reclaimed
$ cat /proc/`pgrep prog`/smaps_rollup
Would list ~1GiB of AnonHugePages. With this fix, they would get
reclaimed as expected.
Fixes: dafff3f4c850 ("mm: split underused THPs")
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Usama Arif <usamaarif642@gmail.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
mm/huge_memory.c | 3 ---
1 file changed, 3 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 26cedfcd74189..aa3ed7a86435b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -4110,9 +4110,6 @@ static bool thp_underused(struct folio *folio)
void *kaddr;
int i;
- if (khugepaged_max_ptes_none == HPAGE_PMD_NR - 1)
- return false;
-
for (i = 0; i < folio_nr_pages(folio); i++) {
kaddr = kmap_local_folio(folio, i * PAGE_SIZE);
if (!memchr_inv(kaddr, 0, PAGE_SIZE)) {
--
2.50.1
^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default
2025-09-05 14:11 [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default David Hildenbrand
@ 2025-09-05 14:37 ` Zi Yan
2025-09-05 14:39 ` David Hildenbrand
2025-09-05 14:43 ` Usama Arif
2025-09-05 14:40 ` Usama Arif
` (5 subsequent siblings)
6 siblings, 2 replies; 31+ messages in thread
From: Zi Yan @ 2025-09-05 14:37 UTC (permalink / raw)
To: David Hildenbrand
Cc: linux-kernel, linux-mm, Andrew Morton, Lorenzo Stoakes,
Baolin Wang, Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Usama Arif
On 5 Sep 2025, at 10:11, David Hildenbrand wrote:
> We added an early exit in thp_underused(), probably to avoid scanning
> pages when there is no chance for success.
>
> However, assume we have max_ptes_none = 511 (default).
>
> Nothing should stop us from freeing all pages part of a THP that
> is completely zero (512) and khugepaged will for sure not try to
> instantiate a THP in that case (512 shared zeropages).
>
> This can just trivially happen if someone writes a single 0 byte into a
> PMD area, or of course, when data ends up being zero later.
>
> So let's remove that early exit.
>
> Do we want to CC stable? Hm, not sure. Probably not urgent.
>
> Note that, as default, the THP shrinker is active
> (/sys/kernel/mm/transparent_hugepage/shrink_underused = 1), and all
> THPs are added to the deferred split lists. However, with the
> max_ptes_none default we would never scan them. We would not do that. If
> that's not desirable, we should just disable the shrinker as default,
> also not adding all THPs to the deferred split lists.
>
> Easy to reproduce:
>
> 1) Allocate some THPs filled with 0s
>
> <prog.c>
> #include <string.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <sys/mman.h>
>
> const size_t size = 1024*1024*1024;
>
> int main(void)
> {
> size_t offs;
> char *area;
>
> area = mmap(0, size, PROT_READ | PROT_WRITE,
> MAP_ANON | MAP_PRIVATE, -1, 0);
> if (area == MAP_FAILED) {
> printf("mmap failed\n");
> exit(-1);
> }
> madvise(area, size, MADV_HUGEPAGE);
>
> for (offs = 0; offs < size; offs += getpagesize())
> area[offs] = 0;
> pause();
> }
> <\prog.c>
>
> 2) Trigger the shrinker
>
> E.g., memory pressure through memhog
>
> 3) Observe that THPs are not getting reclaimed
>
> $ cat /proc/`pgrep prog`/smaps_rollup
>
> Would list ~1GiB of AnonHugePages. With this fix, they would get
> reclaimed as expected.
>
> Fixes: dafff3f4c850 ("mm: split underused THPs")
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Nico Pache <npache@redhat.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Dev Jain <dev.jain@arm.com>
> Cc: Barry Song <baohua@kernel.org>
> Cc: Usama Arif <usamaarif642@gmail.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
> mm/huge_memory.c | 3 ---
> 1 file changed, 3 deletions(-)
>
LGTM. Acked-by: Zi Yan <ziy@nvidia.com>
I also notice that thp_underused() checks num_zero_pages directly
against khugepaged_max_ptes_none. This means mTHPs will never be regarded
as underused. A similar issue you are discussing in Nico’s khugepaged
mTHP support. Maybe checks against these khugepaged_max* variables
should be calculated based on nr_pages of a large folio, like
making these variables a ratio in other discussion.
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default
2025-09-05 14:37 ` Zi Yan
@ 2025-09-05 14:39 ` David Hildenbrand
2025-09-06 6:35 ` Lance Yang
2025-09-05 14:43 ` Usama Arif
1 sibling, 1 reply; 31+ messages in thread
From: David Hildenbrand @ 2025-09-05 14:39 UTC (permalink / raw)
To: Zi Yan
Cc: linux-kernel, linux-mm, Andrew Morton, Lorenzo Stoakes,
Baolin Wang, Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Usama Arif
On 05.09.25 16:37, Zi Yan wrote:
> On 5 Sep 2025, at 10:11, David Hildenbrand wrote:
>
>> We added an early exit in thp_underused(), probably to avoid scanning
>> pages when there is no chance for success.
>>
>> However, assume we have max_ptes_none = 511 (default).
>>
>> Nothing should stop us from freeing all pages part of a THP that
>> is completely zero (512) and khugepaged will for sure not try to
>> instantiate a THP in that case (512 shared zeropages).
>>
>> This can just trivially happen if someone writes a single 0 byte into a
>> PMD area, or of course, when data ends up being zero later.
>>
>> So let's remove that early exit.
>>
>> Do we want to CC stable? Hm, not sure. Probably not urgent.
>>
>> Note that, as default, the THP shrinker is active
>> (/sys/kernel/mm/transparent_hugepage/shrink_underused = 1), and all
>> THPs are added to the deferred split lists. However, with the
>> max_ptes_none default we would never scan them. We would not do that. If
>> that's not desirable, we should just disable the shrinker as default,
>> also not adding all THPs to the deferred split lists.
>>
>> Easy to reproduce:
>>
>> 1) Allocate some THPs filled with 0s
>>
>> <prog.c>
>> #include <string.h>
>> #include <stdio.h>
>> #include <stdlib.h>
>> #include <unistd.h>
>> #include <sys/mman.h>
>>
>> const size_t size = 1024*1024*1024;
>>
>> int main(void)
>> {
>> size_t offs;
>> char *area;
>>
>> area = mmap(0, size, PROT_READ | PROT_WRITE,
>> MAP_ANON | MAP_PRIVATE, -1, 0);
>> if (area == MAP_FAILED) {
>> printf("mmap failed\n");
>> exit(-1);
>> }
>> madvise(area, size, MADV_HUGEPAGE);
>>
>> for (offs = 0; offs < size; offs += getpagesize())
>> area[offs] = 0;
>> pause();
>> }
>> <\prog.c>
>>
>> 2) Trigger the shrinker
>>
>> E.g., memory pressure through memhog
>>
>> 3) Observe that THPs are not getting reclaimed
>>
>> $ cat /proc/`pgrep prog`/smaps_rollup
>>
>> Would list ~1GiB of AnonHugePages. With this fix, they would get
>> reclaimed as expected.
>>
>> Fixes: dafff3f4c850 ("mm: split underused THPs")
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>> Cc: Zi Yan <ziy@nvidia.com>
>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
>> Cc: Nico Pache <npache@redhat.com>
>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>> Cc: Dev Jain <dev.jain@arm.com>
>> Cc: Barry Song <baohua@kernel.org>
>> Cc: Usama Arif <usamaarif642@gmail.com>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>> mm/huge_memory.c | 3 ---
>> 1 file changed, 3 deletions(-)
>>
> LGTM. Acked-by: Zi Yan <ziy@nvidia.com>
>
> I also notice that thp_underused() checks num_zero_pages directly
> against khugepaged_max_ptes_none. This means mTHPs will never be regarded
> as underused. A similar issue you are discussing in Nico’s khugepaged
> mTHP support. Maybe checks against these khugepaged_max* variables
> should be calculated based on nr_pages of a large folio, like
> making these variables a ratio in other discussion.
Yeah, factoring that out and cleaning it up is my next step.
But note that mTHPs are not a candidate for the shrinker right now. (see
my explanation in reply to Nicos patch)
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default
2025-09-05 14:11 [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default David Hildenbrand
2025-09-05 14:37 ` Zi Yan
@ 2025-09-05 14:40 ` Usama Arif
2025-09-05 14:46 ` David Hildenbrand
2025-09-14 14:04 ` Dev Jain
2025-09-05 15:02 ` Usama Arif
` (4 subsequent siblings)
6 siblings, 2 replies; 31+ messages in thread
From: Usama Arif @ 2025-09-05 14:40 UTC (permalink / raw)
To: David Hildenbrand, linux-kernel
Cc: linux-mm, Andrew Morton, Lorenzo Stoakes, Zi Yan, Baolin Wang,
Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song
On 05/09/2025 15:11, David Hildenbrand wrote:
> We added an early exit in thp_underused(), probably to avoid scanning
> pages when there is no chance for success.
>
Yes, that was the main reason.
> However, assume we have max_ptes_none = 511 (default).
>
> Nothing should stop us from freeing all pages part of a THP that
> is completely zero (512) and khugepaged will for sure not try to
> instantiate a THP in that case (512 shared zeropages).
Agree with this point for this patch.
>
> This can just trivially happen if someone writes a single 0 byte into a
> PMD area, or of course, when data ends up being zero later.
>
> So let's remove that early exit.
>
> Do we want to CC stable? Hm, not sure. Probably not urgent.
>
> Note that, as default, the THP shrinker is active
> (/sys/kernel/mm/transparent_hugepage/shrink_underused = 1), and all
> THPs are added to the deferred split lists. However, with the
> max_ptes_none default we would never scan them. We would not do that. If
> that's not desirable, we should just disable the shrinker as default,
> also not adding all THPs to the deferred split lists.
>
The reason I did this is for the case if you change max_ptes_none after the THP is added
to deferred split list but *before* memory pressure, i.e. before the shrinker runs,
so that its considered for splitting.
> Easy to reproduce:
>
> 1) Allocate some THPs filled with 0s
>
> <prog.c>
> #include <string.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <sys/mman.h>
>
> const size_t size = 1024*1024*1024;
>
> int main(void)
> {
> size_t offs;
> char *area;
>
> area = mmap(0, size, PROT_READ | PROT_WRITE,
> MAP_ANON | MAP_PRIVATE, -1, 0);
> if (area == MAP_FAILED) {
> printf("mmap failed\n");
> exit(-1);
> }
> madvise(area, size, MADV_HUGEPAGE);
>
> for (offs = 0; offs < size; offs += getpagesize())
> area[offs] = 0;
> pause();
> }
> <\prog.c>
>
> 2) Trigger the shrinker
>
> E.g., memory pressure through memhog
>
> 3) Observe that THPs are not getting reclaimed
>
> $ cat /proc/`pgrep prog`/smaps_rollup
>
> Would list ~1GiB of AnonHugePages. With this fix, they would get
> reclaimed as expected.
>
> Fixes: dafff3f4c850 ("mm: split underused THPs")
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Nico Pache <npache@redhat.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Dev Jain <dev.jain@arm.com>
> Cc: Barry Song <baohua@kernel.org>
> Cc: Usama Arif <usamaarif642@gmail.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
> mm/huge_memory.c | 3 ---
> 1 file changed, 3 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 26cedfcd74189..aa3ed7a86435b 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -4110,9 +4110,6 @@ static bool thp_underused(struct folio *folio)
> void *kaddr;
> int i;
>
> - if (khugepaged_max_ptes_none == HPAGE_PMD_NR - 1)
> - return false;
> -
I do agree with your usecase, but I am really worried about the amount of
work and cpu time the THP shrinker will consume when max_ptes_none is 511
(I dont have any numbers to back up my worry :)), and its less likely that
we will have these completely zeroed out THPs (again no numbers to back up
this statement). We have the huge_zero_folio as well which is installed on read.
> for (i = 0; i < folio_nr_pages(folio); i++) {
> kaddr = kmap_local_folio(folio, i * PAGE_SIZE);
> if (!memchr_inv(kaddr, 0, PAGE_SIZE)) {
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default
2025-09-05 14:37 ` Zi Yan
2025-09-05 14:39 ` David Hildenbrand
@ 2025-09-05 14:43 ` Usama Arif
2025-09-05 14:47 ` David Hildenbrand
1 sibling, 1 reply; 31+ messages in thread
From: Usama Arif @ 2025-09-05 14:43 UTC (permalink / raw)
To: Zi Yan, David Hildenbrand
Cc: linux-kernel, linux-mm, Andrew Morton, Lorenzo Stoakes,
Baolin Wang, Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song
On 05/09/2025 15:37, Zi Yan wrote:
> On 5 Sep 2025, at 10:11, David Hildenbrand wrote:
>
>> We added an early exit in thp_underused(), probably to avoid scanning
>> pages when there is no chance for success.
>>
>> However, assume we have max_ptes_none = 511 (default).
>>
>> Nothing should stop us from freeing all pages part of a THP that
>> is completely zero (512) and khugepaged will for sure not try to
>> instantiate a THP in that case (512 shared zeropages).
>>
>> This can just trivially happen if someone writes a single 0 byte into a
>> PMD area, or of course, when data ends up being zero later.
>>
>> So let's remove that early exit.
>>
>> Do we want to CC stable? Hm, not sure. Probably not urgent.
>>
>> Note that, as default, the THP shrinker is active
>> (/sys/kernel/mm/transparent_hugepage/shrink_underused = 1), and all
>> THPs are added to the deferred split lists. However, with the
>> max_ptes_none default we would never scan them. We would not do that. If
>> that's not desirable, we should just disable the shrinker as default,
>> also not adding all THPs to the deferred split lists.
>>
>> Easy to reproduce:
>>
>> 1) Allocate some THPs filled with 0s
>>
>> <prog.c>
>> #include <string.h>
>> #include <stdio.h>
>> #include <stdlib.h>
>> #include <unistd.h>
>> #include <sys/mman.h>
>>
>> const size_t size = 1024*1024*1024;
>>
>> int main(void)
>> {
>> size_t offs;
>> char *area;
>>
>> area = mmap(0, size, PROT_READ | PROT_WRITE,
>> MAP_ANON | MAP_PRIVATE, -1, 0);
>> if (area == MAP_FAILED) {
>> printf("mmap failed\n");
>> exit(-1);
>> }
>> madvise(area, size, MADV_HUGEPAGE);
>>
>> for (offs = 0; offs < size; offs += getpagesize())
>> area[offs] = 0;
>> pause();
>> }
>> <\prog.c>
>>
>> 2) Trigger the shrinker
>>
>> E.g., memory pressure through memhog
>>
>> 3) Observe that THPs are not getting reclaimed
>>
>> $ cat /proc/`pgrep prog`/smaps_rollup
>>
>> Would list ~1GiB of AnonHugePages. With this fix, they would get
>> reclaimed as expected.
>>
>> Fixes: dafff3f4c850 ("mm: split underused THPs")
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>> Cc: Zi Yan <ziy@nvidia.com>
>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
>> Cc: Nico Pache <npache@redhat.com>
>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>> Cc: Dev Jain <dev.jain@arm.com>
>> Cc: Barry Song <baohua@kernel.org>
>> Cc: Usama Arif <usamaarif642@gmail.com>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>> mm/huge_memory.c | 3 ---
>> 1 file changed, 3 deletions(-)
>>
> LGTM. Acked-by: Zi Yan <ziy@nvidia.com>
>
> I also notice that thp_underused() checks num_zero_pages directly
> against khugepaged_max_ptes_none. This means mTHPs will never be regarded
> as underused. A similar issue you are discussing in Nico’s khugepaged
> mTHP support. Maybe checks against these khugepaged_max* variables
> should be calculated based on nr_pages of a large folio, like
> making these variables a ratio in other discussion.
I unfortunately didnt follow the series in the latest revisions.
In the earlier revisions, I think it was decided to not add mTHPs to shrinker
as a start, as there are diminshing returns for smaller THPs and having a lot
of smaller mTHPs in the deferred list might mean that we get to PMD mapped THPs
a lot slower?
>
> Best Regards,
> Yan, Zi
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default
2025-09-05 14:40 ` Usama Arif
@ 2025-09-05 14:46 ` David Hildenbrand
2025-09-05 14:53 ` Usama Arif
2025-09-14 14:04 ` Dev Jain
1 sibling, 1 reply; 31+ messages in thread
From: David Hildenbrand @ 2025-09-05 14:46 UTC (permalink / raw)
To: Usama Arif, linux-kernel
Cc: linux-mm, Andrew Morton, Lorenzo Stoakes, Zi Yan, Baolin Wang,
Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song
[...]
>
> The reason I did this is for the case if you change max_ptes_none after the THP is added
> to deferred split list but *before* memory pressure, i.e. before the shrinker runs,
> so that its considered for splitting.
Yeah, I was assuming that was the reason why the shrinker is enabled as
default.
But in any sane system, the admin would enable the shrinker early. If
not, we can look into handling it differently.
>
>> Easy to reproduce:
>>
>> 1) Allocate some THPs filled with 0s
>>
>> <prog.c>
>> #include <string.h>
>> #include <stdio.h>
>> #include <stdlib.h>
>> #include <unistd.h>
>> #include <sys/mman.h>
>>
>> const size_t size = 1024*1024*1024;
>>
>> int main(void)
>> {
>> size_t offs;
>> char *area;
>>
>> area = mmap(0, size, PROT_READ | PROT_WRITE,
>> MAP_ANON | MAP_PRIVATE, -1, 0);
>> if (area == MAP_FAILED) {
>> printf("mmap failed\n");
>> exit(-1);
>> }
>> madvise(area, size, MADV_HUGEPAGE);
>>
>> for (offs = 0; offs < size; offs += getpagesize())
>> area[offs] = 0;
>> pause();
>> }
>> <\prog.c>
>>
>> 2) Trigger the shrinker
>>
>> E.g., memory pressure through memhog
>>
>> 3) Observe that THPs are not getting reclaimed
>>
>> $ cat /proc/`pgrep prog`/smaps_rollup
>>
>> Would list ~1GiB of AnonHugePages. With this fix, they would get
>> reclaimed as expected.
>>
>> Fixes: dafff3f4c850 ("mm: split underused THPs")
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>> Cc: Zi Yan <ziy@nvidia.com>
>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
>> Cc: Nico Pache <npache@redhat.com>
>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>> Cc: Dev Jain <dev.jain@arm.com>
>> Cc: Barry Song <baohua@kernel.org>
>> Cc: Usama Arif <usamaarif642@gmail.com>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>> mm/huge_memory.c | 3 ---
>> 1 file changed, 3 deletions(-)
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 26cedfcd74189..aa3ed7a86435b 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -4110,9 +4110,6 @@ static bool thp_underused(struct folio *folio)
>> void *kaddr;
>> int i;
>>
>> - if (khugepaged_max_ptes_none == HPAGE_PMD_NR - 1)
>> - return false;
>> -
>
> I do agree with your usecase, but I am really worried about the amount of
> work and cpu time the THP shrinker will consume when max_ptes_none is 511
> (I dont have any numbers to back up my worry :)), and its less likely that
> we will have these completely zeroed out THPs (again no numbers to back up
> this statement).
Then then shrinker shall be deactivated as default if that becomes a
problem.
Fortunately you documented the desired semantics:
"All THPs at fault and collapse time will be added to _deferred_list,
and will therefore be split under memory pressure if they are considered
"underused". A THP is underused if the number of zero-filled pages in
the THP is above max_ptes_none (see below)."
> We have the huge_zero_folio as well which is installed on read.
Yes, only if the huge zero folio is not available. Which will then also
get properly reclaimed.
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default
2025-09-05 14:43 ` Usama Arif
@ 2025-09-05 14:47 ` David Hildenbrand
2025-09-05 14:58 ` Usama Arif
0 siblings, 1 reply; 31+ messages in thread
From: David Hildenbrand @ 2025-09-05 14:47 UTC (permalink / raw)
To: Usama Arif, Zi Yan
Cc: linux-kernel, linux-mm, Andrew Morton, Lorenzo Stoakes,
Baolin Wang, Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song
On 05.09.25 16:43, Usama Arif wrote:
>
>
> On 05/09/2025 15:37, Zi Yan wrote:
>> On 5 Sep 2025, at 10:11, David Hildenbrand wrote:
>>
>>> We added an early exit in thp_underused(), probably to avoid scanning
>>> pages when there is no chance for success.
>>>
>>> However, assume we have max_ptes_none = 511 (default).
>>>
>>> Nothing should stop us from freeing all pages part of a THP that
>>> is completely zero (512) and khugepaged will for sure not try to
>>> instantiate a THP in that case (512 shared zeropages).
>>>
>>> This can just trivially happen if someone writes a single 0 byte into a
>>> PMD area, or of course, when data ends up being zero later.
>>>
>>> So let's remove that early exit.
>>>
>>> Do we want to CC stable? Hm, not sure. Probably not urgent.
>>>
>>> Note that, as default, the THP shrinker is active
>>> (/sys/kernel/mm/transparent_hugepage/shrink_underused = 1), and all
>>> THPs are added to the deferred split lists. However, with the
>>> max_ptes_none default we would never scan them. We would not do that. If
>>> that's not desirable, we should just disable the shrinker as default,
>>> also not adding all THPs to the deferred split lists.
>>>
>>> Easy to reproduce:
>>>
>>> 1) Allocate some THPs filled with 0s
>>>
>>> <prog.c>
>>> #include <string.h>
>>> #include <stdio.h>
>>> #include <stdlib.h>
>>> #include <unistd.h>
>>> #include <sys/mman.h>
>>>
>>> const size_t size = 1024*1024*1024;
>>>
>>> int main(void)
>>> {
>>> size_t offs;
>>> char *area;
>>>
>>> area = mmap(0, size, PROT_READ | PROT_WRITE,
>>> MAP_ANON | MAP_PRIVATE, -1, 0);
>>> if (area == MAP_FAILED) {
>>> printf("mmap failed\n");
>>> exit(-1);
>>> }
>>> madvise(area, size, MADV_HUGEPAGE);
>>>
>>> for (offs = 0; offs < size; offs += getpagesize())
>>> area[offs] = 0;
>>> pause();
>>> }
>>> <\prog.c>
>>>
>>> 2) Trigger the shrinker
>>>
>>> E.g., memory pressure through memhog
>>>
>>> 3) Observe that THPs are not getting reclaimed
>>>
>>> $ cat /proc/`pgrep prog`/smaps_rollup
>>>
>>> Would list ~1GiB of AnonHugePages. With this fix, they would get
>>> reclaimed as expected.
>>>
>>> Fixes: dafff3f4c850 ("mm: split underused THPs")
>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>> Cc: Zi Yan <ziy@nvidia.com>
>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
>>> Cc: Nico Pache <npache@redhat.com>
>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>> Cc: Dev Jain <dev.jain@arm.com>
>>> Cc: Barry Song <baohua@kernel.org>
>>> Cc: Usama Arif <usamaarif642@gmail.com>
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>> ---
>>> mm/huge_memory.c | 3 ---
>>> 1 file changed, 3 deletions(-)
>>>
>> LGTM. Acked-by: Zi Yan <ziy@nvidia.com>
>>
>> I also notice that thp_underused() checks num_zero_pages directly
>> against khugepaged_max_ptes_none. This means mTHPs will never be regarded
>> as underused. A similar issue you are discussing in Nico’s khugepaged
>> mTHP support. Maybe checks against these khugepaged_max* variables
>> should be calculated based on nr_pages of a large folio, like
>> making these variables a ratio in other discussion.
>
> I unfortunately didnt follow the series in the latest revisions.
>
> In the earlier revisions, I think it was decided to not add mTHPs to shrinker
> as a start, as there are diminshing returns for smaller THPs and having a lot
> of smaller mTHPs in the deferred list might mean that we get to PMD mapped THPs
> a lot slower?
Probably we would want lists per order etc.
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default
2025-09-05 14:46 ` David Hildenbrand
@ 2025-09-05 14:53 ` Usama Arif
2025-09-05 14:58 ` David Hildenbrand
0 siblings, 1 reply; 31+ messages in thread
From: Usama Arif @ 2025-09-05 14:53 UTC (permalink / raw)
To: David Hildenbrand, linux-kernel
Cc: linux-mm, Andrew Morton, Lorenzo Stoakes, Zi Yan, Baolin Wang,
Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song
On 05/09/2025 15:46, David Hildenbrand wrote:
> [...]
>
>>
>> The reason I did this is for the case if you change max_ptes_none after the THP is added
>> to deferred split list but *before* memory pressure, i.e. before the shrinker runs,
>> so that its considered for splitting.
>
> Yeah, I was assuming that was the reason why the shrinker is enabled as default.
>
> But in any sane system, the admin would enable the shrinker early. If not, we can look into handling it differently.
Yes, I do this as well, i.e. have a low value from the start.
Does it make sense to disable shrinker if max_ptes_none is 511? It wont shrink
the usecase you are describing below, but we wont encounter the increased CPU usage.>
>>
>>> Easy to reproduce:
>>>
>>> 1) Allocate some THPs filled with 0s
>>>
>>> <prog.c>
>>> #include <string.h>
>>> #include <stdio.h>
>>> #include <stdlib.h>
>>> #include <unistd.h>
>>> #include <sys/mman.h>
>>>
>>> const size_t size = 1024*1024*1024;
>>>
>>> int main(void)
>>> {
>>> size_t offs;
>>> char *area;
>>>
>>> area = mmap(0, size, PROT_READ | PROT_WRITE,
>>> MAP_ANON | MAP_PRIVATE, -1, 0);
>>> if (area == MAP_FAILED) {
>>> printf("mmap failed\n");
>>> exit(-1);
>>> }
>>> madvise(area, size, MADV_HUGEPAGE);
>>>
>>> for (offs = 0; offs < size; offs += getpagesize())
>>> area[offs] = 0;
>>> pause();
>>> }
>>> <\prog.c>
>>>
>>> 2) Trigger the shrinker
>>>
>>> E.g., memory pressure through memhog
>>>
>>> 3) Observe that THPs are not getting reclaimed
>>>
>>> $ cat /proc/`pgrep prog`/smaps_rollup
>>>
>>> Would list ~1GiB of AnonHugePages. With this fix, they would get
>>> reclaimed as expected.
>>>
>>> Fixes: dafff3f4c850 ("mm: split underused THPs")
>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>> Cc: Zi Yan <ziy@nvidia.com>
>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
>>> Cc: Nico Pache <npache@redhat.com>
>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>> Cc: Dev Jain <dev.jain@arm.com>
>>> Cc: Barry Song <baohua@kernel.org>
>>> Cc: Usama Arif <usamaarif642@gmail.com>
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>> ---
>>> mm/huge_memory.c | 3 ---
>>> 1 file changed, 3 deletions(-)
>>>
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index 26cedfcd74189..aa3ed7a86435b 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -4110,9 +4110,6 @@ static bool thp_underused(struct folio *folio)
>>> void *kaddr;
>>> int i;
>>> - if (khugepaged_max_ptes_none == HPAGE_PMD_NR - 1)
>>> - return false;
>>> -
>>
>> I do agree with your usecase, but I am really worried about the amount of
>> work and cpu time the THP shrinker will consume when max_ptes_none is 511
>> (I dont have any numbers to back up my worry :)), and its less likely that
>> we will have these completely zeroed out THPs (again no numbers to back up
>> this statement).
>
> Then then shrinker shall be deactivated as default if that becomes a problem.
>
> Fortunately you documented the desired semantics:
>
> "All THPs at fault and collapse time will be added to _deferred_list,
> and will therefore be split under memory pressure if they are considered
> "underused". A THP is underused if the number of zero-filled pages in
> the THP is above max_ptes_none (see below)."
>
>> We have the huge_zero_folio as well which is installed on read.
>
> Yes, only if the huge zero folio is not available. Which will then also get properly reclaimed.
>
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default
2025-09-05 14:47 ` David Hildenbrand
@ 2025-09-05 14:58 ` Usama Arif
0 siblings, 0 replies; 31+ messages in thread
From: Usama Arif @ 2025-09-05 14:58 UTC (permalink / raw)
To: David Hildenbrand, Zi Yan
Cc: linux-kernel, linux-mm, Andrew Morton, Lorenzo Stoakes,
Baolin Wang, Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song
On 05/09/2025 15:47, David Hildenbrand wrote:
> On 05.09.25 16:43, Usama Arif wrote:
>>
>>
>> On 05/09/2025 15:37, Zi Yan wrote:
>>> On 5 Sep 2025, at 10:11, David Hildenbrand wrote:
>>>
>>>> We added an early exit in thp_underused(), probably to avoid scanning
>>>> pages when there is no chance for success.
>>>>
>>>> However, assume we have max_ptes_none = 511 (default).
>>>>
>>>> Nothing should stop us from freeing all pages part of a THP that
>>>> is completely zero (512) and khugepaged will for sure not try to
>>>> instantiate a THP in that case (512 shared zeropages).
>>>>
>>>> This can just trivially happen if someone writes a single 0 byte into a
>>>> PMD area, or of course, when data ends up being zero later.
>>>>
>>>> So let's remove that early exit.
>>>>
>>>> Do we want to CC stable? Hm, not sure. Probably not urgent.
>>>>
>>>> Note that, as default, the THP shrinker is active
>>>> (/sys/kernel/mm/transparent_hugepage/shrink_underused = 1), and all
>>>> THPs are added to the deferred split lists. However, with the
>>>> max_ptes_none default we would never scan them. We would not do that. If
>>>> that's not desirable, we should just disable the shrinker as default,
>>>> also not adding all THPs to the deferred split lists.
>>>>
>>>> Easy to reproduce:
>>>>
>>>> 1) Allocate some THPs filled with 0s
>>>>
>>>> <prog.c>
>>>> #include <string.h>
>>>> #include <stdio.h>
>>>> #include <stdlib.h>
>>>> #include <unistd.h>
>>>> #include <sys/mman.h>
>>>>
>>>> const size_t size = 1024*1024*1024;
>>>>
>>>> int main(void)
>>>> {
>>>> size_t offs;
>>>> char *area;
>>>>
>>>> area = mmap(0, size, PROT_READ | PROT_WRITE,
>>>> MAP_ANON | MAP_PRIVATE, -1, 0);
>>>> if (area == MAP_FAILED) {
>>>> printf("mmap failed\n");
>>>> exit(-1);
>>>> }
>>>> madvise(area, size, MADV_HUGEPAGE);
>>>>
>>>> for (offs = 0; offs < size; offs += getpagesize())
>>>> area[offs] = 0;
>>>> pause();
>>>> }
>>>> <\prog.c>
>>>>
>>>> 2) Trigger the shrinker
>>>>
>>>> E.g., memory pressure through memhog
>>>>
>>>> 3) Observe that THPs are not getting reclaimed
>>>>
>>>> $ cat /proc/`pgrep prog`/smaps_rollup
>>>>
>>>> Would list ~1GiB of AnonHugePages. With this fix, they would get
>>>> reclaimed as expected.
>>>>
>>>> Fixes: dafff3f4c850 ("mm: split underused THPs")
>>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
>>>> Cc: Nico Pache <npache@redhat.com>
>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>> Cc: Dev Jain <dev.jain@arm.com>
>>>> Cc: Barry Song <baohua@kernel.org>
>>>> Cc: Usama Arif <usamaarif642@gmail.com>
>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>> ---
>>>> mm/huge_memory.c | 3 ---
>>>> 1 file changed, 3 deletions(-)
>>>>
>>> LGTM. Acked-by: Zi Yan <ziy@nvidia.com>
>>>
>>> I also notice that thp_underused() checks num_zero_pages directly
>>> against khugepaged_max_ptes_none. This means mTHPs will never be regarded
>>> as underused. A similar issue you are discussing in Nico’s khugepaged
>>> mTHP support. Maybe checks against these khugepaged_max* variables
>>> should be calculated based on nr_pages of a large folio, like
>>> making these variables a ratio in other discussion.
>>
>> I unfortunately didnt follow the series in the latest revisions.
>>
>> In the earlier revisions, I think it was decided to not add mTHPs to shrinker
>> as a start, as there are diminshing returns for smaller THPs and having a lot
>> of smaller mTHPs in the deferred list might mean that we get to PMD mapped THPs
>> a lot slower?
>
> Probably we would want lists per order etc.
>
Yes that makes sense! and we start with the highest order list.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default
2025-09-05 14:53 ` Usama Arif
@ 2025-09-05 14:58 ` David Hildenbrand
2025-09-05 15:01 ` Usama Arif
0 siblings, 1 reply; 31+ messages in thread
From: David Hildenbrand @ 2025-09-05 14:58 UTC (permalink / raw)
To: Usama Arif, linux-kernel
Cc: linux-mm, Andrew Morton, Lorenzo Stoakes, Zi Yan, Baolin Wang,
Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song
On 05.09.25 16:53, Usama Arif wrote:
>
>
> On 05/09/2025 15:46, David Hildenbrand wrote:
>> [...]
>>
>>>
>>> The reason I did this is for the case if you change max_ptes_none after the THP is added
>>> to deferred split list but *before* memory pressure, i.e. before the shrinker runs,
>>> so that its considered for splitting.
>>
>> Yeah, I was assuming that was the reason why the shrinker is enabled as default.
>>
>> But in any sane system, the admin would enable the shrinker early. If not, we can look into handling it differently.
>
> Yes, I do this as well, i.e. have a low value from the start.
>
> Does it make sense to disable shrinker if max_ptes_none is 511? It wont shrink
> the usecase you are describing below, but we wont encounter the increased CPU usage.>
I don't really see why we should do that.
If the shrinker is a problem than the shrinker should be disabled. But
if it is enabled, we should be shrinking as documented.
Without more magic around our THP toggles (we want less) :)
Shrinking happens when we are under memory pressure, so I am not really
sure how relevant the scanning bit is, and if it is relevant enought to
change the shrinker default.
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default
2025-09-05 14:58 ` David Hildenbrand
@ 2025-09-05 15:01 ` Usama Arif
2025-09-05 15:04 ` David Hildenbrand
0 siblings, 1 reply; 31+ messages in thread
From: Usama Arif @ 2025-09-05 15:01 UTC (permalink / raw)
To: David Hildenbrand, linux-kernel
Cc: linux-mm, Andrew Morton, Lorenzo Stoakes, Zi Yan, Baolin Wang,
Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song
On 05/09/2025 15:58, David Hildenbrand wrote:
> On 05.09.25 16:53, Usama Arif wrote:
>>
>>
>> On 05/09/2025 15:46, David Hildenbrand wrote:
>>> [...]
>>>
>>>>
>>>> The reason I did this is for the case if you change max_ptes_none after the THP is added
>>>> to deferred split list but *before* memory pressure, i.e. before the shrinker runs,
>>>> so that its considered for splitting.
>>>
>>> Yeah, I was assuming that was the reason why the shrinker is enabled as default.
>>>
>>> But in any sane system, the admin would enable the shrinker early. If not, we can look into handling it differently.
>>
>> Yes, I do this as well, i.e. have a low value from the start.
>>
>> Does it make sense to disable shrinker if max_ptes_none is 511? It wont shrink
>> the usecase you are describing below, but we wont encounter the increased CPU usage.>
>
> I don't really see why we should do that.
>
> If the shrinker is a problem than the shrinker should be disabled. But if it is enabled, we should be shrinking as documented.
>
> Without more magic around our THP toggles (we want less) :)
>
> Shrinking happens when we are under memory pressure, so I am not really sure how relevant the scanning bit is, and if it is relevant enought to change the shrinker default.
>
yes agreed, I also dont have numbers to back up my worry, its all theoretical :)
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default
2025-09-05 14:11 [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default David Hildenbrand
2025-09-05 14:37 ` Zi Yan
2025-09-05 14:40 ` Usama Arif
@ 2025-09-05 15:02 ` Usama Arif
2025-09-05 15:11 ` David Hildenbrand
` (3 subsequent siblings)
6 siblings, 0 replies; 31+ messages in thread
From: Usama Arif @ 2025-09-05 15:02 UTC (permalink / raw)
To: David Hildenbrand, linux-kernel
Cc: linux-mm, Andrew Morton, Lorenzo Stoakes, Zi Yan, Baolin Wang,
Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song
On 05/09/2025 15:11, David Hildenbrand wrote:
> We added an early exit in thp_underused(), probably to avoid scanning
> pages when there is no chance for success.
>
> However, assume we have max_ptes_none = 511 (default).
>
> Nothing should stop us from freeing all pages part of a THP that
> is completely zero (512) and khugepaged will for sure not try to
> instantiate a THP in that case (512 shared zeropages).
>
> This can just trivially happen if someone writes a single 0 byte into a
> PMD area, or of course, when data ends up being zero later.
>
> So let's remove that early exit.
>
> Do we want to CC stable? Hm, not sure. Probably not urgent.
>
> Note that, as default, the THP shrinker is active
> (/sys/kernel/mm/transparent_hugepage/shrink_underused = 1), and all
> THPs are added to the deferred split lists. However, with the
> max_ptes_none default we would never scan them. We would not do that. If
> that's not desirable, we should just disable the shrinker as default,
> also not adding all THPs to the deferred split lists.
>
> Easy to reproduce:
>
> 1) Allocate some THPs filled with 0s
>
> <prog.c>
> #include <string.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <sys/mman.h>
>
> const size_t size = 1024*1024*1024;
>
> int main(void)
> {
> size_t offs;
> char *area;
>
> area = mmap(0, size, PROT_READ | PROT_WRITE,
> MAP_ANON | MAP_PRIVATE, -1, 0);
> if (area == MAP_FAILED) {
> printf("mmap failed\n");
> exit(-1);
> }
> madvise(area, size, MADV_HUGEPAGE);
>
> for (offs = 0; offs < size; offs += getpagesize())
> area[offs] = 0;
> pause();
> }
> <\prog.c>
>
> 2) Trigger the shrinker
>
> E.g., memory pressure through memhog
>
> 3) Observe that THPs are not getting reclaimed
>
> $ cat /proc/`pgrep prog`/smaps_rollup
>
> Would list ~1GiB of AnonHugePages. With this fix, they would get
> reclaimed as expected.
>
> Fixes: dafff3f4c850 ("mm: split underused THPs")
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Nico Pache <npache@redhat.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Dev Jain <dev.jain@arm.com>
> Cc: Barry Song <baohua@kernel.org>
> Cc: Usama Arif <usamaarif642@gmail.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Usama Arif <usamaarif642@gmail.com>
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default
2025-09-05 15:01 ` Usama Arif
@ 2025-09-05 15:04 ` David Hildenbrand
2025-09-05 15:16 ` Usama Arif
0 siblings, 1 reply; 31+ messages in thread
From: David Hildenbrand @ 2025-09-05 15:04 UTC (permalink / raw)
To: Usama Arif, linux-kernel
Cc: linux-mm, Andrew Morton, Lorenzo Stoakes, Zi Yan, Baolin Wang,
Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song
On 05.09.25 17:01, Usama Arif wrote:
>
>
> On 05/09/2025 15:58, David Hildenbrand wrote:
>> On 05.09.25 16:53, Usama Arif wrote:
>>>
>>>
>>> On 05/09/2025 15:46, David Hildenbrand wrote:
>>>> [...]
>>>>
>>>>>
>>>>> The reason I did this is for the case if you change max_ptes_none after the THP is added
>>>>> to deferred split list but *before* memory pressure, i.e. before the shrinker runs,
>>>>> so that its considered for splitting.
>>>>
>>>> Yeah, I was assuming that was the reason why the shrinker is enabled as default.
>>>>
>>>> But in any sane system, the admin would enable the shrinker early. If not, we can look into handling it differently.
>>>
>>> Yes, I do this as well, i.e. have a low value from the start.
>>>
>>> Does it make sense to disable shrinker if max_ptes_none is 511? It wont shrink
>>> the usecase you are describing below, but we wont encounter the increased CPU usage.>
>>
>> I don't really see why we should do that.
>>
>> If the shrinker is a problem than the shrinker should be disabled. But if it is enabled, we should be shrinking as documented.
>>
>> Without more magic around our THP toggles (we want less) :)
>>
>> Shrinking happens when we are under memory pressure, so I am not really sure how relevant the scanning bit is, and if it is relevant enought to change the shrinker default.
>>
>
> yes agreed, I also dont have numbers to back up my worry, its all theoretical :)
BTW, I was also wondering if we should just always add all THP to the
deferred split list, and make the split toggle just affect whether we
process them or not (scan or not).
I mean, as a default we add all of them to the list already right now,
even though nothing would ever get reclaimed as default.
What's your take?
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default
2025-09-05 14:11 [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default David Hildenbrand
` (2 preceding siblings ...)
2025-09-05 15:02 ` Usama Arif
@ 2025-09-05 15:11 ` David Hildenbrand
2025-09-05 15:30 ` Lorenzo Stoakes
` (2 subsequent siblings)
6 siblings, 0 replies; 31+ messages in thread
From: David Hildenbrand @ 2025-09-05 15:11 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, Andrew Morton, Lorenzo Stoakes, Zi Yan, Baolin Wang,
Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
Usama Arif
On 05.09.25 16:11, David Hildenbrand wrote:
> We added an early exit in thp_underused(), probably to avoid scanning
> pages when there is no chance for success.
>
> However, assume we have max_ptes_none = 511 (default).
>
> Nothing should stop us from freeing all pages part of a THP that
> is completely zero (512) and khugepaged will for sure not try to
> instantiate a THP in that case (512 shared zeropages).
>
> This can just trivially happen if someone writes a single 0 byte into a
> PMD area, or of course, when data ends up being zero later.
>
> So let's remove that early exit.
>
> Do we want to CC stable? Hm, not sure. Probably not urgent.
>
> Note that, as default, the THP shrinker is active
> (/sys/kernel/mm/transparent_hugepage/shrink_underused = 1), and all
> THPs are added to the deferred split lists. However, with the
> max_ptes_none default we would never scan them. We would not do that. If
> that's not desirable, we should just disable the shrinker as default,
> also not adding all THPs to the deferred split lists.
>
> Easy to reproduce:
>
Just a note that I will send a v2 with an updated patch description
after the discussions here settled.
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default
2025-09-05 15:04 ` David Hildenbrand
@ 2025-09-05 15:16 ` Usama Arif
2025-09-05 15:28 ` David Hildenbrand
0 siblings, 1 reply; 31+ messages in thread
From: Usama Arif @ 2025-09-05 15:16 UTC (permalink / raw)
To: David Hildenbrand, linux-kernel
Cc: linux-mm, Andrew Morton, Lorenzo Stoakes, Zi Yan, Baolin Wang,
Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song
On 05/09/2025 16:04, David Hildenbrand wrote:
> On 05.09.25 17:01, Usama Arif wrote:
>>
>>
>> On 05/09/2025 15:58, David Hildenbrand wrote:
>>> On 05.09.25 16:53, Usama Arif wrote:
>>>>
>>>>
>>>> On 05/09/2025 15:46, David Hildenbrand wrote:
>>>>> [...]
>>>>>
>>>>>>
>>>>>> The reason I did this is for the case if you change max_ptes_none after the THP is added
>>>>>> to deferred split list but *before* memory pressure, i.e. before the shrinker runs,
>>>>>> so that its considered for splitting.
>>>>>
>>>>> Yeah, I was assuming that was the reason why the shrinker is enabled as default.
>>>>>
>>>>> But in any sane system, the admin would enable the shrinker early. If not, we can look into handling it differently.
>>>>
>>>> Yes, I do this as well, i.e. have a low value from the start.
>>>>
>>>> Does it make sense to disable shrinker if max_ptes_none is 511? It wont shrink
>>>> the usecase you are describing below, but we wont encounter the increased CPU usage.>
>>>
>>> I don't really see why we should do that.
>>>
>>> If the shrinker is a problem than the shrinker should be disabled. But if it is enabled, we should be shrinking as documented.
>>>
>>> Without more magic around our THP toggles (we want less) :)
>>>
>>> Shrinking happens when we are under memory pressure, so I am not really sure how relevant the scanning bit is, and if it is relevant enought to change the shrinker default.
>>>
>>
>> yes agreed, I also dont have numbers to back up my worry, its all theoretical :)
>
> BTW, I was also wondering if we should just always add all THP to the deferred split list, and make the split toggle just affect whether we process them or not (scan or not).
>
> I mean, as a default we add all of them to the list already right now, even though nothing would ever get reclaimed as default.
>
> What's your take?
>
hmm I probably didnt understand what you meant to say here:
we already add all of them to the list in __do_huge_pmd_anonymous_page and collapse_huge_page and
shrink_underused sets/clears split_underused_thp in deferred_split_folio decides whether we process or not.
In deferred_split_folio, if split_underused_thp is false, we dont add them to the list (unless partially_mapped).
Unless you are referring to non pmd mapped THPs?
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default
2025-09-05 15:16 ` Usama Arif
@ 2025-09-05 15:28 ` David Hildenbrand
2025-09-05 15:53 ` Usama Arif
0 siblings, 1 reply; 31+ messages in thread
From: David Hildenbrand @ 2025-09-05 15:28 UTC (permalink / raw)
To: Usama Arif, linux-kernel
Cc: linux-mm, Andrew Morton, Lorenzo Stoakes, Zi Yan, Baolin Wang,
Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song
On 05.09.25 17:16, Usama Arif wrote:
>
>
> On 05/09/2025 16:04, David Hildenbrand wrote:
>> On 05.09.25 17:01, Usama Arif wrote:
>>>
>>>
>>> On 05/09/2025 15:58, David Hildenbrand wrote:
>>>> On 05.09.25 16:53, Usama Arif wrote:
>>>>>
>>>>>
>>>>> On 05/09/2025 15:46, David Hildenbrand wrote:
>>>>>> [...]
>>>>>>
>>>>>>>
>>>>>>> The reason I did this is for the case if you change max_ptes_none after the THP is added
>>>>>>> to deferred split list but *before* memory pressure, i.e. before the shrinker runs,
>>>>>>> so that its considered for splitting.
>>>>>>
>>>>>> Yeah, I was assuming that was the reason why the shrinker is enabled as default.
>>>>>>
>>>>>> But in any sane system, the admin would enable the shrinker early. If not, we can look into handling it differently.
>>>>>
>>>>> Yes, I do this as well, i.e. have a low value from the start.
>>>>>
>>>>> Does it make sense to disable shrinker if max_ptes_none is 511? It wont shrink
>>>>> the usecase you are describing below, but we wont encounter the increased CPU usage.>
>>>>
>>>> I don't really see why we should do that.
>>>>
>>>> If the shrinker is a problem than the shrinker should be disabled. But if it is enabled, we should be shrinking as documented.
>>>>
>>>> Without more magic around our THP toggles (we want less) :)
>>>>
>>>> Shrinking happens when we are under memory pressure, so I am not really sure how relevant the scanning bit is, and if it is relevant enought to change the shrinker default.
>>>>
>>>
>>> yes agreed, I also dont have numbers to back up my worry, its all theoretical :)
>>
>> BTW, I was also wondering if we should just always add all THP to the deferred split list, and make the split toggle just affect whether we process them or not (scan or not).
>>
>> I mean, as a default we add all of them to the list already right now, even though nothing would ever get reclaimed as default.
>>
>> What's your take?
>>
>
> hmm I probably didnt understand what you meant to say here:
> we already add all of them to the list in __do_huge_pmd_anonymous_page and collapse_huge_page and
> shrink_underused sets/clears split_underused_thp in deferred_split_folio decides whether we process or not.
This is what I mean:
commit 3952b6f6b671ca7d69fd1783b1abf4806f90d436 (HEAD -> max_ptes_none)
Author: David Hildenbrand <david@redhat.com>
Date: Fri Sep 5 17:22:01 2025 +0200
mm/huge_memory: always add THPs to the deferred split list
When disabling the shrinker and then re-enabling it, any anon THPs
allocated in the meantime.
That also means that we cannot disable the shrinker as default during
boot, because we would miss some THPs later when enabling it.
So always add them to the deferred split list, and only skip the
scanning if the shrinker is disabled.
This is effectively what we do on all systems out there already, unless
they disable the shrinker.
Signed-off-by: David Hildenbrand <david@redhat.com>
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index aa3ed7a86435b..3ee857c1d3754 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -4052,9 +4052,6 @@ void deferred_split_folio(struct folio *folio, bool partially_mapped)
if (folio_order(folio) <= 1)
return;
- if (!partially_mapped && !split_underused_thp)
- return;
-
/*
* Exclude swapcache: originally to avoid a corrupt deferred split
* queue. Nowadays that is fully prevented by memcg1_swapout();
@@ -4175,6 +4172,8 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
bool underused = false;
if (!folio_test_partially_mapped(folio)) {
+ if (!split_underused_thp)
+ goto next;
underused = thp_underused(folio);
if (!underused)
goto next;
--
Cheers
David / dhildenb
^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default
2025-09-05 14:11 [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default David Hildenbrand
` (3 preceding siblings ...)
2025-09-05 15:11 ` David Hildenbrand
@ 2025-09-05 15:30 ` Lorenzo Stoakes
2025-09-05 15:36 ` David Hildenbrand
2025-09-06 6:39 ` Lance Yang
2025-09-08 2:16 ` Baolin Wang
6 siblings, 1 reply; 31+ messages in thread
From: Lorenzo Stoakes @ 2025-09-05 15:30 UTC (permalink / raw)
To: David Hildenbrand
Cc: linux-kernel, linux-mm, Andrew Morton, Zi Yan, Baolin Wang,
Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
Usama Arif
On Fri, Sep 05, 2025 at 04:11:37PM +0200, David Hildenbrand wrote:
> We added an early exit in thp_underused(), probably to avoid scanning
> pages when there is no chance for success.
>
> However, assume we have max_ptes_none = 511 (default).
>
> Nothing should stop us from freeing all pages part of a THP that
Freeing 'all pages which are part of a THP' rather?
> is completely zero (512) and khugepaged will for sure not try to
that is -> that are?
> instantiate a THP in that case (512 shared zeropages).
But if you write faulted they're not the zero page? And how are they shared? I
mean be being dumb here.
>
> This can just trivially happen if someone writes a single 0 byte into a
> PMD area, or of course, when data ends up being zero later.
>
> So let's remove that early exit.
>
> Do we want to CC stable? Hm, not sure. Probably not urgent.
Surely this is worth having?
>
> Note that, as default, the THP shrinker is active
> (/sys/kernel/mm/transparent_hugepage/shrink_underused = 1), and all
> THPs are added to the deferred split lists. However, with the
> max_ptes_none default we would never scan them. We would not do that. If
Nit but 'we would not do that' is kinda duplicative here :)
> that's not desirable, we should just disable the shrinker as default,
> also not adding all THPs to the deferred split lists.
>
> Easy to reproduce:
>
> 1) Allocate some THPs filled with 0s
>
> <prog.c>
> #include <string.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <sys/mman.h>
>
> const size_t size = 1024*1024*1024;
>
> int main(void)
> {
> size_t offs;
> char *area;
>
> area = mmap(0, size, PROT_READ | PROT_WRITE,
> MAP_ANON | MAP_PRIVATE, -1, 0);
> if (area == MAP_FAILED) {
> printf("mmap failed\n");
> exit(-1);
> }
> madvise(area, size, MADV_HUGEPAGE);
>
> for (offs = 0; offs < size; offs += getpagesize())
> area[offs] = 0;
> pause();
> }
> <\prog.c>
>
> 2) Trigger the shrinker
>
> E.g., memory pressure through memhog
>
> 3) Observe that THPs are not getting reclaimed
>
> $ cat /proc/`pgrep prog`/smaps_rollup
>
> Would list ~1GiB of AnonHugePages. With this fix, they would get
> reclaimed as expected.
Yikes!
>
> Fixes: dafff3f4c850 ("mm: split underused THPs")
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Nico Pache <npache@redhat.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Dev Jain <dev.jain@arm.com>
> Cc: Barry Song <baohua@kernel.org>
> Cc: Usama Arif <usamaarif642@gmail.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
LGTM, so:
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
> mm/huge_memory.c | 3 ---
> 1 file changed, 3 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 26cedfcd74189..aa3ed7a86435b 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -4110,9 +4110,6 @@ static bool thp_underused(struct folio *folio)
> void *kaddr;
> int i;
>
> - if (khugepaged_max_ptes_none == HPAGE_PMD_NR - 1)
> - return false;
> -
> for (i = 0; i < folio_nr_pages(folio); i++) {
> kaddr = kmap_local_folio(folio, i * PAGE_SIZE);
> if (!memchr_inv(kaddr, 0, PAGE_SIZE)) {
> --
> 2.50.1
>
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default
2025-09-05 15:30 ` Lorenzo Stoakes
@ 2025-09-05 15:36 ` David Hildenbrand
2025-09-08 11:32 ` Lorenzo Stoakes
0 siblings, 1 reply; 31+ messages in thread
From: David Hildenbrand @ 2025-09-05 15:36 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: linux-kernel, linux-mm, Andrew Morton, Zi Yan, Baolin Wang,
Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
Usama Arif
On 05.09.25 17:30, Lorenzo Stoakes wrote:
> On Fri, Sep 05, 2025 at 04:11:37PM +0200, David Hildenbrand wrote:
>> We added an early exit in thp_underused(), probably to avoid scanning
>> pages when there is no chance for success.
>>
>> However, assume we have max_ptes_none = 511 (default).
>>
>> Nothing should stop us from freeing all pages part of a THP that
>
> Freeing 'all pages which are part of a THP' rather?
I'm German, I don't know what I'm doing. :D
>
>> is completely zero (512) and khugepaged will for sure not try to
>
> that is -> that are?
the THP is zero?
>
>> instantiate a THP in that case (512 shared zeropages).
>
> But if you write faulted they're not the zero page? And how are they shared? I
> mean be being dumb here.
The shrinker will replace zeroed pages by the shared zeropages.
>
>>
>> This can just trivially happen if someone writes a single 0 byte into a
>> PMD area, or of course, when data ends up being zero later.
>>
>> So let's remove that early exit.
>>
>> Do we want to CC stable? Hm, not sure. Probably not urgent.
>
> Surely this is worth having?
Alrighty, let me cc stable.
>
>>
>> Note that, as default, the THP shrinker is active
>> (/sys/kernel/mm/transparent_hugepage/shrink_underused = 1), and all
>> THPs are added to the deferred split lists. However, with the
>> max_ptes_none default we would never scan them. We would not do that. If
>
> Nit but 'we would not do that' is kinda duplicative here :)
Yeah, fixed it already while rewriting: this was meant to be "would now".
Thanks!
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default
2025-09-05 15:28 ` David Hildenbrand
@ 2025-09-05 15:53 ` Usama Arif
2025-09-05 15:57 ` Usama Arif
2025-09-05 15:58 ` David Hildenbrand
0 siblings, 2 replies; 31+ messages in thread
From: Usama Arif @ 2025-09-05 15:53 UTC (permalink / raw)
To: David Hildenbrand, linux-kernel
Cc: linux-mm, Andrew Morton, Lorenzo Stoakes, Zi Yan, Baolin Wang,
Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song
On 05/09/2025 16:28, David Hildenbrand wrote:
> On 05.09.25 17:16, Usama Arif wrote:
>>
>>
>> On 05/09/2025 16:04, David Hildenbrand wrote:
>>> On 05.09.25 17:01, Usama Arif wrote:
>>>>
>>>>
>>>> On 05/09/2025 15:58, David Hildenbrand wrote:
>>>>> On 05.09.25 16:53, Usama Arif wrote:
>>>>>>
>>>>>>
>>>>>> On 05/09/2025 15:46, David Hildenbrand wrote:
>>>>>>> [...]
>>>>>>>
>>>>>>>>
>>>>>>>> The reason I did this is for the case if you change max_ptes_none after the THP is added
>>>>>>>> to deferred split list but *before* memory pressure, i.e. before the shrinker runs,
>>>>>>>> so that its considered for splitting.
>>>>>>>
>>>>>>> Yeah, I was assuming that was the reason why the shrinker is enabled as default.
>>>>>>>
>>>>>>> But in any sane system, the admin would enable the shrinker early. If not, we can look into handling it differently.
>>>>>>
>>>>>> Yes, I do this as well, i.e. have a low value from the start.
>>>>>>
>>>>>> Does it make sense to disable shrinker if max_ptes_none is 511? It wont shrink
>>>>>> the usecase you are describing below, but we wont encounter the increased CPU usage.>
>>>>>
>>>>> I don't really see why we should do that.
>>>>>
>>>>> If the shrinker is a problem than the shrinker should be disabled. But if it is enabled, we should be shrinking as documented.
>>>>>
>>>>> Without more magic around our THP toggles (we want less) :)
>>>>>
>>>>> Shrinking happens when we are under memory pressure, so I am not really sure how relevant the scanning bit is, and if it is relevant enought to change the shrinker default.
>>>>>
>>>>
>>>> yes agreed, I also dont have numbers to back up my worry, its all theoretical :)
>>>
>>> BTW, I was also wondering if we should just always add all THP to the deferred split list, and make the split toggle just affect whether we process them or not (scan or not).
>>>
>>> I mean, as a default we add all of them to the list already right now, even though nothing would ever get reclaimed as default.
>>>
>>> What's your take?
>>>
>>
>> hmm I probably didnt understand what you meant to say here:
>> we already add all of them to the list in __do_huge_pmd_anonymous_page and collapse_huge_page and
>> shrink_underused sets/clears split_underused_thp in deferred_split_folio decides whether we process or not.
>
> This is what I mean:
>
> commit 3952b6f6b671ca7d69fd1783b1abf4806f90d436 (HEAD -> max_ptes_none)
> Author: David Hildenbrand <david@redhat.com>
> Date: Fri Sep 5 17:22:01 2025 +0200
>
> mm/huge_memory: always add THPs to the deferred split list
> When disabling the shrinker and then re-enabling it, any anon THPs
> allocated in the meantime.
> That also means that we cannot disable the shrinker as default during
> boot, because we would miss some THPs later when enabling it.
> So always add them to the deferred split list, and only skip the
> scanning if the shrinker is disabled.
> This is effectively what we do on all systems out there already, unless
> they disable the shrinker.
> Signed-off-by: David Hildenbrand <david@redhat.com>
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index aa3ed7a86435b..3ee857c1d3754 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -4052,9 +4052,6 @@ void deferred_split_folio(struct folio *folio, bool partially_mapped)
> if (folio_order(folio) <= 1)
> return;
>
> - if (!partially_mapped && !split_underused_thp)
> - return;
> -
> /*
> * Exclude swapcache: originally to avoid a corrupt deferred split
> * queue. Nowadays that is fully prevented by memcg1_swapout();
> @@ -4175,6 +4172,8 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
> bool underused = false;
>
> if (!folio_test_partially_mapped(folio)) {
> + if (!split_underused_thp)
> + goto next;
> underused = thp_underused(folio);
> if (!underused)
> goto next;
>
>
Thanks for sending the diff! Now I know what you meant lol.
In the case of when shrinker is disabled, this could make the deferred split scan for partially mapped folios
very ineffective?
I am making up numbers, but lets there are 128 THPs in the system, only 2 of them are partially mapped
and sc->nr_to_scan is 32.
In the current code, with shrinker disabled, only the 2 partially mapped THPs will be on the deferred list, so
we will reclaim them in the first go.
With your patch, the worst case scenario is that the partially mapped THPs are at the end of the deferred_list
and we would need 4 calls for the shrinker to split them.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default
2025-09-05 15:53 ` Usama Arif
@ 2025-09-05 15:57 ` Usama Arif
2025-09-05 15:58 ` David Hildenbrand
1 sibling, 0 replies; 31+ messages in thread
From: Usama Arif @ 2025-09-05 15:57 UTC (permalink / raw)
To: David Hildenbrand, linux-kernel
Cc: linux-mm, Andrew Morton, Lorenzo Stoakes, Zi Yan, Baolin Wang,
Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song
On 05/09/2025 16:53, Usama Arif wrote:
>
>
> On 05/09/2025 16:28, David Hildenbrand wrote:
>> On 05.09.25 17:16, Usama Arif wrote:
>>>
>>>
>>> On 05/09/2025 16:04, David Hildenbrand wrote:
>>>> On 05.09.25 17:01, Usama Arif wrote:
>>>>>
>>>>>
>>>>> On 05/09/2025 15:58, David Hildenbrand wrote:
>>>>>> On 05.09.25 16:53, Usama Arif wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 05/09/2025 15:46, David Hildenbrand wrote:
>>>>>>>> [...]
>>>>>>>>
>>>>>>>>>
>>>>>>>>> The reason I did this is for the case if you change max_ptes_none after the THP is added
>>>>>>>>> to deferred split list but *before* memory pressure, i.e. before the shrinker runs,
>>>>>>>>> so that its considered for splitting.
>>>>>>>>
>>>>>>>> Yeah, I was assuming that was the reason why the shrinker is enabled as default.
>>>>>>>>
>>>>>>>> But in any sane system, the admin would enable the shrinker early. If not, we can look into handling it differently.
>>>>>>>
>>>>>>> Yes, I do this as well, i.e. have a low value from the start.
>>>>>>>
>>>>>>> Does it make sense to disable shrinker if max_ptes_none is 511? It wont shrink
>>>>>>> the usecase you are describing below, but we wont encounter the increased CPU usage.>
>>>>>>
>>>>>> I don't really see why we should do that.
>>>>>>
>>>>>> If the shrinker is a problem than the shrinker should be disabled. But if it is enabled, we should be shrinking as documented.
>>>>>>
>>>>>> Without more magic around our THP toggles (we want less) :)
>>>>>>
>>>>>> Shrinking happens when we are under memory pressure, so I am not really sure how relevant the scanning bit is, and if it is relevant enought to change the shrinker default.
>>>>>>
>>>>>
>>>>> yes agreed, I also dont have numbers to back up my worry, its all theoretical :)
>>>>
>>>> BTW, I was also wondering if we should just always add all THP to the deferred split list, and make the split toggle just affect whether we process them or not (scan or not).
>>>>
>>>> I mean, as a default we add all of them to the list already right now, even though nothing would ever get reclaimed as default.
>>>>
>>>> What's your take?
>>>>
>>>
>>> hmm I probably didnt understand what you meant to say here:
>>> we already add all of them to the list in __do_huge_pmd_anonymous_page and collapse_huge_page and
>>> shrink_underused sets/clears split_underused_thp in deferred_split_folio decides whether we process or not.
>>
>> This is what I mean:
>>
>> commit 3952b6f6b671ca7d69fd1783b1abf4806f90d436 (HEAD -> max_ptes_none)
>> Author: David Hildenbrand <david@redhat.com>
>> Date: Fri Sep 5 17:22:01 2025 +0200
>>
>> mm/huge_memory: always add THPs to the deferred split list
>> When disabling the shrinker and then re-enabling it, any anon THPs
>> allocated in the meantime.
>> That also means that we cannot disable the shrinker as default during
>> boot, because we would miss some THPs later when enabling it.
>> So always add them to the deferred split list, and only skip the
>> scanning if the shrinker is disabled.
>> This is effectively what we do on all systems out there already, unless
>> they disable the shrinker.
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index aa3ed7a86435b..3ee857c1d3754 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -4052,9 +4052,6 @@ void deferred_split_folio(struct folio *folio, bool partially_mapped)
>> if (folio_order(folio) <= 1)
>> return;
>>
>> - if (!partially_mapped && !split_underused_thp)
>> - return;
>> -
>> /*
>> * Exclude swapcache: originally to avoid a corrupt deferred split
>> * queue. Nowadays that is fully prevented by memcg1_swapout();
>> @@ -4175,6 +4172,8 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>> bool underused = false;
>>
>> if (!folio_test_partially_mapped(folio)) {
>> + if (!split_underused_thp)
>> + goto next;
>> underused = thp_underused(folio);
>> if (!underused)
>> goto next;
>>
>>
>
>
> Thanks for sending the diff! Now I know what you meant lol.
>
> In the case of when shrinker is disabled, this could make the deferred split scan for partially mapped folios
> very ineffective?
>
> I am making up numbers, but lets there are 128 THPs in the system, only 2 of them are partially mapped
> and sc->nr_to_scan is 32.
>
> In the current code, with shrinker disabled, only the 2 partially mapped THPs will be on the deferred list, so
> we will reclaim them in the first go.
>
> With your patch, the worst case scenario is that the partially mapped THPs are at the end of the deferred_list
> and we would need 4 calls for the shrinker to split them.
And I am hoping people are not dynamically enabling/disabling THP shrinker :)
I have ideas about dynamically changing max_ptes_none, maybe based on system metrics like memory pressure,
but not enabling/disabling shrinker.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default
2025-09-05 15:53 ` Usama Arif
2025-09-05 15:57 ` Usama Arif
@ 2025-09-05 15:58 ` David Hildenbrand
2025-09-05 16:47 ` Usama Arif
1 sibling, 1 reply; 31+ messages in thread
From: David Hildenbrand @ 2025-09-05 15:58 UTC (permalink / raw)
To: Usama Arif, linux-kernel
Cc: linux-mm, Andrew Morton, Lorenzo Stoakes, Zi Yan, Baolin Wang,
Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song
On 05.09.25 17:53, Usama Arif wrote:
>
>
> On 05/09/2025 16:28, David Hildenbrand wrote:
>> On 05.09.25 17:16, Usama Arif wrote:
>>>
>>>
>>> On 05/09/2025 16:04, David Hildenbrand wrote:
>>>> On 05.09.25 17:01, Usama Arif wrote:
>>>>>
>>>>>
>>>>> On 05/09/2025 15:58, David Hildenbrand wrote:
>>>>>> On 05.09.25 16:53, Usama Arif wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 05/09/2025 15:46, David Hildenbrand wrote:
>>>>>>>> [...]
>>>>>>>>
>>>>>>>>>
>>>>>>>>> The reason I did this is for the case if you change max_ptes_none after the THP is added
>>>>>>>>> to deferred split list but *before* memory pressure, i.e. before the shrinker runs,
>>>>>>>>> so that its considered for splitting.
>>>>>>>>
>>>>>>>> Yeah, I was assuming that was the reason why the shrinker is enabled as default.
>>>>>>>>
>>>>>>>> But in any sane system, the admin would enable the shrinker early. If not, we can look into handling it differently.
>>>>>>>
>>>>>>> Yes, I do this as well, i.e. have a low value from the start.
>>>>>>>
>>>>>>> Does it make sense to disable shrinker if max_ptes_none is 511? It wont shrink
>>>>>>> the usecase you are describing below, but we wont encounter the increased CPU usage.>
>>>>>>
>>>>>> I don't really see why we should do that.
>>>>>>
>>>>>> If the shrinker is a problem than the shrinker should be disabled. But if it is enabled, we should be shrinking as documented.
>>>>>>
>>>>>> Without more magic around our THP toggles (we want less) :)
>>>>>>
>>>>>> Shrinking happens when we are under memory pressure, so I am not really sure how relevant the scanning bit is, and if it is relevant enought to change the shrinker default.
>>>>>>
>>>>>
>>>>> yes agreed, I also dont have numbers to back up my worry, its all theoretical :)
>>>>
>>>> BTW, I was also wondering if we should just always add all THP to the deferred split list, and make the split toggle just affect whether we process them or not (scan or not).
>>>>
>>>> I mean, as a default we add all of them to the list already right now, even though nothing would ever get reclaimed as default.
>>>>
>>>> What's your take?
>>>>
>>>
>>> hmm I probably didnt understand what you meant to say here:
>>> we already add all of them to the list in __do_huge_pmd_anonymous_page and collapse_huge_page and
>>> shrink_underused sets/clears split_underused_thp in deferred_split_folio decides whether we process or not.
>>
>> This is what I mean:
>>
>> commit 3952b6f6b671ca7d69fd1783b1abf4806f90d436 (HEAD -> max_ptes_none)
>> Author: David Hildenbrand <david@redhat.com>
>> Date: Fri Sep 5 17:22:01 2025 +0200
>>
>> mm/huge_memory: always add THPs to the deferred split list
>> When disabling the shrinker and then re-enabling it, any anon THPs
>> allocated in the meantime.
>> That also means that we cannot disable the shrinker as default during
>> boot, because we would miss some THPs later when enabling it.
>> So always add them to the deferred split list, and only skip the
>> scanning if the shrinker is disabled.
>> This is effectively what we do on all systems out there already, unless
>> they disable the shrinker.
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index aa3ed7a86435b..3ee857c1d3754 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -4052,9 +4052,6 @@ void deferred_split_folio(struct folio *folio, bool partially_mapped)
>> if (folio_order(folio) <= 1)
>> return;
>>
>> - if (!partially_mapped && !split_underused_thp)
>> - return;
>> -
>> /*
>> * Exclude swapcache: originally to avoid a corrupt deferred split
>> * queue. Nowadays that is fully prevented by memcg1_swapout();
>> @@ -4175,6 +4172,8 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>> bool underused = false;
>>
>> if (!folio_test_partially_mapped(folio)) {
>> + if (!split_underused_thp)
>> + goto next;
>> underused = thp_underused(folio);
>> if (!underused)
>> goto next;
>>
>>
>
>
> Thanks for sending the diff! Now I know what you meant lol.
>
> In the case of when shrinker is disabled, this could make the deferred split scan for partially mapped folios
> very ineffective?
I hope you realize that that's the default on each and every system out
there that ships this feature :)
And don't ask me how many people even know about disabling the shrinker
or would do it, when the default setting is mostly not splitting many
THPs ever.
>
> I am making up numbers, but lets there are 128 THPs in the system, only 2 of them are partially mapped
> and sc->nr_to_scan is 32.
>
> In the current code, with shrinker disabled, only the 2 partially mapped THPs will be on the deferred list, so
> we will reclaim them in the first go.
>
> With your patch, the worst case scenario is that the partially mapped THPs are at the end of the deferred_list
> and we would need 4 calls for the shrinker to split them.
Probably at some point we would want split lists as well, not sure how
feasible that is.
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default
2025-09-05 15:58 ` David Hildenbrand
@ 2025-09-05 16:47 ` Usama Arif
2025-09-05 16:55 ` David Hildenbrand
0 siblings, 1 reply; 31+ messages in thread
From: Usama Arif @ 2025-09-05 16:47 UTC (permalink / raw)
To: David Hildenbrand, linux-kernel
Cc: linux-mm, Andrew Morton, Lorenzo Stoakes, Zi Yan, Baolin Wang,
Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song
On 05/09/2025 16:58, David Hildenbrand wrote:
> On 05.09.25 17:53, Usama Arif wrote:
>>
>>
>> On 05/09/2025 16:28, David Hildenbrand wrote:
>>> On 05.09.25 17:16, Usama Arif wrote:
>>>>
>>>>
>>>> On 05/09/2025 16:04, David Hildenbrand wrote:
>>>>> On 05.09.25 17:01, Usama Arif wrote:
>>>>>>
>>>>>>
>>>>>> On 05/09/2025 15:58, David Hildenbrand wrote:
>>>>>>> On 05.09.25 16:53, Usama Arif wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 05/09/2025 15:46, David Hildenbrand wrote:
>>>>>>>>> [...]
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The reason I did this is for the case if you change max_ptes_none after the THP is added
>>>>>>>>>> to deferred split list but *before* memory pressure, i.e. before the shrinker runs,
>>>>>>>>>> so that its considered for splitting.
>>>>>>>>>
>>>>>>>>> Yeah, I was assuming that was the reason why the shrinker is enabled as default.
>>>>>>>>>
>>>>>>>>> But in any sane system, the admin would enable the shrinker early. If not, we can look into handling it differently.
>>>>>>>>
>>>>>>>> Yes, I do this as well, i.e. have a low value from the start.
>>>>>>>>
>>>>>>>> Does it make sense to disable shrinker if max_ptes_none is 511? It wont shrink
>>>>>>>> the usecase you are describing below, but we wont encounter the increased CPU usage.>
>>>>>>>
>>>>>>> I don't really see why we should do that.
>>>>>>>
>>>>>>> If the shrinker is a problem than the shrinker should be disabled. But if it is enabled, we should be shrinking as documented.
>>>>>>>
>>>>>>> Without more magic around our THP toggles (we want less) :)
>>>>>>>
>>>>>>> Shrinking happens when we are under memory pressure, so I am not really sure how relevant the scanning bit is, and if it is relevant enought to change the shrinker default.
>>>>>>>
>>>>>>
>>>>>> yes agreed, I also dont have numbers to back up my worry, its all theoretical :)
>>>>>
>>>>> BTW, I was also wondering if we should just always add all THP to the deferred split list, and make the split toggle just affect whether we process them or not (scan or not).
>>>>>
>>>>> I mean, as a default we add all of them to the list already right now, even though nothing would ever get reclaimed as default.
>>>>>
>>>>> What's your take?
>>>>>
>>>>
>>>> hmm I probably didnt understand what you meant to say here:
>>>> we already add all of them to the list in __do_huge_pmd_anonymous_page and collapse_huge_page and
>>>> shrink_underused sets/clears split_underused_thp in deferred_split_folio decides whether we process or not.
>>>
>>> This is what I mean:
>>>
>>> commit 3952b6f6b671ca7d69fd1783b1abf4806f90d436 (HEAD -> max_ptes_none)
>>> Author: David Hildenbrand <david@redhat.com>
>>> Date: Fri Sep 5 17:22:01 2025 +0200
>>>
>>> mm/huge_memory: always add THPs to the deferred split list
>>> When disabling the shrinker and then re-enabling it, any anon THPs
>>> allocated in the meantime.
>>> That also means that we cannot disable the shrinker as default during
>>> boot, because we would miss some THPs later when enabling it.
>>> So always add them to the deferred split list, and only skip the
>>> scanning if the shrinker is disabled.
>>> This is effectively what we do on all systems out there already, unless
>>> they disable the shrinker.
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index aa3ed7a86435b..3ee857c1d3754 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -4052,9 +4052,6 @@ void deferred_split_folio(struct folio *folio, bool partially_mapped)
>>> if (folio_order(folio) <= 1)
>>> return;
>>> - if (!partially_mapped && !split_underused_thp)
>>> - return;
>>> -
>>> /*
>>> * Exclude swapcache: originally to avoid a corrupt deferred split
>>> * queue. Nowadays that is fully prevented by memcg1_swapout();
>>> @@ -4175,6 +4172,8 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>>> bool underused = false;
>>> if (!folio_test_partially_mapped(folio)) {
>>> + if (!split_underused_thp)
>>> + goto next;
>>> underused = thp_underused(folio);
>>> if (!underused)
>>> goto next;
>>>
>>>
>>
>>
>> Thanks for sending the diff! Now I know what you meant lol.
>>
>> In the case of when shrinker is disabled, this could make the deferred split scan for partially mapped folios
>> very ineffective?
>
> I hope you realize that that's the default on each and every system out there that ships this feature :)
>
Yes, I made it default :)
I am assuming people either keep shrinker enabled (which is an extremely large majority as its default), or disable shrinker
and they dont flip flop between the 2 settings.
There are 2 scenarios for the above patch:
- shrinker is enabled (default): the above patch wont make a difference.
- shrinker is disabled: the above patch makes splitting partially mapped folios inefficient.
I didnt talk about the shrinker enabled case as it was a no-op and just talked about the shrinker disabled
case.
> And don't ask me how many people even know about disabling the shrinker or would do it, when the default setting is mostly not splitting many THPs ever.
>
>>
>> I am making up numbers, but lets there are 128 THPs in the system, only 2 of them are partially mapped
>> and sc->nr_to_scan is 32.
>>
>> In the current code, with shrinker disabled, only the 2 partially mapped THPs will be on the deferred list, so
>> we will reclaim them in the first go.
>>
>> With your patch, the worst case scenario is that the partially mapped THPs are at the end of the deferred_list
>> and we would need 4 calls for the shrinker to split them.
>
> Probably at some point we would want split lists as well, not sure how feasible that is.
>
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default
2025-09-05 16:47 ` Usama Arif
@ 2025-09-05 16:55 ` David Hildenbrand
2025-09-05 17:26 ` Usama Arif
0 siblings, 1 reply; 31+ messages in thread
From: David Hildenbrand @ 2025-09-05 16:55 UTC (permalink / raw)
To: Usama Arif, linux-kernel
Cc: linux-mm, Andrew Morton, Lorenzo Stoakes, Zi Yan, Baolin Wang,
Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song
On 05.09.25 18:47, Usama Arif wrote:
>
>
> On 05/09/2025 16:58, David Hildenbrand wrote:
>> On 05.09.25 17:53, Usama Arif wrote:
>>>
>>>
>>> On 05/09/2025 16:28, David Hildenbrand wrote:
>>>> On 05.09.25 17:16, Usama Arif wrote:
>>>>>
>>>>>
>>>>> On 05/09/2025 16:04, David Hildenbrand wrote:
>>>>>> On 05.09.25 17:01, Usama Arif wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 05/09/2025 15:58, David Hildenbrand wrote:
>>>>>>>> On 05.09.25 16:53, Usama Arif wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 05/09/2025 15:46, David Hildenbrand wrote:
>>>>>>>>>> [...]
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> The reason I did this is for the case if you change max_ptes_none after the THP is added
>>>>>>>>>>> to deferred split list but *before* memory pressure, i.e. before the shrinker runs,
>>>>>>>>>>> so that its considered for splitting.
>>>>>>>>>>
>>>>>>>>>> Yeah, I was assuming that was the reason why the shrinker is enabled as default.
>>>>>>>>>>
>>>>>>>>>> But in any sane system, the admin would enable the shrinker early. If not, we can look into handling it differently.
>>>>>>>>>
>>>>>>>>> Yes, I do this as well, i.e. have a low value from the start.
>>>>>>>>>
>>>>>>>>> Does it make sense to disable shrinker if max_ptes_none is 511? It wont shrink
>>>>>>>>> the usecase you are describing below, but we wont encounter the increased CPU usage.>
>>>>>>>>
>>>>>>>> I don't really see why we should do that.
>>>>>>>>
>>>>>>>> If the shrinker is a problem than the shrinker should be disabled. But if it is enabled, we should be shrinking as documented.
>>>>>>>>
>>>>>>>> Without more magic around our THP toggles (we want less) :)
>>>>>>>>
>>>>>>>> Shrinking happens when we are under memory pressure, so I am not really sure how relevant the scanning bit is, and if it is relevant enought to change the shrinker default.
>>>>>>>>
>>>>>>>
>>>>>>> yes agreed, I also dont have numbers to back up my worry, its all theoretical :)
>>>>>>
>>>>>> BTW, I was also wondering if we should just always add all THP to the deferred split list, and make the split toggle just affect whether we process them or not (scan or not).
>>>>>>
>>>>>> I mean, as a default we add all of them to the list already right now, even though nothing would ever get reclaimed as default.
>>>>>>
>>>>>> What's your take?
>>>>>>
>>>>>
>>>>> hmm I probably didnt understand what you meant to say here:
>>>>> we already add all of them to the list in __do_huge_pmd_anonymous_page and collapse_huge_page and
>>>>> shrink_underused sets/clears split_underused_thp in deferred_split_folio decides whether we process or not.
>>>>
>>>> This is what I mean:
>>>>
>>>> commit 3952b6f6b671ca7d69fd1783b1abf4806f90d436 (HEAD -> max_ptes_none)
>>>> Author: David Hildenbrand <david@redhat.com>
>>>> Date: Fri Sep 5 17:22:01 2025 +0200
>>>>
>>>> mm/huge_memory: always add THPs to the deferred split list
>>>> When disabling the shrinker and then re-enabling it, any anon THPs
>>>> allocated in the meantime.
>>>> That also means that we cannot disable the shrinker as default during
>>>> boot, because we would miss some THPs later when enabling it.
>>>> So always add them to the deferred split list, and only skip the
>>>> scanning if the shrinker is disabled.
>>>> This is effectively what we do on all systems out there already, unless
>>>> they disable the shrinker.
>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>>
>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>> index aa3ed7a86435b..3ee857c1d3754 100644
>>>> --- a/mm/huge_memory.c
>>>> +++ b/mm/huge_memory.c
>>>> @@ -4052,9 +4052,6 @@ void deferred_split_folio(struct folio *folio, bool partially_mapped)
>>>> if (folio_order(folio) <= 1)
>>>> return;
>>>> - if (!partially_mapped && !split_underused_thp)
>>>> - return;
>>>> -
>>>> /*
>>>> * Exclude swapcache: originally to avoid a corrupt deferred split
>>>> * queue. Nowadays that is fully prevented by memcg1_swapout();
>>>> @@ -4175,6 +4172,8 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>>>> bool underused = false;
>>>> if (!folio_test_partially_mapped(folio)) {
>>>> + if (!split_underused_thp)
>>>> + goto next;
>>>> underused = thp_underused(folio);
>>>> if (!underused)
>>>> goto next;
>>>>
>>>>
>>>
>>>
>>> Thanks for sending the diff! Now I know what you meant lol.
>>>
>>> In the case of when shrinker is disabled, this could make the deferred split scan for partially mapped folios
>>> very ineffective?
>>
>> I hope you realize that that's the default on each and every system out there that ships this feature :)
>>
>
> Yes, I made it default :)
>
> I am assuming people either keep shrinker enabled (which is an extremely large majority as its default), or disable shrinker
> and they dont flip flop between the 2 settings.
> There are 2 scenarios for the above patch:
>
> - shrinker is enabled (default): the above patch wont make a difference.
> - shrinker is disabled: the above patch makes splitting partially mapped folios inefficient.
>
> I didnt talk about the shrinker enabled case as it was a no-op and just talked about the shrinker disabled
> case.
Yeah, and I am saying that all you raised as a concern would be a
problem already today in all default setups (-> 99.999999%). :)
Probably we should not just disable the shrinker during boot, and once
enabled, it would only split THPs created afterwards.
With this patch it would also split ones created previously.
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default
2025-09-05 16:55 ` David Hildenbrand
@ 2025-09-05 17:26 ` Usama Arif
2025-09-08 9:14 ` David Hildenbrand
0 siblings, 1 reply; 31+ messages in thread
From: Usama Arif @ 2025-09-05 17:26 UTC (permalink / raw)
To: David Hildenbrand, linux-kernel
Cc: linux-mm, Andrew Morton, Lorenzo Stoakes, Zi Yan, Baolin Wang,
Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song
On 05/09/2025 17:55, David Hildenbrand wrote:
> On 05.09.25 18:47, Usama Arif wrote:
>>
>>
>> On 05/09/2025 16:58, David Hildenbrand wrote:
>>> On 05.09.25 17:53, Usama Arif wrote:
>>>>
>>>>
>>>> On 05/09/2025 16:28, David Hildenbrand wrote:
>>>>> On 05.09.25 17:16, Usama Arif wrote:
>>>>>>
>>>>>>
>>>>>> On 05/09/2025 16:04, David Hildenbrand wrote:
>>>>>>> On 05.09.25 17:01, Usama Arif wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 05/09/2025 15:58, David Hildenbrand wrote:
>>>>>>>>> On 05.09.25 16:53, Usama Arif wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 05/09/2025 15:46, David Hildenbrand wrote:
>>>>>>>>>>> [...]
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> The reason I did this is for the case if you change max_ptes_none after the THP is added
>>>>>>>>>>>> to deferred split list but *before* memory pressure, i.e. before the shrinker runs,
>>>>>>>>>>>> so that its considered for splitting.
>>>>>>>>>>>
>>>>>>>>>>> Yeah, I was assuming that was the reason why the shrinker is enabled as default.
>>>>>>>>>>>
>>>>>>>>>>> But in any sane system, the admin would enable the shrinker early. If not, we can look into handling it differently.
>>>>>>>>>>
>>>>>>>>>> Yes, I do this as well, i.e. have a low value from the start.
>>>>>>>>>>
>>>>>>>>>> Does it make sense to disable shrinker if max_ptes_none is 511? It wont shrink
>>>>>>>>>> the usecase you are describing below, but we wont encounter the increased CPU usage.>
>>>>>>>>>
>>>>>>>>> I don't really see why we should do that.
>>>>>>>>>
>>>>>>>>> If the shrinker is a problem than the shrinker should be disabled. But if it is enabled, we should be shrinking as documented.
>>>>>>>>>
>>>>>>>>> Without more magic around our THP toggles (we want less) :)
>>>>>>>>>
>>>>>>>>> Shrinking happens when we are under memory pressure, so I am not really sure how relevant the scanning bit is, and if it is relevant enought to change the shrinker default.
>>>>>>>>>
>>>>>>>>
>>>>>>>> yes agreed, I also dont have numbers to back up my worry, its all theoretical :)
>>>>>>>
>>>>>>> BTW, I was also wondering if we should just always add all THP to the deferred split list, and make the split toggle just affect whether we process them or not (scan or not).
>>>>>>>
>>>>>>> I mean, as a default we add all of them to the list already right now, even though nothing would ever get reclaimed as default.
>>>>>>>
>>>>>>> What's your take?
>>>>>>>
>>>>>>
>>>>>> hmm I probably didnt understand what you meant to say here:
>>>>>> we already add all of them to the list in __do_huge_pmd_anonymous_page and collapse_huge_page and
>>>>>> shrink_underused sets/clears split_underused_thp in deferred_split_folio decides whether we process or not.
>>>>>
>>>>> This is what I mean:
>>>>>
>>>>> commit 3952b6f6b671ca7d69fd1783b1abf4806f90d436 (HEAD -> max_ptes_none)
>>>>> Author: David Hildenbrand <david@redhat.com>
>>>>> Date: Fri Sep 5 17:22:01 2025 +0200
>>>>>
>>>>> mm/huge_memory: always add THPs to the deferred split list
>>>>> When disabling the shrinker and then re-enabling it, any anon THPs
>>>>> allocated in the meantime.
>>>>> That also means that we cannot disable the shrinker as default during
>>>>> boot, because we would miss some THPs later when enabling it.
>>>>> So always add them to the deferred split list, and only skip the
>>>>> scanning if the shrinker is disabled.
>>>>> This is effectively what we do on all systems out there already, unless
>>>>> they disable the shrinker.
>>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>>>
>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>> index aa3ed7a86435b..3ee857c1d3754 100644
>>>>> --- a/mm/huge_memory.c
>>>>> +++ b/mm/huge_memory.c
>>>>> @@ -4052,9 +4052,6 @@ void deferred_split_folio(struct folio *folio, bool partially_mapped)
>>>>> if (folio_order(folio) <= 1)
>>>>> return;
>>>>> - if (!partially_mapped && !split_underused_thp)
>>>>> - return;
>>>>> -
>>>>> /*
>>>>> * Exclude swapcache: originally to avoid a corrupt deferred split
>>>>> * queue. Nowadays that is fully prevented by memcg1_swapout();
>>>>> @@ -4175,6 +4172,8 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>>>>> bool underused = false;
>>>>> if (!folio_test_partially_mapped(folio)) {
>>>>> + if (!split_underused_thp)
>>>>> + goto next;
>>>>> underused = thp_underused(folio);
>>>>> if (!underused)
>>>>> goto next;
>>>>>
>>>>>
>>>>
>>>>
>>>> Thanks for sending the diff! Now I know what you meant lol.
>>>>
>>>> In the case of when shrinker is disabled, this could make the deferred split scan for partially mapped folios
>>>> very ineffective?
>>>
>>> I hope you realize that that's the default on each and every system out there that ships this feature :)
>>>
>>
>> Yes, I made it default :)
>>
>> I am assuming people either keep shrinker enabled (which is an extremely large majority as its default), or disable shrinker
>> and they dont flip flop between the 2 settings.
>> There are 2 scenarios for the above patch:
>>
>> - shrinker is enabled (default): the above patch wont make a difference.
>> - shrinker is disabled: the above patch makes splitting partially mapped folios inefficient.
>>
>> I didnt talk about the shrinker enabled case as it was a no-op and just talked about the shrinker disabled
>> case.
>
>
> Yeah, and I am saying that all you raised as a concern would be a problem already today in all default setups (-> 99.999999%). :)
>
> Probably we should not just disable the shrinker during boot, and once enabled, it would only split THPs created afterwards.
>
I probably didnt understand this again lol. Sorry its friday evening :)
split_underused_thp is true at boot time [1]. You are saying we should not disable shrinker during boot, but it is already not
disabled during boot, right?
If someone goes with system default, which is THP shrinker enabled (from boot and runtime), the above patch is a no-op, right?
> With this patch it would also split ones created previously.
>
yes, if someone changes from shrinker being disabled to shrinker being enabled before memory pressure.
[1] https://elixir.bootlin.com/linux/v6.16.4/source/mm/huge_memory.c#L76
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default
2025-09-05 14:39 ` David Hildenbrand
@ 2025-09-06 6:35 ` Lance Yang
0 siblings, 0 replies; 31+ messages in thread
From: Lance Yang @ 2025-09-06 6:35 UTC (permalink / raw)
To: David Hildenbrand
Cc: Zi Yan, linux-kernel, linux-mm, Andrew Morton, Lorenzo Stoakes,
Baolin Wang, Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Usama Arif
On Fri, Sep 5, 2025 at 11:17 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 05.09.25 16:37, Zi Yan wrote:
> > On 5 Sep 2025, at 10:11, David Hildenbrand wrote:
> >
> >> We added an early exit in thp_underused(), probably to avoid scanning
> >> pages when there is no chance for success.
> >>
> >> However, assume we have max_ptes_none = 511 (default).
> >>
> >> Nothing should stop us from freeing all pages part of a THP that
> >> is completely zero (512) and khugepaged will for sure not try to
> >> instantiate a THP in that case (512 shared zeropages).
> >>
> >> This can just trivially happen if someone writes a single 0 byte into a
> >> PMD area, or of course, when data ends up being zero later.
> >>
> >> So let's remove that early exit.
> >>
> >> Do we want to CC stable? Hm, not sure. Probably not urgent.
> >>
> >> Note that, as default, the THP shrinker is active
> >> (/sys/kernel/mm/transparent_hugepage/shrink_underused = 1), and all
> >> THPs are added to the deferred split lists. However, with the
> >> max_ptes_none default we would never scan them. We would not do that. If
> >> that's not desirable, we should just disable the shrinker as default,
> >> also not adding all THPs to the deferred split lists.
> >>
> >> Easy to reproduce:
> >>
> >> 1) Allocate some THPs filled with 0s
> >>
> >> <prog.c>
> >> #include <string.h>
> >> #include <stdio.h>
> >> #include <stdlib.h>
> >> #include <unistd.h>
> >> #include <sys/mman.h>
> >>
> >> const size_t size = 1024*1024*1024;
> >>
> >> int main(void)
> >> {
> >> size_t offs;
> >> char *area;
> >>
> >> area = mmap(0, size, PROT_READ | PROT_WRITE,
> >> MAP_ANON | MAP_PRIVATE, -1, 0);
> >> if (area == MAP_FAILED) {
> >> printf("mmap failed\n");
> >> exit(-1);
> >> }
> >> madvise(area, size, MADV_HUGEPAGE);
> >>
> >> for (offs = 0; offs < size; offs += getpagesize())
> >> area[offs] = 0;
> >> pause();
> >> }
> >> <\prog.c>
> >>
> >> 2) Trigger the shrinker
> >>
> >> E.g., memory pressure through memhog
> >>
> >> 3) Observe that THPs are not getting reclaimed
> >>
> >> $ cat /proc/`pgrep prog`/smaps_rollup
> >>
> >> Would list ~1GiB of AnonHugePages. With this fix, they would get
> >> reclaimed as expected.
> >>
> >> Fixes: dafff3f4c850 ("mm: split underused THPs")
> >> Cc: Andrew Morton <akpm@linux-foundation.org>
> >> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> >> Cc: Zi Yan <ziy@nvidia.com>
> >> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> >> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> >> Cc: Nico Pache <npache@redhat.com>
> >> Cc: Ryan Roberts <ryan.roberts@arm.com>
> >> Cc: Dev Jain <dev.jain@arm.com>
> >> Cc: Barry Song <baohua@kernel.org>
> >> Cc: Usama Arif <usamaarif642@gmail.com>
> >> Signed-off-by: David Hildenbrand <david@redhat.com>
> >> ---
> >> mm/huge_memory.c | 3 ---
> >> 1 file changed, 3 deletions(-)
> >>
> > LGTM. Acked-by: Zi Yan <ziy@nvidia.com>
> >
> > I also notice that thp_underused() checks num_zero_pages directly
> > against khugepaged_max_ptes_none. This means mTHPs will never be regarded
> > as underused. A similar issue you are discussing in Nico’s khugepaged
> > mTHP support. Maybe checks against these khugepaged_max* variables
> > should be calculated based on nr_pages of a large folio, like
> > making these variables a ratio in other discussion.
>
> Yeah, factoring that out and cleaning it up is my next step.
>
> But note that mTHPs are not a candidate for the shrinker right now. (see
> my explanation in reply to Nicos patch)
Right. IIUC, the logic in deferred_split_scan() is gated by
!folio_test_partially_mapped(folio). This creates two paths:
1) Splits for mTHPs that are partially mapped. These set the flag and
corectly bypass thp_underused().
2) Optional shrinker splits for new or collapsed THPs. These don't set the
flag and must go through thp_underused().
So, thp_underused() is only for whole, PMD-sized THPs right now. Please
correct me if my understanding is wrong here ;)
Cheers,
Lance
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default
2025-09-05 14:11 [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default David Hildenbrand
` (4 preceding siblings ...)
2025-09-05 15:30 ` Lorenzo Stoakes
@ 2025-09-06 6:39 ` Lance Yang
2025-09-08 2:16 ` Baolin Wang
6 siblings, 0 replies; 31+ messages in thread
From: Lance Yang @ 2025-09-06 6:39 UTC (permalink / raw)
To: David Hildenbrand
Cc: linux-kernel, linux-mm, Andrew Morton, Lorenzo Stoakes, Zi Yan,
Baolin Wang, Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Usama Arif
On Fri, Sep 5, 2025 at 10:14 PM David Hildenbrand <david@redhat.com> wrote:
>
> We added an early exit in thp_underused(), probably to avoid scanning
> pages when there is no chance for success.
>
> However, assume we have max_ptes_none = 511 (default).
>
> Nothing should stop us from freeing all pages part of a THP that
> is completely zero (512) and khugepaged will for sure not try to
> instantiate a THP in that case (512 shared zeropages).
>
> This can just trivially happen if someone writes a single 0 byte into a
> PMD area, or of course, when data ends up being zero later.
>
> So let's remove that early exit.
>
> Do we want to CC stable? Hm, not sure. Probably not urgent.
>
> Note that, as default, the THP shrinker is active
> (/sys/kernel/mm/transparent_hugepage/shrink_underused = 1), and all
> THPs are added to the deferred split lists. However, with the
> max_ptes_none default we would never scan them. We would not do that. If
> that's not desirable, we should just disable the shrinker as default,
> also not adding all THPs to the deferred split lists.
>
> Easy to reproduce:
>
> 1) Allocate some THPs filled with 0s
>
> <prog.c>
> #include <string.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <sys/mman.h>
>
> const size_t size = 1024*1024*1024;
>
> int main(void)
> {
> size_t offs;
> char *area;
>
> area = mmap(0, size, PROT_READ | PROT_WRITE,
> MAP_ANON | MAP_PRIVATE, -1, 0);
> if (area == MAP_FAILED) {
> printf("mmap failed\n");
> exit(-1);
> }
> madvise(area, size, MADV_HUGEPAGE);
>
> for (offs = 0; offs < size; offs += getpagesize())
> area[offs] = 0;
> pause();
> }
> <\prog.c>
>
> 2) Trigger the shrinker
>
> E.g., memory pressure through memhog
>
> 3) Observe that THPs are not getting reclaimed
>
> $ cat /proc/`pgrep prog`/smaps_rollup
>
> Would list ~1GiB of AnonHugePages. With this fix, they would get
> reclaimed as expected.
>
> Fixes: dafff3f4c850 ("mm: split underused THPs")
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Nico Pache <npache@redhat.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Dev Jain <dev.jain@arm.com>
> Cc: Barry Song <baohua@kernel.org>
> Cc: Usama Arif <usamaarif642@gmail.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
Good catch! Feel free to add:
Reviewed-by: Lance Yang <lance.yang@linux.dev>
Cheers,
Lance
> ---
> mm/huge_memory.c | 3 ---
> 1 file changed, 3 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 26cedfcd74189..aa3ed7a86435b 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -4110,9 +4110,6 @@ static bool thp_underused(struct folio *folio)
> void *kaddr;
> int i;
>
> - if (khugepaged_max_ptes_none == HPAGE_PMD_NR - 1)
> - return false;
> -
> for (i = 0; i < folio_nr_pages(folio); i++) {
> kaddr = kmap_local_folio(folio, i * PAGE_SIZE);
> if (!memchr_inv(kaddr, 0, PAGE_SIZE)) {
> --
> 2.50.1
>
>
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default
2025-09-05 14:11 [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default David Hildenbrand
` (5 preceding siblings ...)
2025-09-06 6:39 ` Lance Yang
@ 2025-09-08 2:16 ` Baolin Wang
6 siblings, 0 replies; 31+ messages in thread
From: Baolin Wang @ 2025-09-08 2:16 UTC (permalink / raw)
To: David Hildenbrand, linux-kernel
Cc: linux-mm, Andrew Morton, Lorenzo Stoakes, Zi Yan, Liam R. Howlett,
Nico Pache, Ryan Roberts, Dev Jain, Barry Song, Usama Arif
On 2025/9/5 22:11, David Hildenbrand wrote:
> We added an early exit in thp_underused(), probably to avoid scanning
> pages when there is no chance for success.
>
> However, assume we have max_ptes_none = 511 (default).
>
> Nothing should stop us from freeing all pages part of a THP that
> is completely zero (512) and khugepaged will for sure not try to
> instantiate a THP in that case (512 shared zeropages).
>
> This can just trivially happen if someone writes a single 0 byte into a
> PMD area, or of course, when data ends up being zero later.
>
> So let's remove that early exit.
>
> Do we want to CC stable? Hm, not sure. Probably not urgent.
>
> Note that, as default, the THP shrinker is active
> (/sys/kernel/mm/transparent_hugepage/shrink_underused = 1), and all
> THPs are added to the deferred split lists. However, with the
> max_ptes_none default we would never scan them. We would not do that. If
> that's not desirable, we should just disable the shrinker as default,
> also not adding all THPs to the deferred split lists.
>
> Easy to reproduce:
>
> 1) Allocate some THPs filled with 0s
>
> <prog.c>
> #include <string.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <sys/mman.h>
>
> const size_t size = 1024*1024*1024;
>
> int main(void)
> {
> size_t offs;
> char *area;
>
> area = mmap(0, size, PROT_READ | PROT_WRITE,
> MAP_ANON | MAP_PRIVATE, -1, 0);
> if (area == MAP_FAILED) {
> printf("mmap failed\n");
> exit(-1);
> }
> madvise(area, size, MADV_HUGEPAGE);
>
> for (offs = 0; offs < size; offs += getpagesize())
> area[offs] = 0;
> pause();
> }
> <\prog.c>
>
> 2) Trigger the shrinker
>
> E.g., memory pressure through memhog
>
> 3) Observe that THPs are not getting reclaimed
>
> $ cat /proc/`pgrep prog`/smaps_rollup
>
> Would list ~1GiB of AnonHugePages. With this fix, they would get
> reclaimed as expected.
>
> Fixes: dafff3f4c850 ("mm: split underused THPs")
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Nico Pache <npache@redhat.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Dev Jain <dev.jain@arm.com>
> Cc: Barry Song <baohua@kernel.org>
> Cc: Usama Arif <usamaarif642@gmail.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
Thanks for sorting this out. Make sense to me.
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> mm/huge_memory.c | 3 ---
> 1 file changed, 3 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 26cedfcd74189..aa3ed7a86435b 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -4110,9 +4110,6 @@ static bool thp_underused(struct folio *folio)
> void *kaddr;
> int i;
>
> - if (khugepaged_max_ptes_none == HPAGE_PMD_NR - 1)
> - return false;
> -
> for (i = 0; i < folio_nr_pages(folio); i++) {
> kaddr = kmap_local_folio(folio, i * PAGE_SIZE);
> if (!memchr_inv(kaddr, 0, PAGE_SIZE)) {
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default
2025-09-05 17:26 ` Usama Arif
@ 2025-09-08 9:14 ` David Hildenbrand
0 siblings, 0 replies; 31+ messages in thread
From: David Hildenbrand @ 2025-09-08 9:14 UTC (permalink / raw)
To: Usama Arif, linux-kernel
Cc: linux-mm, Andrew Morton, Lorenzo Stoakes, Zi Yan, Baolin Wang,
Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song
On 05.09.25 19:26, Usama Arif wrote:
>
>
> On 05/09/2025 17:55, David Hildenbrand wrote:
>> On 05.09.25 18:47, Usama Arif wrote:
>>>
>>>
>>> On 05/09/2025 16:58, David Hildenbrand wrote:
>>>> On 05.09.25 17:53, Usama Arif wrote:
>>>>>
>>>>>
>>>>> On 05/09/2025 16:28, David Hildenbrand wrote:
>>>>>> On 05.09.25 17:16, Usama Arif wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 05/09/2025 16:04, David Hildenbrand wrote:
>>>>>>>> On 05.09.25 17:01, Usama Arif wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 05/09/2025 15:58, David Hildenbrand wrote:
>>>>>>>>>> On 05.09.25 16:53, Usama Arif wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 05/09/2025 15:46, David Hildenbrand wrote:
>>>>>>>>>>>> [...]
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> The reason I did this is for the case if you change max_ptes_none after the THP is added
>>>>>>>>>>>>> to deferred split list but *before* memory pressure, i.e. before the shrinker runs,
>>>>>>>>>>>>> so that its considered for splitting.
>>>>>>>>>>>>
>>>>>>>>>>>> Yeah, I was assuming that was the reason why the shrinker is enabled as default.
>>>>>>>>>>>>
>>>>>>>>>>>> But in any sane system, the admin would enable the shrinker early. If not, we can look into handling it differently.
>>>>>>>>>>>
>>>>>>>>>>> Yes, I do this as well, i.e. have a low value from the start.
>>>>>>>>>>>
>>>>>>>>>>> Does it make sense to disable shrinker if max_ptes_none is 511? It wont shrink
>>>>>>>>>>> the usecase you are describing below, but we wont encounter the increased CPU usage.>
>>>>>>>>>>
>>>>>>>>>> I don't really see why we should do that.
>>>>>>>>>>
>>>>>>>>>> If the shrinker is a problem than the shrinker should be disabled. But if it is enabled, we should be shrinking as documented.
>>>>>>>>>>
>>>>>>>>>> Without more magic around our THP toggles (we want less) :)
>>>>>>>>>>
>>>>>>>>>> Shrinking happens when we are under memory pressure, so I am not really sure how relevant the scanning bit is, and if it is relevant enought to change the shrinker default.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> yes agreed, I also dont have numbers to back up my worry, its all theoretical :)
>>>>>>>>
>>>>>>>> BTW, I was also wondering if we should just always add all THP to the deferred split list, and make the split toggle just affect whether we process them or not (scan or not).
>>>>>>>>
>>>>>>>> I mean, as a default we add all of them to the list already right now, even though nothing would ever get reclaimed as default.
>>>>>>>>
>>>>>>>> What's your take?
>>>>>>>>
>>>>>>>
>>>>>>> hmm I probably didnt understand what you meant to say here:
>>>>>>> we already add all of them to the list in __do_huge_pmd_anonymous_page and collapse_huge_page and
>>>>>>> shrink_underused sets/clears split_underused_thp in deferred_split_folio decides whether we process or not.
>>>>>>
>>>>>> This is what I mean:
>>>>>>
>>>>>> commit 3952b6f6b671ca7d69fd1783b1abf4806f90d436 (HEAD -> max_ptes_none)
>>>>>> Author: David Hildenbrand <david@redhat.com>
>>>>>> Date: Fri Sep 5 17:22:01 2025 +0200
>>>>>>
>>>>>> mm/huge_memory: always add THPs to the deferred split list
>>>>>> When disabling the shrinker and then re-enabling it, any anon THPs
>>>>>> allocated in the meantime.
>>>>>> That also means that we cannot disable the shrinker as default during
>>>>>> boot, because we would miss some THPs later when enabling it.
>>>>>> So always add them to the deferred split list, and only skip the
>>>>>> scanning if the shrinker is disabled.
>>>>>> This is effectively what we do on all systems out there already, unless
>>>>>> they disable the shrinker.
>>>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>>>>
>>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>>> index aa3ed7a86435b..3ee857c1d3754 100644
>>>>>> --- a/mm/huge_memory.c
>>>>>> +++ b/mm/huge_memory.c
>>>>>> @@ -4052,9 +4052,6 @@ void deferred_split_folio(struct folio *folio, bool partially_mapped)
>>>>>> if (folio_order(folio) <= 1)
>>>>>> return;
>>>>>> - if (!partially_mapped && !split_underused_thp)
>>>>>> - return;
>>>>>> -
>>>>>> /*
>>>>>> * Exclude swapcache: originally to avoid a corrupt deferred split
>>>>>> * queue. Nowadays that is fully prevented by memcg1_swapout();
>>>>>> @@ -4175,6 +4172,8 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>>>>>> bool underused = false;
>>>>>> if (!folio_test_partially_mapped(folio)) {
>>>>>> + if (!split_underused_thp)
>>>>>> + goto next;
>>>>>> underused = thp_underused(folio);
>>>>>> if (!underused)
>>>>>> goto next;
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> Thanks for sending the diff! Now I know what you meant lol.
>>>>>
>>>>> In the case of when shrinker is disabled, this could make the deferred split scan for partially mapped folios
>>>>> very ineffective?
>>>>
>>>> I hope you realize that that's the default on each and every system out there that ships this feature :)
>>>>
>>>
>>> Yes, I made it default :)
>>>
>>> I am assuming people either keep shrinker enabled (which is an extremely large majority as its default), or disable shrinker
>>> and they dont flip flop between the 2 settings.
>>> There are 2 scenarios for the above patch:
>>>
>>> - shrinker is enabled (default): the above patch wont make a difference.
>>> - shrinker is disabled: the above patch makes splitting partially mapped folios inefficient.
>>>
>>> I didnt talk about the shrinker enabled case as it was a no-op and just talked about the shrinker disabled
>>> case.
>>
>>
>> Yeah, and I am saying that all you raised as a concern would be a problem already today in all default setups (-> 99.999999%). :)
>>
>> Probably we should not just disable the shrinker during boot, and once enabled, it would only split THPs created afterwards.
>>
>
> I probably didnt understand this again lol. Sorry its friday evening :)
>
> split_underused_thp is true at boot time [1]. You are saying we should not disable shrinker during boot, but it is already not
> disabled during boot, right?
It's all a mess, sorry, currently it's "enabled" as default but actually
"disabled". Let me see if I can clean all that up.
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default
2025-09-05 15:36 ` David Hildenbrand
@ 2025-09-08 11:32 ` Lorenzo Stoakes
0 siblings, 0 replies; 31+ messages in thread
From: Lorenzo Stoakes @ 2025-09-08 11:32 UTC (permalink / raw)
To: David Hildenbrand
Cc: linux-kernel, linux-mm, Andrew Morton, Zi Yan, Baolin Wang,
Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
Usama Arif
On Fri, Sep 05, 2025 at 05:36:01PM +0200, David Hildenbrand wrote:
> On 05.09.25 17:30, Lorenzo Stoakes wrote:
> > On Fri, Sep 05, 2025 at 04:11:37PM +0200, David Hildenbrand wrote:
> > > We added an early exit in thp_underused(), probably to avoid scanning
> > > pages when there is no chance for success.
> > >
> > > However, assume we have max_ptes_none = 511 (default).
> > >
> > > Nothing should stop us from freeing all pages part of a THP that
> >
> > Freeing 'all pages which are part of a THP' rather?
>
> I'm German, I don't know what I'm doing. :D
Whereas I have no excuse :P
>
> >
> > > is completely zero (512) and khugepaged will for sure not try to
> >
> > that is -> that are?
>
> the THP is zero?
I mean "all pages part of a THP that is completely zero' -> "all pages part of a
THP that are completely zero', I'm referring to the 'all pages' bit, but I guess
you mean the THP is entirely zero.
So maybe rephrase to 'all pages which are part of a zero THP' or similar? :)
>
> >
> > > instantiate a THP in that case (512 shared zeropages).
> >
> > But if you write faulted they're not the zero page? And how are they shared? I
> > mean be being dumb here.
>
> The shrinker will replace zeroed pages by the shared zeropages.
Ah thanks, I was being dumb :) too stuck in vanilla mm land...
>
> >
> > >
> > > This can just trivially happen if someone writes a single 0 byte into a
> > > PMD area, or of course, when data ends up being zero later.
> > >
> > > So let's remove that early exit.
> > >
> > > Do we want to CC stable? Hm, not sure. Probably not urgent.
> >
> > Surely this is worth having?
>
> Alrighty, let me cc stable.
Thanks!
>
> >
> > >
> > > Note that, as default, the THP shrinker is active
> > > (/sys/kernel/mm/transparent_hugepage/shrink_underused = 1), and all
> > > THPs are added to the deferred split lists. However, with the
> > > max_ptes_none default we would never scan them. We would not do that. If
> >
> > Nit but 'we would not do that' is kinda duplicative here :)
>
> Yeah, fixed it already while rewriting: this was meant to be "would now".
Cheers!
>
>
> Thanks!
>
>
> --
> Cheers
>
> David / dhildenb
>
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default
2025-09-05 14:40 ` Usama Arif
2025-09-05 14:46 ` David Hildenbrand
@ 2025-09-14 14:04 ` Dev Jain
2025-09-15 8:51 ` David Hildenbrand
1 sibling, 1 reply; 31+ messages in thread
From: Dev Jain @ 2025-09-14 14:04 UTC (permalink / raw)
To: Usama Arif, David Hildenbrand, linux-kernel
Cc: linux-mm, Andrew Morton, Lorenzo Stoakes, Zi Yan, Baolin Wang,
Liam R. Howlett, Nico Pache, Ryan Roberts, Barry Song
>> ---
>> mm/huge_memory.c | 3 ---
>> 1 file changed, 3 deletions(-)
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 26cedfcd74189..aa3ed7a86435b 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -4110,9 +4110,6 @@ static bool thp_underused(struct folio *folio)
>> void *kaddr;
>> int i;
>>
>> - if (khugepaged_max_ptes_none == HPAGE_PMD_NR - 1)
>> - return false;
>> -
> I do agree with your usecase, but I am really worried about the amount of
> work and cpu time the THP shrinker will consume when max_ptes_none is 511
> (I dont have any numbers to back up my worry :)), and its less likely that
> we will have these completely zeroed out THPs (again no numbers to back up
> this statement). We have the huge_zero_folio as well which is installed on read.
How about just doing a memcmp() between huge_zero_folio and folio? We know
exactly how this folio looks like, in case of max_ptes_none == 511, if it
is to be eligible for shrinking.
>
>> for (i = 0; i < folio_nr_pages(folio); i++) {
>> kaddr = kmap_local_folio(folio, i * PAGE_SIZE);
>> if (!memchr_inv(kaddr, 0, PAGE_SIZE)) {
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default
2025-09-14 14:04 ` Dev Jain
@ 2025-09-15 8:51 ` David Hildenbrand
0 siblings, 0 replies; 31+ messages in thread
From: David Hildenbrand @ 2025-09-15 8:51 UTC (permalink / raw)
To: Dev Jain, Usama Arif, linux-kernel
Cc: linux-mm, Andrew Morton, Lorenzo Stoakes, Zi Yan, Baolin Wang,
Liam R. Howlett, Nico Pache, Ryan Roberts, Barry Song
On 14.09.25 16:04, Dev Jain wrote:
>>> ---
>>> mm/huge_memory.c | 3 ---
>>> 1 file changed, 3 deletions(-)
>>>
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index 26cedfcd74189..aa3ed7a86435b 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -4110,9 +4110,6 @@ static bool thp_underused(struct folio *folio)
>>> void *kaddr;
>>> int i;
>>>
>>> - if (khugepaged_max_ptes_none == HPAGE_PMD_NR - 1)
>>> - return false;
>>> -
>> I do agree with your usecase, but I am really worried about the amount of
>> work and cpu time the THP shrinker will consume when max_ptes_none is 511
>> (I dont have any numbers to back up my worry :)), and its less likely that
>> we will have these completely zeroed out THPs (again no numbers to back up
>> this statement). We have the huge_zero_folio as well which is installed on read.
>
FWIW, I am still hesitant on this patch because I think the whole
deferred split lists + interaction with the shrinker should be cleaned
up along this patch.
> How about just doing a memcmp() between huge_zero_folio and folio? We know
> exactly how this folio looks like, in case of max_ptes_none == 511, if it
> is to be eligible for shrinking.
I wouldn't really want to optimize for this case, especially given that
(a) I want to remove all direct dependencies on max_ptes_none and any
special values.
(b) The huge zero folio is not always around
(c) The kmap local prevents us from doing that (in particular on 32bit).
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 31+ messages in thread
end of thread, other threads:[~2025-09-15 8:51 UTC | newest]
Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-05 14:11 [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default David Hildenbrand
2025-09-05 14:37 ` Zi Yan
2025-09-05 14:39 ` David Hildenbrand
2025-09-06 6:35 ` Lance Yang
2025-09-05 14:43 ` Usama Arif
2025-09-05 14:47 ` David Hildenbrand
2025-09-05 14:58 ` Usama Arif
2025-09-05 14:40 ` Usama Arif
2025-09-05 14:46 ` David Hildenbrand
2025-09-05 14:53 ` Usama Arif
2025-09-05 14:58 ` David Hildenbrand
2025-09-05 15:01 ` Usama Arif
2025-09-05 15:04 ` David Hildenbrand
2025-09-05 15:16 ` Usama Arif
2025-09-05 15:28 ` David Hildenbrand
2025-09-05 15:53 ` Usama Arif
2025-09-05 15:57 ` Usama Arif
2025-09-05 15:58 ` David Hildenbrand
2025-09-05 16:47 ` Usama Arif
2025-09-05 16:55 ` David Hildenbrand
2025-09-05 17:26 ` Usama Arif
2025-09-08 9:14 ` David Hildenbrand
2025-09-14 14:04 ` Dev Jain
2025-09-15 8:51 ` David Hildenbrand
2025-09-05 15:02 ` Usama Arif
2025-09-05 15:11 ` David Hildenbrand
2025-09-05 15:30 ` Lorenzo Stoakes
2025-09-05 15:36 ` David Hildenbrand
2025-09-08 11:32 ` Lorenzo Stoakes
2025-09-06 6:39 ` Lance Yang
2025-09-08 2:16 ` Baolin Wang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).