Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] mm: Support selecting doing direct COW for anonymous pmd entry
@ 2026-05-01  5:55 Luka Bai
  2026-05-01  5:55 ` [PATCH 1/5] mm: add basic madvise helpers and branch for THP setup Luka Bai
                   ` (6 more replies)
  0 siblings, 7 replies; 14+ messages in thread
From: Luka Bai @ 2026-05-01  5:55 UTC (permalink / raw)
  To: linux-mm
  Cc: Jonathan Corbet, Shuah Khan, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Zi Yan, Baolin Wang, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Lance Yang, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Jann Horn,
	Arnd Bergmann, Kairui Song, linux-kernel, linux-arch, linux-doc,
	Luka Bai

Copy on write support for anonymous pmd level THP is simple right now:
firstly we'll check whether the folio can be exclusively used by the
faulting process, if we can (when the ref of the folio is only 1 after
trying to free swapcache or the page flag AnonExclusive is setup) we'll
directly use it with few further handling. If we cannot, then we'll
split the pmd into 512 4K ptes, and do copy on write only for the
specific 4K page that we faulted on.

This logic is truly memory efficient since for most workloads we don't
want to allocate 2M new memory simply on a small write. However, it also
makes the original 2M page for the process suddenly splitted on a
write which will generate some performance thrashing. For example, if
process A and process B share an anonymous 2M pmd, if process B chooses
to do a writing, then its page table mapping will be changed from 1
pmd entry into 512 4K pte entries at once, so the tlb benifit will
suddenly just "vanish" for process B, which sometimes may cause a
observable performance degeneration. After that, we can only wait for
khugepaged to do the collapse for this area and merge the pmd back, which
is not easy to happen.

In addition to the problem above, this logic can also generate some
deficiency for THP itself. Currently THP is just a "best-effort" choice
with no "certainty". THP is easily splitted into multiple small pages
on common calling path like reclaiming, COW. A transparent splitting
can cause throughput fluctuation for some workloads. For these workloads,
we may want to give THP some "certainty" just like hugetlbfs, The effect
we want is: after some customized setup, if only the system has usable
folio, and the virtual memory alignment permits (or we setup to), we can
make sure we always use THP for it, the system will never split it except
the user wants to do so.

This patchset is about both two things above, firstly we add pmd level
THP COW support by revising the code in do_huge_pmd_wp_page, we added
switch for it because different workloads may need different resources,
for which memory saving may matter more rather than the 2M tlb gain.
The switch is very similar to the "enable" and "shmem_enable" in sysfs
path of transparent_hugepage. THP COW is only enabled when THP itself
is enabled globally or by madvise. And also, we add basic THP setup
helpers and branch in madvise path and add the THP COW choice to it for a
more fine-grained setup. Now the helpers only supports copy on write
related, but in the future we may be able to add more types of THP
configurations into it like swapping.

Patch Details:
========
* Patch 1 adds the basic THP setup helpers and branch in madvise path.
  Then we add THP COW parameter into it.
* Patch 2 adds the THP COW sysfs interface, the logic is very similar
  to enable and shmem_enable of THP.
* Patch 3 adds the helpers that will be used in the actual COW path
  to decide whether we choose to do pmd level THP COW.
* Patch 4 reconstructs map_anon_folio_pmd_nopf and map_anon_folio_pmd_pf
  to make it capable of doing mapping for copied new folio when the
  fault flag has FLAG_FAULT_UNSHARE.
* Patch 5 adds the actual support for pmd level THP COW, and uses all
  the switches and helpers in the above 4 patches to do the strategy
  control.

Thanks for reading. Comments and suggestions are very welcome!

Signed-off-by: Luka Bai <lukabai@tencent.com>
---
Luka Bai (5):
      mm: add basic madvise helpers and branch for THP setup
      mm: add pmd level THP COW parameter in sysfs
      mm: add pmd level THP COW judgement helpers
      mm: enable map_anon_folio_pmd_nopf to handle unshare
      mm: support choosing to do THP COW for anonymous pmd entry.

 .../testing/sysfs-kernel-mm-transparent-hugepage   |   1 +
 Documentation/admin-guide/mm/transhuge.rst         |  27 +++
 include/linux/huge_mm.h                            |  45 ++++-
 include/linux/mm.h                                 |  19 ++
 include/uapi/asm-generic/mman-common.h             |   9 +
 mm/huge_memory.c                                   | 198 ++++++++++++++++++---
 mm/khugepaged.c                                    |   8 +-
 mm/madvise.c                                       |  25 +++
 8 files changed, 308 insertions(+), 24 deletions(-)
---
base-commit: 41cd9e3d23b8fd9e6c3c0311e9cb0304442c6141
change-id: 20260501-thp_cow-94873ed30793

Best regards,
--  
Luka Bai <lukabai@tencent.com>



^ permalink raw reply	[flat|nested] 14+ messages in thread
* Re: [PATCH 0/5] mm: Support selecting doing direct COW for anonymous pmd entry
@ 2026-05-01  8:33 lukafocus
  0 siblings, 0 replies; 14+ messages in thread
From: lukafocus @ 2026-05-01  8:33 UTC (permalink / raw)
  To: david
  Cc: akpm, arnd, baohua, baolin.wang, corbet, dev.jain, jannh, kasong,
	lance.yang, liam, linux-arch, linux-doc, linux-kernel, linux-mm,
	ljs, lukabai, lukafocus, mhocko, npache, rppt, ryan.roberts,
	skhan, surenb, vbabka, ziy

[-- Attachment #1: Type: text/plain, Size: 6209 bytes --]

Hi David,

Thanks for your review and opinion :) I really appreciate it!

> > Copy on write support for anonymous pmd level THP is simple right now:
> > firstly we'll check whether the folio can be exclusively used by the
> > faulting process, if we can (when the ref of the folio is only 1 after
> > trying to free swapcache or the page flag AnonExclusive is setup) we'll
> > directly use it with few further handling. If we cannot, then we'll
> > split the pmd into 512 4K ptes, and do copy on write only for the
> > specific 4K page that we faulted on.
> > 
> > This logic is truly memory efficient since for most workloads we don't
> > want to allocate 2M new memory simply on a small write. However, it also
> > makes the original 2M page for the process suddenly splitted on a
> > write which will generate some performance thrashing. For example, if
> > process A and process B share an anonymous 2M pmd, if process B chooses
> > to do a writing, then its page table mapping will be changed from 1
> > pmd entry into 512 4K pte entries at once, so the tlb benifit will
> > suddenly just "vanish" for process B, which sometimes may cause a
> > observable performance degeneration. After that, we can only wait for
> > khugepaged to do the collapse for this area and merge the pmd back, which
> > is not easy to happen.

> You probably know that, historically, we did exactly what you describe in this
> patch set. It was rather bad regarding memory waste and COW latency, so we
> switched to the current model.

> Note that there was a recent related discussion for executable, which was rejected:

> https://lore.kernel.org/r/20251226100337.4171191-1-zhangqilong3@huawei.com
Yes I know this history, and I know that it will cost some memory or latency,
That’s why I want to add a switch to it to make it configurable :) But I don’t
Know the existence of this discussion in 
https://lore.kernel.org/r/20251226100337.4171191-1-zhangqilong3@huawei.com,
I’ll check it out, thanks for imforming me that :).

> > 
> > In addition to the problem above, this logic can also generate some
> > deficiency for THP itself. Currently THP is just a "best-effort" choice
> > with no "certainty". THP is easily splitted into multiple small pages
> > on common calling path like reclaiming, COW. A transparent splitting
> > can cause throughput fluctuation for some workloads. For these workloads,
> > we may want to give THP some "certainty" just like hugetlbfs,

> There are no such guarantees, though. And We wouldn't want to commit to any such
> guarantees today. For example, simple page migration can split the folio.
> Allocation failures will fallback to small pages etc.

> If you need guarantees, use hugetlb for now.
The reason why I want to use THP over hugetlb is actually I need reclamation for my
workload :). There are many processes in my workload that need 2M
aligned memory, and we want to reclaim them back automatically when the process
doesn’t need it. But hugetlbfs cannot do passive reclamation from what I
know (except doing active madvise my the processes themselves). And using THP can
easily split the hugepages. So that’s why I would like to add a certainty for THP,
and use THP for these processes as backend, because THP is very well integrated with
the swapping and other filesystems. And from what I checked,
it seems the most common case for splitting a THP is COW and swapping so I am trying to handle
these two scenarios (But coincidentally, PMD swapping is committed in
https://lore.kernel.org/all/D3F08F85-76E0-4C5A-ABA1-537C68E038B8@nvidia.com/
a few days before, which is a great implementation :) ).

> > The effect
> > we want is: after some customized setup, if only the system has usable
> > folio, and the virtual memory alignment permits (or we setup to), we can
> > make sure we always use THP for it, the system will never split it except
> > the user wants to do so.
> > 
> > This patchset is about both two things above, firstly we add pmd level
> > THP COW support by revising the code in do_huge_pmd_wp_page, we added
> > switch for it because different workloads may need different resources,

> The switch is bad, and we won't accept any toggle like that. A system-wide
> setting does not make sense for such behavior.

> A per-VMA flag? Maybe, but I expect pushback as well, as it is way too specific.
> So we'd have to find some concept that abstracts these semantics. But I expect
> pushback as well.

> We messed up enough with toggles in THP space, unfortunately.

> Also, anything that only works for PMD-sized THPs is a warning sign in 2026 :)

> You don't really raise any concrete use cases or performance numbers for these
> use cases. Some details about applications that use fork() and rely on such
> behavior would be helpful.
Oh, the reason why I added a switch globally is also because the scenario I mentioned
above, I want those processes to always use PMD sized pages as back
end to make sure performance. COW is truly not that common like swap out/in, but it
can happen sometimes and can harm the performance for these processes. Setting the
System globally is more convenient for my situation :).
I don’t actually have a performance test set in workload level right now since COW
doesn’t necessarily happen in my workload, it just can happen sometime. And the workload
itself hasn’t been finished yet. Pure performance comparison between 512 pte COW and 1
pmd COW is easy, but I guess that doesn’t have too much meaning. But I think maybe
PMD COW can be a more common case in the future? For example, maybe pmd level KSM
support? :)
And for PMD-sized THPs, actually, I’m also considering adding more support to mTHP
If the upstream consider it useful. And also for pud sized THPs also. :)
MADV_DONTFORK and MADV_COLLAPSE are nice and great options :), but the former one seems
to be a little wasteful :). And the latter one can greatly solve fork situation, but
It seems it can not solve the situation like pmd swap in for pmd pages mapped by
two processes. :)

Look forward to your further opinion, thanks!
Best,
Luka

[-- Attachment #2: Type: text/html, Size: 61286 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2026-05-03  7:03 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-01  5:55 [PATCH 0/5] mm: Support selecting doing direct COW for anonymous pmd entry Luka Bai
2026-05-01  5:55 ` [PATCH 1/5] mm: add basic madvise helpers and branch for THP setup Luka Bai
2026-05-01  5:55 ` [PATCH 2/5] mm: add pmd level THP COW parameter in sysfs Luka Bai
2026-05-01  5:55 ` [PATCH 3/5] mm: add pmd level THP COW judgement helpers Luka Bai
2026-05-01  5:55 ` [PATCH 4/5] mm: enable map_anon_folio_pmd_nopf to handle unshare Luka Bai
2026-05-01  5:55 ` [PATCH 5/5] mm: support choosing to do THP COW for anonymous pmd entry Luka Bai
2026-05-01  7:11   ` David Hildenbrand (Arm)
2026-05-01 15:01     ` Luka Bai
2026-05-01  7:07 ` [PATCH 0/5] mm: Support selecting doing direct " David Hildenbrand (Arm)
2026-05-01 16:16   ` Luka Bai
2026-05-01 18:30     ` David Hildenbrand (Arm)
2026-05-02  5:06       ` Luka Bai
2026-05-03  7:03 ` [syzbot ci] " syzbot ci
  -- strict thread matches above, loose matches on Subject: below --
2026-05-01  8:33 [PATCH 0/5] " lukafocus

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox