From: "Zach O'Keefe" <zokeefe@google.com>
To: Alex Shi <alex.shi@linux.alibaba.com>,
David Hildenbrand <david@redhat.com>,
David Rientjes <rientjes@google.com>,
Matthew Wilcox <willy@infradead.org>,
Michal Hocko <mhocko@suse.com>,
Pasha Tatashin <pasha.tatashin@soleen.com>,
Peter Xu <peterx@redhat.com>, SeongJae Park <sj@kernel.org>,
Song Liu <songliubraving@fb.com>,
Vlastimil Babka <vbabka@suse.cz>, Yang Shi <shy828301@gmail.com>,
Zi Yan <ziy@nvidia.com>,
linux-mm@kvack.org
Cc: Andrea Arcangeli <aarcange@redhat.com>,
Andrew Morton <akpm@linux-foundation.org>,
Arnd Bergmann <arnd@arndb.de>,
Axel Rasmussen <axelrasmussen@google.com>,
Chris Kennelly <ckennelly@google.com>,
Chris Zankel <chris@zankel.net>, Helge Deller <deller@gmx.de>,
Hugh Dickins <hughd@google.com>,
Ivan Kokshaysky <ink@jurassic.park.msu.ru>,
"James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>,
Jens Axboe <axboe@kernel.dk>,
"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
Matt Turner <mattst88@gmail.com>,
Max Filippov <jcmvbkbc@gmail.com>,
Miaohe Lin <linmiaohe@huawei.com>,
Minchan Kim <minchan@kernel.org>,
Patrick Xia <patrickx@google.com>,
Pavel Begunkov <asml.silence@gmail.com>,
Thomas Bogendoerfer <tsbogend@alpha.franken.de>,
"Zach O'Keefe" <zokeefe@google.com>
Subject: [PATCH v3 00/12] mm: userspace hugepage collapse
Date: Tue, 26 Apr 2022 07:44:00 -0700 [thread overview]
Message-ID: <20220426144412.742113-1-zokeefe@google.com> (raw)
Introduction
--------------------------------
This series provides a mechanism for userspace to induce a collapse of
eligible ranges of memory into transparent hugepages in process context,
thus permitting users to more tightly control their own hugepage
utilization policy at their own expense.
This idea was introduced by David Rientjes[1], and the semantics and
implementation were introduced and discussed in a previous PATCH RFC[2].
Interface
--------------------------------
The proposed interface adds a new madvise(2) mode, MADV_COLLAPSE, and
leverages the new process_madvise(2) call.
process_madvise(2)
Performs a synchronous collapse of the native pages
mapped by the list of iovecs into transparent hugepages.
The system-wide THP sysfs settings as well as the VMA flags of the
memory range being collapsed are both respected for the purposes
of determining THP eligibility.
THP allocation may enter direct reclaim and/or compaction.
When a range spans multiple VMAs, the semantics of the collapse
over of each VMA is independent from the others.
Caller must have CAP_SYS_ADMIN if not acting on self.
Return value follows existing process_madvise(2) conventions. A
“success” indicates that all hugepage-sized/aligned regions
covered by the provided range were either successfully
collapsed, or were already pmd-mapped THPs.
madvise(2)
Equivalent to process_madvise(2) on self, with 0 returned on
“success”.
Current Use-Cases
--------------------------------
(1) Immediately back executable text by THPs. Current support provided
by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large
system which might impair services from serving at their full rated
load after (re)starting. Tricks like mremap(2)'ing text onto
anonymous memory to immediately realize iTLB performance prevents
page sharing and demand paging, both of which increase steady state
memory footprint. With MADV_COLLAPSE, we get the best of both
worlds: Peak upfront performance and lower RAM footprints. Note
that subsequent support for file-backed memory is required here.
(2) malloc() implementations that manage memory in hugepage-sized
chunks, but sometimes subrelease memory back to the system in
native-sized chunks via MADV_DONTNEED; zapping the pmd. Later, when
the memory is hot, the implementation could madvise(MADV_COLLAPSE)
to re-back the memory by THPs to regain hugepage coverage and dTLB
performance. TCMalloc is such an implementation that could benefit
from this[3]. A prior study of Google internal workloads during
evaluation of Temeraire, a hugepage-aware enhancement to TCMalloc,
showed that nearly 20% of all cpu cycles were spent in dTLB stalls,
and that increasing hugepage coverage by even small amount can help
with that[4].
Future work
--------------------------------
Only private anonymous memory is supported by this series. File and
shmem memory support will be added later.
One possible user of this functionality is a userspace agent that
attempts to optimize THP utilization system-wide by allocating THPs
based on, for example, task priority, task performance requirements, or
heatmaps. For the latter, one idea that has already surfaced is using
DAMON to identify hot regions, and driving THP collapse through a new
DAMOS_COLLAPSE scheme[5].
Sequence of Patches
--------------------------------
Patches 1-4 perform refactoring of collapse logic within khugepaged.c
and introduce the notion of a collapse context.
Patches 5-9 introduces MADV_COLLAPSE, does some renaming, adds support
so that MADV_COLLAPSE context has the eligibility and allocation
semantics referenced above, and adds process_madivse(2) support.
Patches 10-12 add selftests to test the new functionality.
Applies against next-20220426.
Changelog
--------------------------------
v2 -> v3:
* Collapse semantics have changed: the gfp flags used for hugepage allocation
now are independent of khugepaged.
* Cover-letter: add primary use-cases and update description of collapse
semantics.
* 'mm/khugepaged: make hugepage allocation context-specific'
-> Added .gfp operation to struct collapse_control
* 'mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse'
-> Added madvise context .gfp implementation.
-> Set scan_result appropriately on early exit due to mm exit or vma
vma revalidation.
-> Reword patch description
* Rebased onto next-20220426
v1 -> v2:
* Cover-letter clarification and added RFC -> v1 notes
* Fixes issues reported by kernel test robot <lkp@intel.com>
* 'mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds THP'
-> Fixed mixed code/declarations
* 'mm/khugepaged: make hugepage allocation context-specific'
-> Fixed bad function signature in !NUMA && TRANSPARENT_HUGEPAGE configs
-> Added doc comment to retract_page_tables() for "cc"
* 'mm/khugepaged: add struct collapse_result'
-> Added doc comment to retract_page_tables() for "cr"
* 'mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse'
-> Added MADV_COLLAPSE definitions for alpha, mips, parisc, xtensa
-> Moved an "#ifdef NUMA" so that khugepaged_find_target_node() is
defined in !NUMA && TRANSPARENT_HUGEPAGE configs.
* 'mm/khugepaged: remove khugepaged prefix from shared collapse'
functions
-> Removed khugepaged prefix from khugepaged_find_target_node on L914
* Rebased onto next-20220414
RFC -> v1:
* The series was significantly reworked from RFC and most patches are entirely
new or reworked.
* Collapse eligibility criteria has changed: MADV_COLLAPSE now respects
VM_NOHUGEPAGE.
* Collapse semantics have changed: the gfp flags used for hugepage allocation
now match that of khugepaged for the same VMA, instead of the gfp flags used
at-fault for calling process for the VMA.
* Collapse semantics have changed: The collapse semantics for multiple VMAs
spanning a single MADV_COLLAPSE call are now independent, whereas before the
idea was to allow direct reclaim/compaction if any spanned VMA permitted so.
* The process_madvise(2) flags, MADV_F_COLLAPSE_LIMITS and
MADV_F_COLLAPSE_DEFRAG have been removed.
* Implementation change: the RFC implemented collapse over a range of hugepages
in a batched-fashion with the aim of doing multiple page table updates inside
a single mmap_lock write. This has been changed, and the implementation now
collapses each hugepage-aligned/sized region iteratively. This was motivated
by an experiment which showed that, when multiple threads were concurrently
faulting during a MADV_COLLAPSE operation, mean and tail latency to acquire
mmap_lock in read for threads in the fault patch was improved by using a batch
size of 1 (batch sizes of 1, 8, 16, 32 were tested)[6].
* Added: If a collapse operation fails because a page isn't found on the LRU, do
a lru_add_drain_all() and retry.
* Added: selftests
[1] https://lore.kernel.org/all/C8C89F13-3F04-456B-BA76-DE2C378D30BF@nvidia.com/
[2] https://lore.kernel.org/linux-mm/20220308213417.1407042-1-zokeefe@google.com/
[3] https://github.com/google/tcmalloc/tree/master/tcmalloc
[4] https://research.google/pubs/pub50370/
[5] https://lore.kernel.org/lkml/bcc8d9a0-81d-5f34-5e4-fcc28eb7ce@google.com/T/
[6] https://lore.kernel.org/linux-mm/CAAa6QmRc76n-dspGT7UK8DkaqZAOz-CkCsME1V7KGtQ6Yt2FqA@mail.gmail.com/
Zach O'Keefe (13):
mm/khugepaged: separate hugepage preallocation and cleanup
mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds THP
mm/khugepaged: add struct collapse_control
mm/khugepaged: make hugepage allocation context-specific
mm/khugepaged: add struct collapse_result
mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse
mm/khugepaged: remove khugepaged prefix from shared collapse functions
mm/khugepaged: add flag to ignore khugepaged_max_ptes_*
mm/khugepaged: add flag to ignore page young/referenced requirement
mm/madvise: add MADV_COLLAPSE to process_madvise()
selftests/vm: modularize collapse selftests
selftests/vm: add MADV_COLLAPSE collapse context to selftests
selftests/vm: add test to verify recollapse of THPs
include/linux/huge_mm.h | 12 +
include/trace/events/huge_memory.h | 5 +-
include/uapi/asm-generic/mman-common.h | 2 +
mm/internal.h | 1 +
mm/khugepaged.c | 598 ++++++++++++++++--------
mm/madvise.c | 11 +-
mm/rmap.c | 15 +-
tools/testing/selftests/vm/khugepaged.c | 417 +++++++++++------
8 files changed, 702 insertions(+), 359 deletions(-)
--
2.35.1.1178.g4f1659d476-goog
next reply other threads:[~2022-04-26 14:44 UTC|newest]
Thread overview: 31+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-04-26 14:44 Zach O'Keefe [this message]
2022-04-26 14:44 ` [PATCH v3 01/12] mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds THP Zach O'Keefe
2022-04-27 0:26 ` Peter Xu
2022-04-27 15:48 ` Zach O'Keefe
2022-04-26 14:44 ` [PATCH v3 02/12] mm/khugepaged: add struct collapse_control Zach O'Keefe
2022-04-27 19:49 ` Peter Xu
2022-04-28 0:19 ` Zach O'Keefe
2022-04-26 14:44 ` [PATCH v3 03/12] mm/khugepaged: make hugepage allocation context-specific Zach O'Keefe
2022-04-27 20:04 ` Peter Xu
2022-04-28 0:51 ` Zach O'Keefe
2022-04-28 14:51 ` Peter Xu
2022-04-28 15:37 ` Zach O'Keefe
2022-04-28 15:52 ` Peter Xu
2022-04-26 14:44 ` [PATCH v3 04/12] mm/khugepaged: add struct collapse_result Zach O'Keefe
2022-04-27 20:47 ` Peter Xu
2022-04-28 21:59 ` Zach O'Keefe
2022-04-28 23:21 ` Peter Xu
2022-04-29 16:01 ` Zach O'Keefe
2022-04-26 14:44 ` [PATCH v3 05/12] mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse Zach O'Keefe
2022-04-26 14:44 ` [PATCH v3 06/12] mm/khugepaged: remove khugepaged prefix from shared collapse functions Zach O'Keefe
2022-04-27 21:10 ` Peter Xu
2022-04-28 22:51 ` Zach O'Keefe
2022-04-26 14:44 ` [PATCH v3 07/12] mm/khugepaged: add flag to ignore khugepaged_max_ptes_* Zach O'Keefe
2022-04-27 21:12 ` Peter Xu
2022-04-29 14:26 ` Zach O'Keefe
2022-04-26 14:44 ` [PATCH v3 08/12] mm/khugepaged: add flag to ignore page young/referenced requirement Zach O'Keefe
2022-04-26 14:44 ` [PATCH v3 09/12] mm/madvise: add MADV_COLLAPSE to process_madvise() Zach O'Keefe
2022-04-26 14:44 ` [PATCH v3 10/12] selftests/vm: modularize collapse selftests Zach O'Keefe
2022-04-26 14:44 ` [PATCH v3 11/12] selftests/vm: add MADV_COLLAPSE collapse context to selftests Zach O'Keefe
2022-04-26 14:44 ` [PATCH v3 12/12] selftests/vm: add test to verify recollapse of THPs Zach O'Keefe
2022-04-26 20:23 ` [PATCH v3 00/12] mm: userspace hugepage collapse Andrew Morton
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20220426144412.742113-1-zokeefe@google.com \
--to=zokeefe@google.com \
--cc=James.Bottomley@HansenPartnership.com \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=alex.shi@linux.alibaba.com \
--cc=arnd@arndb.de \
--cc=asml.silence@gmail.com \
--cc=axboe@kernel.dk \
--cc=axelrasmussen@google.com \
--cc=chris@zankel.net \
--cc=ckennelly@google.com \
--cc=david@redhat.com \
--cc=deller@gmx.de \
--cc=hughd@google.com \
--cc=ink@jurassic.park.msu.ru \
--cc=jcmvbkbc@gmail.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=linmiaohe@huawei.com \
--cc=linux-mm@kvack.org \
--cc=mattst88@gmail.com \
--cc=mhocko@suse.com \
--cc=minchan@kernel.org \
--cc=pasha.tatashin@soleen.com \
--cc=patrickx@google.com \
--cc=peterx@redhat.com \
--cc=rientjes@google.com \
--cc=shy828301@gmail.com \
--cc=sj@kernel.org \
--cc=songliubraving@fb.com \
--cc=tsbogend@alpha.franken.de \
--cc=vbabka@suse.cz \
--cc=willy@infradead.org \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).