linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, x86@kernel.org,
	linux-fsdevel@vger.kernel.org,
	"Andrew Morton" <akpm@linux-foundation.org>,
	"Matthew Wilcox (Oracle)" <willy@infradead.org>,
	"Tejun Heo" <tj@kernel.org>, "Zefan Li" <lizefan.x@bytedance.com>,
	"Johannes Weiner" <hannes@cmpxchg.org>,
	"Michal Koutný" <mkoutny@suse.com>,
	"Jonathan Corbet" <corbet@lwn.net>,
	"Andy Lutomirski" <luto@kernel.org>,
	"Thomas Gleixner" <tglx@linutronix.de>,
	"Ingo Molnar" <mingo@redhat.com>,
	"Borislav Petkov" <bp@alien8.de>,
	"Dave Hansen" <dave.hansen@linux.intel.com>
Subject: Re: [PATCH v1 00/17] mm: MM owner tracking for large folios (!hugetlb) + CONFIG_NO_PAGE_MAPCOUNT
Date: Wed, 23 Oct 2024 11:10:40 +0200	[thread overview]
Message-ID: <f5d70d2c-b7e6-483a-bc07-48947203e832@redhat.com> (raw)
In-Reply-To: <20240829165627.2256514-1-david@redhat.com>

On 29.08.24 18:56, David Hildenbrand wrote:
> RMAP overhaul and optimizations, PTE batching, large mapcount,
> folio_likely_mapped_shared() introduction and optimizations, page_mapcount
> cleanups and preparations ... it's been quite some work to get to this
> point.
> 
> Next up is being able to identify -- without false positives, without
> page-mapcounts and without page table/rmap scanning -- whether a
> large folio is "mapped exclusively" into a single MM, and using that
> information to implement Copy-on-Write reuse and to improve
> folio_likely_mapped_shared() for large folios.
> 
> ... and based on that, finally introducing a kernel config option that
> let's us not use+maintain per-page mapcounts in large folios, improving
> performance of (un)map operations today, taking one step towards
> supporting large folios  > PMD_SIZE, and preparing for the bright future
> where we might no longer have a mapcount per page at all.
> 
> The bigger picture was presented at LSF/MM [1].
> 
> This series is effectively a follow-up on my early work from last
> year [2], which proposed a precise way to identify whether a large folio is
> "mapped shared" into multiple MMs or "mapped exclusively" into a single MM.
> 
> While that advanced approach has been simplified and optimized in the
> meantime, let's start with something simpler first -- "certainly mapped
> exclusive" vs. ""maybe mapped shared" -- so we can start learning about
> the effects and TODOs that some of the implied changes of losing
> per-page mapcounts has.
> 
> I have plans to exchange the simple approach used in this series at some
> point by the advanced approach, but one important thing to learn if the
> imprecision in the simple approach is relevant in practice.
> 
> 64BIT only, and unless enabled in kconfig, this series should for now
> not have any impact.
> 
> 
> 1) Patch Organization
> =====================
> 
> Patch #1 -> #4: make more room on 64BIT in order-1 folios
> 
> Patch #5 -> #7: prepare for MM owner tracking of large folios
> 
> Patch #8: implement a simple MM owner tracking approach for large folios
> 
> patch #9: simple optimization
> 
> Patch #10: COW reuse for PTE-mapped anon THP
> 
> Patch #11 -> #17: introduce and implement CONFIG_NO_PAGE_MAPCOUNT
> 
> 
> 2) MM owner tracking
> ====================
> 
> Similar to my advanced approach [2], we assign each MM a unique 20-bit ID
> ("MM ID"), to be able to squeeze more information in our folios.
> 
> Each large folios can store two MM-ID+mapcount combination:
> * mm0_id + mm0_mapcount
> * mm1_id + mm1_mapcount
> 
> Combined with the large mapcount, we can reliably identify whether one
> of these MMs is the current owner (-> owns all mappings) or even holds
> all folio references (-> owns all mappings, and all references are from
> mappings).
> 
> Stored MM IDs can only change if the corresponding mapcount is logically
> 0, and if the folio is currently "mapped exclusively".
> 
> As long as only two MMs map folio pages at a time, we can reliably identify
> whether a large folio is "mapped shared" or "mapped exclusively". The
> approach is precise.
> 
> Any MM mapping the folio while two other MMs are already mapping the folio,
> will lead to a "mapped shared" detection even after all other MMs stopped
> mapping the folio and it is actually "mapped exclusively": we can have
> false positives but never false negatives when detecting "mapped shared".
> 
> So that's where the approach gets imprecise.
> 
> For now, we use a bit-spinlock to sync the large mapcount + MM IDs + MM
> mapcounts, and make sure we do keep the machinery fast, to not degrade
> (un)map performance too much: for example, we make sure to only use a
> single atomic (when grabbing the bit-spinlock), like we would already
> perform when updating the large mapcount.
> 
> In the future, we might be able to use an arch_spin_lock(), but that's
> future work.
> 
> 
> 3) CONFIG_NO_PAGE_MAPCOUNT
> ==========================
> 
> patch #11 -> #17 spell out and document what exactly is affected when
> not maintaining the per-page mapcounts in large folios anymore.
> 
> For example, as we cannot maintain folio->_nr_pages_mapped anymore when
> (un)mapping pages, we'll account a complete folio as mapped if a
> single page is mapped.
> 
> As another example, we might now under-estimate the USS (Unique Set Size)
> of a process, but never over-estimate it.
> 
> With a more elaborate approach for MM-owner tracking like #1, some things
> could be improved (e.g., USS to some degree), but somethings just cannot be
> handled like we used to without these per-page mapcounts (e.g.,
> folio->_nr_pages_mapped).
> 
> 
> 4) Performance
> ==============
> 
> The following kernel config combinations are possible:
> 
> * Base: CONFIG_PAGE_MAPCOUNT
>    -> (existing) page-mapcount tracking
> * MM-ID: CONFIG_MM_ID && CONFIG_PAGE_MAPCOUNT
>    -> page-mapcount + MM-ID tracking
> * No-Mapcount: CONFIG_MM_ID && CONFIG_NO_PAGE_MAPCOUNT
>    -> MM-ID tracking
> 
> 
> I run my PTE-mapped-THP microbenchmarks [3] and vm-scalability on a machine
> with two NUMA nodes, with a 10-core Intel(R) Xeon(R) Silver 4210R CPU @
> 2.40GHz and 16 GiB of memory each.
> 
> 4.1) PTE-mapped-THP microbenchmarks
> -----------------------------------
> 
> All benchmarks allocate 1 GiB of THPs of a given size, to then fork()/
> munmap/... PMD-sized THPs are mapped by PTEs first.
> 
> Numbers are increase (+) / reduction (-) in runtime. Reduction (-) is
> good. "Base" is the baseline.
> 
> munmap: munmap() the allocated memory.
> 
> Folio Size |  MM-ID | No-Mapcount
> --------------------------------
>      16 KiB |   2 % |        -8 %
>      32 KiB |   3 % |        -9 %
>      64 KiB |   4 % |       -16 %
>     128 KiB |   3 % |       -17 %
>     256 KiB |   1 % |       -23 %
>     512 KiB |   1 % |       -26 %
>    1024 KiB |   0 % |       -29 %
>    2048 KiB |   0 % |       -31 %
> 
> -> 32-128 with MM-ID are a bit unexpected: we would expect to see the worst
>     case with the smallest size (16 KiB). But for these sizes also the STDEV
>     is between 1% and 2%, in contrast to the others (< 1 %). Maybe some
>     weird interaction with PCP/buddy.
> 
> fork: fork()
> 
> Folio Size |  MM-ID | No-Mapcount
> --------------------------------
>      16 KiB |    4 % |       -9 %
>      32 KiB |    1 % |      -12 %
>      64 KiB |    0 % |      -15 %
>     128 KiB |    0 % |      -15 %
>     256 KiB |    0 % |      -16 %
>     512 KiB |    0 % |      -16 %
>    1024 KiB |    0 % |      -17 %
>    2048 KiB |   -1 % |      -21 %
> 
> -> Slight slowdown with MM-ID for the smallest folio size (more what we
> expect in contrast to munmap()).
> 
> cow-byte: fork() and keep the child running. write one byte to each
>    individual page, measuring the duration of all writes.
> 
> Folio Size |  MM-ID | No-Mapcount
> --------------------------------
>      16 KiB |    0 % |        0 %
>      32 KiB |    0 % |        0 %
>      64 KiB |    0 % |        0 %
>     128 KiB |    0 % |        0 %
>     256 KiB |    0 % |        0 %
>     512 KiB |    0 % |        0 %
>    1024 KiB |    0 % |        0 %
>    2048 KiB |    0 % |        0 %
> 
> -> All other overhead dominates even when effectively unmapping
>     single pages of large folios when replacing them by a copy during write
>     faults. No change, which is great!
> 
> reuse-byte: fork() and wait until the child quit. write one byte to each
>    individual page, measuring the duration of all writes.
> 
> Folio Size |  MM-ID | No-Mapcount
> --------------------------------
>      16 KiB |  -66 % |      -66 %
>      32 KiB |  -65 % |      -65 %
>      64 KiB |  -64 % |      -64 %
>     128 KiB |  -64 % |      -64 %
>     256 KiB |  -64 % |      -64 %
>     512 KiB |  -64 % |      -64 %
>    1024 KiB |  -64 % |      -64 %
>    2048 KiB |  -64 % |      -64 %
> 
> -> No surprise, we reuse all pages instead of copying them.
> 
> child-reuse-bye: fork() and unmap the memory in the parent. write one byte
>    to each individual page in the child, measuring the duration of all writes.
> 
> Folio Size |  MM-ID | No-Mapcount
> --------------------------------
>      16 KiB |  -66 % |      -66 %
>      32 KiB |  -65 % |      -65 %
>      64 KiB |  -64 % |      -64 %
>     128 KiB |  -64 % |      -64 %
>     256 KiB |  -64 % |      -64 %
>     512 KiB |  -64 % |      -64 %
>    1024 KiB |  -64 % |      -64 %
>    2048 KiB |  -64 % |      -64 %
> 
> -> Same thing, we reuse all pages instead of copying them.
> 
> 
> For 4 KiB, there is no change in any benchmark, as expected.
> 
> 
> 4.2) vm-scalability
> -------------------
> 
> For now I only ran anon COW tests. I use 1 GiB per child process and use
> one child per core (-> 20).
> 
> case-anon-cow-rand: random writes
> 
> There is effectively no change (<0.6% throughput difference).
> 
> case-anon-cow-seq: sequential writes
> 
> MM-ID has up to 2% *lower* throughout than Base, not really correlating to
> folio size. The difference is almost as large as the STDEV (1% - 2%),
> though. It looks like there is a very slight effective slowdown.
> 
> No-Mapcount has up to 3% *higher* throughput than Base, not really
> correlating to the folio size. However, also here the difference is almost
> as large as the STDEV (up to 2%). It looks like there is a very slight
> effective speedup.
> 
> In summary, no earth-shattering slowdown with MM-ID (and we just recently
> optimized folio->_nr_pages_mapped to give us some speedup :) ), and
> another nice improvement with No-Mapcount.
> 
> 
> I did a bunch of cross-compiles and the build bots turned out very helpful
> over the last months. I did quite some testing with LTP and selftests,
> but x86-64 only.

Gentle ping. I might soon have capacity to continue working on this. If 
there is no further feedback I'll rebase and resend.

-- 
Cheers,

David / dhildenb


      parent reply	other threads:[~2024-10-23  9:10 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-08-29 16:56 [PATCH v1 00/17] mm: MM owner tracking for large folios (!hugetlb) + CONFIG_NO_PAGE_MAPCOUNT David Hildenbrand
2024-08-29 16:56 ` [PATCH v1 01/17] mm: factor out large folio handling from folio_order() into folio_large_order() David Hildenbrand
2024-09-23  4:44   ` Lance Yang
2024-10-23 11:11   ` Kirill A. Shutemov
2024-08-29 16:56 ` [PATCH v1 02/17] mm: factor out large folio handling from folio_nr_pages() into folio_large_nr_pages() David Hildenbrand
2024-10-23 11:18   ` Kirill A. Shutemov
2024-12-06 10:29     ` David Hildenbrand
2024-08-29 16:56 ` [PATCH v1 03/17] mm/rmap: use folio_large_nr_pages() in add/remove functions David Hildenbrand
2024-10-23 11:22   ` Kirill A. Shutemov
2024-08-29 16:56 ` [PATCH v1 04/17] mm: let _folio_nr_pages overlay memcg_data in first tail page David Hildenbrand
2024-10-23 11:38   ` Kirill A. Shutemov
2024-10-23 11:40     ` David Hildenbrand
2024-08-29 16:56 ` [PATCH v1 05/17] mm/rmap: pass dst_vma to page_try_dup_anon_rmap() and page_dup_file_rmap() David Hildenbrand
2024-08-29 16:56 ` [PATCH v1 06/17] mm/rmap: pass vma to __folio_add_rmap() David Hildenbrand
2024-08-29 16:56 ` [PATCH v1 07/17] mm/rmap: abstract large mapcount operations for large folios (!hugetlb) David Hildenbrand
2024-08-29 16:56 ` [PATCH v1 08/17] mm/rmap: initial MM owner tracking " David Hildenbrand
2024-10-23 13:08   ` Kirill A. Shutemov
2024-10-23 13:28     ` David Hildenbrand
2024-08-29 16:56 ` [PATCH v1 09/17] bit_spinlock: __always_inline (un)lock functions David Hildenbrand
2024-08-29 16:56 ` [PATCH v1 10/17] mm: COW reuse support for PTE-mapped THP with CONFIG_MM_ID David Hildenbrand
2024-08-29 16:56 ` [PATCH v1 11/17] mm: CONFIG_NO_PAGE_MAPCOUNT to prepare for not maintain per-page mapcounts in large folios David Hildenbrand
2024-08-29 16:56 ` [PATCH v1 12/17] mm: remove per-page mapcount dependency in folio_likely_mapped_shared() (CONFIG_NO_PAGE_MAPCOUNT) David Hildenbrand
2024-08-29 16:56 ` [PATCH v1 13/17] fs/proc/page: remove per-page mapcount dependency for /proc/kpagecount (CONFIG_NO_PAGE_MAPCOUNT) David Hildenbrand
2024-08-29 16:56 ` [PATCH v1 14/17] fs/proc/task_mmu: remove per-page mapcount dependency for PM_MMAP_EXCLUSIVE (CONFIG_NO_PAGE_MAPCOUNT) David Hildenbrand
2024-08-29 16:56 ` [PATCH v1 15/17] fs/proc/task_mmu: remove per-page mapcount dependency for "mapmax" (CONFIG_NO_PAGE_MAPCOUNT) David Hildenbrand
2024-08-29 16:56 ` [PATCH v1 16/17] fs/proc/task_mmu: remove per-page mapcount dependency for smaps/smaps_rollup (CONFIG_NO_PAGE_MAPCOUNT) David Hildenbrand
2024-08-29 16:56 ` [PATCH v1 17/17] mm: stop maintaining the per-page mapcount of large folios (CONFIG_NO_PAGE_MAPCOUNT) David Hildenbrand
2024-10-23  9:10 ` David Hildenbrand [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f5d70d2c-b7e6-483a-bc07-48947203e832@redhat.com \
    --to=david@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=bp@alien8.de \
    --cc=cgroups@vger.kernel.org \
    --cc=corbet@lwn.net \
    --cc=dave.hansen@linux.intel.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lizefan.x@bytedance.com \
    --cc=luto@kernel.org \
    --cc=mingo@redhat.com \
    --cc=mkoutny@suse.com \
    --cc=tglx@linutronix.de \
    --cc=tj@kernel.org \
    --cc=willy@infradead.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).