All of lore.kernel.org
 help / color / mirror / Atom feed
* + mm-allow-deferred-splitting-of-arbitrary-anon-large-folios.patch added to mm-unstable branch
@ 2023-09-29 19:08 Andrew Morton
  0 siblings, 0 replies; 2+ messages in thread
From: Andrew Morton @ 2023-09-29 19:08 UTC (permalink / raw)
  To: mm-commits, ziy, yuzhao, ying.huang, willy, vbabka, shy828301,
	rientjes, mcgrof, kirill.shutemov, jhubbard, itaru.kitayama,
	hughd, fengwei.yin, david, catalin.marinas, anshuman.khandual,
	ryan.roberts, akpm


The patch titled
     Subject: mm: allow deferred splitting of arbitrary anon large folios
has been added to the -mm mm-unstable branch.  Its filename is
     mm-allow-deferred-splitting-of-arbitrary-anon-large-folios.patch

This patch will shortly appear at
     https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mm-allow-deferred-splitting-of-arbitrary-anon-large-folios.patch

This patch will later appear in the mm-unstable branch at
    git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days

------------------------------------------------------
From: Ryan Roberts <ryan.roberts@arm.com>
Subject: mm: allow deferred splitting of arbitrary anon large folios
Date: Fri, 29 Sep 2023 12:44:12 +0100

Patch series "variable-order, large folios for anonymous memory", v6.

A series to implement variable order, large folios for anonymous memory. 
(previously called "ANON_LARGE_FOLIO", "LARGE_ANON_FOLIO", "FLEXIBLE_THP",
but now exposed as an extension to THP; "small-order THP").  The objective
of this is to improve performance by allocating larger chunks of memory
during anonymous page faults:

1) Since SW (the kernel) is dealing with larger chunks of memory than base
   pages, there are efficiency savings to be had; fewer page faults, batched PTE
   and RMAP manipulation, reduced lru list, etc. In short, we reduce kernel
   overhead. This should benefit all architectures.
2) Since we are now mapping physically contiguous chunks of memory, we can take
   advantage of HW TLB compression techniques. A reduction in TLB pressure
   speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
   TLB entries; "the contiguous bit" (architectural) and HPA (uarch).

Testing
=======

This version adds patches to mm selftests so that the cow tests explicitly test
small-order THP, in the same way that PMD-order THP is tested. The new tests all
pass, and no regressions are observed in the mm selftest suite.


Performance
===========

The below tables show performance and memory data for selected workloads, with
different small-order THPs enabled. All configs are compared to a 4k page kernel
with small-order THP disabled. 16k and 64k (with small-order THP disabled)
kernels are included to aid the comparison. All kernels built from the same
source; mm-hotfixes-unstable (f9911db48293) + this series.

4k-page-16k-folio: 16k (order-2) THP enabled
4k-page-32k-folio: 32k+16k (order-3, order-2) THP enabled
4k-page-64k-folio: 64k+32k+16k (order-4, order-3, order-2) THP enabled

Running on Ampere Altra with 1 NUMA node enabled, Ubuntu 22.04, XFS filesystem
20 repeats across 5 reboots (with 1 warmup run after each reboot)
Run in its own cgroup and read memory.peak after completion


Kernel Compilation with 8 jobs: (make defconfig && make -s -j8 Image)
(smaller is better):

| kernel            |   real-time |   kern-time |   user-time | memory |
|:------------------|------------:|------------:|------------:|-------:|
| baseline-4k-page  |        0.0% |        0.0% |        0.0% |   0.0% |
| 16k-page          |       -9.0% |      -49.7% |       -4.0% |   6.2% |
| 64k-page          |      -11.9% |      -66.5% |       -5.0% |  28.3% |
| 4k-page-16k-folio |       -2.8% |      -23.0% |       -0.3% |   0.0% |
| 4k-page-32k-folio |       -4.0% |      -32.0% |       -0.6% |   0.1% |
| 4k-page-64k-folio |       -4.6% |      -37.9% |       -0.5% |   0.1% |


Kernel Compilation with 80 jobs: (make defconfig && make -s -j80 Image)
(smaller is better):

| kernel            |   real-time |   kern-time |   user-time | memory |
|:------------------|------------:|------------:|------------:|:-------|
| baseline-4k-page  |        0.0% |        0.0% |        0.0% |   0.0% |
| 16k-page          |       -9.2% |      -52.1% |       -3.6% |   4.6% |
| 64k-page          |      -11.4% |      -66.4% |       -3.0% |  12.6% |
| 4k-page-16k-folio |       -3.2% |      -22.8% |       -0.3% |   2.7% |
| 4k-page-32k-folio |       -4.8% |      -37.1% |       -0.5% |   2.9% |
| 4k-page-64k-folio |       -5.0% |      -42.1% |       -0.3% |   3.4% |


Speedometer 2.0: Running on Chromium automated with Selenium
(bigger is better for runs_per_min, smaller is better for memory):

| kernel            |   runs_per_min | memory |
|:------------------|---------------:|-------:|
| baseline-4k-page  |           0.0% |   0.0% |
| 16k-page          |           5.9% |  10.6% |
| 4k-page-16k-folio |           1.0% |  -0.6% |
| 4k-page-32k-folio |           1.3% |   3.5% |
| 4k-page-64k-folio |           1.3% |   6.4% |


This patch (of 6):

In preparation for the introduction of large folios for anonymous memory,
we would like to be able to split them when they have unmapped subpages,
in order to free those unused pages under memory pressure.  So remove the
artificial requirement that the large folio needed to be at least
PMD-sized.

Link: https://lkml.kernel.org/r/20230929114421.3761121-1-ryan.roberts@arm.com
Link: https://lkml.kernel.org/r/20230929114421.3761121-2-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Yin Fengwei <fengwei.yin@intel.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Itaru Kitayama <itaru.kitayama@gmail.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/rmap.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/mm/rmap.c~mm-allow-deferred-splitting-of-arbitrary-anon-large-folios
+++ a/mm/rmap.c
@@ -1504,11 +1504,11 @@ void page_remove_rmap(struct page *page,
 		__lruvec_stat_mod_folio(folio, idx, -nr);
 
 		/*
-		 * Queue anon THP for deferred split if at least one
+		 * Queue anon large folio for deferred split if at least one
 		 * page of the folio is unmapped and at least one page
 		 * is still mapped.
 		 */
-		if (folio_test_pmd_mappable(folio) && folio_test_anon(folio))
+		if (folio_test_large(folio) && folio_test_anon(folio))
 			if (!compound || nr < nr_pmdmapped)
 				deferred_split_folio(folio);
 	}
_

Patches currently in -mm which might be from ryan.roberts@arm.com are

mm-hugetlb-add-huge-page-size-param-to-set_huge_pte_at.patch
arm64-hugetlb-fix-set_huge_pte_at-to-work-with-all-swap-entries.patch
mm-allow-deferred-splitting-of-arbitrary-anon-large-folios.patch
mm-non-pmd-mappable-large-folios-for-folio_add_new_anon_rmap.patch
mm-thp-account-pte-mapped-anonymous-thp-usage.patch
mm-thp-introduce-anon_orders-and-anon_always_mask-sysfs-files.patch
mm-thp-extend-thp-to-allocate-anonymous-large-folios.patch
mm-thp-add-recommend-option-for-anon_orders.patch
arm64-mm-override-arch_wants_pte_order.patch
selftests-mm-cow-generalize-do_run_with_thp-helper.patch
selftests-mm-cow-add-tests-for-small-order-anon-thp.patch


^ permalink raw reply	[flat|nested] 2+ messages in thread
* + mm-allow-deferred-splitting-of-arbitrary-anon-large-folios.patch added to mm-unstable branch
@ 2023-12-07 22:04 Andrew Morton
  0 siblings, 0 replies; 2+ messages in thread
From: Andrew Morton @ 2023-12-07 22:04 UTC (permalink / raw)
  To: mm-commits, ziy, yuzhao, ying.huang, willy, wangkefeng.wang,
	v-songbaohua, vbabka, shy828301, rientjes, mcgrof,
	kirill.shutemov, jhubbard, itaru.kitayama, hughd, fengwei.yin,
	david, catalin.marinas, apopple, anshuman.khandual, ryan.roberts,
	akpm


The patch titled
     Subject: mm: allow deferred splitting of arbitrary anon large folios
has been added to the -mm mm-unstable branch.  Its filename is
     mm-allow-deferred-splitting-of-arbitrary-anon-large-folios.patch

This patch will shortly appear at
     https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mm-allow-deferred-splitting-of-arbitrary-anon-large-folios.patch

This patch will later appear in the mm-unstable branch at
    git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days

------------------------------------------------------
From: Ryan Roberts <ryan.roberts@arm.com>
Subject: mm: allow deferred splitting of arbitrary anon large folios
Date: Thu, 7 Dec 2023 16:12:02 +0000

Patch series "Multi-size THP for anonymous memory", v9.

A series to implement multi-size THP (mTHP) for anonymous memory
(previously called "small-sized THP" and "large anonymous folios").

The objective of this is to improve performance by allocating larger
chunks of memory during anonymous page faults:

1) Since SW (the kernel) is dealing with larger chunks of memory than base
   pages, there are efficiency savings to be had; fewer page faults, batched PTE
   and RMAP manipulation, reduced lru list, etc. In short, we reduce kernel
   overhead. This should benefit all architectures.
2) Since we are now mapping physically contiguous chunks of memory, we can take
   advantage of HW TLB compression techniques. A reduction in TLB pressure
   speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
   TLB entries; "the contiguous bit" (architectural) and HPA (uarch).

This version incorporates David's feedback on the core patches (#3, #4)
and adds some RB and TB tags (see change log for details).

By default, the existing behaviour (and performance) is maintained.  The
user must explicitly enable multi-size THP to see the performance benefit.
This is done via a new sysfs interface (as recommended by David
Hildenbrand - thanks to David for the suggestion)!  This interface is
inspired by the existing per-hugepage-size sysfs interface used by
hugetlb, provides full backwards compatibility with the existing PMD-size
THP interface, and provides a base for future extensibility.  See [9] for
detailed discussion of the interface.

This series is based on mm-unstable (715b67adf4c8).


Prerequisites
=============

I'm removing this section on the basis that I don't believe what we were
previously calling prerequisites are really prerequisites anymore.  We
originally defined them when mTHP was a compile-time feature.  There is
now a runtime control to opt-in to mTHP; when disabled, correctness and
performance are as before.  When enabled, the code is still
correct/robust, but in the absence of the one remaining item (compaction)
there may be a performance impact in some corners.  See the old list in
the v8 cover letter at [8].  And a longer explanation of my thinking here
[10].

SUMMARY: I don't think we should hold this series up, waiting for the
items on the prerequisites list.  I believe this series should be ready
now so hopefully can be added to mm-unstable for some testing, then
fingers crossed for v6.8.


Testing
=======

The series includes patches for mm selftests to enlighten the cow and
khugepaged tests to explicitly test with multi-size THP, in the same way
that PMD-sized THP is tested.  The new tests all pass, and no regressions
are observed in the mm selftest suite.  I've also run my usual kernel
compilation and java script benchmarks without any issues.

Refer to my performance numbers posted with v6 [6].  (These are for
multi-size THP only - they do not include the arm64 contpte follow-on
series).

John Hubbard at Nvidia has indicated dramatic 10x performance improvements
for some workloads at [11].  (Observed using v6 of this series as well as
the arm64 contpte series).

Kefeng Wang at Huawei has also indicated he sees improvements at [12] although
there are some latency regressions also.

I've also checked that there is no regression in the write fault path when
mTHP is disabled using a microbenchmark.  I ran it for a baseline kernel,
as well as v8 and v9.  I repeated on Ampere Altra (bare metal) and Apple
M2 (VM):

|              |        m2 vm        |        altra        |
|--------------|---------------------|---------------------|
| kernel       |     mean |  std_rel |     mean |  std_rel |
|--------------|----------|----------|----------|----------|
| baseline     |   0.000% |   0.341% |   0.000% |   3.581% |
| anonfolio-v8 |   0.005% |   0.272% |   5.068% |   1.128% |
| anonfolio-v9 |  -0.013% |   0.442% |   0.107% |   1.788% |

There is no measurable difference on M2, but altra has a slow down in v8
which is fixed in v9 by moving the THP order check to be inline within
thp_vma_allowable_orders(), as suggested by David.


This patch (of 10):

In preparation for the introduction of anonymous multi-size THP, we would
like to be able to split them when they have unmapped subpages, in order
to free those unused pages under memory pressure.  So remove the
artificial requirement that the large folio needed to be at least
PMD-sized.

Link: https://lkml.kernel.org/r/20231207161211.2374093-1-ryan.roberts@arm.com
Link: https://lkml.kernel.org/r/20231207161211.2374093-2-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Yin Fengwei <fengwei.yin@intel.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Barry Song <v-songbaohua@oppo.com>
Tested-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Tested-by: John Hubbard <jhubbard@nvidia.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Itaru Kitayama <itaru.kitayama@gmail.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/rmap.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/mm/rmap.c~mm-allow-deferred-splitting-of-arbitrary-anon-large-folios
+++ a/mm/rmap.c
@@ -1535,11 +1535,11 @@ void page_remove_rmap(struct page *page,
 		__lruvec_stat_mod_folio(folio, idx, -nr);
 
 		/*
-		 * Queue anon THP for deferred split if at least one
+		 * Queue anon large folio for deferred split if at least one
 		 * page of the folio is unmapped and at least one page
 		 * is still mapped.
 		 */
-		if (folio_test_pmd_mappable(folio) && folio_test_anon(folio))
+		if (folio_test_large(folio) && folio_test_anon(folio))
 			if (!compound || nr < nr_pmdmapped)
 				deferred_split_folio(folio);
 	}
_

Patches currently in -mm which might be from ryan.roberts@arm.com are

mm-readahead-do-not-allow-order-1-folio.patch
mm-allow-deferred-splitting-of-arbitrary-anon-large-folios.patch
mm-non-pmd-mappable-large-folios-for-folio_add_new_anon_rmap.patch
mm-thp-introduce-multi-size-thp-sysfs-interface.patch
mm-thp-support-allocation-of-anonymous-multi-size-thp.patch
selftests-mm-kugepaged-restore-thp-settings-at-exit.patch
selftests-mm-factor-out-thp-settings-management.patch
selftests-mm-support-multi-size-thp-interface-in-thp_settings.patch
selftests-mm-khugepaged-enlighten-for-multi-size-thp.patch
selftests-mm-cow-generalize-do_run_with_thp-helper.patch
selftests-mm-cow-add-tests-for-anonymous-multi-size-thp.patch


^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2023-12-07 22:04 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-09-29 19:08 + mm-allow-deferred-splitting-of-arbitrary-anon-large-folios.patch added to mm-unstable branch Andrew Morton
  -- strict thread matches above, loose matches on Subject: below --
2023-12-07 22:04 Andrew Morton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.