Re: [RFC v2 PATCH 0/5] Promotion of Unmapped Page Cache Folios.

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Huang, Ying" <ying.huang@linux.alibaba.com>
To: Gregory Price <gourry@gourry.net>
Cc: linux-mm@kvack.org,  linux-kernel@vger.kernel.org,
	nehagholkar@meta.com,  abhishekd@meta.com,  kernel-team@meta.com,
	david@redhat.com,  nphamcs@gmail.com,  akpm@linux-foundation.org,
	hannes@cmpxchg.org,  kbusch@meta.com,
	 Feng Tang <feng.tang@intel.com>
Subject: Re: [RFC v2 PATCH 0/5] Promotion of Unmapped Page Cache Folios.
Date: Sat, 21 Dec 2024 13:18:04 +0800	[thread overview]
Message-ID: <87o715r4vn.fsf@DESKTOP-5N7EMDA> (raw)
In-Reply-To: <20241210213744.2968-1-gourry@gourry.net> (Gregory Price's message of "Tue, 10 Dec 2024 16:37:39 -0500")

Hi, Gregory,

Thanks for working on this!

Gregory Price <gourry@gourry.net> writes:

> Unmapped page cache pages can be demoted to low-tier memory, but
> they can presently only be promoted in two conditions:
>     1) The page is fully swapped out and re-faulted
>     2) The page becomes mapped (and exposed to NUMA hint faults)
>
> This RFC proposes promoting unmapped page cache pages by using
> folio_mark_accessed as a hotness hint for unmapped pages.
>
> Patches 1-3
> 	allow NULL as valid input to migration prep interfaces
> 	for vmf/vma - which is not present in unmapped folios.
> Patch 4
> 	adds NUMA_HINT_PAGE_CACHE to vmstat
> Patch 5
> 	adds the promotion mechanism, along with a sysfs
> 	extension which defaults the behavior to off.
> 	/sys/kernel/mm/numa/pagecache_promotion_enabled
>
> Functional test showed that we are able to reclaim some performance
> in canned scenarios (a file gets demoted and becomes hot with 
> relatively little contention).  See test/overhead section below.
>
> v2
> - cleanup first commit to be accurate and take Ying's feedback
> - cleanup NUMA_HINT_ define usage
> - add NUMA_HINT_ type selection macro to keep code clean
> - mild comment updates
>
> Open Questions:
> ======
>    1) Should we also add a limit to how much can be forced onto
>       a single task's promotion list at any one time? This might
>       piggy-back on the existing TPP promotion limit (256MB?) and
>       would simply add something like task->promo_count.
>
>       Technically we are limited by the batch read-rate before a
>       TASK_RESUME occurs.
>
>    2) Should we exempt certain forms of folios, or add additional
>       knobs/levers in to deal with things like large folios?
>
>    3) We added NUMA_HINT_PAGE_CACHE to differentiate hint faults
>       so we could validate the behavior works as intended. Should
>       we just call this a NUMA_HINT_FAULT and not add a new hint?
>
>    4) Benchmark suggestions that can pressure 1TB memory. This is
>       not my typical wheelhouse, so if folks know of a useful
>       benchmark that can pressure my 1TB (768 DRAM / 256 CXL) setup,
>       I'd like to add additional measurements here.
>
> Development Notes
> =================
>
> During development, we explored the following proposals:
>
> 1) directly promoting within folio_mark_accessed (FMA)
>    Originally suggested by Johannes Weiner
>    https://lore.kernel.org/all/20240803094715.23900-1-gourry@gourry.net/
>
>    This caused deadlocks due to the fact that the PTL was held
>    in a variety of cases - but in particular during task exit.
>    It also is incredibly inflexible and causes promotion-on-fault.
>    It was discussed that a deferral mechanism was preferred.
>
>
> 2) promoting in filemap.c locations (calls of FMA)
>    Originally proposed by Feng Tang and Ying Huang
>    https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git/patch/?id=5f2e64ce75c0322602c2ec8c70b64bb69b1f1329
>
>    First, we saw this as less problematic than directly hooking FMA,
>    but we realized this has the potential to miss data in a variety of
>    locations: swap.c, memory.c, gup.c, ksm.c, paddr.c - etc.
>
>    Second, we discovered that the lock state of pages is very subtle,
>    and that these locations in filemap.c can be called in an atomic
>    context.  Prototypes lead to a variety of stalls and lockups.
>
>
> 3) a new LRU - originally proposed by Keith Busch
>    https://git.kernel.org/pub/scm/linux/kernel/git/kbusch/linux.git/patch/?id=6616afe9a722f6ebedbb27ade3848cf07b9a3af7
>
>    There are two issues with this approach: PG_promotable and reclaim.
>
>    First - PG_promotable has generally be discouraged.
>
>    Second - Attach this mechanism to an LRU is both backwards and
>    counter-intutive.  A promotable list is better served by a MOST
>    recently used list, and since LRUs are generally only shrank when
>    exposed to pressure it would require implementing a new promotion
>    list shrinker that runs separate from the existing reclaim logic.
>
>
> 4) Adding a separate kthread - suggested by many
>
>    This is - to an extent - a more general version of the LRU proposal.
>    We still have to track the folios - which likely requires the
>    addition of a page flag.  Additionally, this method would actually
>    contend pretty heavily with LRU behavior - i.e. we'd want to
>    throttle addition to the promotion candidate list in some scenarios.
>
>
> 5) Doing it in task work
>
>    This seemed to be the most realistic after considering the above.
>
>    We observe the following:
>     - FMA is an ideal hook for this and isolation is safe here
>     - the new promotion_candidate function is an ideal hook for new
>       filter logic (throttling, fairness, etc).
>     - isolated folios are either promoted or putback on task resume,
>       there are no additional concurrency mechanics to worry about
>     - The mechanic can be made optional via a sysfs hook to avoid
>       overhead in degenerate scenarios (thrashing).
>
>    We also piggy-backed on the numa_hint_fault_latency timestamp to
>    further throttle promotions to help avoid promotions on one or
>    two time accesses to a particular page.
>
>
> Test:
> ======
>
> Environment:
>     1.5-3.7GHz CPU, ~4000 BogoMIPS, 
>     1TB Machine with 768GB DRAM and 256GB CXL
>     A 64GB file being linearly read by 6-7 Python processes
>
> Goal:
>    Generate promotions. Demonstrate stability and measure overhead.
>
> System Settings:
>    echo 1 > /sys/kernel/mm/numa/demotion_enabled
>    echo 1 > /sys/kernel/mm/numa/pagecache_promotion_enabled
>    echo 2 > /proc/sys/kernel/numa_balancing
>    
> Each process took up ~128GB each, with anonymous memory growing and
> shrinking as python filled and released buffers with the 64GB data.
> This causes DRAM pressure to generate demotions, and file pages to
> "become hot" - and therefore be selected for promotion.
>
> First we ran with promotion disabled to show consistent overhead as
> a result of forcing a file out to CXL memory. We first ran a single
> reader to see uncontended performance, launched many readers to force
> demotions, then droppedb back to a single reader to observe.
>
> Single-reader DRAM: ~16.0-16.4s
> Single-reader CXL (after demotion):  ~16.8-17s

The difference is trivial.  This makes me thought that why we need this
patchset?

> Next we turned promotion on with only a single reader running.
>
> Before promotions:
>     Node 0 MemFree:        636478112 kB
>     Node 0 FilePages:      59009156 kB
>     Node 1 MemFree:        250336004 kB
>     Node 1 FilePages:      14979628 kB

Why are there some many file pages on node 1 even if there're a lot of
free pages on node 0?  You moved some file pages from node 0 to node 1?

> After promotions:
>     Node 0 MemFree:        632267268 kB
>     Node 0 FilePages:      72204968 kB
>     Node 1 MemFree:        262567056 kB
>     Node 1 FilePages:       2918768 kB
>
> Single-reader (after_promotion): ~16.5s
>
> Turning the promotion mechanism on when nothing had been demoted
> produced no appreciable overhead (memory allocation noise overpowers it)
>
> Read time did not change after turning promotion off after promotion
> occurred, which implies that the additional overhead is not coming from
> the promotion system itself - but likely other pages still trapped on
> the low tier.  Either way, this at least demonstrates the mechanism is
> not particularly harmful when there are no pages to promote - and the
> mechanism is valuable when a file actually is quite hot.
>
> Notability, it takes some time for the average read loop to come back
> down, and there still remains unpromoted file pages trapped in pagecache.
> This isn't entirely unexpected, there are many files which may have been
> demoted, and they may not be very hot.
>
>
> Overhead
> ======
> When promotion was tured on we saw a loop-runtime increate temporarily
>
> before: 16.8s
> during:
>   17.606216192245483
>   17.375206470489502
>   17.722095489501953
>   18.230552434921265
>   18.20712447166443
>   18.008254528045654
>   17.008427381515503
>   16.851454257965088
>   16.715774059295654
> stable: ~16.5s
>
> We measured overhead with a separate patch that simply measured the
> rdtsc value before/after calls in promotion_candidate and task work.
>
> e.g.:
> +       start = rdtsc();
>         list_for_each_entry_safe(folio, tmp, promo_list, lru) {
>                 list_del_init(&folio->lru);
>                 migrate_misplaced_folio(folio, NULL, nid);
> +               count++;
>         }
> +       atomic_long_add(rdtsc()-start, &promo_time);
> +       atomic_long_add(count, &promo_count);
>
> numa_migrate_prep: 93 - time(3969867917) count(42576860)
> migrate_misplaced_folio_prepare: 491 - time(3433174319) count(6985523)
> migrate_misplaced_folio: 1635 - time(11426529980) count(6985523)
>
> Thoughts on a good throttling heuristic would be appreciated here.

We do have a throttle mechanism already, for example, you can used

$ echo 100 > /proc/sys/kernel/numa_balancing_promote_rate_limit_MBps

to rate limit the promotion throughput under 100 MB/s for each DRAM
node.

> Suggested-by: Huang Ying <ying.huang@linux.alibaba.com>
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Suggested-by: Keith Busch <kbusch@meta.com>
> Suggested-by: Feng Tang <feng.tang@intel.com>
> Signed-off-by: Gregory Price <gourry@gourry.net>
>
> Gregory Price (5):
>   migrate: Allow migrate_misplaced_folio_prepare() to accept a NULL VMA.
>   memory: move conditionally defined enums use inside ifdef tags
>   memory: allow non-fault migration in numa_migrate_check path
>   vmstat: add page-cache numa hints
>   migrate,sysfs: add pagecache promotion
>
>  .../ABI/testing/sysfs-kernel-mm-numa          | 20 ++++++
>  include/linux/memory-tiers.h                  |  2 +
>  include/linux/migrate.h                       |  2 +
>  include/linux/sched.h                         |  3 +
>  include/linux/sched/numa_balancing.h          |  5 ++
>  include/linux/vm_event_item.h                 |  8 +++
>  init/init_task.c                              |  1 +
>  kernel/sched/fair.c                           | 26 +++++++-
>  mm/memory-tiers.c                             | 27 ++++++++
>  mm/memory.c                                   | 32 +++++-----
>  mm/mempolicy.c                                | 25 +++++---
>  mm/migrate.c                                  | 61 ++++++++++++++++++-
>  mm/swap.c                                     |  3 +
>  mm/vmstat.c                                   |  2 +
>  14 files changed, 193 insertions(+), 24 deletions(-)

---
Best Regards,
Huang, Ying

next prev parent reply	other threads:[~2024-12-21  5:18 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-12-10 21:37 [RFC v2 PATCH 0/5] Promotion of Unmapped Page Cache Folios Gregory Price
2024-12-10 21:37 ` [RFC v2 PATCH 1/5] migrate: Allow migrate_misplaced_folio_prepare() to accept a NULL VMA Gregory Price
2024-12-10 21:37 ` [RFC v2 PATCH 2/5] memory: move conditionally defined enums use inside ifdef tags Gregory Price
2024-12-27 10:34   ` Donet Tom
2024-12-27 15:42     ` Gregory Price
2024-12-29 14:49       ` Donet Tom
2024-12-10 21:37 ` [RFC v2 PATCH 3/5] memory: allow non-fault migration in numa_migrate_check path Gregory Price
2024-12-10 21:37 ` [RFC v2 PATCH 4/5] vmstat: add page-cache numa hints Gregory Price
2024-12-27 10:48   ` Donet Tom
2024-12-27 15:49     ` Gregory Price
2024-12-29 14:57       ` Donet Tom
2025-01-03 10:18   ` Donet Tom
2025-01-03 19:19     ` Gregory Price
2024-12-10 21:37 ` [RFC v2 PATCH 5/5] migrate,sysfs: add pagecache promotion Gregory Price
2024-12-27 11:01   ` Donet Tom
2024-12-27 15:56     ` Gregory Price
2024-12-29 15:00       ` Donet Tom
2024-12-21  5:18 ` Huang, Ying [this message]
2024-12-21 14:48   ` [RFC v2 PATCH 0/5] Promotion of Unmapped Page Cache Folios Gregory Price
2024-12-22  7:09     ` Huang, Ying
2024-12-22 16:22       ` Gregory Price
2024-12-27  2:16         ` Huang, Ying
2024-12-27 15:40           ` Gregory Price
2024-12-27 19:09             ` Gregory Price
2024-12-28  3:38               ` Gregory Price
2024-12-31  7:32                 ` Gregory Price
2025-01-02  2:58                   ` Huang, Ying

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87o715r4vn.fsf@DESKTOP-5N7EMDA \
    --to=ying.huang@linux.alibaba.com \
    --cc=abhishekd@meta.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=feng.tang@intel.com \
    --cc=gourry@gourry.net \
    --cc=hannes@cmpxchg.org \
    --cc=kbusch@meta.com \
    --cc=kernel-team@meta.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=nehagholkar@meta.com \
    --cc=nphamcs@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.