From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from out30-124.freemail.mail.aliyun.com (out30-124.freemail.mail.aliyun.com [115.124.30.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 57A273F9D2
	for <linux-kernel@vger.kernel.org>; Sat, 21 Dec 2024 05:18:26 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=115.124.30.124
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1734758310; cv=none; b=sAIsJfWmIjdvQmE/+F9sGPvqdIs7fMQc+hhGY6HQR218VOREIzdsp7z0DcZ5VTSFUeJgtJe5LVz0pagaYo+oOqdvqeL4DwOEw1tU1JHgY3jrqa6YbbOKxeDLmlnpmi6Aqgk9dQiptGpYzeLQt4qx9F92yQMDeLvGtbg3f0pg1sA=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1734758310; c=relaxed/simple;
	bh=S8X0TriKciHcSXkOsaAZ08Br3IgLidpElWdZce5RlqE=;
	h=From:To:Cc:Subject:In-Reply-To:References:Date:Message-ID:
	 MIME-Version:Content-Type; b=ZSHczxd0/j1SUQyXZQz2bWe//x6z7Ofu+wnWGmnsy3+FiU3pBncjusEukfIzMo7qb9vnJ9yJooVrFQwhPtJDiTJjzWrjkCp+fPGSJhrY5PAmPNwr5Hsv2xZXx5+R9MZ0PxMoeulGnZxpeilF6S9Mn1leNUB+U2PG5iD83l/OF48=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.alibaba.com; spf=pass smtp.mailfrom=linux.alibaba.com; dkim=pass (1024-bit key) header.d=linux.alibaba.com header.i=@linux.alibaba.com header.b=Ts2V/1kc; arc=none smtp.client-ip=115.124.30.124
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.alibaba.com
Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.alibaba.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=linux.alibaba.com header.i=@linux.alibaba.com header.b="Ts2V/1kc"
DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=linux.alibaba.com; s=default;
	t=1734758299; h=From:To:Subject:Date:Message-ID:MIME-Version:Content-Type;
	bh=63fvxfNo+VR7o5+HruWuAXuIy+Vfv5cuOmYmq8sx/3Q=;
	b=Ts2V/1kcRbM6VhB7RpgeoollF2BZLh/bOfqp8pnxgC+Blwe5Ra9R6iSDLL9aCgj8CO0Ntba3CAYNu/SGx30QcU0l2EGDfEVjXXRwTzCnlzIMTCziXEa+9bVrNgoEA+1Ot2oVrdqWlVlZ1JU5Q7NfeydORA9lUHGrqj94XY0LSu0=
Received: from DESKTOP-5N7EMDA(mailfrom:ying.huang@linux.alibaba.com fp:SMTPD_---0WLvXlYA_1734758284 cluster:ay36)
          by smtp.aliyun-inc.com;
          Sat, 21 Dec 2024 13:18:18 +0800
From: "Huang, Ying" <ying.huang@linux.alibaba.com>
To: Gregory Price <gourry@gourry.net>
Cc: linux-mm@kvack.org,  linux-kernel@vger.kernel.org,
  nehagholkar@meta.com,  abhishekd@meta.com,  kernel-team@meta.com,
  david@redhat.com,  nphamcs@gmail.com,  akpm@linux-foundation.org,
  hannes@cmpxchg.org,  kbusch@meta.com,  Feng Tang <feng.tang@intel.com>
Subject: Re: [RFC v2 PATCH 0/5] Promotion of Unmapped Page Cache Folios.
In-Reply-To: <20241210213744.2968-1-gourry@gourry.net> (Gregory Price's
	message of "Tue, 10 Dec 2024 16:37:39 -0500")
References: <20241210213744.2968-1-gourry@gourry.net>
Date: Sat, 21 Dec 2024 13:18:04 +0800
Message-ID: <87o715r4vn.fsf@DESKTOP-5N7EMDA>
User-Agent: Gnus/5.13 (Gnus v5.13)
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=ascii

Hi, Gregory,

Thanks for working on this!

Gregory Price <gourry@gourry.net> writes:

> Unmapped page cache pages can be demoted to low-tier memory, but
> they can presently only be promoted in two conditions:
>     1) The page is fully swapped out and re-faulted
>     2) The page becomes mapped (and exposed to NUMA hint faults)
>
> This RFC proposes promoting unmapped page cache pages by using
> folio_mark_accessed as a hotness hint for unmapped pages.
>
> Patches 1-3
> 	allow NULL as valid input to migration prep interfaces
> 	for vmf/vma - which is not present in unmapped folios.
> Patch 4
> 	adds NUMA_HINT_PAGE_CACHE to vmstat
> Patch 5
> 	adds the promotion mechanism, along with a sysfs
> 	extension which defaults the behavior to off.
> 	/sys/kernel/mm/numa/pagecache_promotion_enabled
>
> Functional test showed that we are able to reclaim some performance
> in canned scenarios (a file gets demoted and becomes hot with 
> relatively little contention).  See test/overhead section below.
>
> v2
> - cleanup first commit to be accurate and take Ying's feedback
> - cleanup NUMA_HINT_ define usage
> - add NUMA_HINT_ type selection macro to keep code clean
> - mild comment updates
>
> Open Questions:
> ======
>    1) Should we also add a limit to how much can be forced onto
>       a single task's promotion list at any one time? This might
>       piggy-back on the existing TPP promotion limit (256MB?) and
>       would simply add something like task->promo_count.
>
>       Technically we are limited by the batch read-rate before a
>       TASK_RESUME occurs.
>
>    2) Should we exempt certain forms of folios, or add additional
>       knobs/levers in to deal with things like large folios?
>
>    3) We added NUMA_HINT_PAGE_CACHE to differentiate hint faults
>       so we could validate the behavior works as intended. Should
>       we just call this a NUMA_HINT_FAULT and not add a new hint?
>
>    4) Benchmark suggestions that can pressure 1TB memory. This is
>       not my typical wheelhouse, so if folks know of a useful
>       benchmark that can pressure my 1TB (768 DRAM / 256 CXL) setup,
>       I'd like to add additional measurements here.
>
> Development Notes
> =================
>
> During development, we explored the following proposals:
>
> 1) directly promoting within folio_mark_accessed (FMA)
>    Originally suggested by Johannes Weiner
>    https://lore.kernel.org/all/20240803094715.23900-1-gourry@gourry.net/
>
>    This caused deadlocks due to the fact that the PTL was held
>    in a variety of cases - but in particular during task exit.
>    It also is incredibly inflexible and causes promotion-on-fault.
>    It was discussed that a deferral mechanism was preferred.
>
>
> 2) promoting in filemap.c locations (calls of FMA)
>    Originally proposed by Feng Tang and Ying Huang
>    https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git/patch/?id=5f2e64ce75c0322602c2ec8c70b64bb69b1f1329
>
>    First, we saw this as less problematic than directly hooking FMA,
>    but we realized this has the potential to miss data in a variety of
>    locations: swap.c, memory.c, gup.c, ksm.c, paddr.c - etc.
>
>    Second, we discovered that the lock state of pages is very subtle,
>    and that these locations in filemap.c can be called in an atomic
>    context.  Prototypes lead to a variety of stalls and lockups.
>
>
> 3) a new LRU - originally proposed by Keith Busch
>    https://git.kernel.org/pub/scm/linux/kernel/git/kbusch/linux.git/patch/?id=6616afe9a722f6ebedbb27ade3848cf07b9a3af7
>
>    There are two issues with this approach: PG_promotable and reclaim.
>
>    First - PG_promotable has generally be discouraged.
>
>    Second - Attach this mechanism to an LRU is both backwards and
>    counter-intutive.  A promotable list is better served by a MOST
>    recently used list, and since LRUs are generally only shrank when
>    exposed to pressure it would require implementing a new promotion
>    list shrinker that runs separate from the existing reclaim logic.
>
>
> 4) Adding a separate kthread - suggested by many
>
>    This is - to an extent - a more general version of the LRU proposal.
>    We still have to track the folios - which likely requires the
>    addition of a page flag.  Additionally, this method would actually
>    contend pretty heavily with LRU behavior - i.e. we'd want to
>    throttle addition to the promotion candidate list in some scenarios.
>
>
> 5) Doing it in task work
>
>    This seemed to be the most realistic after considering the above.
>
>    We observe the following:
>     - FMA is an ideal hook for this and isolation is safe here
>     - the new promotion_candidate function is an ideal hook for new
>       filter logic (throttling, fairness, etc).
>     - isolated folios are either promoted or putback on task resume,
>       there are no additional concurrency mechanics to worry about
>     - The mechanic can be made optional via a sysfs hook to avoid
>       overhead in degenerate scenarios (thrashing).
>
>    We also piggy-backed on the numa_hint_fault_latency timestamp to
>    further throttle promotions to help avoid promotions on one or
>    two time accesses to a particular page.
>
>
> Test:
> ======
>
> Environment:
>     1.5-3.7GHz CPU, ~4000 BogoMIPS, 
>     1TB Machine with 768GB DRAM and 256GB CXL
>     A 64GB file being linearly read by 6-7 Python processes
>
> Goal:
>    Generate promotions. Demonstrate stability and measure overhead.
>
> System Settings:
>    echo 1 > /sys/kernel/mm/numa/demotion_enabled
>    echo 1 > /sys/kernel/mm/numa/pagecache_promotion_enabled
>    echo 2 > /proc/sys/kernel/numa_balancing
>    
> Each process took up ~128GB each, with anonymous memory growing and
> shrinking as python filled and released buffers with the 64GB data.
> This causes DRAM pressure to generate demotions, and file pages to
> "become hot" - and therefore be selected for promotion.
>
> First we ran with promotion disabled to show consistent overhead as
> a result of forcing a file out to CXL memory. We first ran a single
> reader to see uncontended performance, launched many readers to force
> demotions, then droppedb back to a single reader to observe.
>
> Single-reader DRAM: ~16.0-16.4s
> Single-reader CXL (after demotion):  ~16.8-17s

The difference is trivial.  This makes me thought that why we need this
patchset?

> Next we turned promotion on with only a single reader running.
>
> Before promotions:
>     Node 0 MemFree:        636478112 kB
>     Node 0 FilePages:      59009156 kB
>     Node 1 MemFree:        250336004 kB
>     Node 1 FilePages:      14979628 kB

Why are there some many file pages on node 1 even if there're a lot of
free pages on node 0?  You moved some file pages from node 0 to node 1?

> After promotions:
>     Node 0 MemFree:        632267268 kB
>     Node 0 FilePages:      72204968 kB
>     Node 1 MemFree:        262567056 kB
>     Node 1 FilePages:       2918768 kB
>
> Single-reader (after_promotion): ~16.5s
>
> Turning the promotion mechanism on when nothing had been demoted
> produced no appreciable overhead (memory allocation noise overpowers it)
>
> Read time did not change after turning promotion off after promotion
> occurred, which implies that the additional overhead is not coming from
> the promotion system itself - but likely other pages still trapped on
> the low tier.  Either way, this at least demonstrates the mechanism is
> not particularly harmful when there are no pages to promote - and the
> mechanism is valuable when a file actually is quite hot.
>
> Notability, it takes some time for the average read loop to come back
> down, and there still remains unpromoted file pages trapped in pagecache.
> This isn't entirely unexpected, there are many files which may have been
> demoted, and they may not be very hot.
>
>
> Overhead
> ======
> When promotion was tured on we saw a loop-runtime increate temporarily
>
> before: 16.8s
> during:
>   17.606216192245483
>   17.375206470489502
>   17.722095489501953
>   18.230552434921265
>   18.20712447166443
>   18.008254528045654
>   17.008427381515503
>   16.851454257965088
>   16.715774059295654
> stable: ~16.5s
>
> We measured overhead with a separate patch that simply measured the
> rdtsc value before/after calls in promotion_candidate and task work.
>
> e.g.:
> +       start = rdtsc();
>         list_for_each_entry_safe(folio, tmp, promo_list, lru) {
>                 list_del_init(&folio->lru);
>                 migrate_misplaced_folio(folio, NULL, nid);
> +               count++;
>         }
> +       atomic_long_add(rdtsc()-start, &promo_time);
> +       atomic_long_add(count, &promo_count);
>
> numa_migrate_prep: 93 - time(3969867917) count(42576860)
> migrate_misplaced_folio_prepare: 491 - time(3433174319) count(6985523)
> migrate_misplaced_folio: 1635 - time(11426529980) count(6985523)
>
> Thoughts on a good throttling heuristic would be appreciated here.

We do have a throttle mechanism already, for example, you can used

$ echo 100 > /proc/sys/kernel/numa_balancing_promote_rate_limit_MBps

to rate limit the promotion throughput under 100 MB/s for each DRAM
node.

> Suggested-by: Huang Ying <ying.huang@linux.alibaba.com>
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Suggested-by: Keith Busch <kbusch@meta.com>
> Suggested-by: Feng Tang <feng.tang@intel.com>
> Signed-off-by: Gregory Price <gourry@gourry.net>
>
> Gregory Price (5):
>   migrate: Allow migrate_misplaced_folio_prepare() to accept a NULL VMA.
>   memory: move conditionally defined enums use inside ifdef tags
>   memory: allow non-fault migration in numa_migrate_check path
>   vmstat: add page-cache numa hints
>   migrate,sysfs: add pagecache promotion
>
>  .../ABI/testing/sysfs-kernel-mm-numa          | 20 ++++++
>  include/linux/memory-tiers.h                  |  2 +
>  include/linux/migrate.h                       |  2 +
>  include/linux/sched.h                         |  3 +
>  include/linux/sched/numa_balancing.h          |  5 ++
>  include/linux/vm_event_item.h                 |  8 +++
>  init/init_task.c                              |  1 +
>  kernel/sched/fair.c                           | 26 +++++++-
>  mm/memory-tiers.c                             | 27 ++++++++
>  mm/memory.c                                   | 32 +++++-----
>  mm/mempolicy.c                                | 25 +++++---
>  mm/migrate.c                                  | 61 ++++++++++++++++++-
>  mm/swap.c                                     |  3 +
>  mm/vmstat.c                                   |  2 +
>  14 files changed, 193 insertions(+), 24 deletions(-)

---
Best Regards,
Huang, Ying