From: Gregory Price <gourry@gourry.net>
To: Bharata B Rao <bharata@amd.com>
Cc: Matthew Wilcox <willy@infradead.org>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
Jonathan.Cameron@huawei.com, dave.hansen@intel.com,
mgorman@techsingularity.net, mingo@redhat.com,
peterz@infradead.org, raghavendra.kt@amd.com, riel@surriel.com,
rientjes@google.com, sj@kernel.org, weixugc@google.com,
ying.huang@linux.alibaba.com, ziy@nvidia.com, dave@stgolabs.net,
nifan.cxl@gmail.com, xuezhengchu@huawei.com, yiannis@zptcorp.com,
akpm@linux-foundation.org, david@kernel.org, byungchul@sk.com,
kinseyho@google.com, joshua.hahnjy@gmail.com, yuanchu@google.com,
balbirs@nvidia.com, alok.rathore@samsung.com, shivankg@amd.com,
donettom@linux.ibm.com
Subject: Re: [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure
Date: Mon, 11 May 2026 10:27:46 -0400 [thread overview]
Message-ID: <agHnYo_yvyXHBtTJ@gourry-fedora-PF4VCD3F> (raw)
In-Reply-To: <c2b63544-d5da-4fc8-88cc-487de0a9a71e@amd.com>
On Mon, May 11, 2026 at 03:32:20PM +0530, Bharata B Rao wrote:
>
>
> On 06-May-26 8:52 PM, Gregory Price wrote:
> > On Mon, May 04, 2026 at 09:36:05PM +0100, Matthew Wilcox wrote:
> >> On Mon, May 04, 2026 at 11:39:17AM +0530, Bharata B Rao wrote:
> >>> This is v7 of pghot, a hot-page tracking and promotion subsystem. The
> >>
> >> I continue to think we should not do this.
> >
> > My only pushback on the general "we should not do this" is that we need
> > something to counter-balance the demotion bit in vmscan.c, and the
> > current implementation (prot_none faults) is rather :[
>
> So you are saying pghot subsystem currently does hot page detection and
> promotion only, which is fine. But the current implementation of demotion is not
> very optimal and hence we should spend effort in fine-tuning demotion first?
>
I'm saying because of demotion and fallbacks, we need a mechanism to
handle promotions. I'm not convinced a hotness will extend to coldness
- at least any better than LRU/MGLRU.
> In this series itself I have shown via benchmark numbers that for over-committed
> cases (involving both demotion and promotion), the workload isn't really showing
> real benefit due to demotion and promotion. Are you specifically referring to
> this problem?
>
If over-committed means over-subscribed hot-tier (more hot memory than
available top tier memory), then yeah that result is intuitive. I
haven't pointed to any specific issue, as of yet, still taking time to
consider some of the results.
>
> Can you provide more context about the LRU inversion problem?
>
I've been tracking some data around shrink_folio_list and
alloc_migrate_folio behavior when a low tier node is full.
The result is we end up just swapping memory from high tier straight to
swap and skip demotion, resulting in a bunch of file and anon refaults.
Hardware: Single Socket, 768GB DRAM, 256GB CXL Expander
In this workload, we see swap usage after the full 1TB of memory is
utilized, and as a result we see swap spillage.
second_chance = second alloc attempt in alloc_migrate_folio succeeds
swap_fallback = second chance fails, we swap directly from top tier
Sample data:
pgdemote_kswapd 333052779
pgdemote_direct 3181480482
pgdemote_second_chance 31017629
pgdemote_swap_fallback 335759535
workingset_refault_anon 30106868
workingset_refault_file 2343035341
(note here: swap fallback is number of occurances, while the others are
number of pages. As a result, the actual number of swapped pages is
likely much closer to the pgdemote_direct number)
As a result: LRU is just broken on CXL systems, LRU inverts by design.
In a sane world we would just see the second tier as an extention of the
LRU, but that doesn't necessarily mean we can gleen hotness data from it
(it's still largely a coldness tracking mechanism).
I have patches I haven't RFC'd yet that try to address this, but I need
more time to test it.
I don't think this is something to address with PGHot.
---
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 112983b42559..ccdd698c5937 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1043,7 +1043,10 @@ struct folio *alloc_migrate_folio(struct folio *src, unsigned long private)
mtc->gfp_mask &= ~__GFP_THISNODE;
mtc->nmask = allowed_mask;
- return alloc_migration_target(src, (unsigned long)mtc);
+ dst = alloc_migration_target(src, (unsigned long)mtc);
+ if (dst)
+ count_vm_events(PGDEMOTE_SECOND_CHANCE, folio_nr_pages(src));
+ return dst;
}
/*
@@ -1616,6 +1619,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
/* Folios that could not be demoted are still in @demote_folios */
if (!list_empty(&demote_folios)) {
/* Folios which weren't demoted go back on @folio_list */
+ if (!sc->proactive)
+ count_vm_event(PGDEMOTE_SWAP_FALLBACK);
list_splice_init(&demote_folios, folio_list);
/*
next prev parent reply other threads:[~2026-05-11 14:27 UTC|newest]
Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-04 6:09 [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure Bharata B Rao
2026-05-04 6:09 ` [PATCH v7 1/7] mm: migrate: Allow misplaced migration without VMA Bharata B Rao
2026-05-04 6:09 ` [PATCH v7 2/7] mm: migrate: Add promote_misplaced_memcg_folios() Bharata B Rao
2026-05-04 18:14 ` Donet Tom
2026-05-06 6:15 ` Bharata B Rao
2026-05-04 6:09 ` [PATCH v7 3/7] mm: Hot page tracking and promotion - pghot Bharata B Rao
2026-05-04 6:09 ` [PATCH v7 4/7] mm: pghot: Precision mode for pghot Bharata B Rao
2026-05-04 18:41 ` Donet Tom
2026-05-06 6:17 ` Bharata B Rao
2026-05-04 6:09 ` [PATCH v7 5/7] mm: sched: move NUMA balancing tiering promotion to pghot Bharata B Rao
2026-05-05 4:44 ` Donet Tom
2026-05-06 6:20 ` Bharata B Rao
2026-05-04 6:09 ` [RFC PATCH v7 6/7] x86/ibs: Move IBS caps definitions into its own header Bharata B Rao
2026-05-04 6:09 ` [RFC PATCH v7 7/7] x86/mm/ibs: In-kernel driver for AMD IBS Memory Profiler Bharata B Rao
2026-05-04 6:23 ` [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure Bharata B Rao
2026-05-04 20:36 ` Matthew Wilcox
2026-05-05 22:17 ` Balbir Singh
2026-05-06 3:43 ` Bharata B Rao
2026-05-06 4:02 ` Balbir Singh
2026-05-06 5:00 ` Bharata B Rao
2026-05-06 15:22 ` Gregory Price
2026-05-11 10:02 ` Bharata B Rao
2026-05-11 14:27 ` Gregory Price [this message]
2026-05-05 10:41 ` Bharata B Rao
2026-05-09 1:18 ` Andrew Morton
2026-05-11 10:37 ` Bharata B Rao
2026-05-11 14:38 ` Gregory Price
2026-05-05 13:42 ` Bharata B Rao
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=agHnYo_yvyXHBtTJ@gourry-fedora-PF4VCD3F \
--to=gourry@gourry.net \
--cc=Jonathan.Cameron@huawei.com \
--cc=akpm@linux-foundation.org \
--cc=alok.rathore@samsung.com \
--cc=balbirs@nvidia.com \
--cc=bharata@amd.com \
--cc=byungchul@sk.com \
--cc=dave.hansen@intel.com \
--cc=dave@stgolabs.net \
--cc=david@kernel.org \
--cc=donettom@linux.ibm.com \
--cc=joshua.hahnjy@gmail.com \
--cc=kinseyho@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@techsingularity.net \
--cc=mingo@redhat.com \
--cc=nifan.cxl@gmail.com \
--cc=peterz@infradead.org \
--cc=raghavendra.kt@amd.com \
--cc=riel@surriel.com \
--cc=rientjes@google.com \
--cc=shivankg@amd.com \
--cc=sj@kernel.org \
--cc=weixugc@google.com \
--cc=willy@infradead.org \
--cc=xuezhengchu@huawei.com \
--cc=yiannis@zptcorp.com \
--cc=ying.huang@linux.alibaba.com \
--cc=yuanchu@google.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox