The Linux Kernel Mailing List
 help / color / mirror / Atom feed
From: Gregory Price <gourry@gourry.net>
To: Bharata B Rao <bharata@amd.com>
Cc: Matthew Wilcox <willy@infradead.org>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Jonathan.Cameron@huawei.com, dave.hansen@intel.com,
	mgorman@techsingularity.net, mingo@redhat.com,
	peterz@infradead.org, raghavendra.kt@amd.com, riel@surriel.com,
	rientjes@google.com, sj@kernel.org, weixugc@google.com,
	ying.huang@linux.alibaba.com, ziy@nvidia.com, dave@stgolabs.net,
	nifan.cxl@gmail.com, xuezhengchu@huawei.com, yiannis@zptcorp.com,
	akpm@linux-foundation.org, david@kernel.org, byungchul@sk.com,
	kinseyho@google.com, joshua.hahnjy@gmail.com, yuanchu@google.com,
	balbirs@nvidia.com, alok.rathore@samsung.com, shivankg@amd.com,
	donettom@linux.ibm.com
Subject: Re: [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure
Date: Mon, 11 May 2026 10:27:46 -0400	[thread overview]
Message-ID: <agHnYo_yvyXHBtTJ@gourry-fedora-PF4VCD3F> (raw)
In-Reply-To: <c2b63544-d5da-4fc8-88cc-487de0a9a71e@amd.com>

On Mon, May 11, 2026 at 03:32:20PM +0530, Bharata B Rao wrote:
> 
> 
> On 06-May-26 8:52 PM, Gregory Price wrote:
> > On Mon, May 04, 2026 at 09:36:05PM +0100, Matthew Wilcox wrote:
> >> On Mon, May 04, 2026 at 11:39:17AM +0530, Bharata B Rao wrote:
> >>> This is v7 of pghot, a hot-page tracking and promotion subsystem. The
> >>
> >> I continue to think we should not do this.
> > 
> > My only pushback on the general "we should not do this" is that we need
> > something to counter-balance the demotion bit in vmscan.c, and the
> > current implementation (prot_none faults) is rather :[
> 
> So you are saying pghot subsystem currently does hot page detection and
> promotion only, which is fine. But the current implementation of demotion is not
> very optimal and hence we should spend effort in fine-tuning demotion first?
>

I'm saying because of demotion and fallbacks, we need a mechanism to
handle promotions.  I'm not convinced a hotness will extend to coldness
- at least any better than LRU/MGLRU.

> In this series itself I have shown via benchmark numbers that for over-committed
> cases (involving both demotion and promotion), the workload isn't really showing
> real benefit due to demotion and promotion. Are you specifically referring to
> this problem?
> 

If over-committed means over-subscribed hot-tier (more hot memory than
available top tier memory), then yeah that result is intuitive.  I
haven't pointed to any specific issue, as of yet, still taking time to
consider some of the results.

> 
> Can you provide more context about the LRU inversion problem?
> 

I've been tracking some data around shrink_folio_list and
alloc_migrate_folio behavior when a low tier node is full.

The result is we end up just swapping memory from high tier straight to
swap and skip demotion, resulting in a bunch of file and anon refaults.

Hardware: Single Socket, 768GB DRAM, 256GB CXL Expander

In this workload, we see swap usage after the full 1TB of memory is
utilized, and as a result we see swap spillage.

second_chance = second alloc attempt in alloc_migrate_folio succeeds
swap_fallback = second chance fails, we swap directly from top tier

Sample data:

pgdemote_kswapd           333052779
pgdemote_direct          3181480482
pgdemote_second_chance     31017629
pgdemote_swap_fallback    335759535
workingset_refault_anon    30106868
workingset_refault_file  2343035341

(note here: swap fallback is number of occurances, while the others are
 number of pages.  As a result, the actual number of swapped pages is
 likely much closer to the pgdemote_direct number)

As a result:  LRU is just broken on CXL systems, LRU inverts by design.

In a sane world we would just see the second tier as an extention of the
LRU, but that doesn't necessarily mean we can gleen hotness data from it
(it's still largely a coldness tracking mechanism).

I have patches I haven't RFC'd yet that try to address this, but I need
more time to test it.

I don't think this is something to address with PGHot.

---

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 112983b42559..ccdd698c5937 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1043,7 +1043,10 @@ struct folio *alloc_migrate_folio(struct folio *src, unsigned long private)
        mtc->gfp_mask &= ~__GFP_THISNODE;
        mtc->nmask = allowed_mask;

-       return alloc_migration_target(src, (unsigned long)mtc);
+       dst = alloc_migration_target(src, (unsigned long)mtc);
+       if (dst)
+               count_vm_events(PGDEMOTE_SECOND_CHANCE, folio_nr_pages(src));
+       return dst;
 }

 /*
@@ -1616,6 +1619,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
        /* Folios that could not be demoted are still in @demote_folios */
        if (!list_empty(&demote_folios)) {
                /* Folios which weren't demoted go back on @folio_list */
+               if (!sc->proactive)
+                       count_vm_event(PGDEMOTE_SWAP_FALLBACK);
                list_splice_init(&demote_folios, folio_list);

                /*


  reply	other threads:[~2026-05-11 14:27 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20260504060924.344313-1-bharata@amd.com>
     [not found] ` <20260504060924.344313-3-bharata@amd.com>
2026-05-04 18:14   ` [PATCH v7 2/7] mm: migrate: Add promote_misplaced_memcg_folios() Donet Tom
2026-05-06  6:15     ` Bharata B Rao
     [not found] ` <20260504060924.344313-5-bharata@amd.com>
2026-05-04 18:41   ` [PATCH v7 4/7] mm: pghot: Precision mode for pghot Donet Tom
2026-05-06  6:17     ` Bharata B Rao
2026-05-04 20:36 ` [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure Matthew Wilcox
2026-05-05 22:17   ` Balbir Singh
2026-05-06  3:43     ` Bharata B Rao
2026-05-06  4:02       ` Balbir Singh
2026-05-06  5:00         ` Bharata B Rao
2026-05-06 15:22   ` Gregory Price
2026-05-11 10:02     ` Bharata B Rao
2026-05-11 14:27       ` Gregory Price [this message]
     [not found] ` <20260504060924.344313-6-bharata@amd.com>
2026-05-05  4:44   ` [PATCH v7 5/7] mm: sched: move NUMA balancing tiering promotion to pghot Donet Tom
2026-05-06  6:20     ` Bharata B Rao
2026-05-05 10:41 ` [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure Bharata B Rao
2026-05-09  1:18   ` Andrew Morton
2026-05-11 10:37     ` Bharata B Rao
2026-05-11 14:38       ` Gregory Price
2026-05-05 13:42 ` Bharata B Rao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=agHnYo_yvyXHBtTJ@gourry-fedora-PF4VCD3F \
    --to=gourry@gourry.net \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=akpm@linux-foundation.org \
    --cc=alok.rathore@samsung.com \
    --cc=balbirs@nvidia.com \
    --cc=bharata@amd.com \
    --cc=byungchul@sk.com \
    --cc=dave.hansen@intel.com \
    --cc=dave@stgolabs.net \
    --cc=david@kernel.org \
    --cc=donettom@linux.ibm.com \
    --cc=joshua.hahnjy@gmail.com \
    --cc=kinseyho@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@techsingularity.net \
    --cc=mingo@redhat.com \
    --cc=nifan.cxl@gmail.com \
    --cc=peterz@infradead.org \
    --cc=raghavendra.kt@amd.com \
    --cc=riel@surriel.com \
    --cc=rientjes@google.com \
    --cc=shivankg@amd.com \
    --cc=sj@kernel.org \
    --cc=weixugc@google.com \
    --cc=willy@infradead.org \
    --cc=xuezhengchu@huawei.com \
    --cc=yiannis@zptcorp.com \
    --cc=ying.huang@linux.alibaba.com \
    --cc=yuanchu@google.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox