Re: [PATCH v7 2/7] mm: migrate: Add promote_misplaced_memcg

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

* Re: [PATCH v7 2/7] mm: migrate: Add promote_misplaced_memcg_folios()
       [not found] ` <20260504060924.344313-3-bharata@amd.com>
@ 2026-05-04 18:14   ` Donet Tom
  2026-05-06  6:15     ` Bharata B Rao
  0 siblings, 1 reply; 19+ messages in thread
From: Donet Tom @ 2026-05-04 18:14 UTC (permalink / raw)
  To: Bharata B Rao, linux-kernel, linux-mm
  Cc: Jonathan.Cameron, dave.hansen, gourry, mgorman, mingo, peterz,
	raghavendra.kt, riel, rientjes, sj, weixugc, willy, ying.huang,
	ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm, david,
	byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs,
	alok.rathore, shivankg

Hi Bharata

On 5/4/26 11:39 AM, Bharata B Rao wrote:
> +int promote_misplaced_memcg_folios(struct list_head *folio_list, int node)
> +{
> +	struct mem_cgroup *memcg = NULL;
> +	unsigned int nr_succeeded = 0;
> +	struct folio *first;
> +	int nr_remaining;
> +
> +	if (list_empty(folio_list))
> +		return 0;
> +
> +	first = list_first_entry(folio_list, struct folio, lru);
> +#ifdef CONFIG_DEBUG_VM
> +	{
> +		struct folio *f;
> +		list_for_each_entry(f, folio_list, lru)
> +		VM_WARN_ON_ONCE(folio_memcg(f) != folio_memcg(first));


It looks like the indentation might be off here.


> +	}
> +#endif
> +	memcg = get_mem_cgroup_from_folio(first);
> +
> +	nr_remaining = migrate_pages(folio_list, alloc_misplaced_dst_folio,
> +				     NULL, node, MIGRATE_ASYNC,
> +				     MR_NUMA_MISPLACED, &nr_succeeded);
> +	if (nr_remaining)
> +		putback_movable_pages(folio_list);
> +
> +	if (nr_succeeded) {
> +		count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded);
> +		count_memcg_events(memcg, NUMA_PAGE_MIGRATE, nr_succeeded);
> +		mod_lruvec_state(mem_cgroup_lruvec(memcg, NODE_DATA(node)),
> +				 PGPROMOTE_SUCCESS, nr_succeeded);
> +	}
> +
> +	mem_cgroup_put(memcg);
> +	WARN_ON(!list_empty(folio_list));
> +	return nr_remaining ? -EAGAIN : 0;
> +}
>   #endif /* CONFIG_NUMA_BALANCING */

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v7 2/7] mm: migrate: Add promote_misplaced_memcg_folios()
  2026-05-04 18:14   ` [PATCH v7 2/7] mm: migrate: Add promote_misplaced_memcg_folios() Donet Tom
@ 2026-05-06  6:15     ` Bharata B Rao
  0 siblings, 0 replies; 19+ messages in thread
From: Bharata B Rao @ 2026-05-06  6:15 UTC (permalink / raw)
  To: Donet Tom, linux-kernel, linux-mm
  Cc: Jonathan.Cameron, dave.hansen, gourry, mgorman, mingo, peterz,
	raghavendra.kt, riel, rientjes, sj, weixugc, willy, ying.huang,
	ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm, david,
	byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs,
	alok.rathore, shivankg

On 04-May-26 11:44 PM, Donet Tom wrote:
> Hi Bharata
> 
> On 5/4/26 11:39 AM, Bharata B Rao wrote:
>> +int promote_misplaced_memcg_folios(struct list_head *folio_list, int node)
>> +{
>> +    struct mem_cgroup *memcg = NULL;
>> +    unsigned int nr_succeeded = 0;
>> +    struct folio *first;
>> +    int nr_remaining;
>> +
>> +    if (list_empty(folio_list))
>> +        return 0;
>> +
>> +    first = list_first_entry(folio_list, struct folio, lru);
>> +#ifdef CONFIG_DEBUG_VM
>> +    {
>> +        struct folio *f;
>> +        list_for_each_entry(f, folio_list, lru)
>> +        VM_WARN_ON_ONCE(folio_memcg(f) != folio_memcg(first));
> 
> 
> It looks like the indentation might be off here.

Yeah looks like. Will fix.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 19+ messages in thread

[parent not found: <20260504060924.344313-5-bharata@amd.com>]

* Re: [PATCH v7 4/7] mm: pghot: Precision mode for pghot
       [not found] ` <20260504060924.344313-5-bharata@amd.com>
@ 2026-05-04 18:41   ` Donet Tom
  2026-05-06  6:17     ` Bharata B Rao
  0 siblings, 1 reply; 19+ messages in thread
From: Donet Tom @ 2026-05-04 18:41 UTC (permalink / raw)
  To: Bharata B Rao, linux-kernel, linux-mm
  Cc: Jonathan.Cameron, dave.hansen, gourry, mgorman, mingo, peterz,
	raghavendra.kt, riel, rientjes, sj, weixugc, willy, ying.huang,
	ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm, david,
	byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs,
	alok.rathore, shivankg

Hi Bharata

On 5/4/26 11:39 AM, Bharata B Rao wrote:
> +#include <linux/pghot.h>
> +#include <linux/jiffies.h>
> +#include <linux/memory-tiers.h>
> +
> +bool pghot_nid_valid(int nid)

I might be missing something, but since pghot_nid_valid() exists in both 
pghot-default.c and pghot-precise.c, would it make sense to move it to a 
header file as a static inline function?

-Donet


> +{
> +	if (nid != NUMA_NO_NODE &&
> +	    (!numa_valid_node(nid) || nid > PGHOT_NID_MAX ||
> +	     !node_online(nid) || !node_is_toptier(nid)))
> +		return false;
> +
> +	return true;
> +}
> +
> +unsigned long pghot_access_latency(unsigned long old_time, unsigned long time)
> +{
> +	return jiffies_to_msecs((time - old_time) & PGHOT_TIME_MASK);
> +}
> +
> +bool pghot_update_record(phi_t *phi, int nid, unsigned long now)
> +{

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v7 4/7] mm: pghot: Precision mode for pghot
  2026-05-04 18:41   ` [PATCH v7 4/7] mm: pghot: Precision mode for pghot Donet Tom
@ 2026-05-06  6:17     ` Bharata B Rao
  0 siblings, 0 replies; 19+ messages in thread
From: Bharata B Rao @ 2026-05-06  6:17 UTC (permalink / raw)
  To: Donet Tom, linux-kernel, linux-mm
  Cc: Jonathan.Cameron, dave.hansen, gourry, mgorman, mingo, peterz,
	raghavendra.kt, riel, rientjes, sj, weixugc, willy, ying.huang,
	ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm, david,
	byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs,
	alok.rathore, shivankg

On 05-May-26 12:11 AM, Donet Tom wrote:
> Hi Bharata
> 
> On 5/4/26 11:39 AM, Bharata B Rao wrote:
>> +#include <linux/pghot.h>
>> +#include <linux/jiffies.h>
>> +#include <linux/memory-tiers.h>
>> +
>> +bool pghot_nid_valid(int nid)
> 
> I might be missing something, but since pghot_nid_valid() exists in both pghot-
> default.c and pghot-precise.c, would it make sense to move it to a header file
> as a static inline function?

It exists in both modes of pghot but the implementations differ. Hence it can't
reside as static inline function in pghot.h.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure
       [not found] <20260504060924.344313-1-bharata@amd.com>
       [not found] ` <20260504060924.344313-3-bharata@amd.com>
       [not found] ` <20260504060924.344313-5-bharata@amd.com>
@ 2026-05-04 20:36 ` Matthew Wilcox
  2026-05-05 22:17   ` Balbir Singh
  2026-05-06 15:22   ` Gregory Price
       [not found] ` <20260504060924.344313-6-bharata@amd.com>
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 19+ messages in thread
From: Matthew Wilcox @ 2026-05-04 20:36 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-kernel, linux-mm, Jonathan.Cameron, dave.hansen, gourry,
	mgorman, mingo, peterz, raghavendra.kt, riel, rientjes, sj,
	weixugc, ying.huang, ziy, dave, nifan.cxl, xuezhengchu, yiannis,
	akpm, david, byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs,
	alok.rathore, shivankg, donettom

On Mon, May 04, 2026 at 11:39:17AM +0530, Bharata B Rao wrote:
> This is v7 of pghot, a hot-page tracking and promotion subsystem. The

I continue to think we should not do this.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure
  2026-05-04 20:36 ` [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure Matthew Wilcox
@ 2026-05-05 22:17   ` Balbir Singh
  2026-05-06  3:43     ` Bharata B Rao
  2026-05-06 15:22   ` Gregory Price
  1 sibling, 1 reply; 19+ messages in thread
From: Balbir Singh @ 2026-05-05 22:17 UTC (permalink / raw)
  To: Matthew Wilcox, Bharata B Rao
  Cc: linux-kernel, linux-mm, Jonathan.Cameron, dave.hansen, gourry,
	mgorman, mingo, peterz, raghavendra.kt, riel, rientjes, sj,
	weixugc, ying.huang, ziy, dave, nifan.cxl, xuezhengchu, yiannis,
	akpm, david, byungchul, kinseyho, joshua.hahnjy, yuanchu,
	alok.rathore, shivankg, donettom

On 5/5/26 06:36, Matthew Wilcox wrote:
> On Mon, May 04, 2026 at 11:39:17AM +0530, Bharata B Rao wrote:
>> This is v7 of pghot, a hot-page tracking and promotion subsystem. The
> 
> I continue to think we should not do this.
> 

I am unclear about the benefits of the patchset, I have not tested
it or reviewed the latest revision. My big concern was that top-tier
might not always be suitable.

I see that there are some numbers posted, but I find this weird
"After the graph creation, the processes are stopped and data is migrated
to CXL node 2 before continuing so that BFS phase starts accessing lower
tier memory." Why not allocate everything on CXL node 2?

Balbir



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure
  2026-05-05 22:17   ` Balbir Singh
@ 2026-05-06  3:43     ` Bharata B Rao
  2026-05-06  4:02       ` Balbir Singh
  0 siblings, 1 reply; 19+ messages in thread
From: Bharata B Rao @ 2026-05-06  3:43 UTC (permalink / raw)
  To: Balbir Singh, Matthew Wilcox
  Cc: linux-kernel, linux-mm, Jonathan.Cameron, dave.hansen, gourry,
	mgorman, mingo, peterz, raghavendra.kt, riel, rientjes, sj,
	weixugc, ying.huang, ziy, dave, nifan.cxl, xuezhengchu, yiannis,
	akpm, david, byungchul, kinseyho, joshua.hahnjy, yuanchu,
	alok.rathore, shivankg, donettom

On 06-May-26 3:47 AM, Balbir Singh wrote:
> On 5/5/26 06:36, Matthew Wilcox wrote:
>> On Mon, May 04, 2026 at 11:39:17AM +0530, Bharata B Rao wrote:
>>> This is v7 of pghot, a hot-page tracking and promotion subsystem. The
>>
>> I continue to think we should not do this.
>>
> 
> I am unclear about the benefits of the patchset, I have not tested
> it or reviewed the latest revision. My big concern was that top-tier
> might not always be suitable.

So you are saying that we should have a capability to promote accessed pages
from lower tier to an other tier that is not classified as top tier? Is that
non-top tier node the one which generates accesses?

> 
> I see that there are some numbers posted, but I find this weird
> "After the graph creation, the processes are stopped and data is migrated
> to CXL node 2 before continuing so that BFS phase starts accessing lower
> tier memory." Why not allocate everything on CXL node 2?

In the ideal scenario, the benefit is to see if any pages that land up on lower
tier get identified as hot and get promoted. That means we need to create an
over-committed scenario where the pages get demoted first. I have provided
numbers from such cases in my previous versions. The problem with this case is
that the base hot page promotion (NUMAB2) hasn't shown any benefit at all with
my micro-benchmark - Ref:
https://lore.kernel.org/linux-mm/868004d8-bb8e-4800-9fdd-ade48e95fe3b@amd.com/

Same has been observed with redis-memtier benchmark -
https://lore.kernel.org/linux-mm/957f2242-56d4-4bf0-8aeb-9d60fbea8c8c@amd.com/

Instead what I am doing here is to take out demotion from the scenario but still
retain the access pattern of the benchmark by pushing out the data to lower tier
when the benchmark reaches steady allocation state.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure
  2026-05-06  3:43     ` Bharata B Rao
@ 2026-05-06  4:02       ` Balbir Singh
  2026-05-06  5:00         ` Bharata B Rao
  0 siblings, 1 reply; 19+ messages in thread
From: Balbir Singh @ 2026-05-06  4:02 UTC (permalink / raw)
  To: Bharata B Rao, Matthew Wilcox
  Cc: linux-kernel, linux-mm, Jonathan.Cameron, dave.hansen, gourry,
	mgorman, mingo, peterz, raghavendra.kt, riel, rientjes, sj,
	weixugc, ying.huang, ziy, dave, nifan.cxl, xuezhengchu, yiannis,
	akpm, david, byungchul, kinseyho, joshua.hahnjy, yuanchu,
	alok.rathore, shivankg, donettom

On 5/6/26 13:43, Bharata B Rao wrote:
> On 06-May-26 3:47 AM, Balbir Singh wrote:
>> On 5/5/26 06:36, Matthew Wilcox wrote:
>>> On Mon, May 04, 2026 at 11:39:17AM +0530, Bharata B Rao wrote:
>>>> This is v7 of pghot, a hot-page tracking and promotion subsystem. The
>>>
>>> I continue to think we should not do this.
>>>
>>
>> I am unclear about the benefits of the patchset, I have not tested
>> it or reviewed the latest revision. My big concern was that top-tier
>> might not always be suitable.
> 
> So you are saying that we should have a capability to promote accessed pages
> from lower tier to an other tier that is not classified as top tier? Is that
> non-top tier node the one which generates accesses?
> 

Yes, a top tier node could be CPU less for example.

>>
>> I see that there are some numbers posted, but I find this weird
>> "After the graph creation, the processes are stopped and data is migrated
>> to CXL node 2 before continuing so that BFS phase starts accessing lower
>> tier memory." Why not allocate everything on CXL node 2?
> 
> In the ideal scenario, the benefit is to see if any pages that land up on lower
> tier get identified as hot and get promoted. That means we need to create an
> over-committed scenario where the pages get demoted first. I have provided

Why do the pages need to get demoted? Why not allocate them from the lower tier
to show that promotion upwards is helpful

> numbers from such cases in my previous versions. The problem with this case is
> that the base hot page promotion (NUMAB2) hasn't shown any benefit at all with
> my micro-benchmark - Ref:
> https://lore.kernel.org/linux-mm/868004d8-bb8e-4800-9fdd-ade48e95fe3b@amd.com/
> 
> Same has been observed with redis-memtier benchmark -
> https://lore.kernel.org/linux-mm/957f2242-56d4-4bf0-8aeb-9d60fbea8c8c@amd.com/
> 
> Instead what I am doing here is to take out demotion from the scenario but still
> retain the access pattern of the benchmark by pushing out the data to lower tier
> when the benchmark reaches steady allocation state.
> 

Balbir

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure
  2026-05-06  4:02       ` Balbir Singh
@ 2026-05-06  5:00         ` Bharata B Rao
  0 siblings, 0 replies; 19+ messages in thread
From: Bharata B Rao @ 2026-05-06  5:00 UTC (permalink / raw)
  To: Balbir Singh, Matthew Wilcox
  Cc: linux-kernel, linux-mm, Jonathan.Cameron, dave.hansen, gourry,
	mgorman, mingo, peterz, raghavendra.kt, riel, rientjes, sj,
	weixugc, ying.huang, ziy, dave, nifan.cxl, xuezhengchu, yiannis,
	akpm, david, byungchul, kinseyho, joshua.hahnjy, yuanchu,
	alok.rathore, shivankg, donettom

On 06-May-26 9:32 AM, Balbir Singh wrote:
>>> I am unclear about the benefits of the patchset, I have not tested
>>> it or reviewed the latest revision. My big concern was that top-tier
>>> might not always be suitable.
>>
>> So you are saying that we should have a capability to promote accessed pages
>> from lower tier to an other tier that is not classified as top tier? Is that
>> non-top tier node the one which generates accesses?
>>
> 
> Yes, a top tier node could be CPU less for example.

Currently kmigrated thread in pghot doesn't explicitly prevent promotion to
non-toptier nodes. Here is how this works for the two modes of operation in pghot:

pghot-default: In this mode, the target NID isn't explicitly tracked and hence
kmigrated relies on the user-configurable pghot_target_nid. Though there is a
!node_is_toptier(nid) check in the helper routine that populates
pghot_target_nid, that can be relaxed if required.

pghot-precise: In this mode, the accessing CPU's node is tracked as the target
nid and promotion is done to that node. Note that pghot_target_nid isn't used here.

Hence I don't see any major issues in this patchset to cover your use case. Let
me know if I miss anything here.

BTW, does the existing hot page promotion cover the use case you are targeting?

> 
>>>
>>> I see that there are some numbers posted, but I find this weird
>>> "After the graph creation, the processes are stopped and data is migrated
>>> to CXL node 2 before continuing so that BFS phase starts accessing lower
>>> tier memory." Why not allocate everything on CXL node 2?
>>
>> In the ideal scenario, the benefit is to see if any pages that land up on lower
>> tier get identified as hot and get promoted. That means we need to create an
>> over-committed scenario where the pages get demoted first. I have provided
> 
> Why do the pages need to get demoted? Why not allocate them from the lower tier
> to show that promotion upwards is helpful

As you can see, these are controlled experiments to measure the effectiveness of
hot page detection and promotion and the benefits from promotion. It can be done
in the way you are suggesting; just that I found it a bit simpler to pause the
benchmark, migrate all pages to lower tier memory before the benchmark starts
accessing them rather than relying on setting memory policies to achieve the
same effect.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure
  2026-05-04 20:36 ` [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure Matthew Wilcox
  2026-05-05 22:17   ` Balbir Singh
@ 2026-05-06 15:22   ` Gregory Price
  2026-05-11 10:02     ` Bharata B Rao
  1 sibling, 1 reply; 19+ messages in thread
From: Gregory Price @ 2026-05-06 15:22 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Bharata B Rao, linux-kernel, linux-mm, Jonathan.Cameron,
	dave.hansen, mgorman, mingo, peterz, raghavendra.kt, riel,
	rientjes, sj, weixugc, ying.huang, ziy, dave, nifan.cxl,
	xuezhengchu, yiannis, akpm, david, byungchul, kinseyho,
	joshua.hahnjy, yuanchu, balbirs, alok.rathore, shivankg, donettom

On Mon, May 04, 2026 at 09:36:05PM +0100, Matthew Wilcox wrote:
> On Mon, May 04, 2026 at 11:39:17AM +0530, Bharata B Rao wrote:
> > This is v7 of pghot, a hot-page tracking and promotion subsystem. The
> 
> I continue to think we should not do this.

My only pushback on the general "we should not do this" is that we need
something to counter-balance the demotion bit in vmscan.c, and the
current implementation (prot_none faults) is rather :[

I think this series needs to greatly limit its complexity and provide
some gentle correction for LRU inversions, and I think they're making a
decent attempt at that.

But then I think local memory expansion on CXL is going pretty
swimmingly in our datacenters :], others may not feel the same.

~Gregory

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure
  2026-05-06 15:22   ` Gregory Price
@ 2026-05-11 10:02     ` Bharata B Rao
  2026-05-11 14:27       ` Gregory Price
  0 siblings, 1 reply; 19+ messages in thread
From: Bharata B Rao @ 2026-05-11 10:02 UTC (permalink / raw)
  To: Gregory Price, Matthew Wilcox
  Cc: linux-kernel, linux-mm, Jonathan.Cameron, dave.hansen, mgorman,
	mingo, peterz, raghavendra.kt, riel, rientjes, sj, weixugc,
	ying.huang, ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm,
	david, byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs,
	alok.rathore, shivankg, donettom

On 06-May-26 8:52 PM, Gregory Price wrote:
> On Mon, May 04, 2026 at 09:36:05PM +0100, Matthew Wilcox wrote:
>> On Mon, May 04, 2026 at 11:39:17AM +0530, Bharata B Rao wrote:
>>> This is v7 of pghot, a hot-page tracking and promotion subsystem. The
>>
>> I continue to think we should not do this.
> 
> My only pushback on the general "we should not do this" is that we need
> something to counter-balance the demotion bit in vmscan.c, and the
> current implementation (prot_none faults) is rather :[

So you are saying pghot subsystem currently does hot page detection and
promotion only, which is fine. But the current implementation of demotion is not
very optimal and hence we should spend effort in fine-tuning demotion first?

In this series itself I have shown via benchmark numbers that for over-committed
cases (involving both demotion and promotion), the workload isn't really showing
real benefit due to demotion and promotion. Are you specifically referring to
this problem?

> 
> I think this series needs to greatly limit its complexity and provide
> some gentle correction for LRU inversions, and I think they're making a
> decent attempt at that.

Regarding complexity, I agree that the initial version of this patchset was
quite complicated in the way it maintained hot page information. But the later
versions including this one have greatly reduced the complexity with one byte of
hot page information per PFN, atomic updates to hotness data without any locks,
per-lowertier kmigrated threads for promotion and reuse of existing hot page
promotion engine.

Did you have anything else in mind wrt complexity?

Can you provide more context about the LRU inversion problem?

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure
  2026-05-11 10:02     ` Bharata B Rao
@ 2026-05-11 14:27       ` Gregory Price
  0 siblings, 0 replies; 19+ messages in thread
From: Gregory Price @ 2026-05-11 14:27 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: Matthew Wilcox, linux-kernel, linux-mm, Jonathan.Cameron,
	dave.hansen, mgorman, mingo, peterz, raghavendra.kt, riel,
	rientjes, sj, weixugc, ying.huang, ziy, dave, nifan.cxl,
	xuezhengchu, yiannis, akpm, david, byungchul, kinseyho,
	joshua.hahnjy, yuanchu, balbirs, alok.rathore, shivankg, donettom

On Mon, May 11, 2026 at 03:32:20PM +0530, Bharata B Rao wrote:
> 
> 
> On 06-May-26 8:52 PM, Gregory Price wrote:
> > On Mon, May 04, 2026 at 09:36:05PM +0100, Matthew Wilcox wrote:
> >> On Mon, May 04, 2026 at 11:39:17AM +0530, Bharata B Rao wrote:
> >>> This is v7 of pghot, a hot-page tracking and promotion subsystem. The
> >>
> >> I continue to think we should not do this.
> > 
> > My only pushback on the general "we should not do this" is that we need
> > something to counter-balance the demotion bit in vmscan.c, and the
> > current implementation (prot_none faults) is rather :[
> 
> So you are saying pghot subsystem currently does hot page detection and
> promotion only, which is fine. But the current implementation of demotion is not
> very optimal and hence we should spend effort in fine-tuning demotion first?
>

I'm saying because of demotion and fallbacks, we need a mechanism to
handle promotions.  I'm not convinced a hotness will extend to coldness
- at least any better than LRU/MGLRU.

> In this series itself I have shown via benchmark numbers that for over-committed
> cases (involving both demotion and promotion), the workload isn't really showing
> real benefit due to demotion and promotion. Are you specifically referring to
> this problem?
> 

If over-committed means over-subscribed hot-tier (more hot memory than
available top tier memory), then yeah that result is intuitive.  I
haven't pointed to any specific issue, as of yet, still taking time to
consider some of the results.

> 
> Can you provide more context about the LRU inversion problem?
> 

I've been tracking some data around shrink_folio_list and
alloc_migrate_folio behavior when a low tier node is full.

The result is we end up just swapping memory from high tier straight to
swap and skip demotion, resulting in a bunch of file and anon refaults.

Hardware: Single Socket, 768GB DRAM, 256GB CXL Expander

In this workload, we see swap usage after the full 1TB of memory is
utilized, and as a result we see swap spillage.

second_chance = second alloc attempt in alloc_migrate_folio succeeds
swap_fallback = second chance fails, we swap directly from top tier

Sample data:

pgdemote_kswapd           333052779
pgdemote_direct          3181480482
pgdemote_second_chance     31017629
pgdemote_swap_fallback    335759535
workingset_refault_anon    30106868
workingset_refault_file  2343035341

(note here: swap fallback is number of occurances, while the others are
 number of pages.  As a result, the actual number of swapped pages is
 likely much closer to the pgdemote_direct number)

As a result:  LRU is just broken on CXL systems, LRU inverts by design.

In a sane world we would just see the second tier as an extention of the
LRU, but that doesn't necessarily mean we can gleen hotness data from it
(it's still largely a coldness tracking mechanism).

I have patches I haven't RFC'd yet that try to address this, but I need
more time to test it.

I don't think this is something to address with PGHot.

---

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 112983b42559..ccdd698c5937 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1043,7 +1043,10 @@ struct folio *alloc_migrate_folio(struct folio *src, unsigned long private)
        mtc->gfp_mask &= ~__GFP_THISNODE;
        mtc->nmask = allowed_mask;

-       return alloc_migration_target(src, (unsigned long)mtc);
+       dst = alloc_migration_target(src, (unsigned long)mtc);
+       if (dst)
+               count_vm_events(PGDEMOTE_SECOND_CHANCE, folio_nr_pages(src));
+       return dst;
 }

 /*
@@ -1616,6 +1619,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
        /* Folios that could not be demoted are still in @demote_folios */
        if (!list_empty(&demote_folios)) {
                /* Folios which weren't demoted go back on @folio_list */
+               if (!sc->proactive)
+                       count_vm_event(PGDEMOTE_SWAP_FALLBACK);
                list_splice_init(&demote_folios, folio_list);

                /*


^ permalink raw reply related	[flat|nested] 19+ messages in thread

[parent not found: <20260504060924.344313-6-bharata@amd.com>]

* Re: [PATCH v7 5/7] mm: sched: move NUMA balancing tiering promotion to pghot
       [not found] ` <20260504060924.344313-6-bharata@amd.com>
@ 2026-05-05  4:44   ` Donet Tom
  2026-05-06  6:20     ` Bharata B Rao
  0 siblings, 1 reply; 19+ messages in thread
From: Donet Tom @ 2026-05-05  4:44 UTC (permalink / raw)
  To: Bharata B Rao, linux-kernel, linux-mm
  Cc: Jonathan.Cameron, dave.hansen, gourry, mgorman, mingo, peterz,
	raghavendra.kt, riel, rientjes, sj, weixugc, willy, ying.huang,
	ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm, david,
	byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs,
	alok.rathore, shivankg

Hi Bharata

On 5/4/26 11:39 AM, Bharata B Rao wrote:
>   
> +/*
> + * For memory tiering mode, if there are enough free pages (more than
> + * enough watermark defined here) in fast memory node, to take full
> + * advantage of fast memory capacity, all recently accessed slow
> + * memory pages will be migrated to fast memory node without
> + * considering hot threshold.
> + */
> +static bool pgdat_free_space_enough(struct pglist_data *pgdat)
> +{
> +	int z;
> +	unsigned long enough_wmark;
> +
> +	enough_wmark = max(1UL * 1024 * 1024 * 1024 >> PAGE_SHIFT,

Just a thought—would it be better to use #define for these hardcoded values?

-Donet

> +			   pgdat->node_present_pages >> 4);
> +	for (z = pgdat->nr_zones - 1; z >= 0; z--) {

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v7 5/7] mm: sched: move NUMA balancing tiering promotion to pghot
  2026-05-05  4:44   ` [PATCH v7 5/7] mm: sched: move NUMA balancing tiering promotion to pghot Donet Tom
@ 2026-05-06  6:20     ` Bharata B Rao
  0 siblings, 0 replies; 19+ messages in thread
From: Bharata B Rao @ 2026-05-06  6:20 UTC (permalink / raw)
  To: Donet Tom, linux-kernel, linux-mm
  Cc: Jonathan.Cameron, dave.hansen, gourry, mgorman, mingo, peterz,
	raghavendra.kt, riel, rientjes, sj, weixugc, willy, ying.huang,
	ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm, david,
	byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs,
	alok.rathore, shivankg

On 05-May-26 10:14 AM, Donet Tom wrote:
> Hi Bharata
> 
> On 5/4/26 11:39 AM, Bharata B Rao wrote:
>>   +/*
>> + * For memory tiering mode, if there are enough free pages (more than
>> + * enough watermark defined here) in fast memory node, to take full
>> + * advantage of fast memory capacity, all recently accessed slow
>> + * memory pages will be migrated to fast memory node without
>> + * considering hot threshold.
>> + */
>> +static bool pgdat_free_space_enough(struct pglist_data *pgdat)
>> +{
>> +    int z;
>> +    unsigned long enough_wmark;
>> +
>> +    enough_wmark = max(1UL * 1024 * 1024 * 1024 >> PAGE_SHIFT,
> 
> Just a thought—would it be better to use #define for these hardcoded values?

We could. It was a code movement, hence left it untouched.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure
       [not found] <20260504060924.344313-1-bharata@amd.com>
                   ` (3 preceding siblings ...)
       [not found] ` <20260504060924.344313-6-bharata@amd.com>
@ 2026-05-05 10:41 ` Bharata B Rao
  2026-05-09  1:18   ` Andrew Morton
  2026-05-05 13:42 ` Bharata B Rao
  5 siblings, 1 reply; 19+ messages in thread
From: Bharata B Rao @ 2026-05-05 10:41 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jonathan.Cameron, dave.hansen, gourry, mgorman, mingo, peterz,
	raghavendra.kt, riel, rientjes, sj, weixugc, willy, ying.huang,
	ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm, david,
	byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs,
	alok.rathore, shivankg, donettom

On 04-May-26 11:39 AM, Bharata B Rao wrote:
> Results
> =======
> Posted as replies to this mail thread.

Graph500 benchmark results:

Test system details
-------------------
3 node AMD Zen5 system with 2 regular NUMA nodes (0, 1) and a CXL node (2)

$ numactl -H
available: 3 nodes (0-2)
node 0 cpus: 0-95,192-287
node 0 size: 128460 MB
node 1 cpus: 96-191,288-383
node 1 size: 128893 MB
node 2 cpus:
node 2 size: 257993 MB
node distances:
node   0   1   2
  0:  10  32  50
  1:  32  10  60
  2:  255  255  10

Hotness sources
---------------
NUMAB0 - Without NUMA Balancing in base case and with no source enabled
         in the pghot case. No migrations occur.
NUMAB2 - Existing hot page promotion for the base case and
         use of hint faults as source in the pghot case.
NUMAB3 - Enabled both regular and tiering mode of NUMA Balancing
         (kernel.numa_balancing=3)

Pghot by default promotes after two accesses but for NUMAB2 source,
promotion is done after one access to match the base behaviour.
(/sys/kernel/debug/pghot/freq_threshold=1)

Graph500 details
----------------
Command: mpirun -n 128 --bind-to core --map-by core
graph500/src/graph500_reference_bfs 28 16

After the graph creation, the processes are stopped and data is migrated
to CXL node 2 before continuing so that BFS phase starts accessing lower
tier memory.

Total memory usage is slightly over 100GB and will fit within Node 0 and 1.
Hence there is no memory pressure to induce demotions.

harmonic_mean_TEPS - Higher is better
=====================================================================================
                        Base            Base            pghot-default
pghot-precise
                        NUMAB0          NUMAB2          NUMAB2          NUMAB2
=====================================================================================
harmonic_mean_TEPS      5.08026e+08     7.48633e+08     5.46257e+08     7.45101e+08
mean_time               8.45413         5.73702         7.86245         5.76421
median_TEPS             5.09236e+08     7.25058e+08     5.40525e+08     7.63752e+08
max_TEPS                5.15244e+08     1.03391e+09     8.51317e+08     9.7552e+08

pgpromote_success       0               13809474        13763582        13763155
numa_pte_updates        0               26746117        39502157        36368086
numa_hint_faults        0               13811769        24248272        21172314
=====================================================================================
                                                        pghot-default
                                                        NUMAB3
=====================================================================================
harmonic_mean_TEPS                                      7.00515e+08
mean_time                                               6.13109
median_TEPS                                             7.06813e+08
max_TEPS                                                7.63164e+08

pgpromote_success                                       13762087
numa_pte_updates                                        93632490
numa_hint_faults                                        70566306
=====================================================================================
- The base case shows a good improvement with NUMAB2 in harmonic_mean_TEPS.
- The same improvement gets maintained with pghot-precise too.
- pghot-default mode doesn't show benefit even when achieving similar page promotion
  numbers. This mode doesn't track accessing NID and by default promotes to NID=0
  which probably isn't all that beneficial as processes are running on both Node 0
  and Node 1.
- pghot-default recovers the performance when balancing between toptier nodes
  0 and 1 is enabled in addition to hot page promotion.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure
  2026-05-05 10:41 ` [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure Bharata B Rao
@ 2026-05-09  1:18   ` Andrew Morton
  2026-05-11 10:37     ` Bharata B Rao
  0 siblings, 1 reply; 19+ messages in thread
From: Andrew Morton @ 2026-05-09  1:18 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: linux-kernel, linux-mm, Jonathan.Cameron, dave.hansen, gourry,
	mgorman, mingo, peterz, raghavendra.kt, riel, rientjes, sj,
	weixugc, willy, ying.huang, ziy, dave, nifan.cxl, xuezhengchu,
	yiannis, david, byungchul, kinseyho, joshua.hahnjy, yuanchu,
	balbirs, alok.rathore, shivankg, donettom

On Tue, 5 May 2026 16:11:43 +0530 Bharata B Rao <bharata@amd.com> wrote:

> On 04-May-26 11:39 AM, Bharata B Rao wrote:
> > Results
> > =======
> > Posted as replies to this mail thread.
> 
> Graph500 benchmark results:

Please include (and maintain) the testing results in the formal
changelogs (perhaps in the [0/N], in a condensed summary form).

I mean, the entire point of the whole patchset is to improve
performance (yes?), so this contribution lives or dies by its
performance testing results.

The first thing your audience will want to know is "how good is this
for our users".  So tell us!  Up front, within the first paragraphs!

The better the results, the more motivated people will be to help get
your work upstream.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure
  2026-05-09  1:18   ` Andrew Morton
@ 2026-05-11 10:37     ` Bharata B Rao
  2026-05-11 14:38       ` Gregory Price
  0 siblings, 1 reply; 19+ messages in thread
From: Bharata B Rao @ 2026-05-11 10:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Jonathan.Cameron, dave.hansen, gourry,
	mgorman, mingo, peterz, raghavendra.kt, riel, rientjes, sj,
	weixugc, willy, ying.huang, ziy, dave, nifan.cxl, xuezhengchu,
	yiannis, david, byungchul, kinseyho, joshua.hahnjy, yuanchu,
	balbirs, alok.rathore, shivankg, donettom

On 09-May-26 6:48 AM, Andrew Morton wrote:
> On Tue, 5 May 2026 16:11:43 +0530 Bharata B Rao <bharata@amd.com> wrote:
> 
>> On 04-May-26 11:39 AM, Bharata B Rao wrote:
>>> Results
>>> =======
>>> Posted as replies to this mail thread.
>>
>> Graph500 benchmark results:
> 
> Please include (and maintain) the testing results in the formal
> changelogs (perhaps in the [0/N], in a condensed summary form).

The results and associated description were getting too long and hence I was
hesitating to make it part of 0/N. But then as you say, I shall include a
condensed summary from next time.

> 
> I mean, the entire point of the whole patchset is to improve
> performance (yes?), so this contribution lives or dies by its
> performance testing results.

The entire point of this patchset is not just about improving the performance.
It is mainly about adding a new dedicated infrastructure for detecting and
promoting hot pages. It is about having a subsystem that can act as a single
source of truth page hotness in the kernel. Though we aren't there yet, we have
started by having a minimal infrastructure that centralizes the hot page
promotion and associated heuristics that currently sits in scheduler so that the
same can be used with other page hotness sources as well.

The first source is the hintfaults based hot page promotion. Here the address
space scanning and introduction of hint faults still remains like earlier. But
the promotion engine is part of pghot. Hence the comparison numbers with base
this source is about meeting the current level of performance and ensuring that
the workloads don't suffer due to batched migration.

There are other sources as well with primary one being the IBS Memory Profiler
which provides memory access information directly from the hardware. I have some
numbers for this source as well. Initial results look encouraging and more tests
can tell us if this source can be an independent one or complements the existing
one.

Then the earlier versions of this patchset had another source - PTE A bit based
scanning where the idea was to completely replace the hint fault based mechanism
by PTE A bit based accesses thereby taking out both the detection and promotion
parts out of the process context. I have temporarily removed this from this
patchset for two reasons: a) to simplify the patchset so that we can get some
consensus on the infrastructure part first. b) to explore the commonality with
another PTE A bit scanning approach (called klruscand) that used MGLRU's
scanning mechanism.

Also on the horizon is to use hot page info that CXL Hotness Monitoring Unit
(CHMU) can provide.

> 
> The first thing your audience will want to know is "how good is this
> for our users".  So tell us!  Up front, within the first paragraphs!
> 
> The better the results, the more motivated people will be to help get
> your work upstream.

So currently it is a multi-step approach with first step of building a common
hotness infrastructure and moving existing mechanism to make use of it w/o any
regression. Then follow up with more sources.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure
  2026-05-11 10:37     ` Bharata B Rao
@ 2026-05-11 14:38       ` Gregory Price
  0 siblings, 0 replies; 19+ messages in thread
From: Gregory Price @ 2026-05-11 14:38 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: Andrew Morton, linux-kernel, linux-mm, Jonathan.Cameron,
	dave.hansen, mgorman, mingo, peterz, raghavendra.kt, riel,
	rientjes, sj, weixugc, willy, ying.huang, ziy, dave, nifan.cxl,
	xuezhengchu, yiannis, david, byungchul, kinseyho, joshua.hahnjy,
	yuanchu, balbirs, alok.rathore, shivankg, donettom

On Mon, May 11, 2026 at 04:07:16PM +0530, Bharata B Rao wrote:
> 
> The entire point of this patchset is not just about improving the performance.
> It is mainly about adding a new dedicated infrastructure for detecting and
> promoting hot pages. It is about having a subsystem that can act as a single
> source of truth page hotness in the kernel. Though we aren't there yet, we have
> started by having a minimal infrastructure that centralizes the hot page
> promotion and associated heuristics that currently sits in scheduler so that the
> same can be used with other page hotness sources as well.
> 

The goal of hotness tracking in general is to improve performance.

The goal of PGHot should be a reasonable baseline for the kernel to
course-correct LRU inversions across tiers over time, because LRU
threads only scan invidiual nodes and don't compare across nodes.

I would hazard against trying to wholesale state it "Shall be the single
source of truth", as we will inevitably discover some condition which is
not covered / cannot be captured / we will simply get it wrong.

Plus, intuitively, counter-balancing LRU/MGLRU aging is probably as good
good as we can get without having to inject per-workload information
into the system - at which point the users should use DAMON.

~Gregory

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure
       [not found] <20260504060924.344313-1-bharata@amd.com>
                   ` (4 preceding siblings ...)
  2026-05-05 10:41 ` [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure Bharata B Rao
@ 2026-05-05 13:42 ` Bharata B Rao
  5 siblings, 0 replies; 19+ messages in thread
From: Bharata B Rao @ 2026-05-05 13:42 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jonathan.Cameron, dave.hansen, gourry, mgorman, mingo, peterz,
	raghavendra.kt, riel, rientjes, sj, weixugc, willy, ying.huang,
	ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm, david,
	byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs,
	alok.rathore, shivankg, donettom

On 04-May-26 11:39 AM, Bharata B Rao wrote:
> Results
> =======
> Posted as replies to this mail thread.

Initial Graph500 benchmark numbers for IBS Memory Profiler source:

Test system details
-------------------
3 node AMD system with 2 regular NUMA nodes (0, 1) in NPS2 mode and a CXL node (2)

$ numactl -H
available: 3 nodes (0-2)
node distances:
node 0 cpus: 0-63,128-191
node 0 size: 257715 MB
node 1 cpus: 64-127,192-255
node 1 size: 257845 MB
node 2 cpus:
node 2 size: 258032 MB
node distances:
node   0   1   2
  0:  10  12  50
  1:  12  10  50
  2:  255  255  10

Hotness sources
---------------
NUMAB0 - Without NUMA Balancing in base case and with no source enabled
         in the pghot case. No migrations occur.
NUMAB2 - Existing hot page promotion for the base case and
         use of hint faults as source in the pghot case.
HWHINTS - IBS Memory Profiler as source for pghot

Pghot by default promotes after two accesses but for NUMAB2 and HWHINTS
sources, promotion is done after one access to match the base behaviour.
(/sys/kernel/debug/pghot/freq_threshold=1)

Graph500 details
----------------
Command: mpirun -n 128 --bind-to core --map-by core
graph500/src/graph500_reference_bfs 28 16

After the graph creation, the processes are stopped and data is migrated
to CXL node 2 before continuing so that BFS phase starts accessing lower
tier memory.

Total memory usage is slightly over 100GB and will fit within Node 0 and 1.
Hence there is no memory pressure to induce demotions.

harmonic_mean_TEPS - Higher is better
=============================================================================
                                Base            Base            pghot-default
                                NUMAB0          NUMAB2          NUMAB2
=============================================================================
harmonic_mean_TEPS              4.09614e+08     1.28401e+09     1.47926e+09
mean_time                       10.4853         3.34492         2.90342
median_TEPS                     4.10086e+08     1.44584e+09     1.85957e+09
max_TEPS                        4.1661e+08      1.79773e+09     1.99242e+09

pgpromote_success               0               13746029        13412213
numa_hint_faults                0               13753808        26669823

pghot_recorded_accesses         NA              NA              26669551
pghot_recorded_hintfaults       NA              NA              26669823
pghot_recorded_hwhints          NA              NA              0
hwhint_total_events             NA              NA              0
=============================================================================
                                                                pghot-default
                                                                HWHINTS
=============================================================================
harmonic_mean_TEPS                                              1.52334e+09
mean_time                                                       2.81941
median_TEPS                                                     1.57446e+09
max_TEPS                                                        1.72014e+09

pgpromote_success                                               3415599
numa_hint_faults                                                0

pghot_recorded_accesses                                         3440912
pghot_recorded_hintfaults                                       0
pghot_recorded_hwhints                                          24475210
hwhint_total_events                                             24475244
=============================================================================
While no migration (NUMAB0) at all hurts Graph500, HWHINTS with pghot is able
to provide similar benchmark numbers even when not migrating as aggressively
as base NUMAB2.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2026-05-11 14:38 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20260504060924.344313-1-bharata@amd.com>
     [not found] ` <20260504060924.344313-3-bharata@amd.com>
2026-05-04 18:14   ` [PATCH v7 2/7] mm: migrate: Add promote_misplaced_memcg_folios() Donet Tom
2026-05-06  6:15     ` Bharata B Rao
     [not found] ` <20260504060924.344313-5-bharata@amd.com>
2026-05-04 18:41   ` [PATCH v7 4/7] mm: pghot: Precision mode for pghot Donet Tom
2026-05-06  6:17     ` Bharata B Rao
2026-05-04 20:36 ` [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure Matthew Wilcox
2026-05-05 22:17   ` Balbir Singh
2026-05-06  3:43     ` Bharata B Rao
2026-05-06  4:02       ` Balbir Singh
2026-05-06  5:00         ` Bharata B Rao
2026-05-06 15:22   ` Gregory Price
2026-05-11 10:02     ` Bharata B Rao
2026-05-11 14:27       ` Gregory Price
     [not found] ` <20260504060924.344313-6-bharata@amd.com>
2026-05-05  4:44   ` [PATCH v7 5/7] mm: sched: move NUMA balancing tiering promotion to pghot Donet Tom
2026-05-06  6:20     ` Bharata B Rao
2026-05-05 10:41 ` [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure Bharata B Rao
2026-05-09  1:18   ` Andrew Morton
2026-05-11 10:37     ` Bharata B Rao
2026-05-11 14:38       ` Gregory Price
2026-05-05 13:42 ` Bharata B Rao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox