* Re: [PATCH v7 2/7] mm: migrate: Add promote_misplaced_memcg_folios() [not found] ` <20260504060924.344313-3-bharata@amd.com> @ 2026-05-04 18:14 ` Donet Tom 2026-05-06 6:15 ` Bharata B Rao 0 siblings, 1 reply; 19+ messages in thread From: Donet Tom @ 2026-05-04 18:14 UTC (permalink / raw) To: Bharata B Rao, linux-kernel, linux-mm Cc: Jonathan.Cameron, dave.hansen, gourry, mgorman, mingo, peterz, raghavendra.kt, riel, rientjes, sj, weixugc, willy, ying.huang, ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm, david, byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs, alok.rathore, shivankg Hi Bharata On 5/4/26 11:39 AM, Bharata B Rao wrote: > +int promote_misplaced_memcg_folios(struct list_head *folio_list, int node) > +{ > + struct mem_cgroup *memcg = NULL; > + unsigned int nr_succeeded = 0; > + struct folio *first; > + int nr_remaining; > + > + if (list_empty(folio_list)) > + return 0; > + > + first = list_first_entry(folio_list, struct folio, lru); > +#ifdef CONFIG_DEBUG_VM > + { > + struct folio *f; > + list_for_each_entry(f, folio_list, lru) > + VM_WARN_ON_ONCE(folio_memcg(f) != folio_memcg(first)); It looks like the indentation might be off here. > + } > +#endif > + memcg = get_mem_cgroup_from_folio(first); > + > + nr_remaining = migrate_pages(folio_list, alloc_misplaced_dst_folio, > + NULL, node, MIGRATE_ASYNC, > + MR_NUMA_MISPLACED, &nr_succeeded); > + if (nr_remaining) > + putback_movable_pages(folio_list); > + > + if (nr_succeeded) { > + count_vm_numa_events(NUMA_PAGE_MIGRATE, nr_succeeded); > + count_memcg_events(memcg, NUMA_PAGE_MIGRATE, nr_succeeded); > + mod_lruvec_state(mem_cgroup_lruvec(memcg, NODE_DATA(node)), > + PGPROMOTE_SUCCESS, nr_succeeded); > + } > + > + mem_cgroup_put(memcg); > + WARN_ON(!list_empty(folio_list)); > + return nr_remaining ? -EAGAIN : 0; > +} > #endif /* CONFIG_NUMA_BALANCING */ ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v7 2/7] mm: migrate: Add promote_misplaced_memcg_folios() 2026-05-04 18:14 ` [PATCH v7 2/7] mm: migrate: Add promote_misplaced_memcg_folios() Donet Tom @ 2026-05-06 6:15 ` Bharata B Rao 0 siblings, 0 replies; 19+ messages in thread From: Bharata B Rao @ 2026-05-06 6:15 UTC (permalink / raw) To: Donet Tom, linux-kernel, linux-mm Cc: Jonathan.Cameron, dave.hansen, gourry, mgorman, mingo, peterz, raghavendra.kt, riel, rientjes, sj, weixugc, willy, ying.huang, ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm, david, byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs, alok.rathore, shivankg On 04-May-26 11:44 PM, Donet Tom wrote: > Hi Bharata > > On 5/4/26 11:39 AM, Bharata B Rao wrote: >> +int promote_misplaced_memcg_folios(struct list_head *folio_list, int node) >> +{ >> + struct mem_cgroup *memcg = NULL; >> + unsigned int nr_succeeded = 0; >> + struct folio *first; >> + int nr_remaining; >> + >> + if (list_empty(folio_list)) >> + return 0; >> + >> + first = list_first_entry(folio_list, struct folio, lru); >> +#ifdef CONFIG_DEBUG_VM >> + { >> + struct folio *f; >> + list_for_each_entry(f, folio_list, lru) >> + VM_WARN_ON_ONCE(folio_memcg(f) != folio_memcg(first)); > > > It looks like the indentation might be off here. Yeah looks like. Will fix. Regards, Bharata. ^ permalink raw reply [flat|nested] 19+ messages in thread
[parent not found: <20260504060924.344313-5-bharata@amd.com>]
* Re: [PATCH v7 4/7] mm: pghot: Precision mode for pghot [not found] ` <20260504060924.344313-5-bharata@amd.com> @ 2026-05-04 18:41 ` Donet Tom 2026-05-06 6:17 ` Bharata B Rao 0 siblings, 1 reply; 19+ messages in thread From: Donet Tom @ 2026-05-04 18:41 UTC (permalink / raw) To: Bharata B Rao, linux-kernel, linux-mm Cc: Jonathan.Cameron, dave.hansen, gourry, mgorman, mingo, peterz, raghavendra.kt, riel, rientjes, sj, weixugc, willy, ying.huang, ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm, david, byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs, alok.rathore, shivankg Hi Bharata On 5/4/26 11:39 AM, Bharata B Rao wrote: > +#include <linux/pghot.h> > +#include <linux/jiffies.h> > +#include <linux/memory-tiers.h> > + > +bool pghot_nid_valid(int nid) I might be missing something, but since pghot_nid_valid() exists in both pghot-default.c and pghot-precise.c, would it make sense to move it to a header file as a static inline function? -Donet > +{ > + if (nid != NUMA_NO_NODE && > + (!numa_valid_node(nid) || nid > PGHOT_NID_MAX || > + !node_online(nid) || !node_is_toptier(nid))) > + return false; > + > + return true; > +} > + > +unsigned long pghot_access_latency(unsigned long old_time, unsigned long time) > +{ > + return jiffies_to_msecs((time - old_time) & PGHOT_TIME_MASK); > +} > + > +bool pghot_update_record(phi_t *phi, int nid, unsigned long now) > +{ ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v7 4/7] mm: pghot: Precision mode for pghot 2026-05-04 18:41 ` [PATCH v7 4/7] mm: pghot: Precision mode for pghot Donet Tom @ 2026-05-06 6:17 ` Bharata B Rao 0 siblings, 0 replies; 19+ messages in thread From: Bharata B Rao @ 2026-05-06 6:17 UTC (permalink / raw) To: Donet Tom, linux-kernel, linux-mm Cc: Jonathan.Cameron, dave.hansen, gourry, mgorman, mingo, peterz, raghavendra.kt, riel, rientjes, sj, weixugc, willy, ying.huang, ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm, david, byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs, alok.rathore, shivankg On 05-May-26 12:11 AM, Donet Tom wrote: > Hi Bharata > > On 5/4/26 11:39 AM, Bharata B Rao wrote: >> +#include <linux/pghot.h> >> +#include <linux/jiffies.h> >> +#include <linux/memory-tiers.h> >> + >> +bool pghot_nid_valid(int nid) > > I might be missing something, but since pghot_nid_valid() exists in both pghot- > default.c and pghot-precise.c, would it make sense to move it to a header file > as a static inline function? It exists in both modes of pghot but the implementations differ. Hence it can't reside as static inline function in pghot.h. Regards, Bharata. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure [not found] <20260504060924.344313-1-bharata@amd.com> [not found] ` <20260504060924.344313-3-bharata@amd.com> [not found] ` <20260504060924.344313-5-bharata@amd.com> @ 2026-05-04 20:36 ` Matthew Wilcox 2026-05-05 22:17 ` Balbir Singh 2026-05-06 15:22 ` Gregory Price [not found] ` <20260504060924.344313-6-bharata@amd.com> ` (2 subsequent siblings) 5 siblings, 2 replies; 19+ messages in thread From: Matthew Wilcox @ 2026-05-04 20:36 UTC (permalink / raw) To: Bharata B Rao Cc: linux-kernel, linux-mm, Jonathan.Cameron, dave.hansen, gourry, mgorman, mingo, peterz, raghavendra.kt, riel, rientjes, sj, weixugc, ying.huang, ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm, david, byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs, alok.rathore, shivankg, donettom On Mon, May 04, 2026 at 11:39:17AM +0530, Bharata B Rao wrote: > This is v7 of pghot, a hot-page tracking and promotion subsystem. The I continue to think we should not do this. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure 2026-05-04 20:36 ` [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure Matthew Wilcox @ 2026-05-05 22:17 ` Balbir Singh 2026-05-06 3:43 ` Bharata B Rao 2026-05-06 15:22 ` Gregory Price 1 sibling, 1 reply; 19+ messages in thread From: Balbir Singh @ 2026-05-05 22:17 UTC (permalink / raw) To: Matthew Wilcox, Bharata B Rao Cc: linux-kernel, linux-mm, Jonathan.Cameron, dave.hansen, gourry, mgorman, mingo, peterz, raghavendra.kt, riel, rientjes, sj, weixugc, ying.huang, ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm, david, byungchul, kinseyho, joshua.hahnjy, yuanchu, alok.rathore, shivankg, donettom On 5/5/26 06:36, Matthew Wilcox wrote: > On Mon, May 04, 2026 at 11:39:17AM +0530, Bharata B Rao wrote: >> This is v7 of pghot, a hot-page tracking and promotion subsystem. The > > I continue to think we should not do this. > I am unclear about the benefits of the patchset, I have not tested it or reviewed the latest revision. My big concern was that top-tier might not always be suitable. I see that there are some numbers posted, but I find this weird "After the graph creation, the processes are stopped and data is migrated to CXL node 2 before continuing so that BFS phase starts accessing lower tier memory." Why not allocate everything on CXL node 2? Balbir ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure 2026-05-05 22:17 ` Balbir Singh @ 2026-05-06 3:43 ` Bharata B Rao 2026-05-06 4:02 ` Balbir Singh 0 siblings, 1 reply; 19+ messages in thread From: Bharata B Rao @ 2026-05-06 3:43 UTC (permalink / raw) To: Balbir Singh, Matthew Wilcox Cc: linux-kernel, linux-mm, Jonathan.Cameron, dave.hansen, gourry, mgorman, mingo, peterz, raghavendra.kt, riel, rientjes, sj, weixugc, ying.huang, ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm, david, byungchul, kinseyho, joshua.hahnjy, yuanchu, alok.rathore, shivankg, donettom On 06-May-26 3:47 AM, Balbir Singh wrote: > On 5/5/26 06:36, Matthew Wilcox wrote: >> On Mon, May 04, 2026 at 11:39:17AM +0530, Bharata B Rao wrote: >>> This is v7 of pghot, a hot-page tracking and promotion subsystem. The >> >> I continue to think we should not do this. >> > > I am unclear about the benefits of the patchset, I have not tested > it or reviewed the latest revision. My big concern was that top-tier > might not always be suitable. So you are saying that we should have a capability to promote accessed pages from lower tier to an other tier that is not classified as top tier? Is that non-top tier node the one which generates accesses? > > I see that there are some numbers posted, but I find this weird > "After the graph creation, the processes are stopped and data is migrated > to CXL node 2 before continuing so that BFS phase starts accessing lower > tier memory." Why not allocate everything on CXL node 2? In the ideal scenario, the benefit is to see if any pages that land up on lower tier get identified as hot and get promoted. That means we need to create an over-committed scenario where the pages get demoted first. I have provided numbers from such cases in my previous versions. The problem with this case is that the base hot page promotion (NUMAB2) hasn't shown any benefit at all with my micro-benchmark - Ref: https://lore.kernel.org/linux-mm/868004d8-bb8e-4800-9fdd-ade48e95fe3b@amd.com/ Same has been observed with redis-memtier benchmark - https://lore.kernel.org/linux-mm/957f2242-56d4-4bf0-8aeb-9d60fbea8c8c@amd.com/ Instead what I am doing here is to take out demotion from the scenario but still retain the access pattern of the benchmark by pushing out the data to lower tier when the benchmark reaches steady allocation state. Regards, Bharata. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure 2026-05-06 3:43 ` Bharata B Rao @ 2026-05-06 4:02 ` Balbir Singh 2026-05-06 5:00 ` Bharata B Rao 0 siblings, 1 reply; 19+ messages in thread From: Balbir Singh @ 2026-05-06 4:02 UTC (permalink / raw) To: Bharata B Rao, Matthew Wilcox Cc: linux-kernel, linux-mm, Jonathan.Cameron, dave.hansen, gourry, mgorman, mingo, peterz, raghavendra.kt, riel, rientjes, sj, weixugc, ying.huang, ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm, david, byungchul, kinseyho, joshua.hahnjy, yuanchu, alok.rathore, shivankg, donettom On 5/6/26 13:43, Bharata B Rao wrote: > On 06-May-26 3:47 AM, Balbir Singh wrote: >> On 5/5/26 06:36, Matthew Wilcox wrote: >>> On Mon, May 04, 2026 at 11:39:17AM +0530, Bharata B Rao wrote: >>>> This is v7 of pghot, a hot-page tracking and promotion subsystem. The >>> >>> I continue to think we should not do this. >>> >> >> I am unclear about the benefits of the patchset, I have not tested >> it or reviewed the latest revision. My big concern was that top-tier >> might not always be suitable. > > So you are saying that we should have a capability to promote accessed pages > from lower tier to an other tier that is not classified as top tier? Is that > non-top tier node the one which generates accesses? > Yes, a top tier node could be CPU less for example. >> >> I see that there are some numbers posted, but I find this weird >> "After the graph creation, the processes are stopped and data is migrated >> to CXL node 2 before continuing so that BFS phase starts accessing lower >> tier memory." Why not allocate everything on CXL node 2? > > In the ideal scenario, the benefit is to see if any pages that land up on lower > tier get identified as hot and get promoted. That means we need to create an > over-committed scenario where the pages get demoted first. I have provided Why do the pages need to get demoted? Why not allocate them from the lower tier to show that promotion upwards is helpful > numbers from such cases in my previous versions. The problem with this case is > that the base hot page promotion (NUMAB2) hasn't shown any benefit at all with > my micro-benchmark - Ref: > https://lore.kernel.org/linux-mm/868004d8-bb8e-4800-9fdd-ade48e95fe3b@amd.com/ > > Same has been observed with redis-memtier benchmark - > https://lore.kernel.org/linux-mm/957f2242-56d4-4bf0-8aeb-9d60fbea8c8c@amd.com/ > > Instead what I am doing here is to take out demotion from the scenario but still > retain the access pattern of the benchmark by pushing out the data to lower tier > when the benchmark reaches steady allocation state. > Balbir ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure 2026-05-06 4:02 ` Balbir Singh @ 2026-05-06 5:00 ` Bharata B Rao 0 siblings, 0 replies; 19+ messages in thread From: Bharata B Rao @ 2026-05-06 5:00 UTC (permalink / raw) To: Balbir Singh, Matthew Wilcox Cc: linux-kernel, linux-mm, Jonathan.Cameron, dave.hansen, gourry, mgorman, mingo, peterz, raghavendra.kt, riel, rientjes, sj, weixugc, ying.huang, ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm, david, byungchul, kinseyho, joshua.hahnjy, yuanchu, alok.rathore, shivankg, donettom On 06-May-26 9:32 AM, Balbir Singh wrote: >>> I am unclear about the benefits of the patchset, I have not tested >>> it or reviewed the latest revision. My big concern was that top-tier >>> might not always be suitable. >> >> So you are saying that we should have a capability to promote accessed pages >> from lower tier to an other tier that is not classified as top tier? Is that >> non-top tier node the one which generates accesses? >> > > Yes, a top tier node could be CPU less for example. Currently kmigrated thread in pghot doesn't explicitly prevent promotion to non-toptier nodes. Here is how this works for the two modes of operation in pghot: pghot-default: In this mode, the target NID isn't explicitly tracked and hence kmigrated relies on the user-configurable pghot_target_nid. Though there is a !node_is_toptier(nid) check in the helper routine that populates pghot_target_nid, that can be relaxed if required. pghot-precise: In this mode, the accessing CPU's node is tracked as the target nid and promotion is done to that node. Note that pghot_target_nid isn't used here. Hence I don't see any major issues in this patchset to cover your use case. Let me know if I miss anything here. BTW, does the existing hot page promotion cover the use case you are targeting? > >>> >>> I see that there are some numbers posted, but I find this weird >>> "After the graph creation, the processes are stopped and data is migrated >>> to CXL node 2 before continuing so that BFS phase starts accessing lower >>> tier memory." Why not allocate everything on CXL node 2? >> >> In the ideal scenario, the benefit is to see if any pages that land up on lower >> tier get identified as hot and get promoted. That means we need to create an >> over-committed scenario where the pages get demoted first. I have provided > > Why do the pages need to get demoted? Why not allocate them from the lower tier > to show that promotion upwards is helpful As you can see, these are controlled experiments to measure the effectiveness of hot page detection and promotion and the benefits from promotion. It can be done in the way you are suggesting; just that I found it a bit simpler to pause the benchmark, migrate all pages to lower tier memory before the benchmark starts accessing them rather than relying on setting memory policies to achieve the same effect. Regards, Bharata. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure 2026-05-04 20:36 ` [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure Matthew Wilcox 2026-05-05 22:17 ` Balbir Singh @ 2026-05-06 15:22 ` Gregory Price 2026-05-11 10:02 ` Bharata B Rao 1 sibling, 1 reply; 19+ messages in thread From: Gregory Price @ 2026-05-06 15:22 UTC (permalink / raw) To: Matthew Wilcox Cc: Bharata B Rao, linux-kernel, linux-mm, Jonathan.Cameron, dave.hansen, mgorman, mingo, peterz, raghavendra.kt, riel, rientjes, sj, weixugc, ying.huang, ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm, david, byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs, alok.rathore, shivankg, donettom On Mon, May 04, 2026 at 09:36:05PM +0100, Matthew Wilcox wrote: > On Mon, May 04, 2026 at 11:39:17AM +0530, Bharata B Rao wrote: > > This is v7 of pghot, a hot-page tracking and promotion subsystem. The > > I continue to think we should not do this. My only pushback on the general "we should not do this" is that we need something to counter-balance the demotion bit in vmscan.c, and the current implementation (prot_none faults) is rather :[ I think this series needs to greatly limit its complexity and provide some gentle correction for LRU inversions, and I think they're making a decent attempt at that. But then I think local memory expansion on CXL is going pretty swimmingly in our datacenters :], others may not feel the same. ~Gregory ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure 2026-05-06 15:22 ` Gregory Price @ 2026-05-11 10:02 ` Bharata B Rao 2026-05-11 14:27 ` Gregory Price 0 siblings, 1 reply; 19+ messages in thread From: Bharata B Rao @ 2026-05-11 10:02 UTC (permalink / raw) To: Gregory Price, Matthew Wilcox Cc: linux-kernel, linux-mm, Jonathan.Cameron, dave.hansen, mgorman, mingo, peterz, raghavendra.kt, riel, rientjes, sj, weixugc, ying.huang, ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm, david, byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs, alok.rathore, shivankg, donettom On 06-May-26 8:52 PM, Gregory Price wrote: > On Mon, May 04, 2026 at 09:36:05PM +0100, Matthew Wilcox wrote: >> On Mon, May 04, 2026 at 11:39:17AM +0530, Bharata B Rao wrote: >>> This is v7 of pghot, a hot-page tracking and promotion subsystem. The >> >> I continue to think we should not do this. > > My only pushback on the general "we should not do this" is that we need > something to counter-balance the demotion bit in vmscan.c, and the > current implementation (prot_none faults) is rather :[ So you are saying pghot subsystem currently does hot page detection and promotion only, which is fine. But the current implementation of demotion is not very optimal and hence we should spend effort in fine-tuning demotion first? In this series itself I have shown via benchmark numbers that for over-committed cases (involving both demotion and promotion), the workload isn't really showing real benefit due to demotion and promotion. Are you specifically referring to this problem? > > I think this series needs to greatly limit its complexity and provide > some gentle correction for LRU inversions, and I think they're making a > decent attempt at that. Regarding complexity, I agree that the initial version of this patchset was quite complicated in the way it maintained hot page information. But the later versions including this one have greatly reduced the complexity with one byte of hot page information per PFN, atomic updates to hotness data without any locks, per-lowertier kmigrated threads for promotion and reuse of existing hot page promotion engine. Did you have anything else in mind wrt complexity? Can you provide more context about the LRU inversion problem? Regards, Bharata. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure 2026-05-11 10:02 ` Bharata B Rao @ 2026-05-11 14:27 ` Gregory Price 0 siblings, 0 replies; 19+ messages in thread From: Gregory Price @ 2026-05-11 14:27 UTC (permalink / raw) To: Bharata B Rao Cc: Matthew Wilcox, linux-kernel, linux-mm, Jonathan.Cameron, dave.hansen, mgorman, mingo, peterz, raghavendra.kt, riel, rientjes, sj, weixugc, ying.huang, ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm, david, byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs, alok.rathore, shivankg, donettom On Mon, May 11, 2026 at 03:32:20PM +0530, Bharata B Rao wrote: > > > On 06-May-26 8:52 PM, Gregory Price wrote: > > On Mon, May 04, 2026 at 09:36:05PM +0100, Matthew Wilcox wrote: > >> On Mon, May 04, 2026 at 11:39:17AM +0530, Bharata B Rao wrote: > >>> This is v7 of pghot, a hot-page tracking and promotion subsystem. The > >> > >> I continue to think we should not do this. > > > > My only pushback on the general "we should not do this" is that we need > > something to counter-balance the demotion bit in vmscan.c, and the > > current implementation (prot_none faults) is rather :[ > > So you are saying pghot subsystem currently does hot page detection and > promotion only, which is fine. But the current implementation of demotion is not > very optimal and hence we should spend effort in fine-tuning demotion first? > I'm saying because of demotion and fallbacks, we need a mechanism to handle promotions. I'm not convinced a hotness will extend to coldness - at least any better than LRU/MGLRU. > In this series itself I have shown via benchmark numbers that for over-committed > cases (involving both demotion and promotion), the workload isn't really showing > real benefit due to demotion and promotion. Are you specifically referring to > this problem? > If over-committed means over-subscribed hot-tier (more hot memory than available top tier memory), then yeah that result is intuitive. I haven't pointed to any specific issue, as of yet, still taking time to consider some of the results. > > Can you provide more context about the LRU inversion problem? > I've been tracking some data around shrink_folio_list and alloc_migrate_folio behavior when a low tier node is full. The result is we end up just swapping memory from high tier straight to swap and skip demotion, resulting in a bunch of file and anon refaults. Hardware: Single Socket, 768GB DRAM, 256GB CXL Expander In this workload, we see swap usage after the full 1TB of memory is utilized, and as a result we see swap spillage. second_chance = second alloc attempt in alloc_migrate_folio succeeds swap_fallback = second chance fails, we swap directly from top tier Sample data: pgdemote_kswapd 333052779 pgdemote_direct 3181480482 pgdemote_second_chance 31017629 pgdemote_swap_fallback 335759535 workingset_refault_anon 30106868 workingset_refault_file 2343035341 (note here: swap fallback is number of occurances, while the others are number of pages. As a result, the actual number of swapped pages is likely much closer to the pgdemote_direct number) As a result: LRU is just broken on CXL systems, LRU inverts by design. In a sane world we would just see the second tier as an extention of the LRU, but that doesn't necessarily mean we can gleen hotness data from it (it's still largely a coldness tracking mechanism). I have patches I haven't RFC'd yet that try to address this, but I need more time to test it. I don't think this is something to address with PGHot. --- diff --git a/mm/vmscan.c b/mm/vmscan.c index 112983b42559..ccdd698c5937 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1043,7 +1043,10 @@ struct folio *alloc_migrate_folio(struct folio *src, unsigned long private) mtc->gfp_mask &= ~__GFP_THISNODE; mtc->nmask = allowed_mask; - return alloc_migration_target(src, (unsigned long)mtc); + dst = alloc_migration_target(src, (unsigned long)mtc); + if (dst) + count_vm_events(PGDEMOTE_SECOND_CHANCE, folio_nr_pages(src)); + return dst; } /* @@ -1616,6 +1619,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, /* Folios that could not be demoted are still in @demote_folios */ if (!list_empty(&demote_folios)) { /* Folios which weren't demoted go back on @folio_list */ + if (!sc->proactive) + count_vm_event(PGDEMOTE_SWAP_FALLBACK); list_splice_init(&demote_folios, folio_list); /* ^ permalink raw reply related [flat|nested] 19+ messages in thread
[parent not found: <20260504060924.344313-6-bharata@amd.com>]
* Re: [PATCH v7 5/7] mm: sched: move NUMA balancing tiering promotion to pghot [not found] ` <20260504060924.344313-6-bharata@amd.com> @ 2026-05-05 4:44 ` Donet Tom 2026-05-06 6:20 ` Bharata B Rao 0 siblings, 1 reply; 19+ messages in thread From: Donet Tom @ 2026-05-05 4:44 UTC (permalink / raw) To: Bharata B Rao, linux-kernel, linux-mm Cc: Jonathan.Cameron, dave.hansen, gourry, mgorman, mingo, peterz, raghavendra.kt, riel, rientjes, sj, weixugc, willy, ying.huang, ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm, david, byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs, alok.rathore, shivankg Hi Bharata On 5/4/26 11:39 AM, Bharata B Rao wrote: > > +/* > + * For memory tiering mode, if there are enough free pages (more than > + * enough watermark defined here) in fast memory node, to take full > + * advantage of fast memory capacity, all recently accessed slow > + * memory pages will be migrated to fast memory node without > + * considering hot threshold. > + */ > +static bool pgdat_free_space_enough(struct pglist_data *pgdat) > +{ > + int z; > + unsigned long enough_wmark; > + > + enough_wmark = max(1UL * 1024 * 1024 * 1024 >> PAGE_SHIFT, Just a thought—would it be better to use #define for these hardcoded values? -Donet > + pgdat->node_present_pages >> 4); > + for (z = pgdat->nr_zones - 1; z >= 0; z--) { ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v7 5/7] mm: sched: move NUMA balancing tiering promotion to pghot 2026-05-05 4:44 ` [PATCH v7 5/7] mm: sched: move NUMA balancing tiering promotion to pghot Donet Tom @ 2026-05-06 6:20 ` Bharata B Rao 0 siblings, 0 replies; 19+ messages in thread From: Bharata B Rao @ 2026-05-06 6:20 UTC (permalink / raw) To: Donet Tom, linux-kernel, linux-mm Cc: Jonathan.Cameron, dave.hansen, gourry, mgorman, mingo, peterz, raghavendra.kt, riel, rientjes, sj, weixugc, willy, ying.huang, ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm, david, byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs, alok.rathore, shivankg On 05-May-26 10:14 AM, Donet Tom wrote: > Hi Bharata > > On 5/4/26 11:39 AM, Bharata B Rao wrote: >> +/* >> + * For memory tiering mode, if there are enough free pages (more than >> + * enough watermark defined here) in fast memory node, to take full >> + * advantage of fast memory capacity, all recently accessed slow >> + * memory pages will be migrated to fast memory node without >> + * considering hot threshold. >> + */ >> +static bool pgdat_free_space_enough(struct pglist_data *pgdat) >> +{ >> + int z; >> + unsigned long enough_wmark; >> + >> + enough_wmark = max(1UL * 1024 * 1024 * 1024 >> PAGE_SHIFT, > > Just a thought—would it be better to use #define for these hardcoded values? We could. It was a code movement, hence left it untouched. Regards, Bharata. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure [not found] <20260504060924.344313-1-bharata@amd.com> ` (3 preceding siblings ...) [not found] ` <20260504060924.344313-6-bharata@amd.com> @ 2026-05-05 10:41 ` Bharata B Rao 2026-05-09 1:18 ` Andrew Morton 2026-05-05 13:42 ` Bharata B Rao 5 siblings, 1 reply; 19+ messages in thread From: Bharata B Rao @ 2026-05-05 10:41 UTC (permalink / raw) To: linux-kernel, linux-mm Cc: Jonathan.Cameron, dave.hansen, gourry, mgorman, mingo, peterz, raghavendra.kt, riel, rientjes, sj, weixugc, willy, ying.huang, ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm, david, byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs, alok.rathore, shivankg, donettom On 04-May-26 11:39 AM, Bharata B Rao wrote: > Results > ======= > Posted as replies to this mail thread. Graph500 benchmark results: Test system details ------------------- 3 node AMD Zen5 system with 2 regular NUMA nodes (0, 1) and a CXL node (2) $ numactl -H available: 3 nodes (0-2) node 0 cpus: 0-95,192-287 node 0 size: 128460 MB node 1 cpus: 96-191,288-383 node 1 size: 128893 MB node 2 cpus: node 2 size: 257993 MB node distances: node 0 1 2 0: 10 32 50 1: 32 10 60 2: 255 255 10 Hotness sources --------------- NUMAB0 - Without NUMA Balancing in base case and with no source enabled in the pghot case. No migrations occur. NUMAB2 - Existing hot page promotion for the base case and use of hint faults as source in the pghot case. NUMAB3 - Enabled both regular and tiering mode of NUMA Balancing (kernel.numa_balancing=3) Pghot by default promotes after two accesses but for NUMAB2 source, promotion is done after one access to match the base behaviour. (/sys/kernel/debug/pghot/freq_threshold=1) Graph500 details ---------------- Command: mpirun -n 128 --bind-to core --map-by core graph500/src/graph500_reference_bfs 28 16 After the graph creation, the processes are stopped and data is migrated to CXL node 2 before continuing so that BFS phase starts accessing lower tier memory. Total memory usage is slightly over 100GB and will fit within Node 0 and 1. Hence there is no memory pressure to induce demotions. harmonic_mean_TEPS - Higher is better ===================================================================================== Base Base pghot-default pghot-precise NUMAB0 NUMAB2 NUMAB2 NUMAB2 ===================================================================================== harmonic_mean_TEPS 5.08026e+08 7.48633e+08 5.46257e+08 7.45101e+08 mean_time 8.45413 5.73702 7.86245 5.76421 median_TEPS 5.09236e+08 7.25058e+08 5.40525e+08 7.63752e+08 max_TEPS 5.15244e+08 1.03391e+09 8.51317e+08 9.7552e+08 pgpromote_success 0 13809474 13763582 13763155 numa_pte_updates 0 26746117 39502157 36368086 numa_hint_faults 0 13811769 24248272 21172314 ===================================================================================== pghot-default NUMAB3 ===================================================================================== harmonic_mean_TEPS 7.00515e+08 mean_time 6.13109 median_TEPS 7.06813e+08 max_TEPS 7.63164e+08 pgpromote_success 13762087 numa_pte_updates 93632490 numa_hint_faults 70566306 ===================================================================================== - The base case shows a good improvement with NUMAB2 in harmonic_mean_TEPS. - The same improvement gets maintained with pghot-precise too. - pghot-default mode doesn't show benefit even when achieving similar page promotion numbers. This mode doesn't track accessing NID and by default promotes to NID=0 which probably isn't all that beneficial as processes are running on both Node 0 and Node 1. - pghot-default recovers the performance when balancing between toptier nodes 0 and 1 is enabled in addition to hot page promotion. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure 2026-05-05 10:41 ` [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure Bharata B Rao @ 2026-05-09 1:18 ` Andrew Morton 2026-05-11 10:37 ` Bharata B Rao 0 siblings, 1 reply; 19+ messages in thread From: Andrew Morton @ 2026-05-09 1:18 UTC (permalink / raw) To: Bharata B Rao Cc: linux-kernel, linux-mm, Jonathan.Cameron, dave.hansen, gourry, mgorman, mingo, peterz, raghavendra.kt, riel, rientjes, sj, weixugc, willy, ying.huang, ziy, dave, nifan.cxl, xuezhengchu, yiannis, david, byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs, alok.rathore, shivankg, donettom On Tue, 5 May 2026 16:11:43 +0530 Bharata B Rao <bharata@amd.com> wrote: > On 04-May-26 11:39 AM, Bharata B Rao wrote: > > Results > > ======= > > Posted as replies to this mail thread. > > Graph500 benchmark results: Please include (and maintain) the testing results in the formal changelogs (perhaps in the [0/N], in a condensed summary form). I mean, the entire point of the whole patchset is to improve performance (yes?), so this contribution lives or dies by its performance testing results. The first thing your audience will want to know is "how good is this for our users". So tell us! Up front, within the first paragraphs! The better the results, the more motivated people will be to help get your work upstream. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure 2026-05-09 1:18 ` Andrew Morton @ 2026-05-11 10:37 ` Bharata B Rao 2026-05-11 14:38 ` Gregory Price 0 siblings, 1 reply; 19+ messages in thread From: Bharata B Rao @ 2026-05-11 10:37 UTC (permalink / raw) To: Andrew Morton Cc: linux-kernel, linux-mm, Jonathan.Cameron, dave.hansen, gourry, mgorman, mingo, peterz, raghavendra.kt, riel, rientjes, sj, weixugc, willy, ying.huang, ziy, dave, nifan.cxl, xuezhengchu, yiannis, david, byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs, alok.rathore, shivankg, donettom On 09-May-26 6:48 AM, Andrew Morton wrote: > On Tue, 5 May 2026 16:11:43 +0530 Bharata B Rao <bharata@amd.com> wrote: > >> On 04-May-26 11:39 AM, Bharata B Rao wrote: >>> Results >>> ======= >>> Posted as replies to this mail thread. >> >> Graph500 benchmark results: > > Please include (and maintain) the testing results in the formal > changelogs (perhaps in the [0/N], in a condensed summary form). The results and associated description were getting too long and hence I was hesitating to make it part of 0/N. But then as you say, I shall include a condensed summary from next time. > > I mean, the entire point of the whole patchset is to improve > performance (yes?), so this contribution lives or dies by its > performance testing results. The entire point of this patchset is not just about improving the performance. It is mainly about adding a new dedicated infrastructure for detecting and promoting hot pages. It is about having a subsystem that can act as a single source of truth page hotness in the kernel. Though we aren't there yet, we have started by having a minimal infrastructure that centralizes the hot page promotion and associated heuristics that currently sits in scheduler so that the same can be used with other page hotness sources as well. The first source is the hintfaults based hot page promotion. Here the address space scanning and introduction of hint faults still remains like earlier. But the promotion engine is part of pghot. Hence the comparison numbers with base this source is about meeting the current level of performance and ensuring that the workloads don't suffer due to batched migration. There are other sources as well with primary one being the IBS Memory Profiler which provides memory access information directly from the hardware. I have some numbers for this source as well. Initial results look encouraging and more tests can tell us if this source can be an independent one or complements the existing one. Then the earlier versions of this patchset had another source - PTE A bit based scanning where the idea was to completely replace the hint fault based mechanism by PTE A bit based accesses thereby taking out both the detection and promotion parts out of the process context. I have temporarily removed this from this patchset for two reasons: a) to simplify the patchset so that we can get some consensus on the infrastructure part first. b) to explore the commonality with another PTE A bit scanning approach (called klruscand) that used MGLRU's scanning mechanism. Also on the horizon is to use hot page info that CXL Hotness Monitoring Unit (CHMU) can provide. > > The first thing your audience will want to know is "how good is this > for our users". So tell us! Up front, within the first paragraphs! > > The better the results, the more motivated people will be to help get > your work upstream. So currently it is a multi-step approach with first step of building a common hotness infrastructure and moving existing mechanism to make use of it w/o any regression. Then follow up with more sources. Regards, Bharata. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure 2026-05-11 10:37 ` Bharata B Rao @ 2026-05-11 14:38 ` Gregory Price 0 siblings, 0 replies; 19+ messages in thread From: Gregory Price @ 2026-05-11 14:38 UTC (permalink / raw) To: Bharata B Rao Cc: Andrew Morton, linux-kernel, linux-mm, Jonathan.Cameron, dave.hansen, mgorman, mingo, peterz, raghavendra.kt, riel, rientjes, sj, weixugc, willy, ying.huang, ziy, dave, nifan.cxl, xuezhengchu, yiannis, david, byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs, alok.rathore, shivankg, donettom On Mon, May 11, 2026 at 04:07:16PM +0530, Bharata B Rao wrote: > > The entire point of this patchset is not just about improving the performance. > It is mainly about adding a new dedicated infrastructure for detecting and > promoting hot pages. It is about having a subsystem that can act as a single > source of truth page hotness in the kernel. Though we aren't there yet, we have > started by having a minimal infrastructure that centralizes the hot page > promotion and associated heuristics that currently sits in scheduler so that the > same can be used with other page hotness sources as well. > The goal of hotness tracking in general is to improve performance. The goal of PGHot should be a reasonable baseline for the kernel to course-correct LRU inversions across tiers over time, because LRU threads only scan invidiual nodes and don't compare across nodes. I would hazard against trying to wholesale state it "Shall be the single source of truth", as we will inevitably discover some condition which is not covered / cannot be captured / we will simply get it wrong. Plus, intuitively, counter-balancing LRU/MGLRU aging is probably as good good as we can get without having to inject per-workload information into the system - at which point the users should use DAMON. ~Gregory ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure [not found] <20260504060924.344313-1-bharata@amd.com> ` (4 preceding siblings ...) 2026-05-05 10:41 ` [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure Bharata B Rao @ 2026-05-05 13:42 ` Bharata B Rao 5 siblings, 0 replies; 19+ messages in thread From: Bharata B Rao @ 2026-05-05 13:42 UTC (permalink / raw) To: linux-kernel, linux-mm Cc: Jonathan.Cameron, dave.hansen, gourry, mgorman, mingo, peterz, raghavendra.kt, riel, rientjes, sj, weixugc, willy, ying.huang, ziy, dave, nifan.cxl, xuezhengchu, yiannis, akpm, david, byungchul, kinseyho, joshua.hahnjy, yuanchu, balbirs, alok.rathore, shivankg, donettom On 04-May-26 11:39 AM, Bharata B Rao wrote: > Results > ======= > Posted as replies to this mail thread. Initial Graph500 benchmark numbers for IBS Memory Profiler source: Test system details ------------------- 3 node AMD system with 2 regular NUMA nodes (0, 1) in NPS2 mode and a CXL node (2) $ numactl -H available: 3 nodes (0-2) node distances: node 0 cpus: 0-63,128-191 node 0 size: 257715 MB node 1 cpus: 64-127,192-255 node 1 size: 257845 MB node 2 cpus: node 2 size: 258032 MB node distances: node 0 1 2 0: 10 12 50 1: 12 10 50 2: 255 255 10 Hotness sources --------------- NUMAB0 - Without NUMA Balancing in base case and with no source enabled in the pghot case. No migrations occur. NUMAB2 - Existing hot page promotion for the base case and use of hint faults as source in the pghot case. HWHINTS - IBS Memory Profiler as source for pghot Pghot by default promotes after two accesses but for NUMAB2 and HWHINTS sources, promotion is done after one access to match the base behaviour. (/sys/kernel/debug/pghot/freq_threshold=1) Graph500 details ---------------- Command: mpirun -n 128 --bind-to core --map-by core graph500/src/graph500_reference_bfs 28 16 After the graph creation, the processes are stopped and data is migrated to CXL node 2 before continuing so that BFS phase starts accessing lower tier memory. Total memory usage is slightly over 100GB and will fit within Node 0 and 1. Hence there is no memory pressure to induce demotions. harmonic_mean_TEPS - Higher is better ============================================================================= Base Base pghot-default NUMAB0 NUMAB2 NUMAB2 ============================================================================= harmonic_mean_TEPS 4.09614e+08 1.28401e+09 1.47926e+09 mean_time 10.4853 3.34492 2.90342 median_TEPS 4.10086e+08 1.44584e+09 1.85957e+09 max_TEPS 4.1661e+08 1.79773e+09 1.99242e+09 pgpromote_success 0 13746029 13412213 numa_hint_faults 0 13753808 26669823 pghot_recorded_accesses NA NA 26669551 pghot_recorded_hintfaults NA NA 26669823 pghot_recorded_hwhints NA NA 0 hwhint_total_events NA NA 0 ============================================================================= pghot-default HWHINTS ============================================================================= harmonic_mean_TEPS 1.52334e+09 mean_time 2.81941 median_TEPS 1.57446e+09 max_TEPS 1.72014e+09 pgpromote_success 3415599 numa_hint_faults 0 pghot_recorded_accesses 3440912 pghot_recorded_hintfaults 0 pghot_recorded_hwhints 24475210 hwhint_total_events 24475244 ============================================================================= While no migration (NUMAB0) at all hurts Graph500, HWHINTS with pghot is able to provide similar benchmark numbers even when not migrating as aggressively as base NUMAB2. ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2026-05-11 14:38 UTC | newest]
Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20260504060924.344313-1-bharata@amd.com>
[not found] ` <20260504060924.344313-3-bharata@amd.com>
2026-05-04 18:14 ` [PATCH v7 2/7] mm: migrate: Add promote_misplaced_memcg_folios() Donet Tom
2026-05-06 6:15 ` Bharata B Rao
[not found] ` <20260504060924.344313-5-bharata@amd.com>
2026-05-04 18:41 ` [PATCH v7 4/7] mm: pghot: Precision mode for pghot Donet Tom
2026-05-06 6:17 ` Bharata B Rao
2026-05-04 20:36 ` [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure Matthew Wilcox
2026-05-05 22:17 ` Balbir Singh
2026-05-06 3:43 ` Bharata B Rao
2026-05-06 4:02 ` Balbir Singh
2026-05-06 5:00 ` Bharata B Rao
2026-05-06 15:22 ` Gregory Price
2026-05-11 10:02 ` Bharata B Rao
2026-05-11 14:27 ` Gregory Price
[not found] ` <20260504060924.344313-6-bharata@amd.com>
2026-05-05 4:44 ` [PATCH v7 5/7] mm: sched: move NUMA balancing tiering promotion to pghot Donet Tom
2026-05-06 6:20 ` Bharata B Rao
2026-05-05 10:41 ` [PATCH v7 0/7] mm: Hot page tracking and promotion infrastructure Bharata B Rao
2026-05-09 1:18 ` Andrew Morton
2026-05-11 10:37 ` Bharata B Rao
2026-05-11 14:38 ` Gregory Price
2026-05-05 13:42 ` Bharata B Rao
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox