* [Linux Memory Hotness and Promotion] Notes from April 23, 2026
@ 2026-04-25 22:10 David Rientjes
0 siblings, 0 replies; only message in thread
From: David Rientjes @ 2026-04-25 22:10 UTC (permalink / raw)
To: Davidlohr Bueso, Fan Ni, Gregory Price, Jonathan Cameron,
Joshua Hahn, Raghavendra K T, Rao, Bharata Bhasker, SeongJae Park,
Wei Xu, Xuezheng Chu, Yiannis Nikolakopoulos, Zi Yan
Cc: linux-mm
Hi everybody,
Here are the notes from the last Linux Memory Hotness and Promotion call
that happened on Thursday, April 23. Thanks to everybody who was
involved!
These notes are intended to bring people up to speed who could not attend
the call as well as keep the conversation going in between meetings.
----->o-----
Bharata updated offline that he has a working version of IBS memory
profiler driver which acts as a page hotness source for pghot. It is
currently going through review. He should be able to post v7 of pghot
with NUMB2 hint faults and IBS memprofiler as sources of hotness
information by the end of April.
----->o-----
Shivank updated on the status of his patch series for page migration
hardware assist. He will be posting v5 of that series on Monday.
Functionally this is working and he also tested this with memory
compaction. His slides are attached to the cover letter of the meeting.
For an example of a fully fragmented 250GB node that is 50% free (every
other 4KB page is in use), hugepage allocation is blocked. He pinned
compaction and a cpu-hog to the same core that is competing for cpu
cycles. Allocating 16384 hugepages triggered ~4.6M page migrations
through compaction:
Time Pages migrated Hog iters Sys% User%
Baseline 33.3s 4.6M 62.3B 49.8% 49.9%
DMA offload 32.5s 4.6M 66.0B 43.0% 56.3%
DMA offload frees cpu time during compaction so that 6% more work was done
by the application. This is also discussed in the upstream patch
series[1].
I asked if this is direct compaction coming from the page allocator
slowpath and Shivank clarifie that is correct. This example was using DMA
offload in Zen3 using a batch size of migration of 32 pages. Shivank
noted that the workload is still stalled in the page allocator while this
migration is happening, so the benefit here is purely the speed-up
achieved of DMA offload.
Jonathan asked if there were results for more real-world scenarios
involving memory fragmentation. Shivank noted this is the extent of the
deterministic data that he has. Jonathan opined that it may actually show
better results with more realistic scenarios.
----->o-----
Joshua discussed his latest update for tier-aware memcg limits. The
graphs of data that he presented at the meeting demonstrate throughput
differences between three noisy neighbor memory hogs and a victim
workload. On a 1TB machine (750GB DRAM, 250GB CXL), each workload takes
up 220GB of memory. The three hogs are launched first and allocate all of
their memory, and only once they are done allocating, the victim workload
gets to start allocating. Once the victim gets its memory allocated, we
start accessing the memory and measure how many reads each workload can
perform. The three setups presented are (1) random access, (2) 60-40
hot/cold region accesses, and (3) 90-10 hot/cold region accesses. He
tested with both NUMAB2 and NUMAB0.
In all of the experiments, tiered memcg limits provides a tighter band of
throughput. Monitoring memory.numa_stat and looking at anonymous memory
usage, in the non-tiered setup only the victim workload uses CXL memory.
In a tiered setup, everybody uses the same amount of DRAM and CXL. Joshua
noted that the difference between NUMAB2 and NUMAB0 is also interesting,
it seems NUMAB2 is actively harmful to the system under these scenarios,
since it fights against the promotion/demotion caused by tiered limits.
He's planning on sending out a new RFC later today. His sides are
attached to the cover letter for the meeting.
Yiannis asked if all the demotions are from the lru in this scenario and
there were no promotions. Joshua confirmed this is the case, that we read
directly from CXL without promotions.
We discussed the design and implementation of NUMAB2 and Joshua made the
observation that it is unaware of memcg so it is trying to do what is in
the best interest of the system overall, which may be why it is fighting
with the memcg tier aware limits.
----->o-----
NOTE!!! The next meeting will be canceled due to LSF/MM/BPF 2026.
Next meeting will be on Thursday, May 21 at 8:30am PDT (UTC-7), everybody
is welcome: https://meet.google.com/jak-ytdx-hnm
Topics for the next meeting:
- debrief discussions from LSF/MM/BPF 2026
- v7 of Bharata's patch series, including new IBS hotness information
and NUMBA2 hint faults
- v5 of Shivank's series for enlightening migrate_pages() for hardware
assists and how this work will be charged to userspace, including for
memory compaction
- v2 of tier-aware memcg limits, including new page counters and rework
to pass folios into the charge path
- Yiannis's patch series for non-temporal stores support
- discuss generalized subsystem for providing bandwidth information
independent of the underlying platform, ideally through resctrl,
otherwise utilizing bandwidth information will be challenging
+ preferably this bandwidth monitoring is not per NUMA node but rather
slow and fast
- later: testing of tier aware memcg limits with Bharata's changes once
tier aware memcg limits is stable and further along
Please let me know if you'd like to propose additional topics for
discussion, thank you!
[1]
https://lore.kernel.org/linux-mm/a69f463c-0ee3-492c-8505-710d757a1f21@amd.com/
^ permalink raw reply [flat|nested] only message in thread
only message in thread, other threads:[~2026-04-25 22:10 UTC | newest]
Thread overview: (only message) (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-25 22:10 [Linux Memory Hotness and Promotion] Notes from April 23, 2026 David Rientjes
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox