From: David Hildenbrand <david@redhat.com>
To: Zhangrenze <zhang.renze@h3c.com>,
"linux-mm@kvack.org" <linux-mm@kvack.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"akpm@linux-foundation.org" <akpm@linux-foundation.org>
Cc: "arnd@arndb.de" <arnd@arndb.de>,
"linux-arch@vger.kernel.org" <linux-arch@vger.kernel.org>,
"chris@zankel.net" <chris@zankel.net>,
"jcmvbkbc@gmail.com" <jcmvbkbc@gmail.com>,
"James.Bottomley@HansenPartnership.com"
<James.Bottomley@HansenPartnership.com>,
"deller@gmx.de" <deller@gmx.de>,
"linux-parisc@vger.kernel.org" <linux-parisc@vger.kernel.org>,
"tsbogend@alpha.franken.de" <tsbogend@alpha.franken.de>,
"rdunlap@infradead.org" <rdunlap@infradead.org>,
"bhelgaas@google.com" <bhelgaas@google.com>,
"linux-mips@vger.kernel.org" <linux-mips@vger.kernel.org>,
"richard.henderson@linaro.org" <richard.henderson@linaro.org>,
"ink@jurassic.park.msu.ru" <ink@jurassic.park.msu.ru>,
"mattst88@gmail.com" <mattst88@gmail.com>,
"linux-alpha@vger.kernel.org" <linux-alpha@vger.kernel.org>,
Jiaoxupo <jiaoxupo@h3c.com>, Zhouhaofan <zhou.haofan@h3c.com>
Subject: Re: [PATCH v2 0/1] mm: introduce MADV_DEMOTE/MADV_PROMOTE
Date: Thu, 1 Aug 2024 14:53:15 +0200 [thread overview]
Message-ID: <bffe178c-bd97-4945-898e-97ba203f503e@redhat.com> (raw)
In-Reply-To: <3a5785661e1b4f3381046aa5e808854c@h3c.com>
On 01.08.24 11:57, Zhangrenze wrote:
>>> Sure, here's the Scalable Tiered Memory Control (STMC)
>>>
>>> **Background**
>>>
>>> In the era when artificial intelligence, big data analytics, and
>>> machine learning have become mainstream research topics and
>>> application scenarios, the demand for high-capacity and high-
>>> bandwidth memory in computers has become increasingly important.
>>> The emergence of CXL (Compute Express Link) provides the
>>> possibility of high-capacity memory. Although CXL TYPE3 devices
>>> can provide large memory capacities, their access speed is lower
>>> than traditional DRAM due to hardware architecture limitations.
>>>
>>> To enjoy the large capacity brought by CXL memory while minimizing
>>> the impact of high latency, Linux has introduced the Tiered Memory
>>> architecture. In the Tiered Memory architecture, CXL memory is
>>> treated as an independent, slower NUMA NODE, while DRAM is
>>> considered as a relatively faster NUMA NODE. Applications allocate
>>> memory from the local node, and Tiered Memory, leveraging memory
>>> reclamation and NUMA Balancing mechanisms, can transparently demote
>>> physical pages not recently accessed by user processes to the slower
>>> CXL NUMA NODE. However, when user processes re-access the demoted
>>> memory, the Tiered Memory mechanism will, based on certain logic,
>>> decide whether to promote the demoted physical pages back to the
>>> fast NUMA NODE. If the promotion is successful, the memory accessed
>>> by the user process will reside in DRAM; otherwise, it will reside in
>>> the CXL NODE. Through the Tiered Memory mechanism, Linux balances
>>> betweenlarge memory capacity and latency, striving to maintain an
>>> equilibrium for applications.
>>>
>>> **Problem**
>>> Although Tiered Memory strives to balance between large capacity and
>>> latency, specific scenarios can lead to the following issues:
>>>
>>> 1. In scenarios requiring massive computations, if data is heavily
>>> stored in CXL slow memory and Tiered Memory cannot promptly
>>> promote this memory to fast DRAM, it will significantly impact
>>> program performance.
>>> 2. Similar to the scenario described in point 1, if Tiered Memory
>>> decides to promote these physical pages to fast DRAM NODE, but
>>> due to limitations in the DRAM NODE promote ratio, these physical
>>> pages cannot be promoted. Consequently, the program will keep
>>> running in slow memory.
>>> 3. After an application finishes computing on a large block of fast
>>> memory, it may not immediately re-access it. Hence, this memory
>>> can only wait for the memory reclamation mechanism to demote it.
>>> 4. Similar to the scenario described in point 3, if the demotion
>>> speed is slow, these cold pages will occupy the promotion
>>> resources, preventing some eligible slow pages from being
>>> immediately promoted, severely affecting application efficiency.
>>>
>>> **Solution**
>>> We propose the **Scalable Tiered Memory Control (STMC)** mechanism,
>>> which delegates the authority of promoting and demoting memory to the
>>> application. The principle is simple, as follows:
>>>
>>> 1. When an application is preparing for computation, it can promote
>>> the memory it needs to use or ensure the memory resides on a fast
>>> NODE.
>>> 2. When an application will not use the memory shortly, it can
>>> immediately demote the memory to slow memory, freeing up valuable
>>> promotion resources.
>>>
>>> STMC mechanism is implemented through the madvise system call, providing
>>> two new advice options: MADV_DEMOTE and MADV_PROMOTE. MADV_DEMOTE
>>> advises demote the physical memory to the node where slow memory
>>> resides; this advice only fails if there is no free physical memory on
>>> the slow memory node. MADV_PROMOTE advises retaining the physical memory
>>> in the fast memory; this advice only fails if there are no promotion
>>> slots available on the fast memory node. Benefits brought by STMC
>>> include:
>>>
>>> 1. The STMC mechanism is a variant of on-demand memory management
>>> designed to let applications enjoy fast memory as much as possible,
>>> while actively demoting to slow memory when not in use, thus
>>> freeing up promotion slots for the NODE and allowing it to run in
>>> an optimized Tiered Memory environment.
>>> 2. The STMC mechanism better balances large capacity and latency.
>>>
>>> **Shortcomings of STMC**
>>> The STMC mechanism requires the caller to manage memory demotion and
>>> promotion. If the memory is not promptly demoting after an promotion,
>>> it may cause issues similar to memory leaks
>> Ehm, that sounds scary. Can you elaborate what's happening here and why
>> it is "similar to memory leaks"?
>>
>>
>> Can you also point out why migrate_pages() is not suitable? I would
>> assume demote/promote is in essence simply migrating memory between nodes.
>>
>> --
>> Cheers,
>>
>> David / dhildenb
>>
>
> Thank you for the response. Below are my points of view. If there are any
> mistakes, I appreciate your understanding:
>
> 1. In a tiered memory system, fast nodes and slow nodes act as two common
> memory pools. The system has a certain ratio limit for promotion. For
> example, a NODE may stipulate that when the available memory is less
> than 1GB or 1/4 of the node's memory, promotion are prohibited. If we
> use migrate_pages at this point, it will unrestrictedly promote slow
> pages to fast memory, which may prevent other processes’ pages that
> should have been promoted from being promoted. This is what I mean by
> occupying promotion resources.
> 2. As described in point 1, if we use MADV_PROMOTE to temporarily promote
> a batch of pages and do not demote them immediately after usage, it
> will occupy many promotion resources. Other hot pages that need promote
> will not be able to get promote, which will impact the performance of
> certain processes.
So, you mean, applications can actively consume "fast memory" and
"steal" it from other applications? I assume that's what you meant with
"memory leak".
I would really suggest to *not* call this "similar to memory leaks", in
your own favor ;)
> 3. MADV_DEMOTE and MADV_PROMOTE only rely on madvise, while migrate_pages
> depends on libnuma.
Well, you can trivially call that systemcall also without libnuma ;) So
that shouldn't really make a difference and is rather something that can
be solved in user space.
> 4. MADV_DEMOTE and MADV_PROMOTE provide a better balance between capacity
> and latency. They allow hot pages that need promoting to be promoted
> smoothly and pages that need demoting to be demoted immediately. This
> helps tiered memory systems to operate more rationally.
Can you summarize why something similar could not be provided by a
library that builds up on existing functionality, such as migrate_pages?
It could easily take a look at memory stats to reason whether a
promotion/demotion makes sense (your example above with the memory
distribution).
From the patch itself I read
"MADV_DEMOTE can mark a range of memory pages as cold
pages and immediately demote them to slow memory. MADV_PROMOTE can mark
a range of memory pages as hot pages and immediately promote them to
fast memory"
which sounds to me like migrate_pages / MADV_COLD might be able to
achieve something similar.
What's the biggest difference that MADV_DEMOTE|MADV_PROMOTE can do better?
--
Cheers,
David / dhildenb
next prev parent reply other threads:[~2024-08-01 12:53 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-08-01 9:57 [PATCH v2 0/1] mm: introduce MADV_DEMOTE/MADV_PROMOTE Zhangrenze
2024-08-01 12:53 ` David Hildenbrand [this message]
2024-08-01 13:05 ` David Hildenbrand
-- strict thread matches above, loose matches on Subject: below --
2024-08-01 7:56 BiscuitOS Broiler
2024-08-01 8:06 ` David Hildenbrand
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=bffe178c-bd97-4945-898e-97ba203f503e@redhat.com \
--to=david@redhat.com \
--cc=James.Bottomley@HansenPartnership.com \
--cc=akpm@linux-foundation.org \
--cc=arnd@arndb.de \
--cc=bhelgaas@google.com \
--cc=chris@zankel.net \
--cc=deller@gmx.de \
--cc=ink@jurassic.park.msu.ru \
--cc=jcmvbkbc@gmail.com \
--cc=jiaoxupo@h3c.com \
--cc=linux-alpha@vger.kernel.org \
--cc=linux-arch@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mips@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-parisc@vger.kernel.org \
--cc=mattst88@gmail.com \
--cc=rdunlap@infradead.org \
--cc=richard.henderson@linaro.org \
--cc=tsbogend@alpha.franken.de \
--cc=zhang.renze@h3c.com \
--cc=zhou.haofan@h3c.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).