Re: [PATCH v2 0/1] mm: introduce MADV_DEMOTE/MADV_PROMOTE

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: David Hildenbrand <david@redhat.com>
To: Zhangrenze <zhang.renze@h3c.com>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>
Cc: "arnd@arndb.de" <arnd@arndb.de>,
	"linux-arch@vger.kernel.org" <linux-arch@vger.kernel.org>,
	"chris@zankel.net" <chris@zankel.net>,
	"jcmvbkbc@gmail.com" <jcmvbkbc@gmail.com>,
	"James.Bottomley@HansenPartnership.com"
	<James.Bottomley@HansenPartnership.com>,
	"deller@gmx.de" <deller@gmx.de>,
	"linux-parisc@vger.kernel.org" <linux-parisc@vger.kernel.org>,
	"tsbogend@alpha.franken.de" <tsbogend@alpha.franken.de>,
	"rdunlap@infradead.org" <rdunlap@infradead.org>,
	"bhelgaas@google.com" <bhelgaas@google.com>,
	"linux-mips@vger.kernel.org" <linux-mips@vger.kernel.org>,
	"richard.henderson@linaro.org" <richard.henderson@linaro.org>,
	"ink@jurassic.park.msu.ru" <ink@jurassic.park.msu.ru>,
	"mattst88@gmail.com" <mattst88@gmail.com>,
	"linux-alpha@vger.kernel.org" <linux-alpha@vger.kernel.org>,
	Jiaoxupo <jiaoxupo@h3c.com>, Zhouhaofan <zhou.haofan@h3c.com>
Subject: Re: [PATCH v2 0/1] mm: introduce MADV_DEMOTE/MADV_PROMOTE
Date: Thu, 1 Aug 2024 14:53:15 +0200	[thread overview]
Message-ID: <bffe178c-bd97-4945-898e-97ba203f503e@redhat.com> (raw)
In-Reply-To: <3a5785661e1b4f3381046aa5e808854c@h3c.com>

On 01.08.24 11:57, Zhangrenze wrote:
>>> Sure, here's the Scalable Tiered Memory Control (STMC)
>>>
>>> **Background**
>>>
>>> In the era when artificial intelligence, big data analytics, and
>>> machine learning have become mainstream research topics and
>>> application scenarios, the demand for high-capacity and high-
>>> bandwidth memory in computers has become increasingly important.
>>> The emergence of CXL (Compute Express Link) provides the
>>> possibility of high-capacity memory. Although CXL TYPE3 devices
>>> can provide large memory capacities, their access speed is lower
>>> than traditional DRAM due to hardware architecture limitations.
>>>
>>> To enjoy the large capacity brought by CXL memory while minimizing
>>> the impact of high latency, Linux has introduced the Tiered Memory
>>> architecture. In the Tiered Memory architecture, CXL memory is
>>> treated as an independent, slower NUMA NODE, while DRAM is
>>> considered as a relatively faster NUMA NODE. Applications allocate
>>> memory from the local node, and Tiered Memory, leveraging memory
>>> reclamation and NUMA Balancing mechanisms, can transparently demote
>>> physical pages not recently accessed by user processes to the slower
>>> CXL NUMA NODE. However, when user processes re-access the demoted
>>> memory, the Tiered Memory mechanism will, based on certain logic,
>>> decide whether to promote the demoted physical pages back to the
>>> fast NUMA NODE. If the promotion is successful, the memory accessed
>>> by the user process will reside in DRAM; otherwise, it will reside in
>>> the CXL NODE. Through the Tiered Memory mechanism, Linux balances
>>> betweenlarge memory capacity and latency, striving to maintain an
>>> equilibrium for applications.
>>>
>>> **Problem**
>>> Although Tiered Memory strives to balance between large capacity and
>>> latency, specific scenarios can lead to the following issues:
>>>
>>>     1. In scenarios requiring massive computations, if data is heavily
>>>        stored in CXL slow memory and Tiered Memory cannot promptly
>>>        promote this memory to fast DRAM, it will significantly impact
>>>        program performance.
>>>     2. Similar to the scenario described in point 1, if Tiered Memory
>>>        decides to promote these physical pages to fast DRAM NODE, but
>>>        due to limitations in the DRAM NODE promote ratio, these physical
>>>        pages cannot be promoted. Consequently, the program will keep
>>>        running in slow memory.
>>>     3. After an application finishes computing on a large block of fast
>>>        memory, it may not immediately re-access it. Hence, this memory
>>>        can only wait for the memory reclamation mechanism to demote it.
>>>     4. Similar to the scenario described in point 3, if the demotion
>>>        speed is slow, these cold pages will occupy the promotion
>>>        resources, preventing some eligible slow pages from being
>>>        immediately promoted, severely affecting application efficiency.
>>>
>>> **Solution**
>>> We propose the **Scalable Tiered Memory Control (STMC)** mechanism,
>>> which delegates the authority of promoting and demoting memory to the
>>> application. The principle is simple, as follows:
>>>
>>>     1. When an application is preparing for computation, it can promote
>>>        the memory it needs to use or ensure the memory resides on a fast
>>>        NODE.
>>>     2. When an application will not use the memory shortly, it can
>>>        immediately demote the memory to slow memory, freeing up valuable
>>>        promotion resources.
>>>
>>> STMC mechanism is implemented through the madvise system call, providing
>>> two new advice options: MADV_DEMOTE and MADV_PROMOTE. MADV_DEMOTE
>>> advises demote the physical memory to the node where slow memory
>>> resides; this advice only fails if there is no free physical memory on
>>> the slow memory node. MADV_PROMOTE advises retaining the physical memory
>>> in the fast memory; this advice only fails if there are no promotion
>>> slots available on the fast memory node. Benefits brought by STMC
>>> include:
>>>
>>>     1. The STMC mechanism is a variant of on-demand memory management
>>>        designed to let applications enjoy fast memory as much as possible,
>>>        while actively demoting to slow memory when not in use, thus
>>>        freeing up promotion slots for the NODE and allowing it to run in
>>>        an optimized Tiered Memory environment.
>>>     2. The STMC mechanism better balances large capacity and latency.
>>>
>>> **Shortcomings of STMC**
>>> The STMC mechanism requires the caller to manage memory demotion and
>>> promotion. If the memory is not promptly demoting after an promotion,
>>> it may cause issues similar to memory leaks
>> Ehm, that sounds scary. Can you elaborate what's happening here and why
>> it is "similar to memory leaks"?
>>
>>
>> Can you also point out why migrate_pages() is not suitable? I would
>> assume demote/promote is in essence simply migrating memory between nodes.
>>
>> -- 
>> Cheers,
>>
>> David / dhildenb
>>
> 
> Thank you for the response. Below are my points of view. If there are any
> mistakes, I appreciate your understanding:
> 
> 1. In a tiered memory system, fast nodes and slow nodes act as two common
>     memory pools. The system has a certain ratio limit for promotion. For
>     example, a NODE may stipulate that when the available memory is less
>     than 1GB or 1/4 of the node's memory, promotion are prohibited. If we
>     use migrate_pages at this point, it will unrestrictedly promote slow
>     pages to fast memory, which may prevent other processes’ pages that
>     should have been promoted from being promoted. This is what I mean by
>     occupying promotion resources.
> 2. As described in point 1, if we use MADV_PROMOTE to temporarily promote
>     a batch of pages and do not demote them immediately after usage, it
>     will occupy many promotion resources. Other hot pages that need promote
>     will not be able to get promote, which will impact the performance of
>     certain processes.

So, you mean, applications can actively consume "fast memory" and 
"steal" it from other applications? I assume that's what you meant with 
"memory leak".

I would really suggest to *not* call this "similar to memory leaks", in 
your own favor ;)

> 3. MADV_DEMOTE and MADV_PROMOTE only rely on madvise, while migrate_pages
>     depends on libnuma.

Well, you can trivially call that systemcall also without libnuma ;) So 
that shouldn't really make a difference and is rather something that can 
be solved in user space.

> 4. MADV_DEMOTE and MADV_PROMOTE provide a better balance between capacity
>     and latency. They allow hot pages that need promoting to be promoted
>     smoothly and pages that need demoting to be demoted immediately. This
>     helps tiered memory systems to operate more rationally.

Can you summarize why something similar could not be provided by a 
library that builds up on existing functionality, such as migrate_pages? 
It could easily take a look at memory stats to reason whether a 
promotion/demotion makes sense (your example above with the memory 
distribution).

 From the patch itself I read

"MADV_DEMOTE can mark a range of memory pages as cold
pages and immediately demote them to slow memory. MADV_PROMOTE can mark
a range of memory pages as hot pages and immediately promote them to
fast memory"

which sounds to me like migrate_pages / MADV_COLD might be able to 
achieve something similar.

What's the biggest difference that MADV_DEMOTE|MADV_PROMOTE can do better?

-- 
Cheers,

David / dhildenb

next prev parent reply	other threads:[~2024-08-01 12:53 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-08-01  9:57 [PATCH v2 0/1] mm: introduce MADV_DEMOTE/MADV_PROMOTE Zhangrenze
2024-08-01 12:53 ` David Hildenbrand [this message]
2024-08-01 13:05   ` David Hildenbrand
  -- strict thread matches above, loose matches on Subject: below --
2024-08-01  7:56 BiscuitOS Broiler
2024-08-01  8:06 ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bffe178c-bd97-4945-898e-97ba203f503e@redhat.com \
    --to=david@redhat.com \
    --cc=James.Bottomley@HansenPartnership.com \
    --cc=akpm@linux-foundation.org \
    --cc=arnd@arndb.de \
    --cc=bhelgaas@google.com \
    --cc=chris@zankel.net \
    --cc=deller@gmx.de \
    --cc=ink@jurassic.park.msu.ru \
    --cc=jcmvbkbc@gmail.com \
    --cc=jiaoxupo@h3c.com \
    --cc=linux-alpha@vger.kernel.org \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mips@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-parisc@vger.kernel.org \
    --cc=mattst88@gmail.com \
    --cc=rdunlap@infradead.org \
    --cc=richard.henderson@linaro.org \
    --cc=tsbogend@alpha.franken.de \
    --cc=zhang.renze@h3c.com \
    --cc=zhou.haofan@h3c.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).