linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Shakeel Butt <shakeelb@google.com>, Yang Shi <shy828301@gmail.com>
Cc: lsf-pc@lists.linux-foundation.org, Linux MM <linux-mm@kvack.org>,
	Michal Hocko <mhocko@suse.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Dave Hansen <dave.hansen@intel.com>,
	David Rientjes <rientjes@google.com>, Wei Xu <weixugc@google.com>,
	Greg Thelen <gthelen@google.com>
Subject: Re: [LSF/MM TOPIC] Tiered memory accounting and management
Date: Fri, 18 Jun 2021 15:11:44 -0700	[thread overview]
Message-ID: <82ffac56-e3fb-2d2d-1601-64130310bfc1@linux.intel.com> (raw)
In-Reply-To: <CALvZod5+dCgUwfs3sUt6tPCETMe7jF1++B7AQSOGG4+hOpBXLQ@mail.gmail.com>



On 6/17/21 11:48 AM, Shakeel Butt wrote:
> Thanks Yang for the CC.
> 
> On Tue, Jun 15, 2021 at 5:17 PM Yang Shi <shy828301@gmail.com> wrote:
>>
>> On Mon, Jun 14, 2021 at 2:51 PM Tim Chen <tim.c.chen@linux.intel.com> wrote:
>>>
>>>
>>> From: Tim Chen <tim.c.chen@linux.intel.com>
>>>
>>> Tiered memory accounting and management
>>> ------------------------------------------------------------
>>> Traditionally, all RAM is DRAM.  Some DRAM might be closer/faster
>>> than others, but a byte of media has about the same cost whether it
>>> is close or far.  But, with new memory tiers such as High-Bandwidth
>>> Memory or Persistent Memory, there is a choice between fast/expensive
>>> and slow/cheap.  But, the current memory cgroups still live in the
>>> old model. There is only one set of limits, and it implies that all
>>> memory has the same cost.  We would like to extend memory cgroups to
>>> comprehend different memory tiers to give users a way to choose a mix
>>> between fast/expensive and slow/cheap.
>>>
>>> To manage such memory, we will need to account memory usage and
>>> impose limits for each kind of memory.
>>>
>>> There were a couple of approaches that have been discussed previously to partition
>>> the memory between the cgroups listed below.  We will like to
>>> use the LSF/MM session to come to a consensus on the approach to
>>> take.
>>>
>>> 1.      Per NUMA node limit and accounting for each cgroup.
>>> We can assign higher limits on better performing memory node for higher priority cgroups.
>>>
>>> There are some loose ends here that warrant further discussions:
>>> (1) A user friendly interface for such limits.  Will a proportional
>>> weight for the cgroup that translate to actual absolute limit be more suitable?
>>> (2) Memory mis-configurations can occur more easily as the admin
>>> has a much larger number of limits spread among between the
>>> cgroups to manage.  Over-restrictive limits can lead to under utilized
>>> and wasted memory and hurt performance.
>>> (3) OOM behavior when a cgroup hits its limit.
>>>
> 
> This (numa based limits) is something I was pushing for but after
> discussing this internally with userspace controller devs, I have to
> backoff from this position.
> 
> The main feedback I got was that setting one memory limit is already
> complicated and having to set/adjust these many limits would be
> horrifying.
> 
>>> 2.      Per memory tier limit and accounting for each cgroup.
>>> We can assign higher limits on memories in better performing
>>> memory tier for higher priority cgroups.  I previously
>>> prototyped a soft limit based implementation to demonstrate the
>>> tiered limit idea.
>>>
>>> There are also a number of issues here:
>>> (1)     The advantage is we have fewer limits to deal with simplifying
>>> configuration. However, there are doubts raised by a number
>>> of people on whether we can really properly classify the NUMA
>>> nodes into memory tiers. There could still be significant performance
>>> differences between NUMA nodes even for the same kind of memory.
>>> We will also not have the fine-grained control and flexibility that comes
>>> with a per NUMA node limit.
>>> (2)     Will a memory hierarchy defined by promotion/demotion relationship between
>>> memory nodes be a viable approach for defining memory tiers?
>>>
>>> These issues related to  the management of systems with multiple kind of memories
>>> can be ironed out in this session.
>>
>> Thanks for suggesting this topic. I'm interested in the topic and
>> would like to attend.
>>
>> Other than the above points. I'm wondering whether we shall discuss
>> "Migrate Pages in lieu of discard" as well? Dave Hansen is driving the
>> development and I have been involved in the early development and
>> review, but it seems there are still some open questions according to
>> the latest review feedback.
>>
>> Some other folks may be interested in this topic either, CC'ed them in
>> the thread.
>>
> 
> At the moment "personally" I am more inclined towards a passive
> approach towards the memcg accounting of memory tiers. By that I mean,
> let's start by providing a 'usage' interface and get more
> production/real-world data to motivate the 'limit' interfaces. (One
> minor reason is that defining the 'limit' interface will force us to
> make the decision on defining tiers i.e. numa or a set of numa or
> others).

Probably we could first start with accounting the memory used in each
NUMA node for a cgroup and exposing this information to user space.  
I think that is useful regardless.

There is still a question of whether we want to define a set of
numa node or tier and extend the accounting and management at that
memory tier abstraction level.
 
> 
> IMHO we should focus more on the "aging" of the application memory and
> "migration/balance" between the tiers. I don't think the memory
> reclaim infrastructure is the right place for these operations
> (unevictable pages are ignored and not accurate ages). What we need is
> proactive continuous aging and balancing. We need something like, with
> additions, Multi-gen LRUs or DAMON or page idle tracking for aging and
> a new mechanism for balancing which takes ages into account.

Multi-gen LRUs will be pretty useful to expose the page warmth in a NUMA
node and to target the right page to reclaim for a memcg. We will also need some
way to determine how many pages to target in each memcg for a reclaim.

> 
> To give a more concrete example: Let's say we have a system with two
> memory tiers and multiple low and high priority jobs. For high
> priority jobs, set the allocation try list from high to low tier and
> for low priority jobs the reverse of that (I am not sure if we can do
> that out of the box with today's kernel). In the background we migrate
> cold memory down the tiers and hot memory in the reverse direction.
> 
> In this background mechanism we can enforce all different limiting
> policies like Yang's original high and low tier percentage or
> something like X% of accesses of high priority jobs should be from
> high tier. 

If I understand what you are saying is you desire the kernel to provide
the interface to expose performance information like 
"X% of accesses of high priority jobs is from high tier",
and knobs for user space to tell kernel to re-balance pages on
a per job class (or cgroup) basis based on this information.
The page re-balancing will be initiated by user space rather than
by the kernel, similar to what Wei proposed.
 

> Basically I am saying until we find from production data
> that this background mechanism is not strong enough to enforce passive
> limits, we should delay the decision on limit interfaces.
>

Implementing hard limit does have a number of rough edges
on a per node basis.  Probably we should first start with doing the
proper accounting and exposing the right performance information.


Tim


  reply	other threads:[~2021-06-18 22:11 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-14 21:51 [LSF/MM TOPIC] Tiered memory accounting and management Tim Chen
2021-06-16  0:17 ` Yang Shi
2021-06-17 18:48   ` Shakeel Butt
2021-06-18 22:11     ` Tim Chen [this message]
2021-06-18 23:59       ` Shakeel Butt
2021-06-19  0:56         ` Tim Chen
2021-06-19  1:17           ` Shakeel Butt
2021-06-21 20:42     ` Yang Shi
2021-06-21 21:23       ` Shakeel Butt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=82ffac56-e3fb-2d2d-1601-64130310bfc1@linux.intel.com \
    --to=tim.c.chen@linux.intel.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@intel.com \
    --cc=gthelen@google.com \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=mhocko@suse.com \
    --cc=rientjes@google.com \
    --cc=shakeelb@google.com \
    --cc=shy828301@gmail.com \
    --cc=weixugc@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).