From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=kPb1=LM=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-5.5 required=3.0 tests=BAYES_00,
	HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE,
	SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=no autolearn_force=no
	version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 77232C48BE8
	for <linux-mm@archiver.kernel.org>; Fri, 18 Jun 2021 22:11:53 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id F05C061279
	for <linux-mm@archiver.kernel.org>; Fri, 18 Jun 2021 22:11:52 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org F05C061279
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.intel.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 7668A6B006E; Fri, 18 Jun 2021 18:11:52 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 73F166B0070; Fri, 18 Jun 2021 18:11:52 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 5B79A6B0072; Fri, 18 Jun 2021 18:11:52 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0215.hostedemail.com [216.40.44.215])
	by kanga.kvack.org (Postfix) with ESMTP id 2954E6B006E
	for <linux-mm@kvack.org>; Fri, 18 Jun 2021 18:11:52 -0400 (EDT)
Received: from smtpin14.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay03.hostedemail.com (Postfix) with ESMTP id BDD348249980
	for <linux-mm@kvack.org>; Fri, 18 Jun 2021 22:11:51 +0000 (UTC)
X-FDA: 78268242822.14.F407DA9
Received: from mga07.intel.com (mga07.intel.com [134.134.136.100])
	by imf10.hostedemail.com (Postfix) with ESMTP id 375D04211082
	for <linux-mm@kvack.org>; Fri, 18 Jun 2021 22:11:46 +0000 (UTC)
IronPort-SDR: P6pkqEygp9TltEhcPEOv3sgdxLh+dgrx1ru4eaEmh/QLbKixrhYyKFJblIFdxrBnIYGZem8WN9
 siHsC37tf/iQ==
X-IronPort-AV: E=McAfee;i="6200,9189,10019"; a="270477013"
X-IronPort-AV: E=Sophos;i="5.83,284,1616482800"; 
   d="scan'208";a="270477013"
Received: from orsmga001.jf.intel.com ([10.7.209.18])
  by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Jun 2021 15:11:45 -0700
IronPort-SDR: jrTcod068Ed/n7O1Hcd/YHUHy7eJxVtBzudjq6mIIIu7hY9wjL7Vvdr6mkZS/GM2jT5MJRr1kC
 Ko0FVTmAuDTA==
X-IronPort-AV: E=Sophos;i="5.83,284,1616482800"; 
   d="scan'208";a="485835166"
Received: from schen9-mobl.amr.corp.intel.com ([10.212.173.244])
  by orsmga001-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Jun 2021 15:11:45 -0700
Subject: Re: [LSF/MM TOPIC] Tiered memory accounting and management
To: Shakeel Butt <shakeelb@google.com>, Yang Shi <shy828301@gmail.com>
Cc: lsf-pc@lists.linux-foundation.org, Linux MM <linux-mm@kvack.org>,
 Michal Hocko <mhocko@suse.com>, Dan Williams <dan.j.williams@intel.com>,
 Dave Hansen <dave.hansen@intel.com>, David Rientjes <rientjes@google.com>,
 Wei Xu <weixugc@google.com>, Greg Thelen <gthelen@google.com>
References: <475cbc62-a430-2c60-34cc-72ea8baebf2c@linux.intel.com>
 <CAHbLzkqNWn7ONEC=V9z18aWB34rS-q2banDUM=OYU0B=4t91Xw@mail.gmail.com>
 <CALvZod5+dCgUwfs3sUt6tPCETMe7jF1++B7AQSOGG4+hOpBXLQ@mail.gmail.com>
From: Tim Chen <tim.c.chen@linux.intel.com>
Message-ID: <82ffac56-e3fb-2d2d-1601-64130310bfc1@linux.intel.com>
Date: Fri, 18 Jun 2021 15:11:44 -0700
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101
 Thunderbird/68.6.0
MIME-Version: 1.0
In-Reply-To: <CALvZod5+dCgUwfs3sUt6tPCETMe7jF1++B7AQSOGG4+hOpBXLQ@mail.gmail.com>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 7bit
Authentication-Results: imf10.hostedemail.com;
	dkim=none;
	dmarc=fail reason="No valid SPF, No valid DKIM" header.from=intel.com (policy=none);
	spf=none (imf10.hostedemail.com: domain of tim.c.chen@linux.intel.com has no SPF policy when checking 134.134.136.100) smtp.mailfrom=tim.c.chen@linux.intel.com
X-Rspamd-Server: rspam02
X-Stat-Signature: 5wbmd949o3h18oth3y6rewosmrkq6i8u
X-Rspamd-Queue-Id: 375D04211082
X-HE-Tag: 1624054306-259020
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>


On 6/17/21 11:48 AM, Shakeel Butt wrote:
> Thanks Yang for the CC.
> 
> On Tue, Jun 15, 2021 at 5:17 PM Yang Shi <shy828301@gmail.com> wrote:
>>
>> On Mon, Jun 14, 2021 at 2:51 PM Tim Chen <tim.c.chen@linux.intel.com> wrote:
>>>
>>>
>>> From: Tim Chen <tim.c.chen@linux.intel.com>
>>>
>>> Tiered memory accounting and management
>>> ------------------------------------------------------------
>>> Traditionally, all RAM is DRAM.  Some DRAM might be closer/faster
>>> than others, but a byte of media has about the same cost whether it
>>> is close or far.  But, with new memory tiers such as High-Bandwidth
>>> Memory or Persistent Memory, there is a choice between fast/expensive
>>> and slow/cheap.  But, the current memory cgroups still live in the
>>> old model. There is only one set of limits, and it implies that all
>>> memory has the same cost.  We would like to extend memory cgroups to
>>> comprehend different memory tiers to give users a way to choose a mix
>>> between fast/expensive and slow/cheap.
>>>
>>> To manage such memory, we will need to account memory usage and
>>> impose limits for each kind of memory.
>>>
>>> There were a couple of approaches that have been discussed previously to partition
>>> the memory between the cgroups listed below.  We will like to
>>> use the LSF/MM session to come to a consensus on the approach to
>>> take.
>>>
>>> 1.      Per NUMA node limit and accounting for each cgroup.
>>> We can assign higher limits on better performing memory node for higher priority cgroups.
>>>
>>> There are some loose ends here that warrant further discussions:
>>> (1) A user friendly interface for such limits.  Will a proportional
>>> weight for the cgroup that translate to actual absolute limit be more suitable?
>>> (2) Memory mis-configurations can occur more easily as the admin
>>> has a much larger number of limits spread among between the
>>> cgroups to manage.  Over-restrictive limits can lead to under utilized
>>> and wasted memory and hurt performance.
>>> (3) OOM behavior when a cgroup hits its limit.
>>>
> 
> This (numa based limits) is something I was pushing for but after
> discussing this internally with userspace controller devs, I have to
> backoff from this position.
> 
> The main feedback I got was that setting one memory limit is already
> complicated and having to set/adjust these many limits would be
> horrifying.
> 
>>> 2.      Per memory tier limit and accounting for each cgroup.
>>> We can assign higher limits on memories in better performing
>>> memory tier for higher priority cgroups.  I previously
>>> prototyped a soft limit based implementation to demonstrate the
>>> tiered limit idea.
>>>
>>> There are also a number of issues here:
>>> (1)     The advantage is we have fewer limits to deal with simplifying
>>> configuration. However, there are doubts raised by a number
>>> of people on whether we can really properly classify the NUMA
>>> nodes into memory tiers. There could still be significant performance
>>> differences between NUMA nodes even for the same kind of memory.
>>> We will also not have the fine-grained control and flexibility that comes
>>> with a per NUMA node limit.
>>> (2)     Will a memory hierarchy defined by promotion/demotion relationship between
>>> memory nodes be a viable approach for defining memory tiers?
>>>
>>> These issues related to  the management of systems with multiple kind of memories
>>> can be ironed out in this session.
>>
>> Thanks for suggesting this topic. I'm interested in the topic and
>> would like to attend.
>>
>> Other than the above points. I'm wondering whether we shall discuss
>> "Migrate Pages in lieu of discard" as well? Dave Hansen is driving the
>> development and I have been involved in the early development and
>> review, but it seems there are still some open questions according to
>> the latest review feedback.
>>
>> Some other folks may be interested in this topic either, CC'ed them in
>> the thread.
>>
> 
> At the moment "personally" I am more inclined towards a passive
> approach towards the memcg accounting of memory tiers. By that I mean,
> let's start by providing a 'usage' interface and get more
> production/real-world data to motivate the 'limit' interfaces. (One
> minor reason is that defining the 'limit' interface will force us to
> make the decision on defining tiers i.e. numa or a set of numa or
> others).

Probably we could first start with accounting the memory used in each
NUMA node for a cgroup and exposing this information to user space.  
I think that is useful regardless.

There is still a question of whether we want to define a set of
numa node or tier and extend the accounting and management at that
memory tier abstraction level.
 
> 
> IMHO we should focus more on the "aging" of the application memory and
> "migration/balance" between the tiers. I don't think the memory
> reclaim infrastructure is the right place for these operations
> (unevictable pages are ignored and not accurate ages). What we need is
> proactive continuous aging and balancing. We need something like, with
> additions, Multi-gen LRUs or DAMON or page idle tracking for aging and
> a new mechanism for balancing which takes ages into account.

Multi-gen LRUs will be pretty useful to expose the page warmth in a NUMA
node and to target the right page to reclaim for a memcg. We will also need some
way to determine how many pages to target in each memcg for a reclaim.

> 
> To give a more concrete example: Let's say we have a system with two
> memory tiers and multiple low and high priority jobs. For high
> priority jobs, set the allocation try list from high to low tier and
> for low priority jobs the reverse of that (I am not sure if we can do
> that out of the box with today's kernel). In the background we migrate
> cold memory down the tiers and hot memory in the reverse direction.
> 
> In this background mechanism we can enforce all different limiting
> policies like Yang's original high and low tier percentage or
> something like X% of accesses of high priority jobs should be from
> high tier. 

If I understand what you are saying is you desire the kernel to provide
the interface to expose performance information like 
"X% of accesses of high priority jobs is from high tier",
and knobs for user space to tell kernel to re-balance pages on
a per job class (or cgroup) basis based on this information.
The page re-balancing will be initiated by user space rather than
by the kernel, similar to what Wei proposed.
 

> Basically I am saying until we find from production data
> that this background mechanism is not strong enough to enforce passive
> limits, we should delay the decision on limit interfaces.
>

Implementing hard limit does have a number of rough edges
on a per node basis.  Probably we should first start with doing the
proper accounting and exposing the right performance information.


Tim