From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.5 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE, SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 77232C48BE8 for ; Fri, 18 Jun 2021 22:11:53 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id F05C061279 for ; Fri, 18 Jun 2021 22:11:52 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org F05C061279 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 7668A6B006E; Fri, 18 Jun 2021 18:11:52 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 73F166B0070; Fri, 18 Jun 2021 18:11:52 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5B79A6B0072; Fri, 18 Jun 2021 18:11:52 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0215.hostedemail.com [216.40.44.215]) by kanga.kvack.org (Postfix) with ESMTP id 2954E6B006E for ; Fri, 18 Jun 2021 18:11:52 -0400 (EDT) Received: from smtpin14.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id BDD348249980 for ; Fri, 18 Jun 2021 22:11:51 +0000 (UTC) X-FDA: 78268242822.14.F407DA9 Received: from mga07.intel.com (mga07.intel.com [134.134.136.100]) by imf10.hostedemail.com (Postfix) with ESMTP id 375D04211082 for ; Fri, 18 Jun 2021 22:11:46 +0000 (UTC) IronPort-SDR: P6pkqEygp9TltEhcPEOv3sgdxLh+dgrx1ru4eaEmh/QLbKixrhYyKFJblIFdxrBnIYGZem8WN9 siHsC37tf/iQ== X-IronPort-AV: E=McAfee;i="6200,9189,10019"; a="270477013" X-IronPort-AV: E=Sophos;i="5.83,284,1616482800"; d="scan'208";a="270477013" Received: from orsmga001.jf.intel.com ([10.7.209.18]) by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Jun 2021 15:11:45 -0700 IronPort-SDR: jrTcod068Ed/n7O1Hcd/YHUHy7eJxVtBzudjq6mIIIu7hY9wjL7Vvdr6mkZS/GM2jT5MJRr1kC Ko0FVTmAuDTA== X-IronPort-AV: E=Sophos;i="5.83,284,1616482800"; d="scan'208";a="485835166" Received: from schen9-mobl.amr.corp.intel.com ([10.212.173.244]) by orsmga001-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Jun 2021 15:11:45 -0700 Subject: Re: [LSF/MM TOPIC] Tiered memory accounting and management To: Shakeel Butt , Yang Shi Cc: lsf-pc@lists.linux-foundation.org, Linux MM , Michal Hocko , Dan Williams , Dave Hansen , David Rientjes , Wei Xu , Greg Thelen References: <475cbc62-a430-2c60-34cc-72ea8baebf2c@linux.intel.com> From: Tim Chen Message-ID: <82ffac56-e3fb-2d2d-1601-64130310bfc1@linux.intel.com> Date: Fri, 18 Jun 2021 15:11:44 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.6.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Authentication-Results: imf10.hostedemail.com; dkim=none; dmarc=fail reason="No valid SPF, No valid DKIM" header.from=intel.com (policy=none); spf=none (imf10.hostedemail.com: domain of tim.c.chen@linux.intel.com has no SPF policy when checking 134.134.136.100) smtp.mailfrom=tim.c.chen@linux.intel.com X-Rspamd-Server: rspam02 X-Stat-Signature: 5wbmd949o3h18oth3y6rewosmrkq6i8u X-Rspamd-Queue-Id: 375D04211082 X-HE-Tag: 1624054306-259020 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 6/17/21 11:48 AM, Shakeel Butt wrote: > Thanks Yang for the CC. > > On Tue, Jun 15, 2021 at 5:17 PM Yang Shi wrote: >> >> On Mon, Jun 14, 2021 at 2:51 PM Tim Chen wrote: >>> >>> >>> From: Tim Chen >>> >>> Tiered memory accounting and management >>> ------------------------------------------------------------ >>> Traditionally, all RAM is DRAM. Some DRAM might be closer/faster >>> than others, but a byte of media has about the same cost whether it >>> is close or far. But, with new memory tiers such as High-Bandwidth >>> Memory or Persistent Memory, there is a choice between fast/expensive >>> and slow/cheap. But, the current memory cgroups still live in the >>> old model. There is only one set of limits, and it implies that all >>> memory has the same cost. We would like to extend memory cgroups to >>> comprehend different memory tiers to give users a way to choose a mix >>> between fast/expensive and slow/cheap. >>> >>> To manage such memory, we will need to account memory usage and >>> impose limits for each kind of memory. >>> >>> There were a couple of approaches that have been discussed previously to partition >>> the memory between the cgroups listed below. We will like to >>> use the LSF/MM session to come to a consensus on the approach to >>> take. >>> >>> 1. Per NUMA node limit and accounting for each cgroup. >>> We can assign higher limits on better performing memory node for higher priority cgroups. >>> >>> There are some loose ends here that warrant further discussions: >>> (1) A user friendly interface for such limits. Will a proportional >>> weight for the cgroup that translate to actual absolute limit be more suitable? >>> (2) Memory mis-configurations can occur more easily as the admin >>> has a much larger number of limits spread among between the >>> cgroups to manage. Over-restrictive limits can lead to under utilized >>> and wasted memory and hurt performance. >>> (3) OOM behavior when a cgroup hits its limit. >>> > > This (numa based limits) is something I was pushing for but after > discussing this internally with userspace controller devs, I have to > backoff from this position. > > The main feedback I got was that setting one memory limit is already > complicated and having to set/adjust these many limits would be > horrifying. > >>> 2. Per memory tier limit and accounting for each cgroup. >>> We can assign higher limits on memories in better performing >>> memory tier for higher priority cgroups. I previously >>> prototyped a soft limit based implementation to demonstrate the >>> tiered limit idea. >>> >>> There are also a number of issues here: >>> (1) The advantage is we have fewer limits to deal with simplifying >>> configuration. However, there are doubts raised by a number >>> of people on whether we can really properly classify the NUMA >>> nodes into memory tiers. There could still be significant performance >>> differences between NUMA nodes even for the same kind of memory. >>> We will also not have the fine-grained control and flexibility that comes >>> with a per NUMA node limit. >>> (2) Will a memory hierarchy defined by promotion/demotion relationship between >>> memory nodes be a viable approach for defining memory tiers? >>> >>> These issues related to the management of systems with multiple kind of memories >>> can be ironed out in this session. >> >> Thanks for suggesting this topic. I'm interested in the topic and >> would like to attend. >> >> Other than the above points. I'm wondering whether we shall discuss >> "Migrate Pages in lieu of discard" as well? Dave Hansen is driving the >> development and I have been involved in the early development and >> review, but it seems there are still some open questions according to >> the latest review feedback. >> >> Some other folks may be interested in this topic either, CC'ed them in >> the thread. >> > > At the moment "personally" I am more inclined towards a passive > approach towards the memcg accounting of memory tiers. By that I mean, > let's start by providing a 'usage' interface and get more > production/real-world data to motivate the 'limit' interfaces. (One > minor reason is that defining the 'limit' interface will force us to > make the decision on defining tiers i.e. numa or a set of numa or > others). Probably we could first start with accounting the memory used in each NUMA node for a cgroup and exposing this information to user space. I think that is useful regardless. There is still a question of whether we want to define a set of numa node or tier and extend the accounting and management at that memory tier abstraction level. > > IMHO we should focus more on the "aging" of the application memory and > "migration/balance" between the tiers. I don't think the memory > reclaim infrastructure is the right place for these operations > (unevictable pages are ignored and not accurate ages). What we need is > proactive continuous aging and balancing. We need something like, with > additions, Multi-gen LRUs or DAMON or page idle tracking for aging and > a new mechanism for balancing which takes ages into account. Multi-gen LRUs will be pretty useful to expose the page warmth in a NUMA node and to target the right page to reclaim for a memcg. We will also need some way to determine how many pages to target in each memcg for a reclaim. > > To give a more concrete example: Let's say we have a system with two > memory tiers and multiple low and high priority jobs. For high > priority jobs, set the allocation try list from high to low tier and > for low priority jobs the reverse of that (I am not sure if we can do > that out of the box with today's kernel). In the background we migrate > cold memory down the tiers and hot memory in the reverse direction. > > In this background mechanism we can enforce all different limiting > policies like Yang's original high and low tier percentage or > something like X% of accesses of high priority jobs should be from > high tier. If I understand what you are saying is you desire the kernel to provide the interface to expose performance information like "X% of accesses of high priority jobs is from high tier", and knobs for user space to tell kernel to re-balance pages on a per job class (or cgroup) basis based on this information. The page re-balancing will be initiated by user space rather than by the kernel, similar to what Wei proposed. > Basically I am saying until we find from production data > that this background mechanism is not strong enough to enforce passive > limits, we should delay the decision on limit interfaces. > Implementing hard limit does have a number of rough edges on a per node basis. Probably we should first start with doing the proper accounting and exposing the right performance information. Tim