From: Jonathan Cameron <Jonathan.Cameron-aYUidmrrA3LQT0dZR+AlfA@public.gmane.org>
To: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Cc: Tim Chen <tim.c.chen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>,
Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>,
Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>,
Andrew Morton
<akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>,
Dave Hansen <dave.hansen-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>,
Ying Huang <ying.huang-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>,
Dan Williams
<dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>,
David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>,
Linux MM <linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org>,
Cgroups <cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
LKML <linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
Greg Thelen <gthelen-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>,
Wei Xu <weixugc-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Subject: Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory
Date: Wed, 14 Apr 2021 09:59:58 +0100 [thread overview]
Message-ID: <20210414095958.000008c4@Huawei.com> (raw)
In-Reply-To: <CALvZod4zXB6-3Mshu_TnTsQaDErfYkPTw9REYNRptSvPSRmKVA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
On Mon, 12 Apr 2021 12:20:22 -0700
Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> On Fri, Apr 9, 2021 at 4:26 PM Tim Chen <tim.c.chen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> wrote:
> >
> >
> > On 4/8/21 4:52 AM, Michal Hocko wrote:
> >
> > >> The top tier memory used is reported in
> > >>
> > >> memory.toptier_usage_in_bytes
> > >>
> > >> The amount of top tier memory usable by each cgroup without
> > >> triggering page reclaim is controlled by the
> > >>
> > >> memory.toptier_soft_limit_in_bytes
> > >
> >
> > Michal,
> >
> > Thanks for your comments. I will like to take a step back and
> > look at the eventual goal we envision: a mechanism to partition the
> > tiered memory between the cgroups.
> >
> > A typical use case may be a system with two set of tasks.
> > One set of task is very latency sensitive and we desire instantaneous
> > response from them. Another set of tasks will be running batch jobs
> > were latency and performance is not critical. In this case,
> > we want to carve out enough top tier memory such that the working set
> > of the latency sensitive tasks can fit entirely in the top tier memory.
> > The rest of the top tier memory can be assigned to the background tasks.
> >
> > To achieve such cgroup based tiered memory management, we probably want
> > something like the following.
> >
> > For generalization let's say that there are N tiers of memory t_0, t_1 ... t_N-1,
> > where tier t_0 sits at the top and demotes to the lower tier.
> > We envision for this top tier memory t0 the following knobs and counters
> > in the cgroup memory controller
> >
> > memory_t0.current Current usage of tier 0 memory by the cgroup.
> >
> > memory_t0.min If tier 0 memory used by the cgroup falls below this low
> > boundary, the memory will not be subjected to demotion
> > to lower tiers to free up memory at tier 0.
> >
> > memory_t0.low Above this boundary, the tier 0 memory will be subjected
> > to demotion. The demotion pressure will be proportional
> > to the overage.
> >
> > memory_t0.high If tier 0 memory used by the cgroup exceeds this high
> > boundary, allocation of tier 0 memory by the cgroup will
> > be throttled. The tier 0 memory used by this cgroup
> > will also be subjected to heavy demotion.
> >
> > memory_t0.max This will be a hard usage limit of tier 0 memory on the cgroup.
> >
> > If needed, memory_t[12...].current/min/low/high for additional tiers can be added.
> > This follows closely with the design of the general memory controller interface.
> >
> > Will such an interface looks sane and acceptable with everyone?
> >
>
> I have a couple of questions. Let's suppose we have a two socket
> system. Node 0 (DRAM+CPUs), Node 1 (DRAM+CPUs), Node 2 (PMEM on socket
> 0 along with Node 0) and Node 3 (PMEM on socket 1 along with Node 1).
> Based on the tier definition of this patch series, tier_0: {node_0,
> node_1} and tier_1: {node_2, node_3}.
>
> My questions are:
>
> 1) Can we assume that the cost of access within a tier will always be
> less than the cost of access from the tier? (node_0 <-> node_1 vs
> node_0 <-> node_2)
No in large systems even it we can make this assumption in 2 socket ones.
> 2) If yes to (1), is that assumption future proof? Will the future
> systems with DRAM over CXL support have the same characteristics?
> 3) Will the cost of access from tier_0 to tier_1 be uniform? (node_0
> <-> node_2 vs node_0 <-> node_3). For jobs running on node_0, node_3
> might be third tier and similarly for jobs running on node_1, node_2
> might be third tier.
>
> The reason I am asking these questions is that the statically
> partitioning memory nodes into tiers will inherently add platform
> specific assumptions in the user API.
Absolutely agree.
>
> Assumptions like:
> 1) Access within tier is always cheaper than across tier.
> 2) Access from tier_i to tier_i+1 has uniform cost.
>
> The reason I am more inclined towards having numa centric control is
> that we don't have to make these assumptions. Though the usability
> will be more difficult. Greg (CCed) has some ideas on making it better
> and we will share our proposal after polishing it a bit more.
>
Sounds good, will look out for that.
Jonathan
WARNING: multiple messages have this Message-ID (diff)
From: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
To: Shakeel Butt <shakeelb@google.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>,
Michal Hocko <mhocko@suse.com>,
Johannes Weiner <hannes@cmpxchg.org>,
Andrew Morton <akpm@linux-foundation.org>,
Dave Hansen <dave.hansen@intel.com>,
"Ying Huang" <ying.huang@intel.com>,
Dan Williams <dan.j.williams@intel.com>,
"David Rientjes" <rientjes@google.com>,
Linux MM <linux-mm@kvack.org>, Cgroups <cgroups@vger.kernel.org>,
LKML <linux-kernel@vger.kernel.org>,
Greg Thelen <gthelen@google.com>, Wei Xu <weixugc@google.com>
Subject: Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory
Date: Wed, 14 Apr 2021 09:59:58 +0100 [thread overview]
Message-ID: <20210414095958.000008c4@Huawei.com> (raw)
In-Reply-To: <CALvZod4zXB6-3Mshu_TnTsQaDErfYkPTw9REYNRptSvPSRmKVA@mail.gmail.com>
On Mon, 12 Apr 2021 12:20:22 -0700
Shakeel Butt <shakeelb@google.com> wrote:
> On Fri, Apr 9, 2021 at 4:26 PM Tim Chen <tim.c.chen@linux.intel.com> wrote:
> >
> >
> > On 4/8/21 4:52 AM, Michal Hocko wrote:
> >
> > >> The top tier memory used is reported in
> > >>
> > >> memory.toptier_usage_in_bytes
> > >>
> > >> The amount of top tier memory usable by each cgroup without
> > >> triggering page reclaim is controlled by the
> > >>
> > >> memory.toptier_soft_limit_in_bytes
> > >
> >
> > Michal,
> >
> > Thanks for your comments. I will like to take a step back and
> > look at the eventual goal we envision: a mechanism to partition the
> > tiered memory between the cgroups.
> >
> > A typical use case may be a system with two set of tasks.
> > One set of task is very latency sensitive and we desire instantaneous
> > response from them. Another set of tasks will be running batch jobs
> > were latency and performance is not critical. In this case,
> > we want to carve out enough top tier memory such that the working set
> > of the latency sensitive tasks can fit entirely in the top tier memory.
> > The rest of the top tier memory can be assigned to the background tasks.
> >
> > To achieve such cgroup based tiered memory management, we probably want
> > something like the following.
> >
> > For generalization let's say that there are N tiers of memory t_0, t_1 ... t_N-1,
> > where tier t_0 sits at the top and demotes to the lower tier.
> > We envision for this top tier memory t0 the following knobs and counters
> > in the cgroup memory controller
> >
> > memory_t0.current Current usage of tier 0 memory by the cgroup.
> >
> > memory_t0.min If tier 0 memory used by the cgroup falls below this low
> > boundary, the memory will not be subjected to demotion
> > to lower tiers to free up memory at tier 0.
> >
> > memory_t0.low Above this boundary, the tier 0 memory will be subjected
> > to demotion. The demotion pressure will be proportional
> > to the overage.
> >
> > memory_t0.high If tier 0 memory used by the cgroup exceeds this high
> > boundary, allocation of tier 0 memory by the cgroup will
> > be throttled. The tier 0 memory used by this cgroup
> > will also be subjected to heavy demotion.
> >
> > memory_t0.max This will be a hard usage limit of tier 0 memory on the cgroup.
> >
> > If needed, memory_t[12...].current/min/low/high for additional tiers can be added.
> > This follows closely with the design of the general memory controller interface.
> >
> > Will such an interface looks sane and acceptable with everyone?
> >
>
> I have a couple of questions. Let's suppose we have a two socket
> system. Node 0 (DRAM+CPUs), Node 1 (DRAM+CPUs), Node 2 (PMEM on socket
> 0 along with Node 0) and Node 3 (PMEM on socket 1 along with Node 1).
> Based on the tier definition of this patch series, tier_0: {node_0,
> node_1} and tier_1: {node_2, node_3}.
>
> My questions are:
>
> 1) Can we assume that the cost of access within a tier will always be
> less than the cost of access from the tier? (node_0 <-> node_1 vs
> node_0 <-> node_2)
No in large systems even it we can make this assumption in 2 socket ones.
> 2) If yes to (1), is that assumption future proof? Will the future
> systems with DRAM over CXL support have the same characteristics?
> 3) Will the cost of access from tier_0 to tier_1 be uniform? (node_0
> <-> node_2 vs node_0 <-> node_3). For jobs running on node_0, node_3
> might be third tier and similarly for jobs running on node_1, node_2
> might be third tier.
>
> The reason I am asking these questions is that the statically
> partitioning memory nodes into tiers will inherently add platform
> specific assumptions in the user API.
Absolutely agree.
>
> Assumptions like:
> 1) Access within tier is always cheaper than across tier.
> 2) Access from tier_i to tier_i+1 has uniform cost.
>
> The reason I am more inclined towards having numa centric control is
> that we don't have to make these assumptions. Though the usability
> will be more difficult. Greg (CCed) has some ideas on making it better
> and we will share our proposal after polishing it a bit more.
>
Sounds good, will look out for that.
Jonathan
next prev parent reply other threads:[~2021-04-14 8:59 UTC|newest]
Thread overview: 65+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-04-05 17:08 [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory Tim Chen
2021-04-05 17:08 ` [RFC PATCH v1 03/11] mm: Account the top tier memory usage per cgroup Tim Chen
[not found] ` <cover.1617642417.git.tim.c.chen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
2021-04-05 17:08 ` [RFC PATCH v1 01/11] mm: Define top tier memory node mask Tim Chen
2021-04-05 17:08 ` Tim Chen
2021-04-05 17:08 ` [RFC PATCH v1 02/11] mm: Add soft memory limit for mem cgroup Tim Chen
2021-04-05 17:08 ` Tim Chen
2021-04-05 17:08 ` [RFC PATCH v1 04/11] mm: Report top tier memory usage in sysfs Tim Chen
2021-04-05 17:08 ` Tim Chen
2021-04-05 17:08 ` [RFC PATCH v1 05/11] mm: Add soft_limit_top_tier tree for mem cgroup Tim Chen
2021-04-05 17:08 ` Tim Chen
2021-04-05 17:08 ` [RFC PATCH v1 06/11] mm: Handle top tier memory in cgroup soft limit memory tree utilities Tim Chen
2021-04-05 17:08 ` Tim Chen
2021-04-05 17:08 ` [RFC PATCH v1 07/11] mm: Account the total top tier memory in use Tim Chen
2021-04-05 17:08 ` Tim Chen
2021-04-05 17:08 ` [RFC PATCH v1 08/11] mm: Add toptier option for mem_cgroup_soft_limit_reclaim() Tim Chen
2021-04-05 17:08 ` Tim Chen
2021-04-05 17:08 ` [RFC PATCH v1 09/11] mm: Use kswapd to demote pages when toptier memory is tight Tim Chen
2021-04-05 17:08 ` Tim Chen
2021-04-05 17:08 ` [RFC PATCH v1 10/11] mm: Set toptier_scale_factor via sysctl Tim Chen
2021-04-05 17:08 ` Tim Chen
2021-04-05 17:08 ` [RFC PATCH v1 11/11] mm: Wakeup kswapd if toptier memory need soft reclaim Tim Chen
2021-04-05 17:08 ` Tim Chen
2021-04-06 9:08 ` [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory Michal Hocko
2021-04-06 9:08 ` Michal Hocko
[not found] ` <YGwlGrHtDJPQF7UG-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2021-04-07 22:33 ` Tim Chen
2021-04-07 22:33 ` Tim Chen
[not found] ` <c615a610-eb4b-7e1e-16d1-4bc12938b08a-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
2021-04-08 11:52 ` Michal Hocko
2021-04-08 11:52 ` Michal Hocko
[not found] ` <YG7ugXZZ9BcXyGGk-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2021-04-09 23:26 ` Tim Chen
2021-04-09 23:26 ` Tim Chen
[not found] ` <58e5dcc9-c134-78de-6965-7980f8596b57-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
2021-04-12 19:20 ` Shakeel Butt
2021-04-12 19:20 ` Shakeel Butt
[not found] ` <CALvZod4zXB6-3Mshu_TnTsQaDErfYkPTw9REYNRptSvPSRmKVA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2021-04-14 8:59 ` Jonathan Cameron [this message]
2021-04-14 8:59 ` Jonathan Cameron
2021-04-15 0:42 ` Tim Chen
2021-04-15 0:42 ` Tim Chen
2021-04-13 2:15 ` Huang, Ying
2021-04-13 2:15 ` Huang, Ying
2021-04-13 8:33 ` Michal Hocko
2021-04-13 8:33 ` Michal Hocko
2021-04-12 14:03 ` Shakeel Butt
2021-04-12 14:03 ` Shakeel Butt
2021-04-08 17:18 ` Shakeel Butt
2021-04-08 17:18 ` Shakeel Butt
2021-04-08 18:00 ` Yang Shi
[not found] ` <CAHbLzkrPD6s9vRy89cgQ36e+1cs6JbLqV84se7nnvP9MByizXA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2021-04-08 20:29 ` Shakeel Butt
2021-04-08 20:29 ` Shakeel Butt
[not found] ` <CALvZod69-GcS2W57hAUvjbWBCD6B2dTeVsFbtpQuZOM2DphwCQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2021-04-08 20:50 ` Yang Shi
2021-04-08 20:50 ` Yang Shi
[not found] ` <CAHbLzkoce41b-pJ5x=6nRhex_xBdC-+cYACBw9HKtA87H71A-Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2021-04-12 14:03 ` Shakeel Butt
2021-04-12 14:03 ` Shakeel Butt
2021-04-09 7:24 ` Michal Hocko
2021-04-09 7:24 ` Michal Hocko
[not found] ` <YHABLBYU0UgzwOZi-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2021-04-15 22:31 ` Tim Chen
2021-04-15 22:31 ` Tim Chen
[not found] ` <4a864946-a316-3d9c-8780-64c6281276d1-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
2021-04-16 6:38 ` Michal Hocko
2021-04-16 6:38 ` Michal Hocko
2021-04-14 23:22 ` Tim Chen
2021-04-14 23:22 ` Tim Chen
2021-04-09 2:58 ` Huang, Ying
2021-04-09 2:58 ` Huang, Ying
[not found] ` <87eefkxiys.fsf-fFUE1NP8JkwztNwN1K6W+PooFf0ArEBIu+b9c/7xato@public.gmane.org>
2021-04-09 20:50 ` Yang Shi
2021-04-09 20:50 ` Yang Shi
[not found] ` <CALvZod7StYJCPnWRNLnYQV8S5CBLtE0w4r2rH-wZzNs9jGJSRg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2021-04-15 22:25 ` Tim Chen
2021-04-15 22:25 ` Tim Chen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20210414095958.000008c4@Huawei.com \
--to=jonathan.cameron-ayuidmrra3lqt0dzr+alfa@public.gmane.org \
--cc=akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org \
--cc=cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
--cc=dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org \
--cc=dave.hansen-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org \
--cc=gthelen-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org \
--cc=hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org \
--cc=linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
--cc=linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org \
--cc=mhocko-IBi9RG/b67k@public.gmane.org \
--cc=rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org \
--cc=shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org \
--cc=tim.c.chen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org \
--cc=weixugc-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org \
--cc=ying.huang-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.