From mboxrd@z Thu Jan  1 00:00:00 1970
From: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Subject: Re: [PATCH] memcg: introduce per-memcg reclaim interface
Date: Thu, 8 Oct 2020 10:53:36 -0400
Message-ID: <20201008145336.GA163830@cmpxchg.org>
References: <20200909215752.1725525-1-shakeelb@google.com>
 <20200928210216.GA378894@cmpxchg.org>
 <20200929150444.GG2277@dhcp22.suse.cz>
 <20200929215341.GA408059@cmpxchg.org>
 <CALvZod5eN0PDtKo8SEp1n-xGvgCX9k6-OBGYLT3RmzhA+Q-2hw@mail.gmail.com>
 <20201001143149.GA493631@cmpxchg.org>
 <CALvZod59cU40A3nbQtkP50Ae3g6T2MQSt+q1=O2=Gy9QUzNkbg@mail.gmail.com>
Mime-Version: 1.0
Return-path: <cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=cmpxchg-org.20150623.gappssmtp.com; s=20150623;
        h=date:from:to:cc:subject:message-id:references:mime-version
         :content-disposition:in-reply-to;
        bh=uyKHtfoarsj+UYs2PP4XCdlBcu5iL8M8TqmjexyJtjo=;
        b=gSpOReNk6UlU6+gBPtDU3TmLons5StnRVP+CSGLouttWxPrXlWzOUzwAitmC+kccP2
         0zmeEv8fvogdspFm1WObJnfz37SM5ZQXGJVkNbRt3U9rIh7Pyw4kD0vw0G9kikekdRdl
         C2gz/68+9dh12edgZNkL+yu9ynPSKifE62fxva7Ow56HS1u8kUWSsGm2yoI9rtaJKJiN
         nJA/QAp95vu+Fs1ILvQcjFYZDG+JxrV4BqbTRG3oG/QYOK4UZCI4Cc9geDfeNj3qsVsq
         MFI8DWNCMjRkARF4wR2mrx1S+YWbkkW2KhHKaCXXeQQUOCxJQtRVVlpJiKio3PpqqbwF
         TK5A==
Content-Disposition: inline
In-Reply-To: <CALvZod59cU40A3nbQtkP50Ae3g6T2MQSt+q1=O2=Gy9QUzNkbg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
List-ID: <cgroups.vger.kernel.org>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
To: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Cc: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>, Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>, Yang Shi <yang.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>, Greg Thelen <gthelen-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>, David Rientjes <rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>, Michal =?iso-8859-1?Q?Koutn=FD?= <mkoutny-IBi9RG/b67k@public.gmane.org>, Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>, Linux MM <linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org>, Cgroups <cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>, LKML <linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>, Andrea Righi <andrea.righi-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>, SeongJae Park <sjpark-vV1OtcyAfmbQT0dZR+AlfA@public.gmane.org>

On Tue, Oct 06, 2020 at 09:55:43AM -0700, Shakeel Butt wrote:
> On Thu, Oct 1, 2020 at 7:33 AM Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> wrote:
> >
> [snip]
> > > >    So instead of asking users for a target size whose suitability
> > > >    heavily depends on the kernel's LRU implementation, the readahead
> > > >    code, the IO device's capability and general load, why not directly
> > > >    ask the user for a pressure level that the workload is comfortable
> > > >    with and which captures all of the above factors implicitly? Then
> > > >    let the kernel do this feedback loop from a per-cgroup worker.
> > >
> > > I am assuming here by pressure level you are referring to the PSI like
> > > interface e.g. allowing the users to tell about their jobs that X
> > > amount of stalls in a fixed time window is tolerable.
> >
> > Right, essentially the same parameters that psi poll() would take.
> 
> I thought a bit more on the semantics of the psi usage for the
> proactive reclaim.
> 
> Suppose I have a top level cgroup A on which I want to enable
> proactive reclaim. Which memory psi events should the proactive
> reclaim should consider?
> 
> The simplest would be the memory.psi at 'A'. However memory.psi is
> hierarchical and I would not really want the pressure due limits in
> children of 'A' to impact the proactive reclaim.

I don't think pressure from limits down the tree can be separated out,
generally. All events are accounted recursively as well. Of course, we
remember the reclaim level for evicted entries - but if there is
reclaim triggered at A and A/B concurrently, the distribution of who
ends up reclaiming the physical pages in A/B is pretty arbitrary/racy.

If A/B decides to do its own proactive reclaim with the sublimit, and
ends up consuming the pressure budget assigned to proactive reclaim in
A, there isn't much that can be done.

It's also possible that proactive reclaim in A keeps A/B from hitting
its limit in the first place.

I have to say, the configuration doesn't really strike me as sensible,
though. Limits make sense for doing fixed partitioning: A gets 4G, A/B
gets 2G out of that. But if you do proactive reclaim on A you're
essentially saying A as a whole is auto-sizing dynamically based on
its memory access pattern. I'm not sure what it means to then start
doing fixed partitions in the sublevel.

> PSI due to refaults and slow IO should be included or maybe only
> those which are caused by the proactive reclaim itself. I am
> undecided on the PSI due to compaction. PSI due to global reclaim
> for 'A' is even more complicated. This is a stall due to reclaiming
> from the system including self. It might not really cause more
> refaults and IOs for 'A'. Should proactive reclaim ignore the
> pressure due to global pressure when tuning its aggressiveness.

Yeah, I think they should all be included, because ultimately what
matters is what the workload can tolerate without sacrificing
performance.

Proactive reclaim can destroy THPs, so the cost of recreating them
should be reflected. Otherwise you can easily overpressurize.

For global reclaim, if you say you want a workload pressurized to X
percent in order to drive the LRUs and chop off all cold pages the
workload can live without, it doesn't matter who does the work. If
there is an abundance of physical memory, it's going to be proactive
reclaim. If physical memory is already tight enough that global
reclaim does it for you, there is nothing to be done in addition, and
proactive reclaim should hang back. Otherwise you can again easily
overpressurize the workload.

From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=3KsH=DP=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-3.7 required=3.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,
	SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id C1DE5C04EBE
	for <linux-mm@archiver.kernel.org>; Thu,  8 Oct 2020 14:55:12 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id F40EA21927
	for <linux-mm@archiver.kernel.org>; Thu,  8 Oct 2020 14:55:11 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=cmpxchg-org.20150623.gappssmtp.com header.i=@cmpxchg-org.20150623.gappssmtp.com header.b="gSpOReNk"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org F40EA21927
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=cmpxchg.org
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 231496B005C; Thu,  8 Oct 2020 10:55:11 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 1B98B6B005D; Thu,  8 Oct 2020 10:55:11 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 05A8D6B0068; Thu,  8 Oct 2020 10:55:10 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0239.hostedemail.com [216.40.44.239])
	by kanga.kvack.org (Postfix) with ESMTP id C46416B005C
	for <linux-mm@kvack.org>; Thu,  8 Oct 2020 10:55:10 -0400 (EDT)
Received: from smtpin26.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay01.hostedemail.com (Postfix) with ESMTP id 5C864180AD806
	for <linux-mm@kvack.org>; Thu,  8 Oct 2020 14:55:10 +0000 (UTC)
X-FDA: 77349055980.26.jeans66_4d17faf271d8
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin26.hostedemail.com (Postfix) with ESMTP id 2E9461804B655
	for <linux-mm@kvack.org>; Thu,  8 Oct 2020 14:55:10 +0000 (UTC)
X-HE-Tag: jeans66_4d17faf271d8
X-Filterd-Recvd-Size: 7285
Received: from mail-qk1-f193.google.com (mail-qk1-f193.google.com [209.85.222.193])
	by imf46.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Thu,  8 Oct 2020 14:55:09 +0000 (UTC)
Received: by mail-qk1-f193.google.com with SMTP id s4so7277619qkf.7
        for <linux-mm@kvack.org>; Thu, 08 Oct 2020 07:55:09 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=cmpxchg-org.20150623.gappssmtp.com; s=20150623;
        h=date:from:to:cc:subject:message-id:references:mime-version
         :content-disposition:in-reply-to;
        bh=uyKHtfoarsj+UYs2PP4XCdlBcu5iL8M8TqmjexyJtjo=;
        b=gSpOReNk6UlU6+gBPtDU3TmLons5StnRVP+CSGLouttWxPrXlWzOUzwAitmC+kccP2
         0zmeEv8fvogdspFm1WObJnfz37SM5ZQXGJVkNbRt3U9rIh7Pyw4kD0vw0G9kikekdRdl
         C2gz/68+9dh12edgZNkL+yu9ynPSKifE62fxva7Ow56HS1u8kUWSsGm2yoI9rtaJKJiN
         nJA/QAp95vu+Fs1ILvQcjFYZDG+JxrV4BqbTRG3oG/QYOK4UZCI4Cc9geDfeNj3qsVsq
         MFI8DWNCMjRkARF4wR2mrx1S+YWbkkW2KhHKaCXXeQQUOCxJQtRVVlpJiKio3PpqqbwF
         TK5A==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:date:from:to:cc:subject:message-id:references
         :mime-version:content-disposition:in-reply-to;
        bh=uyKHtfoarsj+UYs2PP4XCdlBcu5iL8M8TqmjexyJtjo=;
        b=hKAhaie/YurEolGbMBorNTySc3GkVjzttByxcwr6IQuub8dQtTLU8ddJtoN3LYFUOQ
         WgtLwG2BSRifEskntqOHAyw2tATgcglWBpfw21qvUUIhNGLSbyZArB01h8s67AJFeY4H
         DKhEt7U+fqXz0fxfoetUX2IAKwVsCsgFasQr604px+4HCK7bQGgXoL4BeXF3TQ1lP1jg
         ln+JKvPPVgEE8C4wcN062fUb+HA9KVhlOoCUXqFeyJOoU7/1ydQ2eyvmySLWI5ykg66y
         ghnu/RHceHr86hMXrdaAP6QUpKTvzei3eCs/5ijMHxH0CUsXCmjCS6rx6WcB4rUuDcfB
         zCEw==
X-Gm-Message-State: AOAM532UpKtBKZsWgpL6KeZ1l6r3wqQyRiJRe25ESgw9UOIE24/ZZ+1M
	xmbwUwMrDnwXaVMF4aQinO1JPw==
X-Google-Smtp-Source: ABdhPJwCjDD186qpBAmnjaTWRt1sPwSBHJUbvD4s+XLrOlPjgXCU4uGDse/682h+yj3I+EQxRHgfig==
X-Received: by 2002:ae9:e644:: with SMTP id x4mr8391439qkl.270.1602168908459;
        Thu, 08 Oct 2020 07:55:08 -0700 (PDT)
Received: from localhost ([2620:10d:c091:480::1:9294])
        by smtp.gmail.com with ESMTPSA id e23sm3955591qkl.67.2020.10.08.07.55.07
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Thu, 08 Oct 2020 07:55:07 -0700 (PDT)
Date: Thu, 8 Oct 2020 10:53:36 -0400
From: Johannes Weiner <hannes@cmpxchg.org>
To: Shakeel Butt <shakeelb@google.com>
Cc: Michal Hocko <mhocko@suse.com>, Roman Gushchin <guro@fb.com>,
	Yang Shi <yang.shi@linux.alibaba.com>,
	Greg Thelen <gthelen@google.com>,
	David Rientjes <rientjes@google.com>,
	Michal =?iso-8859-1?Q?Koutn=FD?= <mkoutny@suse.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linux MM <linux-mm@kvack.org>, Cgroups <cgroups@vger.kernel.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Andrea Righi <andrea.righi@canonical.com>,
	SeongJae Park <sjpark@amazon.com>
Subject: Re: [PATCH] memcg: introduce per-memcg reclaim interface
Message-ID: <20201008145336.GA163830@cmpxchg.org>
References: <20200909215752.1725525-1-shakeelb@google.com>
 <20200928210216.GA378894@cmpxchg.org>
 <20200929150444.GG2277@dhcp22.suse.cz>
 <20200929215341.GA408059@cmpxchg.org>
 <CALvZod5eN0PDtKo8SEp1n-xGvgCX9k6-OBGYLT3RmzhA+Q-2hw@mail.gmail.com>
 <20201001143149.GA493631@cmpxchg.org>
 <CALvZod59cU40A3nbQtkP50Ae3g6T2MQSt+q1=O2=Gy9QUzNkbg@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CALvZod59cU40A3nbQtkP50Ae3g6T2MQSt+q1=O2=Gy9QUzNkbg@mail.gmail.com>
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Tue, Oct 06, 2020 at 09:55:43AM -0700, Shakeel Butt wrote:
> On Thu, Oct 1, 2020 at 7:33 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> [snip]
> > > >    So instead of asking users for a target size whose suitability
> > > >    heavily depends on the kernel's LRU implementation, the readahead
> > > >    code, the IO device's capability and general load, why not directly
> > > >    ask the user for a pressure level that the workload is comfortable
> > > >    with and which captures all of the above factors implicitly? Then
> > > >    let the kernel do this feedback loop from a per-cgroup worker.
> > >
> > > I am assuming here by pressure level you are referring to the PSI like
> > > interface e.g. allowing the users to tell about their jobs that X
> > > amount of stalls in a fixed time window is tolerable.
> >
> > Right, essentially the same parameters that psi poll() would take.
> 
> I thought a bit more on the semantics of the psi usage for the
> proactive reclaim.
> 
> Suppose I have a top level cgroup A on which I want to enable
> proactive reclaim. Which memory psi events should the proactive
> reclaim should consider?
> 
> The simplest would be the memory.psi at 'A'. However memory.psi is
> hierarchical and I would not really want the pressure due limits in
> children of 'A' to impact the proactive reclaim.

I don't think pressure from limits down the tree can be separated out,
generally. All events are accounted recursively as well. Of course, we
remember the reclaim level for evicted entries - but if there is
reclaim triggered at A and A/B concurrently, the distribution of who
ends up reclaiming the physical pages in A/B is pretty arbitrary/racy.

If A/B decides to do its own proactive reclaim with the sublimit, and
ends up consuming the pressure budget assigned to proactive reclaim in
A, there isn't much that can be done.

It's also possible that proactive reclaim in A keeps A/B from hitting
its limit in the first place.

I have to say, the configuration doesn't really strike me as sensible,
though. Limits make sense for doing fixed partitioning: A gets 4G, A/B
gets 2G out of that. But if you do proactive reclaim on A you're
essentially saying A as a whole is auto-sizing dynamically based on
its memory access pattern. I'm not sure what it means to then start
doing fixed partitions in the sublevel.

> PSI due to refaults and slow IO should be included or maybe only
> those which are caused by the proactive reclaim itself. I am
> undecided on the PSI due to compaction. PSI due to global reclaim
> for 'A' is even more complicated. This is a stall due to reclaiming
> from the system including self. It might not really cause more
> refaults and IOs for 'A'. Should proactive reclaim ignore the
> pressure due to global pressure when tuning its aggressiveness.

Yeah, I think they should all be included, because ultimately what
matters is what the workload can tolerate without sacrificing
performance.

Proactive reclaim can destroy THPs, so the cost of recreating them
should be reflected. Otherwise you can easily overpressurize.

For global reclaim, if you say you want a workload pressurized to X
percent in order to drive the LRUs and chop off all cold pages the
workload can live without, it doesn't matter who does the work. If
there is an abundance of physical memory, it's going to be proactive
reclaim. If physical memory is already tight enough that global
reclaim does it for you, there is nothing to be done in addition, and
proactive reclaim should hang back. Otherwise you can again easily
overpressurize the workload.