From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=casH=DG=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-3.8 required=3.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,
	SPF_PASS autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 1B708C4727C
	for <linux-mm@archiver.kernel.org>; Tue, 29 Sep 2020 21:55:23 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 33F19208B8
	for <linux-mm@archiver.kernel.org>; Tue, 29 Sep 2020 21:55:21 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=cmpxchg-org.20150623.gappssmtp.com header.i=@cmpxchg-org.20150623.gappssmtp.com header.b="Cr1Susru"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 33F19208B8
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=cmpxchg.org
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 2BECA8E0003; Tue, 29 Sep 2020 17:55:21 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 246F68E0001; Tue, 29 Sep 2020 17:55:21 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 0E7FC8E0003; Tue, 29 Sep 2020 17:55:21 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0243.hostedemail.com [216.40.44.243])
	by kanga.kvack.org (Postfix) with ESMTP id E6D858E0001
	for <linux-mm@kvack.org>; Tue, 29 Sep 2020 17:55:20 -0400 (EDT)
Received: from smtpin02.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay02.hostedemail.com (Postfix) with ESMTP id 9F3813ABE
	for <linux-mm@kvack.org>; Tue, 29 Sep 2020 21:55:20 +0000 (UTC)
X-FDA: 77317455600.02.flame17_2d048b32718d
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin02.hostedemail.com (Postfix) with ESMTP id 73836101DAF52
	for <linux-mm@kvack.org>; Tue, 29 Sep 2020 21:55:20 +0000 (UTC)
X-HE-Tag: flame17_2d048b32718d
X-Filterd-Recvd-Size: 8349
Received: from mail-qt1-f194.google.com (mail-qt1-f194.google.com [209.85.160.194])
	by imf50.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Tue, 29 Sep 2020 21:55:19 +0000 (UTC)
Received: by mail-qt1-f194.google.com with SMTP id d1so4976040qtr.6
        for <linux-mm@kvack.org>; Tue, 29 Sep 2020 14:55:19 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=cmpxchg-org.20150623.gappssmtp.com; s=20150623;
        h=date:from:to:cc:subject:message-id:references:mime-version
         :content-disposition:in-reply-to;
        bh=FOJfgf8q5Mwbkl+Uuxbz5zw63i2IkJ9zeF69lh8BTRA=;
        b=Cr1SusruU/siOv5odz2uDawXx3M2StkgpuCnAfnW6HeoN6hGhWlHDj0Yuw8UVH1Sw8
         1GksVWMUs9+v3GTJ+CdJFW7c3cyp9UT0YgqAsJVVL0EFAJKzQmQAKEDlrRRseCqcWspc
         tQN6sxfbzuOQHvS72lZipUN50ZmCvCRqvFbqadYvW4fF7O0sOZb56yY7FSTYeK2R+svl
         EuS9bScs+RhmVHwEn5TpHkQkoddIsO9ZvR063nCXbeiJCp5UNoEce3QhpuikWgitu7y3
         rxtpIYYrTAr9jd80q3mlD0gy05vvCC3Kst+DbBLaWhazqWQ6J9IOpCLDpzzRY+clkJMN
         FXBg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:date:from:to:cc:subject:message-id:references
         :mime-version:content-disposition:in-reply-to;
        bh=FOJfgf8q5Mwbkl+Uuxbz5zw63i2IkJ9zeF69lh8BTRA=;
        b=jWuWDcEO6U97Bfe8j4zfHM3VS68nAZU9e7B+vBz8ZxtSuydZ4luFfiTss1RD+slYxL
         nEJ02+pWKnAmn8CViyP2/22AvgZ+LWYPpHyg/oaCv8pon9CiKBybili/QL8kJZbHuvw+
         P72J3zQph9E+++cenbuWY7aA97lnHawxvBTCSY92J7Cp45rn+NYpaaRtKe/yp0e2FAH2
         pNkwlS+GuNFfPhCZPEcwCzp+mOJAa5vWSKc2dtYpH38+FIvaoYYpxOlfedM04TbZZUHA
         ZU+pYyn9aobfUHGLuebItczz+sdpxGm/L3/UTtR4czRtBDztgxRbNXYRPkyXUaP2YoH1
         3GSw==
X-Gm-Message-State: AOAM532s8bY2oWXiT5CI9AxW7o8iTDQmLhcjvJsslRcF4gS256UDpynB
	1YPHusAeuyFW4+nUxoNnjgEvuA==
X-Google-Smtp-Source: ABdhPJwKGFeVgwB0gbtRklGSlDPFPgsa3kFrhW/iSBjxukmvBrKrS1f46rvLyJlf0T+z98s/QAXaGg==
X-Received: by 2002:ac8:6f0d:: with SMTP id g13mr5777438qtv.236.1601416518820;
        Tue, 29 Sep 2020 14:55:18 -0700 (PDT)
Received: from localhost ([2620:10d:c091:480::1:4e22])
        by smtp.gmail.com with ESMTPSA id p187sm6516443qkd.129.2020.09.29.14.55.17
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Tue, 29 Sep 2020 14:55:17 -0700 (PDT)
Date: Tue, 29 Sep 2020 17:53:41 -0400
From: Johannes Weiner <hannes@cmpxchg.org>
To: Michal Hocko <mhocko@suse.com>
Cc: Shakeel Butt <shakeelb@google.com>, Roman Gushchin <guro@fb.com>,
	Yang Shi <yang.shi@linux.alibaba.com>,
	Greg Thelen <gthelen@google.com>,
	David Rientjes <rientjes@google.com>,
	Michal =?iso-8859-1?Q?Koutn=FD?= <mkoutny@suse.com>,
	Andrew Morton <akpm@linux-foundation.org>, linux-mm@kvack.org,
	cgroups@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] memcg: introduce per-memcg reclaim interface
Message-ID: <20200929215341.GA408059@cmpxchg.org>
References: <20200909215752.1725525-1-shakeelb@google.com>
 <20200928210216.GA378894@cmpxchg.org>
 <20200929150444.GG2277@dhcp22.suse.cz>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20200929150444.GG2277@dhcp22.suse.cz>
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000009, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Tue, Sep 29, 2020 at 05:04:44PM +0200, Michal Hocko wrote:
> On Mon 28-09-20 17:02:16, Johannes Weiner wrote:
> [...]
> > My take is that a proactive reclaim feature, whose goal is never to
> > thrash or punish but to keep the LRUs warm and the workingset trimmed,
> > would ideally have:
> > 
> > - a pressure or size target specified by userspace but with
> >   enforcement driven inside the kernel from the allocation path
> > 
> > - the enforcement work NOT be done synchronously by the workload
> >   (something I'd argue we want for *all* memory limits)
> > 
> > - the enforcement work ACCOUNTED to the cgroup, though, since it's the
> >   cgroup's memory allocations causing the work (again something I'd
> >   argue we want in general)
> > 
> > - a delegatable knob that is independent of setting the maximum size
> >   of a container, as that expresses a different type of policy
> > 
> > - if size target, self-limiting (ha) enforcement on a pressure
> >   threshold or stop enforcement when the userspace component dies
> > 
> > Thoughts?
> 
> Agreed with above points. What do you think about
> http://lkml.kernel.org/r/20200922190859.GH12990@dhcp22.suse.cz.

I definitely agree with what you wrote in this email for background
reclaim. Indeed, your description sounds like what I proposed in
https://lore.kernel.org/linux-mm/20200219181219.54356-1-hannes@cmpxchg.org/
- what's missing from that patch is proper work attribution.

> I assume that you do not want to override memory.high to implement
> this because that tends to be tricky from the configuration POV as
> you mentioned above. But a new limit (memory.middle for a lack of a
> better name) to define the background reclaim sounds like a good fit
> with above points.

I can see that with a new memory.middle you could kind of sort of do
both - background reclaim and proactive reclaim.

That said, I do see advantages in keeping them separate:

1. Background reclaim is essentially an allocation optimization that
   we may want to provide per default, just like kswapd.

   Kswapd is tweakable of course, but I think actually few users do,
   and it works pretty well out of the box. It would be nice to
   provide the same thing on a per-cgroup basis per default and not
   ask users to make decisions that we are generally better at making.

2. Proactive reclaim may actually be better configured through a
   pressure threshold rather than a size target.

   As per above, the goal is not to be punitive or containing. The
   goal is to keep the LRUs warm and move the colder pages to disk.

   But how aggressively do you run reclaim for this purpose? What
   target value should a user write to such a memory.middle file?

   For one, it depends on the job. A batch job, or a less important
   background job, may tolerate higher paging overhead than an
   interactive job. That means more of its pages could be trimmed from
   RAM and reloaded on-demand from disk.

   But also, it depends on the storage device. If you move a workload
   from a machine with a slow disk to a machine with a fast disk, you
   can page more data in the same amount of time. That means while
   your workload tolerances stays the same, the faster the disk, the
   more aggressively you can do reclaim and offload memory.

   So again, what should a user write to such a control file?

   Of course, you can approximate an optimal target size for the
   workload. You can run a manual workingset analysis with page_idle,
   damon, or similar, determine a hot/cold cutoff based on what you
   know about the storage characteristics, then echo a number of pages
   or a size target into a cgroup file and let kernel do the reclaim
   accordingly. The drawbacks are that the kernel LRU may do a
   different hot/cold classification than you did and evict the wrong
   pages, the storage device latencies may vary based on overall IO
   pattern, and two equally warm pages may have very different paging
   overhead depending on whether readahead can avert a major fault or
   not. So it's easy to overshoot the tolerance target and disrupt the
   workload, or undershoot and have stale LRU data, waste memory etc.

   You can also do a feedback loop, where you guess an optimal size,
   then adjust based on how much paging overhead the workload is
   experiencing, i.e. memory pressure. The drawbacks are that you have
   to monitor pressure closely and react quickly when the workload is
   expanding, as it can be potentially sensitive to latencies in the
   usec range. This can be tricky to do from userspace.

   So instead of asking users for a target size whose suitability
   heavily depends on the kernel's LRU implementation, the readahead
   code, the IO device's capability and general load, why not directly
   ask the user for a pressure level that the workload is comfortable
   with and which captures all of the above factors implicitly? Then
   let the kernel do this feedback loop from a per-cgroup worker.