Re: [PATCH] memcg: introduce per-memcg reclaim interface

public inbox for cgroups@vger.kernel.org
 help / color / mirror / Atom feed

From: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
To: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Cc: "Michal Hocko" <mhocko-IBi9RG/b67k@public.gmane.org>,
	"Roman Gushchin" <guro-b10kYP2dOMg@public.gmane.org>,
	"Yang Shi"
	<yang.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org>,
	"Greg Thelen" <gthelen-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>,
	"David Rientjes"
	<rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>,
	"Michal Koutný" <mkoutny-IBi9RG/b67k@public.gmane.org>,
	"Andrew Morton"
	<akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>,
	"Linux MM" <linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org>,
	Cgroups <cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	LKML <linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
Subject: Re: [PATCH] memcg: introduce per-memcg reclaim interface
Date: Thu, 1 Oct 2020 10:31:49 -0400	[thread overview]
Message-ID: <20201001143149.GA493631@cmpxchg.org> (raw)
In-Reply-To: <CALvZod5eN0PDtKo8SEp1n-xGvgCX9k6-OBGYLT3RmzhA+Q-2hw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Wed, Sep 30, 2020 at 08:45:17AM -0700, Shakeel Butt wrote:
> On Tue, Sep 29, 2020 at 2:55 PM Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> wrote:
> >
> > On Tue, Sep 29, 2020 at 05:04:44PM +0200, Michal Hocko wrote:
> > > On Mon 28-09-20 17:02:16, Johannes Weiner wrote:
> > > [...]
> > > > My take is that a proactive reclaim feature, whose goal is never to
> > > > thrash or punish but to keep the LRUs warm and the workingset trimmed,
> > > > would ideally have:
> > > >
> > > > - a pressure or size target specified by userspace but with
> > > >   enforcement driven inside the kernel from the allocation path
> > > >
> > > > - the enforcement work NOT be done synchronously by the workload
> > > >   (something I'd argue we want for *all* memory limits)
> > > >
> > > > - the enforcement work ACCOUNTED to the cgroup, though, since it's the
> > > >   cgroup's memory allocations causing the work (again something I'd
> > > >   argue we want in general)
> > > >
> > > > - a delegatable knob that is independent of setting the maximum size
> > > >   of a container, as that expresses a different type of policy
> > > >
> > > > - if size target, self-limiting (ha) enforcement on a pressure
> > > >   threshold or stop enforcement when the userspace component dies
> > > >
> > > > Thoughts?
> > >
> > > Agreed with above points. What do you think about
> > > http://lkml.kernel.org/r/20200922190859.GH12990-2MMpYkNvuYA6bu5BqYkRsg@public.gmane.org
> >
> > I definitely agree with what you wrote in this email for background
> > reclaim. Indeed, your description sounds like what I proposed in
> > https://lore.kernel.org/linux-mm/20200219181219.54356-1-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org/
> > - what's missing from that patch is proper work attribution.
> >
> > > I assume that you do not want to override memory.high to implement
> > > this because that tends to be tricky from the configuration POV as
> > > you mentioned above. But a new limit (memory.middle for a lack of a
> > > better name) to define the background reclaim sounds like a good fit
> > > with above points.
> >
> > I can see that with a new memory.middle you could kind of sort of do
> > both - background reclaim and proactive reclaim.
> >
> > That said, I do see advantages in keeping them separate:
> >
> > 1. Background reclaim is essentially an allocation optimization that
> >    we may want to provide per default, just like kswapd.
> >
> >    Kswapd is tweakable of course, but I think actually few users do,
> >    and it works pretty well out of the box. It would be nice to
> >    provide the same thing on a per-cgroup basis per default and not
> >    ask users to make decisions that we are generally better at making.
> >
> > 2. Proactive reclaim may actually be better configured through a
> >    pressure threshold rather than a size target.
> >
> >    As per above, the goal is not to be punitive or containing. The
> >    goal is to keep the LRUs warm and move the colder pages to disk.
> >
> >    But how aggressively do you run reclaim for this purpose? What
> >    target value should a user write to such a memory.middle file?
> >
> >    For one, it depends on the job. A batch job, or a less important
> >    background job, may tolerate higher paging overhead than an
> >    interactive job. That means more of its pages could be trimmed from
> >    RAM and reloaded on-demand from disk.
> >
> >    But also, it depends on the storage device. If you move a workload
> >    from a machine with a slow disk to a machine with a fast disk, you
> >    can page more data in the same amount of time. That means while
> >    your workload tolerances stays the same, the faster the disk, the
> >    more aggressively you can do reclaim and offload memory.
> >
> >    So again, what should a user write to such a control file?
> >
> >    Of course, you can approximate an optimal target size for the
> >    workload. You can run a manual workingset analysis with page_idle,
> >    damon, or similar, determine a hot/cold cutoff based on what you
> >    know about the storage characteristics, then echo a number of pages
> >    or a size target into a cgroup file and let kernel do the reclaim
> >    accordingly. The drawbacks are that the kernel LRU may do a
> >    different hot/cold classification than you did and evict the wrong
> >    pages, the storage device latencies may vary based on overall IO
> >    pattern, and two equally warm pages may have very different paging
> >    overhead depending on whether readahead can avert a major fault or
> >    not. So it's easy to overshoot the tolerance target and disrupt the
> >    workload, or undershoot and have stale LRU data, waste memory etc.
> >
> >    You can also do a feedback loop, where you guess an optimal size,
> >    then adjust based on how much paging overhead the workload is
> >    experiencing, i.e. memory pressure. The drawbacks are that you have
> >    to monitor pressure closely and react quickly when the workload is
> >    expanding, as it can be potentially sensitive to latencies in the
> >    usec range. This can be tricky to do from userspace.
> >
> 
> This is actually what we do in our production i.e. feedback loop to
> adjust the next iteration of proactive reclaim.

That's what we do also right now. It works reasonably well, the only
two pain points are/have been the reaction time under quick workload
expansion and inadvertently forcing the workload into direct reclaim.

> We eliminated the IO or slow disk issues you mentioned by only
> focusing on anon memory and doing zswap.

Interesting, may I ask how the file cache is managed in this setup?

> >    So instead of asking users for a target size whose suitability
> >    heavily depends on the kernel's LRU implementation, the readahead
> >    code, the IO device's capability and general load, why not directly
> >    ask the user for a pressure level that the workload is comfortable
> >    with and which captures all of the above factors implicitly? Then
> >    let the kernel do this feedback loop from a per-cgroup worker.
> 
> I am assuming here by pressure level you are referring to the PSI like
> interface e.g. allowing the users to tell about their jobs that X
> amount of stalls in a fixed time window is tolerable.

Right, essentially the same parameters that psi poll() would take.

next prev parent reply	other threads:[~2020-10-01 14:31 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-09-09 21:57 [PATCH] memcg: introduce per-memcg reclaim interface Shakeel Butt
     [not found] ` <20200909215752.1725525-1-shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2020-09-10  6:36   ` SeongJae Park
     [not found]     ` <20200910063656.25038-1-sjpark-vV1OtcyAfmbQT0dZR+AlfA@public.gmane.org>
2020-09-10 16:10       ` Shakeel Butt
2020-09-10 16:34         ` SeongJae Park
2020-09-21 16:30   ` Michal Hocko
     [not found]     ` <20200921163055.GQ12990-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2020-09-21 17:50       ` Shakeel Butt
     [not found]         ` <CALvZod43VXKZ3StaGXK_EZG_fKcW3v3=cEYOWFwp4HNJpOOf8g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2020-09-22 11:49           ` Michal Hocko
     [not found]             ` <20200922114908.GZ12990-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2020-09-22 15:54               ` Shakeel Butt
     [not found]                 ` <CALvZod4FvE12o53BpeH5WB_McTdCkFTFXgc9gcT1CEHXzQLy_A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2020-09-22 16:55                   ` Michal Hocko
     [not found]                     ` <20200922165527.GD12990-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2020-09-22 18:10                       ` Shakeel Butt
     [not found]                         ` <CALvZod7K9g9mi599c5+ayLeC4__kckv155QQGVMVy2rXXOY1Rw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2020-09-22 18:31                           ` Michal Hocko
2020-09-22 18:56                             ` Shakeel Butt
2020-09-22 19:08                           ` Michal Hocko
     [not found]                             ` <20200922190859.GH12990-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2020-09-22 20:02                               ` Yang Shi
2020-09-22 22:38                               ` Shakeel Butt
2020-09-28 21:02 ` Johannes Weiner
     [not found]   ` <20200928210216.GA378894-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
2020-09-29 15:04     ` Michal Hocko
     [not found]       ` <20200929150444.GG2277-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2020-09-29 21:53         ` Johannes Weiner
     [not found]           ` <20200929215341.GA408059-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
2020-09-30 15:45             ` Shakeel Butt
     [not found]               ` <CALvZod5eN0PDtKo8SEp1n-xGvgCX9k6-OBGYLT3RmzhA+Q-2hw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2020-10-01 14:31                 ` Johannes Weiner [this message]
2020-10-06 16:55                   ` Shakeel Butt
     [not found]                     ` <CALvZod59cU40A3nbQtkP50Ae3g6T2MQSt+q1=O2=Gy9QUzNkbg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2020-10-08 14:53                       ` Johannes Weiner
     [not found]                         ` <20201008145336.GA163830-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
2020-10-08 15:55                           ` Shakeel Butt
     [not found]                             ` <CALvZod5-EtB0jNi9DXTmLSKrUzK2jXRhW8h6+7sqB356k0t1+g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2020-10-08 21:09                               ` Johannes Weiner
2020-09-30 15:26   ` Shakeel Butt
     [not found]     ` <CALvZod7afgoAL7KyfjpP-LoSFGSHv7XtfbbnVhEEhsiZLqZu9A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2020-10-01 15:10       ` Johannes Weiner
2020-10-05 21:59         ` Shakeel Butt
     [not found]           ` <CALvZod66T4-y2JQnN+favf6tnKkkFQ17HZ8EAAX0GXAcbO4v+w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2020-10-08 15:14             ` Johannes Weiner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201001143149.GA493631@cmpxchg.org \
    --to=hannes-druugvl0lcnafugrpc6u6w@public.gmane.org \
    --cc=akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org \
    --cc=cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=gthelen-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org \
    --cc=guro-b10kYP2dOMg@public.gmane.org \
    --cc=linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org \
    --cc=mhocko-IBi9RG/b67k@public.gmane.org \
    --cc=mkoutny-IBi9RG/b67k@public.gmane.org \
    --cc=rientjes-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org \
    --cc=shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org \
    --cc=yang.shi-KPsoFbNs7GizrGE5bRqYAgC/G2K4zDHf@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox