All of lore.kernel.org
 help / color / mirror / Atom feed
From: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
To: "Michal Koutný" <mkoutny-IBi9RG/b67k@public.gmane.org>
Cc: Andrew Morton
	<akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>,
	Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>,
	Roman Gushchin <guro-b10kYP2dOMg@public.gmane.org>,
	Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>,
	Seth Jennings <sjenning-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>,
	Dan Streetman <ddstreet-EkmVulN54Sk@public.gmane.org>,
	Minchan Kim <minchan-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org,
	cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	kernel-team-b10kYP2dOMg@public.gmane.org
Subject: Re: [PATCH v2 6/6] zswap: memcg accounting
Date: Fri, 13 May 2022 13:08:13 -0400	[thread overview]
Message-ID: <Yn6QfdouzkcrygTR@cmpxchg.org> (raw)
In-Reply-To: <20220513151426.GC16096-9OudH3eul5jcvrawFnH+a6VXKuFTiq87@public.gmane.org>

Hello Michal,

On Fri, May 13, 2022 at 05:14:26PM +0200, Michal Koutný wrote:
> On Wed, May 11, 2022 at 03:06:56PM -0400, Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> wrote:
> > Correct. After which the uncompressed page is reclaimed and uncharged.
> > So the zswapout process will reduce the charge bottom line.
> 
> A zswap object falling under memory.current was my first thinking, I was
> confused why it's exported as a separate counter memory.zswap.current
> (which IMO suggests disjoint counting) and it doubles a
> memory.stat:zswap entry.
> 
> Is the separate memory.zswap.current good for anything? (Except maybe
> avoiding global rstat flush on memory.stat read but that'd be an
> undesired precendent.)

Right, it's accounted as a subset rather than fully disjointed. But it
is a limitable counter of its own, so I exported it as such, with a
current and a max knob. This is comparable to the kmem counter in v1.

From an API POV it would be quite strange to have max for a counter
that has no current. Likewise it would be strange for a major memory
consumer to be missing from memory.stat.

> (Ad the eventually reduced footprint, the transitional excursion above
> memcg's (or ancestor's) limit should be limited by number of parallel
> reclaims running (each one at most a page, right?), so it doesn't seem
> necessary to tackle (now).)

Correct.

> > memory.zswap.* are there to configure zswap policy, within the
> > boundaries of available memory - it's by definition a subset.
> 
> I see how the .max works when equal to 0 or "max". The intermediate
> values are more difficult to reason about.

It needs to be configured to the workload's access frequency curve,
which can be done with trial-and-error (reasonable balance between
zswpins and pswpins) or in a more targeted manner using tools such as
page_idle, damon etc.

> Also, I can see that on the global level, zswap is configured relatively
> (/sys/module/zswap/parameters/max_pool_percent).
> You wrote that the actual configured value is workload specific, would
> it be simpler to have also relative zswap limit per memcg?
>
> (Relative wrt memory.max, it'd be rather just a convenience with this
> simple ratio, however, it'd correspond to the top level limit. OTOH, the
> relatives would have counter-intuitive hierarchical behavior. I don't
> mean this should be changed, rather wondering why this variant was
> chosen.)

A percentage isn't a bad way to pick a global default limit for a
kernel feature. But it would have been preferable if zswap had used
the percentage internally and made the knob based in bytes (like
min_free_kbytes for example).

Because for load tuning, bytes make much more sense. That's how you
measure the workingset, so a percentage is an awkward indirection. At
the cgroup level, it makes even less sense: all memcg tunables are in
bytes, it would be quite weird to introduce a "max" that is 0-100. Add
the confusion of how percentages would propagate down the hierarchy...

> > +bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > +{
> > +     struct mem_cgroup *memcg, *original_memcg;
> > +     bool ret = true;
> > +
> > +     original_memcg = get_mem_cgroup_from_objcg(objcg);
> > +     for (memcg = original_memcg; memcg != root_mem_cgroup;
> > +          memcg = parent_mem_cgroup(memcg)) {
> > +             unsigned long max = READ_ONCE(memcg->zswap_max);
> > +             unsigned long pages;
> > +
> > +             if (max == PAGE_COUNTER_MAX)
> > +                     continue;
> > +             if (max == 0) {
> > +                     ret = false;
> > +                     break;
> > +             }
> > +
> > +             cgroup_rstat_flush(memcg->css.cgroup);
> 
> Here, I think it'd be better not to bypass mem_cgroup_flush_stats() (the
> mechanism is approximate and you traverse all ancestors anyway), i.e.
> mem_cgroup_flush_stats() before the loop instead of this.

I don't traverse all ancestors, I bail on disabled groups and skip
unlimited ones. This saves a lot of flushes in practice right now: our
heaviest swapping cgroups have zswap disabled (max=0) because they're
lowpri and forced to disk. Likewise, the zswap users have their zswap
limit several levels down from the root, and I currently don't ever
flush the higher levels (max=PAGE_COUNTER_MAX).

Flushing unnecessary groups with a ratelimit doesn't sound like an
improvement to me.

Thanks

WARNING: multiple messages have this Message-ID (diff)
From: Johannes Weiner <hannes@cmpxchg.org>
To: "Michal Koutný" <mkoutny@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Michal Hocko <mhocko@suse.com>, Roman Gushchin <guro@fb.com>,
	Shakeel Butt <shakeelb@google.com>,
	Seth Jennings <sjenning@redhat.com>,
	Dan Streetman <ddstreet@ieee.org>,
	Minchan Kim <minchan@kernel.org>,
	linux-mm@kvack.org, cgroups@vger.kernel.org,
	linux-kernel@vger.kernel.org, kernel-team@fb.com
Subject: Re: [PATCH v2 6/6] zswap: memcg accounting
Date: Fri, 13 May 2022 13:08:13 -0400	[thread overview]
Message-ID: <Yn6QfdouzkcrygTR@cmpxchg.org> (raw)
In-Reply-To: <20220513151426.GC16096@blackbody.suse.cz>

Hello Michal,

On Fri, May 13, 2022 at 05:14:26PM +0200, Michal Koutný wrote:
> On Wed, May 11, 2022 at 03:06:56PM -0400, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > Correct. After which the uncompressed page is reclaimed and uncharged.
> > So the zswapout process will reduce the charge bottom line.
> 
> A zswap object falling under memory.current was my first thinking, I was
> confused why it's exported as a separate counter memory.zswap.current
> (which IMO suggests disjoint counting) and it doubles a
> memory.stat:zswap entry.
> 
> Is the separate memory.zswap.current good for anything? (Except maybe
> avoiding global rstat flush on memory.stat read but that'd be an
> undesired precendent.)

Right, it's accounted as a subset rather than fully disjointed. But it
is a limitable counter of its own, so I exported it as such, with a
current and a max knob. This is comparable to the kmem counter in v1.

From an API POV it would be quite strange to have max for a counter
that has no current. Likewise it would be strange for a major memory
consumer to be missing from memory.stat.

> (Ad the eventually reduced footprint, the transitional excursion above
> memcg's (or ancestor's) limit should be limited by number of parallel
> reclaims running (each one at most a page, right?), so it doesn't seem
> necessary to tackle (now).)

Correct.

> > memory.zswap.* are there to configure zswap policy, within the
> > boundaries of available memory - it's by definition a subset.
> 
> I see how the .max works when equal to 0 or "max". The intermediate
> values are more difficult to reason about.

It needs to be configured to the workload's access frequency curve,
which can be done with trial-and-error (reasonable balance between
zswpins and pswpins) or in a more targeted manner using tools such as
page_idle, damon etc.

> Also, I can see that on the global level, zswap is configured relatively
> (/sys/module/zswap/parameters/max_pool_percent).
> You wrote that the actual configured value is workload specific, would
> it be simpler to have also relative zswap limit per memcg?
>
> (Relative wrt memory.max, it'd be rather just a convenience with this
> simple ratio, however, it'd correspond to the top level limit. OTOH, the
> relatives would have counter-intuitive hierarchical behavior. I don't
> mean this should be changed, rather wondering why this variant was
> chosen.)

A percentage isn't a bad way to pick a global default limit for a
kernel feature. But it would have been preferable if zswap had used
the percentage internally and made the knob based in bytes (like
min_free_kbytes for example).

Because for load tuning, bytes make much more sense. That's how you
measure the workingset, so a percentage is an awkward indirection. At
the cgroup level, it makes even less sense: all memcg tunables are in
bytes, it would be quite weird to introduce a "max" that is 0-100. Add
the confusion of how percentages would propagate down the hierarchy...

> > +bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> > +{
> > +     struct mem_cgroup *memcg, *original_memcg;
> > +     bool ret = true;
> > +
> > +     original_memcg = get_mem_cgroup_from_objcg(objcg);
> > +     for (memcg = original_memcg; memcg != root_mem_cgroup;
> > +          memcg = parent_mem_cgroup(memcg)) {
> > +             unsigned long max = READ_ONCE(memcg->zswap_max);
> > +             unsigned long pages;
> > +
> > +             if (max == PAGE_COUNTER_MAX)
> > +                     continue;
> > +             if (max == 0) {
> > +                     ret = false;
> > +                     break;
> > +             }
> > +
> > +             cgroup_rstat_flush(memcg->css.cgroup);
> 
> Here, I think it'd be better not to bypass mem_cgroup_flush_stats() (the
> mechanism is approximate and you traverse all ancestors anyway), i.e.
> mem_cgroup_flush_stats() before the loop instead of this.

I don't traverse all ancestors, I bail on disabled groups and skip
unlimited ones. This saves a lot of flushes in practice right now: our
heaviest swapping cgroups have zswap disabled (max=0) because they're
lowpri and forced to disk. Likewise, the zswap users have their zswap
limit several levels down from the root, and I currently don't ever
flush the higher levels (max=PAGE_COUNTER_MAX).

Flushing unnecessary groups with a ratelimit doesn't sound like an
improvement to me.

Thanks


  parent reply	other threads:[~2022-05-13 17:08 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-05-10 15:28 [PATCH v2 0/6] zswap: accounting & cgroup control Johannes Weiner
2022-05-10 15:28 ` Johannes Weiner
     [not found] ` <20220510152847.230957-1-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
2022-05-10 15:28   ` [PATCH v2 1/6] Documentation: filesystems: proc: update meminfo section Johannes Weiner
2022-05-10 15:28     ` Johannes Weiner
     [not found]     ` <20220510152847.230957-2-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
2022-05-11 17:11       ` David Hildenbrand
2022-05-11 17:11         ` David Hildenbrand
     [not found]         ` <7a6f8520-a496-e3c3-1fd9-8a30b7a12b14-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2022-05-11 18:51           ` Johannes Weiner
2022-05-11 18:51             ` Johannes Weiner
     [not found]             ` <YnwFraZlVWQoCjz3-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
2022-05-12  8:55               ` David Hildenbrand
2022-05-12  8:55                 ` David Hildenbrand
2022-05-10 15:28   ` [PATCH v2 2/6] mm: Kconfig: move swap and slab config options to the MM section Johannes Weiner
2022-05-10 15:28     ` Johannes Weiner
2022-05-10 15:28   ` [PATCH v2 3/6] mm: Kconfig: group swap, slab, hotplug and thp options into submenus Johannes Weiner
2022-05-10 15:28     ` Johannes Weiner
     [not found]     ` <20220510152847.230957-4-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
2022-05-10 22:40       ` Andrew Morton
2022-05-10 22:40         ` Andrew Morton
     [not found]         ` <20220510154037.c7916ee9d7de90eedd12f92c-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2022-05-11 15:22           ` Johannes Weiner
2022-05-11 15:22             ` Johannes Weiner
     [not found]             ` <YnvU0hwCfQ11P8Ce-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
2022-05-11 16:28               ` Johannes Weiner
2022-05-11 16:28                 ` Johannes Weiner
2022-05-10 15:28   ` [PATCH v2 4/6] mm: Kconfig: simplify zswap configuration Johannes Weiner
2022-05-10 15:28     ` Johannes Weiner
2022-05-10 15:28   ` [PATCH v2 5/6] mm: zswap: add basic meminfo and vmstat coverage Johannes Weiner
2022-05-10 15:28     ` Johannes Weiner
     [not found]     ` <20220510152847.230957-6-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
2022-05-11 17:13       ` David Hildenbrand
2022-05-11 17:13         ` David Hildenbrand
2022-05-10 15:28   ` [PATCH v2 6/6] zswap: memcg accounting Johannes Weiner
2022-05-10 15:28     ` Johannes Weiner
     [not found]     ` <20220510152847.230957-7-hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
2022-05-11 17:32       ` Michal Koutný
2022-05-11 17:32         ` Michal Koutný
     [not found]         ` <20220511173218.GB31592-9OudH3eul5jcvrawFnH+a6VXKuFTiq87@public.gmane.org>
2022-05-11 19:06           ` Johannes Weiner
2022-05-11 19:06             ` Johannes Weiner
     [not found]             ` <YnwJUL90fuoHs3YW-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
2022-05-13 15:14               ` Michal Koutný
2022-05-13 15:14                 ` Michal Koutný
     [not found]                 ` <20220513151426.GC16096-9OudH3eul5jcvrawFnH+a6VXKuFTiq87@public.gmane.org>
2022-05-13 17:08                   ` Johannes Weiner [this message]
2022-05-13 17:08                     ` Johannes Weiner
     [not found]                     ` <Yn6QfdouzkcrygTR-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
2022-05-16 14:34                       ` Michal Koutný
2022-05-16 14:34                         ` Michal Koutný
     [not found]                         ` <20220516143459.GA17557-9OudH3eul5jcvrawFnH+a6VXKuFTiq87@public.gmane.org>
2022-05-16 20:01                           ` Johannes Weiner
2022-05-16 20:01                             ` Johannes Weiner
2022-05-17 23:52                             ` Andrew Morton
     [not found]                               ` <20220517165216.7acd8434f8b25606836e21e6-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2022-05-18  8:23                                 ` Michal Koutný
2022-05-18  8:23                                   ` Michal Koutný
2022-05-13 17:23       ` Shakeel Butt
2022-05-13 17:23         ` Shakeel Butt
     [not found]         ` <CALvZod6kBZZFfD6Y5p_=9TMJr8P-vU_77NTq048wGUDr0wTv0Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2022-05-13 18:25           ` Johannes Weiner
2022-05-13 18:25             ` Johannes Weiner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Yn6QfdouzkcrygTR@cmpxchg.org \
    --to=hannes-druugvl0lcnafugrpc6u6w@public.gmane.org \
    --cc=akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org \
    --cc=cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=ddstreet-EkmVulN54Sk@public.gmane.org \
    --cc=guro-b10kYP2dOMg@public.gmane.org \
    --cc=kernel-team-b10kYP2dOMg@public.gmane.org \
    --cc=linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org \
    --cc=mhocko-IBi9RG/b67k@public.gmane.org \
    --cc=minchan-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org \
    --cc=mkoutny-IBi9RG/b67k@public.gmane.org \
    --cc=shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org \
    --cc=sjenning-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.