From: Roman Gushchin <roman.gushchin@linux.dev>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org,
Michal Hocko <mhocko@kernel.org>,
Shakeel Butt <shakeel.butt@linux.dev>,
Muchun Song <muchun.song@linux.dev>
Subject: Re: [PATCH v2 3/5] mm: memcg: merge multiple page_counters into a single structure
Date: Thu, 25 Jul 2024 23:33:51 +0000 [thread overview]
Message-ID: <ZqLg30nOUVlerBh1@google.com> (raw)
In-Reply-To: <20240725214227.GC1702603@cmpxchg.org>
On Thu, Jul 25, 2024 at 05:42:27PM -0400, Johannes Weiner wrote:
> On Wed, Jul 24, 2024 at 08:21:01PM +0000, Roman Gushchin wrote:
> > --- a/include/linux/page_counter.h
> > +++ b/include/linux/page_counter.h
> > @@ -5,14 +5,71 @@
> > #include <linux/atomic.h>
> > #include <linux/cache.h>
> > #include <linux/limits.h>
> > +#include <linux/mm_types.h>
> > #include <asm/page.h>
> >
> > +/*
> > + * Page counters are used by memory and hugetlb cgroups.
> > + * Memory cgroups are using up to 4 separate counters:
> > + * memory, swap (memory + swap on cgroup v1), kmem and tcpmem.
> > + * Hugetlb cgroups are using 2 arrays of counters with HUGE_MAX_HSTATE
> > + * in each to track the usage and reservations of each supported
> > + * hugepage size.
> > + *
> > + * Protection (min/low) is supported only for the first counter
> > + * with idx 0 and only if the counter was initialized with the protection
> > + * support.
> > + */
> > +
> > +enum mem_counter_type {
> > +#ifdef CONFIG_MEMCG
> > + /* Unified memory counter */
> > + MCT_MEM,
> > +
> > + /* Swap */
> > + MCT_SWAP,
> > +
> > + /* Memory + swap */
> > + MCT_MEMSW = MCT_SWAP,
> > +
> > +#ifdef CONFIG_MEMCG_V1
> > + /* Kernel memory */
> > + MCT_KMEM,
> > +
> > + /* Tcp memory */
> > + MCT_TCPMEM,
> > +#endif /* CONFIG_MEMCG_V1 */
> > +#endif /* CONFIG_MEMCG */
> > +
> > + /* Maximum number of memcg counters */
> > + MCT_NR_MEMCG_ITEMS,
> > +};
> > +
> > +#ifdef CONFIG_CGROUP_HUGETLB
> > +#ifdef HUGE_MAX_HSTATE
> > +#define MCT_NR_HUGETLB_ITEMS HUGE_MAX_HSTATE
> > +#else
> > +#define MCT_NR_HUGETLB_ITEMS 1
> > +#endif
> > +
> > +/*
> > + * max() can't be used here: even though __builtin_choose_expr() evaluates
> > + * to true, the false clause generates a compiler error:
> > + * error: braced-group within expression allowed only inside a function .
> > + */
> > +#define MCT_NR_ITEMS (__builtin_choose_expr(MCT_NR_MEMCG_ITEMS > MCT_NR_HUGETLB_ITEMS, \
> > + MCT_NR_MEMCG_ITEMS, MCT_NR_HUGETLB_ITEMS))
> > +
> > +#else /* CONFIG_CGROUP_HUGETLB */
> > +#define MCT_NR_ITEMS MCT_NR_MEMCG_ITEMS
> > +#endif /* CONFIG_CGROUP_HUGETLB */
> > +
> > struct page_counter {
> > /*
> > * Make sure 'usage' does not share cacheline with any other field. The
> > * memcg->memory.usage is a hot member of struct mem_cgroup.
> > */
> > - atomic_long_t usage;
> > + atomic_long_t usage[MCT_NR_ITEMS];
> > CACHELINE_PADDING(_pad1_);
> >
> > /* effective memory.min and memory.min usage tracking */
> > @@ -25,9 +82,9 @@ struct page_counter {
> > atomic_long_t low_usage;
> > atomic_long_t children_low_usage;
> >
> > - unsigned long watermark;
> > - unsigned long local_watermark; /* track min of fd-local resets */
> > - unsigned long failcnt;
> > + unsigned long watermark[MCT_NR_ITEMS];
> > + unsigned long local_watermark[MCT_NR_ITEMS]; /* track min of fd-local resets */
> > + unsigned long failcnt[MCT_NR_ITEMS];
> >
> > /* Keep all the read most fields in a separete cacheline. */
> > CACHELINE_PADDING(_pad2_);
> > @@ -35,8 +92,9 @@ struct page_counter {
> > bool protection_support;
> > unsigned long min;
> > unsigned long low;
> > - unsigned long high;
> > - unsigned long max;
> > + unsigned long high[MCT_NR_ITEMS];
> > + unsigned long max[MCT_NR_ITEMS];
> > +
> > struct page_counter *parent;
> > } ____cacheline_internodealigned_in_smp;
>
> This hardcodes way too much user specifics into what should be a
> self-contained piece of abstraction. Should anybody else try to use
> the 'struct page_counter' for a single resource elsewhere, they'd end
> up with up to 5 counters, watermarks, failcnts, highs and maxs etc.
>
> It's clear that the hierarchical aspect of the page_counter no longer
> works. It was okay-ish when all we had was a simple hard limit. Now we
> carry to much stuff in it that is only useful to a small set of users.
> Instead of doubling down and repeating the mistakes we made for struct
> page, it would be better to move user specific stuff out of it
> entirely. I think there are two patterns that would be more natural:
> a) pass a callback function and have a priv pointer in page_counter
> for user-specifics; or b) remove the hierarchy aspect from the page
> counter entirely, let the *caller* walk the tree, call the page
> counter code to trycharge (and log watermark) only that level, and
> handle everything caller specific on the caller side.
>
> All users already have parent linkage in the css, so this would
> eliminate the parent pointer in page_counter altogether.
>
> The protection stuff could be moved to memcg proper, eliminating yet
> more fields that aren't interesting to any other user.
Hm, Idk, I do agree with what you're saying about the self-contained
piece of abstraction (and I had very similar thoughts in the process),
but there are also some complications.
First, funny enough, the protection calculation code was just moved to
mm/page_counter.c by a8585ac68621 ("mm/page_counter: move calculating protection
values to page_counter"). The commit log says that it's gonna be used by the drm
cgroup controller, but the code is not (yet?) upstream apparently. I don't have
any insights here. If there will be the second user for the protection
functionality, moving it back to memcontrol.c is not feasible.
Second, I agree that it would be nice to get rid of the parent pointer in
struct page_cgroup entirely and use one in css. But Idk how to do it
without making the code way more messy or duplicate a lot of tree walks.
In C++ (or another language with generics) we could make struct page_counter
taking the number of counters and the set of features as template parameters.
I feel like with memcg1 code being factored out the benefit of this reorg
lowered, so how about me resending 2 first patches separately and spending
more time on thinking on the best long-term strategy? I have some ideas
here and I thought about this work as a preparatory step, but maybe you're
right and it makes sense to approach it more comprehensively.
Thanks!
next prev parent reply other threads:[~2024-07-25 23:34 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-07-24 20:20 [PATCH v2 0/5] This patchset reorganizes page_counter structures which helps to make Roman Gushchin
2024-07-24 20:20 ` [PATCH v2 1/5] mm: memcg: don't call propagate_protected_usage() needlessly Roman Gushchin
2024-07-24 20:52 ` Shakeel Butt
2024-07-24 23:13 ` Yosry Ahmed
2024-07-24 23:31 ` Roman Gushchin
2024-07-24 23:35 ` Yosry Ahmed
2024-07-24 20:21 ` [PATCH v2 2/5] mm: page_counters: put page_counter_calculate_protection() under CONFIG_MEMCG Roman Gushchin
2024-07-24 23:05 ` Shakeel Butt
2024-07-24 20:21 ` [PATCH v2 3/5] mm: memcg: merge multiple page_counters into a single structure Roman Gushchin
2024-07-24 23:43 ` Shakeel Butt
2024-07-25 21:42 ` Johannes Weiner
2024-07-25 23:33 ` Roman Gushchin [this message]
2024-07-26 23:31 ` Johannes Weiner
2024-07-24 20:21 ` [PATCH v2 4/5] mm: page_counters: initialize usage using ATOMIC_LONG_INIT() macro Roman Gushchin
2024-07-24 23:45 ` Shakeel Butt
2024-07-24 20:21 ` [PATCH v2 5/5] mm: memcg: convert enum res_type to mem_counter_type Roman Gushchin
2024-07-24 23:48 ` Shakeel Butt
2024-07-24 23:54 ` Roman Gushchin
2024-07-24 20:40 ` [PATCH v2 0/5] This patchset reorganizes page_counter structures which helps to make Roman Gushchin
2024-07-24 22:17 ` Shakeel Butt
2024-07-24 22:30 ` Roman Gushchin
2024-07-25 21:42 ` Andrew Morton
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZqLg30nOUVlerBh1@google.com \
--to=roman.gushchin@linux.dev \
--cc=akpm@linux-foundation.org \
--cc=hannes@cmpxchg.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@kernel.org \
--cc=muchun.song@linux.dev \
--cc=shakeel.butt@linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.