From: Roman Gushchin <guro@fb.com>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: "linux-mm@kvack.org" <linux-mm@kvack.org>,
Michal Hocko <mhocko@kernel.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
Kernel Team <Kernel-team@fb.com>,
"Shakeel Butt" <shakeelb@google.com>,
Vladimir Davydov <vdavydov.dev@gmail.com>,
Waiman Long <longman@redhat.com>
Subject: Re: [PATCH RFC 01/14] mm: memcg: subpage charging API
Date: Tue, 17 Sep 2019 18:33:15 +0000 [thread overview]
Message-ID: <20190917183308.GA9776@castle> (raw)
In-Reply-To: <20190917085004.GA1486@cmpxchg.org>
On Tue, Sep 17, 2019 at 10:50:04AM +0200, Johannes Weiner wrote:
> On Tue, Sep 17, 2019 at 02:27:19AM +0000, Roman Gushchin wrote:
> > On Mon, Sep 16, 2019 at 02:56:11PM +0200, Johannes Weiner wrote:
> > > On Thu, Sep 05, 2019 at 02:45:45PM -0700, Roman Gushchin wrote:
> > > > Introduce an API to charge subpage objects to the memory cgroup.
> > > > The API will be used by the new slab memory controller. Later it
> > > > can also be used to implement percpu memory accounting.
> > > > In both cases, a single page can be shared between multiple cgroups
> > > > (and in percpu case a single allocation is split over multiple pages),
> > > > so it's not possible to use page-based accounting.
> > > >
> > > > The implementation is based on percpu stocks. Memory cgroups are still
> > > > charged in pages, and the residue is stored in perpcu stock, or on the
> > > > memcg itself, when it's necessary to flush the stock.
> > >
> > > Did you just implement a slab allocator for page_counter to track
> > > memory consumed by the slab allocator?
> >
> > :)
> >
> > >
> > > > @@ -2500,8 +2577,9 @@ void mem_cgroup_handle_over_high(void)
> > > > }
> > > >
> > > > static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> > > > - unsigned int nr_pages)
> > > > + unsigned int amount, bool subpage)
> > > > {
> > > > + unsigned int nr_pages = subpage ? ((amount >> PAGE_SHIFT) + 1) : amount;
> > > > unsigned int batch = max(MEMCG_CHARGE_BATCH, nr_pages);
> > > > int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> > > > struct mem_cgroup *mem_over_limit;
> > > > @@ -2514,7 +2592,9 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> > > > if (mem_cgroup_is_root(memcg))
> > > > return 0;
> > > > retry:
> > > > - if (consume_stock(memcg, nr_pages))
> > > > + if (subpage && consume_subpage_stock(memcg, amount))
> > > > + return 0;
> > > > + else if (!subpage && consume_stock(memcg, nr_pages))
> > > > return 0;
> > >
> > > The layering here isn't clean. We have an existing per-cpu cache to
> > > batch-charge the page counter. Why does the new subpage allocator not
> > > sit on *top* of this, instead of wedged in between?
> > >
> > > I think what it should be is a try_charge_bytes() that simply gets one
> > > page from try_charge() and then does its byte tracking, regardless of
> > > how try_charge() chooses to implement its own page tracking.
> > >
> > > That would avoid the awkward @amount + @subpage multiplexing, as well
> > > as annotating all existing callsites of try_charge() with a
> > > non-descript "false" parameter.
> > >
> > > You can still reuse the stock data structures, use the lower bits of
> > > stock->nr_bytes for a different cgroup etc., but the charge API should
> > > really be separate.
> >
> > Hm, I kinda like the idea, however there is a complication: for the subpage
> > accounting the css reference management is done in a different way, so that
> > all existing code should avoid changing the css refcounter. So I'd need
> > to pass a boolean argument anyway.
>
> Can you elaborate on the refcounting scheme? I don't quite understand
> how there would be complications with that.
>
> Generally, references are held for each page that is allocated in the
> page_counter. try_charge() allocates a batch of css references,
> returns one and keeps the rest in stock.
>
> So couldn't the following work?
>
> When somebody allocates a subpage, the css reference returned by
> try_charge() is shared by the allocated subpage object and the
> remainder that is kept via stock->subpage_cache and stock->nr_bytes
> (or memcg->nr_stocked_bytes when the percpu cache is reset).
Because individual objects are a subject of reparenting and can outlive
the origin memory cgroup, they shouldn't hold a direct reference to the
memory cgroup. Instead they hold a reference to the mem_cgroup_ptr object,
and this objects holds a single reference to the memory cgroup.
Underlying pages shouldn't hold a reference too.
Btw, it's already true, just kmem_cache plays the role of such intermediate
object, and we do an explicit transfer of charge (look at memcg_charge_slab()).
So we initially associate a page with the memcg, and almost immediately
after break this association and insert kmem_cache in between.
But with subpage accounting it's not possible, as a page is shared between
multiple cgroups, and it can't be attributed to any specific cgroup at
any time.
>
> When the subpage objects are freed, you'll eventually have a full page
> again in stock->nr_bytes, at which point you page_counter_uncharge()
> paired with css_put(_many) as per usual.
>
> A remainder left in old->nr_stocked_bytes would continue to hold on to
> one css reference. (I don't quite understand who is protecting this
> remainder in your current version, actually. A bug?)
>
> Instead of doing your own batched page_counter uncharging in
> refill_subpage_stock() -> drain_subpage_stock(), you should be able to
> call refill_stock() when stock->nr_bytes adds up to a whole page again.
>
> Again, IMO this would be much cleaner architecture if there was a
> try_charge_bytes() byte allocator that would sit on top of a cleanly
> abstracted try_charge() page allocator, just like the slab allocator
> is sitting on top of the page allocator - instead of breaking through
> the abstraction layer of the underlying page allocator.
>
As I said, I like the idea to put it on top, but it can't be put on top
without changes in css refcounting (or I don't see how). I don't know
how to mix stocks which are holding css references and which are not,
so I might end up with two stocks as in current implementation. Then
the idea of having another layer of caching on top looks slightly less
appealing, but maybe still worth a try.
Thanks!
next prev parent reply other threads:[~2019-09-17 18:33 UTC|newest]
Thread overview: 36+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-09-05 21:45 [PATCH RFC 00/14] The new slab memory controller Roman Gushchin
2019-09-05 21:45 ` [PATCH RFC 01/14] mm: memcg: subpage charging API Roman Gushchin
2019-09-16 12:56 ` Johannes Weiner
2019-09-17 2:27 ` Roman Gushchin
2019-09-17 8:50 ` Johannes Weiner
2019-09-17 18:33 ` Roman Gushchin [this message]
2019-09-05 21:45 ` [PATCH RFC 02/14] mm: memcg: introduce mem_cgroup_ptr Roman Gushchin
2019-09-05 22:34 ` Roman Gushchin
2019-09-05 21:45 ` [PATCH RFC 03/14] mm: vmstat: use s32 for vm_node_stat_diff in struct per_cpu_nodestat Roman Gushchin
2019-09-05 21:45 ` [PATCH RFC 04/14] mm: vmstat: convert slab vmstat counter to bytes Roman Gushchin
2019-09-16 12:38 ` Johannes Weiner
2019-09-17 2:08 ` Roman Gushchin
2019-09-05 21:45 ` [PATCH RFC 05/14] mm: memcg/slab: allocate space for memcg ownership data for non-root slabs Roman Gushchin
2019-09-05 21:45 ` [PATCH RFC 06/14] mm: slub: implement SLUB version of obj_to_index() Roman Gushchin
2019-09-05 21:45 ` [PATCH RFC 07/14] mm: memcg/slab: save memcg ownership data for non-root slab objects Roman Gushchin
2019-09-05 21:45 ` [PATCH RFC 08/14] mm: memcg: move memcg_kmem_bypass() to memcontrol.h Roman Gushchin
2019-09-05 21:45 ` [PATCH RFC 09/14] mm: memcg: introduce __mod_lruvec_memcg_state() Roman Gushchin
2019-09-05 22:37 ` [PATCH RFC 02/14] mm: memcg: introduce mem_cgroup_ptr Roman Gushchin
2019-09-17 19:48 ` [PATCH RFC 00/14] The new slab memory controller Waiman Long
2019-09-17 21:24 ` Roman Gushchin
2019-09-19 13:39 ` Suleiman Souhlal
2019-09-19 16:22 ` Roman Gushchin
2019-09-19 21:10 ` Suleiman Souhlal
2019-09-19 21:40 ` Roman Gushchin
2019-10-01 15:12 ` Michal Koutný
2019-10-02 2:09 ` Roman Gushchin
2019-10-02 13:00 ` Suleiman Souhlal
2019-10-03 10:47 ` Michal Koutný
2019-10-03 15:52 ` Roman Gushchin
2019-12-09 9:17 ` [PATCH 00/16] " Bharata B Rao
2019-12-09 11:56 ` Bharata B Rao
2019-12-09 18:04 ` Roman Gushchin
2019-12-10 6:23 ` Bharata B Rao
2019-12-10 18:05 ` Roman Gushchin
2020-01-13 8:47 ` Bharata B Rao
2020-01-13 15:31 ` Roman Gushchin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190917183308.GA9776@castle \
--to=guro@fb.com \
--cc=Kernel-team@fb.com \
--cc=hannes@cmpxchg.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=longman@redhat.com \
--cc=mhocko@kernel.org \
--cc=shakeelb@google.com \
--cc=vdavydov.dev@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.