All of lore.kernel.org
 help / color / mirror / Atom feed
From: Roman Gushchin <roman.gushchin@linux.dev>
To: Muchun Song <songmuchun@bytedance.com>
Cc: Waiman Long <longman@redhat.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Michal Hocko <mhocko@kernel.org>,
	Shakeel Butt <shakeelb@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org, cgroups@vger.kernel.org,
	linux-mm@kvack.org,
	"Matthew Wilcox (Oracle)" <willy@infradead.org>,
	Yang Shi <shy828301@gmail.com>, Vlastimil Babka <vbabka@suse.cz>
Subject: Re: [PATCH] mm/memcg: Free percpu stats memory of dying memcg's
Date: Fri, 22 Apr 2022 08:00:39 -0700	[thread overview]
Message-ID: <YmLDF58Mz3inl9Ev@carbon> (raw)
In-Reply-To: <YmKDCjJFYMmfa8sG@FVFYT0MHHV2J.usts.net>

On Fri, Apr 22, 2022 at 06:27:22PM +0800, Muchun Song wrote:
> On Thu, Apr 21, 2022 at 07:59:00PM -0700, Roman Gushchin wrote:
> > On Thu, Apr 21, 2022 at 02:46:00PM -0400, Waiman Long wrote:
> > > On 4/21/22 13:59, Roman Gushchin wrote:
> > > > On Thu, Apr 21, 2022 at 01:28:20PM -0400, Waiman Long wrote:
> > > > > On 4/21/22 12:33, Roman Gushchin wrote:
> > > > > > On Thu, Apr 21, 2022 at 10:58:45AM -0400, Waiman Long wrote:
> > > > > > > For systems with large number of CPUs, the majority of the memory
> > > > > > > consumed by the mem_cgroup structure is actually the percpu stats
> > > > > > > memory. When a large number of memory cgroups are continuously created
> > > > > > > and destroyed (like in a container host), it is possible that more
> > > > > > > and more mem_cgroup structures remained in the dying state holding up
> > > > > > > increasing amount of percpu memory.
> > > > > > > 
> > > > > > > We can't free up the memory of the dying mem_cgroup structure due to
> > > > > > > active references in some other places. However, the percpu stats memory
> > > > > > > allocated to that mem_cgroup is a different story.
> > > > > > > 
> > > > > > > This patch adds a new percpu_stats_disabled variable to keep track of
> > > > > > > the state of the percpu stats memory. If the variable is set, percpu
> > > > > > > stats update will be disabled for that particular memcg. All the stats
> > > > > > > update will be forward to its parent instead. Reading of the its percpu
> > > > > > > stats will return 0.
> > > > > > > 
> > > > > > > The flushing and freeing of the percpu stats memory is a multi-step
> > > > > > > process. The percpu_stats_disabled variable is set when the memcg is
> > > > > > > being set to offline state. After a grace period with the help of RCU,
> > > > > > > the percpu stats data are flushed and then freed.
> > > > > > > 
> > > > > > > This will greatly reduce the amount of memory held up by dying memory
> > > > > > > cgroups.
> > > > > > > 
> > > > > > > By running a simple management tool for container 2000 times per test
> > > > > > > run, below are the results of increases of percpu memory (as reported
> > > > > > > in /proc/meminfo) and nr_dying_descendants in root's cgroup.stat.
> > > > > > Hi Waiman!
> > > > > > 
> > > > > > I've been proposing the same idea some time ago:
> > > > > > https://lore.kernel.org/all/20190312223404.28665-7-guro@fb.com/T/ .
> > > > > > 
> > > > > > However I dropped it with the thinking that with many other fixes
> > > > > > preventing the accumulation of the dying cgroups it's not worth the added
> > > > > > complexity and a potential cpu overhead.
> > > > > > 
> > > > > > I think it ultimately comes to the number of dying cgroups. If it's low,
> > > > > > memory savings are not worth the cpu overhead. If it's high, they are.
> > > > > > I hope long-term to drive it down significantly (with lru-pages reparenting
> > > > > > being the first major milestone), but it might take a while.
> > > > > > 
> > > > > > I don't have a strong opinion either way, just want to dump my thoughts
> > > > > > on this.
> > > > > I have quite a number of customer cases complaining about increasing percpu
> > > > > memory usages. The number of dying memcg's can go to tens of thousands. From
> > > > > my own investigation, I believe that those dying memcg's are not freed
> > > > > because they are pinned down by references in the page structure. I am aware
> > > > > that we support the use of objcg in the page structure which will allow easy
> > > > > reparenting, but most pages don't do that and it is not easy to do this
> > > > > conversion and it may take quite a while to do that.
> > > > The big question is whether there is a memory pressure on those systems.
> > > > If yes, and the number of dying cgroups is growing, it's worth investigating.
> > > > It might be due to the sharing of pagecache pages and this will be ultimately
> > > > fixed with implementing of the pagecache reparenting. But it also might be due
> > > > to other bugs, which are fixable, so it would be great to understand.
> > > 
> > > 
> > > Pagecache reparenting will probably fix the problem that I have seen. Is
> > > someone working on this?
> > 
> > Some time ago Muchun posted patches based on the reusing of the obj_cgroup API.
> >
> 
> Yep. It is here:
> 
> https://lore.kernel.org/all/20220216115132.52602-1-songmuchun@bytedance.com/.
>  
> > I'm not strictly against this approach, but in my opinion it's not the best.
> > I suggested to use lru vectors as an intermediate objects. In theory, it might
> 
> I remember this.
> 
> > allow to avoid bumping reference counters for all charged pages at all: live
> > cgroups will be protected by being live, dying cgroups will only need
> > a temporarily protection while lru vectors and associated pages are reparenting.
> > 
> > There are pros and cons:
> > + cgroup reference counting becomes simpler and more debuggable
> > + potential perf wins from fewer operations with live cgroups css refcounters
> > = I hope to see code simplifications (but not guaranteed)
> > - deleting cgroups becomes more expensive, but the cost can be spread to
> >   asynchronous workers
> > 
> > Idk if Muchun tried to implement it. If not, I might try myself.
> >
> 
> Yep. I have implemented a initial version recently. I'll do some stability tests
> and send it out ASAP.

Great! Looking forward to see it! And thank you for working on it!

      reply	other threads:[~2022-04-22 15:00 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-21 14:58 [PATCH] mm/memcg: Free percpu stats memory of dying memcg's Waiman Long
2022-04-21 14:58 ` Waiman Long
     [not found] ` <20220421145845.1044652-1-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2022-04-21 16:33   ` Roman Gushchin
2022-04-21 16:33     ` Roman Gushchin
2022-04-21 17:28     ` Waiman Long
2022-04-21 17:28       ` Waiman Long
     [not found]       ` <112a4d7f-bc53-6e59-7bb8-6fecb65d045d-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2022-04-21 17:59         ` Roman Gushchin
2022-04-21 17:59           ` Roman Gushchin
2022-04-21 18:46           ` Waiman Long
2022-04-21 18:46             ` Waiman Long
     [not found]             ` <58c41f14-356e-88dd-54aa-dc6873bf80ff-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2022-04-22  2:29               ` Muchun Song
2022-04-22  2:29                 ` Muchun Song
     [not found]                 ` <YmITEEdEbaKCK3BN-t1y1lxtqHnnw2QbUemf3bixXY32XiHfO@public.gmane.org>
2022-04-25  1:01                   ` Waiman Long
2022-04-25  1:01                     ` Waiman Long
2022-04-22  2:59               ` Roman Gushchin
2022-04-22  2:59                 ` Roman Gushchin
2022-04-22 10:27                 ` Muchun Song
2022-04-22 15:00                   ` Roman Gushchin [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YmLDF58Mz3inl9Ev@carbon \
    --to=roman.gushchin@linux.dev \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=longman@redhat.com \
    --cc=mhocko@kernel.org \
    --cc=shakeelb@google.com \
    --cc=shy828301@gmail.com \
    --cc=songmuchun@bytedance.com \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.