From mboxrd@z Thu Jan 1 00:00:00 1970 From: Waiman Long Subject: Re: [LSF/MM/BPF TOPIC] Reducing zombie memcgs Date: Wed, 26 Apr 2023 16:15:47 -0400 Message-ID: <8ad74529-890a-8300-c2ad-ddaa679b9c87@redhat.com> References: <27e15be8-d0eb-ed32-a0ec-5ec9b59f1f27@redhat.com> Mime-Version: 1.0 Content-Transfer-Encoding: 8bit Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1682540153; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=9L7Q3TbkejCawHPfrk62/EEaFRYMLFfTS6p9ROANUHw=; b=Db1YPV7grqh4DJfkg5i3dbD6/c4kdQaTO2gBdD9D+NFWycFYu1Gsc5DTCnF6EjIOkxu0ZM temJfb3FYRq+ZikrR9WlB1PLjIRPFvXZK78dADeeJugeH5P/dI0Bum0ZghhlPCvzsIKaAj rwGzkXFGRbgFAM1GR/JjyviUtgq5wDg= Content-Language: en-US In-Reply-To: List-ID: Content-Type: text/plain; charset="utf-8"; format="flowed" To: Yosry Ahmed Cc: "T.J. Mercier" , lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Tejun Heo , Shakeel Butt , Muchun Song , Johannes Weiner , Roman Gushchin , Alistair Popple , Jason Gunthorpe , Kalesh Singh , Yu Zhao , Matthew Wilcox , David Rientjes , Greg Thelen On 4/25/23 14:53, Yosry Ahmed wrote: > On Tue, Apr 25, 2023 at 11:42 AM Waiman Long wrote: >> On 4/25/23 07:36, Yosry Ahmed wrote: >>> +David Rientjes +Greg Thelen +Matthew Wilcox >>> >>> On Tue, Apr 11, 2023 at 4:48 PM Yosry Ahmed wrote: >>>> On Tue, Apr 11, 2023 at 4:36 PM T.J. Mercier wrote: >>>>> When a memcg is removed by userspace it gets offlined by the kernel. >>>>> Offline memcgs are hidden from user space, but they still live in the >>>>> kernel until their reference count drops to 0. New allocations cannot >>>>> be charged to offline memcgs, but existing allocations charged to >>>>> offline memcgs remain charged, and hold a reference to the memcg. >>>>> >>>>> As such, an offline memcg can remain in the kernel indefinitely, >>>>> becoming a zombie memcg. The accumulation of a large number of zombie >>>>> memcgs lead to increased system overhead (mainly percpu data in struct >>>>> mem_cgroup). It also causes some kernel operations that scale with the >>>>> number of memcgs to become less efficient (e.g. reclaim). >>>>> >>>>> There are currently out-of-tree solutions which attempt to >>>>> periodically clean up zombie memcgs by reclaiming from them. However >>>>> that is not effective for non-reclaimable memory, which it would be >>>>> better to reparent or recharge to an online cgroup. There are also >>>>> proposed changes that would benefit from recharging for shared >>>>> resources like pinned pages, or DMA buffer pages. >>>> I am very interested in attending this discussion, it's something that >>>> I have been actively looking into -- specifically recharging pages of >>>> offlined memcgs. >>>> >>>>> Suggested attendees: >>>>> Yosry Ahmed >>>>> Yu Zhao >>>>> T.J. Mercier >>>>> Tejun Heo >>>>> Shakeel Butt >>>>> Muchun Song >>>>> Johannes Weiner >>>>> Roman Gushchin >>>>> Alistair Popple >>>>> Jason Gunthorpe >>>>> Kalesh Singh >>> I was hoping I would bring a more complete idea to this thread, but >>> here is what I have so far. >>> >>> The idea is to recharge the memory charged to memcgs when they are >>> offlined. I like to think of the options we have to deal with memory >>> charged to offline memcgs as a toolkit. This toolkit includes: >>> >>> (a) Evict memory. >>> >>> This is the simplest option, just evict the memory. >>> >>> For file-backed pages, this writes them back to their backing files, >>> uncharging and freeing the page. The next access will read the page >>> again and the faulting process’s memcg will be charged. >>> >>> For swap-backed pages (anon/shmem), this swaps them out. Swapping out >>> a page charged to an offline memcg uncharges the page and charges the >>> swap to its parent. The next access will swap in the page and the >>> parent will be charged. This is effectively deferred recharging to the >>> parent. >>> >>> Pros: >>> - Simple. >>> >>> Cons: >>> - Behavior is different for file-backed vs. swap-backed pages, for >>> swap-backed pages, the memory is recharged to the parent (aka >>> reparented), not charged to the "rightful" user. >>> - Next access will incur higher latency, especially if the pages are active. >>> >>> (b) Direct recharge to the parent >>> >>> This can be done for any page and should be simple as the pages are >>> already hierarchically charged to the parent. >>> >>> Pros: >>> - Simple. >>> >>> Cons: >>> - If a different memcg is using the memory, it will keep taxing the >>> parent indefinitely. Same not the "rightful" user argument. >> Muchun had actually posted patch to do this last year. See >> >> https://lore.kernel.org/all/20220621125658.64935-10-songmuchun-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org/T/#me9dbbce85e2f3c4e5f34b97dbbdb5f79d77ce147 >> >> I am wondering if he is going to post an updated version of that or not. >> Anyway, I am looking forward to learn about the result of this >> discussion even thought I am not a conference invitee. > There are a couple of problems that were brought up back then, mainly > that memory will be reparented to the root memcg eventually, > practically escaping accounting. Shared resources may end up being > eventually unaccounted. Ideally, we can come up with a scheme where > the memory is charged to the real user, instead of just to the parent. > > Consider the case where processes in memcg A and B are both using > memory that is charged to memcg A. If memcg A goes offline, and we > reparent the memory, memcg B keeps using the memory for free, taxing > A's parent, or the entire system if that's root. > > Also, if there is a kernel bug and a page is being pinned > unnecessarily, those pages will never be reclaimed and will stick > around and eventually be reparented to the root memcg. If being > reparented to the root memcg is a legitimate action, you can't simply > tell apart if pages are sticking around just because they are being > used by someone or if there is a kernel bug. This is certainly a valid concern. We are currently doing reparenting for slab objects. However physical pages have a higher probability of being shared by different tasks. I do hope that we can come to agreement soon on how best to address this issue. Thanks, Longman