Re: [PATCH v2] memcg: allow exiting tasks to write back data to swap

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Johannes Weiner <hannes@cmpxchg.org>
To: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Yosry Ahmed <yosryahmed@google.com>,
	Rik van Riel <riel@surriel.com>,
	Balbir Singh <balbirs@nvidia.com>,
	Michal Hocko <mhocko@kernel.org>,
	hakeel Butt <shakeel.butt@linux.dev>,
	Muchun Song <muchun.song@linux.dev>,
	Andrew Morton <akpm@linux-foundation.org>,
	cgroups@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, kernel-team@meta.com,
	Nhat Pham <nphamcs@gmail.com>
Subject: Re: [PATCH v2] memcg: allow exiting tasks to write back data to swap
Date: Thu, 12 Dec 2024 23:42:13 -0500	[thread overview]
Message-ID: <20241213044213.GA6910@cmpxchg.org> (raw)
In-Reply-To: <Z1uAi0syiPY7h6wt@google.com>

On Fri, Dec 13, 2024 at 12:32:11AM +0000, Roman Gushchin wrote:
> On Thu, Dec 12, 2024 at 01:30:12PM -0500, Johannes Weiner wrote:
> > On Thu, Dec 12, 2024 at 09:06:25AM -0800, Yosry Ahmed wrote:
> > > On Thu, Dec 12, 2024 at 8:58 AM Rik van Riel <riel@surriel.com> wrote:
> > > >
> > > > A task already in exit can get stuck trying to allocate pages, if its
> > > > cgroup is at the memory.max limit, the cgroup is using zswap, but
> > > > zswap writeback is enabled, and the remaining memory in the cgroup is
> > > > not compressible.
> > > >
> > > > This seems like an unlikely confluence of events, but it can happen
> > > > quite easily if a cgroup is OOM killed due to exceeding its memory.max
> > > > limit, and all the tasks in the cgroup are trying to exit simultaneously.
> > > >
> > > > When this happens, it can sometimes take hours for tasks to exit,
> > > > as they are all trying to squeeze things into zswap to bring the group's
> > > > memory consumption below memory.max.
> > > >
> > > > Allowing these exiting programs to push some memory from their own
> > > > cgroup into swap allows them to quickly bring the cgroup's memory
> > > > consumption below memory.max, and exit in seconds rather than hours.
> > > >
> > > > Signed-off-by: Rik van Riel <riel@surriel.com>
> > > 
> > > Thanks for sending a v2.
> > > 
> > > I still think maybe this needs to be fixed on the memcg side, at least
> > > by not making exiting tasks try really hard to reclaim memory to the
> > > point where this becomes a problem. IIUC there could be other reasons
> > > why reclaim may take too long, but maybe not as pathological as this
> > > case to be fair. I will let the memcg maintainers chime in for this.
> > > 
> > > If there's a fundamental reason why this cannot be fixed on the memcg
> > > side, I don't object to this change.
> > > 
> > > Nhat, any objections on your end? I think your fleet workloads were
> > > the first users of this interface. Does this break their expectations?
> > 
> > Yes, I don't think we can do this, unfortunately :( There can be a
> > variety of reasons for why a user might want to prohibit disk swap for
> > a certain cgroup, and we can't assume it's okay to make exceptions.
> > 
> > There might also not *be* any disk swap to overflow into after Nhat's
> > virtual swap patches. Presumably zram would still have the issue too.
> > 
> > So I'm also inclined to think this needs a reclaim/memcg-side fix. We
> > have a somewhat tumultous history of policy in that space:
> > 
> > commit 7775face207922ea62a4e96b9cd45abfdc7b9840
> > Author: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > Date:   Tue Mar 5 15:46:47 2019 -0800
> > 
> >     memcg: killed threads should not invoke memcg OOM killer
> > 
> > allowed dying tasks to simply force all charges and move on. This
> > turned out to be too aggressive; there were instances of exiting,
> > uncontained memcg tasks causing global OOMs. This lead to that:
> > 
> > commit a4ebf1b6ca1e011289677239a2a361fde4a88076
> > Author: Vasily Averin <vasily.averin@linux.dev>
> > Date:   Fri Nov 5 13:38:09 2021 -0700
> > 
> >     memcg: prohibit unconditional exceeding the limit of dying tasks
> > 
> > which reverted the bypass rather thoroughly. Now NO dying tasks, *not
> > even OOM victims*, can force charges. I am not sure this is correct,
> > either:
> > 
> > If we return -ENOMEM to an OOM victim in a fault, the fault handler
> > will re-trigger OOM, which will find the existing OOM victim and do
> > nothing, then restart the fault. This is a memory deadlock. The page
> > allocator gives OOM victims access to reserves for that reason.
> > 
> > Actually, it looks even worse. For some reason we're not triggering
> > OOM from dying tasks:
> > 
> >         ret = task_is_dying() || out_of_memory(&oc);
> > 
> > Even though dying tasks are in no way privileged or allowed to exit
> > expediently. Why shouldn't they trigger the OOM killer like anybody
> > else trying to allocate memory?
> > 
> > As it stands, it seems we have dying tasks getting trapped in an
> > endless fault->reclaim cycle; with no access to the OOM killer and no
> > access to reserves. Presumably this is what's going on here?
> > 
> > I think we want something like this:
> > 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 53db98d2c4a1..be6b6e72bde5 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1596,11 +1596,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >  	if (mem_cgroup_margin(memcg) >= (1 << order))
> >  		goto unlock;
> >  
> > -	/*
> > -	 * A few threads which were not waiting at mutex_lock_killable() can
> > -	 * fail to bail out. Therefore, check again after holding oom_lock.
> > -	 */
> > -	ret = task_is_dying() || out_of_memory(&oc);
> > +	ret = out_of_memory(&oc);
> 
> I like the idea, but at first glance it might reintroduce the problem
> fixed by 7775face2079 ("memcg: killed threads should not invoke memcg OOM killer").

The race and warning pointed out in the changelog might have been
sufficiently mitigated by this more recent commit:

commit 1378b37d03e8147c67fde60caf0474ea879163d8
Author: Yafang Shao <laoar.shao@gmail.com>
Date:   Thu Aug 6 23:22:08 2020 -0700

    memcg, oom: check memcg margin for parallel oom

If not, another possibility would be this:

	ret = tsk_is_oom_victim(task) || out_of_memory(&oc);

But we should probably first restore reliable forward progress on
dying tasks, then worry about the spurious printk later.

next prev parent reply	other threads:[~2024-12-13  4:42 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-12-12 16:57 [PATCH v2] memcg: allow exiting tasks to write back data to swap Rik van Riel
2024-12-12 17:06 ` Yosry Ahmed
2024-12-12 17:51   ` Shakeel Butt
2024-12-12 18:02     ` Rik van Riel
2024-12-12 18:18       ` Nhat Pham
2024-12-12 18:11   ` Nhat Pham
2024-12-12 18:30   ` Johannes Weiner
2024-12-12 21:35     ` Shakeel Butt
2024-12-12 21:41       ` Yosry Ahmed
2024-12-13  0:32     ` Roman Gushchin
2024-12-13  4:42       ` Johannes Weiner [this message]
2024-12-16 15:39     ` Michal Hocko
2025-01-14 16:09       ` Johannes Weiner
2025-01-14 16:46         ` Michal Hocko
2025-01-14 16:51           ` Rik van Riel
2025-01-14 17:00             ` Michal Hocko
2025-01-14 17:11               ` Rik van Riel
2025-01-14 18:13                 ` Michal Hocko
2025-01-14 19:23                   ` Johannes Weiner
2025-01-14 19:42                     ` Michal Hocko
2025-01-15 17:35                       ` Rik van Riel
2025-01-15 19:41                         ` Michal Hocko
2025-01-14 16:54           ` Michal Hocko
2025-01-14 16:56             ` Rik van Riel
2025-01-14 16:56             ` Michal Hocko
2024-12-12 18:31 ` Roman Gushchin
2024-12-12 20:00   ` Rik van Riel
2024-12-13  0:49     ` Roman Gushchin
2024-12-13  2:54     ` Balbir Singh

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20241213044213.GA6910@cmpxchg.org \
    --to=hannes@cmpxchg.org \
    --cc=akpm@linux-foundation.org \
    --cc=balbirs@nvidia.com \
    --cc=cgroups@vger.kernel.org \
    --cc=kernel-team@meta.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=muchun.song@linux.dev \
    --cc=nphamcs@gmail.com \
    --cc=riel@surriel.com \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeel.butt@linux.dev \
    --cc=yosryahmed@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.