Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Johannes Weiner <hannes@cmpxchg.org>
To: Jiayuan Chen <jiayuan.chen@linux.dev>
Cc: linux-mm@kvack.org, jiayuan.chen@shopee.com,
	yingfu.zhou@shopee.com, Michal Hocko <mhocko@kernel.org>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Shakeel Butt <shakeel.butt@linux.dev>,
	Muchun Song <muchun.song@linux.dev>,
	Andrew Morton <akpm@linux-foundation.org>,
	Kairui Song <kasong@tencent.com>, Qi Zheng <qi.zheng@linux.dev>,
	Barry Song <baohua@kernel.org>,
	Axel Rasmussen <axelrasmussen@google.com>,
	Yuanchu Xie <yuanchu@google.com>, Wei Xu <weixugc@google.com>,
	David Hildenbrand <david@kernel.org>,
	Lorenzo Stoakes <ljs@kernel.org>,
	cgroups@vger.kernel.org, linux-kernel@vger.kernel.org,
	Tejun Heo <tj@kernel.org>
Subject: Re: [PATCH v2 0/4] memcg: bail out reclaim when memcg is dying
Date: Tue, 30 Jun 2026 16:05:04 -0400	[thread overview]
Message-ID: <akQhcC60mufcxVHm@cmpxchg.org> (raw)
In-Reply-To: <20260630012909.144372-1-jiayuan.chen@linux.dev>

The series looks good to me. But please add

	/* cgroup_rmdir() waits for us with cgroup_mutex held. */

to these bailouts. It's a bit unfortunate that we need to have these
inside memcg. But decoupling this on the cgroup core/kernfs side looks
like a bigger project, and we should get this bug fixed.

With that, please feel free to include in your patches:

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

CCing Tejun as well, full quote follows.

On Tue, Jun 30, 2026 at 09:29:00AM +0800, Jiayuan Chen wrote:
> Hi,
> 
> This series mitigates a system-wide stall we hit when a cgroup is
> removed while one of its memory control files is doing synchronous
> reclaim.
> 
> Problem Description
> ===================
> 
> Writing to memory.high, memory.max or memory.reclaim runs reclaim
> synchronously in the writer's context, looping until the usage drops
> below the target (or, for memory.reclaim, until the requested amount has
> been reclaimed). On a large cgroup this can take a long time. The
> latency is especially bad when reclaim has to perform swap I/O, where it
> is bound by the swap device write bandwidth, and under thrashing it is
> effectively unbounded - each round reclaims a few pages that the
> workload immediately faults back in, so the loop keeps making "progress"
> and never converges.
> 
> The legacy (v1) reclaim loops in memory.limit_in_bytes,
> memory.memsw.limit_in_bytes and memory.force_empty share the same
> pattern.
> 
> These writes go through cgroup_file_write(), which does not take
> cgroup_mutex and does not pin the css. Instead, kernfs guarantees the
> node (and thus the css) stays alive for the duration of the operation by
> holding an active reference. So while the reclaim loop runs, the active
> reference on the file is held.
> 
> If another task removes the same cgroup in parallel, cgroup_rmdir()
> takes cgroup_mutex and then blocks in kernfs_drain() waiting for that
> active reference to drain. Because cgroup_mutex is held throughout the
> wait, every other task that needs it piles up behind the remover - in
> our case the whole machine ground to a halt, with hung_task reports for
> the remover and for unrelated tasks merely reading /proc/<pid>/cgroup:
> 
> INFO: task cgdelete:366634 blocked for more than 159 seconds.
>       Not tainted 6.6.102+ #1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Call Trace:
>  <TASK>
>  __schedule+0x3da/0x1650
>  schedule+0x58/0x100
>  kernfs_drain+0xe6/0x150
>  __kernfs_remove.part.0+0xd0/0x200
>  kernfs_remove_by_name_ns+0x75/0xd0
>  cgroup_addrm_files+0x325/0x410
>  css_clear_dir+0x50/0xf0
>  cgroup_destroy_locked+0xdf/0x1e0
>  cgroup_rmdir+0x2d/0xd0
>  kernfs_iop_rmdir+0x53/0x90
>  vfs_rmdir+0x98/0x240
>  do_rmdir+0x172/0x1b0
>  __x64_sys_rmdir+0x42/0x70
>  x64_sys_call+0xeb0/0x2210
>  do_syscall_64+0x56/0x90
>  entry_SYSCALL_64_after_hwframe+0x78/0xe2
> 
> 
> INFO: task systemd-journal:2352 blocked for more than 182 seconds.
>       Not tainted 6.6.102+ #1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Call Trace:
>  <TASK>
>  __schedule+0x3da/0x1650
>  schedule+0x58/0x100
>  schedule_preempt_disabled+0xe/0x20
>  __mutex_lock.constprop.0+0x3bb/0x640
>  __mutex_lock_slowpath+0x13/0x20
>  mutex_lock+0x3c/0x50
>  proc_cgroup_show+0x4d/0x380
>  proc_single_show+0x53/0xe0
>  seq_read_iter+0x12f/0x4b0
>  seq_read+0xcd/0x110
>  vfs_read+0xb1/0x360
>  ? __seccomp_filter+0x368/0x590
>  ksys_read+0x73/0x100
>  __x64_sys_read+0x19/0x30
>  x64_sys_call+0x18d3/0x2210
>  do_syscall_64+0x56/0x90
>  entry_SYSCALL_64_after_hwframe+0x78/0xe2
> 
> The system recovers only once the reclaim finally finishes and releases
> the active reference. The reclaim itself is pointless here: the cgroup
> is being torn down and its remaining pages will be reparented to the
> parent anyway.
> 
> Even though we check signal_pending(current) in the reclaim loop, the
> typical symptom is that cat /proc/<pid>/cgroup gets stuck.
> By the time someone looks for which task is actually stuck in reclaim,
> the hung task timeout has already been hit. This makes the problem
> particularly nasty to debug from a hung-task report alone, because the
> blocked tasks shown are often the victims, not the reclaim writer itself.
> 
> Our Mitigation
> ==============
> 
> cgroup destruction sets CSS_DYING in kill_css_sync() *before*
> css_clear_dir() triggers the kernfs_drain() that blocks the remover. The
> in-flight reclaim loop is therefore guaranteed to observe it before
> starting another reclaim iteration. This series checks memcg_is_dying()
> in the v2 reclaim loops (memory.high, memory.max and proactive reclaim)
> and the v1 reclaim loops (memory.limit_in_bytes,
> memory.memsw.limit_in_bytes and memory.force_empty), and bails out early,
> so the writer drops the active reference promptly and the remover can
> make progress.
> 
> Unlike the no-progress guard (MAX_RECLAIM_RETRIES), which only fires when
> reclaim makes zero progress, the dying check also covers the slow swap
> I/O and thrashing cases, where reclaim keeps succeeding a little and the
> loop would otherwise never converge.
> 
> For memory.reclaim, bailing out because the memcg is dying means the
> requested reclaim amount was not satisfied, so the write returns -EAGAIN.
> 
> This is orthogonal to commit c8e6002bd611 ("memcg: introduce
> non-blocking limit setting option"): O_NONBLOCK lets a caller avoid the
> synchronous reclaim up front, while this series handles the case where
> reclaim is already running when the cgroup starts being removed.
> 
> Changes since v1:
>   - Return -EAGAIN from memory.reclaim when the memcg is dying.
>   - Add the same bailout to the legacy v1 reclaim loops.
> 
> v1:
>   https://lore.kernel.org/linux-mm/20260623062800.298514-1-jiayuan.chen@linux.dev/
> 
> Jiayuan Chen (4):
>   memcg: bail out memory.high when memcg is dying
>   memcg: bail out memory.max when memcg is dying
>   memcg: bail out proactive reclaim when memcg is dying
>   memcg-v1: bail out reclaim when memcg is dying
> 
>  mm/memcontrol-v1.c | 6 ++++++
>  mm/memcontrol.c    | 6 ++++++
>  mm/vmscan.c        | 3 +++
>  3 files changed, 15 insertions(+)
> 
> -- 
> 2.43.0


      parent reply	other threads:[~2026-06-30 20:05 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-30  1:29 [PATCH v2 0/4] memcg: bail out reclaim when memcg is dying Jiayuan Chen
2026-06-30  1:29 ` [PATCH v2 1/4] memcg: bail out memory.high " Jiayuan Chen
2026-06-30  1:29 ` [PATCH v2 2/4] memcg: bail out memory.max " Jiayuan Chen
2026-06-30  1:29 ` [PATCH v2 3/4] memcg: bail out proactive reclaim " Jiayuan Chen
2026-06-30  1:29 ` [PATCH v2 4/4] memcg-v1: bail out " Jiayuan Chen
2026-06-30 20:05 ` Johannes Weiner [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=akQhcC60mufcxVHm@cmpxchg.org \
    --to=hannes@cmpxchg.org \
    --cc=akpm@linux-foundation.org \
    --cc=axelrasmussen@google.com \
    --cc=baohua@kernel.org \
    --cc=cgroups@vger.kernel.org \
    --cc=david@kernel.org \
    --cc=jiayuan.chen@linux.dev \
    --cc=jiayuan.chen@shopee.com \
    --cc=kasong@tencent.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=mhocko@kernel.org \
    --cc=muchun.song@linux.dev \
    --cc=qi.zheng@linux.dev \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeel.butt@linux.dev \
    --cc=tj@kernel.org \
    --cc=weixugc@google.com \
    --cc=yingfu.zhou@shopee.com \
    --cc=yuanchu@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox