* [PATCH v2 1/4] memcg: bail out memory.high when memcg is dying
2026-06-30 1:29 [PATCH v2 0/4] memcg: bail out reclaim when memcg is dying Jiayuan Chen
@ 2026-06-30 1:29 ` Jiayuan Chen
2026-06-30 1:29 ` [PATCH v2 2/4] memcg: bail out memory.max " Jiayuan Chen
` (3 subsequent siblings)
4 siblings, 0 replies; 6+ messages in thread
From: Jiayuan Chen @ 2026-06-30 1:29 UTC (permalink / raw)
To: linux-mm
Cc: jiayuan.chen, yingfu.zhou, Jiayuan Chen, Johannes Weiner,
Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
Andrew Morton, Kairui Song, Qi Zheng, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, David Hildenbrand, Lorenzo Stoakes, cgroups,
linux-kernel
From: Jiayuan Chen <jiayuan.chen@shopee.com>
memory.high reclaims synchronously in the writer's context, and the
latency can be very high - especially when reclaim performs swap I/O, or
under thrashing where the loop may not converge for a long time.
While this runs the kernfs active reference on the file is held, so a
concurrent removal of the same cgroup blocks in kernfs_drain() under
cgroup_mutex until it finishes. Reclaiming a dying cgroup is pointless,
as its pages are reparented to the parent anyway.
Mitigate this by bailing out of the reclaim loop once memcg_is_dying().
Reported-by: Zhou Yingfu <yingfu.zhou@shopee.com>
Cc: Jiayuan Chen <jiayuan.chen@linux.dev>
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
---
mm/memcontrol.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d20ffc827306..eca9f6091980 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4794,6 +4794,9 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
if (signal_pending(current))
break;
+ if (memcg_is_dying(memcg))
+ break;
+
if (!drained) {
drain_all_stock(memcg);
drained = true;
--
2.43.0
^ permalink raw reply related [flat|nested] 6+ messages in thread* [PATCH v2 2/4] memcg: bail out memory.max when memcg is dying
2026-06-30 1:29 [PATCH v2 0/4] memcg: bail out reclaim when memcg is dying Jiayuan Chen
2026-06-30 1:29 ` [PATCH v2 1/4] memcg: bail out memory.high " Jiayuan Chen
@ 2026-06-30 1:29 ` Jiayuan Chen
2026-06-30 1:29 ` [PATCH v2 3/4] memcg: bail out proactive reclaim " Jiayuan Chen
` (2 subsequent siblings)
4 siblings, 0 replies; 6+ messages in thread
From: Jiayuan Chen @ 2026-06-30 1:29 UTC (permalink / raw)
To: linux-mm
Cc: jiayuan.chen, yingfu.zhou, Jiayuan Chen, Johannes Weiner,
Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
Andrew Morton, Kairui Song, Qi Zheng, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, David Hildenbrand, Lorenzo Stoakes, cgroups,
linux-kernel
From: Jiayuan Chen <jiayuan.chen@shopee.com>
memory.max has the same high-latency reclaim loop as memory.high, and
may additionally invoke the OOM killer on a cgroup that is already going
away, further delaying its removal.
Mitigate this by bailing out of the loop once memcg_is_dying().
Reported-by: Zhou Yingfu <yingfu.zhou@shopee.com>
Cc: Jiayuan Chen <jiayuan.chen@linux.dev>
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
---
mm/memcontrol.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index eca9f6091980..ad5f6dfdc021 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4848,6 +4848,9 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
if (signal_pending(current))
break;
+ if (memcg_is_dying(memcg))
+ break;
+
if (!drained) {
drain_all_stock(memcg);
drained = true;
--
2.43.0
^ permalink raw reply related [flat|nested] 6+ messages in thread* [PATCH v2 3/4] memcg: bail out proactive reclaim when memcg is dying
2026-06-30 1:29 [PATCH v2 0/4] memcg: bail out reclaim when memcg is dying Jiayuan Chen
2026-06-30 1:29 ` [PATCH v2 1/4] memcg: bail out memory.high " Jiayuan Chen
2026-06-30 1:29 ` [PATCH v2 2/4] memcg: bail out memory.max " Jiayuan Chen
@ 2026-06-30 1:29 ` Jiayuan Chen
2026-06-30 1:29 ` [PATCH v2 4/4] memcg-v1: bail out " Jiayuan Chen
2026-06-30 20:05 ` [PATCH v2 0/4] memcg: " Johannes Weiner
4 siblings, 0 replies; 6+ messages in thread
From: Jiayuan Chen @ 2026-06-30 1:29 UTC (permalink / raw)
To: linux-mm
Cc: jiayuan.chen, yingfu.zhou, Jiayuan Chen, Johannes Weiner,
Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
Andrew Morton, David Hildenbrand, Qi Zheng, Lorenzo Stoakes,
Kairui Song, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu,
cgroups, linux-kernel
From: Jiayuan Chen <jiayuan.chen@shopee.com>
Proactive reclaim via memory.reclaim can run for a long time - swap I/O
or thrashing again dominating the latency - and delays cgroup removal in
the same way.
Mitigate this by stopping the reclaim once memcg_is_dying().
Reported-by: Zhou Yingfu <yingfu.zhou@shopee.com>
Cc: Jiayuan Chen <jiayuan.chen@linux.dev>
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
---
mm/vmscan.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 754c5f5d716a..091b609cf1b1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -7912,6 +7912,9 @@ int user_proactive_reclaim(char *buf,
if (signal_pending(current))
return -EINTR;
+ if (memcg && memcg_is_dying(memcg))
+ return -EAGAIN;
+
/*
* This is the final attempt, drain percpu lru caches in the
* hope of introducing more evictable pages.
--
2.43.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH v2 4/4] memcg-v1: bail out reclaim when memcg is dying
2026-06-30 1:29 [PATCH v2 0/4] memcg: bail out reclaim when memcg is dying Jiayuan Chen
` (2 preceding siblings ...)
2026-06-30 1:29 ` [PATCH v2 3/4] memcg: bail out proactive reclaim " Jiayuan Chen
@ 2026-06-30 1:29 ` Jiayuan Chen
2026-06-30 20:05 ` [PATCH v2 0/4] memcg: " Johannes Weiner
4 siblings, 0 replies; 6+ messages in thread
From: Jiayuan Chen @ 2026-06-30 1:29 UTC (permalink / raw)
To: linux-mm
Cc: jiayuan.chen, yingfu.zhou, Jiayuan Chen, Johannes Weiner,
Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
Andrew Morton, David Hildenbrand, Qi Zheng, Lorenzo Stoakes,
Kairui Song, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu,
cgroups, linux-kernel
From: Jiayuan Chen <jiayuan.chen@shopee.com>
The legacy memory.limit_in_bytes and memory.memsw.limit_in_bytes writers
retry page_counter_set_max() by reclaiming synchronously in the writer
context. memory.force_empty similarly loops in synchronous reclaim until
the cgroup is empty or reclaim stops making progress.
These writes hold a kernfs active reference on the file. If cgroup removal
starts in parallel, the remover sets CSS_DYING and then waits in
kernfs_drain() under cgroup_mutex for the active reference to drain.
Continuing reclaim after the memcg is dying can therefore delay cgroup
removal and keep cgroup_mutex held for a long time.
Stop the v1 reclaim loops once the memcg is dying. For limit resizing,
keep the existing -EBUSY semantics when the new limit could not be
installed. For memory.force_empty, keep the existing best-effort success
semantics.
Reported-by: Zhou Yingfu <yingfu.zhou@shopee.com>
Cc: Jiayuan Chen <jiayuan.chen@linux.dev>
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
---
mm/memcontrol-v1.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
index 765069211567..ad23de985d9a 100644
--- a/mm/memcontrol-v1.c
+++ b/mm/memcontrol-v1.c
@@ -1513,6 +1513,9 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg,
if (!ret)
break;
+ if (memcg_is_dying(memcg))
+ break;
+
if (!drained) {
drain_all_stock(memcg);
drained = true;
@@ -1551,6 +1554,9 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
if (signal_pending(current))
return -EINTR;
+ if (memcg_is_dying(memcg))
+ break;
+
if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
MEMCG_RECLAIM_MAY_SWAP, NULL))
nr_retries--;
--
2.43.0
^ permalink raw reply related [flat|nested] 6+ messages in thread* Re: [PATCH v2 0/4] memcg: bail out reclaim when memcg is dying
2026-06-30 1:29 [PATCH v2 0/4] memcg: bail out reclaim when memcg is dying Jiayuan Chen
` (3 preceding siblings ...)
2026-06-30 1:29 ` [PATCH v2 4/4] memcg-v1: bail out " Jiayuan Chen
@ 2026-06-30 20:05 ` Johannes Weiner
4 siblings, 0 replies; 6+ messages in thread
From: Johannes Weiner @ 2026-06-30 20:05 UTC (permalink / raw)
To: Jiayuan Chen
Cc: linux-mm, jiayuan.chen, yingfu.zhou, Michal Hocko, Roman Gushchin,
Shakeel Butt, Muchun Song, Andrew Morton, Kairui Song, Qi Zheng,
Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu,
David Hildenbrand, Lorenzo Stoakes, cgroups, linux-kernel,
Tejun Heo
The series looks good to me. But please add
/* cgroup_rmdir() waits for us with cgroup_mutex held. */
to these bailouts. It's a bit unfortunate that we need to have these
inside memcg. But decoupling this on the cgroup core/kernfs side looks
like a bigger project, and we should get this bug fixed.
With that, please feel free to include in your patches:
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
CCing Tejun as well, full quote follows.
On Tue, Jun 30, 2026 at 09:29:00AM +0800, Jiayuan Chen wrote:
> Hi,
>
> This series mitigates a system-wide stall we hit when a cgroup is
> removed while one of its memory control files is doing synchronous
> reclaim.
>
> Problem Description
> ===================
>
> Writing to memory.high, memory.max or memory.reclaim runs reclaim
> synchronously in the writer's context, looping until the usage drops
> below the target (or, for memory.reclaim, until the requested amount has
> been reclaimed). On a large cgroup this can take a long time. The
> latency is especially bad when reclaim has to perform swap I/O, where it
> is bound by the swap device write bandwidth, and under thrashing it is
> effectively unbounded - each round reclaims a few pages that the
> workload immediately faults back in, so the loop keeps making "progress"
> and never converges.
>
> The legacy (v1) reclaim loops in memory.limit_in_bytes,
> memory.memsw.limit_in_bytes and memory.force_empty share the same
> pattern.
>
> These writes go through cgroup_file_write(), which does not take
> cgroup_mutex and does not pin the css. Instead, kernfs guarantees the
> node (and thus the css) stays alive for the duration of the operation by
> holding an active reference. So while the reclaim loop runs, the active
> reference on the file is held.
>
> If another task removes the same cgroup in parallel, cgroup_rmdir()
> takes cgroup_mutex and then blocks in kernfs_drain() waiting for that
> active reference to drain. Because cgroup_mutex is held throughout the
> wait, every other task that needs it piles up behind the remover - in
> our case the whole machine ground to a halt, with hung_task reports for
> the remover and for unrelated tasks merely reading /proc/<pid>/cgroup:
>
> INFO: task cgdelete:366634 blocked for more than 159 seconds.
> Not tainted 6.6.102+ #1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Call Trace:
> <TASK>
> __schedule+0x3da/0x1650
> schedule+0x58/0x100
> kernfs_drain+0xe6/0x150
> __kernfs_remove.part.0+0xd0/0x200
> kernfs_remove_by_name_ns+0x75/0xd0
> cgroup_addrm_files+0x325/0x410
> css_clear_dir+0x50/0xf0
> cgroup_destroy_locked+0xdf/0x1e0
> cgroup_rmdir+0x2d/0xd0
> kernfs_iop_rmdir+0x53/0x90
> vfs_rmdir+0x98/0x240
> do_rmdir+0x172/0x1b0
> __x64_sys_rmdir+0x42/0x70
> x64_sys_call+0xeb0/0x2210
> do_syscall_64+0x56/0x90
> entry_SYSCALL_64_after_hwframe+0x78/0xe2
>
>
> INFO: task systemd-journal:2352 blocked for more than 182 seconds.
> Not tainted 6.6.102+ #1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Call Trace:
> <TASK>
> __schedule+0x3da/0x1650
> schedule+0x58/0x100
> schedule_preempt_disabled+0xe/0x20
> __mutex_lock.constprop.0+0x3bb/0x640
> __mutex_lock_slowpath+0x13/0x20
> mutex_lock+0x3c/0x50
> proc_cgroup_show+0x4d/0x380
> proc_single_show+0x53/0xe0
> seq_read_iter+0x12f/0x4b0
> seq_read+0xcd/0x110
> vfs_read+0xb1/0x360
> ? __seccomp_filter+0x368/0x590
> ksys_read+0x73/0x100
> __x64_sys_read+0x19/0x30
> x64_sys_call+0x18d3/0x2210
> do_syscall_64+0x56/0x90
> entry_SYSCALL_64_after_hwframe+0x78/0xe2
>
> The system recovers only once the reclaim finally finishes and releases
> the active reference. The reclaim itself is pointless here: the cgroup
> is being torn down and its remaining pages will be reparented to the
> parent anyway.
>
> Even though we check signal_pending(current) in the reclaim loop, the
> typical symptom is that cat /proc/<pid>/cgroup gets stuck.
> By the time someone looks for which task is actually stuck in reclaim,
> the hung task timeout has already been hit. This makes the problem
> particularly nasty to debug from a hung-task report alone, because the
> blocked tasks shown are often the victims, not the reclaim writer itself.
>
> Our Mitigation
> ==============
>
> cgroup destruction sets CSS_DYING in kill_css_sync() *before*
> css_clear_dir() triggers the kernfs_drain() that blocks the remover. The
> in-flight reclaim loop is therefore guaranteed to observe it before
> starting another reclaim iteration. This series checks memcg_is_dying()
> in the v2 reclaim loops (memory.high, memory.max and proactive reclaim)
> and the v1 reclaim loops (memory.limit_in_bytes,
> memory.memsw.limit_in_bytes and memory.force_empty), and bails out early,
> so the writer drops the active reference promptly and the remover can
> make progress.
>
> Unlike the no-progress guard (MAX_RECLAIM_RETRIES), which only fires when
> reclaim makes zero progress, the dying check also covers the slow swap
> I/O and thrashing cases, where reclaim keeps succeeding a little and the
> loop would otherwise never converge.
>
> For memory.reclaim, bailing out because the memcg is dying means the
> requested reclaim amount was not satisfied, so the write returns -EAGAIN.
>
> This is orthogonal to commit c8e6002bd611 ("memcg: introduce
> non-blocking limit setting option"): O_NONBLOCK lets a caller avoid the
> synchronous reclaim up front, while this series handles the case where
> reclaim is already running when the cgroup starts being removed.
>
> Changes since v1:
> - Return -EAGAIN from memory.reclaim when the memcg is dying.
> - Add the same bailout to the legacy v1 reclaim loops.
>
> v1:
> https://lore.kernel.org/linux-mm/20260623062800.298514-1-jiayuan.chen@linux.dev/
>
> Jiayuan Chen (4):
> memcg: bail out memory.high when memcg is dying
> memcg: bail out memory.max when memcg is dying
> memcg: bail out proactive reclaim when memcg is dying
> memcg-v1: bail out reclaim when memcg is dying
>
> mm/memcontrol-v1.c | 6 ++++++
> mm/memcontrol.c | 6 ++++++
> mm/vmscan.c | 3 +++
> 3 files changed, 15 insertions(+)
>
> --
> 2.43.0
^ permalink raw reply [flat|nested] 6+ messages in thread