* [PATCH v3 0/4] memcg: bail out reclaim when memcg is dying
@ 2026-07-02 12:02 Jiayuan Chen
2026-07-02 12:02 ` [PATCH v3 1/4] memcg: bail out memory.high " Jiayuan Chen
` (3 more replies)
0 siblings, 4 replies; 6+ messages in thread
From: Jiayuan Chen @ 2026-07-02 12:02 UTC (permalink / raw)
To: linux-mm
Cc: jiayuan.chen, Jiayuan Chen, Johannes Weiner, Michal Hocko,
Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
Kairui Song, Qi Zheng, Barry Song, Axel Rasmussen, Yuanchu Xie,
Wei Xu, David Hildenbrand, Lorenzo Stoakes, cgroups, linux-kernel
Hi,
This series mitigates a system-wide stall we hit when a cgroup is
removed while one of its memory control files is doing synchronous
reclaim.
v2 -> v3:
- Add Acked-by: Johannes Weiner <hannes@cmpxchg.org>
- Add comment suggested by Johannes
https://lore.kernel.org/linux-mm/akQhcC60mufcxVHm@cmpxchg.org/
v1 -> v2:
- Return -EAGAIN from memory.reclaim when the memcg is dying.
- Add the same bailout to the legacy v1 reclaim loops.
https://lore.kernel.org/linux-mm/20260623062800.298514-1-jiayuan.chen@linux.dev/
Problem Description
===================
Writing to memory.high, memory.max or memory.reclaim runs reclaim
synchronously in the writer's context, looping until the usage drops
below the target (or, for memory.reclaim, until the requested amount has
been reclaimed). On a large cgroup this can take a long time. The
latency is especially bad when reclaim has to perform swap I/O, where it
is bound by the swap device write bandwidth, and under thrashing it is
effectively unbounded - each round reclaims a few pages that the
workload immediately faults back in, so the loop keeps making "progress"
and never converges.
The legacy (v1) reclaim loops in memory.limit_in_bytes,
memory.memsw.limit_in_bytes and memory.force_empty share the same
pattern.
These writes go through cgroup_file_write(), which does not take
cgroup_mutex and does not pin the css. Instead, kernfs guarantees the
node (and thus the css) stays alive for the duration of the operation by
holding an active reference. So while the reclaim loop runs, the active
reference on the file is held.
If another task removes the same cgroup in parallel, cgroup_rmdir()
takes cgroup_mutex and then blocks in kernfs_drain() waiting for that
active reference to drain. Because cgroup_mutex is held throughout the
wait, every other task that needs it piles up behind the remover - in
our case the whole machine ground to a halt, with hung_task reports for
the remover and for unrelated tasks merely reading /proc/<pid>/cgroup:
INFO: task cgdelete:366634 blocked for more than 159 seconds.
Not tainted 6.6.102+ #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Call Trace:
<TASK>
__schedule+0x3da/0x1650
schedule+0x58/0x100
kernfs_drain+0xe6/0x150
__kernfs_remove.part.0+0xd0/0x200
kernfs_remove_by_name_ns+0x75/0xd0
cgroup_addrm_files+0x325/0x410
css_clear_dir+0x50/0xf0
cgroup_destroy_locked+0xdf/0x1e0
cgroup_rmdir+0x2d/0xd0
kernfs_iop_rmdir+0x53/0x90
vfs_rmdir+0x98/0x240
do_rmdir+0x172/0x1b0
__x64_sys_rmdir+0x42/0x70
x64_sys_call+0xeb0/0x2210
do_syscall_64+0x56/0x90
entry_SYSCALL_64_after_hwframe+0x78/0xe2
INFO: task systemd-journal:2352 blocked for more than 182 seconds.
Not tainted 6.6.102+ #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Call Trace:
<TASK>
__schedule+0x3da/0x1650
schedule+0x58/0x100
schedule_preempt_disabled+0xe/0x20
__mutex_lock.constprop.0+0x3bb/0x640
__mutex_lock_slowpath+0x13/0x20
mutex_lock+0x3c/0x50
proc_cgroup_show+0x4d/0x380
proc_single_show+0x53/0xe0
seq_read_iter+0x12f/0x4b0
seq_read+0xcd/0x110
vfs_read+0xb1/0x360
? __seccomp_filter+0x368/0x590
ksys_read+0x73/0x100
__x64_sys_read+0x19/0x30
x64_sys_call+0x18d3/0x2210
do_syscall_64+0x56/0x90
entry_SYSCALL_64_after_hwframe+0x78/0xe2
The system recovers only once the reclaim finally finishes and releases
the active reference. The reclaim itself is pointless here: the cgroup
is being torn down and its remaining pages will be reparented to the
parent anyway.
Even though we check signal_pending(current) in the reclaim loop, the
typical symptom is that cat /proc/<pid>/cgroup gets stuck.
By the time someone looks for which task is actually stuck in reclaim,
the hung task timeout has already been hit. This makes the problem
particularly nasty to debug from a hung-task report alone, because the
blocked tasks shown are often the victims, not the reclaim writer itself.
Our Mitigation
==============
cgroup destruction sets CSS_DYING in kill_css_sync() *before*
css_clear_dir() triggers the kernfs_drain() that blocks the remover. The
in-flight reclaim loop is therefore guaranteed to observe it before
starting another reclaim iteration. This series checks memcg_is_dying()
in the v2 reclaim loops (memory.high, memory.max and proactive reclaim)
and the v1 reclaim loops (memory.limit_in_bytes,
memory.memsw.limit_in_bytes and memory.force_empty), and bails out early,
so the writer drops the active reference promptly and the remover can
make progress.
Unlike the no-progress guard (MAX_RECLAIM_RETRIES), which only fires when
reclaim makes zero progress, the dying check also covers the slow swap
I/O and thrashing cases, where reclaim keeps succeeding a little and the
loop would otherwise never converge.
For memory.reclaim, bailing out because the memcg is dying means the
requested reclaim amount was not satisfied, so the write returns -EAGAIN.
This is orthogonal to commit c8e6002bd611 ("memcg: introduce
non-blocking limit setting option"): O_NONBLOCK lets a caller avoid the
synchronous reclaim up front, while this series handles the case where
reclaim is already running when the cgroup starts being removed.
Jiayuan Chen (4):
memcg: bail out memory.high when memcg is dying
memcg: bail out memory.max when memcg is dying
memcg: bail out proactive reclaim when memcg is dying
memcg-v1: bail out reclaim when memcg is dying
mm/memcontrol-v1.c | 8 ++++++++
mm/memcontrol.c | 8 ++++++++
mm/vmscan.c | 4 ++++
3 files changed, 20 insertions(+)
--
2.43.0
^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH v3 1/4] memcg: bail out memory.high when memcg is dying
2026-07-02 12:02 [PATCH v3 0/4] memcg: bail out reclaim when memcg is dying Jiayuan Chen
@ 2026-07-02 12:02 ` Jiayuan Chen
2026-07-02 12:02 ` [PATCH v3 2/4] memcg: bail out memory.max " Jiayuan Chen
` (2 subsequent siblings)
3 siblings, 0 replies; 6+ messages in thread
From: Jiayuan Chen @ 2026-07-02 12:02 UTC (permalink / raw)
To: linux-mm
Cc: jiayuan.chen, Jiayuan Chen, Zhou Yingfu, Johannes Weiner,
Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
Andrew Morton, David Hildenbrand, Qi Zheng, Lorenzo Stoakes,
Kairui Song, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu,
cgroups, linux-kernel
From: Jiayuan Chen <jiayuan.chen@shopee.com>
memory.high reclaims synchronously in the writer's context, and the
latency can be very high - especially when reclaim performs swap I/O, or
under thrashing where the loop may not converge for a long time.
While this runs the kernfs active reference on the file is held, so a
concurrent removal of the same cgroup blocks in kernfs_drain() under
cgroup_mutex until it finishes. Reclaiming a dying cgroup is pointless,
as its pages are reparented to the parent anyway.
Mitigate this by bailing out of the reclaim loop once memcg_is_dying().
Cc: Jiayuan Chen <jiayuan.chen@linux.dev>
Reported-by: Zhou Yingfu <yingfu.zhou@shopee.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
---
mm/memcontrol.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d20ffc827306..4519dc9eae33 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4794,6 +4794,10 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
if (signal_pending(current))
break;
+ /* cgroup_rmdir() waits for us with cgroup_mutex held. */
+ if (memcg_is_dying(memcg))
+ break;
+
if (!drained) {
drain_all_stock(memcg);
drained = true;
--
2.43.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH v3 2/4] memcg: bail out memory.max when memcg is dying
2026-07-02 12:02 [PATCH v3 0/4] memcg: bail out reclaim when memcg is dying Jiayuan Chen
2026-07-02 12:02 ` [PATCH v3 1/4] memcg: bail out memory.high " Jiayuan Chen
@ 2026-07-02 12:02 ` Jiayuan Chen
2026-07-02 12:02 ` [PATCH v3 3/4] memcg: bail out proactive reclaim " Jiayuan Chen
2026-07-02 12:02 ` [PATCH v3 4/4] memcg-v1: bail out " Jiayuan Chen
3 siblings, 0 replies; 6+ messages in thread
From: Jiayuan Chen @ 2026-07-02 12:02 UTC (permalink / raw)
To: linux-mm
Cc: jiayuan.chen, Jiayuan Chen, Zhou Yingfu, Johannes Weiner,
Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
Andrew Morton, Kairui Song, Qi Zheng, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, David Hildenbrand, Lorenzo Stoakes, cgroups,
linux-kernel
From: Jiayuan Chen <jiayuan.chen@shopee.com>
memory.max has the same high-latency reclaim loop as memory.high, and
may additionally invoke the OOM killer on a cgroup that is already going
away, further delaying its removal.
Mitigate this by bailing out of the loop once memcg_is_dying().
Cc: Jiayuan Chen <jiayuan.chen@linux.dev>
Reported-by: Zhou Yingfu <yingfu.zhou@shopee.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
---
mm/memcontrol.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4519dc9eae33..938f190a98fe 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4849,6 +4849,10 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
if (signal_pending(current))
break;
+ /* cgroup_rmdir() waits for us with cgroup_mutex held. */
+ if (memcg_is_dying(memcg))
+ break;
+
if (!drained) {
drain_all_stock(memcg);
drained = true;
--
2.43.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH v3 3/4] memcg: bail out proactive reclaim when memcg is dying
2026-07-02 12:02 [PATCH v3 0/4] memcg: bail out reclaim when memcg is dying Jiayuan Chen
2026-07-02 12:02 ` [PATCH v3 1/4] memcg: bail out memory.high " Jiayuan Chen
2026-07-02 12:02 ` [PATCH v3 2/4] memcg: bail out memory.max " Jiayuan Chen
@ 2026-07-02 12:02 ` Jiayuan Chen
2026-07-02 12:55 ` Jiayuan Chen
2026-07-02 12:02 ` [PATCH v3 4/4] memcg-v1: bail out " Jiayuan Chen
3 siblings, 1 reply; 6+ messages in thread
From: Jiayuan Chen @ 2026-07-02 12:02 UTC (permalink / raw)
To: linux-mm
Cc: jiayuan.chen, Jiayuan Chen, Zhou Yingfu, Johannes Weiner,
Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
Andrew Morton, David Hildenbrand, Qi Zheng, Lorenzo Stoakes,
Kairui Song, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu,
cgroups, linux-kernel
From: Jiayuan Chen <jiayuan.chen@shopee.com>
Proactive reclaim via memory.reclaim can run for a long time - swap I/O
or thrashing again dominating the latency - and delays cgroup removal in
the same way.
Mitigate this by stopping the reclaim once memcg_is_dying().
Cc: Jiayuan Chen <jiayuan.chen@linux.dev>
Reported-by: Zhou Yingfu <yingfu.zhou@shopee.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
---
mm/vmscan.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 754c5f5d716a..6ae61be2fab8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -7912,6 +7912,10 @@ int user_proactive_reclaim(char *buf,
if (signal_pending(current))
return -EINTR;
+ /* cgroup_rmdir() waits for us with cgroup_mutex held. */
+ if (memcg && memcg_is_dying(memcg))
+ return -EAGAIN;
+
/*
* This is the final attempt, drain percpu lru caches in the
* hope of introducing more evictable pages.
--
2.43.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH v3 4/4] memcg-v1: bail out reclaim when memcg is dying
2026-07-02 12:02 [PATCH v3 0/4] memcg: bail out reclaim when memcg is dying Jiayuan Chen
` (2 preceding siblings ...)
2026-07-02 12:02 ` [PATCH v3 3/4] memcg: bail out proactive reclaim " Jiayuan Chen
@ 2026-07-02 12:02 ` Jiayuan Chen
3 siblings, 0 replies; 6+ messages in thread
From: Jiayuan Chen @ 2026-07-02 12:02 UTC (permalink / raw)
To: linux-mm
Cc: jiayuan.chen, Jiayuan Chen, Zhou Yingfu, Johannes Weiner,
Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
Andrew Morton, Kairui Song, Qi Zheng, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, David Hildenbrand, Lorenzo Stoakes, cgroups,
linux-kernel
From: Jiayuan Chen <jiayuan.chen@shopee.com>
The legacy memory.limit_in_bytes and memory.memsw.limit_in_bytes writers
retry page_counter_set_max() by reclaiming synchronously in the writer
context. memory.force_empty similarly loops in synchronous reclaim until
the cgroup is empty or reclaim stops making progress.
These writes hold a kernfs active reference on the file. If cgroup removal
starts in parallel, the remover sets CSS_DYING and then waits in
kernfs_drain() under cgroup_mutex for the active reference to drain.
Continuing reclaim after the memcg is dying can therefore delay cgroup
removal and keep cgroup_mutex held for a long time.
Stop the v1 reclaim loops once the memcg is dying. For limit resizing,
keep the existing -EBUSY semantics when the new limit could not be
installed. For memory.force_empty, keep the existing best-effort success
semantics.
Cc: Jiayuan Chen <jiayuan.chen@linux.dev>
Reported-by: Zhou Yingfu <yingfu.zhou@shopee.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
---
mm/memcontrol-v1.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
index 765069211567..b868a58c52b8 100644
--- a/mm/memcontrol-v1.c
+++ b/mm/memcontrol-v1.c
@@ -1513,6 +1513,10 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg,
if (!ret)
break;
+ /* cgroup_rmdir() waits for us with cgroup_mutex held. */
+ if (memcg_is_dying(memcg))
+ break;
+
if (!drained) {
drain_all_stock(memcg);
drained = true;
@@ -1551,6 +1555,10 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
if (signal_pending(current))
return -EINTR;
+ /* cgroup_rmdir() waits for us with cgroup_mutex held. */
+ if (memcg_is_dying(memcg))
+ break;
+
if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
MEMCG_RECLAIM_MAY_SWAP, NULL))
nr_retries--;
--
2.43.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH v3 3/4] memcg: bail out proactive reclaim when memcg is dying
2026-07-02 12:02 ` [PATCH v3 3/4] memcg: bail out proactive reclaim " Jiayuan Chen
@ 2026-07-02 12:55 ` Jiayuan Chen
0 siblings, 0 replies; 6+ messages in thread
From: Jiayuan Chen @ 2026-07-02 12:55 UTC (permalink / raw)
To: linux-mm
Cc: jiayuan.chen, Zhou Yingfu, Johannes Weiner, Michal Hocko,
Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
David Hildenbrand, Qi Zheng, Lorenzo Stoakes, Kairui Song,
Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu, cgroups,
linux-kernel
On 7/2/26 8:02 PM, Jiayuan Chen wrote:
> From: Jiayuan Chen <jiayuan.chen@shopee.com>
>
> Proactive reclaim via memory.reclaim can run for a long time - swap I/O
> or thrashing again dominating the latency - and delays cgroup removal in
> the same way.
>
> Mitigate this by stopping the reclaim once memcg_is_dying().
>
> Cc: Jiayuan Chen <jiayuan.chen@linux.dev>
> Reported-by: Zhou Yingfu <yingfu.zhou@shopee.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
> ---
> mm/vmscan.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 754c5f5d716a..6ae61be2fab8 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -7912,6 +7912,10 @@ int user_proactive_reclaim(char *buf,
> if (signal_pending(current))
> return -EINTR;
>
> + /* cgroup_rmdir() waits for us with cgroup_mutex held. */
> + if (memcg && memcg_is_dying(memcg))
> + return -EAGAIN;
> +
> /*
> * This is the final attempt, drain percpu lru caches in the
> * hope of introducing more evictable pages.
The issuse reported by Ai is benign:
https://sashiko.dev/#/patchset/20260702120235.376752-1-jiayuan.chen%40linux.dev
We have multiple break points to return in
try_to_free_pages::do_try_to_free_pages
'''
static unsigned long do_try_to_free_pages()
{
retry:
do {
shrink_zones(zonelist, sc);
// break 1
if (sc->nr_reclaimed >= sc->nr_to_reclaim)
break;
} while (--sc->priority >= 0); // at most 12
times(DEF_PRIORITY)
// break 2
if (sc->nr_reclaimed)
return sc->nr_reclaimed;
// retry twice logic
...
}
'''
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-07-02 12:55 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-07-02 12:02 [PATCH v3 0/4] memcg: bail out reclaim when memcg is dying Jiayuan Chen
2026-07-02 12:02 ` [PATCH v3 1/4] memcg: bail out memory.high " Jiayuan Chen
2026-07-02 12:02 ` [PATCH v3 2/4] memcg: bail out memory.max " Jiayuan Chen
2026-07-02 12:02 ` [PATCH v3 3/4] memcg: bail out proactive reclaim " Jiayuan Chen
2026-07-02 12:55 ` Jiayuan Chen
2026-07-02 12:02 ` [PATCH v3 4/4] memcg-v1: bail out " Jiayuan Chen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox