linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] memcg: Always call cond_resched() after fn()
@ 2025-05-23 17:21 Breno Leitao
  2025-05-23 18:21 ` Shakeel Butt
  0 siblings, 1 reply; 5+ messages in thread
From: Breno Leitao @ 2025-05-23 17:21 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Chen Ridong, Greg Kroah-Hartman
  Cc: Michal Hocko, cgroups, linux-mm, linux-kernel, kernel-team,
	Michael van der Westhuizen, Usama Arif, Pavel Begunkov,
	Rik van Riel, Breno Leitao

I am seeing soft lockup on certain machine types when a cgroup
OOMs. This is happening because killing the process in certain machine
might be very slow, which causes the soft lockup and RCU stalls. This
happens usually when the cgroup has MANY processes and memory.oom.group
is set.

Example I am seeing in real production:

       [462012.244552] Memory cgroup out of memory: Killed process 3370438 (crosvm) ....
       ....
       [462037.318059] Memory cgroup out of memory: Killed process 4171372 (adb) ....
       [462037.348314] watchdog: BUG: soft lockup - CPU#64 stuck for 26s! [stat_manager-ag:1618982]
       ....

Quick look at why this is so slow, it seems to be related to serial
flush for certain machine types. For all the crashes I saw, the target
CPU was at console_flush_all().

In the case above, there are thousands of processes in the cgroup, and
it is soft locking up before it reaches the 1024 limit in the code
(which would call the cond_resched()). So, cond_resched() in 1024 blocks
is not sufficient.

Remove the counter-based conditional rescheduling logic and call
cond_resched() unconditionally after each task iteration, after fn() is
called. This avoids the lockup independently of how slow fn() is.

Cc: Michael van der Westhuizen <rmikey@meta.com>
Cc: Usama Arif <usamaarif642@gmail.com>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Suggested-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
Fixes: 46576834291869457 ("memcg: fix soft lockup in the OOM process")
---
 mm/memcontrol.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c96c1f2b9cf57..2d4d65f25fecd 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1168,7 +1168,6 @@ void mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
 {
 	struct mem_cgroup *iter;
 	int ret = 0;
-	int i = 0;
 
 	BUG_ON(mem_cgroup_is_root(memcg));
 
@@ -1178,10 +1177,9 @@ void mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
 
 		css_task_iter_start(&iter->css, CSS_TASK_ITER_PROCS, &it);
 		while (!ret && (task = css_task_iter_next(&it))) {
-			/* Avoid potential softlockup warning */
-			if ((++i & 1023) == 0)
-				cond_resched();
 			ret = fn(task, arg);
+			/* Avoid potential softlockup warning */
+			cond_resched();
 		}
 		css_task_iter_end(&it);
 		if (ret) {

---
base-commit: ea15e046263b19e91ffd827645ae5dfa44ebd044
change-id: 20250523-memcg_fix-012257f3109e

Best regards,
-- 
Breno Leitao <leitao@debian.org>



^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-05-28  9:18 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-23 17:21 [PATCH] memcg: Always call cond_resched() after fn() Breno Leitao
2025-05-23 18:21 ` Shakeel Butt
2025-05-27 10:03   ` Breno Leitao
2025-05-27 16:54     ` Shakeel Butt
2025-05-28  9:18       ` Breno Leitao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).