[PATCH 0/2] Postpone memcg reclaim to return-to-user path

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] Postpone memcg reclaim to return-to-user path
@ 2025-06-18 11:39 Zhongkun He
  2025-06-18 11:39 ` [PATCH 1/2] mm: memcg: introduce PF_MEMALLOC_ACCOUNTFORCE to postpone reclaim to return-to-userland path Zhongkun He
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Zhongkun He @ 2025-06-18 11:39 UTC (permalink / raw)
  To: akpm, tytso, jack, hannes, mhocko
  Cc: muchun.song, linux-ext4, linux-kernel, linux-mm, cgroups,
	Zhongkun He

# Introduction

This patchset aims to introduce an approach to ensure that memory
allocations are forced to be accounted to the memory cgroup, even if
they exceed the cgroup's maximum limit. In such cases, the reclaim
process is postponed until the task returns to the user. This is
beneficial for users who perform over-max reclaim while holding multiple
locks or other resources (especially resources related to file system
writeback). If a task needs any of these resources, it would otherwise
have to wait until the other task completes reclaim and releases the
resources. Postponing reclaim to the return-to-user path helps avoid this issue.

# Background

We have been encountering an hungtask issue for a long time. Specifically,
when a task holds the jbd2 handler and subsequently enters direct reclaim
because it reaches the hard limit within a memory cgroup, the system may become
blocked for a long time. The stack trace of waiting thread holding the jbd2
handle is as follows (and so many other threads are waiting on the same jbd2
handle):

 #0 __schedule at ffffffff97abc6c9
 #1 preempt_schedule_common at ffffffff97abcdaa
 #2 __cond_resched at ffffffff97abcddd
 #3 shrink_active_list at ffffffff9744dca2
 #4 shrink_lruvec at ffffffff97451407
 #5 shrink_node at ffffffff974517c9
 #6 do_try_to_free_pages at ffffffff97451dae
 #7 try_to_free_mem_cgroup_pages at ffffffff974542b8
 #8 try_charge_memcg at ffffffff974f0ede
 #9 charge_memcg at ffffffff974f1d0e
#10 __mem_cgroup_charge at ffffffff974f391c
#11 __add_to_page_cache_locked at ffffffff974313e5
#12 add_to_page_cache_lru at ffffffff974324b2
#13 pagecache_get_page at ffffffff974338e3
#14 __getblk_gfp at ffffffff97556798
#15 __ext4_get_inode_loc at ffffffffc07a5518 [ext4]
#16 ext4_get_inode_loc at ffffffffc07a7fec [ext4]
#17 ext4_reserve_inode_write at ffffffffc07a9fb1 [ext4]
#18 __ext4_mark_inode_dirty at ffffffffc07aa249 [ext4]
#19 __ext4_new_inode at ffffffffc079cbae [ext4]
#20 ext4_create at ffffffffc07c3e56 [ext4]
#21 path_openat at ffffffff9751f471
#22 do_filp_open at ffffffff97521384
#23 do_sys_openat2 at ffffffff97508fd6
#24 do_sys_open at ffffffff9750a65b
#25 do_syscall_64 at ffffffff97aaed14

We've obtained a coredump and dumped struct scan_control from it by using crash tool.

struct scan_control {
  nr_to_reclaim = 32,
  order = 0 '\000',
  priority = 1 '\001',
  reclaim_idx = 4 '\004',
  gfp_mask = 17861706, __GFP_NOFAIL
  nr_scanned = 27810, 
  nr_reclaimed = 0,
  nr = {
        dirty = 27797,
        unqueued_dirty = 27797,
        congested = 0,
        writeback = 0,
        immediate = 0,
        file_taken = 27810,
        taken = 27810
  },
}

The ->nr_reclaimed is zero meaning there is no memory we have reclaimed because
most of the file pages are unqueued dirty. And ->priority is 1 also meaning we
spent so much time on memory reclamation. Since this thread has held the jbd2
handler, the jbd2 thread was waiting for the same jbd2 handler, which blocked
so many other threads from writing dirty pages as well.

0 [] __schedule at ffffffff97abc6c9
1 [] schedule at ffffffff97abcd01
2 [] jbd2_journal_wait_updates at ffffffffc05a522f [jbd2]
3 [] jbd2_journal_commit_transaction at ffffffffc05a72c6 [jbd2]
4 [] kjournald2 at ffffffffc05ad66d [jbd2]
5 [] kthread at ffffffff972bc4c0
6 [] ret_from_fork at ffffffff9720440f

Furthermore, we observed that memory usage far exceeded the configured memory maximum,
reaching around 38GB.

memory.max  : 134896020    514 GB
memory.usage: 144747169    552 GB

We investigated this issue and identified the root cause:
  try_charge_memcg:
    retry charge
     charge failed
       -> direct reclaim
        -> mem_cgroup_oom    return true，but selected task is in an uninterruptible state
           -> retry charge

In which cases, we saw many tasks in the uninterruptible (D) state with a pending
SIGKILL signal. The OOM killer selects a victim and returns success, allowing the
current thread to retry the memory charge. However, the selected task cannot acknowledge
the SIGKILL signal because it is stuck in an uninterruptible state. As a result,
the charging task resets nr_retries and attempts to reclaim again, but the victim
task never exits. This causes the current thread to enter a prolonged retry loop
during direct reclaim, holding the jbd2 handler for much more time and leading to
system-wide blocking. Why are there so many uninterruptible (D) state tasks?
Check the most common stack trace.

crash> task_struct.__state ffff8c53a15b3080
  __state = 2,   #define TASK_UNINTERRUPTIBLE        0x0002
 0 [] __schedule at ffffffff97abc6c9
 1 [] schedule at ffffffff97abcd01
 2 [] schedule_preempt_disabled at ffffffff97abdf1a
 3 [] rwsem_down_read_slowpath at ffffffff97ac05bf
 4 [] down_read at ffffffff97ac06b1
 5 [] do_user_addr_fault at ffffffff9727f1e7
 6 [] exc_page_fault at ffffffff97ab286e
 7 [] asm_exc_page_fault at ffffffff97c00d42

Check the owner of mm_struct.mmap_lock. The task below was entering memory reclaim
holding mmap lock and there are 68 tasks in this memory cgroup, with 23 of them in
the memory reclaim context.

 7 [] shrink_active_list at ffffffff9744dd46
 8 [] shrink_lruvec at ffffffff97451407
 9 [] shrink_node at ffffffff974517c9
10 [] do_try_to_free_pages at ffffffff97451dae
11 [] try_to_free_mem_cgroup_pages at ffffffff974542b8
12 [] try_charge_memcg at ffffffff974f0ede
13 [] obj_cgroup_charge_pages at ffffffff974f1dae
14 [] obj_cgroup_charge at ffffffff974f2fc2
15 [] kmem_cache_alloc at ffffffff974d054c
16 [] vm_area_dup at ffffffff972923f1
17 [] __split_vma at ffffffff97486c16
18 [] __do_munmap at ffffffff97486e78
19 [] __vm_munmap at ffffffff97487307
20 [] __x64_sys_munmap at ffffffff974873e7
21 [] do_syscall_64 at ffffffff97aaed14

Many threads was entering the memory reclaim in UN state, other threads was blocking
on mmap_lock. Although the OOM killer selects a victim, it cannot terminate it. The
task holding the jbd2 handle retries memory charge, but it fails. Reclaiming continues
while holding the jbd2 handler. write_pages also fails while waiting for the same jbd2
handler, causing repeated shrink failures and potentially leading to a system-wide block.

ps | grep UN | wc -l
1463

While the system has 1463 UN state tasks, so the way to break this akin to "deadlock" is
to let the thread holding jbd2 handler quickly exit the memory reclamation process.

We found that a related issue was reported and partially fixed in previous patches [1][2].
However, those fixes only skip direct reclamation and return a failure for some cases such
as readahead requests. As sb_getblk() is called multiple times in __ext4_get_inode_loc()
with the NOFAIL flag, the problem still exists. And it is not feasible to simply remove
__GFP_RECLAIMABLE when holding jbd2 handle to avoid potential very long memory reclaim
latency,  as __GFP_NOFAIL is not supported without __GFP_DIRECT_RECLAIM.

# Fundamentals

This patchset introduces a new task flag of PF_MEMALLOC_ACFORCE to indicate that memory
allocations are forced to be accounted to the memory cgroup, even if they exceed the cgroup's
maximum limit. The reclaim process is deferred until the task returns to the user without
holding any kernel resources for memory reclamation, thereby preventing priority inversion
problems. Any users who might encounter potential similar issues can utilize this new flag
to allocate memory and prevent long-term latency for the entire system.

# References

[1] https://lore.kernel.org/linux-fsdevel/20230811071519.1094-1-teawaterz@linux.alibaba.com/
[2] https://lore.kernel.org/all/20230914150011.843330-1-willy@infradead.org/T/#u

Zhongkun He (2):
  mm: memcg: introduce PF_MEMALLOC_ACCOUNTFORCE to postpone reclaim to
    return-to-userland path
  jbd2: mark the transaction context with the scope PF_MEMALLOC_ACFORCE
    context

 fs/jbd2/transaction.c            | 15 +++++--
 include/linux/memcontrol.h       |  6 +++
 include/linux/resume_user_mode.h |  1 +
 include/linux/sched.h            | 11 ++++-
 include/linux/sched/mm.h         | 35 ++++++++++++++++
 mm/memcontrol.c                  | 71 ++++++++++++++++++++++++++++++++
 6 files changed, 133 insertions(+), 6 deletions(-)

-- 
2.39.5

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/2] mm: memcg: introduce PF_MEMALLOC_ACCOUNTFORCE to postpone reclaim to return-to-userland path
  2025-06-18 11:39 [PATCH 0/2] Postpone memcg reclaim to return-to-user path Zhongkun He
@ 2025-06-18 11:39 ` Zhongkun He
  2025-06-24 14:47   ` Dan Carpenter
  2025-06-18 11:39 ` [PATCH 2/2] jbd2: mark the transaction context with the scope PF_MEMALLOC_ACFORCE context Zhongkun He
  2025-06-18 22:37 ` [PATCH 0/2] Postpone memcg reclaim to return-to-user path Shakeel Butt
  2 siblings, 1 reply; 8+ messages in thread
From: Zhongkun He @ 2025-06-18 11:39 UTC (permalink / raw)
  To: akpm, tytso, jack, hannes, mhocko
  Cc: muchun.song, linux-ext4, linux-kernel, linux-mm, cgroups,
	Zhongkun He, Muchun Song

The PF_MEMALLOC_ACCOUNTFORCE ensures that memory allocations are forced
to be accounted to the memory cgroup, even if they exceed the cgroup's
maximum limit. In such cases, the reclaim process is postponed until
the task returns to userland. This is beneficial for users who perform
over-max reclaim while holding multiple locks or other resources
(Especially resources related to file system writeback). If a task
needs any of these resources, it would otherwise have to wait until
the other task completes reclaim and releases the resources. Postponing
reclaim to the return-to-userland path helps avoid this issue.

We have long been experiencing an issue where, if a task
holds the jbd2 handler and then enters direct reclaim due
to hitting the hard limit in a memory cgroup, the system
can become blocked for an extended period of time.
The stack trace is as follows:

0 [] __schedule at
1 [] preempt_schedule_common at
2 [] __cond_resched at
3 [] shrink_active_list at
4 [] shrink_lruvec at
5 [] shrink_node at
6 [] do_try_to_free_pages at
7 [] try_to_free_mem_cgroup_pages at
8 [] try_charge_memcg at
9 [] charge_memcg at
10 [] __mem_cgroup_charge at
11 [] __add_to_page_cache_locked at
12 [] add_to_page_cache_lru at
13 [] pagecache_get_page at
14 [] __getblk_gfp at
15 [] __ext4_get_inode_loc at  [ext4]
16 [] ext4_get_inode_loc at  [ext4]
17 [] ext4_reserve_inode_write at  [ext4]
18 [] __ext4_mark_inode_dirty at  [ext4]
19 [] __ext4_new_inode at  [ext4]
20 [] ext4_create at  [ext4]

struct scan_control {
  nr_to_reclaim = 32,
  order = 0 '\000',
  priority = 1 '\001',
  reclaim_idx = 4 '\004',
  gfp_mask = 17861706,
  nr_scanned = 27810,
  nr_reclaimed = 0,
  nr = {
    dirty = 27797,
    unqueued_dirty = 27797,
    congested = 0,
    writeback = 0,
    immediate = 0,
    file_taken = 27810,
    taken = 27810
  },
}
The direct reclaim in memcg is unable to flush dirty pages
and ends up looping with the jbd2 handler. As a result,
other tasks are blocked from writing pages that require
the jbd2 handler.

Furthermore, we observed that the memory usage far exceeds
the configured memory max, reaching around 38GB.
Max  : 134896020    514 GB
usage: 144747169    552 GB
We investigated this issue and identified the root cause:

try_charge_memcg:
    retry charge
        charge failed
          -> direct reclaim  nr_retries--
           -> memcg_oom   true-> reset the nr_retries
            -> retry charge
In this cases, the OOM killer selects a task and returns
success, and retry charge. but that task does not acknowledge
the SIGKILL signal because it is stuck in an uninterruptible
state. As a result, the current task gets stuck in a long
retry loop inside direct reclaim.

Why are there so many uninterruptible (D) state tasks?
Check the most common stack.

 __state = 2
PID: 992582   TASK: ffff8c53a15b3080  CPU: 40   COMMAND: "xx"
0 [] __schedule at ffffffff97abc6c9
1 [] schedule at ffffffff97abcd01
2 [] schedule_preempt_disabled at ffffffff97abdf1a
3 [] rwsem_down_read_slowpath at ffffffff97ac05bf
4 [] down_read at ffffffff97ac06b1
5 [] do_user_addr_fault at ffffffff9727f1e7
6 [] exc_page_fault at ffffffff97ab286e
7 [] asm_exc_page_fault at ffffffff97c00d42

Check the owner of mm_struct.mmap_lock; the current task is
waiting on lruvec->lru_lock. There are 68 tasks in this group,
with 23 of them in the shrink page context.

5 [] native_queued_spin_lock_slowpath at ffffffff972fce02
6 [] _raw_spin_lock_irq at ffffffff97ac3bb1
7 [] shrink_active_list at ffffffff9744dd46
8 [] shrink_lruvec at ffffffff97451407
9 [] shrink_node at ffffffff974517c9
10 [] do_try_to_free_pages at ffffffff97451dae
11 [] try_to_free_mem_cgroup_pages at ffffffff974542b8
12 [] try_charge_memcg at ffffffff974f0ede
13 [] obj_cgroup_charge_pages at ffffffff974f1dae
14 [] obj_cgroup_charge at ffffffff974f2fc2
15 [] kmem_cache_alloc at ffffffff974d054c
16 [] vm_area_dup at ffffffff972923f1
17 [] __split_vma at ffffffff97486c16

Many tasks enter a memory shrinking loop in UN state, other threads
blocked on mmap_lock. Although the OOM killer selects a victim,
it cannot terminate it. The task holding the jbd2 handle retries
memory charge, which fails, and reclaim continues with the handle
held. write_pages also fails waiting for jbd2, causing repeated
shrink failures and potentially leading to a system-wide block.

ps | grep UN | wc -l
1463
While the system has 1463 UN state tasks, so the way to break
this akin to "deadlock" is to let the thread holding jbd2 handler
quickly exit the memory reclamation process.

We found that a related issue has been reported and partially
addressed in previous fixes [1][2]. However, those fixes only
skip direct reclaim and return a failure for some cases like
readahead requests. Since sb_getblk() is called multiple times
in __ext4_get_inode_loc() with the NOFAIL flag, the problem
still persists.

With this patch, we can force the memory charge and defer
direct reclaim until the task returns to user space. By doing
so, all global resources such as the jbd2 handler will be
released, provided that if __GFP_ACCOUNT_FORCE flag is set.

Why not combine  __GFP_NOFAIL and ~__GFP_DIRECT_RECLAIM to bypass
direct reclaim and force charge success?

Because we don't support __GFP_NOFAIL without __GFP_DIRECT_RECLAIM,
otherwise, we may result in lockup.[3], Besides, the flag
__GFP_DIRECT_RECLAIM is useful in global memory reclaim in
__alloc_pages_slowpath().

[1]:https://lore.kernel.org/linux-fsdevel/20230811071519.1094-1-teawaterz@linux.alibaba.com/
[2]:https://lore.kernel.org/all/20230914150011.843330-1-willy@infradead.org/T/#u
[3]:https://lore.kernel.org/all/20240830202823.21478-4-21cnbao@gmail.com/T/#u

Co-developed-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
---
 include/linux/memcontrol.h       |  6 +++
 include/linux/resume_user_mode.h |  1 +
 include/linux/sched.h            | 11 ++++-
 include/linux/sched/mm.h         | 35 ++++++++++++++++
 mm/memcontrol.c                  | 71 ++++++++++++++++++++++++++++++++
 5 files changed, 122 insertions(+), 2 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 87b6688f124a..3b4393de553e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -900,6 +900,8 @@ unsigned long mem_cgroup_get_zone_lru_size(struct lruvec *lruvec,
 
 void mem_cgroup_handle_over_high(gfp_t gfp_mask);
 
+void mem_cgroup_handle_over_max(gfp_t gfp_mask);
+
 unsigned long mem_cgroup_get_max(struct mem_cgroup *memcg);
 
 unsigned long mem_cgroup_size(struct mem_cgroup *memcg);
@@ -1354,6 +1356,10 @@ static inline void mem_cgroup_handle_over_high(gfp_t gfp_mask)
 {
 }
 
+static inline void mem_cgroup_handle_over_max(gfp_t gfp_mask)
+{
+}
+
 static inline struct mem_cgroup *mem_cgroup_get_oom_group(
 	struct task_struct *victim, struct mem_cgroup *oom_domain)
 {
diff --git a/include/linux/resume_user_mode.h b/include/linux/resume_user_mode.h
index e0135e0adae0..6189ebb8795b 100644
--- a/include/linux/resume_user_mode.h
+++ b/include/linux/resume_user_mode.h
@@ -56,6 +56,7 @@ static inline void resume_user_mode_work(struct pt_regs *regs)
 	}
 #endif
 
+	mem_cgroup_handle_over_max(GFP_KERNEL);
 	mem_cgroup_handle_over_high(GFP_KERNEL);
 	blkcg_maybe_throttle_current();
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4f78a64beb52..6eadd7be6810 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1549,9 +1549,12 @@ struct task_struct {
 #endif
 
 #ifdef CONFIG_MEMCG
-	/* Number of pages to reclaim on returning to userland: */
+	/* Number of pages over high to reclaim on returning to userland: */
 	unsigned int			memcg_nr_pages_over_high;
 
+	/* Number of pages over max to reclaim on returning to userland: */
+	unsigned int			memcg_nr_pages_over_max;
+
 	/* Used by memcontrol for targeted memcg charge: */
 	struct mem_cgroup		*active_memcg;
 
@@ -1745,7 +1748,11 @@ extern struct pid *cad_pid;
 #define PF_MEMALLOC_PIN		0x10000000	/* Allocations constrained to zones which allow long term pinning.
 						 * See memalloc_pin_save() */
 #define PF_BLOCK_TS		0x20000000	/* plug has ts that needs updating */
-#define PF__HOLE__40000000	0x40000000
+#ifdef CONFIG_MEMCG
+#define PF_MEMALLOC_ACCOUNTFORCE 0x40000000 /* See memalloc_account_force_save() */
+#else
+#define PF_MEMALLOC_ACCOUNTFORCE 0
+#endif
 #define PF_SUSPEND_TASK		0x80000000      /* This thread called freeze_processes() and should not be frozen */
 
 /*
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index b13474825130..648c03b6250c 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -468,6 +468,41 @@ static inline void memalloc_pin_restore(unsigned int flags)
 	memalloc_flags_restore(flags);
 }
 
+/**
+ * memalloc_account_force_save - Marks implicit PF_MEMALLOC_ACCOUNTFORCE
+ * allocation scope.
+ *
+ * The PF_MEMALLOC_ACCOUNTFORCE ensures that memory allocations are forced
+ * to be accounted to the memory cgroup, even if they exceed the cgroup's
+ * maximum limit. In such cases, the reclaim process is postponed until
+ * the task returns to userland. This is beneficial for users who perform
+ * over-max reclaim while holding multiple locks or other resources
+ * (especially resources related to file system writeback). If a task
+ * needs any of these resources, it would otherwise have to wait until
+ * the other task completes reclaim and releases the resources. Postponing
+ * reclaim to the return-to-userland path helps avoid this issue.
+ *
+ * Context: This function is safe to be used from any context.
+ * Return: The saved flags to be passed to memalloc_account_force_restore.
+ */
+static inline unsigned int memalloc_account_force_save(void)
+{
+	return memalloc_flags_save(PF_MEMALLOC_ACCOUNTFORCE);
+}
+
+/**
+ * memalloc_account_force_restore - Ends the implicit PF_MEMALLOC_ACCOUNTFORCE.
+ * @flags: Flags to restore.
+ *
+ * Ends the implicit PF_MEMALLOC_ACCOUNTFORCE scope started by memalloc_account_force_save
+ * function. Always make sure that the given flags is the return value from the pairing
+ * memalloc_account_force_save call.
+ */
+static inline void memalloc_account_force_restore(void)
+{
+	return memalloc_flags_restore(PF_MEMALLOC_ACCOUNTFORCE);
+}
+
 #ifdef CONFIG_MEMCG
 DECLARE_PER_CPU(struct mem_cgroup *, int_active_memcg);
 /**
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 902da8a9c643..8484c3a15151 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2301,6 +2301,67 @@ void mem_cgroup_handle_over_high(gfp_t gfp_mask)
 	css_put(&memcg->css);
 }
 
+static inline struct mem_cgroup *get_over_limit_memcg(struct mem_cgroup *memcg)
+{
+	struct mem_cgroup *mem_over_limit = NULL;
+
+	do {
+		if (page_counter_read(&memcg->memory) <=
+		    READ_ONCE(memcg->memory.max))
+			continue;
+
+		mem_over_limit = memcg;
+		break;
+	} while ((memcg = parent_mem_cgroup(memcg)));
+
+	return mem_over_limit;
+}
+
+void mem_cgroup_handle_over_max(gfp_t gfp_mask)
+{
+	unsigned long nr_reclaimed = 0;
+	unsigned int nr_pages = current->memcg_nr_pages_over_max;
+	int nr_retries = MAX_RECLAIM_RETRIES;
+	struct mem_cgroup *memcg, *mem_over_limit;
+
+	if (likely(!nr_pages))
+		return;
+
+	memcg = get_mem_cgroup_from_mm(current->mm);
+	current->memcg_nr_pages_over_max = 0;
+
+retry:
+	mem_over_limit = get_over_limit_memcg(memcg);
+	if (!mem_over_limit)
+		goto out;
+
+	while (nr_reclaimed < nr_pages) {
+		unsigned long reclaimed;
+
+		reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit,
+					nr_pages, GFP_KERNEL,
+					MEMCG_RECLAIM_MAY_SWAP,
+					NULL);
+
+		if (!reclaimed && !nr_retries--)
+			break;
+
+		nr_reclaimed += reclaimed;
+	}
+
+	if ((nr_reclaimed < nr_pages) &&
+	    (page_counter_read(&mem_over_limit->memory) >
+	    READ_ONCE(mem_over_limit->memory.max)) &&
+	    mem_cgroup_oom(mem_over_limit, gfp_mask,
+			  get_order((nr_pages - nr_reclaimed)  * PAGE_SIZE))) {
+		nr_retries = MAX_RECLAIM_RETRIES;
+		goto retry;
+	}
+
+out:
+	css_put(&memcg->css);
+}
+
 static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 			    unsigned int nr_pages)
 {
@@ -2349,6 +2410,16 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (unlikely(current->flags & PF_MEMALLOC))
 		goto force;
 
+	/*
+	 * Avoid blocking on heavyweight resources (e.g., jbd2 handle)
+	 * which may otherwise lead to system-wide stalls.
+	 */
+	if (current->flags & PF_MEMALLOC_ACCOUNTFORCE) {
+		current->memcg_nr_pages_over_max += nr_pages;
+		set_notify_resume(current);
+		goto force;
+	}
+
 	if (unlikely(task_in_memcg_oom(current)))
 		goto nomem;
 
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] mm: memcg: introduce PF_MEMALLOC_ACCOUNTFORCE to postpone reclaim to return-to-userland path
  2025-06-18 11:39 ` [PATCH 1/2] mm: memcg: introduce PF_MEMALLOC_ACCOUNTFORCE to postpone reclaim to return-to-userland path Zhongkun He
@ 2025-06-24 14:47   ` Dan Carpenter
  0 siblings, 0 replies; 8+ messages in thread
From: Dan Carpenter @ 2025-06-24 14:47 UTC (permalink / raw)
  To: oe-kbuild, Zhongkun He, akpm, tytso, jack, hannes, mhocko
  Cc: lkp, oe-kbuild-all, muchun.song, linux-ext4, linux-kernel,
	linux-mm, cgroups, Zhongkun He, Muchun Song

Hi Zhongkun,

kernel test robot noticed the following build warnings:

https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Zhongkun-He/mm-memcg-introduce-PF_MEMALLOC_ACCOUNTFORCE-to-postpone-reclaim-to-return-to-userland-path/20250618-194101
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/71a4bbc284048ceb38eaac53dfa1031f92ac52b7.1750234270.git.hezhongkun.hzk%40bytedance.com
patch subject: [PATCH 1/2] mm: memcg: introduce PF_MEMALLOC_ACCOUNTFORCE to postpone reclaim to return-to-userland path
config: i386-randconfig-141-20250619 (https://download.01.org/0day-ci/archive/20250624/202506242032.uShv7ASV-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
| Closes: https://lore.kernel.org/r/202506242032.uShv7ASV-lkp@intel.com/

smatch warnings:
mm/memcontrol.c:2341 mem_cgroup_handle_over_max() warn: use 'gfp_mask' here instead of GFP_KERNEL?

vim +/gfp_mask +2341 mm/memcontrol.c

b5db553cc19549 Zhongkun He 2025-06-18  2320  void mem_cgroup_handle_over_max(gfp_t gfp_mask)
                                                                                   ^^^^^^^^
b5db553cc19549 Zhongkun He 2025-06-18  2321  {
b5db553cc19549 Zhongkun He 2025-06-18  2322  	unsigned long nr_reclaimed = 0;
b5db553cc19549 Zhongkun He 2025-06-18  2323  	unsigned int nr_pages = current->memcg_nr_pages_over_max;
b5db553cc19549 Zhongkun He 2025-06-18  2324  	int nr_retries = MAX_RECLAIM_RETRIES;
b5db553cc19549 Zhongkun He 2025-06-18  2325  	struct mem_cgroup *memcg, *mem_over_limit;
b5db553cc19549 Zhongkun He 2025-06-18  2326  
b5db553cc19549 Zhongkun He 2025-06-18  2327  	if (likely(!nr_pages))
b5db553cc19549 Zhongkun He 2025-06-18  2328  		return;
b5db553cc19549 Zhongkun He 2025-06-18  2329  
b5db553cc19549 Zhongkun He 2025-06-18  2330  	memcg = get_mem_cgroup_from_mm(current->mm);
b5db553cc19549 Zhongkun He 2025-06-18  2331  	current->memcg_nr_pages_over_max = 0;
b5db553cc19549 Zhongkun He 2025-06-18  2332  
b5db553cc19549 Zhongkun He 2025-06-18  2333  retry:
b5db553cc19549 Zhongkun He 2025-06-18  2334  	mem_over_limit = get_over_limit_memcg(memcg);
b5db553cc19549 Zhongkun He 2025-06-18  2335  	if (!mem_over_limit)
b5db553cc19549 Zhongkun He 2025-06-18  2336  		goto out;
b5db553cc19549 Zhongkun He 2025-06-18  2337  
b5db553cc19549 Zhongkun He 2025-06-18  2338  	while (nr_reclaimed < nr_pages) {
b5db553cc19549 Zhongkun He 2025-06-18  2339  		unsigned long reclaimed;
b5db553cc19549 Zhongkun He 2025-06-18  2340  
b5db553cc19549 Zhongkun He 2025-06-18 @2341  		reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit,
b5db553cc19549 Zhongkun He 2025-06-18  2342  					nr_pages, GFP_KERNEL,

I guess GFP_KERNEL is fine.  The gfp_mask is used below.  Don't worry
about this one if the GFP_KERNEL is intended.  Just ignore the warning
message.

b5db553cc19549 Zhongkun He 2025-06-18  2343  					MEMCG_RECLAIM_MAY_SWAP,
b5db553cc19549 Zhongkun He 2025-06-18  2344  					NULL);
b5db553cc19549 Zhongkun He 2025-06-18  2345  
b5db553cc19549 Zhongkun He 2025-06-18  2346  		if (!reclaimed && !nr_retries--)
b5db553cc19549 Zhongkun He 2025-06-18  2347  			break;
b5db553cc19549 Zhongkun He 2025-06-18  2348  
b5db553cc19549 Zhongkun He 2025-06-18  2349  		nr_reclaimed += reclaimed;
b5db553cc19549 Zhongkun He 2025-06-18  2350  	}
b5db553cc19549 Zhongkun He 2025-06-18  2351  
b5db553cc19549 Zhongkun He 2025-06-18  2352  	if ((nr_reclaimed < nr_pages) &&
b5db553cc19549 Zhongkun He 2025-06-18  2353  	    (page_counter_read(&mem_over_limit->memory) >
b5db553cc19549 Zhongkun He 2025-06-18  2354  	    READ_ONCE(mem_over_limit->memory.max)) &&
b5db553cc19549 Zhongkun He 2025-06-18  2355  	    mem_cgroup_oom(mem_over_limit, gfp_mask,
b5db553cc19549 Zhongkun He 2025-06-18  2356  			  get_order((nr_pages - nr_reclaimed)  * PAGE_SIZE))) {
b5db553cc19549 Zhongkun He 2025-06-18  2357  		nr_retries = MAX_RECLAIM_RETRIES;
b5db553cc19549 Zhongkun He 2025-06-18  2358  		goto retry;
b5db553cc19549 Zhongkun He 2025-06-18  2359  	}
b5db553cc19549 Zhongkun He 2025-06-18  2360  
b5db553cc19549 Zhongkun He 2025-06-18  2361  out:
b5db553cc19549 Zhongkun He 2025-06-18  2362  	css_put(&memcg->css);
b5db553cc19549 Zhongkun He 2025-06-18  2363  }

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 2/2] jbd2: mark the transaction context with the scope PF_MEMALLOC_ACFORCE context
  2025-06-18 11:39 [PATCH 0/2] Postpone memcg reclaim to return-to-user path Zhongkun He
  2025-06-18 11:39 ` [PATCH 1/2] mm: memcg: introduce PF_MEMALLOC_ACCOUNTFORCE to postpone reclaim to return-to-userland path Zhongkun He
@ 2025-06-18 11:39 ` Zhongkun He
  2025-06-18 22:37 ` [PATCH 0/2] Postpone memcg reclaim to return-to-user path Shakeel Butt
  2 siblings, 0 replies; 8+ messages in thread
From: Zhongkun He @ 2025-06-18 11:39 UTC (permalink / raw)
  To: akpm, tytso, jack, hannes, mhocko
  Cc: muchun.song, linux-ext4, linux-kernel, linux-mm, cgroups,
	Zhongkun He, Muchun Song

The jbd2 handle, associated with filesystem metadata,
can be held during direct reclaim when a memcg limit is hit.
This prevents other tasks from writing pages, resulting
in shrink failures due to dirty pages that cannot be written
back. These shrink failures may leave many tasks stuck in
the uninterruptible (D) state. The OOM killer may select a
victim and return success, allowing the current thread to
retry the memory charge. However, the selected task cannot
respond to the SIGKILL because it is also stuck in the
uninterruptible state. As a result, the charging task resets
nr_retries and attempts reclaim again, but the victim never
exits. This leads to a prolonged retry loop in direct reclaim
with the jbd2 handle held, significantly extending its hold
time and potentially causing a system-wide block.

We found that a related issue has been reported and
partially addressed in previous fixes [1][2].
However, those fixes only skip direct reclaim and
return a failure for some cases like readahead
requests. Since sb_getblk() is called multiple
times in __ext4_get_inode_loc() with the NOFAIL flag,
the problem still persists.

So call the memalloc_account_force_save() to charge
the pages and delay the direct reclaim util return to
userland, to release the global resource jbd2 handle.

[1]:https://lore.kernel.org/linux-fsdevel/20230811071519.1094-1-teawaterz@linux.alibaba.com/
[2]:https://lore.kernel.org/all/20230914150011.843330-1-willy@infradead.org/T/#u

Co-developed-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
---
 fs/jbd2/transaction.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index c7867139af69..d05847301a8f 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -448,6 +448,13 @@ static int start_this_handle(journal_t *journal, handle_t *handle,
 	 * going to recurse back to the fs layer.
 	 */
 	handle->saved_alloc_context = memalloc_nofs_save();
+
+	/*
+	 * Avoid blocking on jbd2 handler in memcg direct reclaim
+	 * which may otherwise lead to system-wide stalls.
+	 */
+	handle->saved_alloc_context |= memalloc_account_force_save();
+
 	return 0;
 }

@@ -733,10 +740,10 @@ static void stop_this_handle(handle_t *handle)

 	rwsem_release(&journal->j_trans_commit_map, _THIS_IP_);
 	/*
-	 * Scope of the GFP_NOFS context is over here and so we can restore the
-	 * original alloc context.
+	 * Scope of the GFP_NOFS and PF_MEMALLOC_ACCOUNTFORCE context
+	 * is over here and so we can restore the original alloc context.
 	 */
-	memalloc_nofs_restore(handle->saved_alloc_context);
+	memalloc_flags_restore(handle->saved_alloc_context);
 }

 /**
@@ -1838,7 +1845,7 @@ int jbd2_journal_stop(handle_t *handle)
 		 * Handle is already detached from the transaction so there is
 		 * nothing to do other than free the handle.
 		 */
-		memalloc_nofs_restore(handle->saved_alloc_context);
+		memalloc_flags_restore(handle->saved_alloc_context);
 		goto free_and_exit;
 	}
 	journal = transaction->t_journal;
-- 
2.39.5

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 0/2] Postpone memcg reclaim to return-to-user path
  2025-06-18 11:39 [PATCH 0/2] Postpone memcg reclaim to return-to-user path Zhongkun He
  2025-06-18 11:39 ` [PATCH 1/2] mm: memcg: introduce PF_MEMALLOC_ACCOUNTFORCE to postpone reclaim to return-to-userland path Zhongkun He
  2025-06-18 11:39 ` [PATCH 2/2] jbd2: mark the transaction context with the scope PF_MEMALLOC_ACFORCE context Zhongkun He
@ 2025-06-18 22:37 ` Shakeel Butt
  2025-06-19  8:04   ` Jan Kara
  2025-06-24  7:44   ` Zhongkun He
  2 siblings, 2 replies; 8+ messages in thread
From: Shakeel Butt @ 2025-06-18 22:37 UTC (permalink / raw)
  To: Zhongkun He
  Cc: akpm, tytso, jack, hannes, mhocko, muchun.song, linux-ext4,
	linux-kernel, linux-mm, cgroups

Hi Zhongkun,

Thanks for a very detailed and awesome description of the problem. This
is a real issue and we at Meta face similar scenarios as well. However I
would not go for PF_MEMALLOC_ACFORCE approach as it is easy to abuse and
is very manual and requires detecting the code which can cause such
scenarios and then case-by-case opting them in. I would prefer a dynamic
or automated approach where the kernel detects such an issue is
happening and recover from it. Though a case can be made where we avoid
such scenarios from happening but that might not be possible everytime.

Also this is very memcg specific, I can clearly see the same scenario
can happen for global reclaim as well.

I have a couple of questions below:

On Wed, Jun 18, 2025 at 07:39:56PM +0800, Zhongkun He wrote:
> # Introduction
> 
> This patchset aims to introduce an approach to ensure that memory
> allocations are forced to be accounted to the memory cgroup, even if
> they exceed the cgroup's maximum limit. In such cases, the reclaim
> process is postponed until the task returns to the user.

This breaks memory.max semantics. Any reason memory.high is not used
here. Basically instead of memory.max, use memory.high as job limit. I
would like to know how memory.high is lacking for your use-case. Maybe
we can fix that or introduce a new form of limit. However this is memcg
specific and will not resolve the global reclaim case.

> This is
> beneficial for users who perform over-max reclaim while holding multiple
> locks or other resources (especially resources related to file system
> writeback). If a task needs any of these resources, it would otherwise
> have to wait until the other task completes reclaim and releases the
> resources. Postponing reclaim to the return-to-user path helps avoid this issue.
> 
> # Background
> 
> We have been encountering an hungtask issue for a long time. Specifically,
> when a task holds the jbd2 handler 

Can you explain a bit more about jbd2 handler? Is it some global shared
lock or a workqueue which can only run single thread at a time.
Basically is there a way to get the current holder/owner of jbd2 handler
programmatically?

> and subsequently enters direct reclaim
> because it reaches the hard limit within a memory cgroup, the system may become
> blocked for a long time. The stack trace of waiting thread holding the jbd2
> handle is as follows (and so many other threads are waiting on the same jbd2
> handle):
> 
>  #0 __schedule at ffffffff97abc6c9
>  #1 preempt_schedule_common at ffffffff97abcdaa
>  #2 __cond_resched at ffffffff97abcddd
>  #3 shrink_active_list at ffffffff9744dca2
>  #4 shrink_lruvec at ffffffff97451407
>  #5 shrink_node at ffffffff974517c9
>  #6 do_try_to_free_pages at ffffffff97451dae
>  #7 try_to_free_mem_cgroup_pages at ffffffff974542b8
>  #8 try_charge_memcg at ffffffff974f0ede
>  #9 charge_memcg at ffffffff974f1d0e
> #10 __mem_cgroup_charge at ffffffff974f391c
> #11 __add_to_page_cache_locked at ffffffff974313e5
> #12 add_to_page_cache_lru at ffffffff974324b2
> #13 pagecache_get_page at ffffffff974338e3
> #14 __getblk_gfp at ffffffff97556798
> #15 __ext4_get_inode_loc at ffffffffc07a5518 [ext4]
> #16 ext4_get_inode_loc at ffffffffc07a7fec [ext4]
> #17 ext4_reserve_inode_write at ffffffffc07a9fb1 [ext4]
> #18 __ext4_mark_inode_dirty at ffffffffc07aa249 [ext4]
> #19 __ext4_new_inode at ffffffffc079cbae [ext4]
> #20 ext4_create at ffffffffc07c3e56 [ext4]
> #21 path_openat at ffffffff9751f471
> #22 do_filp_open at ffffffff97521384
> #23 do_sys_openat2 at ffffffff97508fd6
> #24 do_sys_open at ffffffff9750a65b
> #25 do_syscall_64 at ffffffff97aaed14
> 
> We've obtained a coredump and dumped struct scan_control from it by using crash tool.
> 
> struct scan_control {
>   nr_to_reclaim = 32,
>   order = 0 '\000',
>   priority = 1 '\001',
>   reclaim_idx = 4 '\004',
>   gfp_mask = 17861706, __GFP_NOFAIL
>   nr_scanned = 27810, 
>   nr_reclaimed = 0,
>   nr = {
>         dirty = 27797,
>         unqueued_dirty = 27797,
>         congested = 0,
>         writeback = 0,
>         immediate = 0,
>         file_taken = 27810,
>         taken = 27810
>   },
> }
> 

What is the kernel version? Can you run scripts/gfp-translate on the
gfp_mask above? Does this kernel have a75ffa26122b ("memcg, oom: do not
bypass oom killer for dying tasks")?

> The ->nr_reclaimed is zero meaning there is no memory we have reclaimed because
> most of the file pages are unqueued dirty. And ->priority is 1 also meaning we
> spent so much time on memory reclamation.

Is there a way to get how many times this thread has looped within
try_charge_memcg()?

> Since this thread has held the jbd2
> handler, the jbd2 thread was waiting for the same jbd2 handler, which blocked
> so many other threads from writing dirty pages as well.
> 
> 0 [] __schedule at ffffffff97abc6c9
> 1 [] schedule at ffffffff97abcd01
> 2 [] jbd2_journal_wait_updates at ffffffffc05a522f [jbd2]
> 3 [] jbd2_journal_commit_transaction at ffffffffc05a72c6 [jbd2]
> 4 [] kjournald2 at ffffffffc05ad66d [jbd2]
> 5 [] kthread at ffffffff972bc4c0
> 6 [] ret_from_fork at ffffffff9720440f
> 
> Furthermore, we observed that memory usage far exceeded the configured memory maximum,
> reaching around 38GB.
> 
> memory.max  : 134896020    514 GB
> memory.usage: 144747169    552 GB

This is unexpected and most probably our hacks to allow overcharge to
avoid similar situations are causing this. 

> 
> We investigated this issue and identified the root cause:
>   try_charge_memcg:
>     retry charge
>      charge failed
>        -> direct reclaim
>         -> mem_cgroup_oom    return true，but selected task is in an uninterruptible state
>            -> retry charge

Oh oom reaper didn't help?

> 
> In which cases, we saw many tasks in the uninterruptible (D) state with a pending
> SIGKILL signal. The OOM killer selects a victim and returns success, allowing the
> current thread to retry the memory charge. However, the selected task cannot acknowledge
> the SIGKILL signal because it is stuck in an uninterruptible state.

OOM reaper usually helps in such cases but I see below why it didn't
help.

> As a result,
> the charging task resets nr_retries and attempts to reclaim again, but the victim
> task never exits. This causes the current thread to enter a prolonged retry loop
> during direct reclaim, holding the jbd2 handler for much more time and leading to
> system-wide blocking. Why are there so many uninterruptible (D) state tasks?
> Check the most common stack trace.
> 
> crash> task_struct.__state ffff8c53a15b3080
>   __state = 2,   #define TASK_UNINTERRUPTIBLE        0x0002
>  0 [] __schedule at ffffffff97abc6c9
>  1 [] schedule at ffffffff97abcd01
>  2 [] schedule_preempt_disabled at ffffffff97abdf1a
>  3 [] rwsem_down_read_slowpath at ffffffff97ac05bf
>  4 [] down_read at ffffffff97ac06b1
>  5 [] do_user_addr_fault at ffffffff9727f1e7
>  6 [] exc_page_fault at ffffffff97ab286e
>  7 [] asm_exc_page_fault at ffffffff97c00d42
> 
> Check the owner of mm_struct.mmap_lock. The task below was entering memory reclaim
> holding mmap lock and there are 68 tasks in this memory cgroup, with 23 of them in
> the memory reclaim context.
> 

The following thread has mmap_lock in write mode and thus oom-reaper is
not helping. Do you see "oom_reaper: unable to reap pid..." messages in
dmesg?

>  7 [] shrink_active_list at ffffffff9744dd46
>  8 [] shrink_lruvec at ffffffff97451407
>  9 [] shrink_node at ffffffff974517c9
> 10 [] do_try_to_free_pages at ffffffff97451dae
> 11 [] try_to_free_mem_cgroup_pages at ffffffff974542b8
> 12 [] try_charge_memcg at ffffffff974f0ede
> 13 [] obj_cgroup_charge_pages at ffffffff974f1dae
> 14 [] obj_cgroup_charge at ffffffff974f2fc2
> 15 [] kmem_cache_alloc at ffffffff974d054c
> 16 [] vm_area_dup at ffffffff972923f1
> 17 [] __split_vma at ffffffff97486c16
> 18 [] __do_munmap at ffffffff97486e78
> 19 [] __vm_munmap at ffffffff97487307
> 20 [] __x64_sys_munmap at ffffffff974873e7
> 21 [] do_syscall_64 at ffffffff97aaed14
> 
> Many threads was entering the memory reclaim in UN state, other threads was blocking
> on mmap_lock. Although the OOM killer selects a victim, it cannot terminate it.

Can you please confirm the above? Is the kernel able to oom-kill more
processes or if it is returning early because the current thread is
dying. However if the cgroup has just one big process, this doesn't
matter.

> The
> task holding the jbd2 handle retries memory charge, but it fails. Reclaiming continues
> while holding the jbd2 handler. write_pages also fails while waiting for the same jbd2
> handler, causing repeated shrink failures and potentially leading to a system-wide block.
> 
> ps | grep UN | wc -l
> 1463
> 
> While the system has 1463 UN state tasks, so the way to break this akin to "deadlock" is
> to let the thread holding jbd2 handler quickly exit the memory reclamation process.
> 
> We found that a related issue was reported and partially fixed in previous patches [1][2].
> However, those fixes only skip direct reclamation and return a failure for some cases such
> as readahead requests. As sb_getblk() is called multiple times in __ext4_get_inode_loc()
> with the NOFAIL flag, the problem still exists. And it is not feasible to simply remove
> __GFP_RECLAIMABLE when holding jbd2 handle to avoid potential very long memory reclaim
> latency,  as __GFP_NOFAIL is not supported without __GFP_DIRECT_RECLAIM.
> 
> # Fundamentals
> 
> This patchset introduces a new task flag of PF_MEMALLOC_ACFORCE to indicate that memory
> allocations are forced to be accounted to the memory cgroup, even if they exceed the cgroup's
> maximum limit. The reclaim process is deferred until the task returns to the user without
> holding any kernel resources for memory reclamation, thereby preventing priority inversion
> problems. Any users who might encounter potential similar issues can utilize this new flag
> to allocate memory and prevent long-term latency for the entire system.

I already explained upfront why this is not the approach we want.

We do see similar situations/scenarios but due to global/shared locks in
btrfs but I expect any global lock or global shared resource can cause
such priority inversion situations.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 0/2] Postpone memcg reclaim to return-to-user path
  2025-06-18 22:37 ` [PATCH 0/2] Postpone memcg reclaim to return-to-user path Shakeel Butt
@ 2025-06-19  8:04   ` Jan Kara
  2025-06-24  7:45     ` [External] " Zhongkun He
  2025-06-24  7:44   ` Zhongkun He
  1 sibling, 1 reply; 8+ messages in thread
From: Jan Kara @ 2025-06-19  8:04 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Zhongkun He, akpm, tytso, jack, hannes, mhocko, muchun.song,
	linux-ext4, linux-kernel, linux-mm, cgroups

On Wed 18-06-25 15:37:20, Shakeel Butt wrote:
> > This is
> > beneficial for users who perform over-max reclaim while holding multiple
> > locks or other resources (especially resources related to file system
> > writeback). If a task needs any of these resources, it would otherwise
> > have to wait until the other task completes reclaim and releases the
> > resources. Postponing reclaim to the return-to-user path helps avoid this issue.
> > 
> > # Background
> > 
> > We have been encountering an hungtask issue for a long time. Specifically,
> > when a task holds the jbd2 handler 
> 
> Can you explain a bit more about jbd2 handler? Is it some global shared
> lock or a workqueue which can only run single thread at a time.
> Basically is there a way to get the current holder/owner of jbd2 handler
> programmatically?

There's a typo in the original email :). It should be "jbd2 handle". And
that is just a reference to the currently running transaction in ext4
filesystem. There can be always at most one running transaction in ext4
filesystem and until the last reference is dropped it cannot commit. This
eventually (once the transaction reaches its maximum size) blocks all the
other modifications to the filesystem. So it is shared global resource
that's held by the process doing reclaim.

Since there can be many holders of references to the currently running
transaction there's no easy way to iterate processes that are holding the
references... That being said ext4 sets current->journal_info when
acquiring a journal handle but other filesystems use this field for other
purposes so current->journal_info being non-NULL does not mean jbd2 handle
is held.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [External] Re: [PATCH 0/2] Postpone memcg reclaim to return-to-user path
  2025-06-19  8:04   ` Jan Kara
@ 2025-06-24  7:45     ` Zhongkun He
  0 siblings, 0 replies; 8+ messages in thread
From: Zhongkun He @ 2025-06-24  7:45 UTC (permalink / raw)
  To: Jan Kara
  Cc: Shakeel Butt, akpm, tytso, jack, hannes, mhocko, muchun.song,
	linux-ext4, linux-kernel, linux-mm, cgroups

On Thu, Jun 19, 2025 at 4:05 PM Jan Kara <jack@suse.cz> wrote:
>
> On Wed 18-06-25 15:37:20, Shakeel Butt wrote:
> > > This is
> > > beneficial for users who perform over-max reclaim while holding multiple
> > > locks or other resources (especially resources related to file system
> > > writeback). If a task needs any of these resources, it would otherwise
> > > have to wait until the other task completes reclaim and releases the
> > > resources. Postponing reclaim to the return-to-user path helps avoid this issue.
> > >
> > > # Background
> > >
> > > We have been encountering an hungtask issue for a long time. Specifically,
> > > when a task holds the jbd2 handler
> >
> > Can you explain a bit more about jbd2 handler? Is it some global shared
> > lock or a workqueue which can only run single thread at a time.
> > Basically is there a way to get the current holder/owner of jbd2 handler
> > programmatically?
>
> There's a typo in the original email :). It should be "jbd2 handle". And
> that is just a reference to the currently running transaction in ext4
> filesystem. There can be always at most one running transaction in ext4
> filesystem and until the last reference is dropped it cannot commit. This
> eventually (once the transaction reaches its maximum size) blocks all the
> other modifications to the filesystem. So it is shared global resource
> that's held by the process doing reclaim.
>
> Since there can be many holders of references to the currently running
> transaction there's no easy way to iterate processes that are holding the
> references... That being said ext4 sets current->journal_info when
> acquiring a journal handle but other filesystems use this field for other
> purposes so current->journal_info being non-NULL does not mean jbd2 handle
> is held.

Hi Jan,
Thanks for your feedback and explanations.

>
>                                                                 Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [External] Re: [PATCH 0/2] Postpone memcg reclaim to return-to-user path
  2025-06-18 22:37 ` [PATCH 0/2] Postpone memcg reclaim to return-to-user path Shakeel Butt
  2025-06-19  8:04   ` Jan Kara
@ 2025-06-24  7:44   ` Zhongkun He
  1 sibling, 0 replies; 8+ messages in thread
From: Zhongkun He @ 2025-06-24  7:44 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: akpm, tytso, jack, hannes, mhocko, muchun.song, linux-ext4,
	linux-kernel, linux-mm, cgroups

Hi Shakeel,

Thank you for your detailed feedback and sorry for the late reply.
I need a bit more time to thoroughly review the case.

On Thu, Jun 19, 2025 at 6:37 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> Hi Zhongkun,
>
> Thanks for a very detailed and awesome description of the problem. This
> is a real issue and we at Meta face similar scenarios as well. However I
> would not go for PF_MEMALLOC_ACFORCE approach as it is easy to abuse and
> is very manual and requires detecting the code which can cause such
> scenarios and then case-by-case opting them in. I would prefer a dynamic

Currently, our goal is to address the hung task issue caused by the
jbd2 handle which
has a high probability of occurring in our scenario and is more likely
to cause problems.
The patch only involves APIs used within the start/stop_this_handle()
context in jbd2,
such as memalloc_nofs_save() and can be used in other lock contexts.
Please refer to patch 2 for further details. It appears
that Btrfs employs a similar transactional mechanism via btrfs_trans_handle, and
makes use of memalloc_nofs_save() to safeguard metadata allocations.
For example,
see commit 84de76a2fb21 ("btrfs: protect space cache inode alloc with
GFP_NOFS").
But I'm not sure if the same problem exists.

> or automated approach where the kernel detects such an issue is
> happening and recover from it. Though a case can be made where we avoid
> such scenarios from happening but that might not be possible everytime.
>

Yes, you are right — we need to identify and fix these issues case by
case. However,
we currently have a similar mechanism, like memalloc_nofs_save(), to
handle this.
An automated approach would definitely be better, or alternatively, we
could introduce
a new memory interface to delay memory reclaim until returning to userspace."

> Also this is very memcg specific, I can clearly see the same scenario
> can happen for global reclaim as well.

Yes, the same scenario can occur during global reclaim — we have encountered it.
However, global memory shortages are difficult to resolve. Maybe we
can aggressively
trigger the OOM killer before tasks enter the uninterruptible (UN) state.
Or, a more aggressive approach would be to reduce the number of direct
reclaim retries,
allowing earlier access to memory reserves if the task holds the
global resource.
It is hard to balance.

>
> I have a couple of questions below:
>
> On Wed, Jun 18, 2025 at 07:39:56PM +0800, Zhongkun He wrote:
> > # Introduction
> >
> > This patchset aims to introduce an approach to ensure that memory
> > allocations are forced to be accounted to the memory cgroup, even if
> > they exceed the cgroup's maximum limit. In such cases, the reclaim
> > process is postponed until the task returns to the user.
>
> This breaks memory.max semantics. Any reason memory.high is not used
> here. Basically instead of memory.max, use memory.high as job limit. I
> would like to know how memory.high is lacking for your use-case. Maybe
> we can fix that or introduce a new form of limit. However this is memcg
> specific and will not resolve the global reclaim case.

The gpf_nofail is already breaking the memory.max semantics and this patchset
is just breaking the limit in kernel context, and trying to reclaim or
oom on return
back to userspace. IIUC, memory.high will slow down the task, but we
would prefer
to kill the pod and let it be rescheduled on another machine, rather
than having it blocked.

>
> > This is
> > beneficial for users who perform over-max reclaim while holding multiple
> > locks or other resources (especially resources related to file system
> > writeback). If a task needs any of these resources, it would otherwise
> > have to wait until the other task completes reclaim and releases the
> > resources. Postponing reclaim to the return-to-user path helps avoid this issue.
> >
> > # Background
> >
> > We have been encountering an hungtask issue for a long time. Specifically,
> > when a task holds the jbd2 handler
>
> Can you explain a bit more about jbd2 handler? Is it some global shared
> lock or a workqueue which can only run single thread at a time.
> Basically is there a way to get the current holder/owner of jbd2 handler
> programmatically?

Thanks to Jan Kara for providing a detailed explanation in the letter
above. It is a shared global resource that's held by the process doing reclaim.


>
> > and subsequently enters direct reclaim
> > because it reaches the hard limit within a memory cgroup, the system may become
> > blocked for a long time. The stack trace of waiting thread holding the jbd2
> > handle is as follows (and so many other threads are waiting on the same jbd2
> > handle):
> >
> >  #0 __schedule at ffffffff97abc6c9
> >  #1 preempt_schedule_common at ffffffff97abcdaa
> >  #2 __cond_resched at ffffffff97abcddd
> >  #3 shrink_active_list at ffffffff9744dca2
> >  #4 shrink_lruvec at ffffffff97451407
> >  #5 shrink_node at ffffffff974517c9
> >  #6 do_try_to_free_pages at ffffffff97451dae
> >  #7 try_to_free_mem_cgroup_pages at ffffffff974542b8
> >  #8 try_charge_memcg at ffffffff974f0ede
> >  #9 charge_memcg at ffffffff974f1d0e
> > #10 __mem_cgroup_charge at ffffffff974f391c
> > #11 __add_to_page_cache_locked at ffffffff974313e5
> > #12 add_to_page_cache_lru at ffffffff974324b2
> > #13 pagecache_get_page at ffffffff974338e3
> > #14 __getblk_gfp at ffffffff97556798
> > #15 __ext4_get_inode_loc at ffffffffc07a5518 [ext4]
> > #16 ext4_get_inode_loc at ffffffffc07a7fec [ext4]
> > #17 ext4_reserve_inode_write at ffffffffc07a9fb1 [ext4]
> > #18 __ext4_mark_inode_dirty at ffffffffc07aa249 [ext4]
> > #19 __ext4_new_inode at ffffffffc079cbae [ext4]
> > #20 ext4_create at ffffffffc07c3e56 [ext4]
> > #21 path_openat at ffffffff9751f471
> > #22 do_filp_open at ffffffff97521384
> > #23 do_sys_openat2 at ffffffff97508fd6
> > #24 do_sys_open at ffffffff9750a65b
> > #25 do_syscall_64 at ffffffff97aaed14
> >
> > We've obtained a coredump and dumped struct scan_control from it by using crash tool.
> >
> > struct scan_control {
> >   nr_to_reclaim = 32,
> >   order = 0 '\000',
> >   priority = 1 '\001',
> >   reclaim_idx = 4 '\004',
> >   gfp_mask = 17861706, __GFP_NOFAIL
> >   nr_scanned = 27810,
> >   nr_reclaimed = 0,
> >   nr = {
> >         dirty = 27797,
> >         unqueued_dirty = 27797,
> >         congested = 0,
> >         writeback = 0,
> >         immediate = 0,
> >         file_taken = 27810,
> >         taken = 27810
> >   },
> > }
> >
>
> What is the kernel version? Can you run scripts/gfp-translate on the
> gfp_mask above? Does this kernel have a75ffa26122b ("memcg, oom: do not
> bypass oom killer for dying tasks")?

This kernel is 5.15+,  I think this problem is still here

Parsing: 17861706
#define ___GFP_HIGHMEM 0x02
#define ___GFP_MOVABLE 0x08
#define ___GFP_IO 0x40
#define ___GFP_DIRECT_RECLAIM 0x400
#define ___GFP_KSWAPD_RECLAIM 0x800
#define ___GFP_NOFAIL 0x8000
#define ___GFP_HARDWALL 0x100000
#define ___GFP_SKIP_KASAN_POISON 0x1000000
There is no such patch a75ffa26122b, and I think it is reasonable but
the task holding
the jbd2 handle is not a dying task.

>
> > The ->nr_reclaimed is zero meaning there is no memory we have reclaimed because
> > most of the file pages are unqueued dirty. And ->priority is 1 also meaning we
> > spent so much time on memory reclamation.
>
> Is there a way to get how many times this thread has looped within
> try_charge_memcg()?

The nr_retries is MAX_RECLAIM_RETRIES, and the mem_cgroup not in oom,
mem_cgroup 0xffff8c82a27c0800 | grep oom
  oom_group = false,
  oom_lock = false,
  under_oom = 0

Maybe we enter the try_charge_memcg in the first loop.
I will explain it later.

>
> > Since this thread has held the jbd2
> > handler, the jbd2 thread was waiting for the same jbd2 handler, which blocked
> > so many other threads from writing dirty pages as well.
> >
> > 0 [] __schedule at ffffffff97abc6c9
> > 1 [] schedule at ffffffff97abcd01
> > 2 [] jbd2_journal_wait_updates at ffffffffc05a522f [jbd2]
> > 3 [] jbd2_journal_commit_transaction at ffffffffc05a72c6 [jbd2]
> > 4 [] kjournald2 at ffffffffc05ad66d [jbd2]
> > 5 [] kthread at ffffffff972bc4c0
> > 6 [] ret_from_fork at ffffffff9720440f
> >
> > Furthermore, we observed that memory usage far exceeded the configured memory maximum,
> > reaching around 38GB.
> >
> > memory.max  : 134896020    514 GB
> > memory.usage: 144747169    552 GB
>
> This is unexpected and most probably our hacks to allow overcharge to
> avoid similar situations are causing this.

Yes.

>
> >
> > We investigated this issue and identified the root cause:
> >   try_charge_memcg:
> >     retry charge
> >      charge failed
> >        -> direct reclaim
> >         -> mem_cgroup_oom    return true，but selected task is in an uninterruptible state
> >            -> retry charge
>
> Oh oom reaper didn't help?
>
> >
> > In which cases, we saw many tasks in the uninterruptible (D) state with a pending
> > SIGKILL signal. The OOM killer selects a victim and returns success, allowing the
> > current thread to retry the memory charge. However, the selected task cannot acknowledge
> > the SIGKILL signal because it is stuck in an uninterruptible state.
>
> OOM reaper usually helps in such cases but I see below why it didn't
> help.
>
> > As a result,
> > the charging task resets nr_retries and attempts to reclaim again, but the victim
> > task never exits. This causes the current thread to enter a prolonged retry loop
> > during direct reclaim, holding the jbd2 handler for much more time and leading to
> > system-wide blocking. Why are there so many uninterruptible (D) state tasks?
> > Check the most common stack trace.
> >
> > crash> task_struct.__state ffff8c53a15b3080
> >   __state = 2,   #define TASK_UNINTERRUPTIBLE        0x0002
> >  0 [] __schedule at ffffffff97abc6c9
> >  1 [] schedule at ffffffff97abcd01
> >  2 [] schedule_preempt_disabled at ffffffff97abdf1a
> >  3 [] rwsem_down_read_slowpath at ffffffff97ac05bf
> >  4 [] down_read at ffffffff97ac06b1
> >  5 [] do_user_addr_fault at ffffffff9727f1e7
> >  6 [] exc_page_fault at ffffffff97ab286e
> >  7 [] asm_exc_page_fault at ffffffff97c00d42
> >
> > Check the owner of mm_struct.mmap_lock. The task below was entering memory reclaim
> > holding mmap lock and there are 68 tasks in this memory cgroup, with 23 of them in
> > the memory reclaim context.
> >
>
> The following thread has mmap_lock in write mode and thus oom-reaper is
> not helping. Do you see "oom_reaper: unable to reap pid..." messages in
> dmesg?
>
> >  7 [] shrink_active_list at ffffffff9744dd46
> >  8 [] shrink_lruvec at ffffffff97451407
> >  9 [] shrink_node at ffffffff974517c9
> > 10 [] do_try_to_free_pages at ffffffff97451dae
> > 11 [] try_to_free_mem_cgroup_pages at ffffffff974542b8
> > 12 [] try_charge_memcg at ffffffff974f0ede
> > 13 [] obj_cgroup_charge_pages at ffffffff974f1dae
> > 14 [] obj_cgroup_charge at ffffffff974f2fc2
> > 15 [] kmem_cache_alloc at ffffffff974d054c
> > 16 [] vm_area_dup at ffffffff972923f1
> > 17 [] __split_vma at ffffffff97486c16
> > 18 [] __do_munmap at ffffffff97486e78
> > 19 [] __vm_munmap at ffffffff97487307
> > 20 [] __x64_sys_munmap at ffffffff974873e7
> > 21 [] do_syscall_64 at ffffffff97aaed14
> >
> > Many threads was entering the memory reclaim in UN state, other threads was blocking
> > on mmap_lock. Although the OOM killer selects a victim, it cannot terminate it.
>
> Can you please confirm the above? Is the kernel able to oom-kill more
> processes or if it is returning early because the current thread is
> dying. However if the cgroup has just one big process, this doesn't
> matter.

Hi Shakeel

I'm a bit confused here because I don't see any "oom_reaper: unable
to reap pid..." or "Killed process %d (%s) total-vm:%lukB ..." messages
in the dmesg output. However, I did find many tasks with a pending
SIGKILL signal.

If these tasks weren't killed by the OOM killer, why do they have SIGKILL
pending in shrink context? That's why I'm unsure about this part.
Please have a look：

crash> task_struct.pending.signal.sig ffff8bc4c143e100
  pending.signal.sig = {256}
crash> task_struct.thread_info.flags ffff8bc4c143e100
  thread_info.flags = 16388,  bit 2 ; #defineTIF_SIGPENDING      2
crash> bt
PID: 1269110  TASK: ffff8bc4c143e100  CPU: 5    COMMAND: "xx"
 #0 [] __schedule at ffffffff97abc6c9
 #1 [] preempt_schedule_common at ffffffff97abcdaa
 #2 [] __cond_resched at ffffffff97abcddd
 #3 [] shrink_lruvec at ffffffff9745120a
 #4 [] shrink_node at ffffffff974517c9
 #5 [] do_try_to_free_pages at ffffffff97451dae
 #6 [] try_to_free_mem_cgroup_pages at ffffffff974542b8
 #7 [] try_charge_memcg at ffffffff974f0ede
 #8 [] charge_memcg at ffffffff974f1d0e
 #9 [] __mem_cgroup_charge at ffffffff974f391c
#10 [] __add_to_page_cache_locked at ffffffff974313e5
#11 [] add_to_page_cache_lru at ffffffff974324b2
#12 [] page_cache_ra_unbounded at ffffffff97442227
#13 [] filemap_fault at ffffffff97434b80
#14 [] __do_fault at ffffffff9747b870
#15 [] __handle_mm_fault at ffffffff97480329
#16 [] handle_mm_fault at ffffffff9748081e
#17 [] do_user_addr_fault at ffffffff9727f01b
#18 [] exc_page_fault at ffffffff97ab286e
#19 [] asm_exc_page_fault at ffffffff97c00d42

>
> > The
> > task holding the jbd2 handle retries memory charge, but it fails. Reclaiming continues
> > while holding the jbd2 handler. write_pages also fails while waiting for the same jbd2
> > handler, causing repeated shrink failures and potentially leading to a system-wide block.
> >
> > ps | grep UN | wc -l
> > 1463
> >
> > While the system has 1463 UN state tasks, so the way to break this akin to "deadlock" is
> > to let the thread holding jbd2 handler quickly exit the memory reclamation process.
> >
> > We found that a related issue was reported and partially fixed in previous patches [1][2].
> > However, those fixes only skip direct reclamation and return a failure for some cases such
> > as readahead requests. As sb_getblk() is called multiple times in __ext4_get_inode_loc()
> > with the NOFAIL flag, the problem still exists. And it is not feasible to simply remove
> > __GFP_RECLAIMABLE when holding jbd2 handle to avoid potential very long memory reclaim
> > latency,  as __GFP_NOFAIL is not supported without __GFP_DIRECT_RECLAIM.
> >
> > # Fundamentals
> >
> > This patchset introduces a new task flag of PF_MEMALLOC_ACFORCE to indicate that memory
> > allocations are forced to be accounted to the memory cgroup, even if they exceed the cgroup's
> > maximum limit. The reclaim process is deferred until the task returns to the user without
> > holding any kernel resources for memory reclamation, thereby preventing priority inversion
> > problems. Any users who might encounter potential similar issues can utilize this new flag
> > to allocate memory and prevent long-term latency for the entire system.
>
> I already explained upfront why this is not the approach we want.
>
> We do see similar situations/scenarios but due to global/shared locks in
> btrfs but I expect any global lock or global shared resource can cause
> such priority inversion situations.

I completely agree. Doesn’t this mean that we should have a mechanism
or a flag to indicate when a task is holding global resources or locks, so
that appropriate decisions can be made in global and memcg memory
reclaim contexts, in order to avoid holding locks or resources for an extended
period of time?  It would be better if we can have an automated approach that
solves all the problems.


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-06-24 14:47 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-18 11:39 [PATCH 0/2] Postpone memcg reclaim to return-to-user path Zhongkun He
2025-06-18 11:39 ` [PATCH 1/2] mm: memcg: introduce PF_MEMALLOC_ACCOUNTFORCE to postpone reclaim to return-to-userland path Zhongkun He
2025-06-24 14:47   ` Dan Carpenter
2025-06-18 11:39 ` [PATCH 2/2] jbd2: mark the transaction context with the scope PF_MEMALLOC_ACFORCE context Zhongkun He
2025-06-18 22:37 ` [PATCH 0/2] Postpone memcg reclaim to return-to-user path Shakeel Butt
2025-06-19  8:04   ` Jan Kara
2025-06-24  7:45     ` [External] " Zhongkun He
2025-06-24  7:44   ` Zhongkun He

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).