[PATCH v4 0/3] mm/oom_kill: Only delay OOM reaper for processes using robust futexes

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 0/3] mm/oom_kill: Only delay OOM reaper for processes using robust futexes
@ 2025-08-14 13:55 zhongjinji
  2025-08-14 13:55 ` [PATCH v4 1/3] futex: Introduce function process_has_robust_futex() zhongjinji
                   ` (3 more replies)
  0 siblings, 4 replies; 26+ messages in thread
From: zhongjinji @ 2025-08-14 13:55 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, mhocko, rientjes, shakeel.butt, npache, linux-kernel, tglx,
	mingo, peterz, dvhart, dave, andrealmeid, liam.howlett, liulu.liu,
	feng.han, zhongjinji

From: zhongjinji <zhongjinji@honor.com>

The OOM reaper quickly reclaims a process's memory when the system hits OOM,
helping the system recover. Without the OOM reaper, if a process frozen by
cgroup v1 is OOM killed, the victim's memory cannot be freed, leaving the
system in a poor state. Even if the process is not frozen by cgroup v1,
reclaiming victims' memory remains important, as having one more process
working speeds up memory release.

When processes holding robust futexes are OOM killed but waiters on those
futexes remain alive, the robust futexes might be reaped before
futex_cleanup() runs. This can cause the waiters to block indefinitely [1].

To prevent this issue, the OOM reaper's work is delayed by 2 seconds [1]. Since
many killed processes exit within 2 seconds, the OOM reaper rarely runs after
this delay. However, robust futex users are few, so delaying OOM reap for all
victims is unnecessary.

If each thread's robust_list in a process is NULL, the process holds no robust
futexes. For such processes, the OOM reaper should not be delayed. For
processes holding robust futexes, to avoid issue [1], the OOM reaper must
still be delayed.

Patch 1 introduces process_has_robust_futex() to detect whether a process uses
robust futexes. Patch 2 delays the OOM reaper only for processes holding robust
futexes, improving OOM reaper performance. Patch 3 makes the OOM reaper and
exit_mmap() traverse the maple tree in opposite orders to reduce PTE lock
contention caused by unmapping the same vma.

Link: https://lore.kernel.org/all/20220414144042.677008-1-npache@redhat.com/T/#u [1]

---

v3 -> v4:

1. Rename check_robust_futex() to process_has_robust_futex() for clearer
   intent.
2. Because the delay_reap parameter was added to task_will_free_mem(),
   the function is renamed to should_reap_task() to better clarify
   its purpose.
3. Add should_delay_oom_reap() to decide whether to delay OOM reap.
4. Modify the OOM reaper to traverse the maple tree in reverse order; see patch
   3 for details.
These changes improve code readability and enhance OOM reaper behavior.

zhongjinji (3):
  futex: Introduce function process_has_robust_futex()
  mm/oom_kill: Only delay OOM reaper for processes using robust futexes
  mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple
    tree in opposite orders

 include/linux/futex.h |  5 ++++
 include/linux/mm.h    |  3 +++
 kernel/futex/core.c   | 30 +++++++++++++++++++++++
 mm/oom_kill.c         | 55 +++++++++++++++++++++++++++++++------------
 4 files changed, 78 insertions(+), 15 deletions(-)

-- 
2.17.1

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v4 1/3] futex: Introduce function process_has_robust_futex()
  2025-08-14 13:55 [PATCH v4 0/3] mm/oom_kill: Only delay OOM reaper for processes using robust futexes zhongjinji
@ 2025-08-14 13:55 ` zhongjinji
  2025-08-14 13:55 ` [PATCH v4 2/3] mm/oom_kill: Only delay OOM reaper for processes using robust futexes zhongjinji
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 26+ messages in thread
From: zhongjinji @ 2025-08-14 13:55 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, mhocko, rientjes, shakeel.butt, npache, linux-kernel, tglx,
	mingo, peterz, dvhart, dave, andrealmeid, liam.howlett, liulu.liu,
	feng.han, zhongjinji

From: zhongjinji <zhongjinji@honor.com>

When the holders of robust futexes are OOM killed but the waiters on robust
futexes are still alive, the robust futexes might be reaped before
futex_cleanup() runs. This can cause the waiters to block indefinitely [1].
To prevent this issue, the OOM reaper's work is delayed by 2 seconds [1].
However, the OOM reaper now rarely runs since many killed processes exit
within 2 seconds.

Because robust futex users are few, delay the reaper's execution only for
processes holding robust futexes to improve the performance of the OOM
reaper.

Introduce the function process_has_robust_futex() to detect whether a process
uses robust futexes. If each thread's robust_list in a process is NULL, it
means the process holds no robust futexes. Conversely, it means the process
holds robust futexes.

Link: https://lore.kernel.org/all/20220414144042.677008-1-npache@redhat.com/T/#u [1]

Signed-off-by: zhongjinji <zhongjinji@honor.com>
---
 include/linux/futex.h |  5 +++++
 kernel/futex/core.c   | 30 ++++++++++++++++++++++++++++++
 2 files changed, 35 insertions(+)

diff --git a/include/linux/futex.h b/include/linux/futex.h
index 9e9750f04980..39540b7ae2a1 100644
--- a/include/linux/futex.h
+++ b/include/linux/futex.h
@@ -81,6 +81,7 @@ void futex_exec_release(struct task_struct *tsk);
 long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
 	      u32 __user *uaddr2, u32 val2, u32 val3);
 int futex_hash_prctl(unsigned long arg2, unsigned long arg3, unsigned long arg4);
+bool process_has_robust_futex(struct task_struct *tsk);
 
 #ifdef CONFIG_FUTEX_PRIVATE_HASH
 int futex_hash_allocate_default(void);
@@ -108,6 +109,10 @@ static inline int futex_hash_prctl(unsigned long arg2, unsigned long arg3, unsig
 {
 	return -EINVAL;
 }
+static inline bool process_has_robust_futex(struct task_struct *tsk)
+{
+	return false;
+}
 static inline int futex_hash_allocate_default(void)
 {
 	return 0;
diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index d9bb5567af0c..01b6561ab4f6 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -1961,6 +1961,36 @@ int futex_hash_prctl(unsigned long arg2, unsigned long arg3, unsigned long arg4)
 	return ret;
 }
 
+/*
+ * process_has_robust_futex() - check whether the given task hold robust futexes.
+ * @p: task struct of which task to consider
+ *
+ * If any thread in the task has a non-NULL robust_list or compat_robust_list,
+ * it indicates that the task holds robust futexes.
+ */
+bool process_has_robust_futex(struct task_struct *tsk)
+{
+	struct task_struct *t;
+	bool ret = false;
+
+	rcu_read_lock();
+	for_each_thread(tsk, t) {
+		if (unlikely(t->robust_list)) {
+			ret = true;
+			break;
+		}
+#ifdef CONFIG_COMPAT
+		if (unlikely(t->compat_robust_list)) {
+			ret = true;
+			break;
+		}
+#endif
+	}
+	rcu_read_unlock();
+
+	return ret;
+}
+
 static int __init futex_init(void)
 {
 	unsigned long hashsize, i;
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v4 2/3] mm/oom_kill: Only delay OOM reaper for processes using robust futexes
  2025-08-14 13:55 [PATCH v4 0/3] mm/oom_kill: Only delay OOM reaper for processes using robust futexes zhongjinji
  2025-08-14 13:55 ` [PATCH v4 1/3] futex: Introduce function process_has_robust_futex() zhongjinji
@ 2025-08-14 13:55 ` zhongjinji
  2025-08-15 14:41   ` Lorenzo Stoakes
  2025-08-17 19:37   ` Michal Hocko
  2025-08-14 13:55 ` [PATCH v4 3/3] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite orders zhongjinji
  2025-08-14 23:13 ` [PATCH v4 0/3] mm/oom_kill: Only delay OOM reaper for processes using robust futexes Andrew Morton
  3 siblings, 2 replies; 26+ messages in thread
From: zhongjinji @ 2025-08-14 13:55 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, mhocko, rientjes, shakeel.butt, npache, linux-kernel, tglx,
	mingo, peterz, dvhart, dave, andrealmeid, liam.howlett, liulu.liu,
	feng.han, zhongjinji

From: zhongjinji <zhongjinji@honor.com>

The OOM reaper can quickly reap a process's memory when the system encounters
OOM, helping the system recover. Without the OOM reaper, if a process frozen
by cgroup v1 is OOM killed, the victims' memory cannot be freed, and the
system stays in a poor state. Even if the process is not frozen by cgroup v1,
reaping victims' memory is still meaningful, because having one more process
working speeds up memory release.

When processes holding robust futexes are OOM killed but waiters on those
futexes remain alive, the robust futexes might be reaped before
futex_cleanup() runs. It would cause the waiters to block indefinitely.
To prevent this issue, the OOM reaper's work is delayed by 2 seconds [1].
The OOM reaper now rarely runs since many killed processes exit within 2
seconds.

Because robust futex users are few, it is unreasonable to delay OOM reap for
all victims. For processes that do not hold robust futexes, the OOM reaper
should not be delayed and for processes holding robust futexes, the OOM
reaper must still be delayed to prevent the waiters to block indefinitely [1].

Link: https://lore.kernel.org/all/20220414144042.677008-1-npache@redhat.com/T/#u [1]

Signed-off-by: zhongjinji <zhongjinji@honor.com>
---
 mm/oom_kill.c | 51 ++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 38 insertions(+), 13 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 25923cfec9c6..7ae4001e47c1 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -39,6 +39,7 @@
 #include <linux/ptrace.h>
 #include <linux/freezer.h>
 #include <linux/ftrace.h>
+#include <linux/futex.h>
 #include <linux/ratelimit.h>
 #include <linux/kthread.h>
 #include <linux/init.h>
@@ -692,7 +693,7 @@ static void wake_oom_reaper(struct timer_list *timer)
  * before the exit path is able to wake the futex waiters.
  */
 #define OOM_REAPER_DELAY (2*HZ)
-static void queue_oom_reaper(struct task_struct *tsk)
+static void queue_oom_reaper(struct task_struct *tsk, bool delay)
 {
 	/* mm is already queued? */
 	if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags))
@@ -700,7 +701,7 @@ static void queue_oom_reaper(struct task_struct *tsk)
 
 	get_task_struct(tsk);
 	timer_setup(&tsk->oom_reaper_timer, wake_oom_reaper, 0);
-	tsk->oom_reaper_timer.expires = jiffies + OOM_REAPER_DELAY;
+	tsk->oom_reaper_timer.expires = jiffies + (delay ? OOM_REAPER_DELAY : 0);
 	add_timer(&tsk->oom_reaper_timer);
 }
 
@@ -742,7 +743,7 @@ static int __init oom_init(void)
 }
 subsys_initcall(oom_init)
 #else
-static inline void queue_oom_reaper(struct task_struct *tsk)
+static inline void queue_oom_reaper(struct task_struct *tsk, bool delay)
 {
 }
 #endif /* CONFIG_MMU */
@@ -843,6 +844,16 @@ bool oom_killer_disable(signed long timeout)
 	return true;
 }
 
+/*
+ * If the owner thread of robust futexes is killed by OOM, the robust futexes might be freed
+ * by the OOM reaper before futex_cleanup() runs, which could cause the waiters to
+ * block indefinitely. So when the task hold robust futexes, delay oom reaper.
+ */
+static inline bool should_delay_oom_reap(struct task_struct *task)
+{
+	return process_has_robust_futex(task);
+}
+
 static inline bool __task_will_free_mem(struct task_struct *task)
 {
 	struct signal_struct *sig = task->signal;
@@ -865,17 +876,19 @@ static inline bool __task_will_free_mem(struct task_struct *task)
 }
 
 /*
- * Checks whether the given task is dying or exiting and likely to
- * release its address space. This means that all threads and processes
+ * Determine whether the given task should be reaped based on
+ * whether it is dying or exiting and likely to release its
+ * address space. This means that all threads and processes
  * sharing the same mm have to be killed or exiting.
  * Caller has to make sure that task->mm is stable (hold task_lock or
  * it operates on the current).
  */
-static bool task_will_free_mem(struct task_struct *task)
+static bool should_reap_task(struct task_struct *task, bool *delay_reap)
 {
 	struct mm_struct *mm = task->mm;
 	struct task_struct *p;
 	bool ret = true;
+	bool delay;
 
 	/*
 	 * Skip tasks without mm because it might have passed its exit_mm and
@@ -888,6 +901,8 @@ static bool task_will_free_mem(struct task_struct *task)
 	if (!__task_will_free_mem(task))
 		return false;
 
+	delay = should_delay_oom_reap(task);
+
 	/*
 	 * This task has already been drained by the oom reaper so there are
 	 * only small chances it will free some more
@@ -912,8 +927,11 @@ static bool task_will_free_mem(struct task_struct *task)
 		ret = __task_will_free_mem(p);
 		if (!ret)
 			break;
+		if (!delay)
+			delay = should_delay_oom_reap(p);
 	}
 	rcu_read_unlock();
+	*delay_reap = delay;
 
 	return ret;
 }
@@ -923,6 +941,7 @@ static void __oom_kill_process(struct task_struct *victim, const char *message)
 	struct task_struct *p;
 	struct mm_struct *mm;
 	bool can_oom_reap = true;
+	bool delay_reap;
 
 	p = find_lock_task_mm(victim);
 	if (!p) {
@@ -959,6 +978,7 @@ static void __oom_kill_process(struct task_struct *victim, const char *message)
 		from_kuid(&init_user_ns, task_uid(victim)),
 		mm_pgtables_bytes(mm) >> 10, victim->signal->oom_score_adj);
 	task_unlock(victim);
+	delay_reap = should_delay_oom_reap(victim);
 
 	/*
 	 * Kill all user processes sharing victim->mm in other thread groups, if
@@ -990,11 +1010,13 @@ static void __oom_kill_process(struct task_struct *victim, const char *message)
 		if (unlikely(p->flags & PF_KTHREAD))
 			continue;
 		do_send_sig_info(SIGKILL, SEND_SIG_PRIV, p, PIDTYPE_TGID);
+		if (!delay_reap)
+			delay_reap = should_delay_oom_reap(p);
 	}
 	rcu_read_unlock();
 
 	if (can_oom_reap)
-		queue_oom_reaper(victim);
+		queue_oom_reaper(victim, delay_reap);
 
 	mmdrop(mm);
 	put_task_struct(victim);
@@ -1020,6 +1042,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 	struct mem_cgroup *oom_group;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
+	bool delay_reap = false;
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
@@ -1027,9 +1050,9 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 	 * so it can die quickly
 	 */
 	task_lock(victim);
-	if (task_will_free_mem(victim)) {
+	if (should_reap_task(victim, &delay_reap)) {
 		mark_oom_victim(victim);
-		queue_oom_reaper(victim);
+		queue_oom_reaper(victim, delay_reap);
 		task_unlock(victim);
 		put_task_struct(victim);
 		return;
@@ -1112,6 +1135,7 @@ EXPORT_SYMBOL_GPL(unregister_oom_notifier);
 bool out_of_memory(struct oom_control *oc)
 {
 	unsigned long freed = 0;
+	bool delay_reap = false;
 
 	if (oom_killer_disabled)
 		return false;
@@ -1128,9 +1152,9 @@ bool out_of_memory(struct oom_control *oc)
 	 * select it.  The goal is to allow it to allocate so that it may
 	 * quickly exit and free its memory.
 	 */
-	if (task_will_free_mem(current)) {
+	if (should_reap_task(current, &delay_reap)) {
 		mark_oom_victim(current);
-		queue_oom_reaper(current);
+		queue_oom_reaper(current, delay_reap);
 		return true;
 	}
 
@@ -1209,6 +1233,7 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
 	struct task_struct *p;
 	unsigned int f_flags;
 	bool reap = false;
+	bool delay_reap = false;
 	long ret = 0;
 
 	if (flags)
@@ -1231,7 +1256,7 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
 	mm = p->mm;
 	mmgrab(mm);
 
-	if (task_will_free_mem(p))
+	if (should_reap_task(p, &delay_reap))
 		reap = true;
 	else {
 		/* Error only if the work has not been done already */
@@ -1240,7 +1265,7 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
 	}
 	task_unlock(p);
 
-	if (!reap)
+	if (!reap || delay_reap)
 		goto drop_mm;
 
 	if (mmap_read_lock_killable(mm)) {
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v4 3/3] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite orders
  2025-08-14 13:55 [PATCH v4 0/3] mm/oom_kill: Only delay OOM reaper for processes using robust futexes zhongjinji
  2025-08-14 13:55 ` [PATCH v4 1/3] futex: Introduce function process_has_robust_futex() zhongjinji
  2025-08-14 13:55 ` [PATCH v4 2/3] mm/oom_kill: Only delay OOM reaper for processes using robust futexes zhongjinji
@ 2025-08-14 13:55 ` zhongjinji
  2025-08-14 23:09   ` Andrew Morton
                     ` (2 more replies)
  2025-08-14 23:13 ` [PATCH v4 0/3] mm/oom_kill: Only delay OOM reaper for processes using robust futexes Andrew Morton
  3 siblings, 3 replies; 26+ messages in thread
From: zhongjinji @ 2025-08-14 13:55 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, mhocko, rientjes, shakeel.butt, npache, linux-kernel, tglx,
	mingo, peterz, dvhart, dave, andrealmeid, liam.howlett, liulu.liu,
	feng.han, zhongjinji

From: zhongjinji <zhongjinji@honor.com>

When a process is OOM killed, if the OOM reaper and the thread running
exit_mmap() execute at the same time, both will traverse the vma's maple
tree along the same path. They may easily unmap the same vma, causing them
to compete for the pte spinlock. This increases unnecessary load, causing
the execution time of the OOM reaper and the thread running exit_mmap() to
increase.

When a process exits, exit_mmap() traverses the vma's maple tree from low to high
address. To reduce the chance of unmapping the same vma simultaneously,
the OOM reaper should traverse vma's tree from high to low address. This reduces
lock contention when unmapping the same vma.

Signed-off-by: zhongjinji <zhongjinji@honor.com>
---
 include/linux/mm.h | 3 +++
 mm/oom_kill.c      | 9 +++++++--
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0c44bb8ce544..b665ea3c30eb 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -923,6 +923,9 @@ static inline void vma_iter_set(struct vma_iterator *vmi, unsigned long addr)
 #define for_each_vma_range(__vmi, __vma, __end)				\
 	while (((__vma) = vma_find(&(__vmi), (__end))) != NULL)
 
+#define for_each_vma_reverse(__vmi, __vma)					\
+	while (((__vma) = vma_prev(&(__vmi))) != NULL)
+
 #ifdef CONFIG_SHMEM
 /*
  * The vma_is_shmem is not inline because it is used only by slow
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 7ae4001e47c1..602d6836098a 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -517,7 +517,7 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
 {
 	struct vm_area_struct *vma;
 	bool ret = true;
-	VMA_ITERATOR(vmi, mm, 0);
+	VMA_ITERATOR(vmi, mm, ULONG_MAX);
 
 	/*
 	 * Tell all users of get_user/copy_from_user etc... that the content
@@ -527,7 +527,12 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
 	 */
 	set_bit(MMF_UNSTABLE, &mm->flags);
 
-	for_each_vma(vmi, vma) {
+	/*
+	 * When two tasks unmap the same vma at the same time, they may contend for the
+	 * pte spinlock. To avoid traversing the same vma as exit_mmap unmap, traverse
+	 * the vma maple tree in reverse order.
+	 */
+	for_each_vma_reverse(vmi, vma) {
 		if (vma->vm_flags & (VM_HUGETLB|VM_PFNMAP))
 			continue;
 
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 3/3] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite orders
  2025-08-14 13:55 ` [PATCH v4 3/3] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite orders zhongjinji
@ 2025-08-14 23:09   ` Andrew Morton
  2025-08-15 16:32     ` zhongjinji
  2025-08-15 17:52     ` gio
  2025-08-15 14:29   ` Lorenzo Stoakes
  2025-08-15 14:41   ` Liam R. Howlett
  2 siblings, 2 replies; 26+ messages in thread
From: Andrew Morton @ 2025-08-14 23:09 UTC (permalink / raw)
  To: zhongjinji
  Cc: linux-mm, mhocko, rientjes, shakeel.butt, npache, linux-kernel,
	tglx, mingo, peterz, dvhart, dave, andrealmeid, liam.howlett,
	liulu.liu, feng.han

On Thu, 14 Aug 2025 21:55:55 +0800 <zhongjinji@honor.com> wrote:

> When a process is OOM killed, if the OOM reaper and the thread running
> exit_mmap() execute at the same time, both will traverse the vma's maple
> tree along the same path. They may easily unmap the same vma, causing them
> to compete for the pte spinlock. This increases unnecessary load, causing
> the execution time of the OOM reaper and the thread running exit_mmap() to
> increase.

Please tell me what I'm missing here.

OOM kills are a rare event.  And this race sounds like it will rarely
occur even if an oom-killing is happening.  And the delay will be
relatively short.

If I'm correct then we're addressing rare*rare*small, so why bother?

> When a process exits, exit_mmap() traverses the vma's maple tree from low to high
> address. To reduce the chance of unmapping the same vma simultaneously,
> the OOM reaper should traverse vma's tree from high to low address. This reduces
> lock contention when unmapping the same vma.

Sharing some before-and-after runtime measurements would be useful.  Or
at least, detailed anecdotes.



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 0/3] mm/oom_kill: Only delay OOM reaper for processes using robust futexes
  2025-08-14 13:55 [PATCH v4 0/3] mm/oom_kill: Only delay OOM reaper for processes using robust futexes zhongjinji
                   ` (2 preceding siblings ...)
  2025-08-14 13:55 ` [PATCH v4 3/3] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite orders zhongjinji
@ 2025-08-14 23:13 ` Andrew Morton
  2025-08-15 17:06   ` zhongjinji
  3 siblings, 1 reply; 26+ messages in thread
From: Andrew Morton @ 2025-08-14 23:13 UTC (permalink / raw)
  To: zhongjinji
  Cc: linux-mm, mhocko, rientjes, shakeel.butt, npache, linux-kernel,
	tglx, mingo, peterz, dvhart, dave, andrealmeid, liam.howlett,
	liulu.liu, feng.han, Joel Savitz, Thomas Gleixner

On Thu, 14 Aug 2025 21:55:52 +0800 <zhongjinji@honor.com> wrote:

> The OOM reaper quickly reclaims a process's memory when the system hits OOM,
> helping the system recover. Without the OOM reaper, if a process frozen by
> cgroup v1 is OOM killed, the victim's memory cannot be freed, leaving the
> system in a poor state. Even if the process is not frozen by cgroup v1,
> reclaiming victims' memory remains important, as having one more process
> working speeds up memory release.
> 
> When processes holding robust futexes are OOM killed but waiters on those
> futexes remain alive, the robust futexes might be reaped before
> futex_cleanup() runs. This can cause the waiters to block indefinitely [1].
> 
> To prevent this issue, the OOM reaper's work is delayed by 2 seconds [1]. Since
> many killed processes exit within 2 seconds, the OOM reaper rarely runs after
> this delay. However, robust futex users are few, so delaying OOM reap for all
> victims is unnecessary.
> 
> If each thread's robust_list in a process is NULL, the process holds no robust
> futexes. For such processes, the OOM reaper should not be delayed. For
> processes holding robust futexes, to avoid issue [1], the OOM reaper must
> still be delayed.
> 
> Patch 1 introduces process_has_robust_futex() to detect whether a process uses
> robust futexes. Patch 2 delays the OOM reaper only for processes holding robust
> futexes, improving OOM reaper performance. Patch 3 makes the OOM reaper and
> exit_mmap() traverse the maple tree in opposite orders to reduce PTE lock
> contention caused by unmapping the same vma.

This all sounds sensible, given that we appear to be stuck with the
2-second hack.

What prevents one of the process's threads from creating a robust mutex
after we've inspected it with process_has_robust_futex()?


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 3/3] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite orders
  2025-08-14 13:55 ` [PATCH v4 3/3] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite orders zhongjinji
  2025-08-14 23:09   ` Andrew Morton
@ 2025-08-15 14:29   ` Lorenzo Stoakes
  2025-08-15 15:01     ` Lorenzo Stoakes
                       ` (2 more replies)
  2025-08-15 14:41   ` Liam R. Howlett
  2 siblings, 3 replies; 26+ messages in thread
From: Lorenzo Stoakes @ 2025-08-15 14:29 UTC (permalink / raw)
  To: zhongjinji
  Cc: linux-mm, akpm, mhocko, rientjes, shakeel.butt, npache,
	linux-kernel, tglx, mingo, peterz, dvhart, dave, andrealmeid,
	liam.howlett, liulu.liu, feng.han

On Thu, Aug 14, 2025 at 09:55:55PM +0800, zhongjinji@honor.com wrote:
> From: zhongjinji <zhongjinji@honor.com>
>
> When a process is OOM killed, if the OOM reaper and the thread running
> exit_mmap() execute at the same time, both will traverse the vma's maple
> tree along the same path. They may easily unmap the same vma, causing them
> to compete for the pte spinlock. This increases unnecessary load, causing
> the execution time of the OOM reaper and the thread running exit_mmap() to
> increase.

You're not giving any numbers, and this seems pretty niche, you really
exiting that many processes with the reaper running at the exact same time
that this is an issue? Waiting on a spinlock also?

This commit message is very unconvincing.

>
> When a process exits, exit_mmap() traverses the vma's maple tree from low to high
> address. To reduce the chance of unmapping the same vma simultaneously,
> the OOM reaper should traverse vma's tree from high to low address. This reduces
> lock contention when unmapping the same vma.

Are they going to run through and do their work in exactly the same time,
or might one 'run past' the other and you still have an issue?

Seems very vague and timing dependent and again, not convincing.

>
> Signed-off-by: zhongjinji <zhongjinji@honor.com>
> ---
>  include/linux/mm.h | 3 +++
>  mm/oom_kill.c      | 9 +++++++--
>  2 files changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 0c44bb8ce544..b665ea3c30eb 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -923,6 +923,9 @@ static inline void vma_iter_set(struct vma_iterator *vmi, unsigned long addr)
>  #define for_each_vma_range(__vmi, __vma, __end)				\
>  	while (((__vma) = vma_find(&(__vmi), (__end))) != NULL)
>
> +#define for_each_vma_reverse(__vmi, __vma)					\
> +	while (((__vma) = vma_prev(&(__vmi))) != NULL)

Please don't casually add an undocumented public VMA iterator hidden in a
patch doing something else :)

Won't this skip the first VMA? Not sure this is really worth having as a
general thing anyway, it's not many people who want to do this in reverse.

> +
>  #ifdef CONFIG_SHMEM
>  /*
>   * The vma_is_shmem is not inline because it is used only by slow
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 7ae4001e47c1..602d6836098a 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -517,7 +517,7 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
>  {
>  	struct vm_area_struct *vma;
>  	bool ret = true;
> -	VMA_ITERATOR(vmi, mm, 0);
> +	VMA_ITERATOR(vmi, mm, ULONG_MAX);
>
>  	/*
>  	 * Tell all users of get_user/copy_from_user etc... that the content
> @@ -527,7 +527,12 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
>  	 */
>  	set_bit(MMF_UNSTABLE, &mm->flags);
>
> -	for_each_vma(vmi, vma) {
> +	/*
> +	 * When two tasks unmap the same vma at the same time, they may contend for the
> +	 * pte spinlock. To avoid traversing the same vma as exit_mmap unmap, traverse
> +	 * the vma maple tree in reverse order.
> +	 */

Except you won't necessarily avoid anything, as if one walker is faster
than the other they'll run ahead, plus of course they'll have a cross-over
where they share the same PTE anyway.

I feel like maybe you've got a fairly specific situation that indicates an
issue elsewhere and you're maybe solving the wrong problem here?

> +	for_each_vma_reverse(vmi, vma) {
>  		if (vma->vm_flags & (VM_HUGETLB|VM_PFNMAP))
>  			continue;
>
> --
> 2.17.1
>
>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 3/3] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite orders
  2025-08-14 13:55 ` [PATCH v4 3/3] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite orders zhongjinji
  2025-08-14 23:09   ` Andrew Morton
  2025-08-15 14:29   ` Lorenzo Stoakes
@ 2025-08-15 14:41   ` Liam R. Howlett
  2025-08-15 16:05     ` Liam R. Howlett
  2 siblings, 1 reply; 26+ messages in thread
From: Liam R. Howlett @ 2025-08-15 14:41 UTC (permalink / raw)
  To: zhongjinji
  Cc: linux-mm, akpm, mhocko, rientjes, shakeel.butt, npache,
	linux-kernel, tglx, mingo, peterz, dvhart, dave, andrealmeid,
	liulu.liu, feng.han

* zhongjinji@honor.com <zhongjinji@honor.com> [250814 09:56]:
> From: zhongjinji <zhongjinji@honor.com>
> 
> When a process is OOM killed, if the OOM reaper and the thread running
> exit_mmap() execute at the same time, both will traverse the vma's maple
> tree along the same path. They may easily unmap the same vma, causing them
> to compete for the pte spinlock. This increases unnecessary load, causing
> the execution time of the OOM reaper and the thread running exit_mmap() to
> increase.
> 
> When a process exits, exit_mmap() traverses the vma's maple tree from low to high
> address. To reduce the chance of unmapping the same vma simultaneously,
> the OOM reaper should traverse vma's tree from high to low address. This reduces
> lock contention when unmapping the same vma.
> 
> Signed-off-by: zhongjinji <zhongjinji@honor.com>
> ---
>  include/linux/mm.h | 3 +++
>  mm/oom_kill.c      | 9 +++++++--
>  2 files changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 0c44bb8ce544..b665ea3c30eb 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -923,6 +923,9 @@ static inline void vma_iter_set(struct vma_iterator *vmi, unsigned long addr)
>  #define for_each_vma_range(__vmi, __vma, __end)				\
>  	while (((__vma) = vma_find(&(__vmi), (__end))) != NULL)
>  
> +#define for_each_vma_reverse(__vmi, __vma)					\
> +	while (((__vma) = vma_prev(&(__vmi))) != NULL)
> +

This does not do what you think it does, nor does it do what others
will think it will do.  It's not the opposite of the
for_each_vma_range() above.

vma_find() calls mas_find() which has a different meaning that
mas_next().  mas_find()'s behaviour is a hold-over from the vma_find()
of yesteryears: it will find the first entry at the address (if it's the
first time called) or the entry after it.

mas_prev() is trying to replace the linked list behaviour of "go to the
previous one", so it'll walk to the index you specified and go to the
previous one.  It will skip the index you passed in regardless of its
existence or not.

So what you have here is a broken interface, you just don't see it with
your code because you don't happen to have a mapping at ULONG_MAX.

This should not be merged as-is.

Also, there was zero mention of the new interface in the subject so I
almost missed this being added.

>  #ifdef CONFIG_SHMEM
>  /*
>   * The vma_is_shmem is not inline because it is used only by slow
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 7ae4001e47c1..602d6836098a 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -517,7 +517,7 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
>  {
>  	struct vm_area_struct *vma;
>  	bool ret = true;
> -	VMA_ITERATOR(vmi, mm, 0);
> +	VMA_ITERATOR(vmi, mm, ULONG_MAX);
>  
>  	/*
>  	 * Tell all users of get_user/copy_from_user etc... that the content
> @@ -527,7 +527,12 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
>  	 */
>  	set_bit(MMF_UNSTABLE, &mm->flags);
>  
> -	for_each_vma(vmi, vma) {
> +	/*
> +	 * When two tasks unmap the same vma at the same time, they may contend for the
> +	 * pte spinlock. To avoid traversing the same vma as exit_mmap unmap, traverse
> +	 * the vma maple tree in reverse order.
> +	 */
> +	for_each_vma_reverse(vmi, vma) {

How is this possible?  Both need the same lock..?  What part of
exit_mmap() will race here?

Why aren't we using the MMF_UNSTABLE flag set above to avoid it?  Or the
MMF_OOM_SKIP?

>  		if (vma->vm_flags & (VM_HUGETLB|VM_PFNMAP))
>  			continue;
>  
> -- 
> 2.17.1
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 2/3] mm/oom_kill: Only delay OOM reaper for processes using robust futexes
  2025-08-14 13:55 ` [PATCH v4 2/3] mm/oom_kill: Only delay OOM reaper for processes using robust futexes zhongjinji
@ 2025-08-15 14:41   ` Lorenzo Stoakes
  2025-08-18 14:14     ` zhongjinji
  2025-08-17 19:37   ` Michal Hocko
  1 sibling, 1 reply; 26+ messages in thread
From: Lorenzo Stoakes @ 2025-08-15 14:41 UTC (permalink / raw)
  To: zhongjinji
  Cc: linux-mm, akpm, mhocko, rientjes, shakeel.butt, npache,
	linux-kernel, tglx, mingo, peterz, dvhart, dave, andrealmeid,
	liam.howlett, liulu.liu, feng.han

On Thu, Aug 14, 2025 at 09:55:54PM +0800, zhongjinji@honor.com wrote:
> From: zhongjinji <zhongjinji@honor.com>
>
> The OOM reaper can quickly reap a process's memory when the system encounters
> OOM, helping the system recover. Without the OOM reaper, if a process frozen
> by cgroup v1 is OOM killed, the victims' memory cannot be freed, and the
> system stays in a poor state. Even if the process is not frozen by cgroup v1,
> reaping victims' memory is still meaningful, because having one more process
> working speeds up memory release.
>
> When processes holding robust futexes are OOM killed but waiters on those
> futexes remain alive, the robust futexes might be reaped before
> futex_cleanup() runs. It would cause the waiters to block indefinitely.
> To prevent this issue, the OOM reaper's work is delayed by 2 seconds [1].
> The OOM reaper now rarely runs since many killed processes exit within 2
> seconds.

God, I really don't love that that got merged. So arbitrary. Are futexes really
this broken?

>
> Because robust futex users are few, it is unreasonable to delay OOM reap for
> all victims. For processes that do not hold robust futexes, the OOM reaper
> should not be delayed and for processes holding robust futexes, the OOM
> reaper must still be delayed to prevent the waiters to block indefinitely [1].

I really hate that we do this :/

>
> Link: https://lore.kernel.org/all/20220414144042.677008-1-npache@redhat.com/T/#u [1]
>
> Signed-off-by: zhongjinji <zhongjinji@honor.com>



> ---
>  mm/oom_kill.c | 51 ++++++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 38 insertions(+), 13 deletions(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 25923cfec9c6..7ae4001e47c1 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -39,6 +39,7 @@
>  #include <linux/ptrace.h>
>  #include <linux/freezer.h>
>  #include <linux/ftrace.h>
> +#include <linux/futex.h>
>  #include <linux/ratelimit.h>
>  #include <linux/kthread.h>
>  #include <linux/init.h>
> @@ -692,7 +693,7 @@ static void wake_oom_reaper(struct timer_list *timer)
>   * before the exit path is able to wake the futex waiters.
>   */
>  #define OOM_REAPER_DELAY (2*HZ)
> -static void queue_oom_reaper(struct task_struct *tsk)
> +static void queue_oom_reaper(struct task_struct *tsk, bool delay)
>  {
>  	/* mm is already queued? */
>  	if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags))
> @@ -700,7 +701,7 @@ static void queue_oom_reaper(struct task_struct *tsk)
>
>  	get_task_struct(tsk);
>  	timer_setup(&tsk->oom_reaper_timer, wake_oom_reaper, 0);
> -	tsk->oom_reaper_timer.expires = jiffies + OOM_REAPER_DELAY;
> +	tsk->oom_reaper_timer.expires = jiffies + (delay ? OOM_REAPER_DELAY : 0);

Since this is predicated on the task_struct you have here, can we avoid all this
horrible threading of 'delay' and just check here?

>  	add_timer(&tsk->oom_reaper_timer);
>  }
>
> @@ -742,7 +743,7 @@ static int __init oom_init(void)
>  }
>  subsys_initcall(oom_init)
>  #else
> -static inline void queue_oom_reaper(struct task_struct *tsk)
> +static inline void queue_oom_reaper(struct task_struct *tsk, bool delay)
>  {
>  }
>  #endif /* CONFIG_MMU */
> @@ -843,6 +844,16 @@ bool oom_killer_disable(signed long timeout)
>  	return true;
>  }
>
> +/*
> + * If the owner thread of robust futexes is killed by OOM, the robust futexes might be freed
> + * by the OOM reaper before futex_cleanup() runs, which could cause the waiters to
> + * block indefinitely. So when the task hold robust futexes, delay oom reaper.
> + */
> +static inline bool should_delay_oom_reap(struct task_struct *task)
> +{
> +	return process_has_robust_futex(task);
> +}
> +
>  static inline bool __task_will_free_mem(struct task_struct *task)
>  {
>  	struct signal_struct *sig = task->signal;
> @@ -865,17 +876,19 @@ static inline bool __task_will_free_mem(struct task_struct *task)
>  }
>
>  /*
> - * Checks whether the given task is dying or exiting and likely to
> - * release its address space. This means that all threads and processes
> + * Determine whether the given task should be reaped based on
> + * whether it is dying or exiting and likely to release its
> + * address space. This means that all threads and processes
>   * sharing the same mm have to be killed or exiting.
>   * Caller has to make sure that task->mm is stable (hold task_lock or
>   * it operates on the current).
>   */
> -static bool task_will_free_mem(struct task_struct *task)
> +static bool should_reap_task(struct task_struct *task, bool *delay_reap)
>  {
>  	struct mm_struct *mm = task->mm;
>  	struct task_struct *p;
>  	bool ret = true;
> +	bool delay;
>
>  	/*
>  	 * Skip tasks without mm because it might have passed its exit_mm and
> @@ -888,6 +901,8 @@ static bool task_will_free_mem(struct task_struct *task)
>  	if (!__task_will_free_mem(task))
>  		return false;
>
> +	delay = should_delay_oom_reap(task);
> +
>  	/*
>  	 * This task has already been drained by the oom reaper so there are
>  	 * only small chances it will free some more
> @@ -912,8 +927,11 @@ static bool task_will_free_mem(struct task_struct *task)
>  		ret = __task_will_free_mem(p);
>  		if (!ret)
>  			break;
> +		if (!delay)
> +			delay = should_delay_oom_reap(p);
>  	}
>  	rcu_read_unlock();
> +	*delay_reap = delay;
>
>  	return ret;
>  }
> @@ -923,6 +941,7 @@ static void __oom_kill_process(struct task_struct *victim, const char *message)
>  	struct task_struct *p;
>  	struct mm_struct *mm;
>  	bool can_oom_reap = true;
> +	bool delay_reap;
>
>  	p = find_lock_task_mm(victim);
>  	if (!p) {
> @@ -959,6 +978,7 @@ static void __oom_kill_process(struct task_struct *victim, const char *message)
>  		from_kuid(&init_user_ns, task_uid(victim)),
>  		mm_pgtables_bytes(mm) >> 10, victim->signal->oom_score_adj);
>  	task_unlock(victim);
> +	delay_reap = should_delay_oom_reap(victim);
>

Yeah I really think we can just simplify by testing this at the point where we
decide whether or not to do the horrible 2s thing.

Let's not try to 'generalise' this just yet.

>  	/*
>  	 * Kill all user processes sharing victim->mm in other thread groups, if
> @@ -990,11 +1010,13 @@ static void __oom_kill_process(struct task_struct *victim, const char *message)
>  		if (unlikely(p->flags & PF_KTHREAD))
>  			continue;
>  		do_send_sig_info(SIGKILL, SEND_SIG_PRIV, p, PIDTYPE_TGID);
> +		if (!delay_reap)
> +			delay_reap = should_delay_oom_reap(p);
>  	}
>  	rcu_read_unlock();
>
>  	if (can_oom_reap)
> -		queue_oom_reaper(victim);
> +		queue_oom_reaper(victim, delay_reap);
>
>  	mmdrop(mm);
>  	put_task_struct(victim);
> @@ -1020,6 +1042,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
>  	struct mem_cgroup *oom_group;
>  	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
>  					      DEFAULT_RATELIMIT_BURST);
> +	bool delay_reap = false;
>
>  	/*
>  	 * If the task is already exiting, don't alarm the sysadmin or kill
> @@ -1027,9 +1050,9 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
>  	 * so it can die quickly
>  	 */
>  	task_lock(victim);
> -	if (task_will_free_mem(victim)) {
> +	if (should_reap_task(victim, &delay_reap)) {
>  		mark_oom_victim(victim);
> -		queue_oom_reaper(victim);
> +		queue_oom_reaper(victim, delay_reap);
>  		task_unlock(victim);
>  		put_task_struct(victim);
>  		return;
> @@ -1112,6 +1135,7 @@ EXPORT_SYMBOL_GPL(unregister_oom_notifier);
>  bool out_of_memory(struct oom_control *oc)
>  {
>  	unsigned long freed = 0;
> +	bool delay_reap = false;
>
>  	if (oom_killer_disabled)
>  		return false;
> @@ -1128,9 +1152,9 @@ bool out_of_memory(struct oom_control *oc)
>  	 * select it.  The goal is to allow it to allocate so that it may
>  	 * quickly exit and free its memory.
>  	 */
> -	if (task_will_free_mem(current)) {
> +	if (should_reap_task(current, &delay_reap)) {
>  		mark_oom_victim(current);
> -		queue_oom_reaper(current);
> +		queue_oom_reaper(current, delay_reap);
>  		return true;
>  	}
>
> @@ -1209,6 +1233,7 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
>  	struct task_struct *p;
>  	unsigned int f_flags;
>  	bool reap = false;
> +	bool delay_reap = false;
>  	long ret = 0;
>
>  	if (flags)
> @@ -1231,7 +1256,7 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
>  	mm = p->mm;
>  	mmgrab(mm);
>
> -	if (task_will_free_mem(p))
> +	if (should_reap_task(p, &delay_reap))

You can figure out whether you dealyed or not from the task again right? No need
to thread this.

I don't think it's a big deal to check twice, we are in the OOM code path which
is not (or should not be... :) a hot path so we're good.

>  		reap = true;
>  	else {
>  		/* Error only if the work has not been done already */
> @@ -1240,7 +1265,7 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
>  	}
>  	task_unlock(p);
>
> -	if (!reap)
> +	if (!reap || delay_reap)
>  		goto drop_mm;
>
>  	if (mmap_read_lock_killable(mm)) {
> --
> 2.17.1
>
>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 3/3] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite orders
  2025-08-15 14:29   ` Lorenzo Stoakes
@ 2025-08-15 15:01     ` Lorenzo Stoakes
  2025-08-15 17:37     ` zhongjinji
  2025-08-19 15:18     ` zhongjinji
  2 siblings, 0 replies; 26+ messages in thread
From: Lorenzo Stoakes @ 2025-08-15 15:01 UTC (permalink / raw)
  To: zhongjinji
  Cc: linux-mm, akpm, mhocko, rientjes, shakeel.butt, npache,
	linux-kernel, tglx, mingo, peterz, dvhart, dave, andrealmeid,
	liam.howlett, liulu.liu, feng.han

On Fri, Aug 15, 2025 at 03:29:24PM +0100, Lorenzo Stoakes wrote:
> On Thu, Aug 14, 2025 at 09:55:55PM +0800, zhongjinji@honor.com wrote:
> > From: zhongjinji <zhongjinji@honor.com>
> >
> > When a process is OOM killed, if the OOM reaper and the thread running
> > exit_mmap() execute at the same time, both will traverse the vma's maple
> > tree along the same path. They may easily unmap the same vma, causing them
> > to compete for the pte spinlock. This increases unnecessary load, causing
> > the execution time of the OOM reaper and the thread running exit_mmap() to
> > increase.
>
> You're not giving any numbers, and this seems pretty niche, you really
> exiting that many processes with the reaper running at the exact same time
> that this is an issue? Waiting on a spinlock also?
>
> This commit message is very unconvincing.
>
> >
> > When a process exits, exit_mmap() traverses the vma's maple tree from low to high
> > address. To reduce the chance of unmapping the same vma simultaneously,
> > the OOM reaper should traverse vma's tree from high to low address. This reduces
> > lock contention when unmapping the same vma.
>
> Are they going to run through and do their work in exactly the same time,
> or might one 'run past' the other and you still have an issue?
>
> Seems very vague and timing dependent and again, not convincing.
>
> >
> > Signed-off-by: zhongjinji <zhongjinji@honor.com>
> > ---
> >  include/linux/mm.h | 3 +++
> >  mm/oom_kill.c      | 9 +++++++--
> >  2 files changed, 10 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 0c44bb8ce544..b665ea3c30eb 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -923,6 +923,9 @@ static inline void vma_iter_set(struct vma_iterator *vmi, unsigned long addr)
> >  #define for_each_vma_range(__vmi, __vma, __end)				\
> >  	while (((__vma) = vma_find(&(__vmi), (__end))) != NULL)
> >
> > +#define for_each_vma_reverse(__vmi, __vma)					\
> > +	while (((__vma) = vma_prev(&(__vmi))) != NULL)
>
> Please don't casually add an undocumented public VMA iterator hidden in a
> patch doing something else :)
>
> Won't this skip the first VMA? Not sure this is really worth having as a
> general thing anyway, it's not many people who want to do this in reverse.
>
> > +
> >  #ifdef CONFIG_SHMEM
> >  /*
> >   * The vma_is_shmem is not inline because it is used only by slow
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 7ae4001e47c1..602d6836098a 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -517,7 +517,7 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
> >  {
> >  	struct vm_area_struct *vma;
> >  	bool ret = true;
> > -	VMA_ITERATOR(vmi, mm, 0);
> > +	VMA_ITERATOR(vmi, mm, ULONG_MAX);
> >
> >  	/*
> >  	 * Tell all users of get_user/copy_from_user etc... that the content
> > @@ -527,7 +527,12 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
> >  	 */
> >  	set_bit(MMF_UNSTABLE, &mm->flags);
> >
> > -	for_each_vma(vmi, vma) {
> > +	/*
> > +	 * When two tasks unmap the same vma at the same time, they may contend for the
> > +	 * pte spinlock. To avoid traversing the same vma as exit_mmap unmap, traverse
> > +	 * the vma maple tree in reverse order.
> > +	 */
>
> Except you won't necessarily avoid anything, as if one walker is faster
> than the other they'll run ahead, plus of course they'll have a cross-over
> where they share the same PTE anyway.

OK I guess what is happening here is very likely one task will be faster the
other slower, but like a slow train ahead of a fast one on a single line, if it
happens to get the lock first it'll hold up the first one overa nd over again if
the same PTEs are traversed.

Still, again this is super timing dependent it still feels like the wrong
solution and something of a hack and it really needs to be backed/explained more
thorougly.

The remaining comments still apply.

>
> I feel like maybe you've got a fairly specific situation that indicates an
> issue elsewhere and you're maybe solving the wrong problem here?
>
> > +	for_each_vma_reverse(vmi, vma) {
> >  		if (vma->vm_flags & (VM_HUGETLB|VM_PFNMAP))
> >  			continue;
> >
> > --
> > 2.17.1
> >
> >


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 3/3] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite orders
  2025-08-15 14:41   ` Liam R. Howlett
@ 2025-08-15 16:05     ` Liam R. Howlett
  0 siblings, 0 replies; 26+ messages in thread
From: Liam R. Howlett @ 2025-08-15 16:05 UTC (permalink / raw)
  To: zhongjinji, linux-mm, akpm, mhocko, rientjes, shakeel.butt,
	npache, linux-kernel, tglx, mingo, peterz, dvhart, dave,
	andrealmeid, liulu.liu, feng.han

* Liam R. Howlett <Liam.Howlett@oracle.com> [250815 10:41]:
> * zhongjinji@honor.com <zhongjinji@honor.com> [250814 09:56]:
> > From: zhongjinji <zhongjinji@honor.com>
...

> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 7ae4001e47c1..602d6836098a 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -517,7 +517,7 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
> >  {
> >  	struct vm_area_struct *vma;
> >  	bool ret = true;
> > -	VMA_ITERATOR(vmi, mm, 0);
> > +	VMA_ITERATOR(vmi, mm, ULONG_MAX);
> >  
> >  	/*
> >  	 * Tell all users of get_user/copy_from_user etc... that the content
> > @@ -527,7 +527,12 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
> >  	 */
> >  	set_bit(MMF_UNSTABLE, &mm->flags);
> >  
> > -	for_each_vma(vmi, vma) {
> > +	/*
> > +	 * When two tasks unmap the same vma at the same time, they may contend for the
> > +	 * pte spinlock. To avoid traversing the same vma as exit_mmap unmap, traverse
> > +	 * the vma maple tree in reverse order.
> > +	 */
> > +	for_each_vma_reverse(vmi, vma) {
> 
> How is this possible?  Both need the same lock..?  What part of
> exit_mmap() will race here?

I see, exit_mmap() and the oom both use unmap_page_range() under the
mmap read lock, so they can race but since they'll contend on the pte
lock it doesn't really matter.

This is so rare, I don't think this is worth fixing.

Thanks,
Liam


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 3/3] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite orders
  2025-08-14 23:09   ` Andrew Morton
@ 2025-08-15 16:32     ` zhongjinji
  2025-08-15 17:52     ` gio
  1 sibling, 0 replies; 26+ messages in thread
From: zhongjinji @ 2025-08-15 16:32 UTC (permalink / raw)
  To: akpm
  Cc: andrealmeid, dave, dvhart, feng.han, liam.howlett, linux-kernel,
	linux-mm, liulu.liu, mhocko, mingo, npache, peterz, rientjes,
	shakeel.butt, tglx, zhongjinji

On Thu, 14 Aug 2025 21:55:55 +0800 <zhongjinji@honor.com> wrote:

> > When a process is OOM killed, if the OOM reaper and the thread running
> > exit_mmap() execute at the same time, both will traverse the vma's maple
> > tree along the same path. They may easily unmap the same vma, causing them
> > to compete for the pte spinlock. This increases unnecessary load, causing
> > the execution time of the OOM reaper and the thread running exit_mmap() to
> > increase.
> 
> Please tell me what I'm missing here.
> 
> OOM kills are a rare event.  And this race sounds like it will rarely
> occur even if an oom-killing is happening.  And the delay will be
> relatively short.
> 
> If I'm correct then we're addressing rare*rare*small, so why bother?

When there are apps that consume a large amount of memory, encountering OOM on
low-memory Android devices is not uncommon. On Android devices, programs like lmkd
(a user-space daemon in the Android system) also call process_mrelease() to reap 
memory when an app is killed.

> > When a process exits, exit_mmap() traverses the vma's maple tree from low to high
> > address. To reduce the chance of unmapping the same vma simultaneously,
> > the OOM reaper should traverse vma's tree from high to low address. This reduces
> > lock contention when unmapping the same vma.
> 
> Sharing some before-and-after runtime measurements would be useful.  Or
> at least, detailed anecdotes.

Here is my test data on Android. The test process is as follows: start the same app,
then kill it, and finally capture the perfetto trace.
In the test, the way to trigger the OOM reaper is: intercept the kill signal and
actively add the process to the OOM reaper queue as what OOM does.

Note: #RxComputationT, vdp:vidtask:m, and tp-background are threads of the same process,
and they are the last threads to exit.

Thread             TID         State        Wall duration (ms)
# with oom reaper and traverse reverse
#RxComputationT    13708       Running      60.690572
oom_reaper         81          Running      46.492032

# with oom reaper but traverses
vdp:vidtask:m      14040       Running      81.848297
oom_reaper         81          Running      69.32

# without oom reaper
tp-background      12424       Running      106.021874



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 0/3] mm/oom_kill: Only delay OOM reaper for processes using robust futexes
  2025-08-14 23:13 ` [PATCH v4 0/3] mm/oom_kill: Only delay OOM reaper for processes using robust futexes Andrew Morton
@ 2025-08-15 17:06   ` zhongjinji
  0 siblings, 0 replies; 26+ messages in thread
From: zhongjinji @ 2025-08-15 17:06 UTC (permalink / raw)
  To: akpm
  Cc: andrealmeid, dave, dvhart, feng.han, jsavitz, liam.howlett,
	linux-kernel, linux-mm, liulu.liu, mhocko, mingo, npache, peterz,
	rientjes, shakeel.butt, tglx, zhongjinji

On Thu, 14 Aug 2025 21:55:52 +0800 <zhongjinji@honor.com> wrote:

> > The OOM reaper quickly reclaims a process's memory when the system hits OOM,
> > helping the system recover. Without the OOM reaper, if a process frozen by
> > cgroup v1 is OOM killed, the victim's memory cannot be freed, leaving the
> > system in a poor state. Even if the process is not frozen by cgroup v1,
> > reclaiming victims' memory remains important, as having one more process
> > working speeds up memory release.
> > 
> > When processes holding robust futexes are OOM killed but waiters on those
> > futexes remain alive, the robust futexes might be reaped before
> > futex_cleanup() runs. This can cause the waiters to block indefinitely [1].
> > 
> > To prevent this issue, the OOM reaper's work is delayed by 2 seconds [1]. Since
> > many killed processes exit within 2 seconds, the OOM reaper rarely runs after
> > this delay. However, robust futex users are few, so delaying OOM reap for all
> > victims is unnecessary.
> > 
> > If each thread's robust_list in a process is NULL, the process holds no robust
> > futexes. For such processes, the OOM reaper should not be delayed. For
> > processes holding robust futexes, to avoid issue [1], the OOM reaper must
> > still be delayed.
> > 
> > Patch 1 introduces process_has_robust_futex() to detect whether a process uses
> > robust futexes. Patch 2 delays the OOM reaper only for processes holding robust
> > futexes, improving OOM reaper performance. Patch 3 makes the OOM reaper and
> > exit_mmap() traverse the maple tree in opposite orders to reduce PTE lock
> > contention caused by unmapping the same vma.
> 
> This all sounds sensible, given that we appear to be stuck with the
> 2-second hack.
> 
> What prevents one of the process's threads from creating a robust mutex
> after we've inspected it with process_has_robust_futex()?

Thank you, I didn't consider this situation.
Since process_has_robust_futex() is called after the kill signal is sent,
this means the process will have the SIGNAL_GROUP_EXIT flag when calling
process_has_robust_futex().
We can check whether task->signal->flags contains the SIGNAL_GROUP_EXIT
flag in set_robust_list() to ensure that the process is not being killed
before creating the robust mutex.



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 3/3] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite orders
  2025-08-15 14:29   ` Lorenzo Stoakes
  2025-08-15 15:01     ` Lorenzo Stoakes
@ 2025-08-15 17:37     ` zhongjinji
  2025-08-19 15:18     ` zhongjinji
  2 siblings, 0 replies; 26+ messages in thread
From: zhongjinji @ 2025-08-15 17:37 UTC (permalink / raw)
  To: lorenzo.stoakes
  Cc: akpm, andrealmeid, dave, dvhart, feng.han, liam.howlett,
	linux-kernel, linux-mm, liulu.liu, mhocko, mingo, npache, peterz,
	rientjes, shakeel.butt, tglx, zhongjinji

> 
> On Thu, Aug 14, 2025 at 09:55:55PM +0800, zhongjinji@honor.com wrote:
> > From: zhongjinji <zhongjinji@honor.com>
> >
> > When a process is OOM killed, if the OOM reaper and the thread running
> > exit_mmap() execute at the same time, both will traverse the vma's maple
> > tree along the same path. They may easily unmap the same vma, causing them
> > to compete for the pte spinlock. This increases unnecessary load, causing
> > the execution time of the OOM reaper and the thread running exit_mmap() to
> > increase.
> 
> You're not giving any numbers, and this seems pretty niche, you really
> exiting that many processes with the reaper running at the exact same time
> that this is an issue? Waiting on a spinlock also?
> 
> This commit message is very unconvincing.

Thank you, I will reconfirm this issue.

> 
> >
> > When a process exits, exit_mmap() traverses the vma's maple tree from low to high
> > address. To reduce the chance of unmapping the same vma simultaneously,
> > the OOM reaper should traverse vma's tree from high to low address. This reduces
> > lock contention when unmapping the same vma.
> 
> Are they going to run through and do their work in exactly the same time,
> or might one 'run past' the other and you still have an issue?
> 
> Seems very vague and timing dependent and again, not convincing.

well, Thank you, I should capture a perf trace for the oom reaper, not perfetto.

> 
> >
> > Signed-off-by: zhongjinji <zhongjinji@honor.com>
> > ---
> >  include/linux/mm.h | 3 +++
> >  mm/oom_kill.c      | 9 +++++++--
> >  2 files changed, 10 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 0c44bb8ce544..b665ea3c30eb 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -923,6 +923,9 @@ static inline void vma_iter_set(struct vma_iterator *vmi, unsigned long addr)
> >  #define for_each_vma_range(__vmi, __vma, __end)				\
> >  	while (((__vma) = vma_find(&(__vmi), (__end))) != NULL)
> >
> > +#define for_each_vma_reverse(__vmi, __vma)					\
> > +	while (((__vma) = vma_prev(&(__vmi))) != NULL)
> 
> Please don't casually add an undocumented public VMA iterator hidden in a
> patch doing something else :)

sorry, I got it.

> 
> Won't this skip the first VMA? Not sure this is really worth having as a
> general thing anyway, it's not many people who want to do this in reverse.
> 
> > +
> >  #ifdef CONFIG_SHMEM
> >  /*
> >   * The vma_is_shmem is not inline because it is used only by slow
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 7ae4001e47c1..602d6836098a 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -517,7 +517,7 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
> >  {
> >  	struct vm_area_struct *vma;
> >  	bool ret = true;
> > -	VMA_ITERATOR(vmi, mm, 0);
> > +	VMA_ITERATOR(vmi, mm, ULONG_MAX);
> >
> >  	/*
> >  	 * Tell all users of get_user/copy_from_user etc... that the content
> > @@ -527,7 +527,12 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
> >  	 */
> >  	set_bit(MMF_UNSTABLE, &mm->flags);
> >
> > -	for_each_vma(vmi, vma) {
> > +	/*
> > +	 * When two tasks unmap the same vma at the same time, they may contend for the
> > +	 * pte spinlock. To avoid traversing the same vma as exit_mmap unmap, traverse
> > +	 * the vma maple tree in reverse order.
> > +	 */
> 
> Except you won't necessarily avoid anything, as if one walker is faster
> than the other they'll run ahead, plus of course they'll have a cross-over
> where they share the same PTE anyway.
> 
> I feel like maybe you've got a fairly specific situation that indicates an
> issue elsewhere and you're maybe solving the wrong problem here?

Thank you, I will reconfirm this issue.

> 
> > +	for_each_vma_reverse(vmi, vma) {
> >  		if (vma->vm_flags & (VM_HUGETLB|VM_PFNMAP))
> >  			continue;
> >
> > --
> > 2.17.1
> >
> >




^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 3/3] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite orders
  2025-08-14 23:09   ` Andrew Morton
  2025-08-15 16:32     ` zhongjinji
@ 2025-08-15 17:52     ` gio
  2025-08-15 17:53       ` gio
  1 sibling, 1 reply; 26+ messages in thread
From: gio @ 2025-08-15 17:52 UTC (permalink / raw)
  To: Andrew Morton, zhongjinji
  Cc: linux-mm, mhocko, rientjes, shakeel.butt, npache, linux-kernel,
	tglx, mingo, peterz, dvhart, dave, andrealmeid, liam.howlett,
	liulu.liu, feng.han

[-- Attachment #1: Type: text/plain, Size: 1547 bytes --]

I believe the patch should be accepted. While the race condition might be rare in simple workloads, it can become significant in containerized systems or Android devices. The proposed solution is simple, low-risk, and directly addresses the identified problem with minimal code changes.

On August 15, 2025 3:09:14 AM GMT+04:00, Andrew Morton <akpm@linux-foundation.org> wrote:
>On Thu, 14 Aug 2025 21:55:55 +0800 <zhongjinji@honor.com> wrote:
>
>> When a process is OOM killed, if the OOM reaper and the thread running
>> exit_mmap() execute at the same time, both will traverse the vma's maple
>> tree along the same path. They may easily unmap the same vma, causing them
>> to compete for the pte spinlock. This increases unnecessary load, causing
>> the execution time of the OOM reaper and the thread running exit_mmap() to
>> increase.
>
>Please tell me what I'm missing here.
>
>OOM kills are a rare event.  And this race sounds like it will rarely
>occur even if an oom-killing is happening.  And the delay will be
>relatively short.
>
>If I'm correct then we're addressing rare*rare*small, so why bother?
>
>> When a process exits, exit_mmap() traverses the vma's maple tree from low to high
>> address. To reduce the chance of unmapping the same vma simultaneously,
>> the OOM reaper should traverse vma's tree from high to low address. This reduces
>> lock contention when unmapping the same vma.
>
>Sharing some before-and-after runtime measurements would be useful.  Or
>at least, detailed anecdotes.
>
>

[-- Attachment #2: Type: text/html, Size: 2223 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 3/3] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite orders
  2025-08-15 17:52     ` gio
@ 2025-08-15 17:53       ` gio
  0 siblings, 0 replies; 26+ messages in thread
From: gio @ 2025-08-15 17:53 UTC (permalink / raw)
  To: Andrew Morton, zhongjinji
  Cc: linux-mm, mhocko, rientjes, shakeel.butt, npache, linux-kernel,
	tglx, mingo, peterz, dvhart, dave, andrealmeid, liam.howlett,
	liulu.liu, feng.han

I believe the patch should be accepted. While the race condition might be rare in simple workloads, it can become significant in containerized systems or Android devices. The proposed solution is simple, low-risk, and directly addresses the identified problem with minimal code changes.

On August 15, 2025 9:52:17 PM GMT+04:00, gio <giorgitchankvetadze1997@gmail.com> wrote:
>I believe the patch should be accepted. While the race condition might be rare in simple workloads, it can become significant in containerized systems or Android devices. The proposed solution is simple, low-risk, and directly addresses the identified problem with minimal code changes.
>
>On August 15, 2025 3:09:14 AM GMT+04:00, Andrew Morton <akpm@linux-foundation.org> wrote:
>>On Thu, 14 Aug 2025 21:55:55 +0800 <zhongjinji@honor.com> wrote:
>>
>>> When a process is OOM killed, if the OOM reaper and the thread running
>>> exit_mmap() execute at the same time, both will traverse the vma's maple
>>> tree along the same path. They may easily unmap the same vma, causing them
>>> to compete for the pte spinlock. This increases unnecessary load, causing
>>> the execution time of the OOM reaper and the thread running exit_mmap() to
>>> increase.
>>
>>Please tell me what I'm missing here.
>>
>>OOM kills are a rare event.  And this race sounds like it will rarely
>>occur even if an oom-killing is happening.  And the delay will be
>>relatively short.
>>
>>If I'm correct then we're addressing rare*rare*small, so why bother?
>>
>>> When a process exits, exit_mmap() traverses the vma's maple tree from low to high
>>> address. To reduce the chance of unmapping the same vma simultaneously,
>>> the OOM reaper should traverse vma's tree from high to low address. This reduces
>>> lock contention when unmapping the same vma.
>>
>>Sharing some before-and-after runtime measurements would be useful.  Or
>>at least, detailed anecdotes.
>>
>>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 2/3] mm/oom_kill: Only delay OOM reaper for processes using robust futexes
  2025-08-14 13:55 ` [PATCH v4 2/3] mm/oom_kill: Only delay OOM reaper for processes using robust futexes zhongjinji
  2025-08-15 14:41   ` Lorenzo Stoakes
@ 2025-08-17 19:37   ` Michal Hocko
  2025-08-18 12:08     ` zhongjinji
  1 sibling, 1 reply; 26+ messages in thread
From: Michal Hocko @ 2025-08-17 19:37 UTC (permalink / raw)
  To: zhongjinji
  Cc: linux-mm, akpm, rientjes, shakeel.butt, npache, linux-kernel,
	tglx, mingo, peterz, dvhart, dave, andrealmeid, liam.howlett,
	liulu.liu, feng.han

On Thu 14-08-25 21:55:54, zhongjinji@honor.com wrote:
> From: zhongjinji <zhongjinji@honor.com>
> 
> The OOM reaper can quickly reap a process's memory when the system encounters
> OOM, helping the system recover. Without the OOM reaper, if a process frozen
> by cgroup v1 is OOM killed, the victims' memory cannot be freed, and the
> system stays in a poor state. Even if the process is not frozen by cgroup v1,
> reaping victims' memory is still meaningful, because having one more process
> working speeds up memory release.
> 
> When processes holding robust futexes are OOM killed but waiters on those
> futexes remain alive, the robust futexes might be reaped before
> futex_cleanup() runs. It would cause the waiters to block indefinitely.
> To prevent this issue, the OOM reaper's work is delayed by 2 seconds [1].
> The OOM reaper now rarely runs since many killed processes exit within 2
> seconds.
> 
> Because robust futex users are few, it is unreasonable to delay OOM reap for
> all victims. For processes that do not hold robust futexes, the OOM reaper
> should not be delayed and for processes holding robust futexes, the OOM
> reaper must still be delayed to prevent the waiters to block indefinitely [1].
> 
> Link: https://lore.kernel.org/all/20220414144042.677008-1-npache@redhat.com/T/#u [1]

What has happened to
https://lore.kernel.org/all/aJGiHyTXS_BqxoK2@tiehlicka/T/#u ?

Generally speaking it would be great to provide a link to previous
versions of the patchset. I do not see v3 in my inbox (which is quite
messy ATM so I might have easily missed it).
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 2/3] mm/oom_kill: Only delay OOM reaper for processes using robust futexes
  2025-08-17 19:37   ` Michal Hocko
@ 2025-08-18 12:08     ` zhongjinji
  2025-08-19 10:49       ` Michal Hocko
  0 siblings, 1 reply; 26+ messages in thread
From: zhongjinji @ 2025-08-18 12:08 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, andrealmeid, dave, dvhart, feng.han, liam.howlett,
	linux-kernel, linux-mm, liulu.liu, mingo, npache, peterz,
	rientjes, shakeel.butt, tglx, zhongjinji

> On Thu 14-08-25 21:55:54, zhongjinji@honor.com wrote:
> > From: zhongjinji <zhongjinji@honor.com>
> > 
> > The OOM reaper can quickly reap a process's memory when the system encounters
> > OOM, helping the system recover. Without the OOM reaper, if a process frozen
> > by cgroup v1 is OOM killed, the victims' memory cannot be freed, and the
> > system stays in a poor state. Even if the process is not frozen by cgroup v1,
> > reaping victims' memory is still meaningful, because having one more process
> > working speeds up memory release.
> > 
> > When processes holding robust futexes are OOM killed but waiters on those
> > futexes remain alive, the robust futexes might be reaped before
> > futex_cleanup() runs. It would cause the waiters to block indefinitely.
> > To prevent this issue, the OOM reaper's work is delayed by 2 seconds [1].
> > The OOM reaper now rarely runs since many killed processes exit within 2
> > seconds.
> > 
> > Because robust futex users are few, it is unreasonable to delay OOM reap for
> > all victims. For processes that do not hold robust futexes, the OOM reaper
> > should not be delayed and for processes holding robust futexes, the OOM
> > reaper must still be delayed to prevent the waiters to block indefinitely [1].
> > 
> > Link: https://lore.kernel.org/all/20220414144042.677008-1-npache@redhat.com/T/#u [1]
> 
> What has happened to
> https://lore.kernel.org/all/aJGiHyTXS_BqxoK2@tiehlicka/T/#u ?

If a process holding robust futexes gets frozen, robust futexes might be reaped before
futex_cleanup() runs when an OOM occurs. I am not sure if this will actually happen.

> 
> Generally speaking it would be great to provide a link to previous
> versions of the patchset. I do not see v3 in my inbox (which is quite
> messy ATM so I might have easily missed it).

This is version v3, where I mainly fixed the error in the Subject prefix,
changing it from futex to mm/oom_kill.

https://lore.kernel.org/all/20250804030341.18619-1-zhongjinji@honor.com/
https://lore.kernel.org/all/20250804030341.18619-2-zhongjinji@honor.com/

> -- 
> Michal Hocko
> SUSE Labs



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 2/3] mm/oom_kill: Only delay OOM reaper for processes using robust futexes
  2025-08-15 14:41   ` Lorenzo Stoakes
@ 2025-08-18 14:14     ` zhongjinji
  0 siblings, 0 replies; 26+ messages in thread
From: zhongjinji @ 2025-08-18 14:14 UTC (permalink / raw)
  To: lorenzo.stoakes
  Cc: akpm, andrealmeid, dave, dvhart, feng.han, liam.howlett,
	linux-kernel, linux-mm, liulu.liu, mhocko, mingo, npache, peterz,
	rientjes, shakeel.butt, tglx, zhongjinji

> On Thu, Aug 14, 2025 at 09:55:54PM +0800, zhongjinji@honor.com wrote:
> > From: zhongjinji <zhongjinji@honor.com>
> >
> > The OOM reaper can quickly reap a process's memory when the system encounters
> > OOM, helping the system recover. Without the OOM reaper, if a process frozen
> > by cgroup v1 is OOM killed, the victims' memory cannot be freed, and the
> > system stays in a poor state. Even if the process is not frozen by cgroup v1,
> > reaping victims' memory is still meaningful, because having one more process
> > working speeds up memory release.
> >
> > When processes holding robust futexes are OOM killed but waiters on those
> > futexes remain alive, the robust futexes might be reaped before
> > futex_cleanup() runs. It would cause the waiters to block indefinitely.
> > To prevent this issue, the OOM reaper's work is delayed by 2 seconds [1].
> > The OOM reaper now rarely runs since many killed processes exit within 2
> > seconds.
> 
> God, I really don't love that that got merged. So arbitrary. Are futexes really
> this broken?
> 
> >
> > Because robust futex users are few, it is unreasonable to delay OOM reap for
> > all victims. For processes that do not hold robust futexes, the OOM reaper
> > should not be delayed and for processes holding robust futexes, the OOM
> > reaper must still be delayed to prevent the waiters to block indefinitely [1].
> 
> I really hate that we do this :/
> 
> >
> > Link: https://lore.kernel.org/all/20220414144042.677008-1-npache@redhat.com/T/#u [1]
> >
> > Signed-off-by: zhongjinji <zhongjinji@honor.com>
> 
> 
> 
> > ---
> >  mm/oom_kill.c | 51 ++++++++++++++++++++++++++++++++++++++-------------
> >  1 file changed, 38 insertions(+), 13 deletions(-)
> >
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 25923cfec9c6..7ae4001e47c1 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -39,6 +39,7 @@
> >  #include <linux/ptrace.h>
> >  #include <linux/freezer.h>
> >  #include <linux/ftrace.h>
> > +#include <linux/futex.h>
> >  #include <linux/ratelimit.h>
> >  #include <linux/kthread.h>
> >  #include <linux/init.h>
> > @@ -692,7 +693,7 @@ static void wake_oom_reaper(struct timer_list *timer)
> >   * before the exit path is able to wake the futex waiters.
> >   */
> >  #define OOM_REAPER_DELAY (2*HZ)
> > -static void queue_oom_reaper(struct task_struct *tsk)
> > +static void queue_oom_reaper(struct task_struct *tsk, bool delay)
> >  {
> >  	/* mm is already queued? */
> >  	if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags))
> > @@ -700,7 +701,7 @@ static void queue_oom_reaper(struct task_struct *tsk)
> >
> >  	get_task_struct(tsk);
> >  	timer_setup(&tsk->oom_reaper_timer, wake_oom_reaper, 0);
> > -	tsk->oom_reaper_timer.expires = jiffies + OOM_REAPER_DELAY;
> > +	tsk->oom_reaper_timer.expires = jiffies + (delay ? OOM_REAPER_DELAY : 0);
> 
> Since this is predicated on the task_struct you have here, can we avoid all this
> horrible threading of 'delay' and just check here?

Yeah, It is great! I will update it in next version.

> 
> >  	add_timer(&tsk->oom_reaper_timer);
> >  }
> >
> > @@ -742,7 +743,7 @@ static int __init oom_init(void)
> >  }
> >  subsys_initcall(oom_init)
> >  #else
> > -static inline void queue_oom_reaper(struct task_struct *tsk)
> > +static inline void queue_oom_reaper(struct task_struct *tsk, bool delay)
> >  {
> >  }
> >  #endif /* CONFIG_MMU */
> > @@ -843,6 +844,16 @@ bool oom_killer_disable(signed long timeout)
> >  	return true;
> >  }
> >
> > +/*
> > + * If the owner thread of robust futexes is killed by OOM, the robust futexes might be freed
> > + * by the OOM reaper before futex_cleanup() runs, which could cause the waiters to
> > + * block indefinitely. So when the task hold robust futexes, delay oom reaper.
> > + */
> > +static inline bool should_delay_oom_reap(struct task_struct *task)
> > +{
> > +	return process_has_robust_futex(task);
> > +}
> > +
> >  static inline bool __task_will_free_mem(struct task_struct *task)
> >  {
> >  	struct signal_struct *sig = task->signal;
> > @@ -865,17 +876,19 @@ static inline bool __task_will_free_mem(struct task_struct *task)
> >  }
> >
> >  /*
> > - * Checks whether the given task is dying or exiting and likely to
> > - * release its address space. This means that all threads and processes
> > + * Determine whether the given task should be reaped based on
> > + * whether it is dying or exiting and likely to release its
> > + * address space. This means that all threads and processes
> >   * sharing the same mm have to be killed or exiting.
> >   * Caller has to make sure that task->mm is stable (hold task_lock or
> >   * it operates on the current).
> >   */
> > -static bool task_will_free_mem(struct task_struct *task)
> > +static bool should_reap_task(struct task_struct *task, bool *delay_reap)
> >  {
> >  	struct mm_struct *mm = task->mm;
> >  	struct task_struct *p;
> >  	bool ret = true;
> > +	bool delay;
> >
> >  	/*
> >  	 * Skip tasks without mm because it might have passed its exit_mm and
> > @@ -888,6 +901,8 @@ static bool task_will_free_mem(struct task_struct *task)
> >  	if (!__task_will_free_mem(task))
> >  		return false;
> >
> > +	delay = should_delay_oom_reap(task);
> > +
> >  	/*
> >  	 * This task has already been drained by the oom reaper so there are
> >  	 * only small chances it will free some more
> > @@ -912,8 +927,11 @@ static bool task_will_free_mem(struct task_struct *task)
> >  		ret = __task_will_free_mem(p);
> >  		if (!ret)
> >  			break;
> > +		if (!delay)
> > +			delay = should_delay_oom_reap(p);
> >  	}
> >  	rcu_read_unlock();
> > +	*delay_reap = delay;
> >
> >  	return ret;
> >  }
> > @@ -923,6 +941,7 @@ static void __oom_kill_process(struct task_struct *victim, const char *message)
> >  	struct task_struct *p;
> >  	struct mm_struct *mm;
> >  	bool can_oom_reap = true;
> > +	bool delay_reap;
> >
> >  	p = find_lock_task_mm(victim);
> >  	if (!p) {
> > @@ -959,6 +978,7 @@ static void __oom_kill_process(struct task_struct *victim, const char *message)
> >  		from_kuid(&init_user_ns, task_uid(victim)),
> >  		mm_pgtables_bytes(mm) >> 10, victim->signal->oom_score_adj);
> >  	task_unlock(victim);
> > +	delay_reap = should_delay_oom_reap(victim);
> >
> 
> Yeah I really think we can just simplify by testing this at the point where we
> decide whether or not to do the horrible 2s thing.
> 
> Let's not try to 'generalise' this just yet.
> 
> >  	/*
> >  	 * Kill all user processes sharing victim->mm in other thread groups, if
> > @@ -990,11 +1010,13 @@ static void __oom_kill_process(struct task_struct *victim, const char *message)
> >  		if (unlikely(p->flags & PF_KTHREAD))
> >  			continue;
> >  		do_send_sig_info(SIGKILL, SEND_SIG_PRIV, p, PIDTYPE_TGID);
> > +		if (!delay_reap)
> > +			delay_reap = should_delay_oom_reap(p);
> >  	}
> >  	rcu_read_unlock();
> >
> >  	if (can_oom_reap)
> > -		queue_oom_reaper(victim);
> > +		queue_oom_reaper(victim, delay_reap);
> >
> >  	mmdrop(mm);
> >  	put_task_struct(victim);
> > @@ -1020,6 +1042,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
> >  	struct mem_cgroup *oom_group;
> >  	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
> >  					      DEFAULT_RATELIMIT_BURST);
> > +	bool delay_reap = false;
> >
> >  	/*
> >  	 * If the task is already exiting, don't alarm the sysadmin or kill
> > @@ -1027,9 +1050,9 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
> >  	 * so it can die quickly
> >  	 */
> >  	task_lock(victim);
> > -	if (task_will_free_mem(victim)) {
> > +	if (should_reap_task(victim, &delay_reap)) {
> >  		mark_oom_victim(victim);
> > -		queue_oom_reaper(victim);
> > +		queue_oom_reaper(victim, delay_reap);
> >  		task_unlock(victim);
> >  		put_task_struct(victim);
> >  		return;
> > @@ -1112,6 +1135,7 @@ EXPORT_SYMBOL_GPL(unregister_oom_notifier);
> >  bool out_of_memory(struct oom_control *oc)
> >  {
> >  	unsigned long freed = 0;
> > +	bool delay_reap = false;
> >
> >  	if (oom_killer_disabled)
> >  		return false;
> > @@ -1128,9 +1152,9 @@ bool out_of_memory(struct oom_control *oc)
> >  	 * select it.  The goal is to allow it to allocate so that it may
> >  	 * quickly exit and free its memory.
> >  	 */
> > -	if (task_will_free_mem(current)) {
> > +	if (should_reap_task(current, &delay_reap)) {
> >  		mark_oom_victim(current);
> > -		queue_oom_reaper(current);
> > +		queue_oom_reaper(current, delay_reap);
> >  		return true;
> >  	}
> >
> > @@ -1209,6 +1233,7 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
> >  	struct task_struct *p;
> >  	unsigned int f_flags;
> >  	bool reap = false;
> > +	bool delay_reap = false;
> >  	long ret = 0;
> >
> >  	if (flags)
> > @@ -1231,7 +1256,7 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
> >  	mm = p->mm;
> >  	mmgrab(mm);
> >
> > -	if (task_will_free_mem(p))
> > +	if (should_reap_task(p, &delay_reap))
> 
> You can figure out whether you dealyed or not from the task again right? No need
> to thread this.
> 
> I don't think it's a big deal to check twice, we are in the OOM code path which
> is not (or should not be... :) a hot path so we're good.
> 
> >  		reap = true;
> >  	else {
> >  		/* Error only if the work has not been done already */
> > @@ -1240,7 +1265,7 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
> >  	}
> >  	task_unlock(p);
> >
> > -	if (!reap)
> > +	if (!reap || delay_reap)
> >  		goto drop_mm;
> >
> >  	if (mmap_read_lock_killable(mm)) {
> > --
> > 2.17.1
> >
> >



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 2/3] mm/oom_kill: Only delay OOM reaper for processes using robust futexes
  2025-08-18 12:08     ` zhongjinji
@ 2025-08-19 10:49       ` Michal Hocko
  2025-08-20  2:53         ` Davidlohr Bueso
  0 siblings, 1 reply; 26+ messages in thread
From: Michal Hocko @ 2025-08-19 10:49 UTC (permalink / raw)
  To: zhongjinji
  Cc: akpm, andrealmeid, dave, dvhart, feng.han, liam.howlett,
	linux-kernel, linux-mm, liulu.liu, mingo, npache, peterz,
	rientjes, shakeel.butt, tglx

On Mon 18-08-25 20:08:19, zhongjinji wrote:
> > On Thu 14-08-25 21:55:54, zhongjinji@honor.com wrote:
> > > From: zhongjinji <zhongjinji@honor.com>
> > > 
> > > The OOM reaper can quickly reap a process's memory when the system encounters
> > > OOM, helping the system recover. Without the OOM reaper, if a process frozen
> > > by cgroup v1 is OOM killed, the victims' memory cannot be freed, and the
> > > system stays in a poor state. Even if the process is not frozen by cgroup v1,
> > > reaping victims' memory is still meaningful, because having one more process
> > > working speeds up memory release.
> > > 
> > > When processes holding robust futexes are OOM killed but waiters on those
> > > futexes remain alive, the robust futexes might be reaped before
> > > futex_cleanup() runs. It would cause the waiters to block indefinitely.
> > > To prevent this issue, the OOM reaper's work is delayed by 2 seconds [1].
> > > The OOM reaper now rarely runs since many killed processes exit within 2
> > > seconds.
> > > 
> > > Because robust futex users are few, it is unreasonable to delay OOM reap for
> > > all victims. For processes that do not hold robust futexes, the OOM reaper
> > > should not be delayed and for processes holding robust futexes, the OOM
> > > reaper must still be delayed to prevent the waiters to block indefinitely [1].
> > > 
> > > Link: https://lore.kernel.org/all/20220414144042.677008-1-npache@redhat.com/T/#u [1]
> > 
> > What has happened to
> > https://lore.kernel.org/all/aJGiHyTXS_BqxoK2@tiehlicka/T/#u ?
> 
> If a process holding robust futexes gets frozen, robust futexes might be reaped before
> futex_cleanup() runs when an OOM occurs. I am not sure if this will actually happen.

Yes, and 2s delay will never rule that out. Especially for frozen tasks
which could be frozen undefinitely. That is not the point I have tried
to make. I was suggesting not treating futex specially because no matter
what we do this will always be racy and a hack to reduce the risk. We
simply cannot deal with that case more gracefully without a major
surgery to the futex implementation which is not desirable for this
specific reason.

So instead to checking for futex which Thomas was not happy about too
let's just reap _frozen_/_freezing_ tasks right away as that makes at
least some sense and it also handles your primary problem AFAIU.

> > Generally speaking it would be great to provide a link to previous
> > versions of the patchset. I do not see v3 in my inbox (which is quite
> > messy ATM so I might have easily missed it).
> 
> This is version v3, where I mainly fixed the error in the Subject prefix,
> changing it from futex to mm/oom_kill.
> 
> https://lore.kernel.org/all/20250804030341.18619-1-zhongjinji@honor.com/
> https://lore.kernel.org/all/20250804030341.18619-2-zhongjinji@honor.com/

please always mention that in the cover letter.

Thanks.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 3/3] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite orders
  2025-08-15 14:29   ` Lorenzo Stoakes
  2025-08-15 15:01     ` Lorenzo Stoakes
  2025-08-15 17:37     ` zhongjinji
@ 2025-08-19 15:18     ` zhongjinji
  2025-08-21  9:32       ` Lorenzo Stoakes
  2 siblings, 1 reply; 26+ messages in thread
From: zhongjinji @ 2025-08-19 15:18 UTC (permalink / raw)
  To: lorenzo.stoakes
  Cc: akpm, andrealmeid, dave, dvhart, feng.han, liam.howlett,
	linux-kernel, linux-mm, liulu.liu, mhocko, mingo, npache, peterz,
	rientjes, shakeel.butt, tglx, zhongjinji

> On Thu, Aug 14, 2025 at 09:55:55PM +0800, zhongjinji@honor.com wrote:
> > From: zhongjinji <zhongjinji@honor.com>
> >
> > When a process is OOM killed, if the OOM reaper and the thread running
> > exit_mmap() execute at the same time, both will traverse the vma's maple
> > tree along the same path. They may easily unmap the same vma, causing them
> > to compete for the pte spinlock. This increases unnecessary load, causing
> > the execution time of the OOM reaper and the thread running exit_mmap() to
> > increase.
> 
> You're not giving any numbers, and this seems pretty niche, you really
> exiting that many processes with the reaper running at the exact same time
> that this is an issue? Waiting on a spinlock also?
> 
> This commit message is very unconvincing.

This is the perf data: the first one is without this patch applied, and the
second one is with this patch applied.  It is obvious that without this patch,
the lock contention on the pte spinlock will be very intense.

|--99.74%-- oom_reaper
|    |--76.67%-- unmap_page_range
|    |    |--33.70%-- __pte_offset_map_lock
|    |    |    |--98.46%-- _raw_spin_lock
|    |    |--27.61%-- free_swap_and_cache_nr
|    |    |--16.40%-- folio_remove_rmap_ptes
|    |    |--12.25%-- tlb_flush_mmu
|    |--12.61%-- tlb_finish_mmu


|--98.84%-- oom_reaper
|    |--53.45%-- unmap_page_range
|    |    |--24.29%-- [hit in function]
|    |    |--48.06%-- folio_remove_rmap_ptes
|    |    |--17.99%-- tlb_flush_mmu
|    |    |--1.72%-- __pte_offset_map_lock
|    |
|    |--30.43%-- tlb_finish_mmu

> >
> > When a process exits, exit_mmap() traverses the vma's maple tree from low to high
> > address. To reduce the chance of unmapping the same vma simultaneously,
> > the OOM reaper should traverse vma's tree from high to low address. This reduces
> > lock contention when unmapping the same vma.
> 
> Are they going to run through and do their work in exactly the same time,
> or might one 'run past' the other and you still have an issue?
> 
> Seems very vague and timing dependent and again, not convincing.
> 
> >
> > Signed-off-by: zhongjinji <zhongjinji@honor.com>
> > ---
> >  include/linux/mm.h | 3 +++
> >  mm/oom_kill.c      | 9 +++++++--
> >  2 files changed, 10 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 0c44bb8ce544..b665ea3c30eb 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -923,6 +923,9 @@ static inline void vma_iter_set(struct vma_iterator *vmi, unsigned long addr)
> >  #define for_each_vma_range(__vmi, __vma, __end)				\
> >  	while (((__vma) = vma_find(&(__vmi), (__end))) != NULL)
> >
> > +#define for_each_vma_reverse(__vmi, __vma)					\
> > +	while (((__vma) = vma_prev(&(__vmi))) != NULL)
> 
> Please don't casually add an undocumented public VMA iterator hidden in a
> patch doing something else :)
> 
> Won't this skip the first VMA? Not sure this is really worth having as a
> general thing anyway, it's not many people who want to do this in reverse.

I got it. mas_find_rev() should be used instead of vma_prev().

> > +
> >  #ifdef CONFIG_SHMEM
> >  /*
> >   * The vma_is_shmem is not inline because it is used only by slow
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 7ae4001e47c1..602d6836098a 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -517,7 +517,7 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
> >  {
> >  	struct vm_area_struct *vma;
> >  	bool ret = true;
> > -	VMA_ITERATOR(vmi, mm, 0);
> > +	VMA_ITERATOR(vmi, mm, ULONG_MAX);
> >
> >  	/*
> >  	 * Tell all users of get_user/copy_from_user etc... that the content
> > @@ -527,7 +527,12 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
> >  	 */
> >  	set_bit(MMF_UNSTABLE, &mm->flags);
> >
> > -	for_each_vma(vmi, vma) {
> > +	/*
> > +	 * When two tasks unmap the same vma at the same time, they may contend for the
> > +	 * pte spinlock. To avoid traversing the same vma as exit_mmap unmap, traverse
> > +	 * the vma maple tree in reverse order.
> > +	 */
> 
> Except you won't necessarily avoid anything, as if one walker is faster
> than the other they'll run ahead, plus of course they'll have a cross-over
> where they share the same PTE anyway.
> 
> I feel like maybe you've got a fairly specific situation that indicates an
> issue elsewhere and you're maybe solving the wrong problem here?
> 
> > +	for_each_vma_reverse(vmi, vma) {
> >  		if (vma->vm_flags & (VM_HUGETLB|VM_PFNMAP))
> >  			continue;
> >
> > --
> > 2.17.1
> >
> >


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 2/3] mm/oom_kill: Only delay OOM reaper for processes using robust futexes
  2025-08-19 10:49       ` Michal Hocko
@ 2025-08-20  2:53         ` Davidlohr Bueso
  2025-08-21 18:13           ` Michal Hocko
  0 siblings, 1 reply; 26+ messages in thread
From: Davidlohr Bueso @ 2025-08-20  2:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: zhongjinji, akpm, andrealmeid, dvhart, feng.han, liam.howlett,
	linux-kernel, linux-mm, liulu.liu, mingo, npache, peterz,
	rientjes, shakeel.butt, tglx

On Tue, 19 Aug 2025, Michal Hocko wrote:

>On Mon 18-08-25 20:08:19, zhongjinji wrote:
>> If a process holding robust futexes gets frozen, robust futexes might be reaped before
>> futex_cleanup() runs when an OOM occurs. I am not sure if this will actually happen.
>
>Yes, and 2s delay will never rule that out. Especially for frozen tasks
>which could be frozen undefinitely. That is not the point I have tried
>to make. I was suggesting not treating futex specially because no matter
>what we do this will always be racy and a hack to reduce the risk. We
>simply cannot deal with that case more gracefully without a major
>surgery to the futex implementation which is not desirable for this
>specific reason.

Yeah, relying on time as a fix is never a good idea. I was going to suggest
skipping the reaping for tasks with a robust list, but that still requires
the racy check, and your suggested workaround seems more practical.

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 3/3] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite orders
  2025-08-19 15:18     ` zhongjinji
@ 2025-08-21  9:32       ` Lorenzo Stoakes
  2025-08-25 14:12         ` zhongjinji
  0 siblings, 1 reply; 26+ messages in thread
From: Lorenzo Stoakes @ 2025-08-21  9:32 UTC (permalink / raw)
  To: zhongjinji
  Cc: akpm, andrealmeid, dave, dvhart, feng.han, liam.howlett,
	linux-kernel, linux-mm, liulu.liu, mhocko, mingo, npache, peterz,
	rientjes, shakeel.butt, tglx

On Tue, Aug 19, 2025 at 11:18:34PM +0800, zhongjinji wrote:
> > On Thu, Aug 14, 2025 at 09:55:55PM +0800, zhongjinji@honor.com wrote:
> > > From: zhongjinji <zhongjinji@honor.com>
> > >
> > > When a process is OOM killed, if the OOM reaper and the thread running
> > > exit_mmap() execute at the same time, both will traverse the vma's maple
> > > tree along the same path. They may easily unmap the same vma, causing them
> > > to compete for the pte spinlock. This increases unnecessary load, causing
> > > the execution time of the OOM reaper and the thread running exit_mmap() to
> > > increase.
> >
> > You're not giving any numbers, and this seems pretty niche, you really
> > exiting that many processes with the reaper running at the exact same time
> > that this is an issue? Waiting on a spinlock also?
> >
> > This commit message is very unconvincing.
>
> This is the perf data: the first one is without this patch applied, and the
> second one is with this patch applied.  It is obvious that without this patch,
> the lock contention on the pte spinlock will be very intense.

>
> |--99.74%-- oom_reaper
> |    |--76.67%-- unmap_page_range
> |    |    |--33.70%-- __pte_offset_map_lock
> |    |    |    |--98.46%-- _raw_spin_lock
> |    |    |--27.61%-- free_swap_and_cache_nr
> |    |    |--16.40%-- folio_remove_rmap_ptes
> |    |    |--12.25%-- tlb_flush_mmu
> |    |--12.61%-- tlb_finish_mmu
>
>
> |--98.84%-- oom_reaper
> |    |--53.45%-- unmap_page_range
> |    |    |--24.29%-- [hit in function]
> |    |    |--48.06%-- folio_remove_rmap_ptes
> |    |    |--17.99%-- tlb_flush_mmu
> |    |    |--1.72%-- __pte_offset_map_lock
> |    |
> |    |--30.43%-- tlb_finish_mmu

Right yes thanks for providing this.

I'm still not convinced by this approach however, it feels like you're papering
over a crack for a problematic hack that needs to be solved at a different
level.

It feels like the whole waiting around thing is a hack to paper over something
and then we're introducing another hack to make that work in a specific
scenario.

I also am not clear (perhaps you answered it elsewhere) how you're encountering
this at a scale for it to be a meaningful issue?

Also not sure we should be changing core mm to support perf issues with using an
effectively-deprecated interface (cgroup v1)?

>
> > >
> > > When a process exits, exit_mmap() traverses the vma's maple tree from low to high
> > > address. To reduce the chance of unmapping the same vma simultaneously,
> > > the OOM reaper should traverse vma's tree from high to low address. This reduces
> > > lock contention when unmapping the same vma.
> >
> > Are they going to run through and do their work in exactly the same time,
> > or might one 'run past' the other and you still have an issue?
> >
> > Seems very vague and timing dependent and again, not convincing.
> >
> > >
> > > Signed-off-by: zhongjinji <zhongjinji@honor.com>
> > > ---
> > >  include/linux/mm.h | 3 +++
> > >  mm/oom_kill.c      | 9 +++++++--
> > >  2 files changed, 10 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index 0c44bb8ce544..b665ea3c30eb 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -923,6 +923,9 @@ static inline void vma_iter_set(struct vma_iterator *vmi, unsigned long addr)
> > >  #define for_each_vma_range(__vmi, __vma, __end)				\
> > >  	while (((__vma) = vma_find(&(__vmi), (__end))) != NULL)
> > >
> > > +#define for_each_vma_reverse(__vmi, __vma)					\
> > > +	while (((__vma) = vma_prev(&(__vmi))) != NULL)
> >
> > Please don't casually add an undocumented public VMA iterator hidden in a
> > patch doing something else :)
> >
> > Won't this skip the first VMA? Not sure this is really worth having as a
> > general thing anyway, it's not many people who want to do this in reverse.
>
> I got it. mas_find_rev() should be used instead of vma_prev().
>
> > > +
> > >  #ifdef CONFIG_SHMEM
> > >  /*
> > >   * The vma_is_shmem is not inline because it is used only by slow
> > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > index 7ae4001e47c1..602d6836098a 100644
> > > --- a/mm/oom_kill.c
> > > +++ b/mm/oom_kill.c
> > > @@ -517,7 +517,7 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
> > >  {
> > >  	struct vm_area_struct *vma;
> > >  	bool ret = true;
> > > -	VMA_ITERATOR(vmi, mm, 0);
> > > +	VMA_ITERATOR(vmi, mm, ULONG_MAX);
> > >
> > >  	/*
> > >  	 * Tell all users of get_user/copy_from_user etc... that the content
> > > @@ -527,7 +527,12 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
> > >  	 */
> > >  	set_bit(MMF_UNSTABLE, &mm->flags);
> > >
> > > -	for_each_vma(vmi, vma) {
> > > +	/*
> > > +	 * When two tasks unmap the same vma at the same time, they may contend for the
> > > +	 * pte spinlock. To avoid traversing the same vma as exit_mmap unmap, traverse
> > > +	 * the vma maple tree in reverse order.
> > > +	 */
> >
> > Except you won't necessarily avoid anything, as if one walker is faster
> > than the other they'll run ahead, plus of course they'll have a cross-over
> > where they share the same PTE anyway.
> >
> > I feel like maybe you've got a fairly specific situation that indicates an
> > issue elsewhere and you're maybe solving the wrong problem here?
> >
> > > +	for_each_vma_reverse(vmi, vma) {
> > >  		if (vma->vm_flags & (VM_HUGETLB|VM_PFNMAP))
> > >  			continue;
> > >
> > > --
> > > 2.17.1
> > >
> > >


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 2/3] mm/oom_kill: Only delay OOM reaper for processes using robust futexes
  2025-08-20  2:53         ` Davidlohr Bueso
@ 2025-08-21 18:13           ` Michal Hocko
  2025-08-21 19:45             ` Davidlohr Bueso
  0 siblings, 1 reply; 26+ messages in thread
From: Michal Hocko @ 2025-08-21 18:13 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: zhongjinji, akpm, andrealmeid, dvhart, feng.han, liam.howlett,
	linux-kernel, linux-mm, liulu.liu, mingo, npache, peterz,
	rientjes, shakeel.butt, tglx

On Tue 19-08-25 19:53:08, Davidlohr Bueso wrote:
> On Tue, 19 Aug 2025, Michal Hocko wrote:
> 
> > On Mon 18-08-25 20:08:19, zhongjinji wrote:
> > > If a process holding robust futexes gets frozen, robust futexes might be reaped before
> > > futex_cleanup() runs when an OOM occurs. I am not sure if this will actually happen.
> > 
> > Yes, and 2s delay will never rule that out. Especially for frozen tasks
> > which could be frozen undefinitely. That is not the point I have tried
> > to make. I was suggesting not treating futex specially because no matter
> > what we do this will always be racy and a hack to reduce the risk. We
> > simply cannot deal with that case more gracefully without a major
> > surgery to the futex implementation which is not desirable for this
> > specific reason.
> 
> Yeah, relying on time as a fix is never a good idea. I was going to suggest
> skipping the reaping for tasks with a robust list, 

let me reiterate that the purpose of the oom reaper is not an oom
killing process optimization. It is crucial to guarantee a forward
progress on the OOM situation by a) async memory reclaim of the oom
victim and b) unblocking oom selection to a different process after a)
is done. That means that the victim cannot block the oom situation for
ever. Therefore we cannot really avoid tasks with robust futex or any
other user processes without achieving b) at the same time.

The current delay is something we can tune and still have b) in place.
Normal mode of operation is that the oom reaper has nothing really to do
and that is really a good thing.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 2/3] mm/oom_kill: Only delay OOM reaper for processes using robust futexes
  2025-08-21 18:13           ` Michal Hocko
@ 2025-08-21 19:45             ` Davidlohr Bueso
  0 siblings, 0 replies; 26+ messages in thread
From: Davidlohr Bueso @ 2025-08-21 19:45 UTC (permalink / raw)
  To: Michal Hocko
  Cc: zhongjinji, akpm, andrealmeid, dvhart, feng.han, liam.howlett,
	linux-kernel, linux-mm, liulu.liu, mingo, npache, peterz,
	rientjes, shakeel.butt, tglx

On Thu, 21 Aug 2025, Michal Hocko wrote:

>On Tue 19-08-25 19:53:08, Davidlohr Bueso wrote:
>> Yeah, relying on time as a fix is never a good idea. I was going to suggest
>> skipping the reaping for tasks with a robust list,
>
>let me reiterate that the purpose of the oom reaper is not an oom
>killing process optimization. It is crucial to guarantee a forward
>progress on the OOM situation by a) async memory reclaim of the oom
>victim and b) unblocking oom selection to a different process after a)
>is done. That means that the victim cannot block the oom situation for
>ever. Therefore we cannot really avoid tasks with robust futex or any
>other user processes without achieving b) at the same time.

Yes, which is why I indicated that skipping it was less practical.

In the real world, users that care enough to use robust futexes should
make sure that their application keep the OOM killer away altogether.

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 3/3] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite orders
  2025-08-21  9:32       ` Lorenzo Stoakes
@ 2025-08-25 14:12         ` zhongjinji
  0 siblings, 0 replies; 26+ messages in thread
From: zhongjinji @ 2025-08-25 14:12 UTC (permalink / raw)
  To: lorenzo.stoakes
  Cc: akpm, andrealmeid, dave, dvhart, feng.han, liam.howlett,
	linux-kernel, linux-mm, liulu.liu, mhocko, mingo, npache, peterz,
	rientjes, shakeel.butt, tglx, zhongjinji

> >
> > |--99.74%-- oom_reaper
> > |    |--76.67%-- unmap_page_range
> > |    |    |--33.70%-- __pte_offset_map_lock
> > |    |    |    |--98.46%-- _raw_spin_lock
> > |    |    |--27.61%-- free_swap_and_cache_nr
> > |    |    |--16.40%-- folio_remove_rmap_ptes
> > |    |    |--12.25%-- tlb_flush_mmu
> > |    |--12.61%-- tlb_finish_mmu
> >
> >
> > |--98.84%-- oom_reaper
> > |    |--53.45%-- unmap_page_range
> > |    |    |--24.29%-- [hit in function]
> > |    |    |--48.06%-- folio_remove_rmap_ptes
> > |    |    |--17.99%-- tlb_flush_mmu
> > |    |    |--1.72%-- __pte_offset_map_lock
> > |    |
> > |    |--30.43%-- tlb_finish_mmu
> 
> Right yes thanks for providing this.
> 
> I'm still not convinced by this approach however, it feels like you're papering
> over a crack for a problematic hack that needs to be solved at a different
> level.
> 
> It feels like the whole waiting around thing is a hack to paper over something
> and then we're introducing another hack to make that work in a specific
> scenario.
> 
> I also am not clear (perhaps you answered it elsewhere) how you're encountering
> this at a scale for it to be a meaningful issue?

On low-memory Android devices, high memory pressure often requires killing
processes to free memory, which is generally accepted on Android. When
killing a process on Android, there is also an asynchronous process reap
mechanism, which is implemented through process_mrelease, similar to the
oom reaper. OOM events are also not rare. Therefore, it makes sense to
reduce the load on the reaper.

> Also not sure we should be changing core mm to support perf issues with using an
> effectively-deprecated interface (cgroup v1)?
Yeah, it is not that appealing.



^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2025-08-25 14:12 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-14 13:55 [PATCH v4 0/3] mm/oom_kill: Only delay OOM reaper for processes using robust futexes zhongjinji
2025-08-14 13:55 ` [PATCH v4 1/3] futex: Introduce function process_has_robust_futex() zhongjinji
2025-08-14 13:55 ` [PATCH v4 2/3] mm/oom_kill: Only delay OOM reaper for processes using robust futexes zhongjinji
2025-08-15 14:41   ` Lorenzo Stoakes
2025-08-18 14:14     ` zhongjinji
2025-08-17 19:37   ` Michal Hocko
2025-08-18 12:08     ` zhongjinji
2025-08-19 10:49       ` Michal Hocko
2025-08-20  2:53         ` Davidlohr Bueso
2025-08-21 18:13           ` Michal Hocko
2025-08-21 19:45             ` Davidlohr Bueso
2025-08-14 13:55 ` [PATCH v4 3/3] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite orders zhongjinji
2025-08-14 23:09   ` Andrew Morton
2025-08-15 16:32     ` zhongjinji
2025-08-15 17:52     ` gio
2025-08-15 17:53       ` gio
2025-08-15 14:29   ` Lorenzo Stoakes
2025-08-15 15:01     ` Lorenzo Stoakes
2025-08-15 17:37     ` zhongjinji
2025-08-19 15:18     ` zhongjinji
2025-08-21  9:32       ` Lorenzo Stoakes
2025-08-25 14:12         ` zhongjinji
2025-08-15 14:41   ` Liam R. Howlett
2025-08-15 16:05     ` Liam R. Howlett
2025-08-14 23:13 ` [PATCH v4 0/3] mm/oom_kill: Only delay OOM reaper for processes using robust futexes Andrew Morton
2025-08-15 17:06   ` zhongjinji

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).