linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 0/2] Do not delay oom reaper when the victim is frozen
@ 2025-08-25 13:38 zhongjinji
  2025-08-25 13:38 ` [PATCH v5 1/2] mm/oom_kill: " zhongjinji
  2025-08-25 13:38 ` [PATCH v5 2/2] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite order zhongjinji
  0 siblings, 2 replies; 20+ messages in thread
From: zhongjinji @ 2025-08-25 13:38 UTC (permalink / raw)
  To: mhocko
  Cc: rientjes, shakeel.butt, akpm, linux-mm, linux-kernel, tglx,
	liam.howlett, lorenzo.stoakes, liulu.liu, feng.han, zhongjinji

patch 1 do not delay oom reaper when the victim is frozen, patch 2 makes
the OOM reaper and exit_mmap() traverse the maple tree in opposite orders
to reduce PTE lock contention caused by unmapping the same vma.

About patch 1:
Patch 1 uses frozen() to check the frozen state of a single thread to
determine if a process is frozen, rather than checking all threads,
because the frozen state of all threads in a process will eventually be
consistent. There is no need to strictly confirm that all threads are
frozen; it is only necessary to check whether the process has been frozen
or is about to be frozen.

When a process is frozen, if it cannot be unfrozen promptly, the delayed
two-second oom reaper cannot guarantee that robust futexes will not be
reaped. So the processes holding robust futexes should not be frozen.
This patch will not make issue [1] worse.

About patch 2:
I tested the changes of patch 2 on Android. The reproduction steps are as
follows: Start a process, then kill it like oom kill does, and actively add
it to the oom reaper.

The perf data applying patch 1 but not patch 2:
|--99.74%-- oom_reaper
|  |--76.67%-- unmap_page_range
|  |  |--33.70%-- __pte_offset_map_lock
|  |  |  |--98.46%-- _raw_spin_lock
|  |  |--27.61%-- free_swap_and_cache_nr
|  |  |--16.40%-- folio_remove_rmap_ptes
|  |  |--12.25%-- tlb_flush_mmu
|  |--12.61%-- tlb_finish_mmu

The perf data applying patch 1 and patch 2:
|--98.84%-- oom_reaper
|  |--53.45%-- unmap_page_range
|  |  |--24.29%-- [hit in function]
|  |  |--48.06%-- folio_remove_rmap_ptes
|  |  |--17.99%-- tlb_flush_mmu
|  |  |--1.72%-- __pte_offset_map_lock
|  |  
|  |--30.43%-- tlb_finish_mmu

It is obvious that the lock contention on the pte spinlock will be very
intense when they traverse the tree along the same path.

On low-memory Android devices, high memory pressure often requires killing
processes to free memory, which is generally accepted on Android. lmkd, a
user-space program that actively kills processes, needs to asynchronously
call process_mrelease to release memory from killed processes, similar to
the oom reaper. At the same time, OOM events are not rare. Therefore,
reducing lock contention on __oom_reap_task_mm is meaningful.

Link: https://lore.kernel.org/all/20220414144042.677008-1-npache@redhat.com/T/#u [1]

---
v4 -> v5:
1. Detect the frozen state of the process instead of checking the futex state,
   as special handling of futex locks should be avoided during OOM kill [2].
2. Use mas_find_rev() to traverse the VMA tree instead of vma_prev(), because
   vma_prev() may skip the first VMA and should not be used here. [3]
3. Just check ishould_delay_oom_reap() in queue_oom_reaper() since it is not
   hot path. [4]

v4 link:
https://lore.kernel.org/linux-mm/20250814135555.17493-1-zhongjinji@honor.com/

v3 -> v4:
1. Rename check_robust_futex() to process_has_robust_futex() for clearer
   intent.
2. Because the delay_reap parameter was added to task_will_free_mem(),
   the function is renamed to should_reap_task() to better clarify
   its purpose.
3. Add should_delay_oom_reap() to decide whether to delay OOM reap.
4. Modify the OOM reaper to traverse the maple tree in reverse order; see patch
   3 for details.
These changes improve code readability and enhance OOM reaper behavior.

v3 link:  
https://lore.kernel.org/all/20250804030341.18619-1-zhongjinji@honor.com/
https://lore.kernel.org/all/20250804030341.18619-2-zhongjinji@honor.com/

v2 -> v3:
1. It mainly fixed the error in the Subject prefix, changing it from futex to
   mm/oom_kill.
v2 link:
https://lore.kernel.org/linux-mm/20250801153649.23244-1-zhongjinji@honor.com/
https://lore.kernel.org/linux-mm/20250801153649.23244-2-zhongjinji@honor.com/

v1 -> v2:
1. Check the robust_list of all threads instead of just a single thread.
v1 link:
https://lore.kernel.org/linux-mm/20250731102904.8615-1-zhongjinji@honor.com/

Reference:
https://lore.kernel.org/linux-mm/aKRWtjRhE_HgFlp2@tiehlicka/ [2]
https://lore.kernel.org/linux-mm/26larxehoe3a627s4fxsqghriwctays4opm4hhme3uk7ybjc5r@pmwh4s4yv7lm/
[3]
https://lore.kernel.org/linux-mm/d5013a33-c08a-44c5-a67f-9dc8fd73c969@lucifer.local/ [4]

*** BLURB HERE ***

zhongjinji (2):
  mm/oom_kill: Do not delay oom reaper when the victim is frozen
  mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple
    tree in opposite order

 mm/oom_kill.c | 49 ++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 46 insertions(+), 3 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v5 1/2] mm/oom_kill: Do not delay oom reaper when the victim is frozen
  2025-08-25 13:38 [PATCH v5 0/2] Do not delay oom reaper when the victim is frozen zhongjinji
@ 2025-08-25 13:38 ` zhongjinji
  2025-08-25 19:41   ` Shakeel Butt
                     ` (2 more replies)
  2025-08-25 13:38 ` [PATCH v5 2/2] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite order zhongjinji
  1 sibling, 3 replies; 20+ messages in thread
From: zhongjinji @ 2025-08-25 13:38 UTC (permalink / raw)
  To: mhocko
  Cc: rientjes, shakeel.butt, akpm, linux-mm, linux-kernel, tglx,
	liam.howlett, lorenzo.stoakes, liulu.liu, feng.han, zhongjinji

The OOM reaper can quickly reap a process's memory when the system
encounters OOM, helping the system recover. If the victim process is
frozen and cannot be unfrozen in time, the reaper delayed by two seconds
will cause the system to fail to recover quickly from the OOM state.

When an OOM occurs, if the victim is not unfrozen, delaying the OOM reaper
will keep the system in a bad state for two seconds. Before scheduling the
oom_reaper task, check whether the victim is in a frozen state. If the
victim is frozen, do not delay the OOM reaper.

Signed-off-by: zhongjinji <zhongjinji@honor.com>
---
 mm/oom_kill.c | 40 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 39 insertions(+), 1 deletion(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 25923cfec9c6..4b4d73b1e00d 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -683,6 +683,41 @@ static void wake_oom_reaper(struct timer_list *timer)
 	wake_up(&oom_reaper_wait);
 }
 
+/*
+ * When the victim is frozen, the OOM reaper should not be delayed, because
+ * if the victim cannot be unfrozen promptly, it may block the system from
+ * quickly recovering from the OOM state.
+ */
+static bool should_delay_oom_reap(struct task_struct *tsk)
+{
+	struct mm_struct *mm = tsk->mm;
+	struct task_struct *p;
+	bool ret;
+
+	if (!mm)
+		return true;
+
+	if (!frozen(tsk))
+		return true;
+
+	if (atomic_read(&mm->mm_users) <= 1)
+		return false;
+
+	rcu_read_lock();
+	for_each_process(p) {
+		if (!process_shares_mm(p, mm))
+			continue;
+		if (same_thread_group(tsk, p))
+			continue;
+		ret = !frozen(p);
+		if (ret)
+			break;
+	}
+	rcu_read_unlock();
+
+	return ret;
+}
+
 /*
  * Give the OOM victim time to exit naturally before invoking the oom_reaping.
  * The timers timeout is arbitrary... the longer it is, the longer the worst
@@ -694,13 +729,16 @@ static void wake_oom_reaper(struct timer_list *timer)
 #define OOM_REAPER_DELAY (2*HZ)
 static void queue_oom_reaper(struct task_struct *tsk)
 {
+	bool delay;
+
 	/* mm is already queued? */
 	if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags))
 		return;
 
 	get_task_struct(tsk);
+	delay = should_delay_oom_reap(tsk);
 	timer_setup(&tsk->oom_reaper_timer, wake_oom_reaper, 0);
-	tsk->oom_reaper_timer.expires = jiffies + OOM_REAPER_DELAY;
+	tsk->oom_reaper_timer.expires = jiffies + (delay ? OOM_REAPER_DELAY : 0);
 	add_timer(&tsk->oom_reaper_timer);
 }
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v5 2/2] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite order
  2025-08-25 13:38 [PATCH v5 0/2] Do not delay oom reaper when the victim is frozen zhongjinji
  2025-08-25 13:38 ` [PATCH v5 1/2] mm/oom_kill: " zhongjinji
@ 2025-08-25 13:38 ` zhongjinji
  2025-08-26 12:53   ` Lorenzo Stoakes
  1 sibling, 1 reply; 20+ messages in thread
From: zhongjinji @ 2025-08-25 13:38 UTC (permalink / raw)
  To: mhocko
  Cc: rientjes, shakeel.butt, akpm, linux-mm, linux-kernel, tglx,
	liam.howlett, lorenzo.stoakes, liulu.liu, feng.han, zhongjinji

When a process is OOM killed without reaper delay, the oom reaper and the
exit_mmap() thread likely run simultaneously. They traverse the vma's maple
tree along the same path and may easily unmap the same vma, causing them to
compete for the pte spinlock.

When a process exits, exit_mmap() traverses the vma's maple tree from low
to high addresses. To reduce the chance of unmapping the same vma
simultaneously, the OOM reaper should traverse the vma's tree from high to
low address.

Signed-off-by: zhongjinji <zhongjinji@honor.com>
---
 mm/oom_kill.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 4b4d73b1e00d..a0650da9ec9c 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -516,7 +516,7 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
 {
 	struct vm_area_struct *vma;
 	bool ret = true;
-	VMA_ITERATOR(vmi, mm, 0);
+	MA_STATE(mas, &mm->mm_mt, ULONG_MAX, 0);
 
 	/*
 	 * Tell all users of get_user/copy_from_user etc... that the content
@@ -526,7 +526,12 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
 	 */
 	set_bit(MMF_UNSTABLE, &mm->flags);
 
-	for_each_vma(vmi, vma) {
+	/*
+	 * When two tasks unmap the same vma at the same time, they may contend for the
+	 * pte spinlock. To reduce the probability of them unmapping the same vma, the
+	 * oom reaper traverse the vma maple tree in reverse order.
+	 */
+	while ((vma = mas_find_rev(&mas, 0)) != NULL) {
 		if (vma->vm_flags & (VM_HUGETLB|VM_PFNMAP))
 			continue;
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v5 1/2] mm/oom_kill: Do not delay oom reaper when the victim is frozen
  2025-08-25 13:38 ` [PATCH v5 1/2] mm/oom_kill: " zhongjinji
@ 2025-08-25 19:41   ` Shakeel Butt
  2025-08-26 13:01     ` zhongjinji
  2025-08-26 12:43   ` Lorenzo Stoakes
  2025-08-27 12:08   ` Michal Hocko
  2 siblings, 1 reply; 20+ messages in thread
From: Shakeel Butt @ 2025-08-25 19:41 UTC (permalink / raw)
  To: zhongjinji
  Cc: mhocko, rientjes, akpm, linux-mm, linux-kernel, tglx,
	liam.howlett, lorenzo.stoakes, liulu.liu, feng.han, cgroups

+cgroups

On Mon, Aug 25, 2025 at 09:38:54PM +0800, zhongjinji wrote:
> The OOM reaper can quickly reap a process's memory when the system
> encounters OOM, helping the system recover. If the victim process is
> frozen and cannot be unfrozen in time, the reaper delayed by two seconds
> will cause the system to fail to recover quickly from the OOM state.
> 
> When an OOM occurs, if the victim is not unfrozen, delaying the OOM reaper
> will keep the system in a bad state for two seconds. Before scheduling the
> oom_reaper task, check whether the victim is in a frozen state. If the
> victim is frozen, do not delay the OOM reaper.
> 
> Signed-off-by: zhongjinji <zhongjinji@honor.com>
> ---
>  mm/oom_kill.c | 40 +++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 39 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 25923cfec9c6..4b4d73b1e00d 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -683,6 +683,41 @@ static void wake_oom_reaper(struct timer_list *timer)
>  	wake_up(&oom_reaper_wait);
>  }
>  
> +/*
> + * When the victim is frozen, the OOM reaper should not be delayed, because
> + * if the victim cannot be unfrozen promptly, it may block the system from
> + * quickly recovering from the OOM state.
> + */
> +static bool should_delay_oom_reap(struct task_struct *tsk)
> +{
> +	struct mm_struct *mm = tsk->mm;
> +	struct task_struct *p;
> +	bool ret;
> +

On v2, shouldn't READ_ONCE(tsk->frozen) be enough instead of mm check
and checks insode for_each_process()?

> +	if (!mm)
> +		return true;
> +
> +	if (!frozen(tsk))
> +		return true;
> +
> +	if (atomic_read(&mm->mm_users) <= 1)
> +		return false;
> +
> +	rcu_read_lock();
> +	for_each_process(p) {
> +		if (!process_shares_mm(p, mm))
> +			continue;
> +		if (same_thread_group(tsk, p))
> +			continue;
> +		ret = !frozen(p);
> +		if (ret)
> +			break;
> +	}
> +	rcu_read_unlock();
> +
> +	return ret;
> +}
> +
>  /*
>   * Give the OOM victim time to exit naturally before invoking the oom_reaping.
>   * The timers timeout is arbitrary... the longer it is, the longer the worst
> @@ -694,13 +729,16 @@ static void wake_oom_reaper(struct timer_list *timer)
>  #define OOM_REAPER_DELAY (2*HZ)
>  static void queue_oom_reaper(struct task_struct *tsk)
>  {
> +	bool delay;
> +
>  	/* mm is already queued? */
>  	if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags))
>  		return;
>  
>  	get_task_struct(tsk);
> +	delay = should_delay_oom_reap(tsk);
>  	timer_setup(&tsk->oom_reaper_timer, wake_oom_reaper, 0);
> -	tsk->oom_reaper_timer.expires = jiffies + OOM_REAPER_DELAY;
> +	tsk->oom_reaper_timer.expires = jiffies + (delay ? OOM_REAPER_DELAY : 0);
>  	add_timer(&tsk->oom_reaper_timer);
>  }
>  
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v5 1/2] mm/oom_kill: Do not delay oom reaper when the victim is frozen
  2025-08-25 13:38 ` [PATCH v5 1/2] mm/oom_kill: " zhongjinji
  2025-08-25 19:41   ` Shakeel Butt
@ 2025-08-26 12:43   ` Lorenzo Stoakes
  2025-08-27 12:08   ` Michal Hocko
  2 siblings, 0 replies; 20+ messages in thread
From: Lorenzo Stoakes @ 2025-08-26 12:43 UTC (permalink / raw)
  To: zhongjinji
  Cc: mhocko, rientjes, shakeel.butt, akpm, linux-mm, linux-kernel,
	tglx, liam.howlett, liulu.liu, feng.han

On Mon, Aug 25, 2025 at 09:38:54PM +0800, zhongjinji wrote:
> The OOM reaper can quickly reap a process's memory when the system
> encounters OOM, helping the system recover. If the victim process is
> frozen and cannot be unfrozen in time, the reaper delayed by two seconds
> will cause the system to fail to recover quickly from the OOM state.

Be good to reference the commit where this was introduced.

>
> When an OOM occurs, if the victim is not unfrozen, delaying the OOM reaper
> will keep the system in a bad state for two seconds. Before scheduling the
> oom_reaper task, check whether the victim is in a frozen state. If the
> victim is frozen, do not delay the OOM reaper.
>
> Signed-off-by: zhongjinji <zhongjinji@honor.com>

This is a lot better than the previous version, thanks! :)

> ---
>  mm/oom_kill.c | 40 +++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 39 insertions(+), 1 deletion(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 25923cfec9c6..4b4d73b1e00d 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -683,6 +683,41 @@ static void wake_oom_reaper(struct timer_list *timer)
>  	wake_up(&oom_reaper_wait);
>  }
>
> +/*
> + * When the victim is frozen, the OOM reaper should not be delayed, because
> + * if the victim cannot be unfrozen promptly, it may block the system from
> + * quickly recovering from the OOM state.
> + */

You should put comments like this with each of the predicates, so e.g. this
comment should be above the frozen check, and then you should write equivalent
ones for the rest.

However, if Shakeel's correct, you can vastly simplify this further, so
obviously in that instance you can reduce to the single comment.

> +static bool should_delay_oom_reap(struct task_struct *tsk)
> +{
> +	struct mm_struct *mm = tsk->mm;
> +	struct task_struct *p;
> +	bool ret;
> +
> +	if (!mm)
> +		return true;
> +
> +	if (!frozen(tsk))
> +		return true;
> +
> +	if (atomic_read(&mm->mm_users) <= 1)
> +		return false;
> +
> +	rcu_read_lock();
> +	for_each_process(p) {
> +		if (!process_shares_mm(p, mm))
> +			continue;
> +		if (same_thread_group(tsk, p))
> +			continue;
> +		ret = !frozen(p);
> +		if (ret)
> +			break;
> +	}
> +	rcu_read_unlock();

This surely in any case must exist as a helper somehwere (bieng lazy + not
checking), seems a prime candidate for that if not.

> +
> +	return ret;
> +}
> +
>  /*
>   * Give the OOM victim time to exit naturally before invoking the oom_reaping.
>   * The timers timeout is arbitrary... the longer it is, the longer the worst
> @@ -694,13 +729,16 @@ static void wake_oom_reaper(struct timer_list *timer)
>  #define OOM_REAPER_DELAY (2*HZ)
>  static void queue_oom_reaper(struct task_struct *tsk)
>  {
> +	bool delay;
> +
>  	/* mm is already queued? */
>  	if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags))
>  		return;
>
>  	get_task_struct(tsk);
> +	delay = should_delay_oom_reap(tsk);
>  	timer_setup(&tsk->oom_reaper_timer, wake_oom_reaper, 0);
> -	tsk->oom_reaper_timer.expires = jiffies + OOM_REAPER_DELAY;
> +	tsk->oom_reaper_timer.expires = jiffies + (delay ? OOM_REAPER_DELAY : 0);

I mean, unless there's some reason not to, why not simplify to:

	task->oom_reaper_timer.expires = jiffies;
	if (should_delay_oom_reap(tsk))
		task->oom_reaper_timer.expires += OOM_REAPER_DELAY;

While super spells things out and avoids the other noise.

>  	add_timer(&tsk->oom_reaper_timer);
>  }
>
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v5 2/2] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite order
  2025-08-25 13:38 ` [PATCH v5 2/2] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite order zhongjinji
@ 2025-08-26 12:53   ` Lorenzo Stoakes
  2025-08-26 13:37     ` Liam R. Howlett
  2025-08-29  7:14     ` Michal Hocko
  0 siblings, 2 replies; 20+ messages in thread
From: Lorenzo Stoakes @ 2025-08-26 12:53 UTC (permalink / raw)
  To: zhongjinji
  Cc: mhocko, rientjes, shakeel.butt, akpm, linux-mm, linux-kernel,
	tglx, liam.howlett, liulu.liu, feng.han

On Mon, Aug 25, 2025 at 09:38:55PM +0800, zhongjinji wrote:
> When a process is OOM killed without reaper delay, the oom reaper and the
> exit_mmap() thread likely run simultaneously. They traverse the vma's maple
> tree along the same path and may easily unmap the same vma, causing them to
> compete for the pte spinlock.
>
> When a process exits, exit_mmap() traverses the vma's maple tree from low
> to high addresses. To reduce the chance of unmapping the same vma
> simultaneously, the OOM reaper should traverse the vma's tree from high to
> low address.
>
> Signed-off-by: zhongjinji <zhongjinji@honor.com>

I will leave it to Liam to confirm the maple tree bit is ok, but I guess
I'm softening to the idea of doing this - because it should have no impact
on most users, so even if it's some rare edge case that triggers the
situation, then it's worth doing it in reverse just to help you guys out :)

Liam - please confirm this is good from your side, and then I can add a tag!

Cheers, Lorenzo

> ---
>  mm/oom_kill.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 4b4d73b1e00d..a0650da9ec9c 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -516,7 +516,7 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
>  {
>  	struct vm_area_struct *vma;
>  	bool ret = true;
> -	VMA_ITERATOR(vmi, mm, 0);
> +	MA_STATE(mas, &mm->mm_mt, ULONG_MAX, 0);
>
>  	/*
>  	 * Tell all users of get_user/copy_from_user etc... that the content
> @@ -526,7 +526,12 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
>  	 */
>  	set_bit(MMF_UNSTABLE, &mm->flags);
>
> -	for_each_vma(vmi, vma) {
> +	/*
> +	 * When two tasks unmap the same vma at the same time, they may contend for the
> +	 * pte spinlock. To reduce the probability of them unmapping the same vma, the
> +	 * oom reaper traverse the vma maple tree in reverse order.
> +	 */
> +	while ((vma = mas_find_rev(&mas, 0)) != NULL) {

It's a pity there isn't a nicer formulation of this but this is probably
the least worst way of doing it.

>  		if (vma->vm_flags & (VM_HUGETLB|VM_PFNMAP))
>  			continue;
>
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v5 1/2] mm/oom_kill: Do not delay oom reaper when the victim is frozen
  2025-08-25 19:41   ` Shakeel Butt
@ 2025-08-26 13:01     ` zhongjinji
  0 siblings, 0 replies; 20+ messages in thread
From: zhongjinji @ 2025-08-26 13:01 UTC (permalink / raw)
  To: shakeel.butt
  Cc: akpm, cgroups, feng.han, liam.howlett, linux-kernel, linux-mm,
	liulu.liu, lorenzo.stoakes, mhocko, rientjes, tglx, zhongjinji

> +cgroups
> 
> On Mon, Aug 25, 2025 at 09:38:54PM +0800, zhongjinji wrote:
> > The OOM reaper can quickly reap a process's memory when the system
> > encounters OOM, helping the system recover. If the victim process is
> > frozen and cannot be unfrozen in time, the reaper delayed by two seconds
> > will cause the system to fail to recover quickly from the OOM state.
> > 
> > When an OOM occurs, if the victim is not unfrozen, delaying the OOM reaper
> > will keep the system in a bad state for two seconds. Before scheduling the
> > oom_reaper task, check whether the victim is in a frozen state. If the
> > victim is frozen, do not delay the OOM reaper.
> > 
> > Signed-off-by: zhongjinji <zhongjinji@honor.com>
> > ---
> >  mm/oom_kill.c | 40 +++++++++++++++++++++++++++++++++++++++-
> >  1 file changed, 39 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 25923cfec9c6..4b4d73b1e00d 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -683,6 +683,41 @@ static void wake_oom_reaper(struct timer_list *timer)
> >  	wake_up(&oom_reaper_wait);
> >  }
> >  
> > +/*
> > + * When the victim is frozen, the OOM reaper should not be delayed, because
> > + * if the victim cannot be unfrozen promptly, it may block the system from
> > + * quickly recovering from the OOM state.
> > + */
> > +static bool should_delay_oom_reap(struct task_struct *tsk)
> > +{
> > +	struct mm_struct *mm = tsk->mm;
> > +	struct task_struct *p;
> > +	bool ret;
> > +
> 
> On v2, shouldn't READ_ONCE(tsk->frozen) be enough instead of mm check
> and checks insode for_each_process()?

Thank you.
It is mainly to check the processes created by vfork. I believe no one would
put them into different cgroups, so I also think that READ_ONCE(tsk->frozen)
is enough. I will update it.

> > +	if (!mm)
> > +		return true;
> > +
> > +	if (!frozen(tsk))
> > +		return true;
> > +
> > +	if (atomic_read(&mm->mm_users) <= 1)
> > +		return false;
> > +
> > +	rcu_read_lock();
> > +	for_each_process(p) {

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v5 2/2] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite order
  2025-08-26 12:53   ` Lorenzo Stoakes
@ 2025-08-26 13:37     ` Liam R. Howlett
  2025-08-26 13:50       ` Lorenzo Stoakes
  2025-08-29  7:14     ` Michal Hocko
  1 sibling, 1 reply; 20+ messages in thread
From: Liam R. Howlett @ 2025-08-26 13:37 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: zhongjinji, mhocko, rientjes, shakeel.butt, akpm, linux-mm,
	linux-kernel, tglx, liulu.liu, feng.han

* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250826 08:53]:
> On Mon, Aug 25, 2025 at 09:38:55PM +0800, zhongjinji wrote:
> > When a process is OOM killed without reaper delay, the oom reaper and the
> > exit_mmap() thread likely run simultaneously. They traverse the vma's maple
> > tree along the same path and may easily unmap the same vma, causing them to
> > compete for the pte spinlock.
> >
> > When a process exits, exit_mmap() traverses the vma's maple tree from low
> > to high addresses. To reduce the chance of unmapping the same vma
> > simultaneously, the OOM reaper should traverse the vma's tree from high to
> > low address.
> >
> > Signed-off-by: zhongjinji <zhongjinji@honor.com>
> 
> I will leave it to Liam to confirm the maple tree bit is ok, but I guess
> I'm softening to the idea of doing this - because it should have no impact
> on most users, so even if it's some rare edge case that triggers the
> situation, then it's worth doing it in reverse just to help you guys out :)
> 

I really don't think this is worth doing.  We're avoiding a race between
oom and a task unmap - the MMF bits should be used to avoid this race -
or at least mitigate it.

They are probably both under the read lock, but considering how rare it
would be, would a racy flag check be enough - it is hardly critical to
get right.  Either would reduce the probability.

> Liam - please confirm this is good from your side, and then I can add a tag!
> 
> Cheers, Lorenzo
> 
> > ---
> >  mm/oom_kill.c | 9 +++++++--
> >  1 file changed, 7 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 4b4d73b1e00d..a0650da9ec9c 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -516,7 +516,7 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
> >  {
> >  	struct vm_area_struct *vma;
> >  	bool ret = true;
> > -	VMA_ITERATOR(vmi, mm, 0);
> > +	MA_STATE(mas, &mm->mm_mt, ULONG_MAX, 0);
                                  ^^^^^^^^^  ^^
You have set the index larger than the last.  It (probably?) works, but
isn't correct and may stop working, so let's fix it.

MA_STATE(mas, &mm->mm_mt, ULONG_MAX, ULONG_MAX);


> >
> >  	/*
> >  	 * Tell all users of get_user/copy_from_user etc... that the content
> > @@ -526,7 +526,12 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
> >  	 */
> >  	set_bit(MMF_UNSTABLE, &mm->flags);
> >
> > -	for_each_vma(vmi, vma) {
> > +	/*
> > +	 * When two tasks unmap the same vma at the same time, they may contend for the
> > +	 * pte spinlock. To reduce the probability of them unmapping the same vma, the
> > +	 * oom reaper traverse the vma maple tree in reverse order.
> > +	 */
> > +	while ((vma = mas_find_rev(&mas, 0)) != NULL) {
> 
> It's a pity there isn't a nicer formulation of this but this is probably
> the least worst way of doing it.
> 

mas_for_each_rev() exists for this use case.

You will find that the implementation is very close to what you see
here. :)

> >  		if (vma->vm_flags & (VM_HUGETLB|VM_PFNMAP))
> >  			continue;
> >
> > --
> > 2.17.1
> >

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v5 2/2] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite order
  2025-08-26 13:37     ` Liam R. Howlett
@ 2025-08-26 13:50       ` Lorenzo Stoakes
  2025-08-26 15:21         ` Liam R. Howlett
  0 siblings, 1 reply; 20+ messages in thread
From: Lorenzo Stoakes @ 2025-08-26 13:50 UTC (permalink / raw)
  To: Liam R. Howlett, zhongjinji, mhocko, rientjes, shakeel.butt, akpm,
	linux-mm, linux-kernel, tglx, liulu.liu, feng.han

On Tue, Aug 26, 2025 at 09:37:22AM -0400, Liam R. Howlett wrote:
> I really don't think this is worth doing.  We're avoiding a race between
> oom and a task unmap - the MMF bits should be used to avoid this race -
> or at least mitigate it.

Yes for sure, as explored at length in previous discussions this feels like
we're papering over cracks here.

_However_, I'm sort of ok with a minimalistic fix that solves the proximate
issue even if it is that, as long as it doesn't cause issues in doing so.

So this is my take on the below and why I'm open to it!

>
> They are probably both under the read lock, but considering how rare it
> would be, would a racy flag check be enough - it is hardly critical to
> get right.  Either would reduce the probability.

Zongjinji - I'm stil not sure that you've really indicated _why_ you're
seeing such a tight and unusual race. Presumably some truly massive number
of tasks being OOM'd and unmapping but... yeah that seems odd anyway.

But again, if we can safely fix this in a way that doesn't hurt stuff too
much I'm ok with it (of course, these are famous last words in the kernel
often...!)

Liam - are you open to a solution on the basis above, or do you feel we
ought simply to fix the underlying issue here?

to me we're at a simple enough implementaiton of this (esp. utilising the
helper you mention) that probably kthis is fine (like the meme,
or... hopefully not :)

I will go with your judgment here!

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v5 2/2] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite order
  2025-08-26 13:50       ` Lorenzo Stoakes
@ 2025-08-26 15:21         ` Liam R. Howlett
  2025-08-26 22:26           ` Shakeel Butt
  0 siblings, 1 reply; 20+ messages in thread
From: Liam R. Howlett @ 2025-08-26 15:21 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: zhongjinji, mhocko, rientjes, shakeel.butt, akpm, linux-mm,
	linux-kernel, tglx, liulu.liu, feng.han

* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250826 09:50]:
> On Tue, Aug 26, 2025 at 09:37:22AM -0400, Liam R. Howlett wrote:
> > I really don't think this is worth doing.  We're avoiding a race between
> > oom and a task unmap - the MMF bits should be used to avoid this race -
> > or at least mitigate it.
> 
> Yes for sure, as explored at length in previous discussions this feels like
> we're papering over cracks here.
> 
> _However_, I'm sort of ok with a minimalistic fix that solves the proximate
> issue even if it is that, as long as it doesn't cause issues in doing so.
> 
> So this is my take on the below and why I'm open to it!
> 
> >
> > They are probably both under the read lock, but considering how rare it
> > would be, would a racy flag check be enough - it is hardly critical to
> > get right.  Either would reduce the probability.
> 
> Zongjinji - I'm stil not sure that you've really indicated _why_ you're
> seeing such a tight and unusual race. Presumably some truly massive number
> of tasks being OOM'd and unmapping but... yeah that seems odd anyway.
> 
> But again, if we can safely fix this in a way that doesn't hurt stuff too
> much I'm ok with it (of course, these are famous last words in the kernel
> often...!)
> 
> Liam - are you open to a solution on the basis above, or do you feel we
> ought simply to fix the underlying issue here?

At least this is a benign race.  I'd think using MMF_ to reduce the race
would achieve the same goal with less risk - which is why I bring it up.

Really, both methods should be low risk, so I'm fine with either way.

But I am interested in hearing how this race is happening enough to
necessitate a fix.  Reversing the iterator is a one-spot fix - if this
happens elsewhere then we're out of options.  Using the MMF_ flags is
more of a scalable fix, if it achieves the same results.

> 
> to me we're at a simple enough implementaiton of this (esp. utilising the
> helper you mention) that probably kthis is fine (like the meme,
> or... hopefully not :)
> 
> I will go with your judgment here!
> 
> Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v5 2/2] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite order
  2025-08-26 15:21         ` Liam R. Howlett
@ 2025-08-26 22:26           ` Shakeel Butt
  2025-08-27  4:12             ` Liam R. Howlett
  0 siblings, 1 reply; 20+ messages in thread
From: Shakeel Butt @ 2025-08-26 22:26 UTC (permalink / raw)
  To: Liam R. Howlett, Lorenzo Stoakes, zhongjinji, mhocko, rientjes,
	akpm, linux-mm, linux-kernel, tglx, liulu.liu, feng.han

On Tue, Aug 26, 2025 at 11:21:13AM -0400, Liam R. Howlett wrote:
> * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250826 09:50]:
> > On Tue, Aug 26, 2025 at 09:37:22AM -0400, Liam R. Howlett wrote:
> > > I really don't think this is worth doing.  We're avoiding a race between
> > > oom and a task unmap - the MMF bits should be used to avoid this race -
> > > or at least mitigate it.
> > 
> > Yes for sure, as explored at length in previous discussions this feels like
> > we're papering over cracks here.
> > 
> > _However_, I'm sort of ok with a minimalistic fix that solves the proximate
> > issue even if it is that, as long as it doesn't cause issues in doing so.
> > 
> > So this is my take on the below and why I'm open to it!
> > 
> > >
> > > They are probably both under the read lock, but considering how rare it
> > > would be, would a racy flag check be enough - it is hardly critical to
> > > get right.  Either would reduce the probability.
> > 
> > Zongjinji - I'm stil not sure that you've really indicated _why_ you're
> > seeing such a tight and unusual race. Presumably some truly massive number
> > of tasks being OOM'd and unmapping but... yeah that seems odd anyway.
> > 
> > But again, if we can safely fix this in a way that doesn't hurt stuff too
> > much I'm ok with it (of course, these are famous last words in the kernel
> > often...!)
> > 
> > Liam - are you open to a solution on the basis above, or do you feel we
> > ought simply to fix the underlying issue here?
> 
> At least this is a benign race. 

Is this really a race or rather a contention? IIUC exit_mmap and the oom
reaper are trying to unmap the address space of the oom-killed process
and can compete on page table locks. If both are running concurrently on
two cpus then the contention can continue for whole address space and
can slow down the actual memory freeing. Making oom reaper traverse in
opposite direction can drastically reduce the contention and faster
memory freeing.

> I'd think using MMF_ to reduce the race
> would achieve the same goal with less risk - which is why I bring it up.
> 

With MMF_ flag, are you suggesting oom reaper to skip the unmapping of
the oom-killed process?

> Really, both methods should be low risk, so I'm fine with either way.
> 
> But I am interested in hearing how this race is happening enough to
> necessitate a fix.  Reversing the iterator is a one-spot fix - if this
> happens elsewhere then we're out of options.  Using the MMF_ flags is
> more of a scalable fix, if it achieves the same results.

On the question of if this is a rare situaion and worth the patch. I
would say this scenario is not that rare particularly on low memory
devices and on highly utilized overcommitted systems. Memory pressure
and oom-kills are norm on such systems. The point of oom reaper is to
bring the system out of the oom situation quickly and having two cpus
unmapping the oom-killed process can potentially bring the system out of
oom situation faster.

I think the patch (with your suggestions) is simple enough and I don't
see any risk in including it.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v5 2/2] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite order
  2025-08-26 22:26           ` Shakeel Butt
@ 2025-08-27  4:12             ` Liam R. Howlett
  2025-08-27  4:25               ` Liam R. Howlett
                                 ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Liam R. Howlett @ 2025-08-27  4:12 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Lorenzo Stoakes, zhongjinji, mhocko, rientjes, akpm, linux-mm,
	linux-kernel, tglx, liulu.liu, feng.han

+ Cc Suren since he has worked on the exit_mmap() path a lot.

* Shakeel Butt <shakeel.butt@linux.dev> [250826 18:26]:
> On Tue, Aug 26, 2025 at 11:21:13AM -0400, Liam R. Howlett wrote:
> > * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250826 09:50]:
> > > On Tue, Aug 26, 2025 at 09:37:22AM -0400, Liam R. Howlett wrote:
> > > > I really don't think this is worth doing.  We're avoiding a race between
> > > > oom and a task unmap - the MMF bits should be used to avoid this race -
> > > > or at least mitigate it.
> > > 
> > > Yes for sure, as explored at length in previous discussions this feels like
> > > we're papering over cracks here.
> > > 
> > > _However_, I'm sort of ok with a minimalistic fix that solves the proximate
> > > issue even if it is that, as long as it doesn't cause issues in doing so.
> > > 
> > > So this is my take on the below and why I'm open to it!
> > > 
> > > >
> > > > They are probably both under the read lock, but considering how rare it
> > > > would be, would a racy flag check be enough - it is hardly critical to
> > > > get right.  Either would reduce the probability.
> > > 
> > > Zongjinji - I'm stil not sure that you've really indicated _why_ you're
> > > seeing such a tight and unusual race. Presumably some truly massive number
> > > of tasks being OOM'd and unmapping but... yeah that seems odd anyway.
> > > 
> > > But again, if we can safely fix this in a way that doesn't hurt stuff too
> > > much I'm ok with it (of course, these are famous last words in the kernel
> > > often...!)
> > > 
> > > Liam - are you open to a solution on the basis above, or do you feel we
> > > ought simply to fix the underlying issue here?
> > 
> > At least this is a benign race. 
> 
> Is this really a race or rather a contention? IIUC exit_mmap and the oom
> reaper are trying to unmap the address space of the oom-killed process
> and can compete on page table locks. If both are running concurrently on
> two cpus then the contention can continue for whole address space and
> can slow down the actual memory freeing. Making oom reaper traverse in
> opposite direction can drastically reduce the contention and faster
> memory freeing.

It is two readers of the vma tree racing to lock the page tables for
each vma, so I guess you can see it as contention as well.. but since
the pte is a split lock, I see it as racing through vmas to see who hits
which lock first.  The smart money is on the oom killer as it skips some
vmas :)

If it were just contention, then the loop direction wouldn't matter..
but I do see your point.

> > I'd think using MMF_ to reduce the race
> > would achieve the same goal with less risk - which is why I bring it up.
> > 
> 
> With MMF_ flag, are you suggesting oom reaper to skip the unmapping of
> the oom-killed process?

Yes, specifically move the MMF_OOM_SKIP flag to earlier in the exit
path to reduce the possibility of the race/contention.

> 
> > Really, both methods should be low risk, so I'm fine with either way.
> > 
> > But I am interested in hearing how this race is happening enough to
> > necessitate a fix.  Reversing the iterator is a one-spot fix - if this
> > happens elsewhere then we're out of options.  Using the MMF_ flags is
> > more of a scalable fix, if it achieves the same results.
> 
> On the question of if this is a rare situaion and worth the patch. I
> would say this scenario is not that rare particularly on low memory
> devices and on highly utilized overcommitted systems. Memory pressure
> and oom-kills are norm on such systems. The point of oom reaper is to
> bring the system out of the oom situation quickly and having two cpus
> unmapping the oom-killed process can potentially bring the system out of
> oom situation faster.

The exit_mmap() path used to run the oom reaper if it was an oom victim,
until recently [1].  The part that makes me nervous is the exit_mmap()
call to mmu_notifier_release(mm), while the oom reaper uses an
mmu_notifier.  I am not sure if there is an issue in ordering on any of
the platforms of such things.  Or the associated cost of the calls.

I mean, it's already pretty crazy that we have this in the exit:
mmap_read_lock()
   tlb_gather_mmu_fullmm()
     unmap vmas..
mmap_read_unlock()
mmap_write_lock()
   tlb_finish_mmu()..

So not only do we now have two tasks iterating over the vmas, but we
also have mmu notifiers and tlb calls happening across the ranges..  At
least doing all the work on a single cpu means that the hardware view is
consistent.  But I don't see this being worse than a forward race?

As it is written here, we'll have one CPU working in one direction while
the other works in the other, until both hit the end of the VMAs.  Only
when both tasks stop iterating the vmas can the exit continue since it
requires the write lock.

So the tlb_finish_mmu() in exit_mmap() will always be called after
tlb_finish_mmu() on each individual vma has run in the
__oom_reap_task_mm() context (when the race happens).

There is also a window here, between the exit_mmap() dropping the read
lock, setting MMF_OOM_SKIP, and taking the lock - where the oom killer
will iterate through a list of vmas with zero memory to free and delay
the task exiting.  That is, wasting cpu and stopping the memory
associated with the mm_struct (vmas and such) from being freed.

I'm also not sure on the cpu cache effects of what we are doing and how
much that would play into the speedup.  My guess is that it's
insignificant compared to the time we spend under the pte, but we have
no numbers to go on.

So I'd like to know how likely the simultaneous runs are and if there is
a measurable gain?

I agree, that at face value, two cpus should be able to split the work..
but I don't know about the notifier or the holding up the mm_struct
associated memory.  And it could slow things down by holding up an
exiting task.

> 
> I think the patch (with your suggestions) is simple enough and I don't
> see any risk in including it.
> 

Actually, the more I look at this, the worse I feel about it..  Am I
overreacting?

Thanks,
Liam

[1] https://elixir.bootlin.com/linux/v6.0.19/source/mm/mmap.c#L3085


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v5 2/2] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite order
  2025-08-27  4:12             ` Liam R. Howlett
@ 2025-08-27  4:25               ` Liam R. Howlett
  2025-08-27  9:55               ` zhongjinji
  2025-08-29  7:11               ` Michal Hocko
  2 siblings, 0 replies; 20+ messages in thread
From: Liam R. Howlett @ 2025-08-27  4:25 UTC (permalink / raw)
  To: Shakeel Butt, Lorenzo Stoakes, zhongjinji, mhocko, rientjes, akpm,
	linux-mm, linux-kernel, tglx, liulu.liu, feng.han,
	surenb@google.com

..actually add Suren this time.

* Liam R. Howlett <Liam.Howlett@oracle.com> [250827 00:12]:
> + Cc Suren since he has worked on the exit_mmap() path a lot.
> 
> * Shakeel Butt <shakeel.butt@linux.dev> [250826 18:26]:
> > On Tue, Aug 26, 2025 at 11:21:13AM -0400, Liam R. Howlett wrote:
> > > * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250826 09:50]:
> > > > On Tue, Aug 26, 2025 at 09:37:22AM -0400, Liam R. Howlett wrote:
> > > > > I really don't think this is worth doing.  We're avoiding a race between
> > > > > oom and a task unmap - the MMF bits should be used to avoid this race -
> > > > > or at least mitigate it.
> > > > 
> > > > Yes for sure, as explored at length in previous discussions this feels like
> > > > we're papering over cracks here.
> > > > 
> > > > _However_, I'm sort of ok with a minimalistic fix that solves the proximate
> > > > issue even if it is that, as long as it doesn't cause issues in doing so.
> > > > 
> > > > So this is my take on the below and why I'm open to it!
> > > > 
> > > > >
> > > > > They are probably both under the read lock, but considering how rare it
> > > > > would be, would a racy flag check be enough - it is hardly critical to
> > > > > get right.  Either would reduce the probability.
> > > > 
> > > > Zongjinji - I'm stil not sure that you've really indicated _why_ you're
> > > > seeing such a tight and unusual race. Presumably some truly massive number
> > > > of tasks being OOM'd and unmapping but... yeah that seems odd anyway.
> > > > 
> > > > But again, if we can safely fix this in a way that doesn't hurt stuff too
> > > > much I'm ok with it (of course, these are famous last words in the kernel
> > > > often...!)
> > > > 
> > > > Liam - are you open to a solution on the basis above, or do you feel we
> > > > ought simply to fix the underlying issue here?
> > > 
> > > At least this is a benign race. 
> > 
> > Is this really a race or rather a contention? IIUC exit_mmap and the oom
> > reaper are trying to unmap the address space of the oom-killed process
> > and can compete on page table locks. If both are running concurrently on
> > two cpus then the contention can continue for whole address space and
> > can slow down the actual memory freeing. Making oom reaper traverse in
> > opposite direction can drastically reduce the contention and faster
> > memory freeing.
> 
> It is two readers of the vma tree racing to lock the page tables for
> each vma, so I guess you can see it as contention as well.. but since
> the pte is a split lock, I see it as racing through vmas to see who hits
> which lock first.  The smart money is on the oom killer as it skips some
> vmas :)
> 
> If it were just contention, then the loop direction wouldn't matter..
> but I do see your point.
> 
> > > I'd think using MMF_ to reduce the race
> > > would achieve the same goal with less risk - which is why I bring it up.
> > > 
> > 
> > With MMF_ flag, are you suggesting oom reaper to skip the unmapping of
> > the oom-killed process?
> 
> Yes, specifically move the MMF_OOM_SKIP flag to earlier in the exit
> path to reduce the possibility of the race/contention.
> 
> > 
> > > Really, both methods should be low risk, so I'm fine with either way.
> > > 
> > > But I am interested in hearing how this race is happening enough to
> > > necessitate a fix.  Reversing the iterator is a one-spot fix - if this
> > > happens elsewhere then we're out of options.  Using the MMF_ flags is
> > > more of a scalable fix, if it achieves the same results.
> > 
> > On the question of if this is a rare situaion and worth the patch. I
> > would say this scenario is not that rare particularly on low memory
> > devices and on highly utilized overcommitted systems. Memory pressure
> > and oom-kills are norm on such systems. The point of oom reaper is to
> > bring the system out of the oom situation quickly and having two cpus
> > unmapping the oom-killed process can potentially bring the system out of
> > oom situation faster.
> 
> The exit_mmap() path used to run the oom reaper if it was an oom victim,
> until recently [1].  The part that makes me nervous is the exit_mmap()
> call to mmu_notifier_release(mm), while the oom reaper uses an
> mmu_notifier.  I am not sure if there is an issue in ordering on any of
> the platforms of such things.  Or the associated cost of the calls.
> 
> I mean, it's already pretty crazy that we have this in the exit:
> mmap_read_lock()
>    tlb_gather_mmu_fullmm()
>      unmap vmas..
> mmap_read_unlock()
> mmap_write_lock()
>    tlb_finish_mmu()..
> 
> So not only do we now have two tasks iterating over the vmas, but we
> also have mmu notifiers and tlb calls happening across the ranges..  At
> least doing all the work on a single cpu means that the hardware view is
> consistent.  But I don't see this being worse than a forward race?
> 
> As it is written here, we'll have one CPU working in one direction while
> the other works in the other, until both hit the end of the VMAs.  Only
> when both tasks stop iterating the vmas can the exit continue since it
> requires the write lock.
> 
> So the tlb_finish_mmu() in exit_mmap() will always be called after
> tlb_finish_mmu() on each individual vma has run in the
> __oom_reap_task_mm() context (when the race happens).
> 
> There is also a window here, between the exit_mmap() dropping the read
> lock, setting MMF_OOM_SKIP, and taking the lock - where the oom killer
> will iterate through a list of vmas with zero memory to free and delay
> the task exiting.  That is, wasting cpu and stopping the memory
> associated with the mm_struct (vmas and such) from being freed.
> 
> I'm also not sure on the cpu cache effects of what we are doing and how
> much that would play into the speedup.  My guess is that it's
> insignificant compared to the time we spend under the pte, but we have
> no numbers to go on.
> 
> So I'd like to know how likely the simultaneous runs are and if there is
> a measurable gain?
> 
> I agree, that at face value, two cpus should be able to split the work..
> but I don't know about the notifier or the holding up the mm_struct
> associated memory.  And it could slow things down by holding up an
> exiting task.
> 
> > 
> > I think the patch (with your suggestions) is simple enough and I don't
> > see any risk in including it.
> > 
> 
> Actually, the more I look at this, the worse I feel about it..  Am I
> overreacting?
> 
> Thanks,
> Liam
> 
> [1] https://elixir.bootlin.com/linux/v6.0.19/source/mm/mmap.c#L3085
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v5 2/2] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite order
  2025-08-27  4:12             ` Liam R. Howlett
  2025-08-27  4:25               ` Liam R. Howlett
@ 2025-08-27  9:55               ` zhongjinji
  2025-08-27 15:57                 ` Suren Baghdasaryan
  2025-08-29  7:11               ` Michal Hocko
  2 siblings, 1 reply; 20+ messages in thread
From: zhongjinji @ 2025-08-27  9:55 UTC (permalink / raw)
  To: liam.howlett
  Cc: akpm, feng.han, linux-kernel, linux-mm, liulu.liu,
	lorenzo.stoakes, mhocko, rientjes, shakeel.butt, surenb, tglx,
	zhongjinji

> + Cc Suren since he has worked on the exit_mmap() path a lot.
 
Thank you for your assistance. I realize now that I should have
Cc Suren earlier.

> * Shakeel Butt <shakeel.butt@linux.dev> [250826 18:26]:
> > On Tue, Aug 26, 2025 at 11:21:13AM -0400, Liam R. Howlett wrote:
> > > * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250826 09:50]:
> > > > On Tue, Aug 26, 2025 at 09:37:22AM -0400, Liam R. Howlett wrote:
> > > > > I really don't think this is worth doing.  We're avoiding a race between
> > > > > oom and a task unmap - the MMF bits should be used to avoid this race -
> > > > > or at least mitigate it.
> > > > 
> > > > Yes for sure, as explored at length in previous discussions this feels like
> > > > we're papering over cracks here.
> > > > 
> > > > _However_, I'm sort of ok with a minimalistic fix that solves the proximate
> > > > issue even if it is that, as long as it doesn't cause issues in doing so.
> > > > 
> > > > So this is my take on the below and why I'm open to it!
> > > > 
> > > > >
> > > > > They are probably both under the read lock, but considering how rare it
> > > > > would be, would a racy flag check be enough - it is hardly critical to
> > > > > get right.  Either would reduce the probability.
> > > > 
> > > > Zongjinji - I'm stil not sure that you've really indicated _why_ you're
> > > > seeing such a tight and unusual race. Presumably some truly massive number
> > > > of tasks being OOM'd and unmapping but... yeah that seems odd anyway.
> > > > 
> > > > But again, if we can safely fix this in a way that doesn't hurt stuff too
> > > > much I'm ok with it (of course, these are famous last words in the kernel
> > > > often...!)
> > > > 
> > > > Liam - are you open to a solution on the basis above, or do you feel we
> > > > ought simply to fix the underlying issue here?
> > > 
> > > At least this is a benign race. 
> > 
> > Is this really a race or rather a contention? IIUC exit_mmap and the oom
> > reaper are trying to unmap the address space of the oom-killed process
> > and can compete on page table locks. If both are running concurrently on
> > two cpus then the contention can continue for whole address space and
> > can slow down the actual memory freeing. Making oom reaper traverse in
> > opposite direction can drastically reduce the contention and faster
> > memory freeing.
> 
> It is two readers of the vma tree racing to lock the page tables for
> each vma, so I guess you can see it as contention as well.. but since
> the pte is a split lock, I see it as racing through vmas to see who hits
> which lock first.  The smart money is on the oom killer as it skips some
> vmas :)
> 
> If it were just contention, then the loop direction wouldn't matter..
> but I do see your point.
> 
> > > I'd think using MMF_ to reduce the race
> > > would achieve the same goal with less risk - which is why I bring it up.
> > > 
> > 
> > With MMF_ flag, are you suggesting oom reaper to skip the unmapping of
> > the oom-killed process?
> 
> Yes, specifically move the MMF_OOM_SKIP flag to earlier in the exit
> path to reduce the possibility of the race/contention.
> 
> > 
> > > Really, both methods should be low risk, so I'm fine with either way.
> > > 
> > > But I am interested in hearing how this race is happening enough to
> > > necessitate a fix.  Reversing the iterator is a one-spot fix - if this
> > > happens elsewhere then we're out of options.  Using the MMF_ flags is
> > > more of a scalable fix, if it achieves the same results.
> > 
> > On the question of if this is a rare situaion and worth the patch. I
> > would say this scenario is not that rare particularly on low memory
> > devices and on highly utilized overcommitted systems. Memory pressure
> > and oom-kills are norm on such systems. The point of oom reaper is to
> > bring the system out of the oom situation quickly and having two cpus
> > unmapping the oom-killed process can potentially bring the system out of
> > oom situation faster.
> 
> The exit_mmap() path used to run the oom reaper if it was an oom victim,
> until recently [1].  The part that makes me nervous is the exit_mmap()
> call to mmu_notifier_release(mm), while the oom reaper uses an
> mmu_notifier.  I am not sure if there is an issue in ordering on any of
> the platforms of such things.  Or the associated cost of the calls.
> 
> I mean, it's already pretty crazy that we have this in the exit:
> mmap_read_lock()
>    tlb_gather_mmu_fullmm()
>      unmap vmas..
> mmap_read_unlock()
> mmap_write_lock()
>    tlb_finish_mmu()..
> 
> So not only do we now have two tasks iterating over the vmas, but we
> also have mmu notifiers and tlb calls happening across the ranges..  At
> least doing all the work on a single cpu means that the hardware view is
> consistent.  But I don't see this being worse than a forward race?
> 
> As it is written here, we'll have one CPU working in one direction while
> the other works in the other, until both hit the end of the VMAs.  Only
> when both tasks stop iterating the vmas can the exit continue since it
> requires the write lock.
> 
> So the tlb_finish_mmu() in exit_mmap() will always be called after
> tlb_finish_mmu() on each individual vma has run in the
> __oom_reap_task_mm() context (when the race happens).
> 
> There is also a window here, between the exit_mmap() dropping the read
> lock, setting MMF_OOM_SKIP, and taking the lock - where the oom killer
> will iterate through a list of vmas with zero memory to free and delay
> the task exiting.  That is, wasting cpu and stopping the memory
> associated with the mm_struct (vmas and such) from being freed.
> 
> I'm also not sure on the cpu cache effects of what we are doing and how
> much that would play into the speedup.  My guess is that it's
> insignificant compared to the time we spend under the pte, but we have
> no numbers to go on.
> 
> So I'd like to know how likely the simultaneous runs are and if there is
> a measurable gain?

Since process killing events are very frequent on Android, the likelihood of
exit_mmap and reaper work (not only OOM, but also some proactive reaping
actions such as process_mrelease) occurring at the same time is much higher.
When lmkd kills a process, it actively reaps the process using
process_mrelease, similar to the way the OOM reaper works. Surenb may be
able to clarify this point, as he is the author of lmkd.

I referenced this data in link[1], and I should have included it in the cover
letter. The attached test data was collected on Android. Before testing, in
order to simulate the OOM kill process, I intercepted the kill signal and added
the killed process to the OOM reaper queue.

The reproduction steps are as follows:
1. Start a process.
2. Kill the process.
3. Capture a perfetto trace.

Below are the load benefit data, measured by process running time:

Note: #RxComputationT, vdp:vidtask:m, and tp-background are threads of the
same process, and they are the last threads to exit.

Thread             TID         State        Wall duration (ms)          total running
# with oom reaper but traverse reverse
#RxComputationT    13708       Running      60.690572
oom_reaper         81          Running      46.492032                   107.182604

# with oom reaper
vdp:vidtask:m      14040       Running      81.848297
oom_reaper         81          Running      69.32                       151.168297

# without oom reaper
tp-background      12424       Running      106.021874                  106.021874

Compared to reaping when a process is killed, it provides approximately
30% load benefit.
Compared to not performing reap when a process is killed, it can release
memory earlier, with 40% faster memory release.

[1] https://lore.kernel.org/all/20250815163207.7078-1-zhongjinji@honor.com/

> I agree, that at face value, two cpus should be able to split the work..
> but I don't know about the notifier or the holding up the mm_struct
> associated memory.  And it could slow things down by holding up an
> exiting task.
> 
> > 
> > I think the patch (with your suggestions) is simple enough and I don't
> > see any risk in including it.
> > 
> 
> Actually, the more I look at this, the worse I feel about it..  Am I
> overreacting?
> 
> Thanks,
> Liam
> 
> [1] https://elixir.bootlin.com/linux/v6.0.19/source/mm/mmap.c#L3085

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v5 1/2] mm/oom_kill: Do not delay oom reaper when the victim is frozen
  2025-08-25 13:38 ` [PATCH v5 1/2] mm/oom_kill: " zhongjinji
  2025-08-25 19:41   ` Shakeel Butt
  2025-08-26 12:43   ` Lorenzo Stoakes
@ 2025-08-27 12:08   ` Michal Hocko
  2025-08-27 12:14     ` zhongjinji
  2 siblings, 1 reply; 20+ messages in thread
From: Michal Hocko @ 2025-08-27 12:08 UTC (permalink / raw)
  To: zhongjinji
  Cc: rientjes, shakeel.butt, akpm, linux-mm, linux-kernel, tglx,
	liam.howlett, lorenzo.stoakes, liulu.liu, feng.han

On Mon 25-08-25 21:38:54, zhongjinji wrote:
> The OOM reaper can quickly reap a process's memory when the system
> encounters OOM, helping the system recover. If the victim process is
> frozen and cannot be unfrozen in time, the reaper delayed by two seconds
> will cause the system to fail to recover quickly from the OOM state.
> 
> When an OOM occurs, if the victim is not unfrozen, delaying the OOM reaper
> will keep the system in a bad state for two seconds. Before scheduling the
> oom_reaper task, check whether the victim is in a frozen state. If the
> victim is frozen, do not delay the OOM reaper.

I do not think this changelog captures the essence of the change really
well and it suggests that this might be a performance optimization. As I
have explained on several occasions the oom reaper is not meant to be a
performance optimization but rather a forward progress guarantee. I
would suggest this wording instead.

"
The oom reaper is a mechanism to guarantee a forward process during OOM
situation when the oom victim cannot terminate on its own (e.g. being
blocked in uninterruptible state or frozen by cgroup freezer). In order
to give the victim some time to terminate properly the oom reaper is
delayed in its invocation. This is particularly beneficial when the oom
victim is holding robust futex resources as the anonymous memory tear
down can break those.

On the other hand deliberately frozen tasks by the freezer cgroup will
not wake up until they are thawed in the userspace and delay is
effectively pointless. Therefore opt out from the delay for cgroup
frozen oom victims.
"

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v5 1/2] mm/oom_kill: Do not delay oom reaper when the victim is frozen
  2025-08-27 12:08   ` Michal Hocko
@ 2025-08-27 12:14     ` zhongjinji
  0 siblings, 0 replies; 20+ messages in thread
From: zhongjinji @ 2025-08-27 12:14 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, feng.han, liam.howlett, linux-kernel, linux-mm, liulu.liu,
	lorenzo.stoakes, rientjes, shakeel.butt, tglx, zhongjinji

> On Mon 25-08-25 21:38:54, zhongjinji wrote:
> > The OOM reaper can quickly reap a process's memory when the system
> > encounters OOM, helping the system recover. If the victim process is
> > frozen and cannot be unfrozen in time, the reaper delayed by two seconds
> > will cause the system to fail to recover quickly from the OOM state.
> > 
> > When an OOM occurs, if the victim is not unfrozen, delaying the OOM reaper
> > will keep the system in a bad state for two seconds. Before scheduling the
> > oom_reaper task, check whether the victim is in a frozen state. If the
> > victim is frozen, do not delay the OOM reaper.
> 
> I do not think this changelog captures the essence of the change really
> well and it suggests that this might be a performance optimization. As I
> have explained on several occasions the oom reaper is not meant to be a
> performance optimization but rather a forward progress guarantee. I
> would suggest this wording instead.
> 
> "
> The oom reaper is a mechanism to guarantee a forward process during OOM
> situation when the oom victim cannot terminate on its own (e.g. being
> blocked in uninterruptible state or frozen by cgroup freezer). In order
> to give the victim some time to terminate properly the oom reaper is
> delayed in its invocation. This is particularly beneficial when the oom
> victim is holding robust futex resources as the anonymous memory tear
> down can break those.
> 
> On the other hand deliberately frozen tasks by the freezer cgroup will
> not wake up until they are thawed in the userspace and delay is
> effectively pointless. Therefore opt out from the delay for cgroup
> frozen oom victims.
> "

Thank you, I will update it.

> 
> Thanks!
> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v5 2/2] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite order
  2025-08-27  9:55               ` zhongjinji
@ 2025-08-27 15:57                 ` Suren Baghdasaryan
  2025-08-28  0:38                   ` Liam R. Howlett
  0 siblings, 1 reply; 20+ messages in thread
From: Suren Baghdasaryan @ 2025-08-27 15:57 UTC (permalink / raw)
  To: zhongjinji
  Cc: liam.howlett, akpm, feng.han, linux-kernel, linux-mm, liulu.liu,
	lorenzo.stoakes, mhocko, rientjes, shakeel.butt, tglx

On Wed, Aug 27, 2025 at 2:55 AM zhongjinji <zhongjinji@honor.com> wrote:
>
> > + Cc Suren since he has worked on the exit_mmap() path a lot.
>
> Thank you for your assistance. I realize now that I should have
> Cc Suren earlier.

Thanks for adding me!

>
> > * Shakeel Butt <shakeel.butt@linux.dev> [250826 18:26]:
> > > On Tue, Aug 26, 2025 at 11:21:13AM -0400, Liam R. Howlett wrote:
> > > > * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250826 09:50]:
> > > > > On Tue, Aug 26, 2025 at 09:37:22AM -0400, Liam R. Howlett wrote:
> > > > > > I really don't think this is worth doing.  We're avoiding a race between
> > > > > > oom and a task unmap - the MMF bits should be used to avoid this race -
> > > > > > or at least mitigate it.
> > > > >
> > > > > Yes for sure, as explored at length in previous discussions this feels like
> > > > > we're papering over cracks here.
> > > > >
> > > > > _However_, I'm sort of ok with a minimalistic fix that solves the proximate
> > > > > issue even if it is that, as long as it doesn't cause issues in doing so.
> > > > >
> > > > > So this is my take on the below and why I'm open to it!
> > > > >
> > > > > >
> > > > > > They are probably both under the read lock, but considering how rare it
> > > > > > would be, would a racy flag check be enough - it is hardly critical to
> > > > > > get right.  Either would reduce the probability.
> > > > >
> > > > > Zongjinji - I'm stil not sure that you've really indicated _why_ you're
> > > > > seeing such a tight and unusual race. Presumably some truly massive number
> > > > > of tasks being OOM'd and unmapping but... yeah that seems odd anyway.
> > > > >
> > > > > But again, if we can safely fix this in a way that doesn't hurt stuff too
> > > > > much I'm ok with it (of course, these are famous last words in the kernel
> > > > > often...!)
> > > > >
> > > > > Liam - are you open to a solution on the basis above, or do you feel we
> > > > > ought simply to fix the underlying issue here?
> > > >
> > > > At least this is a benign race.
> > >
> > > Is this really a race or rather a contention? IIUC exit_mmap and the oom
> > > reaper are trying to unmap the address space of the oom-killed process
> > > and can compete on page table locks. If both are running concurrently on
> > > two cpus then the contention can continue for whole address space and
> > > can slow down the actual memory freeing. Making oom reaper traverse in
> > > opposite direction can drastically reduce the contention and faster
> > > memory freeing.
> >
> > It is two readers of the vma tree racing to lock the page tables for
> > each vma, so I guess you can see it as contention as well.. but since
> > the pte is a split lock, I see it as racing through vmas to see who hits
> > which lock first.  The smart money is on the oom killer as it skips some
> > vmas :)
> >
> > If it were just contention, then the loop direction wouldn't matter..
> > but I do see your point.
> >
> > > > I'd think using MMF_ to reduce the race
> > > > would achieve the same goal with less risk - which is why I bring it up.
> > > >
> > >
> > > With MMF_ flag, are you suggesting oom reaper to skip the unmapping of
> > > the oom-killed process?
> >
> > Yes, specifically move the MMF_OOM_SKIP flag to earlier in the exit
> > path to reduce the possibility of the race/contention.
> >
> > >
> > > > Really, both methods should be low risk, so I'm fine with either way.
> > > >
> > > > But I am interested in hearing how this race is happening enough to
> > > > necessitate a fix.  Reversing the iterator is a one-spot fix - if this
> > > > happens elsewhere then we're out of options.  Using the MMF_ flags is
> > > > more of a scalable fix, if it achieves the same results.
> > >
> > > On the question of if this is a rare situaion and worth the patch. I
> > > would say this scenario is not that rare particularly on low memory
> > > devices and on highly utilized overcommitted systems. Memory pressure
> > > and oom-kills are norm on such systems. The point of oom reaper is to
> > > bring the system out of the oom situation quickly and having two cpus
> > > unmapping the oom-killed process can potentially bring the system out of
> > > oom situation faster.
> >
> > The exit_mmap() path used to run the oom reaper if it was an oom victim,
> > until recently [1].  The part that makes me nervous is the exit_mmap()
> > call to mmu_notifier_release(mm), while the oom reaper uses an
> > mmu_notifier.  I am not sure if there is an issue in ordering on any of
> > the platforms of such things.  Or the associated cost of the calls.
> >
> > I mean, it's already pretty crazy that we have this in the exit:
> > mmap_read_lock()
> >    tlb_gather_mmu_fullmm()
> >      unmap vmas..
> > mmap_read_unlock()
> > mmap_write_lock()
> >    tlb_finish_mmu()..
> >
> > So not only do we now have two tasks iterating over the vmas, but we
> > also have mmu notifiers and tlb calls happening across the ranges..  At
> > least doing all the work on a single cpu means that the hardware view is
> > consistent.  But I don't see this being worse than a forward race?

This part seems to have changed quite a bit since I last looked into
it closely and it's worth re-checking, however that seems orthogonal
to what this patch is trying to do.

> >
> > As it is written here, we'll have one CPU working in one direction while
> > the other works in the other, until both hit the end of the VMAs.  Only
> > when both tasks stop iterating the vmas can the exit continue since it
> > requires the write lock.
> >
> > So the tlb_finish_mmu() in exit_mmap() will always be called after
> > tlb_finish_mmu() on each individual vma has run in the
> > __oom_reap_task_mm() context (when the race happens).

Correct.

> >
> > There is also a window here, between the exit_mmap() dropping the read
> > lock, setting MMF_OOM_SKIP, and taking the lock - where the oom killer
> > will iterate through a list of vmas with zero memory to free and delay
> > the task exiting.  That is, wasting cpu and stopping the memory
> > associated with the mm_struct (vmas and such) from being freed.

Might be an opportunity to optimize but again, this is happening with
or without this patch, no?

> >
> > I'm also not sure on the cpu cache effects of what we are doing and how
> > much that would play into the speedup.  My guess is that it's
> > insignificant compared to the time we spend under the pte, but we have
> > no numbers to go on.
> >
> > So I'd like to know how likely the simultaneous runs are and if there is
> > a measurable gain?
>
> Since process killing events are very frequent on Android, the likelihood of
> exit_mmap and reaper work (not only OOM, but also some proactive reaping
> actions such as process_mrelease) occurring at the same time is much higher.
> When lmkd kills a process, it actively reaps the process using
> process_mrelease, similar to the way the OOM reaper works. Surenb may be
> able to clarify this point, as he is the author of lmkd.

Yes, on Android process_mrelease() is used after lmkd kills a process
to expedite memory release in case the victim process is blocked for
some reason. This makes the race between __oom_reap_task_mm() and
exit_mmap() much more frequent. That is probably the disconnect in
thinking that this race is rare. I don't see any harm in
__oom_reap_task_mm() walking the tree backwards to minimize the
contention with exit_mmap(). Liam, is there a performance difference
between mas_for_each_rev() and mas_for_each() ?

>
> I referenced this data in link[1], and I should have included it in the cover
> letter. The attached test data was collected on Android. Before testing, in
> order to simulate the OOM kill process, I intercepted the kill signal and added
> the killed process to the OOM reaper queue.
>
> The reproduction steps are as follows:
> 1. Start a process.
> 2. Kill the process.
> 3. Capture a perfetto trace.
>
> Below are the load benefit data, measured by process running time:
>
> Note: #RxComputationT, vdp:vidtask:m, and tp-background are threads of the
> same process, and they are the last threads to exit.
>
> Thread             TID         State        Wall duration (ms)          total running
> # with oom reaper but traverse reverse
> #RxComputationT    13708       Running      60.690572
> oom_reaper         81          Running      46.492032                   107.182604
>
> # with oom reaper
> vdp:vidtask:m      14040       Running      81.848297
> oom_reaper         81          Running      69.32                       151.168297
>
> # without oom reaper
> tp-background      12424       Running      106.021874                  106.021874
>
> Compared to reaping when a process is killed, it provides approximately
> 30% load benefit.
> Compared to not performing reap when a process is killed, it can release
> memory earlier, with 40% faster memory release.

That looks like a nice performance improvement for reaping the memory
with low churn and risk.

>
> [1] https://lore.kernel.org/all/20250815163207.7078-1-zhongjinji@honor.com/
>
> > I agree, that at face value, two cpus should be able to split the work..
> > but I don't know about the notifier or the holding up the mm_struct
> > associated memory.  And it could slow things down by holding up an
> > exiting task.
> >
> > >
> > > I think the patch (with your suggestions) is simple enough and I don't
> > > see any risk in including it.
> > >
> >
> > Actually, the more I look at this, the worse I feel about it..  Am I
> > overreacting?
> >
> > Thanks,
> > Liam
> >
> > [1] https://elixir.bootlin.com/linux/v6.0.19/source/mm/mmap.c#L3085

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v5 2/2] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite order
  2025-08-27 15:57                 ` Suren Baghdasaryan
@ 2025-08-28  0:38                   ` Liam R. Howlett
  0 siblings, 0 replies; 20+ messages in thread
From: Liam R. Howlett @ 2025-08-28  0:38 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: zhongjinji, akpm, feng.han, linux-kernel, linux-mm, liulu.liu,
	lorenzo.stoakes, mhocko, rientjes, shakeel.butt, tglx

* Suren Baghdasaryan <surenb@google.com> [250827 11:57]:

...

> > >
> > > The exit_mmap() path used to run the oom reaper if it was an oom victim,
> > > until recently [1].  The part that makes me nervous is the exit_mmap()
> > > call to mmu_notifier_release(mm), while the oom reaper uses an
> > > mmu_notifier.  I am not sure if there is an issue in ordering on any of
> > > the platforms of such things.  Or the associated cost of the calls.
> > >
> > > I mean, it's already pretty crazy that we have this in the exit:
> > > mmap_read_lock()
> > >    tlb_gather_mmu_fullmm()
> > >      unmap vmas..
> > > mmap_read_unlock()
> > > mmap_write_lock()
> > >    tlb_finish_mmu()..
> > >
> > > So not only do we now have two tasks iterating over the vmas, but we
> > > also have mmu notifiers and tlb calls happening across the ranges..  At
> > > least doing all the work on a single cpu means that the hardware view is
> > > consistent.  But I don't see this being worse than a forward race?
> 
> This part seems to have changed quite a bit since I last looked into
> it closely and it's worth re-checking, however that seems orthogonal
> to what this patch is trying to do.

I was concerned about how a reverse iterator may affect what is
considered accurate for the mmu/tlb so I thought it worth pointing out.

...
> > >
> > > There is also a window here, between the exit_mmap() dropping the read
> > > lock, setting MMF_OOM_SKIP, and taking the lock - where the oom killer
> > > will iterate through a list of vmas with zero memory to free and delay
> > > the task exiting.  That is, wasting cpu and stopping the memory
> > > associated with the mm_struct (vmas and such) from being freed.
> 
> Might be an opportunity to optimize but again, this is happening with
> or without this patch, no?

Correct, but with number it looks to be better to go with two loops.

> 
> > >
> > > I'm also not sure on the cpu cache effects of what we are doing and how
> > > much that would play into the speedup.  My guess is that it's
> > > insignificant compared to the time we spend under the pte, but we have
> > > no numbers to go on.
> > >
> > > So I'd like to know how likely the simultaneous runs are and if there is
> > > a measurable gain?
> >
> > Since process killing events are very frequent on Android, the likelihood of
> > exit_mmap and reaper work (not only OOM, but also some proactive reaping
> > actions such as process_mrelease) occurring at the same time is much higher.
> > When lmkd kills a process, it actively reaps the process using
> > process_mrelease, similar to the way the OOM reaper works. Surenb may be
> > able to clarify this point, as he is the author of lmkd.
> 
> Yes, on Android process_mrelease() is used after lmkd kills a process
> to expedite memory release in case the victim process is blocked for
> some reason. This makes the race between __oom_reap_task_mm() and
> exit_mmap() much more frequent. That is probably the disconnect in
> thinking that this race is rare. I don't see any harm in
> __oom_reap_task_mm() walking the tree backwards to minimize the
> contention with exit_mmap(). Liam, is there a performance difference
> between mas_for_each_rev() and mas_for_each() ?

There should be no performance difference.

> >
> > I referenced this data in link[1], and I should have included it in the cover
> > letter. The attached test data was collected on Android. Before testing, in
> > order to simulate the OOM kill process, I intercepted the kill signal and added
> > the killed process to the OOM reaper queue.

Sorry I missed your response in v4 on this.

> >
> > The reproduction steps are as follows:
> > 1. Start a process.
> > 2. Kill the process.
> > 3. Capture a perfetto trace.
> >
> > Below are the load benefit data, measured by process running time:
> >
> > Note: #RxComputationT, vdp:vidtask:m, and tp-background are threads of the
> > same process, and they are the last threads to exit.
> >
> > Thread             TID         State        Wall duration (ms)          total running
> > # with oom reaper but traverse reverse
> > #RxComputationT    13708       Running      60.690572
> > oom_reaper         81          Running      46.492032                   107.182604
> >
> > # with oom reaper
> > vdp:vidtask:m      14040       Running      81.848297
> > oom_reaper         81          Running      69.32                       151.168297
> >
> > # without oom reaper
> > tp-background      12424       Running      106.021874                  106.021874
> >
> > Compared to reaping when a process is killed, it provides approximately
> > 30% load benefit.
> > Compared to not performing reap when a process is killed, it can release
> > memory earlier, with 40% faster memory release.
> 
> That looks like a nice performance improvement for reaping the memory
> with low churn and risk.

Agreed.


Please include the numbers in the change log in the next revision so it
is recorded in git.

I think all my questions are resolved, thanks.

I look forward to v6.

Regards,
Liam

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v5 2/2] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite order
  2025-08-27  4:12             ` Liam R. Howlett
  2025-08-27  4:25               ` Liam R. Howlett
  2025-08-27  9:55               ` zhongjinji
@ 2025-08-29  7:11               ` Michal Hocko
  2 siblings, 0 replies; 20+ messages in thread
From: Michal Hocko @ 2025-08-29  7:11 UTC (permalink / raw)
  To: Liam R. Howlett
  Cc: Shakeel Butt, Lorenzo Stoakes, zhongjinji, rientjes, akpm,
	linux-mm, linux-kernel, tglx, liulu.liu, feng.han

On Wed 27-08-25 00:12:34, Liam R. Howlett wrote:
[...]
> > > I'd think using MMF_ to reduce the race
> > > would achieve the same goal with less risk - which is why I bring it up.
> > > 
> > 
> > With MMF_ flag, are you suggesting oom reaper to skip the unmapping of
> > the oom-killed process?
> 
> Yes, specifically move the MMF_OOM_SKIP flag to earlier in the exit
> path to reduce the possibility of the race/contention.

MMF_OOM_SKIP is a "hide from oom killer" flag and that means that as
soon as you set it up the next oom killer invocation would kill another
task while the current one hasn't freed up the memory yet and therefore
you have a potentially pointless kill.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v5 2/2] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite order
  2025-08-26 12:53   ` Lorenzo Stoakes
  2025-08-26 13:37     ` Liam R. Howlett
@ 2025-08-29  7:14     ` Michal Hocko
  1 sibling, 0 replies; 20+ messages in thread
From: Michal Hocko @ 2025-08-29  7:14 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: zhongjinji, rientjes, shakeel.butt, akpm, linux-mm, linux-kernel,
	tglx, liam.howlett, liulu.liu, feng.han

On Tue 26-08-25 13:53:43, Lorenzo Stoakes wrote:
> On Mon, Aug 25, 2025 at 09:38:55PM +0800, zhongjinji wrote:
> > When a process is OOM killed without reaper delay, the oom reaper and the
> > exit_mmap() thread likely run simultaneously. They traverse the vma's maple
> > tree along the same path and may easily unmap the same vma, causing them to
> > compete for the pte spinlock.
> >
> > When a process exits, exit_mmap() traverses the vma's maple tree from low
> > to high addresses. To reduce the chance of unmapping the same vma
> > simultaneously, the OOM reaper should traverse the vma's tree from high to
> > low address.
> >
> > Signed-off-by: zhongjinji <zhongjinji@honor.com>
> 
> I will leave it to Liam to confirm the maple tree bit is ok, but I guess
> I'm softening to the idea of doing this - because it should have no impact
> on most users, so even if it's some rare edge case that triggers the
> situation, then it's worth doing it in reverse just to help you guys out :)

I tend to agree on this. I would expect that oom_reaper would race with
exit_mmap only for seriously stalled processes or huge memory consumers
where exit_mmap takes a while. oom_reaper would be quick to catch the
exit_mmap as it wouldn't have fewer work to do on already released
memory.
Going from the other end of the address space might reduces this chance
while not really introducing any actual issues.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2025-08-29  7:14 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-25 13:38 [PATCH v5 0/2] Do not delay oom reaper when the victim is frozen zhongjinji
2025-08-25 13:38 ` [PATCH v5 1/2] mm/oom_kill: " zhongjinji
2025-08-25 19:41   ` Shakeel Butt
2025-08-26 13:01     ` zhongjinji
2025-08-26 12:43   ` Lorenzo Stoakes
2025-08-27 12:08   ` Michal Hocko
2025-08-27 12:14     ` zhongjinji
2025-08-25 13:38 ` [PATCH v5 2/2] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite order zhongjinji
2025-08-26 12:53   ` Lorenzo Stoakes
2025-08-26 13:37     ` Liam R. Howlett
2025-08-26 13:50       ` Lorenzo Stoakes
2025-08-26 15:21         ` Liam R. Howlett
2025-08-26 22:26           ` Shakeel Butt
2025-08-27  4:12             ` Liam R. Howlett
2025-08-27  4:25               ` Liam R. Howlett
2025-08-27  9:55               ` zhongjinji
2025-08-27 15:57                 ` Suren Baghdasaryan
2025-08-28  0:38                   ` Liam R. Howlett
2025-08-29  7:11               ` Michal Hocko
2025-08-29  7:14     ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).