[PATCH v6 0/2] Do not delay OOM reaper when the victim is frozen

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v6 0/2] Do not delay OOM reaper when the victim is frozen
@ 2025-08-29  6:55 zhongjinji
  2025-08-29  6:55 ` [PATCH v6 1/2] mm/oom_kill: Do not delay oom " zhongjinji
  2025-08-29  6:55 ` [PATCH v6 2/2] mm/oom_kill: The OOM reaper traverses the VMA maple tree in reverse order zhongjinji
  0 siblings, 2 replies; 16+ messages in thread
From: zhongjinji @ 2025-08-29  6:55 UTC (permalink / raw)
  To: mhocko
  Cc: rientjes, shakeel.butt, akpm, linux-mm, linux-kernel, tglx,
	liam.howlett, lorenzo.stoakes, surenb, liulu.liu, feng.han,
	tianxiaobin, fengbaopeng, zhongjinji

An overview of the relationship between patch 1 and patch 2:
With patch 1 applied, the OOM reaper is no longer delayed when the victim
process is frozen. If the victim process is thawed in time, the OOM reaper
and the exit_mmap() thread may run concurrently, which can lead to
significant spinlock contention. Patch 2 mitigates this issue by traversing
the maple tree in reverse order, reducing the likelihood of such lock
contention.

The attached test data was collected on Android. It shows that when the OOM
reaper and exit_mmap are executed at the same time, pte spinlock contention
becomes more intense. This results in increased running time for both
processes, which in turn means higher system load. It also shows that
reverse-order traversal of the VMA maple tree by the OOM reaper can
significantly reduce pte spinlock contention.

The test data indicate that it can significantly reduce spinlock contention
and decrease the load (measured by process running time) of both oom_reaper
and exit_mmap by 30%.

The perf data applying patch 1 but not patch 2:
|--99.74%-- oom_reaper
|  |--76.67%-- unmap_page_range
|  |  |--33.70%-- __pte_offset_map_lock
|  |  |  |--98.46%-- _raw_spin_lock
|  |  |--27.61%-- free_swap_and_cache_nr
|  |  |--16.40%-- folio_remove_rmap_ptes
|  |  |--12.25%-- tlb_flush_mmu
|  |--12.61%-- tlb_finish_mmu

The perf data applying patch 1 and patch 2:
|--98.84%-- oom_reaper
|  |--53.45%-- unmap_page_range
|  |  |--24.29%-- [hit in function]
|  |  |--48.06%-- folio_remove_rmap_ptes
|  |  |--17.99%-- tlb_flush_mmu
|  |  |--1.72%-- __pte_offset_map_lock
|  |  
|  |--30.43%-- tlb_finish_mmu

This is test data regarding the process running time.

With oom reaper (reverse traverse):
  Thread            TID     State     Wall duration (ms)
  RxComputationT   13708    Running   60.69
  oom_reaper        81      Running   46.49
  Total (ms): 107.18

With oom reaper:
  Thread            TID     State     Wall duration (ms)
  vdp:vidtask:m    14040    Running   81.85
  oom_reaper        81      Running   69.32
  Total (ms): 151.17

Without oom reaper:
  Thread            TID     State     Wall duration (ms)
  tp-background     12424   Running   106.02
  Total (ms): 106.02

Note: RxComputationT, vdp:vidtask:m, and tp-background are threads of the
same process, and they are the last threads to exit.

---
v5 -> v6:
- Use mas_for_each_rev() for VMA traversal [6]
- Simplify the judgment of whether to delay in queue_oom_reaper() [7]
- Refine changelog to better capture the essence of the changes [8]
- Use READ_ONCE(tsk->frozen) instead of checking mm and additional
  checks inside for_each_process(), as it is sufficient [9]
- Add report tags because fengbaopeng and tianxiaobin reported the
  high load issue of the reaper

v4 -> v5:
- Detect frozen state directly, avoid special futex handling. [3]
- Use mas_find_rev() for VMA traversal to avoid skipping entries. [4]
- Only check should_delay_oom_reap() in queue_oom_reaper(). [5]

v3 -> v4:
- Renamed functions and parameters for clarity. [2]
- Added should_delay_oom_reap() for OOM reap decisions.
- Traverse maple tree in reverse for improved behavior.

v2 -> v3:
- Fixed Subject prefix error.

v1 -> v2:
- Check robust_list for all threads, not just one. [1]

Reference:
[1] https://lore.kernel.org/linux-mm/u3mepw3oxj7cywezna4v72y2hvyc7bafkmsbirsbfuf34zpa7c@b23sc3rvp2gp/
[2] https://lore.kernel.org/linux-mm/87cy99g3k6.ffs@tglx/
[3] https://lore.kernel.org/linux-mm/aKRWtjRhE_HgFlp2@tiehlicka/
[4] https://lore.kernel.org/linux-mm/26larxehoe3a627s4fxsqghriwctays4opm4hhme3uk7ybjc5r@pmwh4s4yv7lm/
[5] https://lore.kernel.org/linux-mm/d5013a33-c08a-44c5-a67f-9dc8fd73c969@lucifer.local/
[6] https://lore.kernel.org/linux-mm/nwh7gegmvoisbxlsfwslobpbqku376uxdj2z32owkbftvozt3x@4dfet73fh2yy/
[7] https://lore.kernel.org/linux-mm/af4edeaf-d3c9-46a9-a300-dbaf5936e7d6@lucifer.local/
[8] https://lore.kernel.org/linux-mm/aK71W1ITmC_4I_RY@tiehlicka/
[9] https://lore.kernel.org/linux-mm/jzzdeczuyraup2zrspl6b74muf3bly2a3acejfftcldfmz4ekk@s5mcbeim34my/

The earlier post:
v5: https://lore.kernel.org/linux-mm/20250825133855.30229-1-zhongjinji@honor.com/
v4: https://lore.kernel.org/linux-mm/20250814135555.17493-1-zhongjinji@honor.com/
v3: https://lore.kernel.org/linux-mm/20250804030341.18619-1-zhongjinji@honor.com/
v2: https://lore.kernel.org/linux-mm/20250801153649.23244-1-zhongjinji@honor.com/
v1: https://lore.kernel.org/linux-mm/20250731102904.8615-1-zhongjinji@honor.com/

zhongjinji (2):
  mm/oom_kill: Do not delay oom reaper when the victim is frozen
  mm/oom_kill: The OOM reaper traverses the VMA maple tree in reverse
    order

 mm/oom_kill.c | 18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

-- 
2.17.1



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v6 1/2] mm/oom_kill: Do not delay oom reaper when the victim is frozen
  2025-08-29  6:55 [PATCH v6 0/2] Do not delay OOM reaper when the victim is frozen zhongjinji
@ 2025-08-29  6:55 ` zhongjinji
  2025-08-29  9:57   ` Lorenzo Stoakes
                     ` (3 more replies)
  2025-08-29  6:55 ` [PATCH v6 2/2] mm/oom_kill: The OOM reaper traverses the VMA maple tree in reverse order zhongjinji
  1 sibling, 4 replies; 16+ messages in thread
From: zhongjinji @ 2025-08-29  6:55 UTC (permalink / raw)
  To: mhocko
  Cc: rientjes, shakeel.butt, akpm, linux-mm, linux-kernel, tglx,
	liam.howlett, lorenzo.stoakes, surenb, liulu.liu, feng.han,
	tianxiaobin, fengbaopeng, zhongjinji

The oom reaper is a mechanism to guarantee a forward process during OOM
situation when the oom victim cannot terminate on its own (e.g. being
blocked in uninterruptible state or frozen by cgroup freezer). In order
to give the victim some time to terminate properly the oom reaper is
delayed in its invocation. This is particularly beneficial when the oom
victim is holding robust futex resources as the anonymous memory tear
down can break those. [1]

On the other hand deliberately frozen tasks by the freezer cgroup will
not wake up until they are thawed in the userspace and delay is
effectively pointless. Therefore opt out from the delay for cgroup
frozen oom victims.

Reference:
[1] https://lore.kernel.org/all/20220414144042.677008-1-npache@redhat.com/T/#u

Signed-off-by: zhongjinji <zhongjinji@honor.com>
---
 mm/oom_kill.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 25923cfec9c6..a5e9074896a1 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -700,7 +700,14 @@ static void queue_oom_reaper(struct task_struct *tsk)
 
 	get_task_struct(tsk);
 	timer_setup(&tsk->oom_reaper_timer, wake_oom_reaper, 0);
-	tsk->oom_reaper_timer.expires = jiffies + OOM_REAPER_DELAY;
+	tsk->oom_reaper_timer.expires = jiffies;
+
+	/*
+	 * If the task is frozen by the cgroup freezer, the delay is unnecessary
+	 * because it cannot exit until thawed. Skip the delay for frozen victims.
+	 */
+	if (!frozen(tsk))
+		tsk->oom_reaper_timer.expires += OOM_REAPER_DELAY;
 	add_timer(&tsk->oom_reaper_timer);
 }
 
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v6 1/2] mm/oom_kill: Do not delay oom reaper when the victim is frozen
  2025-08-29  6:55 ` [PATCH v6 1/2] mm/oom_kill: Do not delay oom " zhongjinji
@ 2025-08-29  9:57   ` Lorenzo Stoakes
  2025-08-29 17:30   ` Liam R. Howlett
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 16+ messages in thread
From: Lorenzo Stoakes @ 2025-08-29  9:57 UTC (permalink / raw)
  To: zhongjinji
  Cc: mhocko, rientjes, shakeel.butt, akpm, linux-mm, linux-kernel,
	tglx, liam.howlett, surenb, liulu.liu, feng.han, tianxiaobin,
	fengbaopeng

On Fri, Aug 29, 2025 at 02:55:49PM +0800, zhongjinji wrote:
> The oom reaper is a mechanism to guarantee a forward process during OOM
> situation when the oom victim cannot terminate on its own (e.g. being
> blocked in uninterruptible state or frozen by cgroup freezer). In order
> to give the victim some time to terminate properly the oom reaper is
> delayed in its invocation. This is particularly beneficial when the oom
> victim is holding robust futex resources as the anonymous memory tear
> down can break those. [1]
>
> On the other hand deliberately frozen tasks by the freezer cgroup will
> not wake up until they are thawed in the userspace and delay is
> effectively pointless. Therefore opt out from the delay for cgroup
> frozen oom victims.
>
> Reference:
> [1] https://lore.kernel.org/all/20220414144042.677008-1-npache@redhat.com/T/#u
>
> Signed-off-by: zhongjinji <zhongjinji@honor.com>

Nice :) now this is very simple.

LGTM, so:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  mm/oom_kill.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 25923cfec9c6..a5e9074896a1 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -700,7 +700,14 @@ static void queue_oom_reaper(struct task_struct *tsk)
>
>  	get_task_struct(tsk);
>  	timer_setup(&tsk->oom_reaper_timer, wake_oom_reaper, 0);
> -	tsk->oom_reaper_timer.expires = jiffies + OOM_REAPER_DELAY;
> +	tsk->oom_reaper_timer.expires = jiffies;
> +
> +	/*
> +	 * If the task is frozen by the cgroup freezer, the delay is unnecessary
> +	 * because it cannot exit until thawed. Skip the delay for frozen victims.
> +	 */
> +	if (!frozen(tsk))
> +		tsk->oom_reaper_timer.expires += OOM_REAPER_DELAY;
>  	add_timer(&tsk->oom_reaper_timer);
>  }
>
> --
> 2.17.1
>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v6 1/2] mm/oom_kill: Do not delay oom reaper when the victim is frozen
  2025-08-29  6:55 ` [PATCH v6 1/2] mm/oom_kill: Do not delay oom " zhongjinji
  2025-08-29  9:57   ` Lorenzo Stoakes
@ 2025-08-29 17:30   ` Liam R. Howlett
  2025-08-29 23:20   ` Shakeel Butt
  2025-09-01  7:25   ` Michal Hocko
  3 siblings, 0 replies; 16+ messages in thread
From: Liam R. Howlett @ 2025-08-29 17:30 UTC (permalink / raw)
  To: zhongjinji
  Cc: mhocko, rientjes, shakeel.butt, akpm, linux-mm, linux-kernel,
	tglx, lorenzo.stoakes, surenb, liulu.liu, feng.han, tianxiaobin,
	fengbaopeng

* zhongjinji <zhongjinji@honor.com> [250829 02:56]:
> The oom reaper is a mechanism to guarantee a forward process during OOM
> situation when the oom victim cannot terminate on its own (e.g. being
> blocked in uninterruptible state or frozen by cgroup freezer). In order
> to give the victim some time to terminate properly the oom reaper is
> delayed in its invocation. This is particularly beneficial when the oom
> victim is holding robust futex resources as the anonymous memory tear
> down can break those. [1]
> 
> On the other hand deliberately frozen tasks by the freezer cgroup will
> not wake up until they are thawed in the userspace and delay is
> effectively pointless. Therefore opt out from the delay for cgroup
> frozen oom victims.
> 
> Reference:
> [1] https://lore.kernel.org/all/20220414144042.677008-1-npache@redhat.com/T/#u
> 
> Signed-off-by: zhongjinji <zhongjinji@honor.com>

Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>

> ---
>  mm/oom_kill.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 25923cfec9c6..a5e9074896a1 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -700,7 +700,14 @@ static void queue_oom_reaper(struct task_struct *tsk)
>  
>  	get_task_struct(tsk);
>  	timer_setup(&tsk->oom_reaper_timer, wake_oom_reaper, 0);
> -	tsk->oom_reaper_timer.expires = jiffies + OOM_REAPER_DELAY;
> +	tsk->oom_reaper_timer.expires = jiffies;
> +
> +	/*
> +	 * If the task is frozen by the cgroup freezer, the delay is unnecessary
> +	 * because it cannot exit until thawed. Skip the delay for frozen victims.
> +	 */
> +	if (!frozen(tsk))
> +		tsk->oom_reaper_timer.expires += OOM_REAPER_DELAY;
>  	add_timer(&tsk->oom_reaper_timer);
>  }
>  
> -- 
> 2.17.1
> 
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v6 1/2] mm/oom_kill: Do not delay oom reaper when the victim is frozen
  2025-08-29  6:55 ` [PATCH v6 1/2] mm/oom_kill: Do not delay oom " zhongjinji
  2025-08-29  9:57   ` Lorenzo Stoakes
  2025-08-29 17:30   ` Liam R. Howlett
@ 2025-08-29 23:20   ` Shakeel Butt
  2025-09-01 13:17     ` zhongjinji
  2025-09-01  7:25   ` Michal Hocko
  3 siblings, 1 reply; 16+ messages in thread
From: Shakeel Butt @ 2025-08-29 23:20 UTC (permalink / raw)
  To: zhongjinji
  Cc: mhocko, rientjes, akpm, linux-mm, linux-kernel, tglx,
	liam.howlett, lorenzo.stoakes, surenb, liulu.liu, feng.han,
	tianxiaobin, fengbaopeng

On Fri, Aug 29, 2025 at 02:55:49PM +0800, zhongjinji wrote:
> The oom reaper is a mechanism to guarantee a forward process during OOM
> situation when the oom victim cannot terminate on its own (e.g. being
> blocked in uninterruptible state or frozen by cgroup freezer). In order
> to give the victim some time to terminate properly the oom reaper is
> delayed in its invocation. This is particularly beneficial when the oom
> victim is holding robust futex resources as the anonymous memory tear
> down can break those. [1]
> 
> On the other hand deliberately frozen tasks by the freezer cgroup will
> not wake up until they are thawed in the userspace and delay is
> effectively pointless. Therefore opt out from the delay for cgroup
> frozen oom victims.
> 
> Reference:
> [1] https://lore.kernel.org/all/20220414144042.677008-1-npache@redhat.com/T/#u
> 
> Signed-off-by: zhongjinji <zhongjinji@honor.com>
> ---
>  mm/oom_kill.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 25923cfec9c6..a5e9074896a1 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -700,7 +700,14 @@ static void queue_oom_reaper(struct task_struct *tsk)
>  
>  	get_task_struct(tsk);
>  	timer_setup(&tsk->oom_reaper_timer, wake_oom_reaper, 0);
> -	tsk->oom_reaper_timer.expires = jiffies + OOM_REAPER_DELAY;
> +	tsk->oom_reaper_timer.expires = jiffies;
> +
> +	/*
> +	 * If the task is frozen by the cgroup freezer, the delay is unnecessary
> +	 * because it cannot exit until thawed. Skip the delay for frozen victims.
> +	 */
> +	if (!frozen(tsk))

Can you please change the above condition with the following to handle
v2 as well?

	if (!frozen(tsk) && !(READ_ONCE(tsk->frozen)))

> +		tsk->oom_reaper_timer.expires += OOM_REAPER_DELAY;
>  	add_timer(&tsk->oom_reaper_timer);
>  }
>  
> -- 
> 2.17.1
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v6 1/2] mm/oom_kill: Do not delay oom reaper when the victim is frozen
  2025-08-29 23:20   ` Shakeel Butt
@ 2025-09-01 13:17     ` zhongjinji
  0 siblings, 0 replies; 16+ messages in thread
From: zhongjinji @ 2025-09-01 13:17 UTC (permalink / raw)
  To: shakeel.butt
  Cc: akpm, feng.han, fengbaopeng, liam.howlett, linux-kernel, linux-mm,
	liulu.liu, lorenzo.stoakes, mhocko, rientjes, surenb, tglx,
	tianxiaobin, zhongjinji

> >  	get_task_struct(tsk);
> >  	timer_setup(&tsk->oom_reaper_timer, wake_oom_reaper, 0);
> > -	tsk->oom_reaper_timer.expires = jiffies + OOM_REAPER_DELAY;
> > +	tsk->oom_reaper_timer.expires = jiffies;
> > +
> > +	/*
> > +	 * If the task is frozen by the cgroup freezer, the delay is unnecessary
> > +	 * because it cannot exit until thawed. Skip the delay for frozen victims.
> > +	 */
> > +	if (!frozen(tsk))
> 
> Can you please change the above condition with the following to handle
> v2 as well?

Thank you, but I think the cgroupv2 check isn't needed, since a process frozen
by the cgroup v2 freezer won't block exit after being killed. Would it be
better to note in the comment or changelog that this change is for cgroup v1?

> 	if (!frozen(tsk) && !(READ_ONCE(tsk->frozen)))
> 
> > +		tsk->oom_reaper_timer.expires += OOM_REAPER_DELAY;
> >  	add_timer(&tsk->oom_reaper_timer);
> >  }
> >  
> > -- 
> > 2.17.1
> >



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v6 1/2] mm/oom_kill: Do not delay oom reaper when the victim is frozen
  2025-08-29  6:55 ` [PATCH v6 1/2] mm/oom_kill: Do not delay oom " zhongjinji
                     ` (2 preceding siblings ...)
  2025-08-29 23:20   ` Shakeel Butt
@ 2025-09-01  7:25   ` Michal Hocko
  2025-09-01  9:30     ` zhongjinji
  3 siblings, 1 reply; 16+ messages in thread
From: Michal Hocko @ 2025-09-01  7:25 UTC (permalink / raw)
  To: zhongjinji
  Cc: rientjes, shakeel.butt, akpm, linux-mm, linux-kernel, tglx,
	liam.howlett, lorenzo.stoakes, surenb, liulu.liu, feng.han,
	tianxiaobin, fengbaopeng

On Fri 29-08-25 14:55:49, zhongjinji wrote:
> The oom reaper is a mechanism to guarantee a forward process during OOM
> situation when the oom victim cannot terminate on its own (e.g. being
> blocked in uninterruptible state or frozen by cgroup freezer). In order
> to give the victim some time to terminate properly the oom reaper is
> delayed in its invocation. This is particularly beneficial when the oom
> victim is holding robust futex resources as the anonymous memory tear
> down can break those. [1]
> 
> On the other hand deliberately frozen tasks by the freezer cgroup will
> not wake up until they are thawed in the userspace and delay is
> effectively pointless. Therefore opt out from the delay for cgroup
> frozen oom victims.
> 
> Reference:
> [1] https://lore.kernel.org/all/20220414144042.677008-1-npache@redhat.com/T/#u
> 
> Signed-off-by: zhongjinji <zhongjinji@honor.com>

Acked-by: Michal Hocko <mhocko@suse.com>
Thanks

> ---
>  mm/oom_kill.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 25923cfec9c6..a5e9074896a1 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -700,7 +700,14 @@ static void queue_oom_reaper(struct task_struct *tsk)
>  
>  	get_task_struct(tsk);
>  	timer_setup(&tsk->oom_reaper_timer, wake_oom_reaper, 0);
> -	tsk->oom_reaper_timer.expires = jiffies + OOM_REAPER_DELAY;
> +	tsk->oom_reaper_timer.expires = jiffies;
> +
> +	/*
> +	 * If the task is frozen by the cgroup freezer, the delay is unnecessary
> +	 * because it cannot exit until thawed. Skip the delay for frozen victims.
> +	 */
> +	if (!frozen(tsk))
> +		tsk->oom_reaper_timer.expires += OOM_REAPER_DELAY;
>  	add_timer(&tsk->oom_reaper_timer);
>  }
>  
> -- 
> 2.17.1

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v6 1/2] mm/oom_kill: Do not delay oom reaper when the victim is frozen
  2025-09-01  7:25   ` Michal Hocko
@ 2025-09-01  9:30     ` zhongjinji
  2025-09-01 13:58       ` Michal Hocko
  0 siblings, 1 reply; 16+ messages in thread
From: zhongjinji @ 2025-09-01  9:30 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, feng.han, fengbaopeng, liam.howlett, linux-kernel, linux-mm,
	liulu.liu, lorenzo.stoakes, rientjes, shakeel.butt, surenb, tglx,
	tianxiaobin, zhongjinji

> On Fri 29-08-25 14:55:49, zhongjinji wrote:
> > The oom reaper is a mechanism to guarantee a forward process during OOM
> > situation when the oom victim cannot terminate on its own (e.g. being
> > blocked in uninterruptible state or frozen by cgroup freezer). In order
> > to give the victim some time to terminate properly the oom reaper is
> > delayed in its invocation. This is particularly beneficial when the oom
> > victim is holding robust futex resources as the anonymous memory tear
> > down can break those. [1]
> > 
> > On the other hand deliberately frozen tasks by the freezer cgroup will
> > not wake up until they are thawed in the userspace and delay is
> > effectively pointless. Therefore opt out from the delay for cgroup
> > frozen oom victims.
> > 
> > Reference:
> > [1] https://lore.kernel.org/all/20220414144042.677008-1-npache@redhat.com/T/#u
> > 
> > Signed-off-by: zhongjinji <zhongjinji@honor.com>
> 
> Acked-by: Michal Hocko <mhocko@suse.com>
> Thanks

Sorry, I found that it doesn't work now (because I previously tested it by
simulating OOM, which made testing easier but also caused the mistake. I will
re-run the new test). Calling __thaw_task in mark_oom_victim will change the
victim's state to running. However, other threads are still in the frozen state,
so the process still can't exit. We should update it again by moving __thaw_task
to after frozen (this way, executing __thaw_task and frozen in the same function
looks more reasonable). Since mark_oom_victim and queue_oom_reaper always appear
in pairs, this won't introduce any risky changes.

 static void queue_oom_reaper(struct task_struct *tsk)
 {
+       bool delay = !frozen(tsk);
+
+       /*
+        * Make sure that the task is woken up from uninterruptible sleep
+        * if it is frozen because OOM killer wouldn't be able to free
+        * any memory and livelock. freezing_slow_path will tell the freezer
+        * that TIF_MEMDIE tasks should be ignored.
+        */
+       __thaw_task(tsk);
+
        /* mm is already queued? */
        if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags))
                return;
@@ -711,7 +721,7 @@ static void queue_oom_reaper(struct task_struct *tsk)
         * If the task is frozen by the cgroup freezer, the delay is unnecessary
         * because it cannot exit until thawed. Skip the delay for frozen victims.
         */
-       if (!frozen(tsk))
+       if (delay)
                tsk->oom_reaper_timer.expires += OOM_REAPER_DELAY;
        add_timer(&tsk->oom_reaper_timer);
 }
@@ -783,13 +793,6 @@ static void mark_oom_victim(struct task_struct *tsk)
        if (!cmpxchg(&tsk->signal->oom_mm, NULL, mm))
                mmgrab(tsk->signal->oom_mm);

-       /*
-        * Make sure that the task is woken up from uninterruptible sleep
-        * if it is frozen because OOM killer wouldn't be able to free
-        * any memory and livelock. freezing_slow_path will tell the freezer
-        * that TIF_MEMDIE tasks should be ignored.
-        */
-       __thaw_task(tsk);
        atomic_inc(&oom_victims);
        cred = get_task_cred(tsk);
        trace_mark_victim(tsk, cred->uid.val);

> 
> > ---
> >  mm/oom_kill.c | 9 ++++++++-
> >  1 file changed, 8 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 25923cfec9c6..a5e9074896a1 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -700,7 +700,14 @@ static void queue_oom_reaper(struct task_struct *tsk)
> >  
> >  	get_task_struct(tsk);
> >  	timer_setup(&tsk->oom_reaper_timer, wake_oom_reaper, 0);
> > -	tsk->oom_reaper_timer.expires = jiffies + OOM_REAPER_DELAY;
> > +	tsk->oom_reaper_timer.expires = jiffies;
> > +
> > +	/*
> > +	 * If the task is frozen by the cgroup freezer, the delay is unnecessary
> > +	 * because it cannot exit until thawed. Skip the delay for frozen victims.
> > +	 */
> > +	if (!frozen(tsk))
> > +		tsk->oom_reaper_timer.expires += OOM_REAPER_DELAY;
> >  	add_timer(&tsk->oom_reaper_timer);
> >  }
> >  
> > -- 
> > 2.17.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v6 1/2] mm/oom_kill: Do not delay oom reaper when the victim is frozen
  2025-09-01  9:30     ` zhongjinji
@ 2025-09-01 13:58       ` Michal Hocko
  2025-09-02 16:01         ` zhongjinji
  0 siblings, 1 reply; 16+ messages in thread
From: Michal Hocko @ 2025-09-01 13:58 UTC (permalink / raw)
  To: zhongjinji
  Cc: akpm, feng.han, fengbaopeng, liam.howlett, linux-kernel, linux-mm,
	liulu.liu, lorenzo.stoakes, rientjes, shakeel.butt, surenb, tglx,
	tianxiaobin

On Mon 01-09-25 17:30:57, zhongjinji wrote:
> > On Fri 29-08-25 14:55:49, zhongjinji wrote:
> > > The oom reaper is a mechanism to guarantee a forward process during OOM
> > > situation when the oom victim cannot terminate on its own (e.g. being
> > > blocked in uninterruptible state or frozen by cgroup freezer). In order
> > > to give the victim some time to terminate properly the oom reaper is
> > > delayed in its invocation. This is particularly beneficial when the oom
> > > victim is holding robust futex resources as the anonymous memory tear
> > > down can break those. [1]
> > > 
> > > On the other hand deliberately frozen tasks by the freezer cgroup will
> > > not wake up until they are thawed in the userspace and delay is
> > > effectively pointless. Therefore opt out from the delay for cgroup
> > > frozen oom victims.
> > > 
> > > Reference:
> > > [1] https://lore.kernel.org/all/20220414144042.677008-1-npache@redhat.com/T/#u
> > > 
> > > Signed-off-by: zhongjinji <zhongjinji@honor.com>
> > 
> > Acked-by: Michal Hocko <mhocko@suse.com>
> > Thanks
> 
> Sorry, I found that it doesn't work now (because I previously tested it by
> simulating OOM, which made testing easier but also caused the mistake. I will
> re-run the new test). Calling __thaw_task in mark_oom_victim will change the
> victim's state to running. However, other threads are still in the frozen state,
> so the process still can't exit. We should update it again by moving __thaw_task
> to after frozen (this way, executing __thaw_task and frozen in the same function
> looks more reasonable). Since mark_oom_victim and queue_oom_reaper always appear
> in pairs, this won't introduce any risky changes.

Hmm, I must have completely forgot that we are actually thawing the
frozen task! That means that the actual argument for not delaying the
oom reaper doesn't hold.
Now I do see why the existing implementation doesn't really work as you
would expect though. Is there any reason why we are not thawing the
whole process group? I guess I just didn't realize that __thaw_task is
per thread rather than per process back then when I have introduced it.
Because thread specific behavior makes very little sense to me TBH.
So rather than plaing with __thaw_task placement which doesn't really
make much sense wrt to delaying the reaper we should look into that
part.

Sorry, I should have realized earlier when proposing that.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v6 1/2] mm/oom_kill: Do not delay oom reaper when the victim is frozen
  2025-09-01 13:58       ` Michal Hocko
@ 2025-09-02 16:01         ` zhongjinji
  2025-09-03  7:00           ` Michal Hocko
  0 siblings, 1 reply; 16+ messages in thread
From: zhongjinji @ 2025-09-02 16:01 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, feng.han, fengbaopeng, liam.howlett, linux-kernel, linux-mm,
	liulu.liu, lorenzo.stoakes, rientjes, shakeel.butt, surenb, tglx,
	tianxiaobin, zhongjinji

> > Sorry, I found that it doesn't work now (because I previously tested it by
> > simulating OOM, which made testing easier but also caused the mistake. I will
> > re-run the new test). Calling __thaw_task in mark_oom_victim will change the
> > victim's state to running. However, other threads are still in the frozen state,
> > so the process still can't exit. We should update it again by moving __thaw_task
> > to after frozen (this way, executing __thaw_task and frozen in the same function
> > looks more reasonable). Since mark_oom_victim and queue_oom_reaper always appear
> > in pairs, this won't introduce any risky changes.
> 
> Hmm, I must have completely forgot that we are actually thawing the
> frozen task! That means that the actual argument for not delaying the
> oom reaper doesn't hold.
> Now I do see why the existing implementation doesn't really work as you
> would expect though. Is there any reason why we are not thawing the
> whole process group? I guess I just didn't realize that __thaw_task is
> per thread rather than per process back then when I have introduced it.

Previously, I didn't know why we needed to call __thaw_task() in 
mark_oom_victim(). Now I understand.

> Because thread specific behavior makes very little sense to me TBH.
> So rather than plaing with __thaw_task placement which doesn't really
> make much sense wrt to delaying the reaper we should look into that
> part.
> 
> Sorry, I should have realized earlier when proposing that.

Is this modification acceptable?

This change only thaws the process previously identified as the victim,
and does not thaw the process being killed in for_each_process.

The reason is that the process being killed in for_each_process is usually
a vfork process, which is only temporary and rarely encountered.

@@ -772,12 +773,18 @@ static void mark_oom_victim(struct task_struct *tsk)
                mmgrab(tsk->signal->oom_mm);

        /*
-        * Make sure that the task is woken up from uninterruptible sleep
+        * Make sure that the process is woken up from uninterruptible sleep
         * if it is frozen because OOM killer wouldn't be able to free
         * any memory and livelock. freezing_slow_path will tell the freezer
-        * that TIF_MEMDIE tasks should be ignored.
+        * that TIF_MEMDIE thread should be ignored.
         */
-       __thaw_task(tsk);
+       rcu_read_lock();
+       for_each_thread(tsk, t) {
+               set_tsk_thread_flag(t, TIF_MEMDIE);
+               __thaw_task(t);
+       }
+       rcu_read_unlock();
+
        atomic_inc(&oom_victims);
        cred = get_task_cred(tsk);
        trace_mark_victim(tsk, cred->uid.val);


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v6 1/2] mm/oom_kill: Do not delay oom reaper when the victim is frozen
  2025-09-02 16:01         ` zhongjinji
@ 2025-09-03  7:00           ` Michal Hocko
  0 siblings, 0 replies; 16+ messages in thread
From: Michal Hocko @ 2025-09-03  7:00 UTC (permalink / raw)
  To: zhongjinji
  Cc: akpm, feng.han, fengbaopeng, liam.howlett, linux-kernel, linux-mm,
	liulu.liu, lorenzo.stoakes, rientjes, shakeel.butt, surenb, tglx,
	tianxiaobin

On Wed 03-09-25 00:01:29, zhongjinji wrote:
[...]
> @@ -772,12 +773,18 @@ static void mark_oom_victim(struct task_struct *tsk)
>                 mmgrab(tsk->signal->oom_mm);
> 
>         /*
> -        * Make sure that the task is woken up from uninterruptible sleep
> +        * Make sure that the process is woken up from uninterruptible sleep
>          * if it is frozen because OOM killer wouldn't be able to free
>          * any memory and livelock. freezing_slow_path will tell the freezer
> -        * that TIF_MEMDIE tasks should be ignored.
> +        * that TIF_MEMDIE thread should be ignored.
>          */
> -       __thaw_task(tsk);
> +       rcu_read_lock();
> +       for_each_thread(tsk, t) {
> +               set_tsk_thread_flag(t, TIF_MEMDIE);
> +               __thaw_task(t);
> +       }
> +       rcu_read_unlock();
> +

I would prefer if we had thaw_process() rather than open code it here.
But the implementation matches what I would expect it to do.

>         atomic_inc(&oom_victims);
>         cred = get_task_cred(tsk);
>         trace_mark_victim(tsk, cred->uid.val);

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v6 2/2] mm/oom_kill: The OOM reaper traverses the VMA maple tree in reverse order
  2025-08-29  6:55 [PATCH v6 0/2] Do not delay OOM reaper when the victim is frozen zhongjinji
  2025-08-29  6:55 ` [PATCH v6 1/2] mm/oom_kill: Do not delay oom " zhongjinji
@ 2025-08-29  6:55 ` zhongjinji
  2025-08-29 10:00   ` Lorenzo Stoakes
                     ` (3 more replies)
  1 sibling, 4 replies; 16+ messages in thread
From: zhongjinji @ 2025-08-29  6:55 UTC (permalink / raw)
  To: mhocko
  Cc: rientjes, shakeel.butt, akpm, linux-mm, linux-kernel, tglx,
	liam.howlett, lorenzo.stoakes, surenb, liulu.liu, feng.han,
	tianxiaobin, fengbaopeng, zhongjinji

When a process is OOM killed without reaper delay, the oom reaper and the
exit_mmap() thread likely run simultaneously. They traverse the vma's maple
tree along the same path and may easily unmap the same vma, causing them to
compete for the pte spinlock.

When a process exits, exit_mmap() traverses the vma's maple tree from low
to high addresses. To reduce the chance of unmapping the same vma
simultaneously, the OOM reaper should traverse the vma's tree from high to
low address.

Reported-by: tianxiaobin <tianxiaobin@honor.com>
Reported-by: fengbaopeng <fengbaopeng@honor.com>

Signed-off-by: zhongjinji <zhongjinji@honor.com>
---
 mm/oom_kill.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index a5e9074896a1..01665a666bf1 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -516,7 +516,7 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
 {
 	struct vm_area_struct *vma;
 	bool ret = true;
-	VMA_ITERATOR(vmi, mm, 0);
+	MA_STATE(mas, &mm->mm_mt, ULONG_MAX, ULONG_MAX);
 
 	/*
 	 * Tell all users of get_user/copy_from_user etc... that the content
@@ -526,7 +526,12 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
 	 */
 	set_bit(MMF_UNSTABLE, &mm->flags);
 
-	for_each_vma(vmi, vma) {
+	/*
+	 * When two tasks unmap the same vma at the same time, they may contend
+	 * for the pte spinlock. To reduce the probability of unmapping the same vma
+	 * as exit_mmap, the OOM reaper traverses the vma maple tree in reverse order.
+	 */
+	mas_for_each_rev(&mas, vma, 0) {
 		if (vma->vm_flags & (VM_HUGETLB|VM_PFNMAP))
 			continue;
 
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v6 2/2] mm/oom_kill: The OOM reaper traverses the VMA maple tree in reverse order
  2025-08-29  6:55 ` [PATCH v6 2/2] mm/oom_kill: The OOM reaper traverses the VMA maple tree in reverse order zhongjinji
@ 2025-08-29 10:00   ` Lorenzo Stoakes
  2025-08-29 17:31   ` Liam R. Howlett
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 16+ messages in thread
From: Lorenzo Stoakes @ 2025-08-29 10:00 UTC (permalink / raw)
  To: zhongjinji
  Cc: mhocko, rientjes, shakeel.butt, akpm, linux-mm, linux-kernel,
	tglx, liam.howlett, surenb, liulu.liu, feng.han, tianxiaobin,
	fengbaopeng

On Fri, Aug 29, 2025 at 02:55:50PM +0800, zhongjinji wrote:
> When a process is OOM killed without reaper delay, the oom reaper and the
> exit_mmap() thread likely run simultaneously. They traverse the vma's maple
> tree along the same path and may easily unmap the same vma, causing them to
> compete for the pte spinlock.
>
> When a process exits, exit_mmap() traverses the vma's maple tree from low
> to high addresses. To reduce the chance of unmapping the same vma
> simultaneously, the OOM reaper should traverse the vma's tree from high to
> low address.
>
> Reported-by: tianxiaobin <tianxiaobin@honor.com>
> Reported-by: fengbaopeng <fengbaopeng@honor.com>
>

Very nitty but weird gap here haha, not really worth any kind of fixup though
Andrew will take care.

> Signed-off-by: zhongjinji <zhongjinji@honor.com>

As far as I'm concerned this isn't too bad, doesn't really impact anything
negatively and fixes an issue, so while it's gross to have to do this, it LGTM
now you've addressed Liam's feedback.

So:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  mm/oom_kill.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index a5e9074896a1..01665a666bf1 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -516,7 +516,7 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
>  {
>  	struct vm_area_struct *vma;
>  	bool ret = true;
> -	VMA_ITERATOR(vmi, mm, 0);
> +	MA_STATE(mas, &mm->mm_mt, ULONG_MAX, ULONG_MAX);
>
>  	/*
>  	 * Tell all users of get_user/copy_from_user etc... that the content
> @@ -526,7 +526,12 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
>  	 */
>  	set_bit(MMF_UNSTABLE, &mm->flags);
>
> -	for_each_vma(vmi, vma) {
> +	/*
> +	 * When two tasks unmap the same vma at the same time, they may contend
> +	 * for the pte spinlock. To reduce the probability of unmapping the same vma
> +	 * as exit_mmap, the OOM reaper traverses the vma maple tree in reverse order.
> +	 */
> +	mas_for_each_rev(&mas, vma, 0) {
>  		if (vma->vm_flags & (VM_HUGETLB|VM_PFNMAP))
>  			continue;
>
> --
> 2.17.1
>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v6 2/2] mm/oom_kill: The OOM reaper traverses the VMA maple tree in reverse order
  2025-08-29  6:55 ` [PATCH v6 2/2] mm/oom_kill: The OOM reaper traverses the VMA maple tree in reverse order zhongjinji
  2025-08-29 10:00   ` Lorenzo Stoakes
@ 2025-08-29 17:31   ` Liam R. Howlett
  2025-08-29 23:21   ` Shakeel Butt
  2025-09-01  7:41   ` Michal Hocko
  3 siblings, 0 replies; 16+ messages in thread
From: Liam R. Howlett @ 2025-08-29 17:31 UTC (permalink / raw)
  To: zhongjinji
  Cc: mhocko, rientjes, shakeel.butt, akpm, linux-mm, linux-kernel,
	tglx, lorenzo.stoakes, surenb, liulu.liu, feng.han, tianxiaobin,
	fengbaopeng

* zhongjinji <zhongjinji@honor.com> [250829 02:56]:
> When a process is OOM killed without reaper delay, the oom reaper and the
> exit_mmap() thread likely run simultaneously. They traverse the vma's maple
> tree along the same path and may easily unmap the same vma, causing them to
> compete for the pte spinlock.
> 
> When a process exits, exit_mmap() traverses the vma's maple tree from low
> to high addresses. To reduce the chance of unmapping the same vma
> simultaneously, the OOM reaper should traverse the vma's tree from high to
> low address.
> 
> Reported-by: tianxiaobin <tianxiaobin@honor.com>
> Reported-by: fengbaopeng <fengbaopeng@honor.com>
> 
> Signed-off-by: zhongjinji <zhongjinji@honor.com>

Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>


> ---
>  mm/oom_kill.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index a5e9074896a1..01665a666bf1 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -516,7 +516,7 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
>  {
>  	struct vm_area_struct *vma;
>  	bool ret = true;
> -	VMA_ITERATOR(vmi, mm, 0);
> +	MA_STATE(mas, &mm->mm_mt, ULONG_MAX, ULONG_MAX);
>  
>  	/*
>  	 * Tell all users of get_user/copy_from_user etc... that the content
> @@ -526,7 +526,12 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
>  	 */
>  	set_bit(MMF_UNSTABLE, &mm->flags);
>  
> -	for_each_vma(vmi, vma) {
> +	/*
> +	 * When two tasks unmap the same vma at the same time, they may contend
> +	 * for the pte spinlock. To reduce the probability of unmapping the same vma
> +	 * as exit_mmap, the OOM reaper traverses the vma maple tree in reverse order.
> +	 */
> +	mas_for_each_rev(&mas, vma, 0) {
>  		if (vma->vm_flags & (VM_HUGETLB|VM_PFNMAP))
>  			continue;
>  
> -- 
> 2.17.1
> 
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v6 2/2] mm/oom_kill: The OOM reaper traverses the VMA maple tree in reverse order
  2025-08-29  6:55 ` [PATCH v6 2/2] mm/oom_kill: The OOM reaper traverses the VMA maple tree in reverse order zhongjinji
  2025-08-29 10:00   ` Lorenzo Stoakes
  2025-08-29 17:31   ` Liam R. Howlett
@ 2025-08-29 23:21   ` Shakeel Butt
  2025-09-01  7:41   ` Michal Hocko
  3 siblings, 0 replies; 16+ messages in thread
From: Shakeel Butt @ 2025-08-29 23:21 UTC (permalink / raw)
  To: zhongjinji
  Cc: mhocko, rientjes, akpm, linux-mm, linux-kernel, tglx,
	liam.howlett, lorenzo.stoakes, surenb, liulu.liu, feng.han,
	tianxiaobin, fengbaopeng

On Fri, Aug 29, 2025 at 02:55:50PM +0800, zhongjinji wrote:
> When a process is OOM killed without reaper delay, the oom reaper and the
> exit_mmap() thread likely run simultaneously. They traverse the vma's maple
> tree along the same path and may easily unmap the same vma, causing them to
> compete for the pte spinlock.
> 
> When a process exits, exit_mmap() traverses the vma's maple tree from low
> to high addresses. To reduce the chance of unmapping the same vma
> simultaneously, the OOM reaper should traverse the vma's tree from high to
> low address.
> 
> Reported-by: tianxiaobin <tianxiaobin@honor.com>
> Reported-by: fengbaopeng <fengbaopeng@honor.com>
> 
> Signed-off-by: zhongjinji <zhongjinji@honor.com>

Acked-by: Shakeel Butt <shakeel.butt@linux.dev>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v6 2/2] mm/oom_kill: The OOM reaper traverses the VMA maple tree in reverse order
  2025-08-29  6:55 ` [PATCH v6 2/2] mm/oom_kill: The OOM reaper traverses the VMA maple tree in reverse order zhongjinji
                     ` (2 preceding siblings ...)
  2025-08-29 23:21   ` Shakeel Butt
@ 2025-09-01  7:41   ` Michal Hocko
  3 siblings, 0 replies; 16+ messages in thread
From: Michal Hocko @ 2025-09-01  7:41 UTC (permalink / raw)
  To: zhongjinji
  Cc: rientjes, shakeel.butt, akpm, linux-mm, linux-kernel, tglx,
	liam.howlett, lorenzo.stoakes, surenb, liulu.liu, feng.han,
	tianxiaobin, fengbaopeng

On Fri 29-08-25 14:55:50, zhongjinji wrote:
> When a process is OOM killed without reaper delay, the oom reaper and the
> exit_mmap() thread likely run simultaneously. They traverse the vma's maple
> tree along the same path and may easily unmap the same vma, causing them to
> compete for the pte spinlock.
> 
> When a process exits, exit_mmap() traverses the vma's maple tree from low
> to high addresses. To reduce the chance of unmapping the same vma
> simultaneously, the OOM reaper should traverse the vma's tree from high to
> low address.
> 
> Reported-by: tianxiaobin <tianxiaobin@honor.com>
> Reported-by: fengbaopeng <fengbaopeng@honor.com>
> 
> Signed-off-by: zhongjinji <zhongjinji@honor.com>

The changelog could be improved because it is a bit confusing at this
stage. I haven't payed a close attention to previous discussion (sorry)
but there are two Reported-bys without any actual problem statement
(sure contention could happen but so what? What was the observed
behavior). Also the first paragraph states that "without reaper delay"
there is a problem but the only situation we do not have a dealay is
when the task is frozen and there is no racing there.

As already said in the previous response I think this makes conceptual
sense especially for oom victims with large address spaces which take
more that the OOM_REAPER_DELAY to die. Maybe you want to use that as a
justiciation. My wording would be
"
Although the oom_reaper is delayed and it gives the oom victim chance to
clean up its address space this might take a while especially for
processes with a large address space footprint. In those cases
oom_reaper might start racing with the dying task and compete for shared
resources - e.g. page table lock contention has been observed.

Reduce those races by reaping the oom victim from the other end of the
address space.
"

Anyway, with a changelog clarified.
Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/oom_kill.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index a5e9074896a1..01665a666bf1 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -516,7 +516,7 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
>  {
>  	struct vm_area_struct *vma;
>  	bool ret = true;
> -	VMA_ITERATOR(vmi, mm, 0);
> +	MA_STATE(mas, &mm->mm_mt, ULONG_MAX, ULONG_MAX);
>  
>  	/*
>  	 * Tell all users of get_user/copy_from_user etc... that the content
> @@ -526,7 +526,12 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
>  	 */
>  	set_bit(MMF_UNSTABLE, &mm->flags);
>  
> -	for_each_vma(vmi, vma) {
> +	/*
> +	 * When two tasks unmap the same vma at the same time, they may contend
> +	 * for the pte spinlock. To reduce the probability of unmapping the same vma
> +	 * as exit_mmap, the OOM reaper traverses the vma maple tree in reverse order.
> +	 */
> +	mas_for_each_rev(&mas, vma, 0) {
>  		if (vma->vm_flags & (VM_HUGETLB|VM_PFNMAP))
>  			continue;
>  
> -- 
> 2.17.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2025-09-03  7:00 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-29  6:55 [PATCH v6 0/2] Do not delay OOM reaper when the victim is frozen zhongjinji
2025-08-29  6:55 ` [PATCH v6 1/2] mm/oom_kill: Do not delay oom " zhongjinji
2025-08-29  9:57   ` Lorenzo Stoakes
2025-08-29 17:30   ` Liam R. Howlett
2025-08-29 23:20   ` Shakeel Butt
2025-09-01 13:17     ` zhongjinji
2025-09-01  7:25   ` Michal Hocko
2025-09-01  9:30     ` zhongjinji
2025-09-01 13:58       ` Michal Hocko
2025-09-02 16:01         ` zhongjinji
2025-09-03  7:00           ` Michal Hocko
2025-08-29  6:55 ` [PATCH v6 2/2] mm/oom_kill: The OOM reaper traverses the VMA maple tree in reverse order zhongjinji
2025-08-29 10:00   ` Lorenzo Stoakes
2025-08-29 17:31   ` Liam R. Howlett
2025-08-29 23:21   ` Shakeel Butt
2025-09-01  7:41   ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).