[PATCH v7 0/2] Improvements for victim thawing and reaper VMA traversal

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v7 0/2] Improvements for victim thawing and reaper VMA traversal
@ 2025-09-03  9:27 zhongjinji
  2025-09-03  9:27 ` [PATCH v7 1/2] mm/oom_kill: Thaw victim on a per-process basis instead of per-thread zhongjinji
  2025-09-03  9:27 ` [PATCH v7 2/2] mm/oom_kill: The OOM reaper traverses the VMA maple tree in reverse order zhongjinji
  0 siblings, 2 replies; 6+ messages in thread
From: zhongjinji @ 2025-09-03  9:27 UTC (permalink / raw)
  To: mhocko
  Cc: rientjes, shakeel.butt, akpm, linux-mm, linux-kernel, tglx,
	liam.howlett, lorenzo.stoakes, surenb, liulu.liu, feng.han,
	zhongjinji

This patch series is about improvements to victim process thawing and
reaper VMA traversal. Even if the oom_reaper is delayed, patch 2 is 
still beneficial for reaping processes with a large address space
footprint, and it also greatly improves process_mrelease.

---

v6 -> v7:
- Thawing the victim process prevents the OOM killer from being blocked. [10]
- Remove report tags

v5 -> v6:
- Use mas_for_each_rev() for VMA traversal [6]
- Simplify the judgment of whether to delay in queue_oom_reaper() [7]
- Refine changelog to better capture the essence of the changes [8]
- Use READ_ONCE(tsk->frozen) instead of checking mm and additional
  checks inside for_each_process(), as it is sufficient [9]
- Add report tags because fengbaopeng and tianxiaobin reported the
  high load issue of the reaper

v4 -> v5:
- Detect frozen state directly, avoid special futex handling. [3]
- Use mas_find_rev() for VMA traversal to avoid skipping entries. [4]
- Only check should_delay_oom_reap() in queue_oom_reaper(). [5]

v3 -> v4:
- Renamed functions and parameters for clarity. [2]
- Added should_delay_oom_reap() for OOM reap decisions.
- Traverse maple tree in reverse for improved behavior.

v2 -> v3:
- Fixed Subject prefix error.

v1 -> v2:
- Check robust_list for all threads, not just one. [1]

Reference:
[1] https://lore.kernel.org/linux-mm/u3mepw3oxj7cywezna4v72y2hvyc7bafkmsbirsbfuf34zpa7c@b23sc3rvp2gp/
[2] https://lore.kernel.org/linux-mm/87cy99g3k6.ffs@tglx/
[3] https://lore.kernel.org/linux-mm/aKRWtjRhE_HgFlp2@tiehlicka/
[4] https://lore.kernel.org/linux-mm/26larxehoe3a627s4fxsqghriwctays4opm4hhme3uk7ybjc5r@pmwh4s4yv7lm/
[5] https://lore.kernel.org/linux-mm/d5013a33-c08a-44c5-a67f-9dc8fd73c969@lucifer.local/
[6] https://lore.kernel.org/linux-mm/nwh7gegmvoisbxlsfwslobpbqku376uxdj2z32owkbftvozt3x@4dfet73fh2yy/
[7] https://lore.kernel.org/linux-mm/af4edeaf-d3c9-46a9-a300-dbaf5936e7d6@lucifer.local/
[8] https://lore.kernel.org/linux-mm/aK71W1ITmC_4I_RY@tiehlicka/
[9] https://lore.kernel.org/linux-mm/jzzdeczuyraup2zrspl6b74muf3bly2a3acejfftcldfmz4ekk@s5mcbeim34my/
[10] https://lore.kernel.org/linux-mm/aLWmf6qZHTA0hMpU@tiehlicka/

The earlier post:
v6: https://lore.kernel.org/linux-mm/20250829065550.29571-1-zhongjinji@honor.com/
v5: https://lore.kernel.org/linux-mm/20250825133855.30229-1-zhongjinji@honor.com/
v4: https://lore.kernel.org/linux-mm/20250814135555.17493-1-zhongjinji@honor.com/
v3: https://lore.kernel.org/linux-mm/20250804030341.18619-1-zhongjinji@honor.com/
v2: https://lore.kernel.org/linux-mm/20250801153649.23244-1-zhongjinji@honor.com/
v1: https://lore.kernel.org/linux-mm/20250731102904.8615-1-zhongjinji@honor.com/

zhongjinji (2):
  mm/oom_kill: Thaw victim on a per-process basis instead of per-thread
  mm/oom_kill: The OOM reaper traverses the VMA maple tree in reverse
    order

 mm/oom_kill.c | 29 ++++++++++++++++++++++++-----
 1 file changed, 24 insertions(+), 5 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v7 1/2] mm/oom_kill: Thaw victim on a per-process basis instead of per-thread
  2025-09-03  9:27 [PATCH v7 0/2] Improvements for victim thawing and reaper VMA traversal zhongjinji
@ 2025-09-03  9:27 ` zhongjinji
  2025-09-03 12:27   ` Michal Hocko
  2025-09-03  9:27 ` [PATCH v7 2/2] mm/oom_kill: The OOM reaper traverses the VMA maple tree in reverse order zhongjinji
  1 sibling, 1 reply; 6+ messages in thread
From: zhongjinji @ 2025-09-03  9:27 UTC (permalink / raw)
  To: mhocko
  Cc: rientjes, shakeel.butt, akpm, linux-mm, linux-kernel, tglx,
	liam.howlett, lorenzo.stoakes, surenb, liulu.liu, feng.han,
	zhongjinji

OOM killer is a mechanism that selects and kills processes when the system
runs out of memory to reclaim resources and keep the system stable.
However, the oom victim cannot terminate on its own when it is frozen,
because __thaw_task() only thaws one thread of the victim, while
the other threads remain in the frozen state.

This change will thaw the entire victim process when OOM occurs,
ensuring that the oom victim can terminate on its own.

Signed-off-by: zhongjinji <zhongjinji@honor.com>
---
 mm/oom_kill.c | 19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 25923cfec9c6..3caaafc896d4 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -747,6 +747,19 @@ static inline void queue_oom_reaper(struct task_struct *tsk)
 }
 #endif /* CONFIG_MMU */
 
+static void thaw_oom_process(struct task_struct *tsk)
+{
+	struct task_struct *t;
+
+	/* protects against  __exit_signal() */
+	read_lock(&tasklist_lock);
+	for_each_thread(tsk, t) {
+		set_tsk_thread_flag(t, TIF_MEMDIE);
+		__thaw_task(t);
+	}
+	read_unlock(&tasklist_lock);
+}
+
 /**
  * mark_oom_victim - mark the given task as OOM victim
  * @tsk: task to mark
@@ -772,12 +785,12 @@ static void mark_oom_victim(struct task_struct *tsk)
 		mmgrab(tsk->signal->oom_mm);
 
 	/*
-	 * Make sure that the task is woken up from uninterruptible sleep
+	 * Make sure that the process is woken up from uninterruptible sleep
 	 * if it is frozen because OOM killer wouldn't be able to free
 	 * any memory and livelock. freezing_slow_path will tell the freezer
-	 * that TIF_MEMDIE tasks should be ignored.
+	 * that TIF_MEMDIE threads should be ignored.
 	 */
-	__thaw_task(tsk);
+	thaw_oom_process(tsk);
 	atomic_inc(&oom_victims);
 	cred = get_task_cred(tsk);
 	trace_mark_victim(tsk, cred->uid.val);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v7 2/2] mm/oom_kill: The OOM reaper traverses the VMA maple tree in reverse order
  2025-09-03  9:27 [PATCH v7 0/2] Improvements for victim thawing and reaper VMA traversal zhongjinji
  2025-09-03  9:27 ` [PATCH v7 1/2] mm/oom_kill: Thaw victim on a per-process basis instead of per-thread zhongjinji
@ 2025-09-03  9:27 ` zhongjinji
  2025-09-03 12:58   ` Michal Hocko
  1 sibling, 1 reply; 6+ messages in thread
From: zhongjinji @ 2025-09-03  9:27 UTC (permalink / raw)
  To: mhocko
  Cc: rientjes, shakeel.butt, akpm, linux-mm, linux-kernel, tglx,
	liam.howlett, lorenzo.stoakes, surenb, liulu.liu, feng.han,
	zhongjinji

Although the oom_reaper is delayed and it gives the oom victim chance to
clean up its address space this might take a while especially for
processes with a large address space footprint. In those cases
oom_reaper might start racing with the dying task and compete for shared
resources - e.g. page table lock contention has been observed.

Reduce those races by reaping the oom victim from the other end of the
address space.

It is also a significant improvement for process_mrelease(). When a process
is killed, process_mrelease is used to reap the killed process and often
runs concurrently with the dying task. The test data shows that after
applying the patch, lock contention is greatly reduced during the procedure
of reaping the killed process.

Without the patch:
|--99.74%-- oom_reaper
|  |--76.67%-- unmap_page_range
|  |  |--33.70%-- __pte_offset_map_lock
|  |  |  |--98.46%-- _raw_spin_lock
|  |  |--27.61%-- free_swap_and_cache_nr
|  |  |--16.40%-- folio_remove_rmap_ptes
|  |  |--12.25%-- tlb_flush_mmu
|  |--12.61%-- tlb_finish_mmu

With the patch:
|--98.84%-- oom_reaper
|  |--53.45%-- unmap_page_range
|  |  |--24.29%-- [hit in function]
|  |  |--48.06%-- folio_remove_rmap_ptes
|  |  |--17.99%-- tlb_flush_mmu
|  |  |--1.72%-- __pte_offset_map_lock
|  |--30.43%-- tlb_finish_mmu

Signed-off-by: zhongjinji <zhongjinji@honor.com>
---
 mm/oom_kill.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 3caaafc896d4..540b1e5e0e46 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -516,7 +516,7 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
 {
 	struct vm_area_struct *vma;
 	bool ret = true;
-	VMA_ITERATOR(vmi, mm, 0);
+	MA_STATE(mas, &mm->mm_mt, ULONG_MAX, ULONG_MAX);
 
 	/*
 	 * Tell all users of get_user/copy_from_user etc... that the content
@@ -526,7 +526,13 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
 	 */
 	set_bit(MMF_UNSTABLE, &mm->flags);
 
-	for_each_vma(vmi, vma) {
+	/*
+	 * It might start racing with the dying task and compete for shared
+	 * resources - e.g. page table lock contention has been observed.
+	 * Reduce those races by reaping the oom victim from the other end
+	 * of the address space.
+	 */
+	mas_for_each_rev(&mas, vma, 0) {
 		if (vma->vm_flags & (VM_HUGETLB|VM_PFNMAP))
 			continue;
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v7 1/2] mm/oom_kill: Thaw victim on a per-process basis instead of per-thread
  2025-09-03  9:27 ` [PATCH v7 1/2] mm/oom_kill: Thaw victim on a per-process basis instead of per-thread zhongjinji
@ 2025-09-03 12:27   ` Michal Hocko
  0 siblings, 0 replies; 6+ messages in thread
From: Michal Hocko @ 2025-09-03 12:27 UTC (permalink / raw)
  To: zhongjinji
  Cc: rientjes, shakeel.butt, akpm, linux-mm, linux-kernel, tglx,
	liam.howlett, lorenzo.stoakes, surenb, liulu.liu, feng.han

On Wed 03-09-25 17:27:28, zhongjinji wrote:
> OOM killer is a mechanism that selects and kills processes when the system
> runs out of memory to reclaim resources and keep the system stable.
> However, the oom victim cannot terminate on its own when it is frozen,
> because __thaw_task() only thaws one thread of the victim, while
> the other threads remain in the frozen state.
> 
> This change will thaw the entire victim process when OOM occurs,
> ensuring that the oom victim can terminate on its own.
> 
> Signed-off-by: zhongjinji <zhongjinji@honor.com>
> ---
>  mm/oom_kill.c | 19 ++++++++++++++++---
>  1 file changed, 16 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 25923cfec9c6..3caaafc896d4 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -747,6 +747,19 @@ static inline void queue_oom_reaper(struct task_struct *tsk)
>  }
>  #endif /* CONFIG_MMU */
>  
> +static void thaw_oom_process(struct task_struct *tsk)
> +{
> +	struct task_struct *t;
> +
> +	/* protects against  __exit_signal() */
> +	read_lock(&tasklist_lock);
> +	for_each_thread(tsk, t) {
> +		set_tsk_thread_flag(t, TIF_MEMDIE);
> +		__thaw_task(t);
> +	}
> +	read_unlock(&tasklist_lock);
> +}
> +

Sorry, I was probably not clear enough. I meant thaw_process should live
in the freezer proper kernel/freezer.c and oom should be just user.
Please make sure that freezer maintainers are involved and approve the
change.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v7 2/2] mm/oom_kill: The OOM reaper traverses the VMA maple tree in reverse order
  2025-09-03  9:27 ` [PATCH v7 2/2] mm/oom_kill: The OOM reaper traverses the VMA maple tree in reverse order zhongjinji
@ 2025-09-03 12:58   ` Michal Hocko
  2025-09-03 19:02     ` Liam R. Howlett
  0 siblings, 1 reply; 6+ messages in thread
From: Michal Hocko @ 2025-09-03 12:58 UTC (permalink / raw)
  To: zhongjinji
  Cc: rientjes, shakeel.butt, akpm, linux-mm, linux-kernel, tglx,
	liam.howlett, lorenzo.stoakes, surenb, liulu.liu, feng.han

On Wed 03-09-25 17:27:29, zhongjinji wrote:
> Although the oom_reaper is delayed and it gives the oom victim chance to
> clean up its address space this might take a while especially for
> processes with a large address space footprint. In those cases
> oom_reaper might start racing with the dying task and compete for shared
> resources - e.g. page table lock contention has been observed.
> 
> Reduce those races by reaping the oom victim from the other end of the
> address space.
> 
> It is also a significant improvement for process_mrelease(). When a process
> is killed, process_mrelease is used to reap the killed process and often
> runs concurrently with the dying task. The test data shows that after
> applying the patch, lock contention is greatly reduced during the procedure
> of reaping the killed process.

Thank you this is much better!

> Without the patch:
> |--99.74%-- oom_reaper
> |  |--76.67%-- unmap_page_range
> |  |  |--33.70%-- __pte_offset_map_lock
> |  |  |  |--98.46%-- _raw_spin_lock
> |  |  |--27.61%-- free_swap_and_cache_nr
> |  |  |--16.40%-- folio_remove_rmap_ptes
> |  |  |--12.25%-- tlb_flush_mmu
> |  |--12.61%-- tlb_finish_mmu
> 
> With the patch:
> |--98.84%-- oom_reaper
> |  |--53.45%-- unmap_page_range
> |  |  |--24.29%-- [hit in function]
> |  |  |--48.06%-- folio_remove_rmap_ptes
> |  |  |--17.99%-- tlb_flush_mmu
> |  |  |--1.72%-- __pte_offset_map_lock
> |  |--30.43%-- tlb_finish_mmu

Just curious. Do I read this correctly that the overall speedup is
mostly eaten by contention over tlb_finish_mmu?

> Signed-off-by: zhongjinji <zhongjinji@honor.com>

Anyway, the change on its own makes sense to me
Acked-by: Michal Hocko <mhocko@suse.com>

Thanks for working on the changelog improvements.

> ---
>  mm/oom_kill.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 3caaafc896d4..540b1e5e0e46 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -516,7 +516,7 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
>  {
>  	struct vm_area_struct *vma;
>  	bool ret = true;
> -	VMA_ITERATOR(vmi, mm, 0);
> +	MA_STATE(mas, &mm->mm_mt, ULONG_MAX, ULONG_MAX);
>  
>  	/*
>  	 * Tell all users of get_user/copy_from_user etc... that the content
> @@ -526,7 +526,13 @@ static bool __oom_reap_task_mm(struct mm_struct *mm)
>  	 */
>  	set_bit(MMF_UNSTABLE, &mm->flags);
>  
> -	for_each_vma(vmi, vma) {
> +	/*
> +	 * It might start racing with the dying task and compete for shared
> +	 * resources - e.g. page table lock contention has been observed.
> +	 * Reduce those races by reaping the oom victim from the other end
> +	 * of the address space.
> +	 */
> +	mas_for_each_rev(&mas, vma, 0) {
>  		if (vma->vm_flags & (VM_HUGETLB|VM_PFNMAP))
>  			continue;
>  
> -- 
> 2.17.1
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v7 2/2] mm/oom_kill: The OOM reaper traverses the VMA maple tree in reverse order
  2025-09-03 12:58   ` Michal Hocko
@ 2025-09-03 19:02     ` Liam R. Howlett
  0 siblings, 0 replies; 6+ messages in thread
From: Liam R. Howlett @ 2025-09-03 19:02 UTC (permalink / raw)
  To: Michal Hocko
  Cc: zhongjinji, rientjes, shakeel.butt, akpm, linux-mm, linux-kernel,
	tglx, lorenzo.stoakes, surenb, liulu.liu, feng.han

* Michal Hocko <mhocko@suse.com> [250903 08:58]:
> On Wed 03-09-25 17:27:29, zhongjinji wrote:
> > Although the oom_reaper is delayed and it gives the oom victim chance to
> > clean up its address space this might take a while especially for
> > processes with a large address space footprint. In those cases
> > oom_reaper might start racing with the dying task and compete for shared
> > resources - e.g. page table lock contention has been observed.
> > 
> > Reduce those races by reaping the oom victim from the other end of the
> > address space.
> > 
> > It is also a significant improvement for process_mrelease(). When a process
> > is killed, process_mrelease is used to reap the killed process and often
> > runs concurrently with the dying task. The test data shows that after
> > applying the patch, lock contention is greatly reduced during the procedure
> > of reaping the killed process.
> 
> Thank you this is much better!
> 
> > Without the patch:
> > |--99.74%-- oom_reaper
> > |  |--76.67%-- unmap_page_range
> > |  |  |--33.70%-- __pte_offset_map_lock
> > |  |  |  |--98.46%-- _raw_spin_lock
> > |  |  |--27.61%-- free_swap_and_cache_nr
> > |  |  |--16.40%-- folio_remove_rmap_ptes
> > |  |  |--12.25%-- tlb_flush_mmu
> > |  |--12.61%-- tlb_finish_mmu
> > 
> > With the patch:
> > |--98.84%-- oom_reaper
> > |  |--53.45%-- unmap_page_range
> > |  |  |--24.29%-- [hit in function]
> > |  |  |--48.06%-- folio_remove_rmap_ptes
> > |  |  |--17.99%-- tlb_flush_mmu
> > |  |  |--1.72%-- __pte_offset_map_lock
> > |  |--30.43%-- tlb_finish_mmu
> 
> Just curious. Do I read this correctly that the overall speedup is
> mostly eaten by contention over tlb_finish_mmu?

The tlb_finish_mmu() taking less time indicates that it's probably not
doing much work, afaict.  These numbers would be better if exit_mmap()
was also added to show a more complete view of how the system is
affected - I suspect the tlb_finish_mmu time will have disappeared from
that side of things.

Comments in the code of this stuff has many arch specific statements,
which makes me wonder if this is safe (probably?) and beneficial for
everyone?  At the least, it would be worth mentioning which arch was
used for the benchmark - I am guessing arm64 considering the talk of
android, coincidently arm64 would benefit the most fwiu.

mmu_notifier_release(mm) is called early in the exit_mmap() path should
cause the mmu notifiers to be non-blocking (according to the comment in
v6.0 source of exit_mmap [1].

> 
> > Signed-off-by: zhongjinji <zhongjinji@honor.com>
> 
> Anyway, the change on its own makes sense to me
> Acked-by: Michal Hocko <mhocko@suse.com>
> 
> Thanks for working on the changelog improvements.

[1]. https://elixir.bootlin.com/linux/v6.0.19/source/mm/mmap.c#L3089

...

Thanks,
Liam

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-09-03 19:05 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-03  9:27 [PATCH v7 0/2] Improvements for victim thawing and reaper VMA traversal zhongjinji
2025-09-03  9:27 ` [PATCH v7 1/2] mm/oom_kill: Thaw victim on a per-process basis instead of per-thread zhongjinji
2025-09-03 12:27   ` Michal Hocko
2025-09-03  9:27 ` [PATCH v7 2/2] mm/oom_kill: The OOM reaper traverses the VMA maple tree in reverse order zhongjinji
2025-09-03 12:58   ` Michal Hocko
2025-09-03 19:02     ` Liam R. Howlett

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).