* [PATCH v5 0/2] Do not delay oom reaper when the victim is frozen @ 2025-08-25 13:38 zhongjinji 2025-08-25 13:38 ` [PATCH v5 1/2] mm/oom_kill: " zhongjinji 2025-08-25 13:38 ` [PATCH v5 2/2] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite order zhongjinji 0 siblings, 2 replies; 20+ messages in thread From: zhongjinji @ 2025-08-25 13:38 UTC (permalink / raw) To: mhocko Cc: rientjes, shakeel.butt, akpm, linux-mm, linux-kernel, tglx, liam.howlett, lorenzo.stoakes, liulu.liu, feng.han, zhongjinji patch 1 do not delay oom reaper when the victim is frozen, patch 2 makes the OOM reaper and exit_mmap() traverse the maple tree in opposite orders to reduce PTE lock contention caused by unmapping the same vma. About patch 1: Patch 1 uses frozen() to check the frozen state of a single thread to determine if a process is frozen, rather than checking all threads, because the frozen state of all threads in a process will eventually be consistent. There is no need to strictly confirm that all threads are frozen; it is only necessary to check whether the process has been frozen or is about to be frozen. When a process is frozen, if it cannot be unfrozen promptly, the delayed two-second oom reaper cannot guarantee that robust futexes will not be reaped. So the processes holding robust futexes should not be frozen. This patch will not make issue [1] worse. About patch 2: I tested the changes of patch 2 on Android. The reproduction steps are as follows: Start a process, then kill it like oom kill does, and actively add it to the oom reaper. The perf data applying patch 1 but not patch 2: |--99.74%-- oom_reaper | |--76.67%-- unmap_page_range | | |--33.70%-- __pte_offset_map_lock | | | |--98.46%-- _raw_spin_lock | | |--27.61%-- free_swap_and_cache_nr | | |--16.40%-- folio_remove_rmap_ptes | | |--12.25%-- tlb_flush_mmu | |--12.61%-- tlb_finish_mmu The perf data applying patch 1 and patch 2: |--98.84%-- oom_reaper | |--53.45%-- unmap_page_range | | |--24.29%-- [hit in function] | | |--48.06%-- folio_remove_rmap_ptes | | |--17.99%-- tlb_flush_mmu | | |--1.72%-- __pte_offset_map_lock | | | |--30.43%-- tlb_finish_mmu It is obvious that the lock contention on the pte spinlock will be very intense when they traverse the tree along the same path. On low-memory Android devices, high memory pressure often requires killing processes to free memory, which is generally accepted on Android. lmkd, a user-space program that actively kills processes, needs to asynchronously call process_mrelease to release memory from killed processes, similar to the oom reaper. At the same time, OOM events are not rare. Therefore, reducing lock contention on __oom_reap_task_mm is meaningful. Link: https://lore.kernel.org/all/20220414144042.677008-1-npache@redhat.com/T/#u [1] --- v4 -> v5: 1. Detect the frozen state of the process instead of checking the futex state, as special handling of futex locks should be avoided during OOM kill [2]. 2. Use mas_find_rev() to traverse the VMA tree instead of vma_prev(), because vma_prev() may skip the first VMA and should not be used here. [3] 3. Just check ishould_delay_oom_reap() in queue_oom_reaper() since it is not hot path. [4] v4 link: https://lore.kernel.org/linux-mm/20250814135555.17493-1-zhongjinji@honor.com/ v3 -> v4: 1. Rename check_robust_futex() to process_has_robust_futex() for clearer intent. 2. Because the delay_reap parameter was added to task_will_free_mem(), the function is renamed to should_reap_task() to better clarify its purpose. 3. Add should_delay_oom_reap() to decide whether to delay OOM reap. 4. Modify the OOM reaper to traverse the maple tree in reverse order; see patch 3 for details. These changes improve code readability and enhance OOM reaper behavior. v3 link: https://lore.kernel.org/all/20250804030341.18619-1-zhongjinji@honor.com/ https://lore.kernel.org/all/20250804030341.18619-2-zhongjinji@honor.com/ v2 -> v3: 1. It mainly fixed the error in the Subject prefix, changing it from futex to mm/oom_kill. v2 link: https://lore.kernel.org/linux-mm/20250801153649.23244-1-zhongjinji@honor.com/ https://lore.kernel.org/linux-mm/20250801153649.23244-2-zhongjinji@honor.com/ v1 -> v2: 1. Check the robust_list of all threads instead of just a single thread. v1 link: https://lore.kernel.org/linux-mm/20250731102904.8615-1-zhongjinji@honor.com/ Reference: https://lore.kernel.org/linux-mm/aKRWtjRhE_HgFlp2@tiehlicka/ [2] https://lore.kernel.org/linux-mm/26larxehoe3a627s4fxsqghriwctays4opm4hhme3uk7ybjc5r@pmwh4s4yv7lm/ [3] https://lore.kernel.org/linux-mm/d5013a33-c08a-44c5-a67f-9dc8fd73c969@lucifer.local/ [4] *** BLURB HERE *** zhongjinji (2): mm/oom_kill: Do not delay oom reaper when the victim is frozen mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite order mm/oom_kill.c | 49 ++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 46 insertions(+), 3 deletions(-) -- 2.17.1 ^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCH v5 1/2] mm/oom_kill: Do not delay oom reaper when the victim is frozen 2025-08-25 13:38 [PATCH v5 0/2] Do not delay oom reaper when the victim is frozen zhongjinji @ 2025-08-25 13:38 ` zhongjinji 2025-08-25 19:41 ` Shakeel Butt ` (2 more replies) 2025-08-25 13:38 ` [PATCH v5 2/2] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite order zhongjinji 1 sibling, 3 replies; 20+ messages in thread From: zhongjinji @ 2025-08-25 13:38 UTC (permalink / raw) To: mhocko Cc: rientjes, shakeel.butt, akpm, linux-mm, linux-kernel, tglx, liam.howlett, lorenzo.stoakes, liulu.liu, feng.han, zhongjinji The OOM reaper can quickly reap a process's memory when the system encounters OOM, helping the system recover. If the victim process is frozen and cannot be unfrozen in time, the reaper delayed by two seconds will cause the system to fail to recover quickly from the OOM state. When an OOM occurs, if the victim is not unfrozen, delaying the OOM reaper will keep the system in a bad state for two seconds. Before scheduling the oom_reaper task, check whether the victim is in a frozen state. If the victim is frozen, do not delay the OOM reaper. Signed-off-by: zhongjinji <zhongjinji@honor.com> --- mm/oom_kill.c | 40 +++++++++++++++++++++++++++++++++++++++- 1 file changed, 39 insertions(+), 1 deletion(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 25923cfec9c6..4b4d73b1e00d 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -683,6 +683,41 @@ static void wake_oom_reaper(struct timer_list *timer) wake_up(&oom_reaper_wait); } +/* + * When the victim is frozen, the OOM reaper should not be delayed, because + * if the victim cannot be unfrozen promptly, it may block the system from + * quickly recovering from the OOM state. + */ +static bool should_delay_oom_reap(struct task_struct *tsk) +{ + struct mm_struct *mm = tsk->mm; + struct task_struct *p; + bool ret; + + if (!mm) + return true; + + if (!frozen(tsk)) + return true; + + if (atomic_read(&mm->mm_users) <= 1) + return false; + + rcu_read_lock(); + for_each_process(p) { + if (!process_shares_mm(p, mm)) + continue; + if (same_thread_group(tsk, p)) + continue; + ret = !frozen(p); + if (ret) + break; + } + rcu_read_unlock(); + + return ret; +} + /* * Give the OOM victim time to exit naturally before invoking the oom_reaping. * The timers timeout is arbitrary... the longer it is, the longer the worst @@ -694,13 +729,16 @@ static void wake_oom_reaper(struct timer_list *timer) #define OOM_REAPER_DELAY (2*HZ) static void queue_oom_reaper(struct task_struct *tsk) { + bool delay; + /* mm is already queued? */ if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags)) return; get_task_struct(tsk); + delay = should_delay_oom_reap(tsk); timer_setup(&tsk->oom_reaper_timer, wake_oom_reaper, 0); - tsk->oom_reaper_timer.expires = jiffies + OOM_REAPER_DELAY; + tsk->oom_reaper_timer.expires = jiffies + (delay ? OOM_REAPER_DELAY : 0); add_timer(&tsk->oom_reaper_timer); } -- 2.17.1 ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH v5 1/2] mm/oom_kill: Do not delay oom reaper when the victim is frozen 2025-08-25 13:38 ` [PATCH v5 1/2] mm/oom_kill: " zhongjinji @ 2025-08-25 19:41 ` Shakeel Butt 2025-08-26 13:01 ` zhongjinji 2025-08-26 12:43 ` Lorenzo Stoakes 2025-08-27 12:08 ` Michal Hocko 2 siblings, 1 reply; 20+ messages in thread From: Shakeel Butt @ 2025-08-25 19:41 UTC (permalink / raw) To: zhongjinji Cc: mhocko, rientjes, akpm, linux-mm, linux-kernel, tglx, liam.howlett, lorenzo.stoakes, liulu.liu, feng.han, cgroups +cgroups On Mon, Aug 25, 2025 at 09:38:54PM +0800, zhongjinji wrote: > The OOM reaper can quickly reap a process's memory when the system > encounters OOM, helping the system recover. If the victim process is > frozen and cannot be unfrozen in time, the reaper delayed by two seconds > will cause the system to fail to recover quickly from the OOM state. > > When an OOM occurs, if the victim is not unfrozen, delaying the OOM reaper > will keep the system in a bad state for two seconds. Before scheduling the > oom_reaper task, check whether the victim is in a frozen state. If the > victim is frozen, do not delay the OOM reaper. > > Signed-off-by: zhongjinji <zhongjinji@honor.com> > --- > mm/oom_kill.c | 40 +++++++++++++++++++++++++++++++++++++++- > 1 file changed, 39 insertions(+), 1 deletion(-) > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 25923cfec9c6..4b4d73b1e00d 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -683,6 +683,41 @@ static void wake_oom_reaper(struct timer_list *timer) > wake_up(&oom_reaper_wait); > } > > +/* > + * When the victim is frozen, the OOM reaper should not be delayed, because > + * if the victim cannot be unfrozen promptly, it may block the system from > + * quickly recovering from the OOM state. > + */ > +static bool should_delay_oom_reap(struct task_struct *tsk) > +{ > + struct mm_struct *mm = tsk->mm; > + struct task_struct *p; > + bool ret; > + On v2, shouldn't READ_ONCE(tsk->frozen) be enough instead of mm check and checks insode for_each_process()? > + if (!mm) > + return true; > + > + if (!frozen(tsk)) > + return true; > + > + if (atomic_read(&mm->mm_users) <= 1) > + return false; > + > + rcu_read_lock(); > + for_each_process(p) { > + if (!process_shares_mm(p, mm)) > + continue; > + if (same_thread_group(tsk, p)) > + continue; > + ret = !frozen(p); > + if (ret) > + break; > + } > + rcu_read_unlock(); > + > + return ret; > +} > + > /* > * Give the OOM victim time to exit naturally before invoking the oom_reaping. > * The timers timeout is arbitrary... the longer it is, the longer the worst > @@ -694,13 +729,16 @@ static void wake_oom_reaper(struct timer_list *timer) > #define OOM_REAPER_DELAY (2*HZ) > static void queue_oom_reaper(struct task_struct *tsk) > { > + bool delay; > + > /* mm is already queued? */ > if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags)) > return; > > get_task_struct(tsk); > + delay = should_delay_oom_reap(tsk); > timer_setup(&tsk->oom_reaper_timer, wake_oom_reaper, 0); > - tsk->oom_reaper_timer.expires = jiffies + OOM_REAPER_DELAY; > + tsk->oom_reaper_timer.expires = jiffies + (delay ? OOM_REAPER_DELAY : 0); > add_timer(&tsk->oom_reaper_timer); > } > > -- > 2.17.1 > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH v5 1/2] mm/oom_kill: Do not delay oom reaper when the victim is frozen 2025-08-25 19:41 ` Shakeel Butt @ 2025-08-26 13:01 ` zhongjinji 0 siblings, 0 replies; 20+ messages in thread From: zhongjinji @ 2025-08-26 13:01 UTC (permalink / raw) To: shakeel.butt Cc: akpm, cgroups, feng.han, liam.howlett, linux-kernel, linux-mm, liulu.liu, lorenzo.stoakes, mhocko, rientjes, tglx, zhongjinji > +cgroups > > On Mon, Aug 25, 2025 at 09:38:54PM +0800, zhongjinji wrote: > > The OOM reaper can quickly reap a process's memory when the system > > encounters OOM, helping the system recover. If the victim process is > > frozen and cannot be unfrozen in time, the reaper delayed by two seconds > > will cause the system to fail to recover quickly from the OOM state. > > > > When an OOM occurs, if the victim is not unfrozen, delaying the OOM reaper > > will keep the system in a bad state for two seconds. Before scheduling the > > oom_reaper task, check whether the victim is in a frozen state. If the > > victim is frozen, do not delay the OOM reaper. > > > > Signed-off-by: zhongjinji <zhongjinji@honor.com> > > --- > > mm/oom_kill.c | 40 +++++++++++++++++++++++++++++++++++++++- > > 1 file changed, 39 insertions(+), 1 deletion(-) > > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > index 25923cfec9c6..4b4d73b1e00d 100644 > > --- a/mm/oom_kill.c > > +++ b/mm/oom_kill.c > > @@ -683,6 +683,41 @@ static void wake_oom_reaper(struct timer_list *timer) > > wake_up(&oom_reaper_wait); > > } > > > > +/* > > + * When the victim is frozen, the OOM reaper should not be delayed, because > > + * if the victim cannot be unfrozen promptly, it may block the system from > > + * quickly recovering from the OOM state. > > + */ > > +static bool should_delay_oom_reap(struct task_struct *tsk) > > +{ > > + struct mm_struct *mm = tsk->mm; > > + struct task_struct *p; > > + bool ret; > > + > > On v2, shouldn't READ_ONCE(tsk->frozen) be enough instead of mm check > and checks insode for_each_process()? Thank you. It is mainly to check the processes created by vfork. I believe no one would put them into different cgroups, so I also think that READ_ONCE(tsk->frozen) is enough. I will update it. > > + if (!mm) > > + return true; > > + > > + if (!frozen(tsk)) > > + return true; > > + > > + if (atomic_read(&mm->mm_users) <= 1) > > + return false; > > + > > + rcu_read_lock(); > > + for_each_process(p) { ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH v5 1/2] mm/oom_kill: Do not delay oom reaper when the victim is frozen 2025-08-25 13:38 ` [PATCH v5 1/2] mm/oom_kill: " zhongjinji 2025-08-25 19:41 ` Shakeel Butt @ 2025-08-26 12:43 ` Lorenzo Stoakes 2025-08-27 12:08 ` Michal Hocko 2 siblings, 0 replies; 20+ messages in thread From: Lorenzo Stoakes @ 2025-08-26 12:43 UTC (permalink / raw) To: zhongjinji Cc: mhocko, rientjes, shakeel.butt, akpm, linux-mm, linux-kernel, tglx, liam.howlett, liulu.liu, feng.han On Mon, Aug 25, 2025 at 09:38:54PM +0800, zhongjinji wrote: > The OOM reaper can quickly reap a process's memory when the system > encounters OOM, helping the system recover. If the victim process is > frozen and cannot be unfrozen in time, the reaper delayed by two seconds > will cause the system to fail to recover quickly from the OOM state. Be good to reference the commit where this was introduced. > > When an OOM occurs, if the victim is not unfrozen, delaying the OOM reaper > will keep the system in a bad state for two seconds. Before scheduling the > oom_reaper task, check whether the victim is in a frozen state. If the > victim is frozen, do not delay the OOM reaper. > > Signed-off-by: zhongjinji <zhongjinji@honor.com> This is a lot better than the previous version, thanks! :) > --- > mm/oom_kill.c | 40 +++++++++++++++++++++++++++++++++++++++- > 1 file changed, 39 insertions(+), 1 deletion(-) > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 25923cfec9c6..4b4d73b1e00d 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -683,6 +683,41 @@ static void wake_oom_reaper(struct timer_list *timer) > wake_up(&oom_reaper_wait); > } > > +/* > + * When the victim is frozen, the OOM reaper should not be delayed, because > + * if the victim cannot be unfrozen promptly, it may block the system from > + * quickly recovering from the OOM state. > + */ You should put comments like this with each of the predicates, so e.g. this comment should be above the frozen check, and then you should write equivalent ones for the rest. However, if Shakeel's correct, you can vastly simplify this further, so obviously in that instance you can reduce to the single comment. > +static bool should_delay_oom_reap(struct task_struct *tsk) > +{ > + struct mm_struct *mm = tsk->mm; > + struct task_struct *p; > + bool ret; > + > + if (!mm) > + return true; > + > + if (!frozen(tsk)) > + return true; > + > + if (atomic_read(&mm->mm_users) <= 1) > + return false; > + > + rcu_read_lock(); > + for_each_process(p) { > + if (!process_shares_mm(p, mm)) > + continue; > + if (same_thread_group(tsk, p)) > + continue; > + ret = !frozen(p); > + if (ret) > + break; > + } > + rcu_read_unlock(); This surely in any case must exist as a helper somehwere (bieng lazy + not checking), seems a prime candidate for that if not. > + > + return ret; > +} > + > /* > * Give the OOM victim time to exit naturally before invoking the oom_reaping. > * The timers timeout is arbitrary... the longer it is, the longer the worst > @@ -694,13 +729,16 @@ static void wake_oom_reaper(struct timer_list *timer) > #define OOM_REAPER_DELAY (2*HZ) > static void queue_oom_reaper(struct task_struct *tsk) > { > + bool delay; > + > /* mm is already queued? */ > if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags)) > return; > > get_task_struct(tsk); > + delay = should_delay_oom_reap(tsk); > timer_setup(&tsk->oom_reaper_timer, wake_oom_reaper, 0); > - tsk->oom_reaper_timer.expires = jiffies + OOM_REAPER_DELAY; > + tsk->oom_reaper_timer.expires = jiffies + (delay ? OOM_REAPER_DELAY : 0); I mean, unless there's some reason not to, why not simplify to: task->oom_reaper_timer.expires = jiffies; if (should_delay_oom_reap(tsk)) task->oom_reaper_timer.expires += OOM_REAPER_DELAY; While super spells things out and avoids the other noise. > add_timer(&tsk->oom_reaper_timer); > } > > -- > 2.17.1 > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH v5 1/2] mm/oom_kill: Do not delay oom reaper when the victim is frozen 2025-08-25 13:38 ` [PATCH v5 1/2] mm/oom_kill: " zhongjinji 2025-08-25 19:41 ` Shakeel Butt 2025-08-26 12:43 ` Lorenzo Stoakes @ 2025-08-27 12:08 ` Michal Hocko 2025-08-27 12:14 ` zhongjinji 2 siblings, 1 reply; 20+ messages in thread From: Michal Hocko @ 2025-08-27 12:08 UTC (permalink / raw) To: zhongjinji Cc: rientjes, shakeel.butt, akpm, linux-mm, linux-kernel, tglx, liam.howlett, lorenzo.stoakes, liulu.liu, feng.han On Mon 25-08-25 21:38:54, zhongjinji wrote: > The OOM reaper can quickly reap a process's memory when the system > encounters OOM, helping the system recover. If the victim process is > frozen and cannot be unfrozen in time, the reaper delayed by two seconds > will cause the system to fail to recover quickly from the OOM state. > > When an OOM occurs, if the victim is not unfrozen, delaying the OOM reaper > will keep the system in a bad state for two seconds. Before scheduling the > oom_reaper task, check whether the victim is in a frozen state. If the > victim is frozen, do not delay the OOM reaper. I do not think this changelog captures the essence of the change really well and it suggests that this might be a performance optimization. As I have explained on several occasions the oom reaper is not meant to be a performance optimization but rather a forward progress guarantee. I would suggest this wording instead. " The oom reaper is a mechanism to guarantee a forward process during OOM situation when the oom victim cannot terminate on its own (e.g. being blocked in uninterruptible state or frozen by cgroup freezer). In order to give the victim some time to terminate properly the oom reaper is delayed in its invocation. This is particularly beneficial when the oom victim is holding robust futex resources as the anonymous memory tear down can break those. On the other hand deliberately frozen tasks by the freezer cgroup will not wake up until they are thawed in the userspace and delay is effectively pointless. Therefore opt out from the delay for cgroup frozen oom victims. " Thanks! -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH v5 1/2] mm/oom_kill: Do not delay oom reaper when the victim is frozen 2025-08-27 12:08 ` Michal Hocko @ 2025-08-27 12:14 ` zhongjinji 0 siblings, 0 replies; 20+ messages in thread From: zhongjinji @ 2025-08-27 12:14 UTC (permalink / raw) To: mhocko Cc: akpm, feng.han, liam.howlett, linux-kernel, linux-mm, liulu.liu, lorenzo.stoakes, rientjes, shakeel.butt, tglx, zhongjinji > On Mon 25-08-25 21:38:54, zhongjinji wrote: > > The OOM reaper can quickly reap a process's memory when the system > > encounters OOM, helping the system recover. If the victim process is > > frozen and cannot be unfrozen in time, the reaper delayed by two seconds > > will cause the system to fail to recover quickly from the OOM state. > > > > When an OOM occurs, if the victim is not unfrozen, delaying the OOM reaper > > will keep the system in a bad state for two seconds. Before scheduling the > > oom_reaper task, check whether the victim is in a frozen state. If the > > victim is frozen, do not delay the OOM reaper. > > I do not think this changelog captures the essence of the change really > well and it suggests that this might be a performance optimization. As I > have explained on several occasions the oom reaper is not meant to be a > performance optimization but rather a forward progress guarantee. I > would suggest this wording instead. > > " > The oom reaper is a mechanism to guarantee a forward process during OOM > situation when the oom victim cannot terminate on its own (e.g. being > blocked in uninterruptible state or frozen by cgroup freezer). In order > to give the victim some time to terminate properly the oom reaper is > delayed in its invocation. This is particularly beneficial when the oom > victim is holding robust futex resources as the anonymous memory tear > down can break those. > > On the other hand deliberately frozen tasks by the freezer cgroup will > not wake up until they are thawed in the userspace and delay is > effectively pointless. Therefore opt out from the delay for cgroup > frozen oom victims. > " Thank you, I will update it. > > Thanks! > -- > Michal Hocko > SUSE Labs ^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCH v5 2/2] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite order 2025-08-25 13:38 [PATCH v5 0/2] Do not delay oom reaper when the victim is frozen zhongjinji 2025-08-25 13:38 ` [PATCH v5 1/2] mm/oom_kill: " zhongjinji @ 2025-08-25 13:38 ` zhongjinji 2025-08-26 12:53 ` Lorenzo Stoakes 1 sibling, 1 reply; 20+ messages in thread From: zhongjinji @ 2025-08-25 13:38 UTC (permalink / raw) To: mhocko Cc: rientjes, shakeel.butt, akpm, linux-mm, linux-kernel, tglx, liam.howlett, lorenzo.stoakes, liulu.liu, feng.han, zhongjinji When a process is OOM killed without reaper delay, the oom reaper and the exit_mmap() thread likely run simultaneously. They traverse the vma's maple tree along the same path and may easily unmap the same vma, causing them to compete for the pte spinlock. When a process exits, exit_mmap() traverses the vma's maple tree from low to high addresses. To reduce the chance of unmapping the same vma simultaneously, the OOM reaper should traverse the vma's tree from high to low address. Signed-off-by: zhongjinji <zhongjinji@honor.com> --- mm/oom_kill.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 4b4d73b1e00d..a0650da9ec9c 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -516,7 +516,7 @@ static bool __oom_reap_task_mm(struct mm_struct *mm) { struct vm_area_struct *vma; bool ret = true; - VMA_ITERATOR(vmi, mm, 0); + MA_STATE(mas, &mm->mm_mt, ULONG_MAX, 0); /* * Tell all users of get_user/copy_from_user etc... that the content @@ -526,7 +526,12 @@ static bool __oom_reap_task_mm(struct mm_struct *mm) */ set_bit(MMF_UNSTABLE, &mm->flags); - for_each_vma(vmi, vma) { + /* + * When two tasks unmap the same vma at the same time, they may contend for the + * pte spinlock. To reduce the probability of them unmapping the same vma, the + * oom reaper traverse the vma maple tree in reverse order. + */ + while ((vma = mas_find_rev(&mas, 0)) != NULL) { if (vma->vm_flags & (VM_HUGETLB|VM_PFNMAP)) continue; -- 2.17.1 ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH v5 2/2] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite order 2025-08-25 13:38 ` [PATCH v5 2/2] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite order zhongjinji @ 2025-08-26 12:53 ` Lorenzo Stoakes 2025-08-26 13:37 ` Liam R. Howlett 2025-08-29 7:14 ` Michal Hocko 0 siblings, 2 replies; 20+ messages in thread From: Lorenzo Stoakes @ 2025-08-26 12:53 UTC (permalink / raw) To: zhongjinji Cc: mhocko, rientjes, shakeel.butt, akpm, linux-mm, linux-kernel, tglx, liam.howlett, liulu.liu, feng.han On Mon, Aug 25, 2025 at 09:38:55PM +0800, zhongjinji wrote: > When a process is OOM killed without reaper delay, the oom reaper and the > exit_mmap() thread likely run simultaneously. They traverse the vma's maple > tree along the same path and may easily unmap the same vma, causing them to > compete for the pte spinlock. > > When a process exits, exit_mmap() traverses the vma's maple tree from low > to high addresses. To reduce the chance of unmapping the same vma > simultaneously, the OOM reaper should traverse the vma's tree from high to > low address. > > Signed-off-by: zhongjinji <zhongjinji@honor.com> I will leave it to Liam to confirm the maple tree bit is ok, but I guess I'm softening to the idea of doing this - because it should have no impact on most users, so even if it's some rare edge case that triggers the situation, then it's worth doing it in reverse just to help you guys out :) Liam - please confirm this is good from your side, and then I can add a tag! Cheers, Lorenzo > --- > mm/oom_kill.c | 9 +++++++-- > 1 file changed, 7 insertions(+), 2 deletions(-) > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 4b4d73b1e00d..a0650da9ec9c 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -516,7 +516,7 @@ static bool __oom_reap_task_mm(struct mm_struct *mm) > { > struct vm_area_struct *vma; > bool ret = true; > - VMA_ITERATOR(vmi, mm, 0); > + MA_STATE(mas, &mm->mm_mt, ULONG_MAX, 0); > > /* > * Tell all users of get_user/copy_from_user etc... that the content > @@ -526,7 +526,12 @@ static bool __oom_reap_task_mm(struct mm_struct *mm) > */ > set_bit(MMF_UNSTABLE, &mm->flags); > > - for_each_vma(vmi, vma) { > + /* > + * When two tasks unmap the same vma at the same time, they may contend for the > + * pte spinlock. To reduce the probability of them unmapping the same vma, the > + * oom reaper traverse the vma maple tree in reverse order. > + */ > + while ((vma = mas_find_rev(&mas, 0)) != NULL) { It's a pity there isn't a nicer formulation of this but this is probably the least worst way of doing it. > if (vma->vm_flags & (VM_HUGETLB|VM_PFNMAP)) > continue; > > -- > 2.17.1 > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH v5 2/2] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite order 2025-08-26 12:53 ` Lorenzo Stoakes @ 2025-08-26 13:37 ` Liam R. Howlett 2025-08-26 13:50 ` Lorenzo Stoakes 2025-08-29 7:14 ` Michal Hocko 1 sibling, 1 reply; 20+ messages in thread From: Liam R. Howlett @ 2025-08-26 13:37 UTC (permalink / raw) To: Lorenzo Stoakes Cc: zhongjinji, mhocko, rientjes, shakeel.butt, akpm, linux-mm, linux-kernel, tglx, liulu.liu, feng.han * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250826 08:53]: > On Mon, Aug 25, 2025 at 09:38:55PM +0800, zhongjinji wrote: > > When a process is OOM killed without reaper delay, the oom reaper and the > > exit_mmap() thread likely run simultaneously. They traverse the vma's maple > > tree along the same path and may easily unmap the same vma, causing them to > > compete for the pte spinlock. > > > > When a process exits, exit_mmap() traverses the vma's maple tree from low > > to high addresses. To reduce the chance of unmapping the same vma > > simultaneously, the OOM reaper should traverse the vma's tree from high to > > low address. > > > > Signed-off-by: zhongjinji <zhongjinji@honor.com> > > I will leave it to Liam to confirm the maple tree bit is ok, but I guess > I'm softening to the idea of doing this - because it should have no impact > on most users, so even if it's some rare edge case that triggers the > situation, then it's worth doing it in reverse just to help you guys out :) > I really don't think this is worth doing. We're avoiding a race between oom and a task unmap - the MMF bits should be used to avoid this race - or at least mitigate it. They are probably both under the read lock, but considering how rare it would be, would a racy flag check be enough - it is hardly critical to get right. Either would reduce the probability. > Liam - please confirm this is good from your side, and then I can add a tag! > > Cheers, Lorenzo > > > --- > > mm/oom_kill.c | 9 +++++++-- > > 1 file changed, 7 insertions(+), 2 deletions(-) > > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > index 4b4d73b1e00d..a0650da9ec9c 100644 > > --- a/mm/oom_kill.c > > +++ b/mm/oom_kill.c > > @@ -516,7 +516,7 @@ static bool __oom_reap_task_mm(struct mm_struct *mm) > > { > > struct vm_area_struct *vma; > > bool ret = true; > > - VMA_ITERATOR(vmi, mm, 0); > > + MA_STATE(mas, &mm->mm_mt, ULONG_MAX, 0); ^^^^^^^^^ ^^ You have set the index larger than the last. It (probably?) works, but isn't correct and may stop working, so let's fix it. MA_STATE(mas, &mm->mm_mt, ULONG_MAX, ULONG_MAX); > > > > /* > > * Tell all users of get_user/copy_from_user etc... that the content > > @@ -526,7 +526,12 @@ static bool __oom_reap_task_mm(struct mm_struct *mm) > > */ > > set_bit(MMF_UNSTABLE, &mm->flags); > > > > - for_each_vma(vmi, vma) { > > + /* > > + * When two tasks unmap the same vma at the same time, they may contend for the > > + * pte spinlock. To reduce the probability of them unmapping the same vma, the > > + * oom reaper traverse the vma maple tree in reverse order. > > + */ > > + while ((vma = mas_find_rev(&mas, 0)) != NULL) { > > It's a pity there isn't a nicer formulation of this but this is probably > the least worst way of doing it. > mas_for_each_rev() exists for this use case. You will find that the implementation is very close to what you see here. :) > > if (vma->vm_flags & (VM_HUGETLB|VM_PFNMAP)) > > continue; > > > > -- > > 2.17.1 > > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH v5 2/2] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite order 2025-08-26 13:37 ` Liam R. Howlett @ 2025-08-26 13:50 ` Lorenzo Stoakes 2025-08-26 15:21 ` Liam R. Howlett 0 siblings, 1 reply; 20+ messages in thread From: Lorenzo Stoakes @ 2025-08-26 13:50 UTC (permalink / raw) To: Liam R. Howlett, zhongjinji, mhocko, rientjes, shakeel.butt, akpm, linux-mm, linux-kernel, tglx, liulu.liu, feng.han On Tue, Aug 26, 2025 at 09:37:22AM -0400, Liam R. Howlett wrote: > I really don't think this is worth doing. We're avoiding a race between > oom and a task unmap - the MMF bits should be used to avoid this race - > or at least mitigate it. Yes for sure, as explored at length in previous discussions this feels like we're papering over cracks here. _However_, I'm sort of ok with a minimalistic fix that solves the proximate issue even if it is that, as long as it doesn't cause issues in doing so. So this is my take on the below and why I'm open to it! > > They are probably both under the read lock, but considering how rare it > would be, would a racy flag check be enough - it is hardly critical to > get right. Either would reduce the probability. Zongjinji - I'm stil not sure that you've really indicated _why_ you're seeing such a tight and unusual race. Presumably some truly massive number of tasks being OOM'd and unmapping but... yeah that seems odd anyway. But again, if we can safely fix this in a way that doesn't hurt stuff too much I'm ok with it (of course, these are famous last words in the kernel often...!) Liam - are you open to a solution on the basis above, or do you feel we ought simply to fix the underlying issue here? to me we're at a simple enough implementaiton of this (esp. utilising the helper you mention) that probably kthis is fine (like the meme, or... hopefully not :) I will go with your judgment here! Cheers, Lorenzo ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH v5 2/2] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite order 2025-08-26 13:50 ` Lorenzo Stoakes @ 2025-08-26 15:21 ` Liam R. Howlett 2025-08-26 22:26 ` Shakeel Butt 0 siblings, 1 reply; 20+ messages in thread From: Liam R. Howlett @ 2025-08-26 15:21 UTC (permalink / raw) To: Lorenzo Stoakes Cc: zhongjinji, mhocko, rientjes, shakeel.butt, akpm, linux-mm, linux-kernel, tglx, liulu.liu, feng.han * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250826 09:50]: > On Tue, Aug 26, 2025 at 09:37:22AM -0400, Liam R. Howlett wrote: > > I really don't think this is worth doing. We're avoiding a race between > > oom and a task unmap - the MMF bits should be used to avoid this race - > > or at least mitigate it. > > Yes for sure, as explored at length in previous discussions this feels like > we're papering over cracks here. > > _However_, I'm sort of ok with a minimalistic fix that solves the proximate > issue even if it is that, as long as it doesn't cause issues in doing so. > > So this is my take on the below and why I'm open to it! > > > > > They are probably both under the read lock, but considering how rare it > > would be, would a racy flag check be enough - it is hardly critical to > > get right. Either would reduce the probability. > > Zongjinji - I'm stil not sure that you've really indicated _why_ you're > seeing such a tight and unusual race. Presumably some truly massive number > of tasks being OOM'd and unmapping but... yeah that seems odd anyway. > > But again, if we can safely fix this in a way that doesn't hurt stuff too > much I'm ok with it (of course, these are famous last words in the kernel > often...!) > > Liam - are you open to a solution on the basis above, or do you feel we > ought simply to fix the underlying issue here? At least this is a benign race. I'd think using MMF_ to reduce the race would achieve the same goal with less risk - which is why I bring it up. Really, both methods should be low risk, so I'm fine with either way. But I am interested in hearing how this race is happening enough to necessitate a fix. Reversing the iterator is a one-spot fix - if this happens elsewhere then we're out of options. Using the MMF_ flags is more of a scalable fix, if it achieves the same results. > > to me we're at a simple enough implementaiton of this (esp. utilising the > helper you mention) that probably kthis is fine (like the meme, > or... hopefully not :) > > I will go with your judgment here! > > Cheers, Lorenzo ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH v5 2/2] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite order 2025-08-26 15:21 ` Liam R. Howlett @ 2025-08-26 22:26 ` Shakeel Butt 2025-08-27 4:12 ` Liam R. Howlett 0 siblings, 1 reply; 20+ messages in thread From: Shakeel Butt @ 2025-08-26 22:26 UTC (permalink / raw) To: Liam R. Howlett, Lorenzo Stoakes, zhongjinji, mhocko, rientjes, akpm, linux-mm, linux-kernel, tglx, liulu.liu, feng.han On Tue, Aug 26, 2025 at 11:21:13AM -0400, Liam R. Howlett wrote: > * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250826 09:50]: > > On Tue, Aug 26, 2025 at 09:37:22AM -0400, Liam R. Howlett wrote: > > > I really don't think this is worth doing. We're avoiding a race between > > > oom and a task unmap - the MMF bits should be used to avoid this race - > > > or at least mitigate it. > > > > Yes for sure, as explored at length in previous discussions this feels like > > we're papering over cracks here. > > > > _However_, I'm sort of ok with a minimalistic fix that solves the proximate > > issue even if it is that, as long as it doesn't cause issues in doing so. > > > > So this is my take on the below and why I'm open to it! > > > > > > > > They are probably both under the read lock, but considering how rare it > > > would be, would a racy flag check be enough - it is hardly critical to > > > get right. Either would reduce the probability. > > > > Zongjinji - I'm stil not sure that you've really indicated _why_ you're > > seeing such a tight and unusual race. Presumably some truly massive number > > of tasks being OOM'd and unmapping but... yeah that seems odd anyway. > > > > But again, if we can safely fix this in a way that doesn't hurt stuff too > > much I'm ok with it (of course, these are famous last words in the kernel > > often...!) > > > > Liam - are you open to a solution on the basis above, or do you feel we > > ought simply to fix the underlying issue here? > > At least this is a benign race. Is this really a race or rather a contention? IIUC exit_mmap and the oom reaper are trying to unmap the address space of the oom-killed process and can compete on page table locks. If both are running concurrently on two cpus then the contention can continue for whole address space and can slow down the actual memory freeing. Making oom reaper traverse in opposite direction can drastically reduce the contention and faster memory freeing. > I'd think using MMF_ to reduce the race > would achieve the same goal with less risk - which is why I bring it up. > With MMF_ flag, are you suggesting oom reaper to skip the unmapping of the oom-killed process? > Really, both methods should be low risk, so I'm fine with either way. > > But I am interested in hearing how this race is happening enough to > necessitate a fix. Reversing the iterator is a one-spot fix - if this > happens elsewhere then we're out of options. Using the MMF_ flags is > more of a scalable fix, if it achieves the same results. On the question of if this is a rare situaion and worth the patch. I would say this scenario is not that rare particularly on low memory devices and on highly utilized overcommitted systems. Memory pressure and oom-kills are norm on such systems. The point of oom reaper is to bring the system out of the oom situation quickly and having two cpus unmapping the oom-killed process can potentially bring the system out of oom situation faster. I think the patch (with your suggestions) is simple enough and I don't see any risk in including it. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH v5 2/2] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite order 2025-08-26 22:26 ` Shakeel Butt @ 2025-08-27 4:12 ` Liam R. Howlett 2025-08-27 4:25 ` Liam R. Howlett ` (2 more replies) 0 siblings, 3 replies; 20+ messages in thread From: Liam R. Howlett @ 2025-08-27 4:12 UTC (permalink / raw) To: Shakeel Butt Cc: Lorenzo Stoakes, zhongjinji, mhocko, rientjes, akpm, linux-mm, linux-kernel, tglx, liulu.liu, feng.han + Cc Suren since he has worked on the exit_mmap() path a lot. * Shakeel Butt <shakeel.butt@linux.dev> [250826 18:26]: > On Tue, Aug 26, 2025 at 11:21:13AM -0400, Liam R. Howlett wrote: > > * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250826 09:50]: > > > On Tue, Aug 26, 2025 at 09:37:22AM -0400, Liam R. Howlett wrote: > > > > I really don't think this is worth doing. We're avoiding a race between > > > > oom and a task unmap - the MMF bits should be used to avoid this race - > > > > or at least mitigate it. > > > > > > Yes for sure, as explored at length in previous discussions this feels like > > > we're papering over cracks here. > > > > > > _However_, I'm sort of ok with a minimalistic fix that solves the proximate > > > issue even if it is that, as long as it doesn't cause issues in doing so. > > > > > > So this is my take on the below and why I'm open to it! > > > > > > > > > > > They are probably both under the read lock, but considering how rare it > > > > would be, would a racy flag check be enough - it is hardly critical to > > > > get right. Either would reduce the probability. > > > > > > Zongjinji - I'm stil not sure that you've really indicated _why_ you're > > > seeing such a tight and unusual race. Presumably some truly massive number > > > of tasks being OOM'd and unmapping but... yeah that seems odd anyway. > > > > > > But again, if we can safely fix this in a way that doesn't hurt stuff too > > > much I'm ok with it (of course, these are famous last words in the kernel > > > often...!) > > > > > > Liam - are you open to a solution on the basis above, or do you feel we > > > ought simply to fix the underlying issue here? > > > > At least this is a benign race. > > Is this really a race or rather a contention? IIUC exit_mmap and the oom > reaper are trying to unmap the address space of the oom-killed process > and can compete on page table locks. If both are running concurrently on > two cpus then the contention can continue for whole address space and > can slow down the actual memory freeing. Making oom reaper traverse in > opposite direction can drastically reduce the contention and faster > memory freeing. It is two readers of the vma tree racing to lock the page tables for each vma, so I guess you can see it as contention as well.. but since the pte is a split lock, I see it as racing through vmas to see who hits which lock first. The smart money is on the oom killer as it skips some vmas :) If it were just contention, then the loop direction wouldn't matter.. but I do see your point. > > I'd think using MMF_ to reduce the race > > would achieve the same goal with less risk - which is why I bring it up. > > > > With MMF_ flag, are you suggesting oom reaper to skip the unmapping of > the oom-killed process? Yes, specifically move the MMF_OOM_SKIP flag to earlier in the exit path to reduce the possibility of the race/contention. > > > Really, both methods should be low risk, so I'm fine with either way. > > > > But I am interested in hearing how this race is happening enough to > > necessitate a fix. Reversing the iterator is a one-spot fix - if this > > happens elsewhere then we're out of options. Using the MMF_ flags is > > more of a scalable fix, if it achieves the same results. > > On the question of if this is a rare situaion and worth the patch. I > would say this scenario is not that rare particularly on low memory > devices and on highly utilized overcommitted systems. Memory pressure > and oom-kills are norm on such systems. The point of oom reaper is to > bring the system out of the oom situation quickly and having two cpus > unmapping the oom-killed process can potentially bring the system out of > oom situation faster. The exit_mmap() path used to run the oom reaper if it was an oom victim, until recently [1]. The part that makes me nervous is the exit_mmap() call to mmu_notifier_release(mm), while the oom reaper uses an mmu_notifier. I am not sure if there is an issue in ordering on any of the platforms of such things. Or the associated cost of the calls. I mean, it's already pretty crazy that we have this in the exit: mmap_read_lock() tlb_gather_mmu_fullmm() unmap vmas.. mmap_read_unlock() mmap_write_lock() tlb_finish_mmu().. So not only do we now have two tasks iterating over the vmas, but we also have mmu notifiers and tlb calls happening across the ranges.. At least doing all the work on a single cpu means that the hardware view is consistent. But I don't see this being worse than a forward race? As it is written here, we'll have one CPU working in one direction while the other works in the other, until both hit the end of the VMAs. Only when both tasks stop iterating the vmas can the exit continue since it requires the write lock. So the tlb_finish_mmu() in exit_mmap() will always be called after tlb_finish_mmu() on each individual vma has run in the __oom_reap_task_mm() context (when the race happens). There is also a window here, between the exit_mmap() dropping the read lock, setting MMF_OOM_SKIP, and taking the lock - where the oom killer will iterate through a list of vmas with zero memory to free and delay the task exiting. That is, wasting cpu and stopping the memory associated with the mm_struct (vmas and such) from being freed. I'm also not sure on the cpu cache effects of what we are doing and how much that would play into the speedup. My guess is that it's insignificant compared to the time we spend under the pte, but we have no numbers to go on. So I'd like to know how likely the simultaneous runs are and if there is a measurable gain? I agree, that at face value, two cpus should be able to split the work.. but I don't know about the notifier or the holding up the mm_struct associated memory. And it could slow things down by holding up an exiting task. > > I think the patch (with your suggestions) is simple enough and I don't > see any risk in including it. > Actually, the more I look at this, the worse I feel about it.. Am I overreacting? Thanks, Liam [1] https://elixir.bootlin.com/linux/v6.0.19/source/mm/mmap.c#L3085 ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH v5 2/2] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite order 2025-08-27 4:12 ` Liam R. Howlett @ 2025-08-27 4:25 ` Liam R. Howlett 2025-08-27 9:55 ` zhongjinji 2025-08-29 7:11 ` Michal Hocko 2 siblings, 0 replies; 20+ messages in thread From: Liam R. Howlett @ 2025-08-27 4:25 UTC (permalink / raw) To: Shakeel Butt, Lorenzo Stoakes, zhongjinji, mhocko, rientjes, akpm, linux-mm, linux-kernel, tglx, liulu.liu, feng.han, surenb@google.com ..actually add Suren this time. * Liam R. Howlett <Liam.Howlett@oracle.com> [250827 00:12]: > + Cc Suren since he has worked on the exit_mmap() path a lot. > > * Shakeel Butt <shakeel.butt@linux.dev> [250826 18:26]: > > On Tue, Aug 26, 2025 at 11:21:13AM -0400, Liam R. Howlett wrote: > > > * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250826 09:50]: > > > > On Tue, Aug 26, 2025 at 09:37:22AM -0400, Liam R. Howlett wrote: > > > > > I really don't think this is worth doing. We're avoiding a race between > > > > > oom and a task unmap - the MMF bits should be used to avoid this race - > > > > > or at least mitigate it. > > > > > > > > Yes for sure, as explored at length in previous discussions this feels like > > > > we're papering over cracks here. > > > > > > > > _However_, I'm sort of ok with a minimalistic fix that solves the proximate > > > > issue even if it is that, as long as it doesn't cause issues in doing so. > > > > > > > > So this is my take on the below and why I'm open to it! > > > > > > > > > > > > > > They are probably both under the read lock, but considering how rare it > > > > > would be, would a racy flag check be enough - it is hardly critical to > > > > > get right. Either would reduce the probability. > > > > > > > > Zongjinji - I'm stil not sure that you've really indicated _why_ you're > > > > seeing such a tight and unusual race. Presumably some truly massive number > > > > of tasks being OOM'd and unmapping but... yeah that seems odd anyway. > > > > > > > > But again, if we can safely fix this in a way that doesn't hurt stuff too > > > > much I'm ok with it (of course, these are famous last words in the kernel > > > > often...!) > > > > > > > > Liam - are you open to a solution on the basis above, or do you feel we > > > > ought simply to fix the underlying issue here? > > > > > > At least this is a benign race. > > > > Is this really a race or rather a contention? IIUC exit_mmap and the oom > > reaper are trying to unmap the address space of the oom-killed process > > and can compete on page table locks. If both are running concurrently on > > two cpus then the contention can continue for whole address space and > > can slow down the actual memory freeing. Making oom reaper traverse in > > opposite direction can drastically reduce the contention and faster > > memory freeing. > > It is two readers of the vma tree racing to lock the page tables for > each vma, so I guess you can see it as contention as well.. but since > the pte is a split lock, I see it as racing through vmas to see who hits > which lock first. The smart money is on the oom killer as it skips some > vmas :) > > If it were just contention, then the loop direction wouldn't matter.. > but I do see your point. > > > > I'd think using MMF_ to reduce the race > > > would achieve the same goal with less risk - which is why I bring it up. > > > > > > > With MMF_ flag, are you suggesting oom reaper to skip the unmapping of > > the oom-killed process? > > Yes, specifically move the MMF_OOM_SKIP flag to earlier in the exit > path to reduce the possibility of the race/contention. > > > > > > Really, both methods should be low risk, so I'm fine with either way. > > > > > > But I am interested in hearing how this race is happening enough to > > > necessitate a fix. Reversing the iterator is a one-spot fix - if this > > > happens elsewhere then we're out of options. Using the MMF_ flags is > > > more of a scalable fix, if it achieves the same results. > > > > On the question of if this is a rare situaion and worth the patch. I > > would say this scenario is not that rare particularly on low memory > > devices and on highly utilized overcommitted systems. Memory pressure > > and oom-kills are norm on such systems. The point of oom reaper is to > > bring the system out of the oom situation quickly and having two cpus > > unmapping the oom-killed process can potentially bring the system out of > > oom situation faster. > > The exit_mmap() path used to run the oom reaper if it was an oom victim, > until recently [1]. The part that makes me nervous is the exit_mmap() > call to mmu_notifier_release(mm), while the oom reaper uses an > mmu_notifier. I am not sure if there is an issue in ordering on any of > the platforms of such things. Or the associated cost of the calls. > > I mean, it's already pretty crazy that we have this in the exit: > mmap_read_lock() > tlb_gather_mmu_fullmm() > unmap vmas.. > mmap_read_unlock() > mmap_write_lock() > tlb_finish_mmu().. > > So not only do we now have two tasks iterating over the vmas, but we > also have mmu notifiers and tlb calls happening across the ranges.. At > least doing all the work on a single cpu means that the hardware view is > consistent. But I don't see this being worse than a forward race? > > As it is written here, we'll have one CPU working in one direction while > the other works in the other, until both hit the end of the VMAs. Only > when both tasks stop iterating the vmas can the exit continue since it > requires the write lock. > > So the tlb_finish_mmu() in exit_mmap() will always be called after > tlb_finish_mmu() on each individual vma has run in the > __oom_reap_task_mm() context (when the race happens). > > There is also a window here, between the exit_mmap() dropping the read > lock, setting MMF_OOM_SKIP, and taking the lock - where the oom killer > will iterate through a list of vmas with zero memory to free and delay > the task exiting. That is, wasting cpu and stopping the memory > associated with the mm_struct (vmas and such) from being freed. > > I'm also not sure on the cpu cache effects of what we are doing and how > much that would play into the speedup. My guess is that it's > insignificant compared to the time we spend under the pte, but we have > no numbers to go on. > > So I'd like to know how likely the simultaneous runs are and if there is > a measurable gain? > > I agree, that at face value, two cpus should be able to split the work.. > but I don't know about the notifier or the holding up the mm_struct > associated memory. And it could slow things down by holding up an > exiting task. > > > > > I think the patch (with your suggestions) is simple enough and I don't > > see any risk in including it. > > > > Actually, the more I look at this, the worse I feel about it.. Am I > overreacting? > > Thanks, > Liam > > [1] https://elixir.bootlin.com/linux/v6.0.19/source/mm/mmap.c#L3085 > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH v5 2/2] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite order 2025-08-27 4:12 ` Liam R. Howlett 2025-08-27 4:25 ` Liam R. Howlett @ 2025-08-27 9:55 ` zhongjinji 2025-08-27 15:57 ` Suren Baghdasaryan 2025-08-29 7:11 ` Michal Hocko 2 siblings, 1 reply; 20+ messages in thread From: zhongjinji @ 2025-08-27 9:55 UTC (permalink / raw) To: liam.howlett Cc: akpm, feng.han, linux-kernel, linux-mm, liulu.liu, lorenzo.stoakes, mhocko, rientjes, shakeel.butt, surenb, tglx, zhongjinji > + Cc Suren since he has worked on the exit_mmap() path a lot. Thank you for your assistance. I realize now that I should have Cc Suren earlier. > * Shakeel Butt <shakeel.butt@linux.dev> [250826 18:26]: > > On Tue, Aug 26, 2025 at 11:21:13AM -0400, Liam R. Howlett wrote: > > > * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250826 09:50]: > > > > On Tue, Aug 26, 2025 at 09:37:22AM -0400, Liam R. Howlett wrote: > > > > > I really don't think this is worth doing. We're avoiding a race between > > > > > oom and a task unmap - the MMF bits should be used to avoid this race - > > > > > or at least mitigate it. > > > > > > > > Yes for sure, as explored at length in previous discussions this feels like > > > > we're papering over cracks here. > > > > > > > > _However_, I'm sort of ok with a minimalistic fix that solves the proximate > > > > issue even if it is that, as long as it doesn't cause issues in doing so. > > > > > > > > So this is my take on the below and why I'm open to it! > > > > > > > > > > > > > > They are probably both under the read lock, but considering how rare it > > > > > would be, would a racy flag check be enough - it is hardly critical to > > > > > get right. Either would reduce the probability. > > > > > > > > Zongjinji - I'm stil not sure that you've really indicated _why_ you're > > > > seeing such a tight and unusual race. Presumably some truly massive number > > > > of tasks being OOM'd and unmapping but... yeah that seems odd anyway. > > > > > > > > But again, if we can safely fix this in a way that doesn't hurt stuff too > > > > much I'm ok with it (of course, these are famous last words in the kernel > > > > often...!) > > > > > > > > Liam - are you open to a solution on the basis above, or do you feel we > > > > ought simply to fix the underlying issue here? > > > > > > At least this is a benign race. > > > > Is this really a race or rather a contention? IIUC exit_mmap and the oom > > reaper are trying to unmap the address space of the oom-killed process > > and can compete on page table locks. If both are running concurrently on > > two cpus then the contention can continue for whole address space and > > can slow down the actual memory freeing. Making oom reaper traverse in > > opposite direction can drastically reduce the contention and faster > > memory freeing. > > It is two readers of the vma tree racing to lock the page tables for > each vma, so I guess you can see it as contention as well.. but since > the pte is a split lock, I see it as racing through vmas to see who hits > which lock first. The smart money is on the oom killer as it skips some > vmas :) > > If it were just contention, then the loop direction wouldn't matter.. > but I do see your point. > > > > I'd think using MMF_ to reduce the race > > > would achieve the same goal with less risk - which is why I bring it up. > > > > > > > With MMF_ flag, are you suggesting oom reaper to skip the unmapping of > > the oom-killed process? > > Yes, specifically move the MMF_OOM_SKIP flag to earlier in the exit > path to reduce the possibility of the race/contention. > > > > > > Really, both methods should be low risk, so I'm fine with either way. > > > > > > But I am interested in hearing how this race is happening enough to > > > necessitate a fix. Reversing the iterator is a one-spot fix - if this > > > happens elsewhere then we're out of options. Using the MMF_ flags is > > > more of a scalable fix, if it achieves the same results. > > > > On the question of if this is a rare situaion and worth the patch. I > > would say this scenario is not that rare particularly on low memory > > devices and on highly utilized overcommitted systems. Memory pressure > > and oom-kills are norm on such systems. The point of oom reaper is to > > bring the system out of the oom situation quickly and having two cpus > > unmapping the oom-killed process can potentially bring the system out of > > oom situation faster. > > The exit_mmap() path used to run the oom reaper if it was an oom victim, > until recently [1]. The part that makes me nervous is the exit_mmap() > call to mmu_notifier_release(mm), while the oom reaper uses an > mmu_notifier. I am not sure if there is an issue in ordering on any of > the platforms of such things. Or the associated cost of the calls. > > I mean, it's already pretty crazy that we have this in the exit: > mmap_read_lock() > tlb_gather_mmu_fullmm() > unmap vmas.. > mmap_read_unlock() > mmap_write_lock() > tlb_finish_mmu().. > > So not only do we now have two tasks iterating over the vmas, but we > also have mmu notifiers and tlb calls happening across the ranges.. At > least doing all the work on a single cpu means that the hardware view is > consistent. But I don't see this being worse than a forward race? > > As it is written here, we'll have one CPU working in one direction while > the other works in the other, until both hit the end of the VMAs. Only > when both tasks stop iterating the vmas can the exit continue since it > requires the write lock. > > So the tlb_finish_mmu() in exit_mmap() will always be called after > tlb_finish_mmu() on each individual vma has run in the > __oom_reap_task_mm() context (when the race happens). > > There is also a window here, between the exit_mmap() dropping the read > lock, setting MMF_OOM_SKIP, and taking the lock - where the oom killer > will iterate through a list of vmas with zero memory to free and delay > the task exiting. That is, wasting cpu and stopping the memory > associated with the mm_struct (vmas and such) from being freed. > > I'm also not sure on the cpu cache effects of what we are doing and how > much that would play into the speedup. My guess is that it's > insignificant compared to the time we spend under the pte, but we have > no numbers to go on. > > So I'd like to know how likely the simultaneous runs are and if there is > a measurable gain? Since process killing events are very frequent on Android, the likelihood of exit_mmap and reaper work (not only OOM, but also some proactive reaping actions such as process_mrelease) occurring at the same time is much higher. When lmkd kills a process, it actively reaps the process using process_mrelease, similar to the way the OOM reaper works. Surenb may be able to clarify this point, as he is the author of lmkd. I referenced this data in link[1], and I should have included it in the cover letter. The attached test data was collected on Android. Before testing, in order to simulate the OOM kill process, I intercepted the kill signal and added the killed process to the OOM reaper queue. The reproduction steps are as follows: 1. Start a process. 2. Kill the process. 3. Capture a perfetto trace. Below are the load benefit data, measured by process running time: Note: #RxComputationT, vdp:vidtask:m, and tp-background are threads of the same process, and they are the last threads to exit. Thread TID State Wall duration (ms) total running # with oom reaper but traverse reverse #RxComputationT 13708 Running 60.690572 oom_reaper 81 Running 46.492032 107.182604 # with oom reaper vdp:vidtask:m 14040 Running 81.848297 oom_reaper 81 Running 69.32 151.168297 # without oom reaper tp-background 12424 Running 106.021874 106.021874 Compared to reaping when a process is killed, it provides approximately 30% load benefit. Compared to not performing reap when a process is killed, it can release memory earlier, with 40% faster memory release. [1] https://lore.kernel.org/all/20250815163207.7078-1-zhongjinji@honor.com/ > I agree, that at face value, two cpus should be able to split the work.. > but I don't know about the notifier or the holding up the mm_struct > associated memory. And it could slow things down by holding up an > exiting task. > > > > > I think the patch (with your suggestions) is simple enough and I don't > > see any risk in including it. > > > > Actually, the more I look at this, the worse I feel about it.. Am I > overreacting? > > Thanks, > Liam > > [1] https://elixir.bootlin.com/linux/v6.0.19/source/mm/mmap.c#L3085 ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH v5 2/2] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite order 2025-08-27 9:55 ` zhongjinji @ 2025-08-27 15:57 ` Suren Baghdasaryan 2025-08-28 0:38 ` Liam R. Howlett 0 siblings, 1 reply; 20+ messages in thread From: Suren Baghdasaryan @ 2025-08-27 15:57 UTC (permalink / raw) To: zhongjinji Cc: liam.howlett, akpm, feng.han, linux-kernel, linux-mm, liulu.liu, lorenzo.stoakes, mhocko, rientjes, shakeel.butt, tglx On Wed, Aug 27, 2025 at 2:55 AM zhongjinji <zhongjinji@honor.com> wrote: > > > + Cc Suren since he has worked on the exit_mmap() path a lot. > > Thank you for your assistance. I realize now that I should have > Cc Suren earlier. Thanks for adding me! > > > * Shakeel Butt <shakeel.butt@linux.dev> [250826 18:26]: > > > On Tue, Aug 26, 2025 at 11:21:13AM -0400, Liam R. Howlett wrote: > > > > * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [250826 09:50]: > > > > > On Tue, Aug 26, 2025 at 09:37:22AM -0400, Liam R. Howlett wrote: > > > > > > I really don't think this is worth doing. We're avoiding a race between > > > > > > oom and a task unmap - the MMF bits should be used to avoid this race - > > > > > > or at least mitigate it. > > > > > > > > > > Yes for sure, as explored at length in previous discussions this feels like > > > > > we're papering over cracks here. > > > > > > > > > > _However_, I'm sort of ok with a minimalistic fix that solves the proximate > > > > > issue even if it is that, as long as it doesn't cause issues in doing so. > > > > > > > > > > So this is my take on the below and why I'm open to it! > > > > > > > > > > > > > > > > > They are probably both under the read lock, but considering how rare it > > > > > > would be, would a racy flag check be enough - it is hardly critical to > > > > > > get right. Either would reduce the probability. > > > > > > > > > > Zongjinji - I'm stil not sure that you've really indicated _why_ you're > > > > > seeing such a tight and unusual race. Presumably some truly massive number > > > > > of tasks being OOM'd and unmapping but... yeah that seems odd anyway. > > > > > > > > > > But again, if we can safely fix this in a way that doesn't hurt stuff too > > > > > much I'm ok with it (of course, these are famous last words in the kernel > > > > > often...!) > > > > > > > > > > Liam - are you open to a solution on the basis above, or do you feel we > > > > > ought simply to fix the underlying issue here? > > > > > > > > At least this is a benign race. > > > > > > Is this really a race or rather a contention? IIUC exit_mmap and the oom > > > reaper are trying to unmap the address space of the oom-killed process > > > and can compete on page table locks. If both are running concurrently on > > > two cpus then the contention can continue for whole address space and > > > can slow down the actual memory freeing. Making oom reaper traverse in > > > opposite direction can drastically reduce the contention and faster > > > memory freeing. > > > > It is two readers of the vma tree racing to lock the page tables for > > each vma, so I guess you can see it as contention as well.. but since > > the pte is a split lock, I see it as racing through vmas to see who hits > > which lock first. The smart money is on the oom killer as it skips some > > vmas :) > > > > If it were just contention, then the loop direction wouldn't matter.. > > but I do see your point. > > > > > > I'd think using MMF_ to reduce the race > > > > would achieve the same goal with less risk - which is why I bring it up. > > > > > > > > > > With MMF_ flag, are you suggesting oom reaper to skip the unmapping of > > > the oom-killed process? > > > > Yes, specifically move the MMF_OOM_SKIP flag to earlier in the exit > > path to reduce the possibility of the race/contention. > > > > > > > > > Really, both methods should be low risk, so I'm fine with either way. > > > > > > > > But I am interested in hearing how this race is happening enough to > > > > necessitate a fix. Reversing the iterator is a one-spot fix - if this > > > > happens elsewhere then we're out of options. Using the MMF_ flags is > > > > more of a scalable fix, if it achieves the same results. > > > > > > On the question of if this is a rare situaion and worth the patch. I > > > would say this scenario is not that rare particularly on low memory > > > devices and on highly utilized overcommitted systems. Memory pressure > > > and oom-kills are norm on such systems. The point of oom reaper is to > > > bring the system out of the oom situation quickly and having two cpus > > > unmapping the oom-killed process can potentially bring the system out of > > > oom situation faster. > > > > The exit_mmap() path used to run the oom reaper if it was an oom victim, > > until recently [1]. The part that makes me nervous is the exit_mmap() > > call to mmu_notifier_release(mm), while the oom reaper uses an > > mmu_notifier. I am not sure if there is an issue in ordering on any of > > the platforms of such things. Or the associated cost of the calls. > > > > I mean, it's already pretty crazy that we have this in the exit: > > mmap_read_lock() > > tlb_gather_mmu_fullmm() > > unmap vmas.. > > mmap_read_unlock() > > mmap_write_lock() > > tlb_finish_mmu().. > > > > So not only do we now have two tasks iterating over the vmas, but we > > also have mmu notifiers and tlb calls happening across the ranges.. At > > least doing all the work on a single cpu means that the hardware view is > > consistent. But I don't see this being worse than a forward race? This part seems to have changed quite a bit since I last looked into it closely and it's worth re-checking, however that seems orthogonal to what this patch is trying to do. > > > > As it is written here, we'll have one CPU working in one direction while > > the other works in the other, until both hit the end of the VMAs. Only > > when both tasks stop iterating the vmas can the exit continue since it > > requires the write lock. > > > > So the tlb_finish_mmu() in exit_mmap() will always be called after > > tlb_finish_mmu() on each individual vma has run in the > > __oom_reap_task_mm() context (when the race happens). Correct. > > > > There is also a window here, between the exit_mmap() dropping the read > > lock, setting MMF_OOM_SKIP, and taking the lock - where the oom killer > > will iterate through a list of vmas with zero memory to free and delay > > the task exiting. That is, wasting cpu and stopping the memory > > associated with the mm_struct (vmas and such) from being freed. Might be an opportunity to optimize but again, this is happening with or without this patch, no? > > > > I'm also not sure on the cpu cache effects of what we are doing and how > > much that would play into the speedup. My guess is that it's > > insignificant compared to the time we spend under the pte, but we have > > no numbers to go on. > > > > So I'd like to know how likely the simultaneous runs are and if there is > > a measurable gain? > > Since process killing events are very frequent on Android, the likelihood of > exit_mmap and reaper work (not only OOM, but also some proactive reaping > actions such as process_mrelease) occurring at the same time is much higher. > When lmkd kills a process, it actively reaps the process using > process_mrelease, similar to the way the OOM reaper works. Surenb may be > able to clarify this point, as he is the author of lmkd. Yes, on Android process_mrelease() is used after lmkd kills a process to expedite memory release in case the victim process is blocked for some reason. This makes the race between __oom_reap_task_mm() and exit_mmap() much more frequent. That is probably the disconnect in thinking that this race is rare. I don't see any harm in __oom_reap_task_mm() walking the tree backwards to minimize the contention with exit_mmap(). Liam, is there a performance difference between mas_for_each_rev() and mas_for_each() ? > > I referenced this data in link[1], and I should have included it in the cover > letter. The attached test data was collected on Android. Before testing, in > order to simulate the OOM kill process, I intercepted the kill signal and added > the killed process to the OOM reaper queue. > > The reproduction steps are as follows: > 1. Start a process. > 2. Kill the process. > 3. Capture a perfetto trace. > > Below are the load benefit data, measured by process running time: > > Note: #RxComputationT, vdp:vidtask:m, and tp-background are threads of the > same process, and they are the last threads to exit. > > Thread TID State Wall duration (ms) total running > # with oom reaper but traverse reverse > #RxComputationT 13708 Running 60.690572 > oom_reaper 81 Running 46.492032 107.182604 > > # with oom reaper > vdp:vidtask:m 14040 Running 81.848297 > oom_reaper 81 Running 69.32 151.168297 > > # without oom reaper > tp-background 12424 Running 106.021874 106.021874 > > Compared to reaping when a process is killed, it provides approximately > 30% load benefit. > Compared to not performing reap when a process is killed, it can release > memory earlier, with 40% faster memory release. That looks like a nice performance improvement for reaping the memory with low churn and risk. > > [1] https://lore.kernel.org/all/20250815163207.7078-1-zhongjinji@honor.com/ > > > I agree, that at face value, two cpus should be able to split the work.. > > but I don't know about the notifier or the holding up the mm_struct > > associated memory. And it could slow things down by holding up an > > exiting task. > > > > > > > > I think the patch (with your suggestions) is simple enough and I don't > > > see any risk in including it. > > > > > > > Actually, the more I look at this, the worse I feel about it.. Am I > > overreacting? > > > > Thanks, > > Liam > > > > [1] https://elixir.bootlin.com/linux/v6.0.19/source/mm/mmap.c#L3085 ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH v5 2/2] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite order 2025-08-27 15:57 ` Suren Baghdasaryan @ 2025-08-28 0:38 ` Liam R. Howlett 0 siblings, 0 replies; 20+ messages in thread From: Liam R. Howlett @ 2025-08-28 0:38 UTC (permalink / raw) To: Suren Baghdasaryan Cc: zhongjinji, akpm, feng.han, linux-kernel, linux-mm, liulu.liu, lorenzo.stoakes, mhocko, rientjes, shakeel.butt, tglx * Suren Baghdasaryan <surenb@google.com> [250827 11:57]: ... > > > > > > The exit_mmap() path used to run the oom reaper if it was an oom victim, > > > until recently [1]. The part that makes me nervous is the exit_mmap() > > > call to mmu_notifier_release(mm), while the oom reaper uses an > > > mmu_notifier. I am not sure if there is an issue in ordering on any of > > > the platforms of such things. Or the associated cost of the calls. > > > > > > I mean, it's already pretty crazy that we have this in the exit: > > > mmap_read_lock() > > > tlb_gather_mmu_fullmm() > > > unmap vmas.. > > > mmap_read_unlock() > > > mmap_write_lock() > > > tlb_finish_mmu().. > > > > > > So not only do we now have two tasks iterating over the vmas, but we > > > also have mmu notifiers and tlb calls happening across the ranges.. At > > > least doing all the work on a single cpu means that the hardware view is > > > consistent. But I don't see this being worse than a forward race? > > This part seems to have changed quite a bit since I last looked into > it closely and it's worth re-checking, however that seems orthogonal > to what this patch is trying to do. I was concerned about how a reverse iterator may affect what is considered accurate for the mmu/tlb so I thought it worth pointing out. ... > > > > > > There is also a window here, between the exit_mmap() dropping the read > > > lock, setting MMF_OOM_SKIP, and taking the lock - where the oom killer > > > will iterate through a list of vmas with zero memory to free and delay > > > the task exiting. That is, wasting cpu and stopping the memory > > > associated with the mm_struct (vmas and such) from being freed. > > Might be an opportunity to optimize but again, this is happening with > or without this patch, no? Correct, but with number it looks to be better to go with two loops. > > > > > > > I'm also not sure on the cpu cache effects of what we are doing and how > > > much that would play into the speedup. My guess is that it's > > > insignificant compared to the time we spend under the pte, but we have > > > no numbers to go on. > > > > > > So I'd like to know how likely the simultaneous runs are and if there is > > > a measurable gain? > > > > Since process killing events are very frequent on Android, the likelihood of > > exit_mmap and reaper work (not only OOM, but also some proactive reaping > > actions such as process_mrelease) occurring at the same time is much higher. > > When lmkd kills a process, it actively reaps the process using > > process_mrelease, similar to the way the OOM reaper works. Surenb may be > > able to clarify this point, as he is the author of lmkd. > > Yes, on Android process_mrelease() is used after lmkd kills a process > to expedite memory release in case the victim process is blocked for > some reason. This makes the race between __oom_reap_task_mm() and > exit_mmap() much more frequent. That is probably the disconnect in > thinking that this race is rare. I don't see any harm in > __oom_reap_task_mm() walking the tree backwards to minimize the > contention with exit_mmap(). Liam, is there a performance difference > between mas_for_each_rev() and mas_for_each() ? There should be no performance difference. > > > > I referenced this data in link[1], and I should have included it in the cover > > letter. The attached test data was collected on Android. Before testing, in > > order to simulate the OOM kill process, I intercepted the kill signal and added > > the killed process to the OOM reaper queue. Sorry I missed your response in v4 on this. > > > > The reproduction steps are as follows: > > 1. Start a process. > > 2. Kill the process. > > 3. Capture a perfetto trace. > > > > Below are the load benefit data, measured by process running time: > > > > Note: #RxComputationT, vdp:vidtask:m, and tp-background are threads of the > > same process, and they are the last threads to exit. > > > > Thread TID State Wall duration (ms) total running > > # with oom reaper but traverse reverse > > #RxComputationT 13708 Running 60.690572 > > oom_reaper 81 Running 46.492032 107.182604 > > > > # with oom reaper > > vdp:vidtask:m 14040 Running 81.848297 > > oom_reaper 81 Running 69.32 151.168297 > > > > # without oom reaper > > tp-background 12424 Running 106.021874 106.021874 > > > > Compared to reaping when a process is killed, it provides approximately > > 30% load benefit. > > Compared to not performing reap when a process is killed, it can release > > memory earlier, with 40% faster memory release. > > That looks like a nice performance improvement for reaping the memory > with low churn and risk. Agreed. Please include the numbers in the change log in the next revision so it is recorded in git. I think all my questions are resolved, thanks. I look forward to v6. Regards, Liam ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH v5 2/2] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite order 2025-08-27 4:12 ` Liam R. Howlett 2025-08-27 4:25 ` Liam R. Howlett 2025-08-27 9:55 ` zhongjinji @ 2025-08-29 7:11 ` Michal Hocko 2 siblings, 0 replies; 20+ messages in thread From: Michal Hocko @ 2025-08-29 7:11 UTC (permalink / raw) To: Liam R. Howlett Cc: Shakeel Butt, Lorenzo Stoakes, zhongjinji, rientjes, akpm, linux-mm, linux-kernel, tglx, liulu.liu, feng.han On Wed 27-08-25 00:12:34, Liam R. Howlett wrote: [...] > > > I'd think using MMF_ to reduce the race > > > would achieve the same goal with less risk - which is why I bring it up. > > > > > > > With MMF_ flag, are you suggesting oom reaper to skip the unmapping of > > the oom-killed process? > > Yes, specifically move the MMF_OOM_SKIP flag to earlier in the exit > path to reduce the possibility of the race/contention. MMF_OOM_SKIP is a "hide from oom killer" flag and that means that as soon as you set it up the next oom killer invocation would kill another task while the current one hasn't freed up the memory yet and therefore you have a potentially pointless kill. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH v5 2/2] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite order 2025-08-26 12:53 ` Lorenzo Stoakes 2025-08-26 13:37 ` Liam R. Howlett @ 2025-08-29 7:14 ` Michal Hocko 1 sibling, 0 replies; 20+ messages in thread From: Michal Hocko @ 2025-08-29 7:14 UTC (permalink / raw) To: Lorenzo Stoakes Cc: zhongjinji, rientjes, shakeel.butt, akpm, linux-mm, linux-kernel, tglx, liam.howlett, liulu.liu, feng.han On Tue 26-08-25 13:53:43, Lorenzo Stoakes wrote: > On Mon, Aug 25, 2025 at 09:38:55PM +0800, zhongjinji wrote: > > When a process is OOM killed without reaper delay, the oom reaper and the > > exit_mmap() thread likely run simultaneously. They traverse the vma's maple > > tree along the same path and may easily unmap the same vma, causing them to > > compete for the pte spinlock. > > > > When a process exits, exit_mmap() traverses the vma's maple tree from low > > to high addresses. To reduce the chance of unmapping the same vma > > simultaneously, the OOM reaper should traverse the vma's tree from high to > > low address. > > > > Signed-off-by: zhongjinji <zhongjinji@honor.com> > > I will leave it to Liam to confirm the maple tree bit is ok, but I guess > I'm softening to the idea of doing this - because it should have no impact > on most users, so even if it's some rare edge case that triggers the > situation, then it's worth doing it in reverse just to help you guys out :) I tend to agree on this. I would expect that oom_reaper would race with exit_mmap only for seriously stalled processes or huge memory consumers where exit_mmap takes a while. oom_reaper would be quick to catch the exit_mmap as it wouldn't have fewer work to do on already released memory. Going from the other end of the address space might reduces this chance while not really introducing any actual issues. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2025-08-29 7:14 UTC | newest] Thread overview: 20+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-08-25 13:38 [PATCH v5 0/2] Do not delay oom reaper when the victim is frozen zhongjinji 2025-08-25 13:38 ` [PATCH v5 1/2] mm/oom_kill: " zhongjinji 2025-08-25 19:41 ` Shakeel Butt 2025-08-26 13:01 ` zhongjinji 2025-08-26 12:43 ` Lorenzo Stoakes 2025-08-27 12:08 ` Michal Hocko 2025-08-27 12:14 ` zhongjinji 2025-08-25 13:38 ` [PATCH v5 2/2] mm/oom_kill: Have the OOM reaper and exit_mmap() traverse the maple tree in opposite order zhongjinji 2025-08-26 12:53 ` Lorenzo Stoakes 2025-08-26 13:37 ` Liam R. Howlett 2025-08-26 13:50 ` Lorenzo Stoakes 2025-08-26 15:21 ` Liam R. Howlett 2025-08-26 22:26 ` Shakeel Butt 2025-08-27 4:12 ` Liam R. Howlett 2025-08-27 4:25 ` Liam R. Howlett 2025-08-27 9:55 ` zhongjinji 2025-08-27 15:57 ` Suren Baghdasaryan 2025-08-28 0:38 ` Liam R. Howlett 2025-08-29 7:11 ` Michal Hocko 2025-08-29 7:14 ` Michal Hocko
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).