* [PATCH] mm/page_alloc.c: Avoid infinite retries caused by cpuset race
@ 2025-04-16 8:24 Tianyang Zhang
2025-04-21 10:00 ` Harry Yoo
0 siblings, 1 reply; 16+ messages in thread
From: Tianyang Zhang @ 2025-04-16 8:24 UTC (permalink / raw)
To: akpm; +Cc: linux-mm, linux-kernel, Tianyang Zhang
__alloc_pages_slowpath has no change detection for ac->nodemask
in the part of retry path, while cpuset can modify it in parallel.
For some processes that set mempolicy as MPOL_BIND, this results
ac->nodemask changes, and then the should_reclaim_retry will
judge based on the latest nodemask and jump to retry, while the
get_page_from_freelist only traverses the zonelist from
ac->preferred_zoneref, which selected by a expired nodemask
and may cause infinite retries in some cases
cpu 64:
__alloc_pages_slowpath {
/* ..... */
retry:
/* ac->nodemask = 0x1, ac->preferred->zone->nid = 1 */
if (alloc_flags & ALLOC_KSWAPD)
wake_all_kswapds(order, gfp_mask, ac);
/* cpu 1:
cpuset_write_resmask
update_nodemask
update_nodemasks_hier
update_tasks_nodemask
mpol_rebind_task
mpol_rebind_policy
mpol_rebind_nodemask
// mempolicy->nodes has been modified,
// which ac->nodemask point to
*/
/* ac->nodemask = 0x3, ac->preferred->zone->nid = 1 */
if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
did_some_progress > 0, &no_progress_loops))
goto retry;
}
Simultaneously starting multiple cpuset01 from LTP can quickly
reproduce this issue on a multi node server when the maximum
memory pressure is reached and the swap is enabled
Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn>
---
mm/page_alloc.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fd6b865cb1ab..1e82f5214a42 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4530,6 +4530,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
}
retry:
+ /*
+ * Deal with possible cpuset update races or zonelist updates to avoid
+ * infinite retries.
+ */
+ if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
+ check_retry_zonelist(zonelist_iter_cookie))
+ goto restart;
+
/* Ensure kswapd doesn't accidentally go to sleep as long as we loop */
if (alloc_flags & ALLOC_KSWAPD)
wake_all_kswapds(order, gfp_mask, ac);
--
2.20.1
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH] mm/page_alloc.c: Avoid infinite retries caused by cpuset race
2025-04-16 8:24 [PATCH] mm/page_alloc.c: Avoid infinite retries caused by cpuset race Tianyang Zhang
@ 2025-04-21 10:00 ` Harry Yoo
2025-04-21 20:28 ` Suren Baghdasaryan
2025-04-22 12:10 ` Tianyang Zhang
0 siblings, 2 replies; 16+ messages in thread
From: Harry Yoo @ 2025-04-21 10:00 UTC (permalink / raw)
To: Tianyang Zhang
Cc: akpm, linux-mm, linux-kernel, Vlastimil Babka, Suren Baghdasaryan,
Michal Hocko, Brendan Jackman, Johannes Weiner, Zi Yan
On Wed, Apr 16, 2025 at 04:24:05PM +0800, Tianyang Zhang wrote:
> __alloc_pages_slowpath has no change detection for ac->nodemask
> in the part of retry path, while cpuset can modify it in parallel.
> For some processes that set mempolicy as MPOL_BIND, this results
> ac->nodemask changes, and then the should_reclaim_retry will
> judge based on the latest nodemask and jump to retry, while the
> get_page_from_freelist only traverses the zonelist from
> ac->preferred_zoneref, which selected by a expired nodemask
> and may cause infinite retries in some cases
>
> cpu 64:
> __alloc_pages_slowpath {
> /* ..... */
> retry:
> /* ac->nodemask = 0x1, ac->preferred->zone->nid = 1 */
> if (alloc_flags & ALLOC_KSWAPD)
> wake_all_kswapds(order, gfp_mask, ac);
> /* cpu 1:
> cpuset_write_resmask
> update_nodemask
> update_nodemasks_hier
> update_tasks_nodemask
> mpol_rebind_task
> mpol_rebind_policy
> mpol_rebind_nodemask
> // mempolicy->nodes has been modified,
> // which ac->nodemask point to
>
> */
> /* ac->nodemask = 0x3, ac->preferred->zone->nid = 1 */
> if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
> did_some_progress > 0, &no_progress_loops))
> goto retry;
> }
>
> Simultaneously starting multiple cpuset01 from LTP can quickly
> reproduce this issue on a multi node server when the maximum
> memory pressure is reached and the swap is enabled
>
> Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn>
> ---
What commit does it fix and should it be backported to -stable?
There's a new 'MEMORY MANAGEMENT - PAGE ALLOCATOR' entry (only in
Andrew's mm.git repository now).
Let's Cc the page allocator folks here!
--
Cheers,
Harry / Hyeonggon
> mm/page_alloc.c | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index fd6b865cb1ab..1e82f5214a42 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4530,6 +4530,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> }
>
> retry:
> + /*
> + * Deal with possible cpuset update races or zonelist updates to avoid
> + * infinite retries.
> + */
> + if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
> + check_retry_zonelist(zonelist_iter_cookie))
> + goto restart;
> +
> /* Ensure kswapd doesn't accidentally go to sleep as long as we loop */
> if (alloc_flags & ALLOC_KSWAPD)
> wake_all_kswapds(order, gfp_mask, ac);
> --
> 2.20.1
>
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mm/page_alloc.c: Avoid infinite retries caused by cpuset race
2025-04-21 10:00 ` Harry Yoo
@ 2025-04-21 20:28 ` Suren Baghdasaryan
2025-04-23 2:38 ` Tianyang Zhang
2025-04-22 12:10 ` Tianyang Zhang
1 sibling, 1 reply; 16+ messages in thread
From: Suren Baghdasaryan @ 2025-04-21 20:28 UTC (permalink / raw)
To: Harry Yoo
Cc: Tianyang Zhang, akpm, linux-mm, linux-kernel, Vlastimil Babka,
Michal Hocko, Brendan Jackman, Johannes Weiner, Zi Yan
On Mon, Apr 21, 2025 at 3:00 AM Harry Yoo <harry.yoo@oracle.com> wrote:
>
> On Wed, Apr 16, 2025 at 04:24:05PM +0800, Tianyang Zhang wrote:
> > __alloc_pages_slowpath has no change detection for ac->nodemask
> > in the part of retry path, while cpuset can modify it in parallel.
> > For some processes that set mempolicy as MPOL_BIND, this results
> > ac->nodemask changes, and then the should_reclaim_retry will
> > judge based on the latest nodemask and jump to retry, while the
> > get_page_from_freelist only traverses the zonelist from
> > ac->preferred_zoneref, which selected by a expired nodemask
> > and may cause infinite retries in some cases
> >
> > cpu 64:
> > __alloc_pages_slowpath {
> > /* ..... */
> > retry:
> > /* ac->nodemask = 0x1, ac->preferred->zone->nid = 1 */
> > if (alloc_flags & ALLOC_KSWAPD)
> > wake_all_kswapds(order, gfp_mask, ac);
> > /* cpu 1:
> > cpuset_write_resmask
> > update_nodemask
> > update_nodemasks_hier
> > update_tasks_nodemask
> > mpol_rebind_task
> > mpol_rebind_policy
> > mpol_rebind_nodemask
> > // mempolicy->nodes has been modified,
> > // which ac->nodemask point to
> >
> > */
> > /* ac->nodemask = 0x3, ac->preferred->zone->nid = 1 */
> > if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
> > did_some_progress > 0, &no_progress_loops))
> > goto retry;
> > }
> >
> > Simultaneously starting multiple cpuset01 from LTP can quickly
> > reproduce this issue on a multi node server when the maximum
> > memory pressure is reached and the swap is enabled
> >
> > Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn>
> > ---
>
> What commit does it fix and should it be backported to -stable?
I think it fixes 902b62810a57 ("mm, page_alloc: fix more premature OOM
due to race with cpuset update").
>
> There's a new 'MEMORY MANAGEMENT - PAGE ALLOCATOR' entry (only in
> Andrew's mm.git repository now).
>
> Let's Cc the page allocator folks here!
>
> --
> Cheers,
> Harry / Hyeonggon
>
> > mm/page_alloc.c | 8 ++++++++
> > 1 file changed, 8 insertions(+)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index fd6b865cb1ab..1e82f5214a42 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -4530,6 +4530,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > }
> >
> > retry:
> > + /*
> > + * Deal with possible cpuset update races or zonelist updates to avoid
> > + * infinite retries.
> > + */
> > + if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
> > + check_retry_zonelist(zonelist_iter_cookie))
> > + goto restart;
> > +
We have this check later in this block:
https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4652,
so IIUC you effectively are moving it to be called before
should_reclaim_retry(). If so, I think you should remove the old one
(the one I linked earlier) as it seems to be unnecessary duplication
at this point.
> > /* Ensure kswapd doesn't accidentally go to sleep as long as we loop */
> > if (alloc_flags & ALLOC_KSWAPD)
> > wake_all_kswapds(order, gfp_mask, ac);
> > --
> > 2.20.1
> >
> >
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mm/page_alloc.c: Avoid infinite retries caused by cpuset race
2025-04-21 10:00 ` Harry Yoo
2025-04-21 20:28 ` Suren Baghdasaryan
@ 2025-04-22 12:10 ` Tianyang Zhang
2025-04-23 0:11 ` Andrew Morton
1 sibling, 1 reply; 16+ messages in thread
From: Tianyang Zhang @ 2025-04-22 12:10 UTC (permalink / raw)
To: Harry Yoo
Cc: akpm, linux-mm, linux-kernel, Vlastimil Babka, Suren Baghdasaryan,
Michal Hocko, Brendan Jackman, Johannes Weiner, Zi Yan
Hi.
在 2025/4/21 下午6:00, Harry Yoo 写道:
> On Wed, Apr 16, 2025 at 04:24:05PM +0800, Tianyang Zhang wrote:
>> __alloc_pages_slowpath has no change detection for ac->nodemask
>> in the part of retry path, while cpuset can modify it in parallel.
>> For some processes that set mempolicy as MPOL_BIND, this results
>> ac->nodemask changes, and then the should_reclaim_retry will
>> judge based on the latest nodemask and jump to retry, while the
>> get_page_from_freelist only traverses the zonelist from
>> ac->preferred_zoneref, which selected by a expired nodemask
>> and may cause infinite retries in some cases
>>
>> cpu 64:
>> __alloc_pages_slowpath {
>> /* ..... */
>> retry:
>> /* ac->nodemask = 0x1, ac->preferred->zone->nid = 1 */
>> if (alloc_flags & ALLOC_KSWAPD)
>> wake_all_kswapds(order, gfp_mask, ac);
>> /* cpu 1:
>> cpuset_write_resmask
>> update_nodemask
>> update_nodemasks_hier
>> update_tasks_nodemask
>> mpol_rebind_task
>> mpol_rebind_policy
>> mpol_rebind_nodemask
>> // mempolicy->nodes has been modified,
>> // which ac->nodemask point to
>>
>> */
>> /* ac->nodemask = 0x3, ac->preferred->zone->nid = 1 */
>> if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
>> did_some_progress > 0, &no_progress_loops))
>> goto retry;
>> }
>>
>> Simultaneously starting multiple cpuset01 from LTP can quickly
>> reproduce this issue on a multi node server when the maximum
>> memory pressure is reached and the swap is enabled
>>
>> Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn>
>> ---
> What commit does it fix and should it be backported to -stable?
>
> There's a new 'MEMORY MANAGEMENT - PAGE ALLOCATOR' entry (only in
> Andrew's mm.git repository now).
>
> Let's Cc the page allocator folks here!
We first identified this issue in 6.6.52-stable , and through root cause
analysis,
it appears the problem may have existed for a significant period.
However It is recommended that the fix should be backported to at least
Linux kernel versions after 6.6-stable
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mm/page_alloc.c: Avoid infinite retries caused by cpuset race
2025-04-22 12:10 ` Tianyang Zhang
@ 2025-04-23 0:11 ` Andrew Morton
2025-04-23 0:22 ` Suren Baghdasaryan
0 siblings, 1 reply; 16+ messages in thread
From: Andrew Morton @ 2025-04-23 0:11 UTC (permalink / raw)
To: Tianyang Zhang
Cc: Harry Yoo, linux-mm, linux-kernel, Vlastimil Babka,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Zi Yan
On Tue, 22 Apr 2025 20:10:06 +0800 Tianyang Zhang <zhangtianyang@loongson.cn> wrote:
>
> ...
>
> >>
> >> Simultaneously starting multiple cpuset01 from LTP can quickly
> >> reproduce this issue on a multi node server when the maximum
> >> memory pressure is reached and the swap is enabled
> >>
> >> Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn>
> >> ---
> > What commit does it fix and should it be backported to -stable?
> >
> > There's a new 'MEMORY MANAGEMENT - PAGE ALLOCATOR' entry (only in
> > Andrew's mm.git repository now).
> >
> > Let's Cc the page allocator folks here!
>
> We first identified this issue in 6.6.52-stable , and through root cause
> analysis,
>
> it appears the problem may have existed for a significant period.
>
> However It is recommended that the fix should be backported to at least
> Linux kernel versions after 6.6-stable
OK, thanks,
This has been in mm-hotfixes-unstable for six days. Hopefully we'll
see some review activity soon (please).
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mm/page_alloc.c: Avoid infinite retries caused by cpuset race
2025-04-23 0:11 ` Andrew Morton
@ 2025-04-23 0:22 ` Suren Baghdasaryan
2025-05-11 3:07 ` Andrew Morton
0 siblings, 1 reply; 16+ messages in thread
From: Suren Baghdasaryan @ 2025-04-23 0:22 UTC (permalink / raw)
To: Andrew Morton
Cc: Tianyang Zhang, Harry Yoo, linux-mm, linux-kernel,
Vlastimil Babka, Michal Hocko, Brendan Jackman, Johannes Weiner,
Zi Yan
On Tue, Apr 22, 2025 at 5:11 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Tue, 22 Apr 2025 20:10:06 +0800 Tianyang Zhang <zhangtianyang@loongson.cn> wrote:
>
> >
> > ...
> >
> > >>
> > >> Simultaneously starting multiple cpuset01 from LTP can quickly
> > >> reproduce this issue on a multi node server when the maximum
> > >> memory pressure is reached and the swap is enabled
> > >>
> > >> Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn>
> > >> ---
> > > What commit does it fix and should it be backported to -stable?
> > >
> > > There's a new 'MEMORY MANAGEMENT - PAGE ALLOCATOR' entry (only in
> > > Andrew's mm.git repository now).
> > >
> > > Let's Cc the page allocator folks here!
> >
> > We first identified this issue in 6.6.52-stable , and through root cause
> > analysis,
> >
> > it appears the problem may have existed for a significant period.
> >
> > However It is recommended that the fix should be backported to at least
> > Linux kernel versions after 6.6-stable
>
> OK, thanks,
>
> This has been in mm-hotfixes-unstable for six days. Hopefully we'll
> see some review activity soon (please).
I reviewed and provided my feedback but saw neither a reply nor a
respin with proposed changes.
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mm/page_alloc.c: Avoid infinite retries caused by cpuset race
2025-04-21 20:28 ` Suren Baghdasaryan
@ 2025-04-23 2:38 ` Tianyang Zhang
2025-04-23 15:35 ` Suren Baghdasaryan
0 siblings, 1 reply; 16+ messages in thread
From: Tianyang Zhang @ 2025-04-23 2:38 UTC (permalink / raw)
To: Suren Baghdasaryan, Harry Yoo
Cc: akpm, linux-mm, linux-kernel, Vlastimil Babka, Michal Hocko,
Brendan Jackman, Johannes Weiner, Zi Yan
Hi, Suren
在 2025/4/22 上午4:28, Suren Baghdasaryan 写道:
> On Mon, Apr 21, 2025 at 3:00 AM Harry Yoo <harry.yoo@oracle.com> wrote:
>> On Wed, Apr 16, 2025 at 04:24:05PM +0800, Tianyang Zhang wrote:
>>> __alloc_pages_slowpath has no change detection for ac->nodemask
>>> in the part of retry path, while cpuset can modify it in parallel.
>>> For some processes that set mempolicy as MPOL_BIND, this results
>>> ac->nodemask changes, and then the should_reclaim_retry will
>>> judge based on the latest nodemask and jump to retry, while the
>>> get_page_from_freelist only traverses the zonelist from
>>> ac->preferred_zoneref, which selected by a expired nodemask
>>> and may cause infinite retries in some cases
>>>
>>> cpu 64:
>>> __alloc_pages_slowpath {
>>> /* ..... */
>>> retry:
>>> /* ac->nodemask = 0x1, ac->preferred->zone->nid = 1 */
>>> if (alloc_flags & ALLOC_KSWAPD)
>>> wake_all_kswapds(order, gfp_mask, ac);
>>> /* cpu 1:
>>> cpuset_write_resmask
>>> update_nodemask
>>> update_nodemasks_hier
>>> update_tasks_nodemask
>>> mpol_rebind_task
>>> mpol_rebind_policy
>>> mpol_rebind_nodemask
>>> // mempolicy->nodes has been modified,
>>> // which ac->nodemask point to
>>>
>>> */
>>> /* ac->nodemask = 0x3, ac->preferred->zone->nid = 1 */
>>> if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
>>> did_some_progress > 0, &no_progress_loops))
>>> goto retry;
>>> }
>>>
>>> Simultaneously starting multiple cpuset01 from LTP can quickly
>>> reproduce this issue on a multi node server when the maximum
>>> memory pressure is reached and the swap is enabled
>>>
>>> Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn>
>>> ---
>> What commit does it fix and should it be backported to -stable?
> I think it fixes 902b62810a57 ("mm, page_alloc: fix more premature OOM
> due to race with cpuset update").
I think this issue is unlikely to have been introduced by Patch
902b62810a57 ,
as the infinite-reties section from
https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4568
to
https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4628
where the cpuset race condition occurs remains unmodified in the logic
of Patch 902b62810a57.
>> There's a new 'MEMORY MANAGEMENT - PAGE ALLOCATOR' entry (only in
>> Andrew's mm.git repository now).
>>
>> Let's Cc the page allocator folks here!
>>
>> --
>> Cheers,
>> Harry / Hyeonggon
>>
>>> mm/page_alloc.c | 8 ++++++++
>>> 1 file changed, 8 insertions(+)
>>>
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index fd6b865cb1ab..1e82f5214a42 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -4530,6 +4530,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>>> }
>>>
>>> retry:
>>> + /*
>>> + * Deal with possible cpuset update races or zonelist updates to avoid
>>> + * infinite retries.
>>> + */
>>> + if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
>>> + check_retry_zonelist(zonelist_iter_cookie))
>>> + goto restart;
>>> +
> We have this check later in this block:
> https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4652,
> so IIUC you effectively are moving it to be called before
> should_reclaim_retry(). If so, I think you should remove the old one
> (the one I linked earlier) as it seems to be unnecessary duplication
> at this point.
In my understanding, the code in
https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4652
was introduced to prevent unnecessary OOM (Out-of-Memory) conditions
in__alloc_pages_may_oom.
If old code is removed, the newly added code (on retry loop entry)
cannot guarantee that the cpuset
remains valid when the flow reaches in__alloc_pages_may_oom, especially
if scheduling occurs during this section.
Therefore, I think retaining the original code logic is necessary to
ensure correctness under concurrency.
>
>
>>> /* Ensure kswapd doesn't accidentally go to sleep as long as we loop */
>>> if (alloc_flags & ALLOC_KSWAPD)
>>> wake_all_kswapds(order, gfp_mask, ac);
>>> --
>>> 2.20.1
>>>
>>>
Thanks
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mm/page_alloc.c: Avoid infinite retries caused by cpuset race
2025-04-23 2:38 ` Tianyang Zhang
@ 2025-04-23 15:35 ` Suren Baghdasaryan
2025-05-14 7:15 ` Vlastimil Babka
0 siblings, 1 reply; 16+ messages in thread
From: Suren Baghdasaryan @ 2025-04-23 15:35 UTC (permalink / raw)
To: Tianyang Zhang
Cc: Harry Yoo, akpm, linux-mm, linux-kernel, Vlastimil Babka,
Michal Hocko, Brendan Jackman, Johannes Weiner, Zi Yan
On Tue, Apr 22, 2025 at 7:39 PM Tianyang Zhang
<zhangtianyang@loongson.cn> wrote:
>
> Hi, Suren
>
> 在 2025/4/22 上午4:28, Suren Baghdasaryan 写道:
> > On Mon, Apr 21, 2025 at 3:00 AM Harry Yoo <harry.yoo@oracle.com> wrote:
> >> On Wed, Apr 16, 2025 at 04:24:05PM +0800, Tianyang Zhang wrote:
> >>> __alloc_pages_slowpath has no change detection for ac->nodemask
> >>> in the part of retry path, while cpuset can modify it in parallel.
> >>> For some processes that set mempolicy as MPOL_BIND, this results
> >>> ac->nodemask changes, and then the should_reclaim_retry will
> >>> judge based on the latest nodemask and jump to retry, while the
> >>> get_page_from_freelist only traverses the zonelist from
> >>> ac->preferred_zoneref, which selected by a expired nodemask
> >>> and may cause infinite retries in some cases
> >>>
> >>> cpu 64:
> >>> __alloc_pages_slowpath {
> >>> /* ..... */
> >>> retry:
> >>> /* ac->nodemask = 0x1, ac->preferred->zone->nid = 1 */
> >>> if (alloc_flags & ALLOC_KSWAPD)
> >>> wake_all_kswapds(order, gfp_mask, ac);
> >>> /* cpu 1:
> >>> cpuset_write_resmask
> >>> update_nodemask
> >>> update_nodemasks_hier
> >>> update_tasks_nodemask
> >>> mpol_rebind_task
> >>> mpol_rebind_policy
> >>> mpol_rebind_nodemask
> >>> // mempolicy->nodes has been modified,
> >>> // which ac->nodemask point to
> >>>
> >>> */
> >>> /* ac->nodemask = 0x3, ac->preferred->zone->nid = 1 */
> >>> if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
> >>> did_some_progress > 0, &no_progress_loops))
> >>> goto retry;
> >>> }
> >>>
> >>> Simultaneously starting multiple cpuset01 from LTP can quickly
> >>> reproduce this issue on a multi node server when the maximum
> >>> memory pressure is reached and the swap is enabled
> >>>
> >>> Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn>
> >>> ---
> >> What commit does it fix and should it be backported to -stable?
> > I think it fixes 902b62810a57 ("mm, page_alloc: fix more premature OOM
> > due to race with cpuset update").
>
> I think this issue is unlikely to have been introduced by Patch
> 902b62810a57 ,
>
> as the infinite-reties section from
>
> https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4568
> to
> https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4628
>
> where the cpuset race condition occurs remains unmodified in the logic
> of Patch 902b62810a57.
Yeah, you are right. After looking into it some more, 902b62810a57 is
a wrong patch to blame for this infinite loop.
>
> >> There's a new 'MEMORY MANAGEMENT - PAGE ALLOCATOR' entry (only in
> >> Andrew's mm.git repository now).
> >>
> >> Let's Cc the page allocator folks here!
> >>
> >> --
> >> Cheers,
> >> Harry / Hyeonggon
> >>
> >>> mm/page_alloc.c | 8 ++++++++
> >>> 1 file changed, 8 insertions(+)
> >>>
> >>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >>> index fd6b865cb1ab..1e82f5214a42 100644
> >>> --- a/mm/page_alloc.c
> >>> +++ b/mm/page_alloc.c
> >>> @@ -4530,6 +4530,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >>> }
> >>>
> >>> retry:
> >>> + /*
> >>> + * Deal with possible cpuset update races or zonelist updates to avoid
> >>> + * infinite retries.
> >>> + */
> >>> + if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
> >>> + check_retry_zonelist(zonelist_iter_cookie))
> >>> + goto restart;
> >>> +
> > We have this check later in this block:
> > https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4652,
> > so IIUC you effectively are moving it to be called before
> > should_reclaim_retry(). If so, I think you should remove the old one
> > (the one I linked earlier) as it seems to be unnecessary duplication
> > at this point.
> In my understanding, the code in
>
> https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4652
>
> was introduced to prevent unnecessary OOM (Out-of-Memory) conditions
> in__alloc_pages_may_oom.
>
> If old code is removed, the newly added code (on retry loop entry)
> cannot guarantee that the cpuset
>
> remains valid when the flow reaches in__alloc_pages_may_oom, especially
> if scheduling occurs during this section.
Well, rescheduling can happen even between
https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4652
and https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4657
but I see your point. Also should_reclaim_retry() does not include
zonelist change detection, so keeping the checks at
https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4652
sounds like a good idea.
>
> Therefore, I think retaining the original code logic is necessary to
> ensure correctness under concurrency.
>
> >
> >
> >>> /* Ensure kswapd doesn't accidentally go to sleep as long as we loop */
> >>> if (alloc_flags & ALLOC_KSWAPD)
> >>> wake_all_kswapds(order, gfp_mask, ac);
> >>> --
> >>> 2.20.1
> >>>
> >>>
> Thanks
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mm/page_alloc.c: Avoid infinite retries caused by cpuset race
2025-04-23 0:22 ` Suren Baghdasaryan
@ 2025-05-11 3:07 ` Andrew Morton
2025-05-13 16:26 ` Suren Baghdasaryan
0 siblings, 1 reply; 16+ messages in thread
From: Andrew Morton @ 2025-05-11 3:07 UTC (permalink / raw)
To: Suren Baghdasaryan
Cc: Tianyang Zhang, Harry Yoo, linux-mm, linux-kernel,
Vlastimil Babka, Michal Hocko, Brendan Jackman, Johannes Weiner,
Zi Yan
On Tue, 22 Apr 2025 17:22:04 -0700 Suren Baghdasaryan <surenb@google.com> wrote:
> On Tue, Apr 22, 2025 at 5:11 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > On Tue, 22 Apr 2025 20:10:06 +0800 Tianyang Zhang <zhangtianyang@loongson.cn> wrote:
> >
> > >
> > > ...
> > >
> > > >>
> > > >> Simultaneously starting multiple cpuset01 from LTP can quickly
> > > >> reproduce this issue on a multi node server when the maximum
> > > >> memory pressure is reached and the swap is enabled
> > > >>
> > > >> Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn>
> > > >> ---
> > > > What commit does it fix and should it be backported to -stable?
> > > >
> > > > There's a new 'MEMORY MANAGEMENT - PAGE ALLOCATOR' entry (only in
> > > > Andrew's mm.git repository now).
> > > >
> > > > Let's Cc the page allocator folks here!
> > >
> > > We first identified this issue in 6.6.52-stable , and through root cause
> > > analysis,
> > >
> > > it appears the problem may have existed for a significant period.
> > >
> > > However It is recommended that the fix should be backported to at least
> > > Linux kernel versions after 6.6-stable
> >
> > OK, thanks,
> >
> > This has been in mm-hotfixes-unstable for six days. Hopefully we'll
> > see some review activity soon (please).
>
> I reviewed and provided my feedback but saw neither a reply nor a
> respin with proposed changes.
OK, thanks. Do you have time to put together a modified version of this?
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mm/page_alloc.c: Avoid infinite retries caused by cpuset race
2025-05-11 3:07 ` Andrew Morton
@ 2025-05-13 16:26 ` Suren Baghdasaryan
2025-05-13 19:16 ` Andrew Morton
0 siblings, 1 reply; 16+ messages in thread
From: Suren Baghdasaryan @ 2025-05-13 16:26 UTC (permalink / raw)
To: Andrew Morton
Cc: Tianyang Zhang, Harry Yoo, linux-mm, linux-kernel,
Vlastimil Babka, Michal Hocko, Brendan Jackman, Johannes Weiner,
Zi Yan
On Sat, May 10, 2025 at 8:07 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Tue, 22 Apr 2025 17:22:04 -0700 Suren Baghdasaryan <surenb@google.com> wrote:
>
> > On Tue, Apr 22, 2025 at 5:11 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > >
> > > On Tue, 22 Apr 2025 20:10:06 +0800 Tianyang Zhang <zhangtianyang@loongson.cn> wrote:
> > >
> > > >
> > > > ...
> > > >
> > > > >>
> > > > >> Simultaneously starting multiple cpuset01 from LTP can quickly
> > > > >> reproduce this issue on a multi node server when the maximum
> > > > >> memory pressure is reached and the swap is enabled
> > > > >>
> > > > >> Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn>
> > > > >> ---
> > > > > What commit does it fix and should it be backported to -stable?
> > > > >
> > > > > There's a new 'MEMORY MANAGEMENT - PAGE ALLOCATOR' entry (only in
> > > > > Andrew's mm.git repository now).
> > > > >
> > > > > Let's Cc the page allocator folks here!
> > > >
> > > > We first identified this issue in 6.6.52-stable , and through root cause
> > > > analysis,
> > > >
> > > > it appears the problem may have existed for a significant period.
> > > >
> > > > However It is recommended that the fix should be backported to at least
> > > > Linux kernel versions after 6.6-stable
> > >
> > > OK, thanks,
> > >
> > > This has been in mm-hotfixes-unstable for six days. Hopefully we'll
> > > see some review activity soon (please).
> >
> > I reviewed and provided my feedback but saw neither a reply nor a
> > respin with proposed changes.
>
> OK, thanks. Do you have time to put together a modified version of this?
I think the code is fine as is. Would be good to add Fixes: tag but it
will require some investigation to find the appropriate patch to
reference here.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mm/page_alloc.c: Avoid infinite retries caused by cpuset race
2025-05-13 16:26 ` Suren Baghdasaryan
@ 2025-05-13 19:16 ` Andrew Morton
2025-05-13 19:33 ` Suren Baghdasaryan
2025-05-14 7:34 ` Vlastimil Babka
0 siblings, 2 replies; 16+ messages in thread
From: Andrew Morton @ 2025-05-13 19:16 UTC (permalink / raw)
To: Suren Baghdasaryan
Cc: Tianyang Zhang, Harry Yoo, linux-mm, linux-kernel,
Vlastimil Babka, Michal Hocko, Brendan Jackman, Johannes Weiner,
Zi Yan
On Tue, 13 May 2025 09:26:53 -0700 Suren Baghdasaryan <surenb@google.com> wrote:
> > > > This has been in mm-hotfixes-unstable for six days. Hopefully we'll
> > > > see some review activity soon (please).
> > >
> > > I reviewed and provided my feedback but saw neither a reply nor a
> > > respin with proposed changes.
> >
> > OK, thanks. Do you have time to put together a modified version of this?
>
> I think the code is fine as is. Would be good to add Fixes: tag but it
> will require some investigation to find the appropriate patch to
> reference here.
Below is what is in mm-hotfixes. It doesn't actually have any
acked-by's or reviewed-by's.
So... final call for review, please.
From: Tianyang Zhang <zhangtianyang@loongson.cn>
Subject: mm/page_alloc.c: avoid infinite retries caused by cpuset race
Date: Wed, 16 Apr 2025 16:24:05 +0800
__alloc_pages_slowpath has no change detection for ac->nodemask in the
part of retry path, while cpuset can modify it in parallel. For some
processes that set mempolicy as MPOL_BIND, this results ac->nodemask
changes, and then the should_reclaim_retry will judge based on the latest
nodemask and jump to retry, while the get_page_from_freelist only
traverses the zonelist from ac->preferred_zoneref, which selected by a
expired nodemask and may cause infinite retries in some cases
cpu 64:
__alloc_pages_slowpath {
/* ..... */
retry:
/* ac->nodemask = 0x1, ac->preferred->zone->nid = 1 */
if (alloc_flags & ALLOC_KSWAPD)
wake_all_kswapds(order, gfp_mask, ac);
/* cpu 1:
cpuset_write_resmask
update_nodemask
update_nodemasks_hier
update_tasks_nodemask
mpol_rebind_task
mpol_rebind_policy
mpol_rebind_nodemask
// mempolicy->nodes has been modified,
// which ac->nodemask point to
*/
/* ac->nodemask = 0x3, ac->preferred->zone->nid = 1 */
if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
did_some_progress > 0, &no_progress_loops))
goto retry;
}
Simultaneously starting multiple cpuset01 from LTP can quickly reproduce
this issue on a multi node server when the maximum memory pressure is
reached and the swap is enabled
Link: https://lkml.kernel.org/r/20250416082405.20988-1-zhangtianyang@loongson.cn
Fixes: 902b62810a57 ("mm, page_alloc: fix more premature OOM due to race with cpuset update").
Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
mm/page_alloc.c | 8 ++++++++
1 file changed, 8 insertions(+)
--- a/mm/page_alloc.c~mm-page_allocc-avoid-infinite-retries-caused-by-cpuset-race
+++ a/mm/page_alloc.c
@@ -4562,6 +4562,14 @@ restart:
}
retry:
+ /*
+ * Deal with possible cpuset update races or zonelist updates to avoid
+ * infinite retries.
+ */
+ if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
+ check_retry_zonelist(zonelist_iter_cookie))
+ goto restart;
+
/* Ensure kswapd doesn't accidentally go to sleep as long as we loop */
if (alloc_flags & ALLOC_KSWAPD)
wake_all_kswapds(order, gfp_mask, ac);
_
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mm/page_alloc.c: Avoid infinite retries caused by cpuset race
2025-05-13 19:16 ` Andrew Morton
@ 2025-05-13 19:33 ` Suren Baghdasaryan
2025-05-14 7:34 ` Vlastimil Babka
1 sibling, 0 replies; 16+ messages in thread
From: Suren Baghdasaryan @ 2025-05-13 19:33 UTC (permalink / raw)
To: Andrew Morton
Cc: Tianyang Zhang, Harry Yoo, linux-mm, linux-kernel,
Vlastimil Babka, Michal Hocko, Brendan Jackman, Johannes Weiner,
Zi Yan
On Tue, May 13, 2025 at 12:16 PM Andrew Morton
<akpm@linux-foundation.org> wrote:
>
> On Tue, 13 May 2025 09:26:53 -0700 Suren Baghdasaryan <surenb@google.com> wrote:
>
> > > > > This has been in mm-hotfixes-unstable for six days. Hopefully we'll
> > > > > see some review activity soon (please).
> > > >
> > > > I reviewed and provided my feedback but saw neither a reply nor a
> > > > respin with proposed changes.
> > >
> > > OK, thanks. Do you have time to put together a modified version of this?
> >
> > I think the code is fine as is. Would be good to add Fixes: tag but it
> > will require some investigation to find the appropriate patch to
> > reference here.
>
> Below is what is in mm-hotfixes. It doesn't actually have any
> acked-by's or reviewed-by's.
>
> So... final call for review, please.
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
>
>
> From: Tianyang Zhang <zhangtianyang@loongson.cn>
> Subject: mm/page_alloc.c: avoid infinite retries caused by cpuset race
> Date: Wed, 16 Apr 2025 16:24:05 +0800
>
> __alloc_pages_slowpath has no change detection for ac->nodemask in the
> part of retry path, while cpuset can modify it in parallel. For some
> processes that set mempolicy as MPOL_BIND, this results ac->nodemask
> changes, and then the should_reclaim_retry will judge based on the latest
> nodemask and jump to retry, while the get_page_from_freelist only
> traverses the zonelist from ac->preferred_zoneref, which selected by a
> expired nodemask and may cause infinite retries in some cases
>
> cpu 64:
> __alloc_pages_slowpath {
> /* ..... */
> retry:
> /* ac->nodemask = 0x1, ac->preferred->zone->nid = 1 */
> if (alloc_flags & ALLOC_KSWAPD)
> wake_all_kswapds(order, gfp_mask, ac);
> /* cpu 1:
> cpuset_write_resmask
> update_nodemask
> update_nodemasks_hier
> update_tasks_nodemask
> mpol_rebind_task
> mpol_rebind_policy
> mpol_rebind_nodemask
> // mempolicy->nodes has been modified,
> // which ac->nodemask point to
>
> */
> /* ac->nodemask = 0x3, ac->preferred->zone->nid = 1 */
> if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
> did_some_progress > 0, &no_progress_loops))
> goto retry;
> }
>
> Simultaneously starting multiple cpuset01 from LTP can quickly reproduce
> this issue on a multi node server when the maximum memory pressure is
> reached and the swap is enabled
>
> Link: https://lkml.kernel.org/r/20250416082405.20988-1-zhangtianyang@loongson.cn
> Fixes: 902b62810a57 ("mm, page_alloc: fix more premature OOM due to race with cpuset update").
> Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Brendan Jackman <jackmanb@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
>
> mm/page_alloc.c | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> --- a/mm/page_alloc.c~mm-page_allocc-avoid-infinite-retries-caused-by-cpuset-race
> +++ a/mm/page_alloc.c
> @@ -4562,6 +4562,14 @@ restart:
> }
>
> retry:
> + /*
> + * Deal with possible cpuset update races or zonelist updates to avoid
> + * infinite retries.
> + */
> + if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
> + check_retry_zonelist(zonelist_iter_cookie))
> + goto restart;
> +
> /* Ensure kswapd doesn't accidentally go to sleep as long as we loop */
> if (alloc_flags & ALLOC_KSWAPD)
> wake_all_kswapds(order, gfp_mask, ac);
> _
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mm/page_alloc.c: Avoid infinite retries caused by cpuset race
2025-04-23 15:35 ` Suren Baghdasaryan
@ 2025-05-14 7:15 ` Vlastimil Babka
0 siblings, 0 replies; 16+ messages in thread
From: Vlastimil Babka @ 2025-05-14 7:15 UTC (permalink / raw)
To: Suren Baghdasaryan, Tianyang Zhang
Cc: Harry Yoo, akpm, linux-mm, linux-kernel, Michal Hocko,
Brendan Jackman, Johannes Weiner, Zi Yan
On 4/23/25 17:35, Suren Baghdasaryan wrote:
>> >> There's a new 'MEMORY MANAGEMENT - PAGE ALLOCATOR' entry (only in
>> >> Andrew's mm.git repository now).
>> >>
>> >> Let's Cc the page allocator folks here!
>> >>
>> >> --
>> >> Cheers,
>> >> Harry / Hyeonggon
>> >>
>> >>> mm/page_alloc.c | 8 ++++++++
>> >>> 1 file changed, 8 insertions(+)
>> >>>
>> >>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> >>> index fd6b865cb1ab..1e82f5214a42 100644
>> >>> --- a/mm/page_alloc.c
>> >>> +++ b/mm/page_alloc.c
>> >>> @@ -4530,6 +4530,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>> >>> }
>> >>>
>> >>> retry:
>> >>> + /*
>> >>> + * Deal with possible cpuset update races or zonelist updates to avoid
>> >>> + * infinite retries.
>> >>> + */
>> >>> + if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
>> >>> + check_retry_zonelist(zonelist_iter_cookie))
>> >>> + goto restart;
>> >>> +
>> > We have this check later in this block:
>> > https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4652,
>> > so IIUC you effectively are moving it to be called before
>> > should_reclaim_retry(). If so, I think you should remove the old one
>> > (the one I linked earlier) as it seems to be unnecessary duplication
>> > at this point.
>> In my understanding, the code in
>>
>> https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4652
>>
>> was introduced to prevent unnecessary OOM (Out-of-Memory) conditions
>> in__alloc_pages_may_oom.
>>
>> If old code is removed, the newly added code (on retry loop entry)
>> cannot guarantee that the cpuset
>>
>> remains valid when the flow reaches in__alloc_pages_may_oom, especially
>> if scheduling occurs during this section.
>
> Well, rescheduling can happen even between
> https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4652
> and https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4657
> but I see your point. Also should_reclaim_retry() does not include
I think the rescheduling isn't a problem because what we're testing is "we
are about to oom, could it have been because we raced?" and the race would
have affected the code before #L4652. If we didn't race and yet determined
it's time for oom, a race between #L4652 and #L4657 shouldn't matter. The
get_page_from_freelist() in __alloc_pages_may_oom() isn't that important for
preventing premature oom AFAICS, given it uses high wmark.
That said, I think the newly added check could be more logically placed
above the call to should_reclaim_retry() instead of right after the retry:
label, but it's not critical.
> zonelist change detection, so keeping the checks at
> https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4652
> sounds like a good idea.
>
>>
>> Therefore, I think retaining the original code logic is necessary to
>> ensure correctness under concurrency.
>>
>> >
>> >
>> >>> /* Ensure kswapd doesn't accidentally go to sleep as long as we loop */
>> >>> if (alloc_flags & ALLOC_KSWAPD)
>> >>> wake_all_kswapds(order, gfp_mask, ac);
>> >>> --
>> >>> 2.20.1
>> >>>
>> >>>
>> Thanks
>>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mm/page_alloc.c: Avoid infinite retries caused by cpuset race
2025-05-13 19:16 ` Andrew Morton
2025-05-13 19:33 ` Suren Baghdasaryan
@ 2025-05-14 7:34 ` Vlastimil Babka
2025-05-14 22:42 ` Andrew Morton
2025-05-15 3:19 ` Tianyang Zhang
1 sibling, 2 replies; 16+ messages in thread
From: Vlastimil Babka @ 2025-05-14 7:34 UTC (permalink / raw)
To: Andrew Morton, Suren Baghdasaryan
Cc: Tianyang Zhang, Harry Yoo, linux-mm, linux-kernel, Michal Hocko,
Brendan Jackman, Johannes Weiner, Zi Yan
On 5/13/25 21:16, Andrew Morton wrote:
> On Tue, 13 May 2025 09:26:53 -0700 Suren Baghdasaryan <surenb@google.com> wrote:
>
>> > > > This has been in mm-hotfixes-unstable for six days. Hopefully we'll
>> > > > see some review activity soon (please).
>> > >
>> > > I reviewed and provided my feedback but saw neither a reply nor a
>> > > respin with proposed changes.
>> >
>> > OK, thanks. Do you have time to put together a modified version of this?
>>
>> I think the code is fine as is. Would be good to add Fixes: tag but it
>> will require some investigation to find the appropriate patch to
>> reference here.
>
> Below is what is in mm-hotfixes. It doesn't actually have any
> acked-by's or reviewed-by's.
>
> So... final call for review, please.
>
>
> From: Tianyang Zhang <zhangtianyang@loongson.cn>
> Subject: mm/page_alloc.c: avoid infinite retries caused by cpuset race
> Date: Wed, 16 Apr 2025 16:24:05 +0800
>
> __alloc_pages_slowpath has no change detection for ac->nodemask in the
> part of retry path, while cpuset can modify it in parallel. For some
> processes that set mempolicy as MPOL_BIND, this results ac->nodemask
> changes, and then the should_reclaim_retry will judge based on the latest
> nodemask and jump to retry, while the get_page_from_freelist only
> traverses the zonelist from ac->preferred_zoneref, which selected by a
> expired nodemask and may cause infinite retries in some cases
>
> cpu 64:
> __alloc_pages_slowpath {
> /* ..... */
> retry:
> /* ac->nodemask = 0x1, ac->preferred->zone->nid = 1 */
> if (alloc_flags & ALLOC_KSWAPD)
> wake_all_kswapds(order, gfp_mask, ac);
> /* cpu 1:
> cpuset_write_resmask
> update_nodemask
> update_nodemasks_hier
> update_tasks_nodemask
> mpol_rebind_task
> mpol_rebind_policy
> mpol_rebind_nodemask
> // mempolicy->nodes has been modified,
> // which ac->nodemask point to
>
> */
> /* ac->nodemask = 0x3, ac->preferred->zone->nid = 1 */
> if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
> did_some_progress > 0, &no_progress_loops))
> goto retry;
> }
>
> Simultaneously starting multiple cpuset01 from LTP can quickly reproduce
> this issue on a multi node server when the maximum memory pressure is
> reached and the swap is enabled
>
> Link: https://lkml.kernel.org/r/20250416082405.20988-1-zhangtianyang@loongson.cn
> Fixes: 902b62810a57 ("mm, page_alloc: fix more premature OOM due to race with cpuset update").
After the discussion in this thread, Suren retracted this Fixes: suggestion.
I think it actually goes back to this one which introduced the
preferred_zoneref caching.
Fixes: c33d6c06f60f ("mm, page_alloc: avoid looking up the first zone in a
zonelist twice")
> Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Brendan Jackman <jackmanb@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
I would have placed the check bit further down, just above the
should_reclaim_retry() call, but it's not that important to hold up a fix
and can be done later.
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>
> mm/page_alloc.c | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> --- a/mm/page_alloc.c~mm-page_allocc-avoid-infinite-retries-caused-by-cpuset-race
> +++ a/mm/page_alloc.c
> @@ -4562,6 +4562,14 @@ restart:
> }
>
> retry:
> + /*
> + * Deal with possible cpuset update races or zonelist updates to avoid
> + * infinite retries.
> + */
> + if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
> + check_retry_zonelist(zonelist_iter_cookie))
> + goto restart;
> +
> /* Ensure kswapd doesn't accidentally go to sleep as long as we loop */
> if (alloc_flags & ALLOC_KSWAPD)
> wake_all_kswapds(order, gfp_mask, ac);
> _
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mm/page_alloc.c: Avoid infinite retries caused by cpuset race
2025-05-14 7:34 ` Vlastimil Babka
@ 2025-05-14 22:42 ` Andrew Morton
2025-05-15 3:19 ` Tianyang Zhang
1 sibling, 0 replies; 16+ messages in thread
From: Andrew Morton @ 2025-05-14 22:42 UTC (permalink / raw)
To: Vlastimil Babka
Cc: Suren Baghdasaryan, Tianyang Zhang, Harry Yoo, linux-mm,
linux-kernel, Michal Hocko, Brendan Jackman, Johannes Weiner,
Zi Yan
On Wed, 14 May 2025 09:34:53 +0200 Vlastimil Babka <vbabka@suse.cz> wrote:
> On 5/13/25 21:16, Andrew Morton wrote:
> > On Tue, 13 May 2025 09:26:53 -0700 Suren Baghdasaryan <surenb@google.com> wrote:
> >
> >> > > > This has been in mm-hotfixes-unstable for six days. Hopefully we'll
> >> > > > see some review activity soon (please).
> >> > >
> >> > > I reviewed and provided my feedback but saw neither a reply nor a
> >> > > respin with proposed changes.
> >> >
> >> > OK, thanks. Do you have time to put together a modified version of this?
> >>
> >> I think the code is fine as is. Would be good to add Fixes: tag but it
> >> will require some investigation to find the appropriate patch to
> >> reference here.
> >
> > Below is what is in mm-hotfixes. It doesn't actually have any
> > acked-by's or reviewed-by's.
> >
> > So... final call for review, please.
> >
>
> ...
>
> After the discussion in this thread, Suren retracted this Fixes: suggestion.
> I think it actually goes back to this one which introduced the
> preferred_zoneref caching.
>
> Fixes: c33d6c06f60f ("mm, page_alloc: avoid looking up the first zone in a
> zonelist twice")
Updated.
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Thanks.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mm/page_alloc.c: Avoid infinite retries caused by cpuset race
2025-05-14 7:34 ` Vlastimil Babka
2025-05-14 22:42 ` Andrew Morton
@ 2025-05-15 3:19 ` Tianyang Zhang
1 sibling, 0 replies; 16+ messages in thread
From: Tianyang Zhang @ 2025-05-15 3:19 UTC (permalink / raw)
To: Vlastimil Babka, Andrew Morton, Suren Baghdasaryan
Cc: Harry Yoo, linux-mm, linux-kernel, Michal Hocko, Brendan Jackman,
Johannes Weiner, Zi Yan
Hi,
在 2025/5/14 下午3:34, Vlastimil Babka 写道:
> On 5/13/25 21:16, Andrew Morton wrote:
>> On Tue, 13 May 2025 09:26:53 -0700 Suren Baghdasaryan <surenb@google.com> wrote:
>>
>>>>>> This has been in mm-hotfixes-unstable for six days. Hopefully we'll
>>>>>> see some review activity soon (please).
>>>>> I reviewed and provided my feedback but saw neither a reply nor a
>>>>> respin with proposed changes.
>>>> OK, thanks. Do you have time to put together a modified version of this?
>>> I think the code is fine as is. Would be good to add Fixes: tag but it
>>> will require some investigation to find the appropriate patch to
>>> reference here.
>> Below is what is in mm-hotfixes. It doesn't actually have any
>> acked-by's or reviewed-by's.
>>
>> So... final call for review, please.
>>
>>
>> From: Tianyang Zhang <zhangtianyang@loongson.cn>
>> Subject: mm/page_alloc.c: avoid infinite retries caused by cpuset race
>> Date: Wed, 16 Apr 2025 16:24:05 +0800
>>
>> __alloc_pages_slowpath has no change detection for ac->nodemask in the
>> part of retry path, while cpuset can modify it in parallel. For some
>> processes that set mempolicy as MPOL_BIND, this results ac->nodemask
>> changes, and then the should_reclaim_retry will judge based on the latest
>> nodemask and jump to retry, while the get_page_from_freelist only
>> traverses the zonelist from ac->preferred_zoneref, which selected by a
>> expired nodemask and may cause infinite retries in some cases
>>
>> cpu 64:
>> __alloc_pages_slowpath {
>> /* ..... */
>> retry:
>> /* ac->nodemask = 0x1, ac->preferred->zone->nid = 1 */
>> if (alloc_flags & ALLOC_KSWAPD)
>> wake_all_kswapds(order, gfp_mask, ac);
>> /* cpu 1:
>> cpuset_write_resmask
>> update_nodemask
>> update_nodemasks_hier
>> update_tasks_nodemask
>> mpol_rebind_task
>> mpol_rebind_policy
>> mpol_rebind_nodemask
>> // mempolicy->nodes has been modified,
>> // which ac->nodemask point to
>>
>> */
>> /* ac->nodemask = 0x3, ac->preferred->zone->nid = 1 */
>> if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
>> did_some_progress > 0, &no_progress_loops))
>> goto retry;
>> }
>>
>> Simultaneously starting multiple cpuset01 from LTP can quickly reproduce
>> this issue on a multi node server when the maximum memory pressure is
>> reached and the swap is enabled
>>
>> Link: https://lkml.kernel.org/r/20250416082405.20988-1-zhangtianyang@loongson.cn
>> Fixes: 902b62810a57 ("mm, page_alloc: fix more premature OOM due to race with cpuset update").
> After the discussion in this thread, Suren retracted this Fixes: suggestion.
> I think it actually goes back to this one which introduced the
> preferred_zoneref caching.
>
> Fixes: c33d6c06f60f ("mm, page_alloc: avoid looking up the first zone in a
> zonelist twice")
Yes, the problem should be introduced by this patch, thank you
>
>> Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn>
>> Cc: Vlastimil Babka <vbabka@suse.cz>
>> Cc: Suren Baghdasaryan <surenb@google.com>
>> Cc: Michal Hocko <mhocko@suse.com>
>> Cc: Brendan Jackman <jackmanb@google.com>
>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>> Cc: Zi Yan <ziy@nvidia.com>
>> Cc: <stable@vger.kernel.org>
>> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> I would have placed the check bit further down, just above the
> should_reclaim_retry() call, but it's not that important to hold up a fix
> and can be done later.
>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
>
>> ---
>>
>> mm/page_alloc.c | 8 ++++++++
>> 1 file changed, 8 insertions(+)
>>
>> --- a/mm/page_alloc.c~mm-page_allocc-avoid-infinite-retries-caused-by-cpuset-race
>> +++ a/mm/page_alloc.c
>> @@ -4562,6 +4562,14 @@ restart:
>> }
>>
>> retry:
>> + /*
>> + * Deal with possible cpuset update races or zonelist updates to avoid
>> + * infinite retries.
>> + */
>> + if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
>> + check_retry_zonelist(zonelist_iter_cookie))
>> + goto restart;
>> +
>> /* Ensure kswapd doesn't accidentally go to sleep as long as we loop */
>> if (alloc_flags & ALLOC_KSWAPD)
>> wake_all_kswapds(order, gfp_mask, ac);
>> _
>>
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2025-05-15 3:20 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-16 8:24 [PATCH] mm/page_alloc.c: Avoid infinite retries caused by cpuset race Tianyang Zhang
2025-04-21 10:00 ` Harry Yoo
2025-04-21 20:28 ` Suren Baghdasaryan
2025-04-23 2:38 ` Tianyang Zhang
2025-04-23 15:35 ` Suren Baghdasaryan
2025-05-14 7:15 ` Vlastimil Babka
2025-04-22 12:10 ` Tianyang Zhang
2025-04-23 0:11 ` Andrew Morton
2025-04-23 0:22 ` Suren Baghdasaryan
2025-05-11 3:07 ` Andrew Morton
2025-05-13 16:26 ` Suren Baghdasaryan
2025-05-13 19:16 ` Andrew Morton
2025-05-13 19:33 ` Suren Baghdasaryan
2025-05-14 7:34 ` Vlastimil Babka
2025-05-14 22:42 ` Andrew Morton
2025-05-15 3:19 ` Tianyang Zhang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).