[PATCH] mm/page_alloc.c: Avoid infinite retries caused by cpuset race

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] mm/page_alloc.c: Avoid infinite retries caused by cpuset race
@ 2025-04-16  8:24 Tianyang Zhang
  2025-04-21 10:00 ` Harry Yoo
  0 siblings, 1 reply; 16+ messages in thread
From: Tianyang Zhang @ 2025-04-16  8:24 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, linux-kernel, Tianyang Zhang

__alloc_pages_slowpath has no change detection for ac->nodemask
in the part of retry path, while cpuset can modify it in parallel.
For some processes that set mempolicy as MPOL_BIND, this results
ac->nodemask changes, and then the should_reclaim_retry will
judge based on the latest nodemask and jump to retry, while the
get_page_from_freelist only traverses the zonelist from
ac->preferred_zoneref, which selected by a expired nodemask
and may cause infinite retries in some cases

cpu 64:
__alloc_pages_slowpath {
        /* ..... */
retry:
        /* ac->nodemask = 0x1, ac->preferred->zone->nid = 1 */
        if (alloc_flags & ALLOC_KSWAPD)
                wake_all_kswapds(order, gfp_mask, ac);
        /* cpu 1:
        cpuset_write_resmask
            update_nodemask
                update_nodemasks_hier
                    update_tasks_nodemask
                        mpol_rebind_task
                         mpol_rebind_policy
                          mpol_rebind_nodemask
		// mempolicy->nodes has been modified,
		// which ac->nodemask point to

        */
        /* ac->nodemask = 0x3, ac->preferred->zone->nid = 1 */
        if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
                                 did_some_progress > 0, &no_progress_loops))
                goto retry;
}

Simultaneously starting multiple cpuset01 from LTP can quickly
reproduce this issue on a multi node server when the maximum
memory pressure is reached and the swap is enabled

Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn>
---
 mm/page_alloc.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fd6b865cb1ab..1e82f5214a42 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4530,6 +4530,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	}
 
 retry:
+	/*
+	 * Deal with possible cpuset update races or zonelist updates to avoid
+	 * infinite retries.
+	 */
+	if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
+	    check_retry_zonelist(zonelist_iter_cookie))
+		goto restart;
+
 	/* Ensure kswapd doesn't accidentally go to sleep as long as we loop */
 	if (alloc_flags & ALLOC_KSWAPD)
 		wake_all_kswapds(order, gfp_mask, ac);
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm/page_alloc.c: Avoid infinite retries caused by cpuset race
  2025-04-16  8:24 [PATCH] mm/page_alloc.c: Avoid infinite retries caused by cpuset race Tianyang Zhang
@ 2025-04-21 10:00 ` Harry Yoo
  2025-04-21 20:28   ` Suren Baghdasaryan
  2025-04-22 12:10   ` Tianyang Zhang
  0 siblings, 2 replies; 16+ messages in thread
From: Harry Yoo @ 2025-04-21 10:00 UTC (permalink / raw)
  To: Tianyang Zhang
  Cc: akpm, linux-mm, linux-kernel, Vlastimil Babka, Suren Baghdasaryan,
	Michal Hocko, Brendan Jackman, Johannes Weiner, Zi Yan

On Wed, Apr 16, 2025 at 04:24:05PM +0800, Tianyang Zhang wrote:
> __alloc_pages_slowpath has no change detection for ac->nodemask
> in the part of retry path, while cpuset can modify it in parallel.
> For some processes that set mempolicy as MPOL_BIND, this results
> ac->nodemask changes, and then the should_reclaim_retry will
> judge based on the latest nodemask and jump to retry, while the
> get_page_from_freelist only traverses the zonelist from
> ac->preferred_zoneref, which selected by a expired nodemask
> and may cause infinite retries in some cases
> 
> cpu 64:
> __alloc_pages_slowpath {
>         /* ..... */
> retry:
>         /* ac->nodemask = 0x1, ac->preferred->zone->nid = 1 */
>         if (alloc_flags & ALLOC_KSWAPD)
>                 wake_all_kswapds(order, gfp_mask, ac);
>         /* cpu 1:
>         cpuset_write_resmask
>             update_nodemask
>                 update_nodemasks_hier
>                     update_tasks_nodemask
>                         mpol_rebind_task
>                          mpol_rebind_policy
>                           mpol_rebind_nodemask
> 		// mempolicy->nodes has been modified,
> 		// which ac->nodemask point to
> 
>         */
>         /* ac->nodemask = 0x3, ac->preferred->zone->nid = 1 */
>         if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
>                                  did_some_progress > 0, &no_progress_loops))
>                 goto retry;
> }
> 
> Simultaneously starting multiple cpuset01 from LTP can quickly
> reproduce this issue on a multi node server when the maximum
> memory pressure is reached and the swap is enabled
> 
> Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn>
> ---

What commit does it fix and should it be backported to -stable?

There's a new 'MEMORY MANAGEMENT - PAGE ALLOCATOR' entry (only in
Andrew's mm.git repository now).

Let's Cc the page allocator folks here!

-- 
Cheers,
Harry / Hyeonggon

>  mm/page_alloc.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index fd6b865cb1ab..1e82f5214a42 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4530,6 +4530,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	}
>  
>  retry:
> +	/*
> +	 * Deal with possible cpuset update races or zonelist updates to avoid
> +	 * infinite retries.
> +	 */
> +	if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
> +	    check_retry_zonelist(zonelist_iter_cookie))
> +		goto restart;
> +
>  	/* Ensure kswapd doesn't accidentally go to sleep as long as we loop */
>  	if (alloc_flags & ALLOC_KSWAPD)
>  		wake_all_kswapds(order, gfp_mask, ac);
> -- 
> 2.20.1
> 
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm/page_alloc.c: Avoid infinite retries caused by cpuset race
  2025-04-21 10:00 ` Harry Yoo
@ 2025-04-21 20:28   ` Suren Baghdasaryan
  2025-04-23  2:38     ` Tianyang Zhang
  2025-04-22 12:10   ` Tianyang Zhang
  1 sibling, 1 reply; 16+ messages in thread
From: Suren Baghdasaryan @ 2025-04-21 20:28 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Tianyang Zhang, akpm, linux-mm, linux-kernel, Vlastimil Babka,
	Michal Hocko, Brendan Jackman, Johannes Weiner, Zi Yan

On Mon, Apr 21, 2025 at 3:00 AM Harry Yoo <harry.yoo@oracle.com> wrote:
>
> On Wed, Apr 16, 2025 at 04:24:05PM +0800, Tianyang Zhang wrote:
> > __alloc_pages_slowpath has no change detection for ac->nodemask
> > in the part of retry path, while cpuset can modify it in parallel.
> > For some processes that set mempolicy as MPOL_BIND, this results
> > ac->nodemask changes, and then the should_reclaim_retry will
> > judge based on the latest nodemask and jump to retry, while the
> > get_page_from_freelist only traverses the zonelist from
> > ac->preferred_zoneref, which selected by a expired nodemask
> > and may cause infinite retries in some cases
> >
> > cpu 64:
> > __alloc_pages_slowpath {
> >         /* ..... */
> > retry:
> >         /* ac->nodemask = 0x1, ac->preferred->zone->nid = 1 */
> >         if (alloc_flags & ALLOC_KSWAPD)
> >                 wake_all_kswapds(order, gfp_mask, ac);
> >         /* cpu 1:
> >         cpuset_write_resmask
> >             update_nodemask
> >                 update_nodemasks_hier
> >                     update_tasks_nodemask
> >                         mpol_rebind_task
> >                          mpol_rebind_policy
> >                           mpol_rebind_nodemask
> >               // mempolicy->nodes has been modified,
> >               // which ac->nodemask point to
> >
> >         */
> >         /* ac->nodemask = 0x3, ac->preferred->zone->nid = 1 */
> >         if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
> >                                  did_some_progress > 0, &no_progress_loops))
> >                 goto retry;
> > }
> >
> > Simultaneously starting multiple cpuset01 from LTP can quickly
> > reproduce this issue on a multi node server when the maximum
> > memory pressure is reached and the swap is enabled
> >
> > Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn>
> > ---
>
> What commit does it fix and should it be backported to -stable?

I think it fixes 902b62810a57 ("mm, page_alloc: fix more premature OOM
due to race with cpuset update").

>
> There's a new 'MEMORY MANAGEMENT - PAGE ALLOCATOR' entry (only in
> Andrew's mm.git repository now).
>
> Let's Cc the page allocator folks here!
>
> --
> Cheers,
> Harry / Hyeonggon
>
> >  mm/page_alloc.c | 8 ++++++++
> >  1 file changed, 8 insertions(+)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index fd6b865cb1ab..1e82f5214a42 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -4530,6 +4530,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >       }
> >
> >  retry:
> > +     /*
> > +      * Deal with possible cpuset update races or zonelist updates to avoid
> > +      * infinite retries.
> > +      */
> > +     if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
> > +         check_retry_zonelist(zonelist_iter_cookie))
> > +             goto restart;
> > +

We have this check later in this block:
https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4652,
so IIUC you effectively are moving it to be called before
should_reclaim_retry(). If so, I think you should remove the old one
(the one I linked earlier) as it seems to be unnecessary duplication
at this point.


> >       /* Ensure kswapd doesn't accidentally go to sleep as long as we loop */
> >       if (alloc_flags & ALLOC_KSWAPD)
> >               wake_all_kswapds(order, gfp_mask, ac);
> > --
> > 2.20.1
> >
> >


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm/page_alloc.c: Avoid infinite retries caused by cpuset race
  2025-04-21 10:00 ` Harry Yoo
  2025-04-21 20:28   ` Suren Baghdasaryan
@ 2025-04-22 12:10   ` Tianyang Zhang
  2025-04-23  0:11     ` Andrew Morton
  1 sibling, 1 reply; 16+ messages in thread
From: Tianyang Zhang @ 2025-04-22 12:10 UTC (permalink / raw)
  To: Harry Yoo
  Cc: akpm, linux-mm, linux-kernel, Vlastimil Babka, Suren Baghdasaryan,
	Michal Hocko, Brendan Jackman, Johannes Weiner, Zi Yan

Hi.

在 2025/4/21 下午6:00, Harry Yoo 写道:
> On Wed, Apr 16, 2025 at 04:24:05PM +0800, Tianyang Zhang wrote:
>> __alloc_pages_slowpath has no change detection for ac->nodemask
>> in the part of retry path, while cpuset can modify it in parallel.
>> For some processes that set mempolicy as MPOL_BIND, this results
>> ac->nodemask changes, and then the should_reclaim_retry will
>> judge based on the latest nodemask and jump to retry, while the
>> get_page_from_freelist only traverses the zonelist from
>> ac->preferred_zoneref, which selected by a expired nodemask
>> and may cause infinite retries in some cases
>>
>> cpu 64:
>> __alloc_pages_slowpath {
>>          /* ..... */
>> retry:
>>          /* ac->nodemask = 0x1, ac->preferred->zone->nid = 1 */
>>          if (alloc_flags & ALLOC_KSWAPD)
>>                  wake_all_kswapds(order, gfp_mask, ac);
>>          /* cpu 1:
>>          cpuset_write_resmask
>>              update_nodemask
>>                  update_nodemasks_hier
>>                      update_tasks_nodemask
>>                          mpol_rebind_task
>>                           mpol_rebind_policy
>>                            mpol_rebind_nodemask
>> 		// mempolicy->nodes has been modified,
>> 		// which ac->nodemask point to
>>
>>          */
>>          /* ac->nodemask = 0x3, ac->preferred->zone->nid = 1 */
>>          if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
>>                                   did_some_progress > 0, &no_progress_loops))
>>                  goto retry;
>> }
>>
>> Simultaneously starting multiple cpuset01 from LTP can quickly
>> reproduce this issue on a multi node server when the maximum
>> memory pressure is reached and the swap is enabled
>>
>> Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn>
>> ---
> What commit does it fix and should it be backported to -stable?
>
> There's a new 'MEMORY MANAGEMENT - PAGE ALLOCATOR' entry (only in
> Andrew's mm.git repository now).
>
> Let's Cc the page allocator folks here!

We first identified this issue in 6.6.52-stable , and through root cause 
analysis,

it appears the problem may have existed for a significant period.

However It is recommended that the fix should be backported to at least 
Linux kernel versions after 6.6-stable



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm/page_alloc.c: Avoid infinite retries caused by cpuset race
  2025-04-22 12:10   ` Tianyang Zhang
@ 2025-04-23  0:11     ` Andrew Morton
  2025-04-23  0:22       ` Suren Baghdasaryan
  0 siblings, 1 reply; 16+ messages in thread
From: Andrew Morton @ 2025-04-23  0:11 UTC (permalink / raw)
  To: Tianyang Zhang
  Cc: Harry Yoo, linux-mm, linux-kernel, Vlastimil Babka,
	Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan

On Tue, 22 Apr 2025 20:10:06 +0800 Tianyang Zhang <zhangtianyang@loongson.cn> wrote:

>
> ...
>
> >>
> >> Simultaneously starting multiple cpuset01 from LTP can quickly
> >> reproduce this issue on a multi node server when the maximum
> >> memory pressure is reached and the swap is enabled
> >>
> >> Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn>
> >> ---
> > What commit does it fix and should it be backported to -stable?
> >
> > There's a new 'MEMORY MANAGEMENT - PAGE ALLOCATOR' entry (only in
> > Andrew's mm.git repository now).
> >
> > Let's Cc the page allocator folks here!
> 
> We first identified this issue in 6.6.52-stable , and through root cause 
> analysis,
> 
> it appears the problem may have existed for a significant period.
> 
> However It is recommended that the fix should be backported to at least 
> Linux kernel versions after 6.6-stable

OK, thanks,

This has been in mm-hotfixes-unstable for six days.  Hopefully we'll
see some review activity soon (please).



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm/page_alloc.c: Avoid infinite retries caused by cpuset race
  2025-04-23  0:11     ` Andrew Morton
@ 2025-04-23  0:22       ` Suren Baghdasaryan
  2025-05-11  3:07         ` Andrew Morton
  0 siblings, 1 reply; 16+ messages in thread
From: Suren Baghdasaryan @ 2025-04-23  0:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tianyang Zhang, Harry Yoo, linux-mm, linux-kernel,
	Vlastimil Babka, Michal Hocko, Brendan Jackman, Johannes Weiner,
	Zi Yan

On Tue, Apr 22, 2025 at 5:11 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Tue, 22 Apr 2025 20:10:06 +0800 Tianyang Zhang <zhangtianyang@loongson.cn> wrote:
>
> >
> > ...
> >
> > >>
> > >> Simultaneously starting multiple cpuset01 from LTP can quickly
> > >> reproduce this issue on a multi node server when the maximum
> > >> memory pressure is reached and the swap is enabled
> > >>
> > >> Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn>
> > >> ---
> > > What commit does it fix and should it be backported to -stable?
> > >
> > > There's a new 'MEMORY MANAGEMENT - PAGE ALLOCATOR' entry (only in
> > > Andrew's mm.git repository now).
> > >
> > > Let's Cc the page allocator folks here!
> >
> > We first identified this issue in 6.6.52-stable , and through root cause
> > analysis,
> >
> > it appears the problem may have existed for a significant period.
> >
> > However It is recommended that the fix should be backported to at least
> > Linux kernel versions after 6.6-stable
>
> OK, thanks,
>
> This has been in mm-hotfixes-unstable for six days.  Hopefully we'll
> see some review activity soon (please).

I reviewed and provided my feedback but saw neither a reply nor a
respin with proposed changes.

>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm/page_alloc.c: Avoid infinite retries caused by cpuset race
  2025-04-21 20:28   ` Suren Baghdasaryan
@ 2025-04-23  2:38     ` Tianyang Zhang
  2025-04-23 15:35       ` Suren Baghdasaryan
  0 siblings, 1 reply; 16+ messages in thread
From: Tianyang Zhang @ 2025-04-23  2:38 UTC (permalink / raw)
  To: Suren Baghdasaryan, Harry Yoo
  Cc: akpm, linux-mm, linux-kernel, Vlastimil Babka, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Zi Yan

Hi, Suren

在 2025/4/22 上午4:28, Suren Baghdasaryan 写道:
> On Mon, Apr 21, 2025 at 3:00 AM Harry Yoo <harry.yoo@oracle.com> wrote:
>> On Wed, Apr 16, 2025 at 04:24:05PM +0800, Tianyang Zhang wrote:
>>> __alloc_pages_slowpath has no change detection for ac->nodemask
>>> in the part of retry path, while cpuset can modify it in parallel.
>>> For some processes that set mempolicy as MPOL_BIND, this results
>>> ac->nodemask changes, and then the should_reclaim_retry will
>>> judge based on the latest nodemask and jump to retry, while the
>>> get_page_from_freelist only traverses the zonelist from
>>> ac->preferred_zoneref, which selected by a expired nodemask
>>> and may cause infinite retries in some cases
>>>
>>> cpu 64:
>>> __alloc_pages_slowpath {
>>>          /* ..... */
>>> retry:
>>>          /* ac->nodemask = 0x1, ac->preferred->zone->nid = 1 */
>>>          if (alloc_flags & ALLOC_KSWAPD)
>>>                  wake_all_kswapds(order, gfp_mask, ac);
>>>          /* cpu 1:
>>>          cpuset_write_resmask
>>>              update_nodemask
>>>                  update_nodemasks_hier
>>>                      update_tasks_nodemask
>>>                          mpol_rebind_task
>>>                           mpol_rebind_policy
>>>                            mpol_rebind_nodemask
>>>                // mempolicy->nodes has been modified,
>>>                // which ac->nodemask point to
>>>
>>>          */
>>>          /* ac->nodemask = 0x3, ac->preferred->zone->nid = 1 */
>>>          if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
>>>                                   did_some_progress > 0, &no_progress_loops))
>>>                  goto retry;
>>> }
>>>
>>> Simultaneously starting multiple cpuset01 from LTP can quickly
>>> reproduce this issue on a multi node server when the maximum
>>> memory pressure is reached and the swap is enabled
>>>
>>> Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn>
>>> ---
>> What commit does it fix and should it be backported to -stable?
> I think it fixes 902b62810a57 ("mm, page_alloc: fix more premature OOM
> due to race with cpuset update").

I think this issue is unlikely to have been introduced by Patch 
902b62810a57 ,

as the infinite-reties section from

https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4568
to
https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4628

where the cpuset race condition occurs remains unmodified in the logic 
of Patch 902b62810a57.

>> There's a new 'MEMORY MANAGEMENT - PAGE ALLOCATOR' entry (only in
>> Andrew's mm.git repository now).
>>
>> Let's Cc the page allocator folks here!
>>
>> --
>> Cheers,
>> Harry / Hyeonggon
>>
>>>   mm/page_alloc.c | 8 ++++++++
>>>   1 file changed, 8 insertions(+)
>>>
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index fd6b865cb1ab..1e82f5214a42 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -4530,6 +4530,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>>>        }
>>>
>>>   retry:
>>> +     /*
>>> +      * Deal with possible cpuset update races or zonelist updates to avoid
>>> +      * infinite retries.
>>> +      */
>>> +     if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
>>> +         check_retry_zonelist(zonelist_iter_cookie))
>>> +             goto restart;
>>> +
> We have this check later in this block:
> https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4652,
> so IIUC you effectively are moving it to be called before
> should_reclaim_retry(). If so, I think you should remove the old one
> (the one I linked earlier) as it seems to be unnecessary duplication
> at this point.
In my understanding, the code in

https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4652

was introduced to prevent unnecessary OOM (Out-of-Memory) conditions 
in__alloc_pages_may_oom.

If old code is removed, the newly added code (on retry loop entry) 
cannot guarantee that the cpuset

remains valid when the flow reaches in__alloc_pages_may_oom, especially 
if scheduling occurs during this section.

Therefore, I think retaining the original code logic is necessary to 
ensure correctness under concurrency.

>
>
>>>        /* Ensure kswapd doesn't accidentally go to sleep as long as we loop */
>>>        if (alloc_flags & ALLOC_KSWAPD)
>>>                wake_all_kswapds(order, gfp_mask, ac);
>>> --
>>> 2.20.1
>>>
>>>
Thanks



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm/page_alloc.c: Avoid infinite retries caused by cpuset race
  2025-04-23  2:38     ` Tianyang Zhang
@ 2025-04-23 15:35       ` Suren Baghdasaryan
  2025-05-14  7:15         ` Vlastimil Babka
  0 siblings, 1 reply; 16+ messages in thread
From: Suren Baghdasaryan @ 2025-04-23 15:35 UTC (permalink / raw)
  To: Tianyang Zhang
  Cc: Harry Yoo, akpm, linux-mm, linux-kernel, Vlastimil Babka,
	Michal Hocko, Brendan Jackman, Johannes Weiner, Zi Yan

On Tue, Apr 22, 2025 at 7:39 PM Tianyang Zhang
<zhangtianyang@loongson.cn> wrote:
>
> Hi, Suren
>
> 在 2025/4/22 上午4:28, Suren Baghdasaryan 写道:
> > On Mon, Apr 21, 2025 at 3:00 AM Harry Yoo <harry.yoo@oracle.com> wrote:
> >> On Wed, Apr 16, 2025 at 04:24:05PM +0800, Tianyang Zhang wrote:
> >>> __alloc_pages_slowpath has no change detection for ac->nodemask
> >>> in the part of retry path, while cpuset can modify it in parallel.
> >>> For some processes that set mempolicy as MPOL_BIND, this results
> >>> ac->nodemask changes, and then the should_reclaim_retry will
> >>> judge based on the latest nodemask and jump to retry, while the
> >>> get_page_from_freelist only traverses the zonelist from
> >>> ac->preferred_zoneref, which selected by a expired nodemask
> >>> and may cause infinite retries in some cases
> >>>
> >>> cpu 64:
> >>> __alloc_pages_slowpath {
> >>>          /* ..... */
> >>> retry:
> >>>          /* ac->nodemask = 0x1, ac->preferred->zone->nid = 1 */
> >>>          if (alloc_flags & ALLOC_KSWAPD)
> >>>                  wake_all_kswapds(order, gfp_mask, ac);
> >>>          /* cpu 1:
> >>>          cpuset_write_resmask
> >>>              update_nodemask
> >>>                  update_nodemasks_hier
> >>>                      update_tasks_nodemask
> >>>                          mpol_rebind_task
> >>>                           mpol_rebind_policy
> >>>                            mpol_rebind_nodemask
> >>>                // mempolicy->nodes has been modified,
> >>>                // which ac->nodemask point to
> >>>
> >>>          */
> >>>          /* ac->nodemask = 0x3, ac->preferred->zone->nid = 1 */
> >>>          if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
> >>>                                   did_some_progress > 0, &no_progress_loops))
> >>>                  goto retry;
> >>> }
> >>>
> >>> Simultaneously starting multiple cpuset01 from LTP can quickly
> >>> reproduce this issue on a multi node server when the maximum
> >>> memory pressure is reached and the swap is enabled
> >>>
> >>> Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn>
> >>> ---
> >> What commit does it fix and should it be backported to -stable?
> > I think it fixes 902b62810a57 ("mm, page_alloc: fix more premature OOM
> > due to race with cpuset update").
>
> I think this issue is unlikely to have been introduced by Patch
> 902b62810a57 ,
>
> as the infinite-reties section from
>
> https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4568
> to
> https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4628
>
> where the cpuset race condition occurs remains unmodified in the logic
> of Patch 902b62810a57.

Yeah, you are right. After looking into it some more, 902b62810a57 is
a wrong patch to blame for this infinite loop.

>
> >> There's a new 'MEMORY MANAGEMENT - PAGE ALLOCATOR' entry (only in
> >> Andrew's mm.git repository now).
> >>
> >> Let's Cc the page allocator folks here!
> >>
> >> --
> >> Cheers,
> >> Harry / Hyeonggon
> >>
> >>>   mm/page_alloc.c | 8 ++++++++
> >>>   1 file changed, 8 insertions(+)
> >>>
> >>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >>> index fd6b865cb1ab..1e82f5214a42 100644
> >>> --- a/mm/page_alloc.c
> >>> +++ b/mm/page_alloc.c
> >>> @@ -4530,6 +4530,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >>>        }
> >>>
> >>>   retry:
> >>> +     /*
> >>> +      * Deal with possible cpuset update races or zonelist updates to avoid
> >>> +      * infinite retries.
> >>> +      */
> >>> +     if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
> >>> +         check_retry_zonelist(zonelist_iter_cookie))
> >>> +             goto restart;
> >>> +
> > We have this check later in this block:
> > https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4652,
> > so IIUC you effectively are moving it to be called before
> > should_reclaim_retry(). If so, I think you should remove the old one
> > (the one I linked earlier) as it seems to be unnecessary duplication
> > at this point.
> In my understanding, the code in
>
> https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4652
>
> was introduced to prevent unnecessary OOM (Out-of-Memory) conditions
> in__alloc_pages_may_oom.
>
> If old code is removed, the newly added code (on retry loop entry)
> cannot guarantee that the cpuset
>
> remains valid when the flow reaches in__alloc_pages_may_oom, especially
> if scheduling occurs during this section.

Well, rescheduling can happen even between
https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4652
and https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4657
but I see your point. Also should_reclaim_retry() does not include
zonelist change detection, so keeping the checks at
https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4652
sounds like a good idea.

>
> Therefore, I think retaining the original code logic is necessary to
> ensure correctness under concurrency.
>
> >
> >
> >>>        /* Ensure kswapd doesn't accidentally go to sleep as long as we loop */
> >>>        if (alloc_flags & ALLOC_KSWAPD)
> >>>                wake_all_kswapds(order, gfp_mask, ac);
> >>> --
> >>> 2.20.1
> >>>
> >>>
> Thanks
>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm/page_alloc.c: Avoid infinite retries caused by cpuset race
  2025-04-23  0:22       ` Suren Baghdasaryan
@ 2025-05-11  3:07         ` Andrew Morton
  2025-05-13 16:26           ` Suren Baghdasaryan
  0 siblings, 1 reply; 16+ messages in thread
From: Andrew Morton @ 2025-05-11  3:07 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Tianyang Zhang, Harry Yoo, linux-mm, linux-kernel,
	Vlastimil Babka, Michal Hocko, Brendan Jackman, Johannes Weiner,
	Zi Yan

On Tue, 22 Apr 2025 17:22:04 -0700 Suren Baghdasaryan <surenb@google.com> wrote:

> On Tue, Apr 22, 2025 at 5:11 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > On Tue, 22 Apr 2025 20:10:06 +0800 Tianyang Zhang <zhangtianyang@loongson.cn> wrote:
> >
> > >
> > > ...
> > >
> > > >>
> > > >> Simultaneously starting multiple cpuset01 from LTP can quickly
> > > >> reproduce this issue on a multi node server when the maximum
> > > >> memory pressure is reached and the swap is enabled
> > > >>
> > > >> Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn>
> > > >> ---
> > > > What commit does it fix and should it be backported to -stable?
> > > >
> > > > There's a new 'MEMORY MANAGEMENT - PAGE ALLOCATOR' entry (only in
> > > > Andrew's mm.git repository now).
> > > >
> > > > Let's Cc the page allocator folks here!
> > >
> > > We first identified this issue in 6.6.52-stable , and through root cause
> > > analysis,
> > >
> > > it appears the problem may have existed for a significant period.
> > >
> > > However It is recommended that the fix should be backported to at least
> > > Linux kernel versions after 6.6-stable
> >
> > OK, thanks,
> >
> > This has been in mm-hotfixes-unstable for six days.  Hopefully we'll
> > see some review activity soon (please).
> 
> I reviewed and provided my feedback but saw neither a reply nor a
> respin with proposed changes.

OK, thanks.  Do you have time to put together a modified version of this?


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm/page_alloc.c: Avoid infinite retries caused by cpuset race
  2025-05-11  3:07         ` Andrew Morton
@ 2025-05-13 16:26           ` Suren Baghdasaryan
  2025-05-13 19:16             ` Andrew Morton
  0 siblings, 1 reply; 16+ messages in thread
From: Suren Baghdasaryan @ 2025-05-13 16:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tianyang Zhang, Harry Yoo, linux-mm, linux-kernel,
	Vlastimil Babka, Michal Hocko, Brendan Jackman, Johannes Weiner,
	Zi Yan

On Sat, May 10, 2025 at 8:07 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Tue, 22 Apr 2025 17:22:04 -0700 Suren Baghdasaryan <surenb@google.com> wrote:
>
> > On Tue, Apr 22, 2025 at 5:11 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > >
> > > On Tue, 22 Apr 2025 20:10:06 +0800 Tianyang Zhang <zhangtianyang@loongson.cn> wrote:
> > >
> > > >
> > > > ...
> > > >
> > > > >>
> > > > >> Simultaneously starting multiple cpuset01 from LTP can quickly
> > > > >> reproduce this issue on a multi node server when the maximum
> > > > >> memory pressure is reached and the swap is enabled
> > > > >>
> > > > >> Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn>
> > > > >> ---
> > > > > What commit does it fix and should it be backported to -stable?
> > > > >
> > > > > There's a new 'MEMORY MANAGEMENT - PAGE ALLOCATOR' entry (only in
> > > > > Andrew's mm.git repository now).
> > > > >
> > > > > Let's Cc the page allocator folks here!
> > > >
> > > > We first identified this issue in 6.6.52-stable , and through root cause
> > > > analysis,
> > > >
> > > > it appears the problem may have existed for a significant period.
> > > >
> > > > However It is recommended that the fix should be backported to at least
> > > > Linux kernel versions after 6.6-stable
> > >
> > > OK, thanks,
> > >
> > > This has been in mm-hotfixes-unstable for six days.  Hopefully we'll
> > > see some review activity soon (please).
> >
> > I reviewed and provided my feedback but saw neither a reply nor a
> > respin with proposed changes.
>
> OK, thanks.  Do you have time to put together a modified version of this?

I think the code is fine as is. Would be good to add Fixes: tag but it
will require some investigation to find the appropriate patch to
reference here.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm/page_alloc.c: Avoid infinite retries caused by cpuset race
  2025-05-13 16:26           ` Suren Baghdasaryan
@ 2025-05-13 19:16             ` Andrew Morton
  2025-05-13 19:33               ` Suren Baghdasaryan
  2025-05-14  7:34               ` Vlastimil Babka
  0 siblings, 2 replies; 16+ messages in thread
From: Andrew Morton @ 2025-05-13 19:16 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Tianyang Zhang, Harry Yoo, linux-mm, linux-kernel,
	Vlastimil Babka, Michal Hocko, Brendan Jackman, Johannes Weiner,
	Zi Yan

On Tue, 13 May 2025 09:26:53 -0700 Suren Baghdasaryan <surenb@google.com> wrote:

> > > > This has been in mm-hotfixes-unstable for six days.  Hopefully we'll
> > > > see some review activity soon (please).
> > >
> > > I reviewed and provided my feedback but saw neither a reply nor a
> > > respin with proposed changes.
> >
> > OK, thanks.  Do you have time to put together a modified version of this?
> 
> I think the code is fine as is. Would be good to add Fixes: tag but it
> will require some investigation to find the appropriate patch to
> reference here.

Below is what is in mm-hotfixes.  It doesn't actually have any
acked-by's or reviewed-by's.

So... final call for review, please.


From: Tianyang Zhang <zhangtianyang@loongson.cn>
Subject: mm/page_alloc.c: avoid infinite retries caused by cpuset race
Date: Wed, 16 Apr 2025 16:24:05 +0800

__alloc_pages_slowpath has no change detection for ac->nodemask in the
part of retry path, while cpuset can modify it in parallel.  For some
processes that set mempolicy as MPOL_BIND, this results ac->nodemask
changes, and then the should_reclaim_retry will judge based on the latest
nodemask and jump to retry, while the get_page_from_freelist only
traverses the zonelist from ac->preferred_zoneref, which selected by a
expired nodemask and may cause infinite retries in some cases

cpu 64:
__alloc_pages_slowpath {
        /* ..... */
retry:
        /* ac->nodemask = 0x1, ac->preferred->zone->nid = 1 */
        if (alloc_flags & ALLOC_KSWAPD)
                wake_all_kswapds(order, gfp_mask, ac);
        /* cpu 1:
        cpuset_write_resmask
            update_nodemask
                update_nodemasks_hier
                    update_tasks_nodemask
                        mpol_rebind_task
                         mpol_rebind_policy
                          mpol_rebind_nodemask
		// mempolicy->nodes has been modified,
		// which ac->nodemask point to

        */
        /* ac->nodemask = 0x3, ac->preferred->zone->nid = 1 */
        if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
                                 did_some_progress > 0, &no_progress_loops))
                goto retry;
}

Simultaneously starting multiple cpuset01 from LTP can quickly reproduce
this issue on a multi node server when the maximum memory pressure is
reached and the swap is enabled

Link: https://lkml.kernel.org/r/20250416082405.20988-1-zhangtianyang@loongson.cn
Fixes: 902b62810a57 ("mm, page_alloc: fix more premature OOM due to race with cpuset update").
Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_alloc.c |    8 ++++++++
 1 file changed, 8 insertions(+)

--- a/mm/page_alloc.c~mm-page_allocc-avoid-infinite-retries-caused-by-cpuset-race
+++ a/mm/page_alloc.c
@@ -4562,6 +4562,14 @@ restart:
 	}
 
 retry:
+	/*
+	 * Deal with possible cpuset update races or zonelist updates to avoid
+	 * infinite retries.
+	 */
+	if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
+	    check_retry_zonelist(zonelist_iter_cookie))
+		goto restart;
+
 	/* Ensure kswapd doesn't accidentally go to sleep as long as we loop */
 	if (alloc_flags & ALLOC_KSWAPD)
 		wake_all_kswapds(order, gfp_mask, ac);
_



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm/page_alloc.c: Avoid infinite retries caused by cpuset race
  2025-05-13 19:16             ` Andrew Morton
@ 2025-05-13 19:33               ` Suren Baghdasaryan
  2025-05-14  7:34               ` Vlastimil Babka
  1 sibling, 0 replies; 16+ messages in thread
From: Suren Baghdasaryan @ 2025-05-13 19:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tianyang Zhang, Harry Yoo, linux-mm, linux-kernel,
	Vlastimil Babka, Michal Hocko, Brendan Jackman, Johannes Weiner,
	Zi Yan

On Tue, May 13, 2025 at 12:16 PM Andrew Morton
<akpm@linux-foundation.org> wrote:
>
> On Tue, 13 May 2025 09:26:53 -0700 Suren Baghdasaryan <surenb@google.com> wrote:
>
> > > > > This has been in mm-hotfixes-unstable for six days.  Hopefully we'll
> > > > > see some review activity soon (please).
> > > >
> > > > I reviewed and provided my feedback but saw neither a reply nor a
> > > > respin with proposed changes.
> > >
> > > OK, thanks.  Do you have time to put together a modified version of this?
> >
> > I think the code is fine as is. Would be good to add Fixes: tag but it
> > will require some investigation to find the appropriate patch to
> > reference here.
>
> Below is what is in mm-hotfixes.  It doesn't actually have any
> acked-by's or reviewed-by's.
>
> So... final call for review, please.

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

>
>
> From: Tianyang Zhang <zhangtianyang@loongson.cn>
> Subject: mm/page_alloc.c: avoid infinite retries caused by cpuset race
> Date: Wed, 16 Apr 2025 16:24:05 +0800
>
> __alloc_pages_slowpath has no change detection for ac->nodemask in the
> part of retry path, while cpuset can modify it in parallel.  For some
> processes that set mempolicy as MPOL_BIND, this results ac->nodemask
> changes, and then the should_reclaim_retry will judge based on the latest
> nodemask and jump to retry, while the get_page_from_freelist only
> traverses the zonelist from ac->preferred_zoneref, which selected by a
> expired nodemask and may cause infinite retries in some cases
>
> cpu 64:
> __alloc_pages_slowpath {
>         /* ..... */
> retry:
>         /* ac->nodemask = 0x1, ac->preferred->zone->nid = 1 */
>         if (alloc_flags & ALLOC_KSWAPD)
>                 wake_all_kswapds(order, gfp_mask, ac);
>         /* cpu 1:
>         cpuset_write_resmask
>             update_nodemask
>                 update_nodemasks_hier
>                     update_tasks_nodemask
>                         mpol_rebind_task
>                          mpol_rebind_policy
>                           mpol_rebind_nodemask
>                 // mempolicy->nodes has been modified,
>                 // which ac->nodemask point to
>
>         */
>         /* ac->nodemask = 0x3, ac->preferred->zone->nid = 1 */
>         if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
>                                  did_some_progress > 0, &no_progress_loops))
>                 goto retry;
> }
>
> Simultaneously starting multiple cpuset01 from LTP can quickly reproduce
> this issue on a multi node server when the maximum memory pressure is
> reached and the swap is enabled
>
> Link: https://lkml.kernel.org/r/20250416082405.20988-1-zhangtianyang@loongson.cn
> Fixes: 902b62810a57 ("mm, page_alloc: fix more premature OOM due to race with cpuset update").
> Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Brendan Jackman <jackmanb@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
>
>  mm/page_alloc.c |    8 ++++++++
>  1 file changed, 8 insertions(+)
>
> --- a/mm/page_alloc.c~mm-page_allocc-avoid-infinite-retries-caused-by-cpuset-race
> +++ a/mm/page_alloc.c
> @@ -4562,6 +4562,14 @@ restart:
>         }
>
>  retry:
> +       /*
> +        * Deal with possible cpuset update races or zonelist updates to avoid
> +        * infinite retries.
> +        */
> +       if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
> +           check_retry_zonelist(zonelist_iter_cookie))
> +               goto restart;
> +
>         /* Ensure kswapd doesn't accidentally go to sleep as long as we loop */
>         if (alloc_flags & ALLOC_KSWAPD)
>                 wake_all_kswapds(order, gfp_mask, ac);
> _
>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm/page_alloc.c: Avoid infinite retries caused by cpuset race
  2025-04-23 15:35       ` Suren Baghdasaryan
@ 2025-05-14  7:15         ` Vlastimil Babka
  0 siblings, 0 replies; 16+ messages in thread
From: Vlastimil Babka @ 2025-05-14  7:15 UTC (permalink / raw)
  To: Suren Baghdasaryan, Tianyang Zhang
  Cc: Harry Yoo, akpm, linux-mm, linux-kernel, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Zi Yan

On 4/23/25 17:35, Suren Baghdasaryan wrote:
>> >> There's a new 'MEMORY MANAGEMENT - PAGE ALLOCATOR' entry (only in
>> >> Andrew's mm.git repository now).
>> >>
>> >> Let's Cc the page allocator folks here!
>> >>
>> >> --
>> >> Cheers,
>> >> Harry / Hyeonggon
>> >>
>> >>>   mm/page_alloc.c | 8 ++++++++
>> >>>   1 file changed, 8 insertions(+)
>> >>>
>> >>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> >>> index fd6b865cb1ab..1e82f5214a42 100644
>> >>> --- a/mm/page_alloc.c
>> >>> +++ b/mm/page_alloc.c
>> >>> @@ -4530,6 +4530,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>> >>>        }
>> >>>
>> >>>   retry:
>> >>> +     /*
>> >>> +      * Deal with possible cpuset update races or zonelist updates to avoid
>> >>> +      * infinite retries.
>> >>> +      */
>> >>> +     if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
>> >>> +         check_retry_zonelist(zonelist_iter_cookie))
>> >>> +             goto restart;
>> >>> +
>> > We have this check later in this block:
>> > https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4652,
>> > so IIUC you effectively are moving it to be called before
>> > should_reclaim_retry(). If so, I think you should remove the old one
>> > (the one I linked earlier) as it seems to be unnecessary duplication
>> > at this point.
>> In my understanding, the code in
>>
>> https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4652
>>
>> was introduced to prevent unnecessary OOM (Out-of-Memory) conditions
>> in__alloc_pages_may_oom.
>>
>> If old code is removed, the newly added code (on retry loop entry)
>> cannot guarantee that the cpuset
>>
>> remains valid when the flow reaches in__alloc_pages_may_oom, especially
>> if scheduling occurs during this section.
> 
> Well, rescheduling can happen even between
> https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4652
> and https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4657
> but I see your point. Also should_reclaim_retry() does not include

I think the rescheduling isn't a problem because what we're testing is "we
are about to oom, could it have been because we raced?" and the race would
have affected the code before #L4652. If we didn't race and yet determined
it's time for oom, a race between #L4652 and #L4657 shouldn't matter. The
get_page_from_freelist() in __alloc_pages_may_oom() isn't that important for
preventing premature oom AFAICS, given it uses high wmark.

That said, I think the newly added check could be more logically placed
above the call to should_reclaim_retry() instead of right after the retry:
label, but it's not critical.

> zonelist change detection, so keeping the checks at
> https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4652
> sounds like a good idea.
> 
>>
>> Therefore, I think retaining the original code logic is necessary to
>> ensure correctness under concurrency.
>>
>> >
>> >
>> >>>        /* Ensure kswapd doesn't accidentally go to sleep as long as we loop */
>> >>>        if (alloc_flags & ALLOC_KSWAPD)
>> >>>                wake_all_kswapds(order, gfp_mask, ac);
>> >>> --
>> >>> 2.20.1
>> >>>
>> >>>
>> Thanks
>>



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm/page_alloc.c: Avoid infinite retries caused by cpuset race
  2025-05-13 19:16             ` Andrew Morton
  2025-05-13 19:33               ` Suren Baghdasaryan
@ 2025-05-14  7:34               ` Vlastimil Babka
  2025-05-14 22:42                 ` Andrew Morton
  2025-05-15  3:19                 ` Tianyang Zhang
  1 sibling, 2 replies; 16+ messages in thread
From: Vlastimil Babka @ 2025-05-14  7:34 UTC (permalink / raw)
  To: Andrew Morton, Suren Baghdasaryan
  Cc: Tianyang Zhang, Harry Yoo, linux-mm, linux-kernel, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Zi Yan

On 5/13/25 21:16, Andrew Morton wrote:
> On Tue, 13 May 2025 09:26:53 -0700 Suren Baghdasaryan <surenb@google.com> wrote:
> 
>> > > > This has been in mm-hotfixes-unstable for six days.  Hopefully we'll
>> > > > see some review activity soon (please).
>> > >
>> > > I reviewed and provided my feedback but saw neither a reply nor a
>> > > respin with proposed changes.
>> >
>> > OK, thanks.  Do you have time to put together a modified version of this?
>> 
>> I think the code is fine as is. Would be good to add Fixes: tag but it
>> will require some investigation to find the appropriate patch to
>> reference here.
> 
> Below is what is in mm-hotfixes.  It doesn't actually have any
> acked-by's or reviewed-by's.
> 
> So... final call for review, please.
> 
> 
> From: Tianyang Zhang <zhangtianyang@loongson.cn>
> Subject: mm/page_alloc.c: avoid infinite retries caused by cpuset race
> Date: Wed, 16 Apr 2025 16:24:05 +0800
> 
> __alloc_pages_slowpath has no change detection for ac->nodemask in the
> part of retry path, while cpuset can modify it in parallel.  For some
> processes that set mempolicy as MPOL_BIND, this results ac->nodemask
> changes, and then the should_reclaim_retry will judge based on the latest
> nodemask and jump to retry, while the get_page_from_freelist only
> traverses the zonelist from ac->preferred_zoneref, which selected by a
> expired nodemask and may cause infinite retries in some cases
> 
> cpu 64:
> __alloc_pages_slowpath {
>         /* ..... */
> retry:
>         /* ac->nodemask = 0x1, ac->preferred->zone->nid = 1 */
>         if (alloc_flags & ALLOC_KSWAPD)
>                 wake_all_kswapds(order, gfp_mask, ac);
>         /* cpu 1:
>         cpuset_write_resmask
>             update_nodemask
>                 update_nodemasks_hier
>                     update_tasks_nodemask
>                         mpol_rebind_task
>                          mpol_rebind_policy
>                           mpol_rebind_nodemask
> 		// mempolicy->nodes has been modified,
> 		// which ac->nodemask point to
> 
>         */
>         /* ac->nodemask = 0x3, ac->preferred->zone->nid = 1 */
>         if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
>                                  did_some_progress > 0, &no_progress_loops))
>                 goto retry;
> }
> 
> Simultaneously starting multiple cpuset01 from LTP can quickly reproduce
> this issue on a multi node server when the maximum memory pressure is
> reached and the swap is enabled
> 
> Link: https://lkml.kernel.org/r/20250416082405.20988-1-zhangtianyang@loongson.cn
> Fixes: 902b62810a57 ("mm, page_alloc: fix more premature OOM due to race with cpuset update").

After the discussion in this thread, Suren retracted this Fixes: suggestion.
I think it actually goes back to this one which introduced the
preferred_zoneref caching.

Fixes: c33d6c06f60f ("mm, page_alloc: avoid looking up the first zone in a
zonelist twice")

> Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Brendan Jackman <jackmanb@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

I would have placed the check bit further down, just above the
should_reclaim_retry() call, but it's not that important to hold up a fix
and can be done later.

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

> ---
> 
>  mm/page_alloc.c |    8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> --- a/mm/page_alloc.c~mm-page_allocc-avoid-infinite-retries-caused-by-cpuset-race
> +++ a/mm/page_alloc.c
> @@ -4562,6 +4562,14 @@ restart:
>  	}
>  
>  retry:
> +	/*
> +	 * Deal with possible cpuset update races or zonelist updates to avoid
> +	 * infinite retries.
> +	 */
> +	if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
> +	    check_retry_zonelist(zonelist_iter_cookie))
> +		goto restart;
> +
>  	/* Ensure kswapd doesn't accidentally go to sleep as long as we loop */
>  	if (alloc_flags & ALLOC_KSWAPD)
>  		wake_all_kswapds(order, gfp_mask, ac);
> _
> 



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm/page_alloc.c: Avoid infinite retries caused by cpuset race
  2025-05-14  7:34               ` Vlastimil Babka
@ 2025-05-14 22:42                 ` Andrew Morton
  2025-05-15  3:19                 ` Tianyang Zhang
  1 sibling, 0 replies; 16+ messages in thread
From: Andrew Morton @ 2025-05-14 22:42 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Suren Baghdasaryan, Tianyang Zhang, Harry Yoo, linux-mm,
	linux-kernel, Michal Hocko, Brendan Jackman, Johannes Weiner,
	Zi Yan

On Wed, 14 May 2025 09:34:53 +0200 Vlastimil Babka <vbabka@suse.cz> wrote:

> On 5/13/25 21:16, Andrew Morton wrote:
> > On Tue, 13 May 2025 09:26:53 -0700 Suren Baghdasaryan <surenb@google.com> wrote:
> > 
> >> > > > This has been in mm-hotfixes-unstable for six days.  Hopefully we'll
> >> > > > see some review activity soon (please).
> >> > >
> >> > > I reviewed and provided my feedback but saw neither a reply nor a
> >> > > respin with proposed changes.
> >> >
> >> > OK, thanks.  Do you have time to put together a modified version of this?
> >> 
> >> I think the code is fine as is. Would be good to add Fixes: tag but it
> >> will require some investigation to find the appropriate patch to
> >> reference here.
> > 
> > Below is what is in mm-hotfixes.  It doesn't actually have any
> > acked-by's or reviewed-by's.
> > 
> > So... final call for review, please.
> > 
> 
> ...
>
> After the discussion in this thread, Suren retracted this Fixes: suggestion.
> I think it actually goes back to this one which introduced the
> preferred_zoneref caching.
> 
> Fixes: c33d6c06f60f ("mm, page_alloc: avoid looking up the first zone in a
> zonelist twice")

Updated.

> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

Thanks.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm/page_alloc.c: Avoid infinite retries caused by cpuset race
  2025-05-14  7:34               ` Vlastimil Babka
  2025-05-14 22:42                 ` Andrew Morton
@ 2025-05-15  3:19                 ` Tianyang Zhang
  1 sibling, 0 replies; 16+ messages in thread
From: Tianyang Zhang @ 2025-05-15  3:19 UTC (permalink / raw)
  To: Vlastimil Babka, Andrew Morton, Suren Baghdasaryan
  Cc: Harry Yoo, linux-mm, linux-kernel, Michal Hocko, Brendan Jackman,
	Johannes Weiner, Zi Yan

Hi,

在 2025/5/14 下午3:34, Vlastimil Babka 写道:
> On 5/13/25 21:16, Andrew Morton wrote:
>> On Tue, 13 May 2025 09:26:53 -0700 Suren Baghdasaryan <surenb@google.com> wrote:
>>
>>>>>> This has been in mm-hotfixes-unstable for six days.  Hopefully we'll
>>>>>> see some review activity soon (please).
>>>>> I reviewed and provided my feedback but saw neither a reply nor a
>>>>> respin with proposed changes.
>>>> OK, thanks.  Do you have time to put together a modified version of this?
>>> I think the code is fine as is. Would be good to add Fixes: tag but it
>>> will require some investigation to find the appropriate patch to
>>> reference here.
>> Below is what is in mm-hotfixes.  It doesn't actually have any
>> acked-by's or reviewed-by's.
>>
>> So... final call for review, please.
>>
>>
>> From: Tianyang Zhang <zhangtianyang@loongson.cn>
>> Subject: mm/page_alloc.c: avoid infinite retries caused by cpuset race
>> Date: Wed, 16 Apr 2025 16:24:05 +0800
>>
>> __alloc_pages_slowpath has no change detection for ac->nodemask in the
>> part of retry path, while cpuset can modify it in parallel.  For some
>> processes that set mempolicy as MPOL_BIND, this results ac->nodemask
>> changes, and then the should_reclaim_retry will judge based on the latest
>> nodemask and jump to retry, while the get_page_from_freelist only
>> traverses the zonelist from ac->preferred_zoneref, which selected by a
>> expired nodemask and may cause infinite retries in some cases
>>
>> cpu 64:
>> __alloc_pages_slowpath {
>>          /* ..... */
>> retry:
>>          /* ac->nodemask = 0x1, ac->preferred->zone->nid = 1 */
>>          if (alloc_flags & ALLOC_KSWAPD)
>>                  wake_all_kswapds(order, gfp_mask, ac);
>>          /* cpu 1:
>>          cpuset_write_resmask
>>              update_nodemask
>>                  update_nodemasks_hier
>>                      update_tasks_nodemask
>>                          mpol_rebind_task
>>                           mpol_rebind_policy
>>                            mpol_rebind_nodemask
>> 		// mempolicy->nodes has been modified,
>> 		// which ac->nodemask point to
>>
>>          */
>>          /* ac->nodemask = 0x3, ac->preferred->zone->nid = 1 */
>>          if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
>>                                   did_some_progress > 0, &no_progress_loops))
>>                  goto retry;
>> }
>>
>> Simultaneously starting multiple cpuset01 from LTP can quickly reproduce
>> this issue on a multi node server when the maximum memory pressure is
>> reached and the swap is enabled
>>
>> Link: https://lkml.kernel.org/r/20250416082405.20988-1-zhangtianyang@loongson.cn
>> Fixes: 902b62810a57 ("mm, page_alloc: fix more premature OOM due to race with cpuset update").
> After the discussion in this thread, Suren retracted this Fixes: suggestion.
> I think it actually goes back to this one which introduced the
> preferred_zoneref caching.
>
> Fixes: c33d6c06f60f ("mm, page_alloc: avoid looking up the first zone in a
> zonelist twice")
Yes, the problem should be introduced by this patch, thank you
>
>> Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn>
>> Cc: Vlastimil Babka <vbabka@suse.cz>
>> Cc: Suren Baghdasaryan <surenb@google.com>
>> Cc: Michal Hocko <mhocko@suse.com>
>> Cc: Brendan Jackman <jackmanb@google.com>
>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>> Cc: Zi Yan <ziy@nvidia.com>
>> Cc: <stable@vger.kernel.org>
>> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> I would have placed the check bit further down, just above the
> should_reclaim_retry() call, but it's not that important to hold up a fix
> and can be done later.
>
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
>
>> ---
>>
>>   mm/page_alloc.c |    8 ++++++++
>>   1 file changed, 8 insertions(+)
>>
>> --- a/mm/page_alloc.c~mm-page_allocc-avoid-infinite-retries-caused-by-cpuset-race
>> +++ a/mm/page_alloc.c
>> @@ -4562,6 +4562,14 @@ restart:
>>   	}
>>   
>>   retry:
>> +	/*
>> +	 * Deal with possible cpuset update races or zonelist updates to avoid
>> +	 * infinite retries.
>> +	 */
>> +	if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
>> +	    check_retry_zonelist(zonelist_iter_cookie))
>> +		goto restart;
>> +
>>   	/* Ensure kswapd doesn't accidentally go to sleep as long as we loop */
>>   	if (alloc_flags & ALLOC_KSWAPD)
>>   		wake_all_kswapds(order, gfp_mask, ac);
>> _
>>



^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2025-05-15  3:20 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-16  8:24 [PATCH] mm/page_alloc.c: Avoid infinite retries caused by cpuset race Tianyang Zhang
2025-04-21 10:00 ` Harry Yoo
2025-04-21 20:28   ` Suren Baghdasaryan
2025-04-23  2:38     ` Tianyang Zhang
2025-04-23 15:35       ` Suren Baghdasaryan
2025-05-14  7:15         ` Vlastimil Babka
2025-04-22 12:10   ` Tianyang Zhang
2025-04-23  0:11     ` Andrew Morton
2025-04-23  0:22       ` Suren Baghdasaryan
2025-05-11  3:07         ` Andrew Morton
2025-05-13 16:26           ` Suren Baghdasaryan
2025-05-13 19:16             ` Andrew Morton
2025-05-13 19:33               ` Suren Baghdasaryan
2025-05-14  7:34               ` Vlastimil Babka
2025-05-14 22:42                 ` Andrew Morton
2025-05-15  3:19                 ` Tianyang Zhang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).