* [PATCH] mm/page_alloc.c: Avoid infinite retries caused by cpuset race @ 2025-04-16 8:24 Tianyang Zhang 2025-04-21 10:00 ` Harry Yoo 0 siblings, 1 reply; 16+ messages in thread From: Tianyang Zhang @ 2025-04-16 8:24 UTC (permalink / raw) To: akpm; +Cc: linux-mm, linux-kernel, Tianyang Zhang __alloc_pages_slowpath has no change detection for ac->nodemask in the part of retry path, while cpuset can modify it in parallel. For some processes that set mempolicy as MPOL_BIND, this results ac->nodemask changes, and then the should_reclaim_retry will judge based on the latest nodemask and jump to retry, while the get_page_from_freelist only traverses the zonelist from ac->preferred_zoneref, which selected by a expired nodemask and may cause infinite retries in some cases cpu 64: __alloc_pages_slowpath { /* ..... */ retry: /* ac->nodemask = 0x1, ac->preferred->zone->nid = 1 */ if (alloc_flags & ALLOC_KSWAPD) wake_all_kswapds(order, gfp_mask, ac); /* cpu 1: cpuset_write_resmask update_nodemask update_nodemasks_hier update_tasks_nodemask mpol_rebind_task mpol_rebind_policy mpol_rebind_nodemask // mempolicy->nodes has been modified, // which ac->nodemask point to */ /* ac->nodemask = 0x3, ac->preferred->zone->nid = 1 */ if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags, did_some_progress > 0, &no_progress_loops)) goto retry; } Simultaneously starting multiple cpuset01 from LTP can quickly reproduce this issue on a multi node server when the maximum memory pressure is reached and the swap is enabled Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn> --- mm/page_alloc.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index fd6b865cb1ab..1e82f5214a42 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4530,6 +4530,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, } retry: + /* + * Deal with possible cpuset update races or zonelist updates to avoid + * infinite retries. + */ + if (check_retry_cpuset(cpuset_mems_cookie, ac) || + check_retry_zonelist(zonelist_iter_cookie)) + goto restart; + /* Ensure kswapd doesn't accidentally go to sleep as long as we loop */ if (alloc_flags & ALLOC_KSWAPD) wake_all_kswapds(order, gfp_mask, ac); -- 2.20.1 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH] mm/page_alloc.c: Avoid infinite retries caused by cpuset race 2025-04-16 8:24 [PATCH] mm/page_alloc.c: Avoid infinite retries caused by cpuset race Tianyang Zhang @ 2025-04-21 10:00 ` Harry Yoo 2025-04-21 20:28 ` Suren Baghdasaryan 2025-04-22 12:10 ` Tianyang Zhang 0 siblings, 2 replies; 16+ messages in thread From: Harry Yoo @ 2025-04-21 10:00 UTC (permalink / raw) To: Tianyang Zhang Cc: akpm, linux-mm, linux-kernel, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko, Brendan Jackman, Johannes Weiner, Zi Yan On Wed, Apr 16, 2025 at 04:24:05PM +0800, Tianyang Zhang wrote: > __alloc_pages_slowpath has no change detection for ac->nodemask > in the part of retry path, while cpuset can modify it in parallel. > For some processes that set mempolicy as MPOL_BIND, this results > ac->nodemask changes, and then the should_reclaim_retry will > judge based on the latest nodemask and jump to retry, while the > get_page_from_freelist only traverses the zonelist from > ac->preferred_zoneref, which selected by a expired nodemask > and may cause infinite retries in some cases > > cpu 64: > __alloc_pages_slowpath { > /* ..... */ > retry: > /* ac->nodemask = 0x1, ac->preferred->zone->nid = 1 */ > if (alloc_flags & ALLOC_KSWAPD) > wake_all_kswapds(order, gfp_mask, ac); > /* cpu 1: > cpuset_write_resmask > update_nodemask > update_nodemasks_hier > update_tasks_nodemask > mpol_rebind_task > mpol_rebind_policy > mpol_rebind_nodemask > // mempolicy->nodes has been modified, > // which ac->nodemask point to > > */ > /* ac->nodemask = 0x3, ac->preferred->zone->nid = 1 */ > if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags, > did_some_progress > 0, &no_progress_loops)) > goto retry; > } > > Simultaneously starting multiple cpuset01 from LTP can quickly > reproduce this issue on a multi node server when the maximum > memory pressure is reached and the swap is enabled > > Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn> > --- What commit does it fix and should it be backported to -stable? There's a new 'MEMORY MANAGEMENT - PAGE ALLOCATOR' entry (only in Andrew's mm.git repository now). Let's Cc the page allocator folks here! -- Cheers, Harry / Hyeonggon > mm/page_alloc.c | 8 ++++++++ > 1 file changed, 8 insertions(+) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index fd6b865cb1ab..1e82f5214a42 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -4530,6 +4530,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > } > > retry: > + /* > + * Deal with possible cpuset update races or zonelist updates to avoid > + * infinite retries. > + */ > + if (check_retry_cpuset(cpuset_mems_cookie, ac) || > + check_retry_zonelist(zonelist_iter_cookie)) > + goto restart; > + > /* Ensure kswapd doesn't accidentally go to sleep as long as we loop */ > if (alloc_flags & ALLOC_KSWAPD) > wake_all_kswapds(order, gfp_mask, ac); > -- > 2.20.1 > > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mm/page_alloc.c: Avoid infinite retries caused by cpuset race 2025-04-21 10:00 ` Harry Yoo @ 2025-04-21 20:28 ` Suren Baghdasaryan 2025-04-23 2:38 ` Tianyang Zhang 2025-04-22 12:10 ` Tianyang Zhang 1 sibling, 1 reply; 16+ messages in thread From: Suren Baghdasaryan @ 2025-04-21 20:28 UTC (permalink / raw) To: Harry Yoo Cc: Tianyang Zhang, akpm, linux-mm, linux-kernel, Vlastimil Babka, Michal Hocko, Brendan Jackman, Johannes Weiner, Zi Yan On Mon, Apr 21, 2025 at 3:00 AM Harry Yoo <harry.yoo@oracle.com> wrote: > > On Wed, Apr 16, 2025 at 04:24:05PM +0800, Tianyang Zhang wrote: > > __alloc_pages_slowpath has no change detection for ac->nodemask > > in the part of retry path, while cpuset can modify it in parallel. > > For some processes that set mempolicy as MPOL_BIND, this results > > ac->nodemask changes, and then the should_reclaim_retry will > > judge based on the latest nodemask and jump to retry, while the > > get_page_from_freelist only traverses the zonelist from > > ac->preferred_zoneref, which selected by a expired nodemask > > and may cause infinite retries in some cases > > > > cpu 64: > > __alloc_pages_slowpath { > > /* ..... */ > > retry: > > /* ac->nodemask = 0x1, ac->preferred->zone->nid = 1 */ > > if (alloc_flags & ALLOC_KSWAPD) > > wake_all_kswapds(order, gfp_mask, ac); > > /* cpu 1: > > cpuset_write_resmask > > update_nodemask > > update_nodemasks_hier > > update_tasks_nodemask > > mpol_rebind_task > > mpol_rebind_policy > > mpol_rebind_nodemask > > // mempolicy->nodes has been modified, > > // which ac->nodemask point to > > > > */ > > /* ac->nodemask = 0x3, ac->preferred->zone->nid = 1 */ > > if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags, > > did_some_progress > 0, &no_progress_loops)) > > goto retry; > > } > > > > Simultaneously starting multiple cpuset01 from LTP can quickly > > reproduce this issue on a multi node server when the maximum > > memory pressure is reached and the swap is enabled > > > > Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn> > > --- > > What commit does it fix and should it be backported to -stable? I think it fixes 902b62810a57 ("mm, page_alloc: fix more premature OOM due to race with cpuset update"). > > There's a new 'MEMORY MANAGEMENT - PAGE ALLOCATOR' entry (only in > Andrew's mm.git repository now). > > Let's Cc the page allocator folks here! > > -- > Cheers, > Harry / Hyeonggon > > > mm/page_alloc.c | 8 ++++++++ > > 1 file changed, 8 insertions(+) > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index fd6b865cb1ab..1e82f5214a42 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -4530,6 +4530,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > > } > > > > retry: > > + /* > > + * Deal with possible cpuset update races or zonelist updates to avoid > > + * infinite retries. > > + */ > > + if (check_retry_cpuset(cpuset_mems_cookie, ac) || > > + check_retry_zonelist(zonelist_iter_cookie)) > > + goto restart; > > + We have this check later in this block: https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4652, so IIUC you effectively are moving it to be called before should_reclaim_retry(). If so, I think you should remove the old one (the one I linked earlier) as it seems to be unnecessary duplication at this point. > > /* Ensure kswapd doesn't accidentally go to sleep as long as we loop */ > > if (alloc_flags & ALLOC_KSWAPD) > > wake_all_kswapds(order, gfp_mask, ac); > > -- > > 2.20.1 > > > > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mm/page_alloc.c: Avoid infinite retries caused by cpuset race 2025-04-21 20:28 ` Suren Baghdasaryan @ 2025-04-23 2:38 ` Tianyang Zhang 2025-04-23 15:35 ` Suren Baghdasaryan 0 siblings, 1 reply; 16+ messages in thread From: Tianyang Zhang @ 2025-04-23 2:38 UTC (permalink / raw) To: Suren Baghdasaryan, Harry Yoo Cc: akpm, linux-mm, linux-kernel, Vlastimil Babka, Michal Hocko, Brendan Jackman, Johannes Weiner, Zi Yan Hi, Suren 在 2025/4/22 上午4:28, Suren Baghdasaryan 写道: > On Mon, Apr 21, 2025 at 3:00 AM Harry Yoo <harry.yoo@oracle.com> wrote: >> On Wed, Apr 16, 2025 at 04:24:05PM +0800, Tianyang Zhang wrote: >>> __alloc_pages_slowpath has no change detection for ac->nodemask >>> in the part of retry path, while cpuset can modify it in parallel. >>> For some processes that set mempolicy as MPOL_BIND, this results >>> ac->nodemask changes, and then the should_reclaim_retry will >>> judge based on the latest nodemask and jump to retry, while the >>> get_page_from_freelist only traverses the zonelist from >>> ac->preferred_zoneref, which selected by a expired nodemask >>> and may cause infinite retries in some cases >>> >>> cpu 64: >>> __alloc_pages_slowpath { >>> /* ..... */ >>> retry: >>> /* ac->nodemask = 0x1, ac->preferred->zone->nid = 1 */ >>> if (alloc_flags & ALLOC_KSWAPD) >>> wake_all_kswapds(order, gfp_mask, ac); >>> /* cpu 1: >>> cpuset_write_resmask >>> update_nodemask >>> update_nodemasks_hier >>> update_tasks_nodemask >>> mpol_rebind_task >>> mpol_rebind_policy >>> mpol_rebind_nodemask >>> // mempolicy->nodes has been modified, >>> // which ac->nodemask point to >>> >>> */ >>> /* ac->nodemask = 0x3, ac->preferred->zone->nid = 1 */ >>> if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags, >>> did_some_progress > 0, &no_progress_loops)) >>> goto retry; >>> } >>> >>> Simultaneously starting multiple cpuset01 from LTP can quickly >>> reproduce this issue on a multi node server when the maximum >>> memory pressure is reached and the swap is enabled >>> >>> Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn> >>> --- >> What commit does it fix and should it be backported to -stable? > I think it fixes 902b62810a57 ("mm, page_alloc: fix more premature OOM > due to race with cpuset update"). I think this issue is unlikely to have been introduced by Patch 902b62810a57 , as the infinite-reties section from https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4568 to https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4628 where the cpuset race condition occurs remains unmodified in the logic of Patch 902b62810a57. >> There's a new 'MEMORY MANAGEMENT - PAGE ALLOCATOR' entry (only in >> Andrew's mm.git repository now). >> >> Let's Cc the page allocator folks here! >> >> -- >> Cheers, >> Harry / Hyeonggon >> >>> mm/page_alloc.c | 8 ++++++++ >>> 1 file changed, 8 insertions(+) >>> >>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >>> index fd6b865cb1ab..1e82f5214a42 100644 >>> --- a/mm/page_alloc.c >>> +++ b/mm/page_alloc.c >>> @@ -4530,6 +4530,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, >>> } >>> >>> retry: >>> + /* >>> + * Deal with possible cpuset update races or zonelist updates to avoid >>> + * infinite retries. >>> + */ >>> + if (check_retry_cpuset(cpuset_mems_cookie, ac) || >>> + check_retry_zonelist(zonelist_iter_cookie)) >>> + goto restart; >>> + > We have this check later in this block: > https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4652, > so IIUC you effectively are moving it to be called before > should_reclaim_retry(). If so, I think you should remove the old one > (the one I linked earlier) as it seems to be unnecessary duplication > at this point. In my understanding, the code in https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4652 was introduced to prevent unnecessary OOM (Out-of-Memory) conditions in__alloc_pages_may_oom. If old code is removed, the newly added code (on retry loop entry) cannot guarantee that the cpuset remains valid when the flow reaches in__alloc_pages_may_oom, especially if scheduling occurs during this section. Therefore, I think retaining the original code logic is necessary to ensure correctness under concurrency. > > >>> /* Ensure kswapd doesn't accidentally go to sleep as long as we loop */ >>> if (alloc_flags & ALLOC_KSWAPD) >>> wake_all_kswapds(order, gfp_mask, ac); >>> -- >>> 2.20.1 >>> >>> Thanks ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mm/page_alloc.c: Avoid infinite retries caused by cpuset race 2025-04-23 2:38 ` Tianyang Zhang @ 2025-04-23 15:35 ` Suren Baghdasaryan 2025-05-14 7:15 ` Vlastimil Babka 0 siblings, 1 reply; 16+ messages in thread From: Suren Baghdasaryan @ 2025-04-23 15:35 UTC (permalink / raw) To: Tianyang Zhang Cc: Harry Yoo, akpm, linux-mm, linux-kernel, Vlastimil Babka, Michal Hocko, Brendan Jackman, Johannes Weiner, Zi Yan On Tue, Apr 22, 2025 at 7:39 PM Tianyang Zhang <zhangtianyang@loongson.cn> wrote: > > Hi, Suren > > 在 2025/4/22 上午4:28, Suren Baghdasaryan 写道: > > On Mon, Apr 21, 2025 at 3:00 AM Harry Yoo <harry.yoo@oracle.com> wrote: > >> On Wed, Apr 16, 2025 at 04:24:05PM +0800, Tianyang Zhang wrote: > >>> __alloc_pages_slowpath has no change detection for ac->nodemask > >>> in the part of retry path, while cpuset can modify it in parallel. > >>> For some processes that set mempolicy as MPOL_BIND, this results > >>> ac->nodemask changes, and then the should_reclaim_retry will > >>> judge based on the latest nodemask and jump to retry, while the > >>> get_page_from_freelist only traverses the zonelist from > >>> ac->preferred_zoneref, which selected by a expired nodemask > >>> and may cause infinite retries in some cases > >>> > >>> cpu 64: > >>> __alloc_pages_slowpath { > >>> /* ..... */ > >>> retry: > >>> /* ac->nodemask = 0x1, ac->preferred->zone->nid = 1 */ > >>> if (alloc_flags & ALLOC_KSWAPD) > >>> wake_all_kswapds(order, gfp_mask, ac); > >>> /* cpu 1: > >>> cpuset_write_resmask > >>> update_nodemask > >>> update_nodemasks_hier > >>> update_tasks_nodemask > >>> mpol_rebind_task > >>> mpol_rebind_policy > >>> mpol_rebind_nodemask > >>> // mempolicy->nodes has been modified, > >>> // which ac->nodemask point to > >>> > >>> */ > >>> /* ac->nodemask = 0x3, ac->preferred->zone->nid = 1 */ > >>> if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags, > >>> did_some_progress > 0, &no_progress_loops)) > >>> goto retry; > >>> } > >>> > >>> Simultaneously starting multiple cpuset01 from LTP can quickly > >>> reproduce this issue on a multi node server when the maximum > >>> memory pressure is reached and the swap is enabled > >>> > >>> Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn> > >>> --- > >> What commit does it fix and should it be backported to -stable? > > I think it fixes 902b62810a57 ("mm, page_alloc: fix more premature OOM > > due to race with cpuset update"). > > I think this issue is unlikely to have been introduced by Patch > 902b62810a57 , > > as the infinite-reties section from > > https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4568 > to > https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4628 > > where the cpuset race condition occurs remains unmodified in the logic > of Patch 902b62810a57. Yeah, you are right. After looking into it some more, 902b62810a57 is a wrong patch to blame for this infinite loop. > > >> There's a new 'MEMORY MANAGEMENT - PAGE ALLOCATOR' entry (only in > >> Andrew's mm.git repository now). > >> > >> Let's Cc the page allocator folks here! > >> > >> -- > >> Cheers, > >> Harry / Hyeonggon > >> > >>> mm/page_alloc.c | 8 ++++++++ > >>> 1 file changed, 8 insertions(+) > >>> > >>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c > >>> index fd6b865cb1ab..1e82f5214a42 100644 > >>> --- a/mm/page_alloc.c > >>> +++ b/mm/page_alloc.c > >>> @@ -4530,6 +4530,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > >>> } > >>> > >>> retry: > >>> + /* > >>> + * Deal with possible cpuset update races or zonelist updates to avoid > >>> + * infinite retries. > >>> + */ > >>> + if (check_retry_cpuset(cpuset_mems_cookie, ac) || > >>> + check_retry_zonelist(zonelist_iter_cookie)) > >>> + goto restart; > >>> + > > We have this check later in this block: > > https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4652, > > so IIUC you effectively are moving it to be called before > > should_reclaim_retry(). If so, I think you should remove the old one > > (the one I linked earlier) as it seems to be unnecessary duplication > > at this point. > In my understanding, the code in > > https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4652 > > was introduced to prevent unnecessary OOM (Out-of-Memory) conditions > in__alloc_pages_may_oom. > > If old code is removed, the newly added code (on retry loop entry) > cannot guarantee that the cpuset > > remains valid when the flow reaches in__alloc_pages_may_oom, especially > if scheduling occurs during this section. Well, rescheduling can happen even between https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4652 and https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4657 but I see your point. Also should_reclaim_retry() does not include zonelist change detection, so keeping the checks at https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4652 sounds like a good idea. > > Therefore, I think retaining the original code logic is necessary to > ensure correctness under concurrency. > > > > > > >>> /* Ensure kswapd doesn't accidentally go to sleep as long as we loop */ > >>> if (alloc_flags & ALLOC_KSWAPD) > >>> wake_all_kswapds(order, gfp_mask, ac); > >>> -- > >>> 2.20.1 > >>> > >>> > Thanks > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mm/page_alloc.c: Avoid infinite retries caused by cpuset race 2025-04-23 15:35 ` Suren Baghdasaryan @ 2025-05-14 7:15 ` Vlastimil Babka 0 siblings, 0 replies; 16+ messages in thread From: Vlastimil Babka @ 2025-05-14 7:15 UTC (permalink / raw) To: Suren Baghdasaryan, Tianyang Zhang Cc: Harry Yoo, akpm, linux-mm, linux-kernel, Michal Hocko, Brendan Jackman, Johannes Weiner, Zi Yan On 4/23/25 17:35, Suren Baghdasaryan wrote: >> >> There's a new 'MEMORY MANAGEMENT - PAGE ALLOCATOR' entry (only in >> >> Andrew's mm.git repository now). >> >> >> >> Let's Cc the page allocator folks here! >> >> >> >> -- >> >> Cheers, >> >> Harry / Hyeonggon >> >> >> >>> mm/page_alloc.c | 8 ++++++++ >> >>> 1 file changed, 8 insertions(+) >> >>> >> >>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >> >>> index fd6b865cb1ab..1e82f5214a42 100644 >> >>> --- a/mm/page_alloc.c >> >>> +++ b/mm/page_alloc.c >> >>> @@ -4530,6 +4530,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, >> >>> } >> >>> >> >>> retry: >> >>> + /* >> >>> + * Deal with possible cpuset update races or zonelist updates to avoid >> >>> + * infinite retries. >> >>> + */ >> >>> + if (check_retry_cpuset(cpuset_mems_cookie, ac) || >> >>> + check_retry_zonelist(zonelist_iter_cookie)) >> >>> + goto restart; >> >>> + >> > We have this check later in this block: >> > https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4652, >> > so IIUC you effectively are moving it to be called before >> > should_reclaim_retry(). If so, I think you should remove the old one >> > (the one I linked earlier) as it seems to be unnecessary duplication >> > at this point. >> In my understanding, the code in >> >> https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4652 >> >> was introduced to prevent unnecessary OOM (Out-of-Memory) conditions >> in__alloc_pages_may_oom. >> >> If old code is removed, the newly added code (on retry loop entry) >> cannot guarantee that the cpuset >> >> remains valid when the flow reaches in__alloc_pages_may_oom, especially >> if scheduling occurs during this section. > > Well, rescheduling can happen even between > https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4652 > and https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4657 > but I see your point. Also should_reclaim_retry() does not include I think the rescheduling isn't a problem because what we're testing is "we are about to oom, could it have been because we raced?" and the race would have affected the code before #L4652. If we didn't race and yet determined it's time for oom, a race between #L4652 and #L4657 shouldn't matter. The get_page_from_freelist() in __alloc_pages_may_oom() isn't that important for preventing premature oom AFAICS, given it uses high wmark. That said, I think the newly added check could be more logically placed above the call to should_reclaim_retry() instead of right after the retry: label, but it's not critical. > zonelist change detection, so keeping the checks at > https://elixir.bootlin.com/linux/v6.15-rc3/source/mm/page_alloc.c#L4652 > sounds like a good idea. > >> >> Therefore, I think retaining the original code logic is necessary to >> ensure correctness under concurrency. >> >> > >> > >> >>> /* Ensure kswapd doesn't accidentally go to sleep as long as we loop */ >> >>> if (alloc_flags & ALLOC_KSWAPD) >> >>> wake_all_kswapds(order, gfp_mask, ac); >> >>> -- >> >>> 2.20.1 >> >>> >> >>> >> Thanks >> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mm/page_alloc.c: Avoid infinite retries caused by cpuset race 2025-04-21 10:00 ` Harry Yoo 2025-04-21 20:28 ` Suren Baghdasaryan @ 2025-04-22 12:10 ` Tianyang Zhang 2025-04-23 0:11 ` Andrew Morton 1 sibling, 1 reply; 16+ messages in thread From: Tianyang Zhang @ 2025-04-22 12:10 UTC (permalink / raw) To: Harry Yoo Cc: akpm, linux-mm, linux-kernel, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko, Brendan Jackman, Johannes Weiner, Zi Yan Hi. 在 2025/4/21 下午6:00, Harry Yoo 写道: > On Wed, Apr 16, 2025 at 04:24:05PM +0800, Tianyang Zhang wrote: >> __alloc_pages_slowpath has no change detection for ac->nodemask >> in the part of retry path, while cpuset can modify it in parallel. >> For some processes that set mempolicy as MPOL_BIND, this results >> ac->nodemask changes, and then the should_reclaim_retry will >> judge based on the latest nodemask and jump to retry, while the >> get_page_from_freelist only traverses the zonelist from >> ac->preferred_zoneref, which selected by a expired nodemask >> and may cause infinite retries in some cases >> >> cpu 64: >> __alloc_pages_slowpath { >> /* ..... */ >> retry: >> /* ac->nodemask = 0x1, ac->preferred->zone->nid = 1 */ >> if (alloc_flags & ALLOC_KSWAPD) >> wake_all_kswapds(order, gfp_mask, ac); >> /* cpu 1: >> cpuset_write_resmask >> update_nodemask >> update_nodemasks_hier >> update_tasks_nodemask >> mpol_rebind_task >> mpol_rebind_policy >> mpol_rebind_nodemask >> // mempolicy->nodes has been modified, >> // which ac->nodemask point to >> >> */ >> /* ac->nodemask = 0x3, ac->preferred->zone->nid = 1 */ >> if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags, >> did_some_progress > 0, &no_progress_loops)) >> goto retry; >> } >> >> Simultaneously starting multiple cpuset01 from LTP can quickly >> reproduce this issue on a multi node server when the maximum >> memory pressure is reached and the swap is enabled >> >> Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn> >> --- > What commit does it fix and should it be backported to -stable? > > There's a new 'MEMORY MANAGEMENT - PAGE ALLOCATOR' entry (only in > Andrew's mm.git repository now). > > Let's Cc the page allocator folks here! We first identified this issue in 6.6.52-stable , and through root cause analysis, it appears the problem may have existed for a significant period. However It is recommended that the fix should be backported to at least Linux kernel versions after 6.6-stable ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mm/page_alloc.c: Avoid infinite retries caused by cpuset race 2025-04-22 12:10 ` Tianyang Zhang @ 2025-04-23 0:11 ` Andrew Morton 2025-04-23 0:22 ` Suren Baghdasaryan 0 siblings, 1 reply; 16+ messages in thread From: Andrew Morton @ 2025-04-23 0:11 UTC (permalink / raw) To: Tianyang Zhang Cc: Harry Yoo, linux-mm, linux-kernel, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko, Brendan Jackman, Johannes Weiner, Zi Yan On Tue, 22 Apr 2025 20:10:06 +0800 Tianyang Zhang <zhangtianyang@loongson.cn> wrote: > > ... > > >> > >> Simultaneously starting multiple cpuset01 from LTP can quickly > >> reproduce this issue on a multi node server when the maximum > >> memory pressure is reached and the swap is enabled > >> > >> Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn> > >> --- > > What commit does it fix and should it be backported to -stable? > > > > There's a new 'MEMORY MANAGEMENT - PAGE ALLOCATOR' entry (only in > > Andrew's mm.git repository now). > > > > Let's Cc the page allocator folks here! > > We first identified this issue in 6.6.52-stable , and through root cause > analysis, > > it appears the problem may have existed for a significant period. > > However It is recommended that the fix should be backported to at least > Linux kernel versions after 6.6-stable OK, thanks, This has been in mm-hotfixes-unstable for six days. Hopefully we'll see some review activity soon (please). ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mm/page_alloc.c: Avoid infinite retries caused by cpuset race 2025-04-23 0:11 ` Andrew Morton @ 2025-04-23 0:22 ` Suren Baghdasaryan 2025-05-11 3:07 ` Andrew Morton 0 siblings, 1 reply; 16+ messages in thread From: Suren Baghdasaryan @ 2025-04-23 0:22 UTC (permalink / raw) To: Andrew Morton Cc: Tianyang Zhang, Harry Yoo, linux-mm, linux-kernel, Vlastimil Babka, Michal Hocko, Brendan Jackman, Johannes Weiner, Zi Yan On Tue, Apr 22, 2025 at 5:11 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Tue, 22 Apr 2025 20:10:06 +0800 Tianyang Zhang <zhangtianyang@loongson.cn> wrote: > > > > > ... > > > > >> > > >> Simultaneously starting multiple cpuset01 from LTP can quickly > > >> reproduce this issue on a multi node server when the maximum > > >> memory pressure is reached and the swap is enabled > > >> > > >> Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn> > > >> --- > > > What commit does it fix and should it be backported to -stable? > > > > > > There's a new 'MEMORY MANAGEMENT - PAGE ALLOCATOR' entry (only in > > > Andrew's mm.git repository now). > > > > > > Let's Cc the page allocator folks here! > > > > We first identified this issue in 6.6.52-stable , and through root cause > > analysis, > > > > it appears the problem may have existed for a significant period. > > > > However It is recommended that the fix should be backported to at least > > Linux kernel versions after 6.6-stable > > OK, thanks, > > This has been in mm-hotfixes-unstable for six days. Hopefully we'll > see some review activity soon (please). I reviewed and provided my feedback but saw neither a reply nor a respin with proposed changes. > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mm/page_alloc.c: Avoid infinite retries caused by cpuset race 2025-04-23 0:22 ` Suren Baghdasaryan @ 2025-05-11 3:07 ` Andrew Morton 2025-05-13 16:26 ` Suren Baghdasaryan 0 siblings, 1 reply; 16+ messages in thread From: Andrew Morton @ 2025-05-11 3:07 UTC (permalink / raw) To: Suren Baghdasaryan Cc: Tianyang Zhang, Harry Yoo, linux-mm, linux-kernel, Vlastimil Babka, Michal Hocko, Brendan Jackman, Johannes Weiner, Zi Yan On Tue, 22 Apr 2025 17:22:04 -0700 Suren Baghdasaryan <surenb@google.com> wrote: > On Tue, Apr 22, 2025 at 5:11 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > > > On Tue, 22 Apr 2025 20:10:06 +0800 Tianyang Zhang <zhangtianyang@loongson.cn> wrote: > > > > > > > > ... > > > > > > >> > > > >> Simultaneously starting multiple cpuset01 from LTP can quickly > > > >> reproduce this issue on a multi node server when the maximum > > > >> memory pressure is reached and the swap is enabled > > > >> > > > >> Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn> > > > >> --- > > > > What commit does it fix and should it be backported to -stable? > > > > > > > > There's a new 'MEMORY MANAGEMENT - PAGE ALLOCATOR' entry (only in > > > > Andrew's mm.git repository now). > > > > > > > > Let's Cc the page allocator folks here! > > > > > > We first identified this issue in 6.6.52-stable , and through root cause > > > analysis, > > > > > > it appears the problem may have existed for a significant period. > > > > > > However It is recommended that the fix should be backported to at least > > > Linux kernel versions after 6.6-stable > > > > OK, thanks, > > > > This has been in mm-hotfixes-unstable for six days. Hopefully we'll > > see some review activity soon (please). > > I reviewed and provided my feedback but saw neither a reply nor a > respin with proposed changes. OK, thanks. Do you have time to put together a modified version of this? ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mm/page_alloc.c: Avoid infinite retries caused by cpuset race 2025-05-11 3:07 ` Andrew Morton @ 2025-05-13 16:26 ` Suren Baghdasaryan 2025-05-13 19:16 ` Andrew Morton 0 siblings, 1 reply; 16+ messages in thread From: Suren Baghdasaryan @ 2025-05-13 16:26 UTC (permalink / raw) To: Andrew Morton Cc: Tianyang Zhang, Harry Yoo, linux-mm, linux-kernel, Vlastimil Babka, Michal Hocko, Brendan Jackman, Johannes Weiner, Zi Yan On Sat, May 10, 2025 at 8:07 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Tue, 22 Apr 2025 17:22:04 -0700 Suren Baghdasaryan <surenb@google.com> wrote: > > > On Tue, Apr 22, 2025 at 5:11 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > > > > > On Tue, 22 Apr 2025 20:10:06 +0800 Tianyang Zhang <zhangtianyang@loongson.cn> wrote: > > > > > > > > > > > ... > > > > > > > > >> > > > > >> Simultaneously starting multiple cpuset01 from LTP can quickly > > > > >> reproduce this issue on a multi node server when the maximum > > > > >> memory pressure is reached and the swap is enabled > > > > >> > > > > >> Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn> > > > > >> --- > > > > > What commit does it fix and should it be backported to -stable? > > > > > > > > > > There's a new 'MEMORY MANAGEMENT - PAGE ALLOCATOR' entry (only in > > > > > Andrew's mm.git repository now). > > > > > > > > > > Let's Cc the page allocator folks here! > > > > > > > > We first identified this issue in 6.6.52-stable , and through root cause > > > > analysis, > > > > > > > > it appears the problem may have existed for a significant period. > > > > > > > > However It is recommended that the fix should be backported to at least > > > > Linux kernel versions after 6.6-stable > > > > > > OK, thanks, > > > > > > This has been in mm-hotfixes-unstable for six days. Hopefully we'll > > > see some review activity soon (please). > > > > I reviewed and provided my feedback but saw neither a reply nor a > > respin with proposed changes. > > OK, thanks. Do you have time to put together a modified version of this? I think the code is fine as is. Would be good to add Fixes: tag but it will require some investigation to find the appropriate patch to reference here. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mm/page_alloc.c: Avoid infinite retries caused by cpuset race 2025-05-13 16:26 ` Suren Baghdasaryan @ 2025-05-13 19:16 ` Andrew Morton 2025-05-13 19:33 ` Suren Baghdasaryan 2025-05-14 7:34 ` Vlastimil Babka 0 siblings, 2 replies; 16+ messages in thread From: Andrew Morton @ 2025-05-13 19:16 UTC (permalink / raw) To: Suren Baghdasaryan Cc: Tianyang Zhang, Harry Yoo, linux-mm, linux-kernel, Vlastimil Babka, Michal Hocko, Brendan Jackman, Johannes Weiner, Zi Yan On Tue, 13 May 2025 09:26:53 -0700 Suren Baghdasaryan <surenb@google.com> wrote: > > > > This has been in mm-hotfixes-unstable for six days. Hopefully we'll > > > > see some review activity soon (please). > > > > > > I reviewed and provided my feedback but saw neither a reply nor a > > > respin with proposed changes. > > > > OK, thanks. Do you have time to put together a modified version of this? > > I think the code is fine as is. Would be good to add Fixes: tag but it > will require some investigation to find the appropriate patch to > reference here. Below is what is in mm-hotfixes. It doesn't actually have any acked-by's or reviewed-by's. So... final call for review, please. From: Tianyang Zhang <zhangtianyang@loongson.cn> Subject: mm/page_alloc.c: avoid infinite retries caused by cpuset race Date: Wed, 16 Apr 2025 16:24:05 +0800 __alloc_pages_slowpath has no change detection for ac->nodemask in the part of retry path, while cpuset can modify it in parallel. For some processes that set mempolicy as MPOL_BIND, this results ac->nodemask changes, and then the should_reclaim_retry will judge based on the latest nodemask and jump to retry, while the get_page_from_freelist only traverses the zonelist from ac->preferred_zoneref, which selected by a expired nodemask and may cause infinite retries in some cases cpu 64: __alloc_pages_slowpath { /* ..... */ retry: /* ac->nodemask = 0x1, ac->preferred->zone->nid = 1 */ if (alloc_flags & ALLOC_KSWAPD) wake_all_kswapds(order, gfp_mask, ac); /* cpu 1: cpuset_write_resmask update_nodemask update_nodemasks_hier update_tasks_nodemask mpol_rebind_task mpol_rebind_policy mpol_rebind_nodemask // mempolicy->nodes has been modified, // which ac->nodemask point to */ /* ac->nodemask = 0x3, ac->preferred->zone->nid = 1 */ if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags, did_some_progress > 0, &no_progress_loops)) goto retry; } Simultaneously starting multiple cpuset01 from LTP can quickly reproduce this issue on a multi node server when the maximum memory pressure is reached and the swap is enabled Link: https://lkml.kernel.org/r/20250416082405.20988-1-zhangtianyang@loongson.cn Fixes: 902b62810a57 ("mm, page_alloc: fix more premature OOM due to race with cpuset update"). Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- mm/page_alloc.c | 8 ++++++++ 1 file changed, 8 insertions(+) --- a/mm/page_alloc.c~mm-page_allocc-avoid-infinite-retries-caused-by-cpuset-race +++ a/mm/page_alloc.c @@ -4562,6 +4562,14 @@ restart: } retry: + /* + * Deal with possible cpuset update races or zonelist updates to avoid + * infinite retries. + */ + if (check_retry_cpuset(cpuset_mems_cookie, ac) || + check_retry_zonelist(zonelist_iter_cookie)) + goto restart; + /* Ensure kswapd doesn't accidentally go to sleep as long as we loop */ if (alloc_flags & ALLOC_KSWAPD) wake_all_kswapds(order, gfp_mask, ac); _ ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mm/page_alloc.c: Avoid infinite retries caused by cpuset race 2025-05-13 19:16 ` Andrew Morton @ 2025-05-13 19:33 ` Suren Baghdasaryan 2025-05-14 7:34 ` Vlastimil Babka 1 sibling, 0 replies; 16+ messages in thread From: Suren Baghdasaryan @ 2025-05-13 19:33 UTC (permalink / raw) To: Andrew Morton Cc: Tianyang Zhang, Harry Yoo, linux-mm, linux-kernel, Vlastimil Babka, Michal Hocko, Brendan Jackman, Johannes Weiner, Zi Yan On Tue, May 13, 2025 at 12:16 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Tue, 13 May 2025 09:26:53 -0700 Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > This has been in mm-hotfixes-unstable for six days. Hopefully we'll > > > > > see some review activity soon (please). > > > > > > > > I reviewed and provided my feedback but saw neither a reply nor a > > > > respin with proposed changes. > > > > > > OK, thanks. Do you have time to put together a modified version of this? > > > > I think the code is fine as is. Would be good to add Fixes: tag but it > > will require some investigation to find the appropriate patch to > > reference here. > > Below is what is in mm-hotfixes. It doesn't actually have any > acked-by's or reviewed-by's. > > So... final call for review, please. Reviewed-by: Suren Baghdasaryan <surenb@google.com> > > > From: Tianyang Zhang <zhangtianyang@loongson.cn> > Subject: mm/page_alloc.c: avoid infinite retries caused by cpuset race > Date: Wed, 16 Apr 2025 16:24:05 +0800 > > __alloc_pages_slowpath has no change detection for ac->nodemask in the > part of retry path, while cpuset can modify it in parallel. For some > processes that set mempolicy as MPOL_BIND, this results ac->nodemask > changes, and then the should_reclaim_retry will judge based on the latest > nodemask and jump to retry, while the get_page_from_freelist only > traverses the zonelist from ac->preferred_zoneref, which selected by a > expired nodemask and may cause infinite retries in some cases > > cpu 64: > __alloc_pages_slowpath { > /* ..... */ > retry: > /* ac->nodemask = 0x1, ac->preferred->zone->nid = 1 */ > if (alloc_flags & ALLOC_KSWAPD) > wake_all_kswapds(order, gfp_mask, ac); > /* cpu 1: > cpuset_write_resmask > update_nodemask > update_nodemasks_hier > update_tasks_nodemask > mpol_rebind_task > mpol_rebind_policy > mpol_rebind_nodemask > // mempolicy->nodes has been modified, > // which ac->nodemask point to > > */ > /* ac->nodemask = 0x3, ac->preferred->zone->nid = 1 */ > if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags, > did_some_progress > 0, &no_progress_loops)) > goto retry; > } > > Simultaneously starting multiple cpuset01 from LTP can quickly reproduce > this issue on a multi node server when the maximum memory pressure is > reached and the swap is enabled > > Link: https://lkml.kernel.org/r/20250416082405.20988-1-zhangtianyang@loongson.cn > Fixes: 902b62810a57 ("mm, page_alloc: fix more premature OOM due to race with cpuset update"). > Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn> > Cc: Vlastimil Babka <vbabka@suse.cz> > Cc: Suren Baghdasaryan <surenb@google.com> > Cc: Michal Hocko <mhocko@suse.com> > Cc: Brendan Jackman <jackmanb@google.com> > Cc: Johannes Weiner <hannes@cmpxchg.org> > Cc: Zi Yan <ziy@nvidia.com> > Cc: <stable@vger.kernel.org> > Signed-off-by: Andrew Morton <akpm@linux-foundation.org> > --- > > mm/page_alloc.c | 8 ++++++++ > 1 file changed, 8 insertions(+) > > --- a/mm/page_alloc.c~mm-page_allocc-avoid-infinite-retries-caused-by-cpuset-race > +++ a/mm/page_alloc.c > @@ -4562,6 +4562,14 @@ restart: > } > > retry: > + /* > + * Deal with possible cpuset update races or zonelist updates to avoid > + * infinite retries. > + */ > + if (check_retry_cpuset(cpuset_mems_cookie, ac) || > + check_retry_zonelist(zonelist_iter_cookie)) > + goto restart; > + > /* Ensure kswapd doesn't accidentally go to sleep as long as we loop */ > if (alloc_flags & ALLOC_KSWAPD) > wake_all_kswapds(order, gfp_mask, ac); > _ > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mm/page_alloc.c: Avoid infinite retries caused by cpuset race 2025-05-13 19:16 ` Andrew Morton 2025-05-13 19:33 ` Suren Baghdasaryan @ 2025-05-14 7:34 ` Vlastimil Babka 2025-05-14 22:42 ` Andrew Morton 2025-05-15 3:19 ` Tianyang Zhang 1 sibling, 2 replies; 16+ messages in thread From: Vlastimil Babka @ 2025-05-14 7:34 UTC (permalink / raw) To: Andrew Morton, Suren Baghdasaryan Cc: Tianyang Zhang, Harry Yoo, linux-mm, linux-kernel, Michal Hocko, Brendan Jackman, Johannes Weiner, Zi Yan On 5/13/25 21:16, Andrew Morton wrote: > On Tue, 13 May 2025 09:26:53 -0700 Suren Baghdasaryan <surenb@google.com> wrote: > >> > > > This has been in mm-hotfixes-unstable for six days. Hopefully we'll >> > > > see some review activity soon (please). >> > > >> > > I reviewed and provided my feedback but saw neither a reply nor a >> > > respin with proposed changes. >> > >> > OK, thanks. Do you have time to put together a modified version of this? >> >> I think the code is fine as is. Would be good to add Fixes: tag but it >> will require some investigation to find the appropriate patch to >> reference here. > > Below is what is in mm-hotfixes. It doesn't actually have any > acked-by's or reviewed-by's. > > So... final call for review, please. > > > From: Tianyang Zhang <zhangtianyang@loongson.cn> > Subject: mm/page_alloc.c: avoid infinite retries caused by cpuset race > Date: Wed, 16 Apr 2025 16:24:05 +0800 > > __alloc_pages_slowpath has no change detection for ac->nodemask in the > part of retry path, while cpuset can modify it in parallel. For some > processes that set mempolicy as MPOL_BIND, this results ac->nodemask > changes, and then the should_reclaim_retry will judge based on the latest > nodemask and jump to retry, while the get_page_from_freelist only > traverses the zonelist from ac->preferred_zoneref, which selected by a > expired nodemask and may cause infinite retries in some cases > > cpu 64: > __alloc_pages_slowpath { > /* ..... */ > retry: > /* ac->nodemask = 0x1, ac->preferred->zone->nid = 1 */ > if (alloc_flags & ALLOC_KSWAPD) > wake_all_kswapds(order, gfp_mask, ac); > /* cpu 1: > cpuset_write_resmask > update_nodemask > update_nodemasks_hier > update_tasks_nodemask > mpol_rebind_task > mpol_rebind_policy > mpol_rebind_nodemask > // mempolicy->nodes has been modified, > // which ac->nodemask point to > > */ > /* ac->nodemask = 0x3, ac->preferred->zone->nid = 1 */ > if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags, > did_some_progress > 0, &no_progress_loops)) > goto retry; > } > > Simultaneously starting multiple cpuset01 from LTP can quickly reproduce > this issue on a multi node server when the maximum memory pressure is > reached and the swap is enabled > > Link: https://lkml.kernel.org/r/20250416082405.20988-1-zhangtianyang@loongson.cn > Fixes: 902b62810a57 ("mm, page_alloc: fix more premature OOM due to race with cpuset update"). After the discussion in this thread, Suren retracted this Fixes: suggestion. I think it actually goes back to this one which introduced the preferred_zoneref caching. Fixes: c33d6c06f60f ("mm, page_alloc: avoid looking up the first zone in a zonelist twice") > Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn> > Cc: Vlastimil Babka <vbabka@suse.cz> > Cc: Suren Baghdasaryan <surenb@google.com> > Cc: Michal Hocko <mhocko@suse.com> > Cc: Brendan Jackman <jackmanb@google.com> > Cc: Johannes Weiner <hannes@cmpxchg.org> > Cc: Zi Yan <ziy@nvidia.com> > Cc: <stable@vger.kernel.org> > Signed-off-by: Andrew Morton <akpm@linux-foundation.org> I would have placed the check bit further down, just above the should_reclaim_retry() call, but it's not that important to hold up a fix and can be done later. Reviewed-by: Vlastimil Babka <vbabka@suse.cz> > --- > > mm/page_alloc.c | 8 ++++++++ > 1 file changed, 8 insertions(+) > > --- a/mm/page_alloc.c~mm-page_allocc-avoid-infinite-retries-caused-by-cpuset-race > +++ a/mm/page_alloc.c > @@ -4562,6 +4562,14 @@ restart: > } > > retry: > + /* > + * Deal with possible cpuset update races or zonelist updates to avoid > + * infinite retries. > + */ > + if (check_retry_cpuset(cpuset_mems_cookie, ac) || > + check_retry_zonelist(zonelist_iter_cookie)) > + goto restart; > + > /* Ensure kswapd doesn't accidentally go to sleep as long as we loop */ > if (alloc_flags & ALLOC_KSWAPD) > wake_all_kswapds(order, gfp_mask, ac); > _ > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mm/page_alloc.c: Avoid infinite retries caused by cpuset race 2025-05-14 7:34 ` Vlastimil Babka @ 2025-05-14 22:42 ` Andrew Morton 2025-05-15 3:19 ` Tianyang Zhang 1 sibling, 0 replies; 16+ messages in thread From: Andrew Morton @ 2025-05-14 22:42 UTC (permalink / raw) To: Vlastimil Babka Cc: Suren Baghdasaryan, Tianyang Zhang, Harry Yoo, linux-mm, linux-kernel, Michal Hocko, Brendan Jackman, Johannes Weiner, Zi Yan On Wed, 14 May 2025 09:34:53 +0200 Vlastimil Babka <vbabka@suse.cz> wrote: > On 5/13/25 21:16, Andrew Morton wrote: > > On Tue, 13 May 2025 09:26:53 -0700 Suren Baghdasaryan <surenb@google.com> wrote: > > > >> > > > This has been in mm-hotfixes-unstable for six days. Hopefully we'll > >> > > > see some review activity soon (please). > >> > > > >> > > I reviewed and provided my feedback but saw neither a reply nor a > >> > > respin with proposed changes. > >> > > >> > OK, thanks. Do you have time to put together a modified version of this? > >> > >> I think the code is fine as is. Would be good to add Fixes: tag but it > >> will require some investigation to find the appropriate patch to > >> reference here. > > > > Below is what is in mm-hotfixes. It doesn't actually have any > > acked-by's or reviewed-by's. > > > > So... final call for review, please. > > > > ... > > After the discussion in this thread, Suren retracted this Fixes: suggestion. > I think it actually goes back to this one which introduced the > preferred_zoneref caching. > > Fixes: c33d6c06f60f ("mm, page_alloc: avoid looking up the first zone in a > zonelist twice") Updated. > Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Thanks. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] mm/page_alloc.c: Avoid infinite retries caused by cpuset race 2025-05-14 7:34 ` Vlastimil Babka 2025-05-14 22:42 ` Andrew Morton @ 2025-05-15 3:19 ` Tianyang Zhang 1 sibling, 0 replies; 16+ messages in thread From: Tianyang Zhang @ 2025-05-15 3:19 UTC (permalink / raw) To: Vlastimil Babka, Andrew Morton, Suren Baghdasaryan Cc: Harry Yoo, linux-mm, linux-kernel, Michal Hocko, Brendan Jackman, Johannes Weiner, Zi Yan Hi, 在 2025/5/14 下午3:34, Vlastimil Babka 写道: > On 5/13/25 21:16, Andrew Morton wrote: >> On Tue, 13 May 2025 09:26:53 -0700 Suren Baghdasaryan <surenb@google.com> wrote: >> >>>>>> This has been in mm-hotfixes-unstable for six days. Hopefully we'll >>>>>> see some review activity soon (please). >>>>> I reviewed and provided my feedback but saw neither a reply nor a >>>>> respin with proposed changes. >>>> OK, thanks. Do you have time to put together a modified version of this? >>> I think the code is fine as is. Would be good to add Fixes: tag but it >>> will require some investigation to find the appropriate patch to >>> reference here. >> Below is what is in mm-hotfixes. It doesn't actually have any >> acked-by's or reviewed-by's. >> >> So... final call for review, please. >> >> >> From: Tianyang Zhang <zhangtianyang@loongson.cn> >> Subject: mm/page_alloc.c: avoid infinite retries caused by cpuset race >> Date: Wed, 16 Apr 2025 16:24:05 +0800 >> >> __alloc_pages_slowpath has no change detection for ac->nodemask in the >> part of retry path, while cpuset can modify it in parallel. For some >> processes that set mempolicy as MPOL_BIND, this results ac->nodemask >> changes, and then the should_reclaim_retry will judge based on the latest >> nodemask and jump to retry, while the get_page_from_freelist only >> traverses the zonelist from ac->preferred_zoneref, which selected by a >> expired nodemask and may cause infinite retries in some cases >> >> cpu 64: >> __alloc_pages_slowpath { >> /* ..... */ >> retry: >> /* ac->nodemask = 0x1, ac->preferred->zone->nid = 1 */ >> if (alloc_flags & ALLOC_KSWAPD) >> wake_all_kswapds(order, gfp_mask, ac); >> /* cpu 1: >> cpuset_write_resmask >> update_nodemask >> update_nodemasks_hier >> update_tasks_nodemask >> mpol_rebind_task >> mpol_rebind_policy >> mpol_rebind_nodemask >> // mempolicy->nodes has been modified, >> // which ac->nodemask point to >> >> */ >> /* ac->nodemask = 0x3, ac->preferred->zone->nid = 1 */ >> if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags, >> did_some_progress > 0, &no_progress_loops)) >> goto retry; >> } >> >> Simultaneously starting multiple cpuset01 from LTP can quickly reproduce >> this issue on a multi node server when the maximum memory pressure is >> reached and the swap is enabled >> >> Link: https://lkml.kernel.org/r/20250416082405.20988-1-zhangtianyang@loongson.cn >> Fixes: 902b62810a57 ("mm, page_alloc: fix more premature OOM due to race with cpuset update"). > After the discussion in this thread, Suren retracted this Fixes: suggestion. > I think it actually goes back to this one which introduced the > preferred_zoneref caching. > > Fixes: c33d6c06f60f ("mm, page_alloc: avoid looking up the first zone in a > zonelist twice") Yes, the problem should be introduced by this patch, thank you > >> Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn> >> Cc: Vlastimil Babka <vbabka@suse.cz> >> Cc: Suren Baghdasaryan <surenb@google.com> >> Cc: Michal Hocko <mhocko@suse.com> >> Cc: Brendan Jackman <jackmanb@google.com> >> Cc: Johannes Weiner <hannes@cmpxchg.org> >> Cc: Zi Yan <ziy@nvidia.com> >> Cc: <stable@vger.kernel.org> >> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> > I would have placed the check bit further down, just above the > should_reclaim_retry() call, but it's not that important to hold up a fix > and can be done later. > > Reviewed-by: Vlastimil Babka <vbabka@suse.cz> > >> --- >> >> mm/page_alloc.c | 8 ++++++++ >> 1 file changed, 8 insertions(+) >> >> --- a/mm/page_alloc.c~mm-page_allocc-avoid-infinite-retries-caused-by-cpuset-race >> +++ a/mm/page_alloc.c >> @@ -4562,6 +4562,14 @@ restart: >> } >> >> retry: >> + /* >> + * Deal with possible cpuset update races or zonelist updates to avoid >> + * infinite retries. >> + */ >> + if (check_retry_cpuset(cpuset_mems_cookie, ac) || >> + check_retry_zonelist(zonelist_iter_cookie)) >> + goto restart; >> + >> /* Ensure kswapd doesn't accidentally go to sleep as long as we loop */ >> if (alloc_flags & ALLOC_KSWAPD) >> wake_all_kswapds(order, gfp_mask, ac); >> _ >> ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2025-05-15 3:20 UTC | newest] Thread overview: 16+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-04-16 8:24 [PATCH] mm/page_alloc.c: Avoid infinite retries caused by cpuset race Tianyang Zhang 2025-04-21 10:00 ` Harry Yoo 2025-04-21 20:28 ` Suren Baghdasaryan 2025-04-23 2:38 ` Tianyang Zhang 2025-04-23 15:35 ` Suren Baghdasaryan 2025-05-14 7:15 ` Vlastimil Babka 2025-04-22 12:10 ` Tianyang Zhang 2025-04-23 0:11 ` Andrew Morton 2025-04-23 0:22 ` Suren Baghdasaryan 2025-05-11 3:07 ` Andrew Morton 2025-05-13 16:26 ` Suren Baghdasaryan 2025-05-13 19:16 ` Andrew Morton 2025-05-13 19:33 ` Suren Baghdasaryan 2025-05-14 7:34 ` Vlastimil Babka 2025-05-14 22:42 ` Andrew Morton 2025-05-15 3:19 ` Tianyang Zhang
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).