* Re: Race condition in build_all_zonelists() when offlining movable zone
@ 2022-08-23 10:34 ` David Hildenbrand
0 siblings, 0 replies; 39+ messages in thread
From: David Hildenbrand @ 2022-08-23 10:34 UTC (permalink / raw)
To: Mel Gorman
Cc: Michal Hocko, Patrick Daly, linux-mm, linux-arm-kernel,
Juergen Gross
On 23.08.22 11:49, Mel Gorman wrote:
> On Tue, Aug 23, 2022 at 10:52:34AM +0200, David Hildenbrand wrote:
>> On 23.08.22 10:33, Mel Gorman wrote:
>>> On Tue, Aug 23, 2022 at 08:36:34AM +0200, David Hildenbrand wrote:
>>>>> @@ -6517,6 +6538,7 @@ static void __build_all_zonelists(void *data)
>>>>> static DEFINE_SPINLOCK(lock);
>>>>>
>>>>> spin_lock(&lock);
>>>>> + write_seqcount_begin(&zonelist_update_seq);
>>>>>
>>>>> #ifdef CONFIG_NUMA
>>>>> memset(node_load, 0, sizeof(node_load));
>>>>> @@ -6553,6 +6575,7 @@ static void __build_all_zonelists(void *data)
>>>>> #endif
>>>>> }
>>>>>
>>>>> + write_seqcount_end(&zonelist_update_seq);
>>>>> spin_unlock(&lock);
>>>>
>>>> Do we want to get rid of the static lock by using a seqlock_t instead of
>>>> a seqcount_t?
>>>>
>>>
>>> I do not think so because it's a relatively heavy lock compared to the
>>> counter and the read side.
>>
>> I was primarily asking because seqlock.h states: "Sequential locks
>> (seqlock_t): Sequence counters with an embedded spinlock for writer
>> serialization and non-preemptibility." seems to be precisely what we are
>> doing here.
>>
>>>
>>> As the read-side can be called from hardirq or softirq context, the
>>> write-side needs to disable irqs or bottom halves as well according to the
>>> documentation. That is relatively minor as the write side is rare but it's
>>> tricker because the calling context can be both IRQ or softirq so the IRQ
>>> protection would have to be used.
>>
>> Naive me would just have used write_seqlock()/write_sequnlock() and
>> read_seqbegin()/read_seqretry() to result in almost the same code as
>> with your change -- but having both mechanisms encapsulated.
>>
>>
>> Yeah, there are special write_seqlock_bh()/write_sequnlock_bh(),
>> write_sequnlock_irq() ... but IIRC we don't have to care about that at
>> all when just using the primitives as above. But most probably I am
>> missing something important.
>>
>
> You're not missing anything important, I'm just not a massive fan of the
> API naming because it's unclear from the context if it's a plain counter
> or a locked counter and felt it was better to keep the locking explicit.
>
> A seqlock version is below. I updated the comments and naming to make it
> clear the read-side is for iteration, what the locking protocol is and
> match the retry naming with the cpuset equivalent. It boots on KVM but
> would need another test from Patrick to be certain it still works. Patrick,
> would you mind testing this version please?
>
> ---8<---
> mm/page_alloc.c | 33 ++++++++++++++++++++++++++++-----
> 1 file changed, 28 insertions(+), 5 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e5486d47406e..a644c7b638a3 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4708,6 +4708,24 @@ void fs_reclaim_release(gfp_t gfp_mask)
> EXPORT_SYMBOL_GPL(fs_reclaim_release);
> #endif
>
> +/*
> + * Zonelists may change due to hotplug during allocation. Detect when zonelists
> + * have been rebuilt so allocation retries. Reader side does not lock and
> + * retries the allocation if zonelist changes. Writer side is protected by the
> + * embedded spin_lock.
> + */
> +DEFINE_SEQLOCK(zonelist_update_seq);
> +
> +static unsigned int zonelist_iter_begin(void)
> +{
> + return read_seqbegin(&zonelist_update_seq);
> +}
> +
> +static unsigned int check_retry_zonelist(unsigned int seq)
> +{
> + return read_seqretry(&zonelist_update_seq, seq);
> +}
> +
> /* Perform direct synchronous page reclaim */
> static unsigned long
> __perform_reclaim(gfp_t gfp_mask, unsigned int order,
> @@ -5001,6 +5019,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> int compaction_retries;
> int no_progress_loops;
> unsigned int cpuset_mems_cookie;
> + unsigned int zonelist_iter_cookie;
> int reserve_flags;
>
> /*
> @@ -5016,6 +5035,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> no_progress_loops = 0;
> compact_priority = DEF_COMPACT_PRIORITY;
> cpuset_mems_cookie = read_mems_allowed_begin();
> + zonelist_iter_cookie = zonelist_iter_begin();
>
> /*
> * The fast path uses conservative alloc_flags to succeed only until
> @@ -5187,8 +5207,12 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> goto retry;
>
>
> - /* Deal with possible cpuset update races before we start OOM killing */
> - if (check_retry_cpuset(cpuset_mems_cookie, ac))
> + /*
> + * Deal with possible cpuset update races or zonelist updates to avoid
> + * a unnecessary OOM kill.
> + */
> + if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
> + check_retry_zonelist(zonelist_iter_cookie))
> goto retry_cpuset;
>
> /* Reclaim has failed us, start killing things */
> @@ -6514,9 +6538,8 @@ static void __build_all_zonelists(void *data)
> int nid;
> int __maybe_unused cpu;
> pg_data_t *self = data;
> - static DEFINE_SPINLOCK(lock);
>
> - spin_lock(&lock);
> + write_seqlock(&zonelist_update_seq);
>
> #ifdef CONFIG_NUMA
> memset(node_load, 0, sizeof(node_load));
> @@ -6553,7 +6576,7 @@ static void __build_all_zonelists(void *data)
> #endif
> }
>
> - spin_unlock(&lock);
> + write_sequnlock(&zonelist_update_seq);
> }
>
> static noinline void __init
>
LGTM. The "retry_cpuset" label might deserve a better name now.
Would
Fixes: 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones
with pages managed by the buddy allocator")
be correct?
--
Thanks,
David / dhildenb
^ permalink raw reply [flat|nested] 39+ messages in thread* Re: Race condition in build_all_zonelists() when offlining movable zone
2022-08-23 10:34 ` David Hildenbrand
(?)
@ 2022-08-23 10:57 ` Michal Hocko
-1 siblings, 0 replies; 39+ messages in thread
From: Michal Hocko @ 2022-08-23 10:57 UTC (permalink / raw)
To: David Hildenbrand
Cc: Mel Gorman, Patrick Daly, linux-mm, linux-arm-kernel,
Juergen Gross
On Tue 23-08-22 12:34:09, David Hildenbrand wrote:
> On 23.08.22 11:49, Mel Gorman wrote:
> > On Tue, Aug 23, 2022 at 10:52:34AM +0200, David Hildenbrand wrote:
> >> On 23.08.22 10:33, Mel Gorman wrote:
> >>> On Tue, Aug 23, 2022 at 08:36:34AM +0200, David Hildenbrand wrote:
> >>>>> @@ -6517,6 +6538,7 @@ static void __build_all_zonelists(void *data)
> >>>>> static DEFINE_SPINLOCK(lock);
> >>>>>
> >>>>> spin_lock(&lock);
> >>>>> + write_seqcount_begin(&zonelist_update_seq);
> >>>>>
> >>>>> #ifdef CONFIG_NUMA
> >>>>> memset(node_load, 0, sizeof(node_load));
> >>>>> @@ -6553,6 +6575,7 @@ static void __build_all_zonelists(void *data)
> >>>>> #endif
> >>>>> }
> >>>>>
> >>>>> + write_seqcount_end(&zonelist_update_seq);
> >>>>> spin_unlock(&lock);
> >>>>
> >>>> Do we want to get rid of the static lock by using a seqlock_t instead of
> >>>> a seqcount_t?
> >>>>
> >>>
> >>> I do not think so because it's a relatively heavy lock compared to the
> >>> counter and the read side.
> >>
> >> I was primarily asking because seqlock.h states: "Sequential locks
> >> (seqlock_t): Sequence counters with an embedded spinlock for writer
> >> serialization and non-preemptibility." seems to be precisely what we are
> >> doing here.
> >>
> >>>
> >>> As the read-side can be called from hardirq or softirq context, the
> >>> write-side needs to disable irqs or bottom halves as well according to the
> >>> documentation. That is relatively minor as the write side is rare but it's
> >>> tricker because the calling context can be both IRQ or softirq so the IRQ
> >>> protection would have to be used.
> >>
> >> Naive me would just have used write_seqlock()/write_sequnlock() and
> >> read_seqbegin()/read_seqretry() to result in almost the same code as
> >> with your change -- but having both mechanisms encapsulated.
> >>
> >>
> >> Yeah, there are special write_seqlock_bh()/write_sequnlock_bh(),
> >> write_sequnlock_irq() ... but IIRC we don't have to care about that at
> >> all when just using the primitives as above. But most probably I am
> >> missing something important.
> >>
> >
> > You're not missing anything important, I'm just not a massive fan of the
> > API naming because it's unclear from the context if it's a plain counter
> > or a locked counter and felt it was better to keep the locking explicit.
> >
> > A seqlock version is below. I updated the comments and naming to make it
> > clear the read-side is for iteration, what the locking protocol is and
> > match the retry naming with the cpuset equivalent. It boots on KVM but
> > would need another test from Patrick to be certain it still works. Patrick,
> > would you mind testing this version please?
> >
> > ---8<---
> > mm/page_alloc.c | 33 ++++++++++++++++++++++++++++-----
> > 1 file changed, 28 insertions(+), 5 deletions(-)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index e5486d47406e..a644c7b638a3 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -4708,6 +4708,24 @@ void fs_reclaim_release(gfp_t gfp_mask)
> > EXPORT_SYMBOL_GPL(fs_reclaim_release);
> > #endif
> >
> > +/*
> > + * Zonelists may change due to hotplug during allocation. Detect when zonelists
> > + * have been rebuilt so allocation retries. Reader side does not lock and
> > + * retries the allocation if zonelist changes. Writer side is protected by the
> > + * embedded spin_lock.
> > + */
> > +DEFINE_SEQLOCK(zonelist_update_seq);
> > +
> > +static unsigned int zonelist_iter_begin(void)
> > +{
> > + return read_seqbegin(&zonelist_update_seq);
> > +}
> > +
> > +static unsigned int check_retry_zonelist(unsigned int seq)
> > +{
> > + return read_seqretry(&zonelist_update_seq, seq);
> > +}
> > +
> > /* Perform direct synchronous page reclaim */
> > static unsigned long
> > __perform_reclaim(gfp_t gfp_mask, unsigned int order,
> > @@ -5001,6 +5019,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > int compaction_retries;
> > int no_progress_loops;
> > unsigned int cpuset_mems_cookie;
> > + unsigned int zonelist_iter_cookie;
> > int reserve_flags;
> >
> > /*
> > @@ -5016,6 +5035,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > no_progress_loops = 0;
> > compact_priority = DEF_COMPACT_PRIORITY;
> > cpuset_mems_cookie = read_mems_allowed_begin();
> > + zonelist_iter_cookie = zonelist_iter_begin();
> >
> > /*
> > * The fast path uses conservative alloc_flags to succeed only until
> > @@ -5187,8 +5207,12 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > goto retry;
> >
> >
> > - /* Deal with possible cpuset update races before we start OOM killing */
> > - if (check_retry_cpuset(cpuset_mems_cookie, ac))
> > + /*
> > + * Deal with possible cpuset update races or zonelist updates to avoid
> > + * a unnecessary OOM kill.
> > + */
> > + if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
> > + check_retry_zonelist(zonelist_iter_cookie))
> > goto retry_cpuset;
> >
> > /* Reclaim has failed us, start killing things */
> > @@ -6514,9 +6538,8 @@ static void __build_all_zonelists(void *data)
> > int nid;
> > int __maybe_unused cpu;
> > pg_data_t *self = data;
> > - static DEFINE_SPINLOCK(lock);
> >
> > - spin_lock(&lock);
> > + write_seqlock(&zonelist_update_seq);
> >
> > #ifdef CONFIG_NUMA
> > memset(node_load, 0, sizeof(node_load));
> > @@ -6553,7 +6576,7 @@ static void __build_all_zonelists(void *data)
> > #endif
> > }
> >
> > - spin_unlock(&lock);
> > + write_sequnlock(&zonelist_update_seq);
> > }
> >
> > static noinline void __init
> >
>
> LGTM. The "retry_cpuset" label might deserve a better name now.
I will not object but I liked the previous version which was already in
line with the cpuset retry (read_mems_allowed_retry).
> Would
>
> Fixes: 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones
> with pages managed by the buddy allocator")
>
> be correct?
Yes.
--
Michal Hocko
SUSE Labs
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 39+ messages in thread* Re: Race condition in build_all_zonelists() when offlining movable zone
2022-08-23 10:34 ` David Hildenbrand
(?)
(?)
@ 2022-08-23 11:09 ` Mel Gorman
2022-08-23 12:18 ` Michal Hocko
-1 siblings, 1 reply; 39+ messages in thread
From: Mel Gorman @ 2022-08-23 11:09 UTC (permalink / raw)
To: David Hildenbrand
Cc: Michal Hocko, Patrick Daly, linux-mm, linux-arm-kernel,
Juergen Gross
On Tue, Aug 23, 2022 at 12:34:09PM +0200, David Hildenbrand wrote:
> > @@ -6553,7 +6576,7 @@ static void __build_all_zonelists(void *data)
> > #endif
> > }
> >
> > - spin_unlock(&lock);
> > + write_sequnlock(&zonelist_update_seq);
> > }
> >
> > static noinline void __init
> >
>
> LGTM. The "retry_cpuset" label might deserve a better name now.
>
Good point ... "restart"?
> Would
>
> Fixes: 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones
> with pages managed by the buddy allocator")
>
> be correct?
>
Not specifically because the bug is due to a zone being completely removed
resulting in a rebuild. This race probably existed ever since memory
hotremove could theoritically remove a complete zone. A Cc: Stable would
be appropriate as it'll apply with fuzz back to at least 5.4.210 but beyond
that, it should be driven by a specific bug report showing that hot-remove
of a full zone was possible and triggered the race.
--
Mel Gorman
SUSE Labs
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 39+ messages in thread* Re: Race condition in build_all_zonelists() when offlining movable zone
2022-08-23 11:09 ` Mel Gorman
@ 2022-08-23 12:18 ` Michal Hocko
0 siblings, 0 replies; 39+ messages in thread
From: Michal Hocko @ 2022-08-23 12:18 UTC (permalink / raw)
To: Mel Gorman
Cc: David Hildenbrand, Patrick Daly, linux-mm, linux-arm-kernel,
Juergen Gross
On Tue 23-08-22 12:09:46, Mel Gorman wrote:
> On Tue, Aug 23, 2022 at 12:34:09PM +0200, David Hildenbrand wrote:
> > > @@ -6553,7 +6576,7 @@ static void __build_all_zonelists(void *data)
> > > #endif
> > > }
> > >
> > > - spin_unlock(&lock);
> > > + write_sequnlock(&zonelist_update_seq);
> > > }
> > >
> > > static noinline void __init
> > >
> >
> > LGTM. The "retry_cpuset" label might deserve a better name now.
> >
>
> Good point ... "restart"?
>
> > Would
> >
> > Fixes: 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones
> > with pages managed by the buddy allocator")
> >
> > be correct?
> >
>
> Not specifically because the bug is due to a zone being completely removed
> resulting in a rebuild. This race probably existed ever since memory
> hotremove could theoritically remove a complete zone. A Cc: Stable would
> be appropriate as it'll apply with fuzz back to at least 5.4.210 but beyond
> that, it should be driven by a specific bug report showing that hot-remove
> of a full zone was possible and triggered the race.
I do not think so. 6aa303defb74 has changed the zonelist building and
changed the check from pfn range (populated) to managed (with a memory).
--
Michal Hocko
SUSE Labs
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 39+ messages in thread* Re: Race condition in build_all_zonelists() when offlining movable zone
@ 2022-08-23 12:18 ` Michal Hocko
0 siblings, 0 replies; 39+ messages in thread
From: Michal Hocko @ 2022-08-23 12:18 UTC (permalink / raw)
To: Mel Gorman
Cc: David Hildenbrand, Patrick Daly, linux-mm, linux-arm-kernel,
Juergen Gross
On Tue 23-08-22 12:09:46, Mel Gorman wrote:
> On Tue, Aug 23, 2022 at 12:34:09PM +0200, David Hildenbrand wrote:
> > > @@ -6553,7 +6576,7 @@ static void __build_all_zonelists(void *data)
> > > #endif
> > > }
> > >
> > > - spin_unlock(&lock);
> > > + write_sequnlock(&zonelist_update_seq);
> > > }
> > >
> > > static noinline void __init
> > >
> >
> > LGTM. The "retry_cpuset" label might deserve a better name now.
> >
>
> Good point ... "restart"?
>
> > Would
> >
> > Fixes: 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones
> > with pages managed by the buddy allocator")
> >
> > be correct?
> >
>
> Not specifically because the bug is due to a zone being completely removed
> resulting in a rebuild. This race probably existed ever since memory
> hotremove could theoritically remove a complete zone. A Cc: Stable would
> be appropriate as it'll apply with fuzz back to at least 5.4.210 but beyond
> that, it should be driven by a specific bug report showing that hot-remove
> of a full zone was possible and triggered the race.
I do not think so. 6aa303defb74 has changed the zonelist building and
changed the check from pfn range (populated) to managed (with a memory).
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 39+ messages in thread* Re: Race condition in build_all_zonelists() when offlining movable zone
2022-08-23 12:18 ` Michal Hocko
(?)
@ 2022-08-23 12:58 ` Mel Gorman
2022-08-23 13:25 ` Michal Hocko
-1 siblings, 1 reply; 39+ messages in thread
From: Mel Gorman @ 2022-08-23 12:58 UTC (permalink / raw)
To: Michal Hocko
Cc: David Hildenbrand, Patrick Daly, linux-mm, linux-arm-kernel,
Juergen Gross
On Tue, Aug 23, 2022 at 02:18:27PM +0200, Michal Hocko wrote:
> On Tue 23-08-22 12:09:46, Mel Gorman wrote:
> > On Tue, Aug 23, 2022 at 12:34:09PM +0200, David Hildenbrand wrote:
> > > > @@ -6553,7 +6576,7 @@ static void __build_all_zonelists(void *data)
> > > > #endif
> > > > }
> > > >
> > > > - spin_unlock(&lock);
> > > > + write_sequnlock(&zonelist_update_seq);
> > > > }
> > > >
> > > > static noinline void __init
> > > >
> > >
> > > LGTM. The "retry_cpuset" label might deserve a better name now.
> > >
> >
> > Good point ... "restart"?
> >
> > > Would
> > >
> > > Fixes: 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones
> > > with pages managed by the buddy allocator")
> > >
> > > be correct?
> > >
> >
> > Not specifically because the bug is due to a zone being completely removed
> > resulting in a rebuild. This race probably existed ever since memory
> > hotremove could theoritically remove a complete zone. A Cc: Stable would
> > be appropriate as it'll apply with fuzz back to at least 5.4.210 but beyond
> > that, it should be driven by a specific bug report showing that hot-remove
> > of a full zone was possible and triggered the race.
>
> I do not think so. 6aa303defb74 has changed the zonelist building and
> changed the check from pfn range (populated) to managed (with a memory).
I'm not 100% convinced. The present_pages should have been the spanned range
minus any holes that exist in the zone. If the zone is completely removed,
the span should be zero meaning present and managed are both zero. No?
The patch *would* have made the situation worse because if a zone was
hot-removed to the point where only reserved pages were present then the
bug would still trigger. On that basis, I'm ok with adding the Fixes: as
it's at least partially true and it covers the range of -stable kernels that
are trivial to backport. Beyond that, it would need greater care and testing.
--
Mel Gorman
SUSE Labs
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 39+ messages in thread* Re: Race condition in build_all_zonelists() when offlining movable zone
2022-08-23 12:58 ` Mel Gorman
@ 2022-08-23 13:25 ` Michal Hocko
0 siblings, 0 replies; 39+ messages in thread
From: Michal Hocko @ 2022-08-23 13:25 UTC (permalink / raw)
To: Mel Gorman
Cc: David Hildenbrand, Patrick Daly, linux-mm, linux-arm-kernel,
Juergen Gross
On Tue 23-08-22 13:58:50, Mel Gorman wrote:
> On Tue, Aug 23, 2022 at 02:18:27PM +0200, Michal Hocko wrote:
> > On Tue 23-08-22 12:09:46, Mel Gorman wrote:
> > > On Tue, Aug 23, 2022 at 12:34:09PM +0200, David Hildenbrand wrote:
> > > > > @@ -6553,7 +6576,7 @@ static void __build_all_zonelists(void *data)
> > > > > #endif
> > > > > }
> > > > >
> > > > > - spin_unlock(&lock);
> > > > > + write_sequnlock(&zonelist_update_seq);
> > > > > }
> > > > >
> > > > > static noinline void __init
> > > > >
> > > >
> > > > LGTM. The "retry_cpuset" label might deserve a better name now.
> > > >
> > >
> > > Good point ... "restart"?
> > >
> > > > Would
> > > >
> > > > Fixes: 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones
> > > > with pages managed by the buddy allocator")
> > > >
> > > > be correct?
> > > >
> > >
> > > Not specifically because the bug is due to a zone being completely removed
> > > resulting in a rebuild. This race probably existed ever since memory
> > > hotremove could theoritically remove a complete zone. A Cc: Stable would
> > > be appropriate as it'll apply with fuzz back to at least 5.4.210 but beyond
> > > that, it should be driven by a specific bug report showing that hot-remove
> > > of a full zone was possible and triggered the race.
> >
> > I do not think so. 6aa303defb74 has changed the zonelist building and
> > changed the check from pfn range (populated) to managed (with a memory).
>
> I'm not 100% convinced. The present_pages should have been the spanned range
> minus any holes that exist in the zone. If the zone is completely removed,
> the span should be zero meaning present and managed are both zero. No?
IIRC, and David will correct me if I am mixing this up. The difference
is that zonelists are rebuilt during memory offlining and that is when
managed pages are removed from the allocator. Zone itself still has that
physical range populated and so this patch would have made a difference.
Now, you are right that this is likely possible even without that commit
but it is highly unlikely because physical hotremove is a very rare
operation and the race window would be so large that it would be likely
unfeasible.
--
Michal Hocko
SUSE Labs
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 39+ messages in thread* Re: Race condition in build_all_zonelists() when offlining movable zone
@ 2022-08-23 13:25 ` Michal Hocko
0 siblings, 0 replies; 39+ messages in thread
From: Michal Hocko @ 2022-08-23 13:25 UTC (permalink / raw)
To: Mel Gorman
Cc: David Hildenbrand, Patrick Daly, linux-mm, linux-arm-kernel,
Juergen Gross
On Tue 23-08-22 13:58:50, Mel Gorman wrote:
> On Tue, Aug 23, 2022 at 02:18:27PM +0200, Michal Hocko wrote:
> > On Tue 23-08-22 12:09:46, Mel Gorman wrote:
> > > On Tue, Aug 23, 2022 at 12:34:09PM +0200, David Hildenbrand wrote:
> > > > > @@ -6553,7 +6576,7 @@ static void __build_all_zonelists(void *data)
> > > > > #endif
> > > > > }
> > > > >
> > > > > - spin_unlock(&lock);
> > > > > + write_sequnlock(&zonelist_update_seq);
> > > > > }
> > > > >
> > > > > static noinline void __init
> > > > >
> > > >
> > > > LGTM. The "retry_cpuset" label might deserve a better name now.
> > > >
> > >
> > > Good point ... "restart"?
> > >
> > > > Would
> > > >
> > > > Fixes: 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones
> > > > with pages managed by the buddy allocator")
> > > >
> > > > be correct?
> > > >
> > >
> > > Not specifically because the bug is due to a zone being completely removed
> > > resulting in a rebuild. This race probably existed ever since memory
> > > hotremove could theoritically remove a complete zone. A Cc: Stable would
> > > be appropriate as it'll apply with fuzz back to at least 5.4.210 but beyond
> > > that, it should be driven by a specific bug report showing that hot-remove
> > > of a full zone was possible and triggered the race.
> >
> > I do not think so. 6aa303defb74 has changed the zonelist building and
> > changed the check from pfn range (populated) to managed (with a memory).
>
> I'm not 100% convinced. The present_pages should have been the spanned range
> minus any holes that exist in the zone. If the zone is completely removed,
> the span should be zero meaning present and managed are both zero. No?
IIRC, and David will correct me if I am mixing this up. The difference
is that zonelists are rebuilt during memory offlining and that is when
managed pages are removed from the allocator. Zone itself still has that
physical range populated and so this patch would have made a difference.
Now, you are right that this is likely possible even without that commit
but it is highly unlikely because physical hotremove is a very rare
operation and the race window would be so large that it would be likely
unfeasible.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 39+ messages in thread* Re: Race condition in build_all_zonelists() when offlining movable zone
2022-08-23 13:25 ` Michal Hocko
@ 2022-08-23 13:50 ` David Hildenbrand
-1 siblings, 0 replies; 39+ messages in thread
From: David Hildenbrand @ 2022-08-23 13:50 UTC (permalink / raw)
To: Michal Hocko, Mel Gorman
Cc: Patrick Daly, linux-mm, linux-arm-kernel, Juergen Gross
On 23.08.22 15:25, Michal Hocko wrote:
> On Tue 23-08-22 13:58:50, Mel Gorman wrote:
>> On Tue, Aug 23, 2022 at 02:18:27PM +0200, Michal Hocko wrote:
>>> On Tue 23-08-22 12:09:46, Mel Gorman wrote:
>>>> On Tue, Aug 23, 2022 at 12:34:09PM +0200, David Hildenbrand wrote:
>>>>>> @@ -6553,7 +6576,7 @@ static void __build_all_zonelists(void *data)
>>>>>> #endif
>>>>>> }
>>>>>>
>>>>>> - spin_unlock(&lock);
>>>>>> + write_sequnlock(&zonelist_update_seq);
>>>>>> }
>>>>>>
>>>>>> static noinline void __init
>>>>>>
>>>>>
>>>>> LGTM. The "retry_cpuset" label might deserve a better name now.
>>>>>
>>>>
>>>> Good point ... "restart"?
>>>>
>>>>> Would
>>>>>
>>>>> Fixes: 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones
>>>>> with pages managed by the buddy allocator")
>>>>>
>>>>> be correct?
>>>>>
>>>>
>>>> Not specifically because the bug is due to a zone being completely removed
>>>> resulting in a rebuild. This race probably existed ever since memory
>>>> hotremove could theoritically remove a complete zone. A Cc: Stable would
>>>> be appropriate as it'll apply with fuzz back to at least 5.4.210 but beyond
>>>> that, it should be driven by a specific bug report showing that hot-remove
>>>> of a full zone was possible and triggered the race.
>>>
>>> I do not think so. 6aa303defb74 has changed the zonelist building and
>>> changed the check from pfn range (populated) to managed (with a memory).
>>
>> I'm not 100% convinced. The present_pages should have been the spanned range
>> minus any holes that exist in the zone. If the zone is completely removed,
>> the span should be zero meaning present and managed are both zero. No?
>
> IIRC, and David will correct me if I am mixing this up. The difference
> is that zonelists are rebuilt during memory offlining and that is when
> managed pages are removed from the allocator. Zone itself still has that
> physical range populated and so this patch would have made a difference.
To recap, memory offlining adjusts managed+present pages of the zone
essentially in one go. If after the adjustments, the zone is no longer
populated (present==0), we rebuild the zone lists.
Once that's done, we try shrinking the zone (start+spanned pages) --
which results in zone_start_pfn == 0 if there are no more pages. That
happens *after* rebuilding the zonelists via remove_pfn_range_from_zone().
Note that populated_zone() checks for present_pages. The actual zone
span (e.g., spanned_pages) is a different story and not of interest when
building zones or wanting to allocate memory.
>
> Now, you are right that this is likely possible even without that commit
> but it is highly unlikely because physical hotremove is a very rare
> operation and the race window would be so large that it would be likely
> unfeasible.
I think I agree that 6aa303defb74 is most likely not the origin of this.
It could only have been the origin in weird corner cases where we
actually succeed offlining one memory block (adjust present+managed) and
end up with managed=0 and present!=0 -- which barely happens in
practice: especially for ZONE_MOVABLE. (yeah, there is memory ballooning
that adjusts managed pages dynamically and might provoke such a
situation on ZONE_MOVABLE)
--
Thanks,
David / dhildenb
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 39+ messages in thread* Re: Race condition in build_all_zonelists() when offlining movable zone
@ 2022-08-23 13:50 ` David Hildenbrand
0 siblings, 0 replies; 39+ messages in thread
From: David Hildenbrand @ 2022-08-23 13:50 UTC (permalink / raw)
To: Michal Hocko, Mel Gorman
Cc: Patrick Daly, linux-mm, linux-arm-kernel, Juergen Gross
On 23.08.22 15:25, Michal Hocko wrote:
> On Tue 23-08-22 13:58:50, Mel Gorman wrote:
>> On Tue, Aug 23, 2022 at 02:18:27PM +0200, Michal Hocko wrote:
>>> On Tue 23-08-22 12:09:46, Mel Gorman wrote:
>>>> On Tue, Aug 23, 2022 at 12:34:09PM +0200, David Hildenbrand wrote:
>>>>>> @@ -6553,7 +6576,7 @@ static void __build_all_zonelists(void *data)
>>>>>> #endif
>>>>>> }
>>>>>>
>>>>>> - spin_unlock(&lock);
>>>>>> + write_sequnlock(&zonelist_update_seq);
>>>>>> }
>>>>>>
>>>>>> static noinline void __init
>>>>>>
>>>>>
>>>>> LGTM. The "retry_cpuset" label might deserve a better name now.
>>>>>
>>>>
>>>> Good point ... "restart"?
>>>>
>>>>> Would
>>>>>
>>>>> Fixes: 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones
>>>>> with pages managed by the buddy allocator")
>>>>>
>>>>> be correct?
>>>>>
>>>>
>>>> Not specifically because the bug is due to a zone being completely removed
>>>> resulting in a rebuild. This race probably existed ever since memory
>>>> hotremove could theoritically remove a complete zone. A Cc: Stable would
>>>> be appropriate as it'll apply with fuzz back to at least 5.4.210 but beyond
>>>> that, it should be driven by a specific bug report showing that hot-remove
>>>> of a full zone was possible and triggered the race.
>>>
>>> I do not think so. 6aa303defb74 has changed the zonelist building and
>>> changed the check from pfn range (populated) to managed (with a memory).
>>
>> I'm not 100% convinced. The present_pages should have been the spanned range
>> minus any holes that exist in the zone. If the zone is completely removed,
>> the span should be zero meaning present and managed are both zero. No?
>
> IIRC, and David will correct me if I am mixing this up. The difference
> is that zonelists are rebuilt during memory offlining and that is when
> managed pages are removed from the allocator. Zone itself still has that
> physical range populated and so this patch would have made a difference.
To recap, memory offlining adjusts managed+present pages of the zone
essentially in one go. If after the adjustments, the zone is no longer
populated (present==0), we rebuild the zone lists.
Once that's done, we try shrinking the zone (start+spanned pages) --
which results in zone_start_pfn == 0 if there are no more pages. That
happens *after* rebuilding the zonelists via remove_pfn_range_from_zone().
Note that populated_zone() checks for present_pages. The actual zone
span (e.g., spanned_pages) is a different story and not of interest when
building zones or wanting to allocate memory.
>
> Now, you are right that this is likely possible even without that commit
> but it is highly unlikely because physical hotremove is a very rare
> operation and the race window would be so large that it would be likely
> unfeasible.
I think I agree that 6aa303defb74 is most likely not the origin of this.
It could only have been the origin in weird corner cases where we
actually succeed offlining one memory block (adjust present+managed) and
end up with managed=0 and present!=0 -- which barely happens in
practice: especially for ZONE_MOVABLE. (yeah, there is memory ballooning
that adjusts managed pages dynamically and might provoke such a
situation on ZONE_MOVABLE)
--
Thanks,
David / dhildenb
^ permalink raw reply [flat|nested] 39+ messages in thread* Re: Race condition in build_all_zonelists() when offlining movable zone
2022-08-23 13:50 ` David Hildenbrand
@ 2022-08-23 13:57 ` Michal Hocko
-1 siblings, 0 replies; 39+ messages in thread
From: Michal Hocko @ 2022-08-23 13:57 UTC (permalink / raw)
To: David Hildenbrand
Cc: Mel Gorman, Patrick Daly, linux-mm, linux-arm-kernel,
Juergen Gross
On Tue 23-08-22 15:50:23, David Hildenbrand wrote:
> On 23.08.22 15:25, Michal Hocko wrote:
> > On Tue 23-08-22 13:58:50, Mel Gorman wrote:
> >> On Tue, Aug 23, 2022 at 02:18:27PM +0200, Michal Hocko wrote:
> >>> On Tue 23-08-22 12:09:46, Mel Gorman wrote:
> >>>> On Tue, Aug 23, 2022 at 12:34:09PM +0200, David Hildenbrand wrote:
> >>>>>> @@ -6553,7 +6576,7 @@ static void __build_all_zonelists(void *data)
> >>>>>> #endif
> >>>>>> }
> >>>>>>
> >>>>>> - spin_unlock(&lock);
> >>>>>> + write_sequnlock(&zonelist_update_seq);
> >>>>>> }
> >>>>>>
> >>>>>> static noinline void __init
> >>>>>>
> >>>>>
> >>>>> LGTM. The "retry_cpuset" label might deserve a better name now.
> >>>>>
> >>>>
> >>>> Good point ... "restart"?
> >>>>
> >>>>> Would
> >>>>>
> >>>>> Fixes: 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones
> >>>>> with pages managed by the buddy allocator")
> >>>>>
> >>>>> be correct?
> >>>>>
> >>>>
> >>>> Not specifically because the bug is due to a zone being completely removed
> >>>> resulting in a rebuild. This race probably existed ever since memory
> >>>> hotremove could theoritically remove a complete zone. A Cc: Stable would
> >>>> be appropriate as it'll apply with fuzz back to at least 5.4.210 but beyond
> >>>> that, it should be driven by a specific bug report showing that hot-remove
> >>>> of a full zone was possible and triggered the race.
> >>>
> >>> I do not think so. 6aa303defb74 has changed the zonelist building and
> >>> changed the check from pfn range (populated) to managed (with a memory).
> >>
> >> I'm not 100% convinced. The present_pages should have been the spanned range
> >> minus any holes that exist in the zone. If the zone is completely removed,
> >> the span should be zero meaning present and managed are both zero. No?
> >
> > IIRC, and David will correct me if I am mixing this up. The difference
> > is that zonelists are rebuilt during memory offlining and that is when
> > managed pages are removed from the allocator. Zone itself still has that
> > physical range populated and so this patch would have made a difference.
>
> To recap, memory offlining adjusts managed+present pages of the zone
> essentially in one go. If after the adjustments, the zone is no longer
> populated (present==0), we rebuild the zone lists.
>
> Once that's done, we try shrinking the zone (start+spanned pages) --
> which results in zone_start_pfn == 0 if there are no more pages. That
> happens *after* rebuilding the zonelists via remove_pfn_range_from_zone().
>
>
> Note that populated_zone() checks for present_pages. The actual zone
> span (e.g., spanned_pages) is a different story and not of interest when
> building zones or wanting to allocate memory.
>
> >
> > Now, you are right that this is likely possible even without that commit
> > but it is highly unlikely because physical hotremove is a very rare
> > operation and the race window would be so large that it would be likely
> > unfeasible.
>
> I think I agree that 6aa303defb74 is most likely not the origin of this.
> It could only have been the origin in weird corner cases where we
> actually succeed offlining one memory block (adjust present+managed) and
> end up with managed=0 and present!=0 -- which barely happens in
> practice: especially for ZONE_MOVABLE. (yeah, there is memory ballooning
> that adjusts managed pages dynamically and might provoke such a
> situation on ZONE_MOVABLE)
OK, thanks for the correction David. Then I would agree that Fixes tag
could be more confusing than helpful and your above summary would be a
great part of the changelog.
Thanks!
--
Michal Hocko
SUSE Labs
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 39+ messages in thread* Re: Race condition in build_all_zonelists() when offlining movable zone
@ 2022-08-23 13:57 ` Michal Hocko
0 siblings, 0 replies; 39+ messages in thread
From: Michal Hocko @ 2022-08-23 13:57 UTC (permalink / raw)
To: David Hildenbrand
Cc: Mel Gorman, Patrick Daly, linux-mm, linux-arm-kernel,
Juergen Gross
On Tue 23-08-22 15:50:23, David Hildenbrand wrote:
> On 23.08.22 15:25, Michal Hocko wrote:
> > On Tue 23-08-22 13:58:50, Mel Gorman wrote:
> >> On Tue, Aug 23, 2022 at 02:18:27PM +0200, Michal Hocko wrote:
> >>> On Tue 23-08-22 12:09:46, Mel Gorman wrote:
> >>>> On Tue, Aug 23, 2022 at 12:34:09PM +0200, David Hildenbrand wrote:
> >>>>>> @@ -6553,7 +6576,7 @@ static void __build_all_zonelists(void *data)
> >>>>>> #endif
> >>>>>> }
> >>>>>>
> >>>>>> - spin_unlock(&lock);
> >>>>>> + write_sequnlock(&zonelist_update_seq);
> >>>>>> }
> >>>>>>
> >>>>>> static noinline void __init
> >>>>>>
> >>>>>
> >>>>> LGTM. The "retry_cpuset" label might deserve a better name now.
> >>>>>
> >>>>
> >>>> Good point ... "restart"?
> >>>>
> >>>>> Would
> >>>>>
> >>>>> Fixes: 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones
> >>>>> with pages managed by the buddy allocator")
> >>>>>
> >>>>> be correct?
> >>>>>
> >>>>
> >>>> Not specifically because the bug is due to a zone being completely removed
> >>>> resulting in a rebuild. This race probably existed ever since memory
> >>>> hotremove could theoritically remove a complete zone. A Cc: Stable would
> >>>> be appropriate as it'll apply with fuzz back to at least 5.4.210 but beyond
> >>>> that, it should be driven by a specific bug report showing that hot-remove
> >>>> of a full zone was possible and triggered the race.
> >>>
> >>> I do not think so. 6aa303defb74 has changed the zonelist building and
> >>> changed the check from pfn range (populated) to managed (with a memory).
> >>
> >> I'm not 100% convinced. The present_pages should have been the spanned range
> >> minus any holes that exist in the zone. If the zone is completely removed,
> >> the span should be zero meaning present and managed are both zero. No?
> >
> > IIRC, and David will correct me if I am mixing this up. The difference
> > is that zonelists are rebuilt during memory offlining and that is when
> > managed pages are removed from the allocator. Zone itself still has that
> > physical range populated and so this patch would have made a difference.
>
> To recap, memory offlining adjusts managed+present pages of the zone
> essentially in one go. If after the adjustments, the zone is no longer
> populated (present==0), we rebuild the zone lists.
>
> Once that's done, we try shrinking the zone (start+spanned pages) --
> which results in zone_start_pfn == 0 if there are no more pages. That
> happens *after* rebuilding the zonelists via remove_pfn_range_from_zone().
>
>
> Note that populated_zone() checks for present_pages. The actual zone
> span (e.g., spanned_pages) is a different story and not of interest when
> building zones or wanting to allocate memory.
>
> >
> > Now, you are right that this is likely possible even without that commit
> > but it is highly unlikely because physical hotremove is a very rare
> > operation and the race window would be so large that it would be likely
> > unfeasible.
>
> I think I agree that 6aa303defb74 is most likely not the origin of this.
> It could only have been the origin in weird corner cases where we
> actually succeed offlining one memory block (adjust present+managed) and
> end up with managed=0 and present!=0 -- which barely happens in
> practice: especially for ZONE_MOVABLE. (yeah, there is memory ballooning
> that adjusts managed pages dynamically and might provoke such a
> situation on ZONE_MOVABLE)
OK, thanks for the correction David. Then I would agree that Fixes tag
could be more confusing than helpful and your above summary would be a
great part of the changelog.
Thanks!
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 39+ messages in thread* Re: Race condition in build_all_zonelists() when offlining movable zone
2022-08-23 13:57 ` Michal Hocko
@ 2022-08-23 15:14 ` Mel Gorman
-1 siblings, 0 replies; 39+ messages in thread
From: Mel Gorman @ 2022-08-23 15:14 UTC (permalink / raw)
To: Michal Hocko
Cc: David Hildenbrand, Patrick Daly, linux-mm, linux-arm-kernel,
Juergen Gross
On Tue, Aug 23, 2022 at 03:57:38PM +0200, Michal Hocko wrote:
> > I think I agree that 6aa303defb74 is most likely not the origin of this.
> > It could only have been the origin in weird corner cases where we
> > actually succeed offlining one memory block (adjust present+managed) and
> > end up with managed=0 and present!=0 -- which barely happens in
> > practice: especially for ZONE_MOVABLE. (yeah, there is memory ballooning
> > that adjusts managed pages dynamically and might provoke such a
> > situation on ZONE_MOVABLE)
>
> OK, thanks for the correction David. Then I would agree that Fixes tag
> could be more confusing than helpful and your above summary would be a
> great part of the changelog.
>
Given that 6aa303defb74 still makes it potentially worse, it's as good a
Fixes-by as any given that anything prior to that commit would need careful
examination. The race changes shape going further back in time until memory
hot-remove was initially added and if someone needs to go that far back,
they'll also need to check if the ZLC needs special treatment.
Provisional patch and changelog is below. I'd still like to get a Tested-by
from Patrick to confirm it still fixes the problem before posting formally.
--8<--
mm/page_alloc: Fix race condition between build_all_zonelists and page allocation
Patrick Daly reported the following problem;
NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK] - before offline operation
[0] - ZONE_MOVABLE
[1] - ZONE_NORMAL
[2] - NULL
For a GFP_KERNEL allocation, alloc_pages_slowpath() will save the
offset of ZONE_NORMAL in ac->preferred_zoneref. If a concurrent
memory_offline operation removes the last page from ZONE_MOVABLE,
build_all_zonelists() & build_zonerefs_node() will update
node_zonelists as shown below. Only populated zones are added.
NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK] - after offline operation
[0] - ZONE_NORMAL
[1] - NULL
[2] - NULL
The race is simple -- page allocation could be in progress when a memory
hot-remove operation triggers a zonelist rebuild that removes zones.
The allocation request will still have a valid ac->preferred_zoneref that
is now pointing to NULL and triggers an OOM kill.
This problem probably always existed but may be slighly easier to trigger
due to 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones
with pages managed by the buddy allocator") which distinguishes between
zones that are completely unpopulated versus zones that have valid pages
but they are all reserved. Memory hotplug had multiple stages with
timing considerations around managed/present page updates, the zonelist
rebuild and the zone span updates. As David Hildenbrand puts it
memory offlining adjusts managed+present pages of the zone
essentially in one go. If after the adjustments, the zone is no
longer populated (present==0), we rebuild the zone lists.
Once that's done, we try shrinking the zone (start+spanned
pages) -- which results in zone_start_pfn == 0 if there are no
more pages. That happens *after* rebuilding the zonelists via
remove_pfn_range_from_zone().
The only requirement to fix the race is that a page allocation request
identifies when a zonelist rebuild has happened since the allocation
request started and no page has yet been allocated. Use a seqlock_t to track
zonelist updates with a lockless read-side of the zonelist and protecting
the rebuild and update of the counter with a spinlock.
Fixes: 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator")
Reported-by: Patrick Daly <quic_pdaly@quicinc.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Cc: <stable@vger.kernel.org>
---
mm/page_alloc.c | 37 ++++++++++++++++++++++++++++++-------
1 file changed, 30 insertions(+), 7 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e5486d47406e..216e21048ddf 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4708,6 +4708,24 @@ void fs_reclaim_release(gfp_t gfp_mask)
EXPORT_SYMBOL_GPL(fs_reclaim_release);
#endif
+/*
+ * Zonelists may change due to hotplug during allocation. Detect when zonelists
+ * have been rebuilt so allocation retries. Reader side does not lock and
+ * retries the allocation if zonelist changes. Writer side is protected by the
+ * embedded spin_lock.
+ */
+DEFINE_SEQLOCK(zonelist_update_seq);
+
+static unsigned int zonelist_iter_begin(void)
+{
+ return read_seqbegin(&zonelist_update_seq);
+}
+
+static unsigned int check_retry_zonelist(unsigned int seq)
+{
+ return read_seqretry(&zonelist_update_seq, seq);
+}
+
/* Perform direct synchronous page reclaim */
static unsigned long
__perform_reclaim(gfp_t gfp_mask, unsigned int order,
@@ -5001,6 +5019,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
int compaction_retries;
int no_progress_loops;
unsigned int cpuset_mems_cookie;
+ unsigned int zonelist_iter_cookie;
int reserve_flags;
/*
@@ -5011,11 +5030,12 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
(__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
gfp_mask &= ~__GFP_ATOMIC;
-retry_cpuset:
+restart:
compaction_retries = 0;
no_progress_loops = 0;
compact_priority = DEF_COMPACT_PRIORITY;
cpuset_mems_cookie = read_mems_allowed_begin();
+ zonelist_iter_cookie = zonelist_iter_begin();
/*
* The fast path uses conservative alloc_flags to succeed only until
@@ -5187,9 +5207,13 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
goto retry;
- /* Deal with possible cpuset update races before we start OOM killing */
- if (check_retry_cpuset(cpuset_mems_cookie, ac))
- goto retry_cpuset;
+ /*
+ * Deal with possible cpuset update races or zonelist updates to avoid
+ * a unnecessary OOM kill.
+ */
+ if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
+ check_retry_zonelist(zonelist_iter_cookie))
+ goto restart;
/* Reclaim has failed us, start killing things */
page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
@@ -6514,9 +6538,8 @@ static void __build_all_zonelists(void *data)
int nid;
int __maybe_unused cpu;
pg_data_t *self = data;
- static DEFINE_SPINLOCK(lock);
- spin_lock(&lock);
+ write_seqlock(&zonelist_update_seq);
#ifdef CONFIG_NUMA
memset(node_load, 0, sizeof(node_load));
@@ -6553,7 +6576,7 @@ static void __build_all_zonelists(void *data)
#endif
}
- spin_unlock(&lock);
+ write_sequnlock(&zonelist_update_seq);
}
static noinline void __init
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply related [flat|nested] 39+ messages in thread* Re: Race condition in build_all_zonelists() when offlining movable zone
@ 2022-08-23 15:14 ` Mel Gorman
0 siblings, 0 replies; 39+ messages in thread
From: Mel Gorman @ 2022-08-23 15:14 UTC (permalink / raw)
To: Michal Hocko
Cc: David Hildenbrand, Patrick Daly, linux-mm, linux-arm-kernel,
Juergen Gross
On Tue, Aug 23, 2022 at 03:57:38PM +0200, Michal Hocko wrote:
> > I think I agree that 6aa303defb74 is most likely not the origin of this.
> > It could only have been the origin in weird corner cases where we
> > actually succeed offlining one memory block (adjust present+managed) and
> > end up with managed=0 and present!=0 -- which barely happens in
> > practice: especially for ZONE_MOVABLE. (yeah, there is memory ballooning
> > that adjusts managed pages dynamically and might provoke such a
> > situation on ZONE_MOVABLE)
>
> OK, thanks for the correction David. Then I would agree that Fixes tag
> could be more confusing than helpful and your above summary would be a
> great part of the changelog.
>
Given that 6aa303defb74 still makes it potentially worse, it's as good a
Fixes-by as any given that anything prior to that commit would need careful
examination. The race changes shape going further back in time until memory
hot-remove was initially added and if someone needs to go that far back,
they'll also need to check if the ZLC needs special treatment.
Provisional patch and changelog is below. I'd still like to get a Tested-by
from Patrick to confirm it still fixes the problem before posting formally.
--8<--
mm/page_alloc: Fix race condition between build_all_zonelists and page allocation
Patrick Daly reported the following problem;
NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK] - before offline operation
[0] - ZONE_MOVABLE
[1] - ZONE_NORMAL
[2] - NULL
For a GFP_KERNEL allocation, alloc_pages_slowpath() will save the
offset of ZONE_NORMAL in ac->preferred_zoneref. If a concurrent
memory_offline operation removes the last page from ZONE_MOVABLE,
build_all_zonelists() & build_zonerefs_node() will update
node_zonelists as shown below. Only populated zones are added.
NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK] - after offline operation
[0] - ZONE_NORMAL
[1] - NULL
[2] - NULL
The race is simple -- page allocation could be in progress when a memory
hot-remove operation triggers a zonelist rebuild that removes zones.
The allocation request will still have a valid ac->preferred_zoneref that
is now pointing to NULL and triggers an OOM kill.
This problem probably always existed but may be slighly easier to trigger
due to 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones
with pages managed by the buddy allocator") which distinguishes between
zones that are completely unpopulated versus zones that have valid pages
but they are all reserved. Memory hotplug had multiple stages with
timing considerations around managed/present page updates, the zonelist
rebuild and the zone span updates. As David Hildenbrand puts it
memory offlining adjusts managed+present pages of the zone
essentially in one go. If after the adjustments, the zone is no
longer populated (present==0), we rebuild the zone lists.
Once that's done, we try shrinking the zone (start+spanned
pages) -- which results in zone_start_pfn == 0 if there are no
more pages. That happens *after* rebuilding the zonelists via
remove_pfn_range_from_zone().
The only requirement to fix the race is that a page allocation request
identifies when a zonelist rebuild has happened since the allocation
request started and no page has yet been allocated. Use a seqlock_t to track
zonelist updates with a lockless read-side of the zonelist and protecting
the rebuild and update of the counter with a spinlock.
Fixes: 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator")
Reported-by: Patrick Daly <quic_pdaly@quicinc.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Cc: <stable@vger.kernel.org>
---
mm/page_alloc.c | 37 ++++++++++++++++++++++++++++++-------
1 file changed, 30 insertions(+), 7 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e5486d47406e..216e21048ddf 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4708,6 +4708,24 @@ void fs_reclaim_release(gfp_t gfp_mask)
EXPORT_SYMBOL_GPL(fs_reclaim_release);
#endif
+/*
+ * Zonelists may change due to hotplug during allocation. Detect when zonelists
+ * have been rebuilt so allocation retries. Reader side does not lock and
+ * retries the allocation if zonelist changes. Writer side is protected by the
+ * embedded spin_lock.
+ */
+DEFINE_SEQLOCK(zonelist_update_seq);
+
+static unsigned int zonelist_iter_begin(void)
+{
+ return read_seqbegin(&zonelist_update_seq);
+}
+
+static unsigned int check_retry_zonelist(unsigned int seq)
+{
+ return read_seqretry(&zonelist_update_seq, seq);
+}
+
/* Perform direct synchronous page reclaim */
static unsigned long
__perform_reclaim(gfp_t gfp_mask, unsigned int order,
@@ -5001,6 +5019,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
int compaction_retries;
int no_progress_loops;
unsigned int cpuset_mems_cookie;
+ unsigned int zonelist_iter_cookie;
int reserve_flags;
/*
@@ -5011,11 +5030,12 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
(__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
gfp_mask &= ~__GFP_ATOMIC;
-retry_cpuset:
+restart:
compaction_retries = 0;
no_progress_loops = 0;
compact_priority = DEF_COMPACT_PRIORITY;
cpuset_mems_cookie = read_mems_allowed_begin();
+ zonelist_iter_cookie = zonelist_iter_begin();
/*
* The fast path uses conservative alloc_flags to succeed only until
@@ -5187,9 +5207,13 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
goto retry;
- /* Deal with possible cpuset update races before we start OOM killing */
- if (check_retry_cpuset(cpuset_mems_cookie, ac))
- goto retry_cpuset;
+ /*
+ * Deal with possible cpuset update races or zonelist updates to avoid
+ * a unnecessary OOM kill.
+ */
+ if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
+ check_retry_zonelist(zonelist_iter_cookie))
+ goto restart;
/* Reclaim has failed us, start killing things */
page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
@@ -6514,9 +6538,8 @@ static void __build_all_zonelists(void *data)
int nid;
int __maybe_unused cpu;
pg_data_t *self = data;
- static DEFINE_SPINLOCK(lock);
- spin_lock(&lock);
+ write_seqlock(&zonelist_update_seq);
#ifdef CONFIG_NUMA
memset(node_load, 0, sizeof(node_load));
@@ -6553,7 +6576,7 @@ static void __build_all_zonelists(void *data)
#endif
}
- spin_unlock(&lock);
+ write_sequnlock(&zonelist_update_seq);
}
static noinline void __init
^ permalink raw reply related [flat|nested] 39+ messages in thread* Re: Race condition in build_all_zonelists() when offlining movable zone
2022-08-23 15:14 ` Mel Gorman
@ 2022-08-23 15:38 ` Michal Hocko
-1 siblings, 0 replies; 39+ messages in thread
From: Michal Hocko @ 2022-08-23 15:38 UTC (permalink / raw)
To: Mel Gorman
Cc: David Hildenbrand, Patrick Daly, linux-mm, linux-arm-kernel,
Juergen Gross
On Tue 23-08-22 16:14:15, Mel Gorman wrote:
> On Tue, Aug 23, 2022 at 03:57:38PM +0200, Michal Hocko wrote:
> > > I think I agree that 6aa303defb74 is most likely not the origin of this.
> > > It could only have been the origin in weird corner cases where we
> > > actually succeed offlining one memory block (adjust present+managed) and
> > > end up with managed=0 and present!=0 -- which barely happens in
> > > practice: especially for ZONE_MOVABLE. (yeah, there is memory ballooning
> > > that adjusts managed pages dynamically and might provoke such a
> > > situation on ZONE_MOVABLE)
> >
> > OK, thanks for the correction David. Then I would agree that Fixes tag
> > could be more confusing than helpful and your above summary would be a
> > great part of the changelog.
> >
>
> Given that 6aa303defb74 still makes it potentially worse, it's as good a
> Fixes-by as any given that anything prior to that commit would need careful
> examination. The race changes shape going further back in time until memory
> hot-remove was initially added and if someone needs to go that far back,
> they'll also need to check if the ZLC needs special treatment.
>
> Provisional patch and changelog is below. I'd still like to get a Tested-by
> from Patrick to confirm it still fixes the problem before posting formally.
>
> --8<--
> mm/page_alloc: Fix race condition between build_all_zonelists and page allocation
>
> Patrick Daly reported the following problem;
>
> NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK] - before offline operation
> [0] - ZONE_MOVABLE
> [1] - ZONE_NORMAL
> [2] - NULL
>
> For a GFP_KERNEL allocation, alloc_pages_slowpath() will save the
> offset of ZONE_NORMAL in ac->preferred_zoneref. If a concurrent
> memory_offline operation removes the last page from ZONE_MOVABLE,
> build_all_zonelists() & build_zonerefs_node() will update
> node_zonelists as shown below. Only populated zones are added.
>
> NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK] - after offline operation
> [0] - ZONE_NORMAL
> [1] - NULL
> [2] - NULL
>
> The race is simple -- page allocation could be in progress when a memory
> hot-remove operation triggers a zonelist rebuild that removes zones.
> The allocation request will still have a valid ac->preferred_zoneref that
> is now pointing to NULL and triggers an OOM kill.
>
> This problem probably always existed but may be slighly easier to trigger
> due to 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones
> with pages managed by the buddy allocator") which distinguishes between
> zones that are completely unpopulated versus zones that have valid pages
> but they are all reserved. Memory hotplug had multiple stages with
> timing considerations around managed/present page updates, the zonelist
> rebuild and the zone span updates. As David Hildenbrand puts it
>
> memory offlining adjusts managed+present pages of the zone
> essentially in one go. If after the adjustments, the zone is no
> longer populated (present==0), we rebuild the zone lists.
>
> Once that's done, we try shrinking the zone (start+spanned
> pages) -- which results in zone_start_pfn == 0 if there are no
> more pages. That happens *after* rebuilding the zonelists via
> remove_pfn_range_from_zone().
>
> The only requirement to fix the race is that a page allocation request
> identifies when a zonelist rebuild has happened since the allocation
> request started and no page has yet been allocated. Use a seqlock_t to track
> zonelist updates with a lockless read-side of the zonelist and protecting
> the rebuild and update of the counter with a spinlock.
>
> Fixes: 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator")
> Reported-by: Patrick Daly <quic_pdaly@quicinc.com>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Cc: <stable@vger.kernel.org>
> ---
> mm/page_alloc.c | 37 ++++++++++++++++++++++++++++++-------
> 1 file changed, 30 insertions(+), 7 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e5486d47406e..216e21048ddf 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4708,6 +4708,24 @@ void fs_reclaim_release(gfp_t gfp_mask)
> EXPORT_SYMBOL_GPL(fs_reclaim_release);
> #endif
>
> +/*
> + * Zonelists may change due to hotplug during allocation. Detect when zonelists
> + * have been rebuilt so allocation retries. Reader side does not lock and
> + * retries the allocation if zonelist changes. Writer side is protected by the
> + * embedded spin_lock.
> + */
> +DEFINE_SEQLOCK(zonelist_update_seq);
> +
> +static unsigned int zonelist_iter_begin(void)
> +{
You likely want something like
if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE))
return read_seqbegin(&zonelist_update_seq);
return 0;
> + return read_seqbegin(&zonelist_update_seq);
> +}
> +
> +static unsigned int check_retry_zonelist(unsigned int seq)
> +{
if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE))
return read_seqretry(&zonelist_update_seq, seq);
return seq;
> + return read_seqretry(&zonelist_update_seq, seq);
> +}
> +
to avoid overhead on systems without HOTREMOVE configured.
Other than that LGTM.
Acked-by: Michal Hocko <mhocko@suse.com>
Thanks!
> /* Perform direct synchronous page reclaim */
> static unsigned long
> __perform_reclaim(gfp_t gfp_mask, unsigned int order,
> @@ -5001,6 +5019,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> int compaction_retries;
> int no_progress_loops;
> unsigned int cpuset_mems_cookie;
> + unsigned int zonelist_iter_cookie;
> int reserve_flags;
>
> /*
> @@ -5011,11 +5030,12 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
> gfp_mask &= ~__GFP_ATOMIC;
>
> -retry_cpuset:
> +restart:
> compaction_retries = 0;
> no_progress_loops = 0;
> compact_priority = DEF_COMPACT_PRIORITY;
> cpuset_mems_cookie = read_mems_allowed_begin();
> + zonelist_iter_cookie = zonelist_iter_begin();
>
> /*
> * The fast path uses conservative alloc_flags to succeed only until
> @@ -5187,9 +5207,13 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> goto retry;
>
>
> - /* Deal with possible cpuset update races before we start OOM killing */
> - if (check_retry_cpuset(cpuset_mems_cookie, ac))
> - goto retry_cpuset;
> + /*
> + * Deal with possible cpuset update races or zonelist updates to avoid
> + * a unnecessary OOM kill.
> + */
> + if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
> + check_retry_zonelist(zonelist_iter_cookie))
> + goto restart;
>
> /* Reclaim has failed us, start killing things */
> page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
> @@ -6514,9 +6538,8 @@ static void __build_all_zonelists(void *data)
> int nid;
> int __maybe_unused cpu;
> pg_data_t *self = data;
> - static DEFINE_SPINLOCK(lock);
>
> - spin_lock(&lock);
> + write_seqlock(&zonelist_update_seq);
>
> #ifdef CONFIG_NUMA
> memset(node_load, 0, sizeof(node_load));
> @@ -6553,7 +6576,7 @@ static void __build_all_zonelists(void *data)
> #endif
> }
>
> - spin_unlock(&lock);
> + write_sequnlock(&zonelist_update_seq);
> }
>
> static noinline void __init
--
Michal Hocko
SUSE Labs
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 39+ messages in thread* Re: Race condition in build_all_zonelists() when offlining movable zone
@ 2022-08-23 15:38 ` Michal Hocko
0 siblings, 0 replies; 39+ messages in thread
From: Michal Hocko @ 2022-08-23 15:38 UTC (permalink / raw)
To: Mel Gorman
Cc: David Hildenbrand, Patrick Daly, linux-mm, linux-arm-kernel,
Juergen Gross
On Tue 23-08-22 16:14:15, Mel Gorman wrote:
> On Tue, Aug 23, 2022 at 03:57:38PM +0200, Michal Hocko wrote:
> > > I think I agree that 6aa303defb74 is most likely not the origin of this.
> > > It could only have been the origin in weird corner cases where we
> > > actually succeed offlining one memory block (adjust present+managed) and
> > > end up with managed=0 and present!=0 -- which barely happens in
> > > practice: especially for ZONE_MOVABLE. (yeah, there is memory ballooning
> > > that adjusts managed pages dynamically and might provoke such a
> > > situation on ZONE_MOVABLE)
> >
> > OK, thanks for the correction David. Then I would agree that Fixes tag
> > could be more confusing than helpful and your above summary would be a
> > great part of the changelog.
> >
>
> Given that 6aa303defb74 still makes it potentially worse, it's as good a
> Fixes-by as any given that anything prior to that commit would need careful
> examination. The race changes shape going further back in time until memory
> hot-remove was initially added and if someone needs to go that far back,
> they'll also need to check if the ZLC needs special treatment.
>
> Provisional patch and changelog is below. I'd still like to get a Tested-by
> from Patrick to confirm it still fixes the problem before posting formally.
>
> --8<--
> mm/page_alloc: Fix race condition between build_all_zonelists and page allocation
>
> Patrick Daly reported the following problem;
>
> NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK] - before offline operation
> [0] - ZONE_MOVABLE
> [1] - ZONE_NORMAL
> [2] - NULL
>
> For a GFP_KERNEL allocation, alloc_pages_slowpath() will save the
> offset of ZONE_NORMAL in ac->preferred_zoneref. If a concurrent
> memory_offline operation removes the last page from ZONE_MOVABLE,
> build_all_zonelists() & build_zonerefs_node() will update
> node_zonelists as shown below. Only populated zones are added.
>
> NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK] - after offline operation
> [0] - ZONE_NORMAL
> [1] - NULL
> [2] - NULL
>
> The race is simple -- page allocation could be in progress when a memory
> hot-remove operation triggers a zonelist rebuild that removes zones.
> The allocation request will still have a valid ac->preferred_zoneref that
> is now pointing to NULL and triggers an OOM kill.
>
> This problem probably always existed but may be slighly easier to trigger
> due to 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones
> with pages managed by the buddy allocator") which distinguishes between
> zones that are completely unpopulated versus zones that have valid pages
> but they are all reserved. Memory hotplug had multiple stages with
> timing considerations around managed/present page updates, the zonelist
> rebuild and the zone span updates. As David Hildenbrand puts it
>
> memory offlining adjusts managed+present pages of the zone
> essentially in one go. If after the adjustments, the zone is no
> longer populated (present==0), we rebuild the zone lists.
>
> Once that's done, we try shrinking the zone (start+spanned
> pages) -- which results in zone_start_pfn == 0 if there are no
> more pages. That happens *after* rebuilding the zonelists via
> remove_pfn_range_from_zone().
>
> The only requirement to fix the race is that a page allocation request
> identifies when a zonelist rebuild has happened since the allocation
> request started and no page has yet been allocated. Use a seqlock_t to track
> zonelist updates with a lockless read-side of the zonelist and protecting
> the rebuild and update of the counter with a spinlock.
>
> Fixes: 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator")
> Reported-by: Patrick Daly <quic_pdaly@quicinc.com>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Cc: <stable@vger.kernel.org>
> ---
> mm/page_alloc.c | 37 ++++++++++++++++++++++++++++++-------
> 1 file changed, 30 insertions(+), 7 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e5486d47406e..216e21048ddf 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4708,6 +4708,24 @@ void fs_reclaim_release(gfp_t gfp_mask)
> EXPORT_SYMBOL_GPL(fs_reclaim_release);
> #endif
>
> +/*
> + * Zonelists may change due to hotplug during allocation. Detect when zonelists
> + * have been rebuilt so allocation retries. Reader side does not lock and
> + * retries the allocation if zonelist changes. Writer side is protected by the
> + * embedded spin_lock.
> + */
> +DEFINE_SEQLOCK(zonelist_update_seq);
> +
> +static unsigned int zonelist_iter_begin(void)
> +{
You likely want something like
if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE))
return read_seqbegin(&zonelist_update_seq);
return 0;
> + return read_seqbegin(&zonelist_update_seq);
> +}
> +
> +static unsigned int check_retry_zonelist(unsigned int seq)
> +{
if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE))
return read_seqretry(&zonelist_update_seq, seq);
return seq;
> + return read_seqretry(&zonelist_update_seq, seq);
> +}
> +
to avoid overhead on systems without HOTREMOVE configured.
Other than that LGTM.
Acked-by: Michal Hocko <mhocko@suse.com>
Thanks!
> /* Perform direct synchronous page reclaim */
> static unsigned long
> __perform_reclaim(gfp_t gfp_mask, unsigned int order,
> @@ -5001,6 +5019,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> int compaction_retries;
> int no_progress_loops;
> unsigned int cpuset_mems_cookie;
> + unsigned int zonelist_iter_cookie;
> int reserve_flags;
>
> /*
> @@ -5011,11 +5030,12 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
> gfp_mask &= ~__GFP_ATOMIC;
>
> -retry_cpuset:
> +restart:
> compaction_retries = 0;
> no_progress_loops = 0;
> compact_priority = DEF_COMPACT_PRIORITY;
> cpuset_mems_cookie = read_mems_allowed_begin();
> + zonelist_iter_cookie = zonelist_iter_begin();
>
> /*
> * The fast path uses conservative alloc_flags to succeed only until
> @@ -5187,9 +5207,13 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> goto retry;
>
>
> - /* Deal with possible cpuset update races before we start OOM killing */
> - if (check_retry_cpuset(cpuset_mems_cookie, ac))
> - goto retry_cpuset;
> + /*
> + * Deal with possible cpuset update races or zonelist updates to avoid
> + * a unnecessary OOM kill.
> + */
> + if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
> + check_retry_zonelist(zonelist_iter_cookie))
> + goto restart;
>
> /* Reclaim has failed us, start killing things */
> page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
> @@ -6514,9 +6538,8 @@ static void __build_all_zonelists(void *data)
> int nid;
> int __maybe_unused cpu;
> pg_data_t *self = data;
> - static DEFINE_SPINLOCK(lock);
>
> - spin_lock(&lock);
> + write_seqlock(&zonelist_update_seq);
>
> #ifdef CONFIG_NUMA
> memset(node_load, 0, sizeof(node_load));
> @@ -6553,7 +6576,7 @@ static void __build_all_zonelists(void *data)
> #endif
> }
>
> - spin_unlock(&lock);
> + write_sequnlock(&zonelist_update_seq);
> }
>
> static noinline void __init
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 39+ messages in thread* Re: Race condition in build_all_zonelists() when offlining movable zone
2022-08-23 15:38 ` Michal Hocko
@ 2022-08-23 15:51 ` David Hildenbrand
-1 siblings, 0 replies; 39+ messages in thread
From: David Hildenbrand @ 2022-08-23 15:51 UTC (permalink / raw)
To: Michal Hocko, Mel Gorman
Cc: Patrick Daly, linux-mm, linux-arm-kernel, Juergen Gross
On 23.08.22 17:38, Michal Hocko wrote:
> On Tue 23-08-22 16:14:15, Mel Gorman wrote:
>> On Tue, Aug 23, 2022 at 03:57:38PM +0200, Michal Hocko wrote:
>>>> I think I agree that 6aa303defb74 is most likely not the origin of this.
>>>> It could only have been the origin in weird corner cases where we
>>>> actually succeed offlining one memory block (adjust present+managed) and
>>>> end up with managed=0 and present!=0 -- which barely happens in
>>>> practice: especially for ZONE_MOVABLE. (yeah, there is memory ballooning
>>>> that adjusts managed pages dynamically and might provoke such a
>>>> situation on ZONE_MOVABLE)
>>>
>>> OK, thanks for the correction David. Then I would agree that Fixes tag
>>> could be more confusing than helpful and your above summary would be a
>>> great part of the changelog.
>>>
>>
>> Given that 6aa303defb74 still makes it potentially worse, it's as good a
>> Fixes-by as any given that anything prior to that commit would need careful
>> examination. The race changes shape going further back in time until memory
>> hot-remove was initially added and if someone needs to go that far back,
>> they'll also need to check if the ZLC needs special treatment.
>>
>> Provisional patch and changelog is below. I'd still like to get a Tested-by
>> from Patrick to confirm it still fixes the problem before posting formally.
>>
>> --8<--
>> mm/page_alloc: Fix race condition between build_all_zonelists and page allocation
>>
>> Patrick Daly reported the following problem;
>>
>> NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK] - before offline operation
>> [0] - ZONE_MOVABLE
>> [1] - ZONE_NORMAL
>> [2] - NULL
>>
>> For a GFP_KERNEL allocation, alloc_pages_slowpath() will save the
>> offset of ZONE_NORMAL in ac->preferred_zoneref. If a concurrent
>> memory_offline operation removes the last page from ZONE_MOVABLE,
>> build_all_zonelists() & build_zonerefs_node() will update
>> node_zonelists as shown below. Only populated zones are added.
>>
>> NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK] - after offline operation
>> [0] - ZONE_NORMAL
>> [1] - NULL
>> [2] - NULL
>>
>> The race is simple -- page allocation could be in progress when a memory
>> hot-remove operation triggers a zonelist rebuild that removes zones.
>> The allocation request will still have a valid ac->preferred_zoneref that
>> is now pointing to NULL and triggers an OOM kill.
>>
>> This problem probably always existed but may be slighly easier to trigger
s/slighly/slightly/
>> due to 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones
>> with pages managed by the buddy allocator") which distinguishes between
>> zones that are completely unpopulated versus zones that have valid pages
>> but they are all reserved. Memory hotplug had multiple stages with
Not necessarily reserved, simply not managed by the buddy (e.g., early
allocations, memory ballooning / virtio-mem).
>> timing considerations around managed/present page updates, the zonelist
>> rebuild and the zone span updates. As David Hildenbrand puts it
>>
>> memory offlining adjusts managed+present pages of the zone
>> essentially in one go. If after the adjustments, the zone is no
>> longer populated (present==0), we rebuild the zone lists.
>>
>> Once that's done, we try shrinking the zone (start+spanned
>> pages) -- which results in zone_start_pfn == 0 if there are no
>> more pages. That happens *after* rebuilding the zonelists via
>> remove_pfn_range_from_zone().
>>
>> The only requirement to fix the race is that a page allocation request
>> identifies when a zonelist rebuild has happened since the allocation
>> request started and no page has yet been allocated. Use a seqlock_t to track
>> zonelist updates with a lockless read-side of the zonelist and protecting
>> the rebuild and update of the counter with a spinlock.
>>
>> Fixes: 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator")
>> Reported-by: Patrick Daly <quic_pdaly@quicinc.com>
>> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
>> Cc: <stable@vger.kernel.org>
>> ---
>> mm/page_alloc.c | 37 ++++++++++++++++++++++++++++++-------
>> 1 file changed, 30 insertions(+), 7 deletions(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index e5486d47406e..216e21048ddf 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -4708,6 +4708,24 @@ void fs_reclaim_release(gfp_t gfp_mask)
>> EXPORT_SYMBOL_GPL(fs_reclaim_release);
>> #endif
>>
>> +/*
>> + * Zonelists may change due to hotplug during allocation. Detect when zonelists
>> + * have been rebuilt so allocation retries. Reader side does not lock and
>> + * retries the allocation if zonelist changes. Writer side is protected by the
>> + * embedded spin_lock.
>> + */
>> +DEFINE_SEQLOCK(zonelist_update_seq);
>> +
>> +static unsigned int zonelist_iter_begin(void)
>> +{
>
> You likely want something like
> if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE))
> return read_seqbegin(&zonelist_update_seq);
> return 0;
>
>> + return read_seqbegin(&zonelist_update_seq);
>> +}
>> +
>> +static unsigned int check_retry_zonelist(unsigned int seq)
>> +{
> if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE))
> return read_seqretry(&zonelist_update_seq, seq);
> return seq;
>
>> + return read_seqretry(&zonelist_update_seq, seq);
>> +}
>> +
>
> to avoid overhead on systems without HOTREMOVE configured.
>
> Other than that LGTM.
> Acked-by: Michal Hocko <mhocko@suse.com>
Makes sense to me, although I wonder how much it will matter in practice.
Reviewed-by: David Hildenbrand <david@redhat.com>
--
Thanks,
David / dhildenb
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 39+ messages in thread* Re: Race condition in build_all_zonelists() when offlining movable zone
@ 2022-08-23 15:51 ` David Hildenbrand
0 siblings, 0 replies; 39+ messages in thread
From: David Hildenbrand @ 2022-08-23 15:51 UTC (permalink / raw)
To: Michal Hocko, Mel Gorman
Cc: Patrick Daly, linux-mm, linux-arm-kernel, Juergen Gross
On 23.08.22 17:38, Michal Hocko wrote:
> On Tue 23-08-22 16:14:15, Mel Gorman wrote:
>> On Tue, Aug 23, 2022 at 03:57:38PM +0200, Michal Hocko wrote:
>>>> I think I agree that 6aa303defb74 is most likely not the origin of this.
>>>> It could only have been the origin in weird corner cases where we
>>>> actually succeed offlining one memory block (adjust present+managed) and
>>>> end up with managed=0 and present!=0 -- which barely happens in
>>>> practice: especially for ZONE_MOVABLE. (yeah, there is memory ballooning
>>>> that adjusts managed pages dynamically and might provoke such a
>>>> situation on ZONE_MOVABLE)
>>>
>>> OK, thanks for the correction David. Then I would agree that Fixes tag
>>> could be more confusing than helpful and your above summary would be a
>>> great part of the changelog.
>>>
>>
>> Given that 6aa303defb74 still makes it potentially worse, it's as good a
>> Fixes-by as any given that anything prior to that commit would need careful
>> examination. The race changes shape going further back in time until memory
>> hot-remove was initially added and if someone needs to go that far back,
>> they'll also need to check if the ZLC needs special treatment.
>>
>> Provisional patch and changelog is below. I'd still like to get a Tested-by
>> from Patrick to confirm it still fixes the problem before posting formally.
>>
>> --8<--
>> mm/page_alloc: Fix race condition between build_all_zonelists and page allocation
>>
>> Patrick Daly reported the following problem;
>>
>> NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK] - before offline operation
>> [0] - ZONE_MOVABLE
>> [1] - ZONE_NORMAL
>> [2] - NULL
>>
>> For a GFP_KERNEL allocation, alloc_pages_slowpath() will save the
>> offset of ZONE_NORMAL in ac->preferred_zoneref. If a concurrent
>> memory_offline operation removes the last page from ZONE_MOVABLE,
>> build_all_zonelists() & build_zonerefs_node() will update
>> node_zonelists as shown below. Only populated zones are added.
>>
>> NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK] - after offline operation
>> [0] - ZONE_NORMAL
>> [1] - NULL
>> [2] - NULL
>>
>> The race is simple -- page allocation could be in progress when a memory
>> hot-remove operation triggers a zonelist rebuild that removes zones.
>> The allocation request will still have a valid ac->preferred_zoneref that
>> is now pointing to NULL and triggers an OOM kill.
>>
>> This problem probably always existed but may be slighly easier to trigger
s/slighly/slightly/
>> due to 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones
>> with pages managed by the buddy allocator") which distinguishes between
>> zones that are completely unpopulated versus zones that have valid pages
>> but they are all reserved. Memory hotplug had multiple stages with
Not necessarily reserved, simply not managed by the buddy (e.g., early
allocations, memory ballooning / virtio-mem).
>> timing considerations around managed/present page updates, the zonelist
>> rebuild and the zone span updates. As David Hildenbrand puts it
>>
>> memory offlining adjusts managed+present pages of the zone
>> essentially in one go. If after the adjustments, the zone is no
>> longer populated (present==0), we rebuild the zone lists.
>>
>> Once that's done, we try shrinking the zone (start+spanned
>> pages) -- which results in zone_start_pfn == 0 if there are no
>> more pages. That happens *after* rebuilding the zonelists via
>> remove_pfn_range_from_zone().
>>
>> The only requirement to fix the race is that a page allocation request
>> identifies when a zonelist rebuild has happened since the allocation
>> request started and no page has yet been allocated. Use a seqlock_t to track
>> zonelist updates with a lockless read-side of the zonelist and protecting
>> the rebuild and update of the counter with a spinlock.
>>
>> Fixes: 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator")
>> Reported-by: Patrick Daly <quic_pdaly@quicinc.com>
>> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
>> Cc: <stable@vger.kernel.org>
>> ---
>> mm/page_alloc.c | 37 ++++++++++++++++++++++++++++++-------
>> 1 file changed, 30 insertions(+), 7 deletions(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index e5486d47406e..216e21048ddf 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -4708,6 +4708,24 @@ void fs_reclaim_release(gfp_t gfp_mask)
>> EXPORT_SYMBOL_GPL(fs_reclaim_release);
>> #endif
>>
>> +/*
>> + * Zonelists may change due to hotplug during allocation. Detect when zonelists
>> + * have been rebuilt so allocation retries. Reader side does not lock and
>> + * retries the allocation if zonelist changes. Writer side is protected by the
>> + * embedded spin_lock.
>> + */
>> +DEFINE_SEQLOCK(zonelist_update_seq);
>> +
>> +static unsigned int zonelist_iter_begin(void)
>> +{
>
> You likely want something like
> if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE))
> return read_seqbegin(&zonelist_update_seq);
> return 0;
>
>> + return read_seqbegin(&zonelist_update_seq);
>> +}
>> +
>> +static unsigned int check_retry_zonelist(unsigned int seq)
>> +{
> if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE))
> return read_seqretry(&zonelist_update_seq, seq);
> return seq;
>
>> + return read_seqretry(&zonelist_update_seq, seq);
>> +}
>> +
>
> to avoid overhead on systems without HOTREMOVE configured.
>
> Other than that LGTM.
> Acked-by: Michal Hocko <mhocko@suse.com>
Makes sense to me, although I wonder how much it will matter in practice.
Reviewed-by: David Hildenbrand <david@redhat.com>
--
Thanks,
David / dhildenb
^ permalink raw reply [flat|nested] 39+ messages in thread* Re: Race condition in build_all_zonelists() when offlining movable zone
2022-08-23 15:51 ` David Hildenbrand
@ 2022-08-24 9:45 ` Mel Gorman
-1 siblings, 0 replies; 39+ messages in thread
From: Mel Gorman @ 2022-08-24 9:45 UTC (permalink / raw)
To: David Hildenbrand
Cc: Michal Hocko, Patrick Daly, linux-mm, linux-arm-kernel,
Juergen Gross
On Tue, Aug 23, 2022 at 05:51:25PM +0200, David Hildenbrand wrote:
> >> The race is simple -- page allocation could be in progress when a memory
> >> hot-remove operation triggers a zonelist rebuild that removes zones.
> >> The allocation request will still have a valid ac->preferred_zoneref that
> >> is now pointing to NULL and triggers an OOM kill.
> >>
> >> This problem probably always existed but may be slighly easier to trigger
>
> s/slighly/slightly/
>
Fixed.
> >> due to 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones
> >> with pages managed by the buddy allocator") which distinguishes between
> >> zones that are completely unpopulated versus zones that have valid pages
> >> but they are all reserved. Memory hotplug had multiple stages with
>
> Not necessarily reserved, simply not managed by the buddy (e.g., early
> allocations, memory ballooning / virtio-mem).
>
Fair point, I filed all that under "reserved" but you're right, this is
clearer.
> >> +/*
> >> + * Zonelists may change due to hotplug during allocation. Detect when zonelists
> >> + * have been rebuilt so allocation retries. Reader side does not lock and
> >> + * retries the allocation if zonelist changes. Writer side is protected by the
> >> + * embedded spin_lock.
> >> + */
> >> +DEFINE_SEQLOCK(zonelist_update_seq);
> >> +
> >> +static unsigned int zonelist_iter_begin(void)
> >> +{
> >
> > You likely want something like
> > if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE))
> > return read_seqbegin(&zonelist_update_seq);
> > return 0;
> >
> >> + return read_seqbegin(&zonelist_update_seq);
> >> +}
> >> +
> >> +static unsigned int check_retry_zonelist(unsigned int seq)
> >> +{
> > if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE))
> > return read_seqretry(&zonelist_update_seq, seq);
> > return seq;
> >
> >> + return read_seqretry(&zonelist_update_seq, seq);
> >> +}
> >> +
> >
> > to avoid overhead on systems without HOTREMOVE configured.
> >
> > Other than that LGTM.
> > Acked-by: Michal Hocko <mhocko@suse.com>
>
It's a minor saving but fair enough. HOTREMOVE is now the only reason
zonelists can be rebuilt. While that was not always true, if it ever
changes again, it's a simple fix.
Thanks Michal.
> Makes sense to me, although I wonder how much it will matter in practice.
>
Probably none at all as it's one branch but it's still valid.
> Reviewed-by: David Hildenbrand <david@redhat.com>
>
Thanks David.
--
Mel Gorman
SUSE Labs
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 39+ messages in thread* Re: Race condition in build_all_zonelists() when offlining movable zone
@ 2022-08-24 9:45 ` Mel Gorman
0 siblings, 0 replies; 39+ messages in thread
From: Mel Gorman @ 2022-08-24 9:45 UTC (permalink / raw)
To: David Hildenbrand
Cc: Michal Hocko, Patrick Daly, linux-mm, linux-arm-kernel,
Juergen Gross
On Tue, Aug 23, 2022 at 05:51:25PM +0200, David Hildenbrand wrote:
> >> The race is simple -- page allocation could be in progress when a memory
> >> hot-remove operation triggers a zonelist rebuild that removes zones.
> >> The allocation request will still have a valid ac->preferred_zoneref that
> >> is now pointing to NULL and triggers an OOM kill.
> >>
> >> This problem probably always existed but may be slighly easier to trigger
>
> s/slighly/slightly/
>
Fixed.
> >> due to 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones
> >> with pages managed by the buddy allocator") which distinguishes between
> >> zones that are completely unpopulated versus zones that have valid pages
> >> but they are all reserved. Memory hotplug had multiple stages with
>
> Not necessarily reserved, simply not managed by the buddy (e.g., early
> allocations, memory ballooning / virtio-mem).
>
Fair point, I filed all that under "reserved" but you're right, this is
clearer.
> >> +/*
> >> + * Zonelists may change due to hotplug during allocation. Detect when zonelists
> >> + * have been rebuilt so allocation retries. Reader side does not lock and
> >> + * retries the allocation if zonelist changes. Writer side is protected by the
> >> + * embedded spin_lock.
> >> + */
> >> +DEFINE_SEQLOCK(zonelist_update_seq);
> >> +
> >> +static unsigned int zonelist_iter_begin(void)
> >> +{
> >
> > You likely want something like
> > if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE))
> > return read_seqbegin(&zonelist_update_seq);
> > return 0;
> >
> >> + return read_seqbegin(&zonelist_update_seq);
> >> +}
> >> +
> >> +static unsigned int check_retry_zonelist(unsigned int seq)
> >> +{
> > if (IS_ENABLED(CONFIG_MEMORY_HOTREMOVE))
> > return read_seqretry(&zonelist_update_seq, seq);
> > return seq;
> >
> >> + return read_seqretry(&zonelist_update_seq, seq);
> >> +}
> >> +
> >
> > to avoid overhead on systems without HOTREMOVE configured.
> >
> > Other than that LGTM.
> > Acked-by: Michal Hocko <mhocko@suse.com>
>
It's a minor saving but fair enough. HOTREMOVE is now the only reason
zonelists can be rebuilt. While that was not always true, if it ever
changes again, it's a simple fix.
Thanks Michal.
> Makes sense to me, although I wonder how much it will matter in practice.
>
Probably none at all as it's one branch but it's still valid.
> Reviewed-by: David Hildenbrand <david@redhat.com>
>
Thanks David.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 39+ messages in thread