Re: Race condition in build_all_zonelists() when offlining movable zone

From: Mel Gorman <mgorman@suse.de>
To: Michal Hocko <mhocko@suse.com>
Cc: Patrick Daly <quic_pdaly@quicinc.com>,
	linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org,
	David Hildenbrand <david@redhat.com>,
	Juergen Gross <jgross@suse.com>
Subject: Re: Race condition in build_all_zonelists() when offlining movable zone
Date: Wed, 17 Aug 2022 11:40:28 +0100	[thread overview]
Message-ID: <20220817104028.uin7cmkb4qlpgfbi@suse.de> (raw)
In-Reply-To: <YvyRvxO+FHMyuGn3@dhcp22.suse.cz>

On Wed, Aug 17, 2022 at 08:59:11AM +0200, Michal Hocko wrote:
> > In order to address that, we should either have to call first_zones_zonelist
> > inside get_page_from_freelist if the zoneref doesn't correspond to a
> > real zone in the zonelist or we should revisit my older approach
> > referenced above.
> 
> Would this work? It is not really great to pay an overhead for unlikely
> event in the hot path but we might use a similar trick to check_retry_cpuset
> in the slowpath to detect this situation. 
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b0bcab50f0a3..bce786d7fcb4 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4098,7 +4098,17 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
>  	 * See also __cpuset_node_allowed() comment in kernel/cpuset.c.
>  	 */
>  	no_fallback = alloc_flags & ALLOC_NOFRAGMENT;
> +
> +	/*
> +	 * A race with memory offlining could alter zones on the zonelist
> +	 * e.g. dropping the top (movable) zone if it gets unpoppulated
> +	 * and so preferred_zoneref is not valid anymore
> +	 */
> +	if (unlikely(!ac->preferred_zoneref->zone))
> +		ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
> +					ac->highest_zoneidx, ac->nodemask);
>  	z = ac->preferred_zoneref;
> +

ac->preferred_zoneref->zone could still be a valid pointer to a zone,
but an empty one so that would imply

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e5486d47406e..38ce123af543 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5191,6 +5191,10 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (check_retry_cpuset(cpuset_mems_cookie, ac))
 		goto retry_cpuset;
 
+	/* Hotplug could have drained the preferred zone. */
+	if (!populated_zone(ac->preferred_zoneref->zone))
+		goto retry_cpuset;
+
 	/* Reclaim has failed us, start killing things */
 	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
 	if (page)

But even that is fragile. If there were multiple zones in the zonelist
and the preferred zone was further down the list, the zone could still
be populated but a different zone than expected. It may be better to have
the same type of seq counter that restarts the allocation attempt if the
zonelist changes.

So.... this? It is seqcount only with a basic lock as there already is a
full lock on the writer side and it would appear to be overkill to protect
the reader side with read_seqbegin_or_lock as it complicates the writer side.

(boot tested only)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e5486d47406e..158954b10724 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4708,6 +4708,22 @@ void fs_reclaim_release(gfp_t gfp_mask)
 EXPORT_SYMBOL_GPL(fs_reclaim_release);
 #endif
 
+/*
+ * Zonelists may change due to hotplug during allocation. Detect when zonelists
+ * have been rebuilt so allocation retries.
+ */
+static seqcount_t zonelist_update_seq = SEQCNT_ZERO(zonelist_update_seq);
+
+static unsigned int zonelist_update_begin(void)
+{
+	return read_seqcount_begin(&zonelist_update_seq);
+}
+
+static unsigned int zonelist_update_retry(unsigned int seq)
+{
+	return read_seqcount_retry(&zonelist_update_seq, seq);
+}
+
 /* Perform direct synchronous page reclaim */
 static unsigned long
 __perform_reclaim(gfp_t gfp_mask, unsigned int order,
@@ -5001,6 +5017,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	int compaction_retries;
 	int no_progress_loops;
 	unsigned int cpuset_mems_cookie;
+	unsigned int zonelist_update_cookie;
 	int reserve_flags;
 
 	/*
@@ -5016,6 +5033,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	no_progress_loops = 0;
 	compact_priority = DEF_COMPACT_PRIORITY;
 	cpuset_mems_cookie = read_mems_allowed_begin();
+	zonelist_update_cookie = zonelist_update_begin();
 
 	/*
 	 * The fast path uses conservative alloc_flags to succeed only until
@@ -5191,6 +5209,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (check_retry_cpuset(cpuset_mems_cookie, ac))
 		goto retry_cpuset;
 
+	if (zonelist_update_retry(zonelist_update_cookie))
+		goto retry_cpuset;
+
 	/* Reclaim has failed us, start killing things */
 	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
 	if (page)
@@ -6517,6 +6538,7 @@ static void __build_all_zonelists(void *data)
 	static DEFINE_SPINLOCK(lock);
 
 	spin_lock(&lock);
+	write_seqcount_begin(&zonelist_update_seq);
 
 #ifdef CONFIG_NUMA
 	memset(node_load, 0, sizeof(node_load));
@@ -6553,6 +6575,7 @@ static void __build_all_zonelists(void *data)
 #endif
 	}
 
+	write_seqcount_end(&zonelist_update_seq);
 	spin_unlock(&lock);
 }
 
-- 
Mel Gorman
SUSE Labs

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel