From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 66E4EC25B08 for ; Wed, 17 Aug 2022 10:41:54 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender: Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:In-Reply-To:MIME-Version:References: Message-ID:Subject:Cc:To:From:Date:Reply-To:Content-ID:Content-Description: Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID: List-Owner; bh=XTsDAmX10Dwh2n5KvVcaQteK9lPhpWIp6yf7Blq9OR8=; b=Na6i7GHeMcpzS3 6A/hERhi/UQpQbEO9dPqZzQ+wSPugJJPDNw15I0OMXIGeGqFTVq8/yDS4AT+cnCzPKWfeEPcyg3Pm EAT+PH0BQDhnvPKp6hidKY9Z1f8zwlm3en6biTaV9X0PKkx4X1qrvSEW4r1XUj5/zZAH1iCEKiqXp iGohXdzrYoXOPaVt9O/NzlKuTf7My6Eu7BKbgqtweUgtiJ5fusPcZhQzJB+w7+GRfPE16hDsuOJ7P Tmi7HyofCgx0Nf7CO7HT7whzN6hudT9vMyvXbcZpAa7+4xJzUin7+wtTB08MrvS417fV9v5bXmAMl S4q8gk1LUnwO8cWxuARQ==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux)) id 1oOGTZ-000mVC-9g; Wed, 17 Aug 2022 10:40:41 +0000 Received: from smtp-out1.suse.de ([195.135.220.28]) by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux)) id 1oOGTV-000mT8-CL for linux-arm-kernel@lists.infradead.org; Wed, 17 Aug 2022 10:40:39 +0000 Received: from relay2.suse.de (relay2.suse.de [149.44.160.134]) by smtp-out1.suse.de (Postfix) with ESMTP id 2C8EA34820; Wed, 17 Aug 2022 10:40:34 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_rsa; t=1660732834; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=7o5Whu2L2GIbaDR1rBs9yvoTnGPtapYA+1Mvdrnf19I=; b=RHCQkinvgsG2Gubh4UFRW+R9TgLm+idCWJcEfVceBE2hUbaNHhMYvBUFTRIPr8C+dBgF5i Q3SGUDIIMX7tO1OK7vaSaki2PK9QnLiR1zm1YfYTzmAXD68iM2B+5hYT6AKp4mcbZHCVxW 4OZaIXjoqOTbhXXuY/Swvto2XTnScnw= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de; s=susede2_ed25519; t=1660732834; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=7o5Whu2L2GIbaDR1rBs9yvoTnGPtapYA+1Mvdrnf19I=; b=IYcSmS1xoOgvy/omX4JuLkBMYJ1/BeySqvalYda5DITGL5l+4sBrjzBI/J3ZXoYbI2o2wd VwaBUVnhfEqpWuDg== Received: from suse.de (mgorman.udp.ovpn2.nue.suse.de [10.163.43.106]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by relay2.suse.de (Postfix) with ESMTPS id 3ABDD2C178; Wed, 17 Aug 2022 10:40:33 +0000 (UTC) Date: Wed, 17 Aug 2022 11:40:28 +0100 From: Mel Gorman To: Michal Hocko Cc: Patrick Daly , linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org, David Hildenbrand , Juergen Gross Subject: Re: Race condition in build_all_zonelists() when offlining movable zone Message-ID: <20220817104028.uin7cmkb4qlpgfbi@suse.de> References: <20220817034250.GB2473@hu-pdaly-lv.qualcomm.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20220817_034037_614277_BF214805 X-CRM114-Status: GOOD ( 27.77 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org On Wed, Aug 17, 2022 at 08:59:11AM +0200, Michal Hocko wrote: > > In order to address that, we should either have to call first_zones_zonelist > > inside get_page_from_freelist if the zoneref doesn't correspond to a > > real zone in the zonelist or we should revisit my older approach > > referenced above. > > Would this work? It is not really great to pay an overhead for unlikely > event in the hot path but we might use a similar trick to check_retry_cpuset > in the slowpath to detect this situation. > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index b0bcab50f0a3..bce786d7fcb4 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -4098,7 +4098,17 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags, > * See also __cpuset_node_allowed() comment in kernel/cpuset.c. > */ > no_fallback = alloc_flags & ALLOC_NOFRAGMENT; > + > + /* > + * A race with memory offlining could alter zones on the zonelist > + * e.g. dropping the top (movable) zone if it gets unpoppulated > + * and so preferred_zoneref is not valid anymore > + */ > + if (unlikely(!ac->preferred_zoneref->zone)) > + ac->preferred_zoneref = first_zones_zonelist(ac->zonelist, > + ac->highest_zoneidx, ac->nodemask); > z = ac->preferred_zoneref; > + ac->preferred_zoneref->zone could still be a valid pointer to a zone, but an empty one so that would imply diff --git a/mm/page_alloc.c b/mm/page_alloc.c index e5486d47406e..38ce123af543 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5191,6 +5191,10 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, if (check_retry_cpuset(cpuset_mems_cookie, ac)) goto retry_cpuset; + /* Hotplug could have drained the preferred zone. */ + if (!populated_zone(ac->preferred_zoneref->zone)) + goto retry_cpuset; + /* Reclaim has failed us, start killing things */ page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress); if (page) But even that is fragile. If there were multiple zones in the zonelist and the preferred zone was further down the list, the zone could still be populated but a different zone than expected. It may be better to have the same type of seq counter that restarts the allocation attempt if the zonelist changes. So.... this? It is seqcount only with a basic lock as there already is a full lock on the writer side and it would appear to be overkill to protect the reader side with read_seqbegin_or_lock as it complicates the writer side. (boot tested only) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index e5486d47406e..158954b10724 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4708,6 +4708,22 @@ void fs_reclaim_release(gfp_t gfp_mask) EXPORT_SYMBOL_GPL(fs_reclaim_release); #endif +/* + * Zonelists may change due to hotplug during allocation. Detect when zonelists + * have been rebuilt so allocation retries. + */ +static seqcount_t zonelist_update_seq = SEQCNT_ZERO(zonelist_update_seq); + +static unsigned int zonelist_update_begin(void) +{ + return read_seqcount_begin(&zonelist_update_seq); +} + +static unsigned int zonelist_update_retry(unsigned int seq) +{ + return read_seqcount_retry(&zonelist_update_seq, seq); +} + /* Perform direct synchronous page reclaim */ static unsigned long __perform_reclaim(gfp_t gfp_mask, unsigned int order, @@ -5001,6 +5017,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, int compaction_retries; int no_progress_loops; unsigned int cpuset_mems_cookie; + unsigned int zonelist_update_cookie; int reserve_flags; /* @@ -5016,6 +5033,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, no_progress_loops = 0; compact_priority = DEF_COMPACT_PRIORITY; cpuset_mems_cookie = read_mems_allowed_begin(); + zonelist_update_cookie = zonelist_update_begin(); /* * The fast path uses conservative alloc_flags to succeed only until @@ -5191,6 +5209,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, if (check_retry_cpuset(cpuset_mems_cookie, ac)) goto retry_cpuset; + if (zonelist_update_retry(zonelist_update_cookie)) + goto retry_cpuset; + /* Reclaim has failed us, start killing things */ page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress); if (page) @@ -6517,6 +6538,7 @@ static void __build_all_zonelists(void *data) static DEFINE_SPINLOCK(lock); spin_lock(&lock); + write_seqcount_begin(&zonelist_update_seq); #ifdef CONFIG_NUMA memset(node_load, 0, sizeof(node_load)); @@ -6553,6 +6575,7 @@ static void __build_all_zonelists(void *data) #endif } + write_seqcount_end(&zonelist_update_seq); spin_unlock(&lock); } -- Mel Gorman SUSE Labs _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel