From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out-183.mta0.migadu.com (out-183.mta0.migadu.com [91.218.175.183]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AAF1A3939AF for ; Mon, 1 Jun 2026 09:54:43 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.183 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780307696; cv=none; b=HwkrP7NhFlHA1PxmHbig7+BwQWK44bwP3emYVDwgquBR2QIrHPsSNdE50BQRj25V35OM+rRURAAg8usGK99h2UoDbmRKvfUbc3o7FAJVjYtXPbOEGrt1XQJpRNyr6UJqN4g8URQ8UvO05oSpLt9QHuwO7juzDRtNCSk48CMPpos= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1780307696; c=relaxed/simple; bh=OGdFqrL9E3SKu1yTP8DKxMPkgRzCtP1c/TucLQDywyg=; h=Content-Type:Mime-Version:Subject:From:In-Reply-To:Date:Cc: Message-Id:References:To; b=dIF1yCtQSuhWI2sOsYkf1qgtAua+U36FkHOG5ARZarkS/fl4Zu+iXDgRUjeYpeYPzgyAUPZxG/ZZ6Musk96MPFCm1aeqmy7D87isPFLSYunxSPDT2VkMDqbbOiUwl23e4HIPQzLVjDePc0a1W8a7BcgrfScNFSkQgbR92OVOr9I= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=opmdGdka; arc=none smtp.client-ip=91.218.175.183 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="opmdGdka" Content-Type: text/plain; charset=us-ascii DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1780307681; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=okiM+ywmJfzOTYWBfyohEkr7uDVGfpbqCX1aCcdo0W4=; b=opmdGdka5O+wo3q8pHowq2rDoNiWwEqRQLMSPNt8JoZqoBfPy5UsmGuYotw3P6ts7Wip1b lKohENeRj+wfynbsP7USECZcFWlmg6AnJHdv+oVcW2IjMFsOBxZ653TnfS10iDnqBJUG1z JXEtE9wYw2lk5TIdUmvN1NnEaJ80wfo= Precedence: bulk X-Mailing-List: cgroups@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3864.600.51.1.1\)) Subject: Re: [PATCH] mm/list_lru: drain before clearing xarray entry on reparent X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Muchun Song In-Reply-To: <20260601063408.2879011-1-shakeel.butt@linux.dev> Date: Mon, 1 Jun 2026 17:54:01 +0800 Cc: Andrew Morton , Johannes Weiner , Dave Chinner , Roman Gushchin , Qi Zheng , Kairui Song , Meta kernel team , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Chris Mason Content-Transfer-Encoding: quoted-printable Message-Id: <79CD986A-2130-4FB8-804F-A543AF22342B@linux.dev> References: <20260601063408.2879011-1-shakeel.butt@linux.dev> To: Shakeel Butt X-Migadu-Flow: FLOW_OUT > On Jun 1, 2026, at 14:34, Shakeel Butt wrote: >=20 > memcg_reparent_list_lrus() clears the dying memcg's xarray entry with > xas_store(&xas, NULL) before reparenting its per-node lists into the > parent. This opens a window where a concurrent list_lru_del() arriving > for the dying memcg sees xa_load() =3D=3D NULL, walks to the parent in > lock_list_lru_of_memcg(), takes the parent's per-node lock, and calls > list_del_init() on an item still physically linked on the dying > memcg's list. >=20 > If another in-flight thread holds the dying memcg's per-node lock at > the same moment (another list_lru_del, or a list_lru_walk_one running > an isolate callback), both threads modify ->next/->prev pointers on = the > same physical list under different locks. Adjacent items can corrupt > each other's links. >=20 > Fix it by reversing the order: reparent each per-node list and mark = the > child's list lru dead and then clear the xarray entry. Any concurrent > list_lru op that finds the still-set xarray entry either takes the = dying > memcg's per-node lock (synchronizing with the drain) or sees LONG_MIN > and walks to the parent, where the items now live. >=20 > Fixes: fb56fdf8b9a2 ("mm/list_lru: split the lock to per-cgroup = scope") > Signed-off-by: Shakeel Butt > Reported-by: Chris Mason > --- > mm/list_lru.c | 20 +++++++++----------- > 1 file changed, 9 insertions(+), 11 deletions(-) >=20 > diff --git a/mm/list_lru.c b/mm/list_lru.c > index dd29bcf8eb5f..ae55a52307db 100644 > --- a/mm/list_lru.c > +++ b/mm/list_lru.c > @@ -473,26 +473,24 @@ void memcg_reparent_list_lrus(struct mem_cgroup = *memcg, struct mem_cgroup *paren > mutex_lock(&list_lrus_mutex); > list_for_each_entry(lru, &memcg_list_lrus, list) { > struct list_lru_memcg *mlru; > - XA_STATE(xas, &lru->xa, memcg->kmemcg_id); >=20 > - /* > - * Lock the Xarray to ensure no on going list_lru_memcg > - * allocation and further allocation will see = css_is_dying(). > - */ > - xas_lock_irq(&xas); > - mlru =3D xas_store(&xas, NULL); > - xas_unlock_irq(&xas); > + mlru =3D xa_load(&lru->xa, memcg->kmemcg_id); > if (!mlru) > continue; Is it possible that concurrent threads running memcg_list_lru_alloc() = could allocate a new mlru after this check passes? This could happen because = the threads haven't noticed css_is_dying() yet. We would consequently miss = the reparent operation for this list. So xas_lock_irq is necessary to = serialize CSS_DYING setting here. Right? Thanks. Muchun >=20 > /* > - * With Xarray value set to NULL, holding the lru lock = below > - * prevents list_lru_{add,del,isolate} from touching the = lru, > - * safe to reparent. > + * Reparent each per-node list and mark the child dead > + * (LONG_MIN) before clearing xarray entry otherwisw a > + * concurrent list_lru_del() may corrupt the list if it = arrives > + * after xarray clear but before reparenting as > + * lock_list_lru_of_memcg will acquire parent's lock = while the > + * item is still on child's list. > */ > for_each_node(i) > memcg_reparent_list_lru_one(lru, i, = &mlru->node[i], parent); >=20 > + xa_erase(&lru->xa, memcg->kmemcg_id); > + > /* > * Here all list_lrus corresponding to the cgroup are = guaranteed > * to remain empty, we can safely free this lru, any = further > --=20 > 2.52.0 >=20